Locale
Specification of language variants
A locale is an identifier of a language and region, plus an optional writing script. The locale is used in machine translation APIs to specify the language of the source and target text. Locales are used to indicate the language of documents in web crawling to build training data.
Example: frCA
means French (fr) as spoken in Canada (CA)
The formatting varies from one system to another. frCA
, fr-ca
, and fr_CA
are all common formats.
Language codes are typically specified in two or three characters according to ISO 639. Regions are typically specified in two characters according to ISO 3166. Scripts are optionally specified according to ISO 15924, such as sr-Cyrl_RS
for Serbian written in Cyrillic script in Serbia.
API support
These language variations are supported by many API vendors:
- Chinese (
zh
):- Chinese, Simplified (
zh-cn
, alsozh-Hans
) - Chinese, Traditional (
zh-tw
, alsozh-Hant
)
- Chinese, Simplified (
- Portuguese (
pr
):- Portugal (
pr-pr
) - Brazil (
pr-br
)
- Portugal (
- French (
fr
):- France (
fr-fr
) - Canada (
fr-ca
)
- France (
- Spanish (
es
):- Spain (
es-es
) - Mexico (
es-mx
) - Latin America and Caribbean region (
es-419
)
- Spain (
- English (
en
):- United States (
en-us
) - Great Britain (
en-gb
)
- United States (
- Serbian (
sr
):- Serbia, Cyrillic script (
sr-Cyrl-rs
) - Serbia, Latin script (
sr-Latn-rs
)
- Serbia, Cyrillic script (
- Norwegian (
no
):- Norwegian Bokmål (
nb
,nob
) - Norwegian Nynorsk (
nn
,nno
)
- Norwegian Bokmål (
Challenges
- When a translation API uses only a language code without a region code or script, it can be unclear what locale is being translated.
- Not all languages or variants have standardised locale codes, leading to differences between different APIs.
- In some cases, the locale codes have changed over time. For example, old systems may represent Cantonese as
zhHK
while newer systems use the newer language codeyue
.