Tag wisely

(Most of the content on this page comes directly from RFC 5646.)

For the same body of text, you may have several possible tags. Interoperability is best served when all users use the same language tag for the same language. The rules here are intended to help in that respect.

Subtags should only be used where they add useful distinguishing information; extraneous subtags interfere with the meaning, understanding, and processing of language tags. In particular, fields Suppress-Script in the registry should be obeyed: for instance, fr (French) has a Suppress-Script: Latn because the overwhelming majority of French texts are in the Latin script. Therefore, tagging text in French as fr-Latn is useless and confusing. A simple fr is enough. In the unlikely case that you meet French texts in the Arabic script, then you can add a subtag for the script: fr-Arab. (This is specially important since the former standard, in RFC 3066, did not have subtags for scripts and therefore old applications will have problems to handle them.)

Use as precise a tag as possible, but no more specific than is justified. Avoid using subtags that are not important for distinguishing content in an application. For example, de might suffice for tagging an email written in German, while de-CH-1996, while legal,is probably unnecessarily precise for such a task.

But do not be too vague: the primary language subtag might not be sufficient to give all the information necessary to understand the text. For example, the tag az (for Azerbaidjani) is probably insufficient in the absence of context, because this language has no dominant script. A person fluent in one script might not be able to read the other, even though the text might be identical. Content tagged as az most probably is written in just one script and thus might not be intelligible to a reader familiar with the other script. az-Latn, az-Cyrl or az-Arab are probably necessary.

If a tag or subtag has a Preferred-Value field in its registry entry, then the value of that field should be used to form the language tag. For example, use he for Hebrew in preference to iw.

Validity of a tag is not everything. A tag may be both valid and meaningless. This is unavoidable with a generative system like the language subtag mechanism. So, ar-Cyrl-AQ (Arabic written with the cyrillic script, as used in Antarctica) is perfectly valid but should nevertheless be avoided because it has no relationship with the reality (there is not a single document with these characteristics).