(Most of the content on this page comes directly from RFC 5646.)
For the same body of text, you may have several possible tags. Interoperability is best served when all users use the same language tag for the same language. The rules here are intended to help in that respect.
Subtags should only be used where they add useful distinguishing
information; extraneous subtags interfere with the meaning,
understanding, and processing of language tags. In particular, fields
Suppress-Script
in the registry should be obeyed: for
instance, fr
(French) has a
Suppress-Script: Latn
because the overwhelming majority
of French texts are in the Latin script. Therefore, tagging text in French as
fr-Latn
is useless and confusing. A simple
fr
is enough. In the unlikely case that you meet French
texts in the Arabic script, then you can add a subtag for the script:
fr-Arab
. (This is specially important since the former
standard, in RFC 3066, did not have subtags for scripts and therefore
old applications will have problems to handle them.)
Use as precise a tag as possible, but no more specific than is
justified. Avoid using subtags that are not important for
distinguishing content in an application. For example, de
might suffice for tagging an email written in
German, while de-CH-1996
, while
legal,is probably unnecessarily precise for such a task.
But do not be too vague: the primary language subtag might not be
sufficient to give all the information necessary to understand the
text. For
example, the tag az
(for
Azerbaidjani) is probably insufficient in the
absence of context, because this language has no dominant script. A person fluent in
one script might not be able to read the other, even though the text
might be identical. Content tagged as az
most probably is written
in just one script and thus might not be intelligible to a reader
familiar with the other script. az-Latn
,
az-Cyrl
or az-Arab
are probably necessary.
If a tag or subtag has a Preferred-Value
field in its registry
entry, then the value of that field should be used to form the
language tag. For example, use he
for Hebrew in preference to
iw
.
Validity of a tag is not everything. A tag may be both valid and
meaningless. This is unavoidable with a generative system like the
language subtag mechanism. So, ar-Cyrl-AQ
(Arabic written with the cyrillic
script, as used in Antarctica) is
perfectly valid but should nevertheless be avoided because it has no
relationship with the reality (there is not a single document with
these characteristics).