Internet-Draft                                       H. Alvestrand
draft-alvestrand-lang-tag-v2-03.txt
                                                     Cisco Systems
Target Category: Best Current Practice
                                                       August 2000
Obsoletes: RFC 1766                         Expires: February 2001














Tags for the Identification of Languages



Status of this Memo
     The file name of this memo is draft-alvestrand-lang-tag-v2-03.txt
     This document is an Internet-Draft and is in full conformance with
     all provisions of Section 10 of RFC 2026.
     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as Internet-
     Drafts.
     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-
     Drafts as reference material or to cite them other than as "work
     in progress."
     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt
     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.
Comments on this draft should be sent to the mailing list <ietf-
languages@iana.org>

Abstract
This document describes a language tag for use in cases where it is
desired to indicate the language used in an information object.



1. Introduction
Tags for the names of languages                  Harald Alvestrand
draft-alvestrand-lang-tag-v2-03.txt          Expires December 2000

There are a number of languages presently or previously used by human
beings in this world.
A great number of these people would prefer to have information
presented in a language which they understand.
In some contexts, it is possible to have information available in more
than one language, or it might be possible to provide tools  (such as
dictionaries) to assist in the understanding of a language.
In other cases, it may be desirable to use a computer program to
convert information from one format (such as plaintext) into another
(such as computer-synthesized speech, or Braille, or high-quality print
renderings).

A prerequisite for any such function is a means of labelling the
information content with an identifier for the language that is used in
this information content.
This document specifies an identifier mechanism.
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC 2119].

2. The Language tag

2.1 Language tag syntax
The language tag is composed of one or more parts: A primary language
tag and a (possibly empty) series of subtags.


The syntax of this tag in ABNF [RFC 2234] is:
 Language-Tag = Primary-tag *( "-" Subtag )
 Primary-tag = 1*8ALPHA
 Subtag = 1*8ALPHA

All tags are to be treated as case insensitive; there exist conventions
for capitalization of some of them, but these should not be taken to
carry meaning. For instance, [ISO 3166] recommends that country codes
are capitalized (MN Mongolia), while [ISO 639] recommends that language
codes are written in lower case (mn Mongolian).

2.2 Language tag sources

The namespace of language tags is administered by the IANA according to
the rules in section 5 of this document.
The following registrations are predefined:


draft-alvestrand-lang-tags-v2-01.txt
                         [Page 2]


Tags for the names of languages                  Harald Alvestrand
draft-alvestrand-lang-tag-v2-03.txt          Expires December 2000

In the primary language tag:
- All 2-letter tags are interpreted according to assignments found in
  ISO standard 639, "Code for the representation of names of languages"
  [ISO 639], or subsequently made by the standardÆs registration
  authority.
  (Note: A revision is underway, and is expected to be released as ISO
  639-1:2000)
- All 3-letter tags are interpreted according to assignments found in
  ISO 639 part 2, "Codes for the representation of names of languages -
  - Part 2: Alpha-3 code [ISO 639-2] , or subsequently made by the
  standardÆs registration authority.

- The value "i" is reserved for IANA-defined registrations
- The value "x" is reserved for private use. Subtags of "xö shall not
  be registered by the IANA.
- Other values shall not be assigned except by revision of this
  standard.
The reason for reserving all other tags is to be open towards new
revisions of ISO 639; the use of "i" and "x" is the minimum we can do
here to be able to extend the mechanism to meet our immediate
requirements.
In the first subtag:
- All 2-letter codes are interpreted as ISO 3166 alpha-2 country codes
  from [ISO 3166], or subsequently assigned by the standardÆs
  registration authority, denoting the area to which this language
  variant relates.
- Codes of 3 to 8 letters may be registered with the IANA, according to
  the rules in chapter 5 of this document.
The information in the subtag may for instance be:
- Country identification, such as en-US (this usage is described in ISO
  639)
- Dialect or variant information, such as no-nyn (nynorsk) or en-scouse
- Languages not listed in ISO 639 that are not variants of any listed
  language, which can be registered with the i-prefix, such as i-
  cherokee
- Script variations, such as az-Arab and az-Cyrl (Azerbaijani in Arabic
  or Cyrillic script û these script codes are suggested by the pending
  script code standard ISO/DIS 15924)

This document does not place any restriction on what values one can
register here, as long as they conform to the rules in section 5.
ISO 639 defines a registration authority for additions to and changes
in the list of languages in ISO 639. This authority is:

draft-alvestrand-lang-tags-v2-01.txt
                         [Page 3]


Tags for the names of languages                  Harald Alvestrand
draft-alvestrand-lang-tag-v2-03.txt          Expires December 2000


      International Information Centre for Terminology (Infoterm)
      P.O. Box 130
      A-1021 Wien
      Austria
      Phone: +43 1  26 75 35 Ext. 312
      Fax:   +43 1 216 32 72

ISO 639-2 defines a registration authority for additions to and changes
in the list of languages in ISO 639-2. This authority is:
     Library of Congress
     Network Development and MARC Standards Office
     Washington, D.C. 20540
     USA
     Phone: +1 202 707 6237
     Fax:   +1 202 707 0115
     URL: http://www.loc.gov/standards/iso639

The registration agency for ISO 3166 (country codes) is:
     ISO 3166 Maintenance Agency Secretariat
     c/o DIN Deutsches Institut fuer Normung
     Burggrafenstrasse 6
     Postfach 1107
     D-10787 Berlin
     Germany
     Phone: +49 30 26 01 320
     Fax:   +49 30 26 01 231
     URL: http://www.din.de/gremien/nas/nabd/iso3166ma/


ISO 3166 reserves the country codes AA, QM-QZ, XA-XZ and ZZ as user-
assigned codes.
2.3 Choice of language tag
One may occasionally be faced with several possible tags for the same
body of text.
Interoperability is best served if all users send the same tag, and use
the same tag for the same language for all documents. Exact
requirements may need to vary by application area; if so, the
application protocol specification MUST specify how the procedure
varies from the one given here.
The text below is based on the set of tags known to the tagging entity.
1. Use the most precise tagging known to the sender that can be
  ascertained and is useful within the application context



draft-alvestrand-lang-tags-v2-01.txt
                         [Page 4]


Tags for the names of languages                  Harald Alvestrand
draft-alvestrand-lang-tag-v2-03.txt          Expires December 2000

2. When a language has both an ISO 639-1 2-character tag and an ISO 639-
  2 3-character tag, you MUST use the ISO 639-1 2-character tag.
3. When a language has no ISO 639-1 2-character tag, and the ISO 639-2/T
  (Terminology) tag and the ISO 639-2/B (Bibliographic) tag differ, you
  MUST use the Terminology tag.
  NOTE: At present, all languages for which there is a difference have
  2-character tags, and the displeasure of developers about the
  existence of 2 tag sets has been adequately communicated to ISO. So
  this situation will hopefully not arise)
4. When a language has both an IANA-registered tag (i-something) and an
  ISO registered tag, you MUST use the ISO tag.
  NOTE: When such a situation is discovered, the IANA-registered tag
  SHOULD be deprecated as soon as possible.
5. You SHOULD NOT use the UND (Undetermined) tag unless the protocol in
  use forces you to give a value for the language tag, even if the
  language is unknown. Omitting the tag is preferred.
6. You MUST NOT use the MUL (Multiple) tag if the protocol allows you to
  use multiple languages, as is the case for the Content-Language:
  header.
NOTE: In order to avoid versioning difficulties in applications such as
that of RFC 1766, the ISO 639 RA-JAC has agreed on the following policy
statement:

  ôAfter the publication of ISO/DIS 639-1 as an International Standard,
  no new 2-letter code shall be added to ISO 639-1 unless a 3-letter
  code is also added at the same time to ISO 639-2. In addition, no
  language with a 3-letter code available at the time of publication of
  ISO 639-1 which at that time had no 2-letter code shall be
  subsequently given a 2-letter code.ö

This will ensure that, for example, a user who implements ôhwiö
(HawaiÆian), which currently has no 2-letter code, will not find his or
her data invalidated by eventual addition of a 2-letter code for that
language.ö


2.4 Meaning of the language tag


The language tag always defines a language as spoken (or written,
signed or otherwise signalled) by human beings for communication of
information to other human beings.
Computer languages such as programming languages are explicitly
excluded.
There is no guaranteed relationship between languages whose tags begin
with the same series of subtags; specifically, they are NOT guaranteed
to be mutually intelligible, although it will sometimes be the case
that they are.

draft-alvestrand-lang-tags-v2-01.txt
                         [Page 5]


Tags for the names of languages                  Harald Alvestrand
draft-alvestrand-lang-tag-v2-03.txt          Expires December 2000

Applications should always treat a language tag as a single token; the
division into main tag and subtags is an administrative mechanism, not
a navigation aid.
The relationship between the tag and the information it relates to is
defined by the standard describing the context in which it appears.
Accordingly, this section can only give possible examples of its usage.
- For a single information object, it should be taken as the set of
  languages that is required for a complete comprehension of the
  complete object.
  Example: Plain text documents.
- For an aggregation of information objects, it should be taken as the
  set of languages used inside components of that aggregation.
  Examples: Document stores and libraries.
- For information objects whose purpose is to provide alternatives, it
  should be regarded as a hint that the material inside is provided in
  several languages, and that one has to inspect each of the
  alternatives in order to find its language or languages.  In this
  case, multiple languages need not mean that one needs to be
  multilingual to get complete understanding of the document.
  Example: MIME multipart/alternative.
- In markup languages, such as HTML, it is possible to define a
  construct embedding a language tag to indicate that contained text is
  written in this language, such that one could write <DIV
  lang="FR">C'est la vie</DIV> inside a Norwegian document; the
  Norwegian-speaking user could then access a French-Norwegian
  dictionary to find out what the marked section meant.
  If the user were listening to that document through a speech
  synthesis interface, this formation could be used to signal the
  synthesizer to appropriately apply French text-to-speech
  pronunciation rules to that span of text, instead of misapplying the
  Norwegian rules.


2.5 Language-range
Since the publication of RFC 1766, it has become apparent that there is
a need to define a term for a set of languages that share some common
property. The following definition of language-range is derived from
HTTP/1.1 [RFC 2616].
          language-range  = ( ( 1*8ALPHA *( "-" 1*8ALPHA ) ) | "*" )


A language-range matches a language-tag if it exactly equals the tag,
or if it exactly equals a prefix of the tag such that the first tag
character following the prefix is "-".
 The special range "*" matches any tag. A protocol which uses language
ranges may specify additional rules about the semantics of "*"; for

draft-alvestrand-lang-tags-v2-01.txt
                         [Page 6]


Tags for the names of languages                  Harald Alvestrand
draft-alvestrand-lang-tag-v2-03.txt          Expires December 2000

instance, HTTP/1.1 specifies that it only matches languages not matched
by any other range within an "Accept-Language:" header.
NOTE: This use of a prefix matching rule does not imply that language
tags are assigned to languages in such a way that it is always true
that if a user understands a language with a certain tag, then this
user will also understand all languages with tags for which this tag is
a prefix. The prefix rule simply allows the use of prefix tags if this
is the case.



3. IANA registration procedure for language tags
Any language tag shall begin with an existing tag, and extend it.
The registration form given here must be used by anyone who wants to
use a language tag not defined by ISO or IANA.
----------------------------------------------------------------------
LANGUAGE TAG REGISTRATION FORM

Name of requester          :
E-mail address of requester:
Tag to be registered       :

English name of language   :

Native name of language (transcribed into ASCII):

Reference to published description of the language (book or article):

Any other relevant information:

----------------------------------------------------------------------
The language form must be sent to <ietf-languages@iana.org> for a 2-
week review period before it can be submitted to IANA.  (This is an
open list. Requests to be added should be sent to <ietf-languages-
request@iana.org>.)
When the two week period has passed, the language tag reviewer, who is
appointed by the IETF Applications Area Director, either forwards the
request to IANA@ISI.EDU, or rejects it because of significant
objections raised on the list. Note that the reviewer can raise
objections on the list himself, if he so desires. The important thing
is that the objection must be made publicly.
The applicant is free to modify a rejected application with additional
information and submit it again; this restarts the 2-week comment
period.
Decisions made by the reviewer may be appealed to the IESG.
All registered forms are available online in the directory
ftp://ftp.isi.edu/in-notes/iana/assignments/languages/


draft-alvestrand-lang-tags-v2-01.txt
                         [Page 7]


Tags for the names of languages                  Harald Alvestrand
draft-alvestrand-lang-tag-v2-03.txt          Expires December 2000

Updates of registrations follow the same procedure as registrations.
The language tag reviewer decides whether to allow a new registrant to
update a registration made by someone else; in the normal case,
objections by the original registrant would carry extra weight in such
a decision.
There is no deletion of registrations; when some registered tag should
not be used any more, for instance because a corresponding ISO 639 code
has been registered, the registration should  be amended by adding a
remark like "DO NOT USE: use <new code> instead" to the "other relevant
information" section.
Note: The purpose of the ôpublished descriptionö is intended as an aid
to people trying to verify whether two suggested language tags are
referring to the same language or not. In most cases, reference to an
authoritative grammar or dictionary of the language will be useful; in
cases where no such work exists, other well known works in or about
that language may be appropriate. The language tag reviewer is the
ultimate authority on what constitutes a ôgood enoughö literature
reference.

4. Security Considerations
The only security issue that has been raised with language tags since
the publication of RFC 1766, which stated that "Security issues are
believed to be irrelevant to this memo", is a concern with language
ranges used in content negotiation - that they may be used to infer the
nationality of the sender, and thus identify potential targets for
surveilllance.
This is a special case of the general problem that anything you send is
visible to the receiving party; it is useful to be aware that such
concerns can exist in some cases.
The exact magnitude of the threat, and any possible countermeasures, is
left to each application protocol.

5. Character set considerations
Codes may always be expressed using the US-ASCII character repertoire
(a-z), which is present in most character sets.
The issue of deciding upon the rendering of a character set based on
the language tag is not addressed in this memo; however, it is thought
impossible to make such a decision correctly for all cases unless means
of switching language in the middle of a text are defined (for example,
a rendering engine that decides font based on Japanese or Chinese
language may fail to work when a mixed Japanese-Chinese text is
encountered)





draft-alvestrand-lang-tags-v2-01.txt
                         [Page 8]


Tags for the names of languages                  Harald Alvestrand
draft-alvestrand-lang-tag-v2-03.txt          Expires December 2000

6. Acknowledgements
This document has benefited from many rounds of review and comments in
various fora of the IETF and the Internet working groups.
Any list of contributors is bound to be incomplete; please regard the
following as only a selection from the group of people who have
contributed to make this document what it is today.
In alphabetical order:
Tim Berners-Lee, Nathaniel Borenstein, Sean M. Burke, Jim Conklin, John
Cowan, Dave Crocker, Martin Duerst, Michael Everson, Ned Freed, Tim
Goodwin, Dirk-Willem van Gulik, Paul Hoffman, Olle Jarnefors, John
Klensin, Keith Moore, Masataka Ohta, Keld Jorn Simonsen, Rhys
Weatherley, Misha Wolf, Francois Yergeau and many, many others.

Special thanks must go to Michael Everson, who has served as language
tag reviewer for almost the complete period since the publication of
RFC 1766, and has provided a great deal of input to this revision.

7. Author's Address
Harald Tveit Alvestrand
Cisco Systems
Weidemanns vei 27
7043 Trondheim
NORWAY
EMail: Harald@Alvestrand.no
Phone: +47 73 50 33 52

8. References

[ISO 639]
     ISO 639:1988 (E/F) - Code for the representation of names of
     languages - The International Organization for Standardization,
     1st edition, 1988-04-01 Prepared by ISO/TC 37 - Terminology
     (principles and coordination).
     Note that a new version (ISO 639-1:2000) is in preparation at the
     time of this writing.
[ISO 639-2]
     ISO 639-2:1998 - Codes for the representation of names of
     languages -- Part 2: Alpha-3 code  - edition 1, 1998-11-01, 66
     pages, prepared by ISO/TC 37/SC 2

[ISO 3166]
     ISO 3166:1988 (E/F) - Codes for the representation of names of
     countries - The International Organization for Standardization,
     3rd edition, 1988-08-15.

draft-alvestrand-lang-tags-v2-01.txt
                         [Page 9]


Tags for the names of languages                  Harald Alvestrand
draft-alvestrand-lang-tag-v2-03.txt          Expires December 2000

[ISO 15924]
     ISO/DIS 15924 - Codes for the representation of names of scripts
(under development by ISO TC46/SC2)
 [RFC 1327]
     Kille, S., "Mapping between X.400(1988) / ISO 10021 and RFC 822",
     RFC 1327, University College London, May 1992.
[RFC 1521]
     Borenstein, N., and N. Freed, "MIME Part One: Mechanisms for
     Specifying and Describing the Format of Internet Message Bodies",
     RFC 1521, Bellcore, Innosoft, September 1993.
[RFC 2119]
     Key words for use in RFCs to Indicate Requirement Levels. S.
     Bradner. March 1997.
[RFC 2234]
     Augmented BNF for Syntax Specifications: ABNF. D. Crocker, Ed., P.
Overell, November 1997.
[RFC 2616]
     Hypertext Transfer Protocol -- HTTP/1.1. R. Fielding, J. Gettys,
     J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee. June
     1999.

Appendix A: Language Tag Reference Material
The Library of Congress, maintainers of ISO 639-2, has made the list of
languages registered available on the Internet.
At the time of this writing, it can be found at
http://www.loc.gov/standards/iso639-2/langhome.html

The IANA registration forms for registered language codes can be found
at
http://www.isi.edu/in-notes/iana/assignments/languages/

The ISO 3166 Maintenance Agency has published Web pages at
http://www.din.de/gremien/nas/nabd/iso3166ma/

Appendix B: Changes from RFC 1766
. Email list address changed from ietf-types@uninett.no to ietf-
  languages@iana.org
. Updated author's address
. Added language-range construct from HTTP/1.1
. Added use of ISO 639-2 language codes
. Added reference to Library of Congress lists of language codes

draft-alvestrand-lang-tags-v2-01.txt
                         [Page 10]


Tags for the names of languages                  Harald Alvestrand
draft-alvestrand-lang-tag-v2-03.txt          Expires December 2000

. Changed examples to use registered tags
. Added "Any other information" to registration form
. Added description of procedure for updating registrations
. Changed target category for document from standards track to BCP
. Moved the content-language header definition into another document

Appendix X: Changes between drafts
This appendix is to be deleted by the RFC Editor before publication as
RFC.

Changes from draft û00 to -01
Changes from draft-00:
- Fixed up the language tag table
- Moved multipart/alternative stuff to appendix
- Changed examples to use registered tags
- Added * in languagte tag table to indicate B/T conflicts
- Considered, but did not adopt, changing from recommending T codes to
  recommending B codes. At the moment, the only argument that appeals
  to the author is that the T codes look more like the 639-1 codes than
  the B codes do.
- Added procedures for updating a registration

Changes from draft û01 to û02
This appendix is to be deleted by the RFC Editor before publication as
RFC.
- Minor updates
- Added reference to Library of Congress code lists instead of
  including code values
- Changed grammars to use RFC 2234 ABNF
- Used MUST and SHOULD in label choice algorithm


Changes from draft û02 to û03
. Minor updates
. Content-language: header moved to another draft
. Added URL for ISO 3166 maintenance agency web pages
. Added text to clarify purpose of the literature reference on the
  registration form


draft-alvestrand-lang-tags-v2-01.txt
                         [Page 11]