IETF Policy on Character Sets and Languages
RFC 2277

Document Type RFC - Best Current Practice (January 1998; Errata)
Also known as BCP 18
Last updated 2013-03-02
Stream Legacy
Formats plain text pdf html
Stream Legacy state (None)
Document shepherd No shepherd assigned
IESG IESG state RFC 2277 (Best Current Practice)
Telechat date
Responsible AD (None)
Send notices to (None)
Network Working Group                                     H. Alvestrand
Request for Comments: 2277                                      UNINETT
BCP: 18                                                    January 1998
Category: Best Current Practice

              IETF Policy on Character Sets and Languages

Status of this Memo

   This document specifies an Internet Best Current Practices for the
   Internet Community, and requests discussion and suggestions for
   improvements.  Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (1998).  All Rights Reserved.

1.  Introduction

   The Internet is international.

   With the international Internet follows an absolute requirement to
   interchange data in a multiplicity of languages, which in turn
   utilize a bewildering number of characters.

   This document is the current policies being applied by the Internet
   Engineering Steering Group (IESG) towards the standardization efforts
   in the Internet Engineering Task Force (IETF) in order to help
   Internet protocols fulfill these requirements.

   The document is very much based upon the recommendations of the IAB
   Character Set Workshop of February 29-March 1, 1996, which is
   documented in RFC 2130 [WR].  This document attempts to be concise,
   explicit and clear; people wanting more background are encouraged to
   read RFC 2130.

   The document uses the terms 'MUST', 'SHOULD' and 'MAY', and their
   negatives, in the way described in [RFC 2119].  In this case, 'the
   specification' as used by RFC 2119 refers to the processing of
   protocols being submitted to the IETF standards process.

Alvestrand               Best Current Practice                  [Page 1]
RFC 2277                     Charset Policy                 January 1998

2.  Where to do internationalization

   Internationalization is for humans. This means that protocols are not
   subject to internationalization; text strings are. Where protocol
   elements look like text tokens, such as in many IETF application
   layer protocols, protocols MUST specify which parts are protocol and
   which are text. [WR 2.2.1.1]

   Names are a problem, because people feel strongly about them, many of
   them are mostly for local usage, and all of them tend to leak out of
   the local context at times. RFC 1958 [RFC 1958] recommends US-ASCII
   for all globally visible names.

   This document does not mandate a policy on name internationalization,
   but requires that all protocols describe whether names are
   internationalized or US-ASCII.

   NOTE: In the protocol stack for any given application, there is
   usually one or a few layers that need to address these problems.

   It would, for instance, not be appropriate to define language tags
   for Ethernet frames. But it is the responsibility of the WGs to
   ensure that whenever responsibility for internationalization is left
   to "another layer", those responsible for that layer are in fact
   aware that they HAVE that responsibility.

3.  Definition of Terms

   This document uses the term "charset" to mean a set of rules for
   mapping from a sequence of octets to a sequence of characters, such
   as the combination of a coded character set and a character encoding
   scheme; this is also what is used as an identifier in MIME "charset="
   parameters, and registered in the IANA charset registry [REG].  (Note
   that this is NOT a term used by other standards bodies, such as ISO).

   For a definition of the term "coded character set", refer to the
   workshop report.

   A "name" is an identifier such as a person's name, a hostname, a
   domainname, a filename or an E-mail address; it is often treated as
   an identifier rather than as a piece of text, and is often used in
   protocols as an identifier for entities, without surrounding text.

3.1.  What charset to use

   All protocols MUST identify, for all character data, which charset is
   in use.

Alvestrand               Best Current Practice                  [Page 2]
RFC 2277                     Charset Policy                 January 1998

   Protocols MUST be able to use the UTF-8 charset, which consists of
   the ISO 10646 coded character set combined with the UTF-8 character
   encoding scheme, as defined in [10646] Annex R (published in
   Amendment 2), for all text.

   Protocols MAY specify, in addition, how to use other charsets or
   other character encoding schemes for ISO 10646, such as UTF-16, but
   lack of an ability to use UTF-8 is a violation of this policy; such a
   violation would need a variance procedure ([BCP9] section 9) with
Show full document text