IETF                                                         A. Sullivan
Internet-Draft                                                       Dyn
Intended status: Best Current Practice                         D. Thaler
Expires: August 18, 2014                                       Microsoft
                                                              J. Klensin

                                                       February 14, 2014


              IETF Policy on Character Sets and Languages
                     draft-sullivan-rfc2277-bis-00

Abstract

   This is a proposed new policy for the IETF on Character Sets and
   Languages.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on August 18, 2014.

Copyright Notice

   Copyright (c) 2014 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.



Sullivan, et al.         Expires August 18, 2014                [Page 1]


Internet-Draft               Charset Policy                February 2014


Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Where to do internationalization  . . . . . . . . . . . . . .   3
     2.1.  Domain names  . . . . . . . . . . . . . . . . . . . . . .   3
     2.2.  Non-DNS, "invisible" protocol elements  . . . . . . . . .   4
     2.3.  Non-DNS, "visible" protocol elements  . . . . . . . . . .   5
     2.4.  Protocol data . . . . . . . . . . . . . . . . . . . . . .   6
   3.  General charset policy  . . . . . . . . . . . . . . . . . . .   6
   4.  Languages . . . . . . . . . . . . . . . . . . . . . . . . . .   7
     4.1.  The need for language information . . . . . . . . . . . .   7
     4.2.  Requirement for language tagging  . . . . . . . . . . . .   7
     4.3.  How to identify a language  . . . . . . . . . . . . . . .   8
     4.4.  Considerations for language negotiation . . . . . . . . .   8
     4.5.  Default language  . . . . . . . . . . . . . . . . . . . .   9
   5.  Locale  . . . . . . . . . . . . . . . . . . . . . . . . . . .   9
   6.  Documenting Internationalization Decisions  . . . . . . . . .   9
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  10
   8.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  10
   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  10
   10. Informative References  . . . . . . . . . . . . . . . . . . .  10
   Appendix A.  Version History  . . . . . . . . . . . . . . . . . .  12
     A.1.  00  . . . . . . . . . . . . . . . . . . . . . . . . . . .  12
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  12

1.  Introduction

   The Internet is international.

   With the international Internet follows an absolute requirement to
   interchange data in a multiplicity of languages, which in turn
   utilize a bewildering number of characters.

   The document is very much based upon RFC 2277 [RFC2277] which is the
   current policy being applied by the Internet Engineering Steering
   Group (IESG) towards the standardization efforts in the Internet
   Engineering Task Force (IETF) in order to help Internet protocols
   fulfill these requirements.

   RFC 2277 in turn was based on the recommendations of the IAB
   Character Set Workshop of February 29-March 1, 1996, which is
   documented in RFC 2130 [RFC2130].  This document is a proposed
   replacement for RFC 2277 and attempts to be explicit and clear, and
   as concise as possible without leaving out necessary detail.[[CREF1:
   What other references do we want to add? --ajs@anvilwalrusden.com]]





Sullivan, et al.         Expires August 18, 2014                [Page 2]


Internet-Draft               Charset Policy                February 2014


1.1.  Terminology

   This document uses the terms "character", "charset", "coded character
   set", "language", "locale", and "protocol elements" as defined in RFC
   6365 [RFC6365].  IDNA terminology is defined in RFC 5890 [RFC5890].
   Any of those definitions may be used below, and the reader is
   expected to be familiar with them.  [[CREF2: That last sentence makes
   this document much less accessible.  I think at a minimum we need to
   list which terms used in this document are defined in each other RFC.
   I've now added a list above for 6365, but it may be missing some and
   the list of terms used from 5890 is needed.
   --dthaler@microsoft.com]][[CREF3: This is fair.  I suggest we leave
   this as is and do an exhaustive pass for terminology later and
   updates these lists. --ajs@anvilwalrusden.com]]

   This document uses the terms 'MUST', 'SHOULD' and 'MAY', and their
   negatives, in the way described in RFC 2119 [RFC2119].  In this case,
   'the specification' as used by RFC 2119 refers to the processing of
   protocols being submitted to the IETF standards process.

2.  Where to do internationalization

   Internationalization is necessary because of the way natural language
   is written.  It enables localization, which is for humans.  This
   means that protocols are not subject to internationalization; text
   strings are.  Where protocol elements look like text tokens, such as
   in many IETF application layer protocols, protocols MUST specify
   which parts are protocol and which are text (see Section 2.2.1.1 of
   [RFC2130]).

   It is helpful to distinguish among four different types of strings
   for these purposes: domain names whether in the DNS or not, other
   protocol elements that are not normally visible to users, other
   protocol elements that are (even sometimes) normally visible to
   users, and data (in most cases, the protocol payload).

2.1.  Domain names

   Domain names (or strings of domain-name-like things) are used in a
   number of protocols, and not all of those names are intended to be
   looked up in the DNS.  This raises a number of issues explored at
   length in [RFC6055].

   Given this state of affairs, it is possible to recommend the
   following.  These recommendations are consistent with RFC 6055:

   o  At resolution time, names that are to be looked up in the global
      DNS SHOULD be transmitted as A-labels.



Sullivan, et al.         Expires August 18, 2014                [Page 3]


Internet-Draft               Charset Policy                February 2014


   o  At resolution time, names that are not to be looked up in the
      global DNS ought to be transmitted in the form appropriate to the
      name resolution protocol.  This is often UTF-8.

   o  Storage of internationalized domain names ought generally to be in
      the form of U-labels.

   o  Any protocol that needs to use domain names ought to use U-labels
      or A-labels consistently, and ought to prefer U-labels.

   o  Storage of U-labels (or putative U-labels) should be in the
      encoding form appropriate to the context.  For instance, on a
      system that normally encodes UTF-8 using NFD, that is how the
      strings should be stored; similarly, a system that uses UTF-16
      should store the strings in that form.

   [[CREF4: This in the end will need to be checked carefully for its
   consistency with 6055. --ajs@anvilwalrusden.com]]

2.2.  Non-DNS, "invisible" protocol elements

   Many protocols include elements that are either words or word-like in
   some natural language (usually English), but that are never exposed
   to users under normal circumstances.  Users might encounter these
   protocol elements in log messages and so on, and system
   administrators might regularly encounter them as part of the ordinary
   support burden.  But these elements are no more candidates for
   internationalization than are hexadecimal protocol parameters.
   Because they are not intended for user consumption, they should not
   be treated as any part of a user interface.  Internationalization
   considerations do not apply to them.

   It is important to recognize that some of this class of protocol
   element sometimes appears to be exposed to users -- for instance,
   many user agents for mail display headers.  In these cases, it is
   important to distinguish between the protocol element itself, and the
   user cues it may provide.  The protocol element does not need to be
   internationalized.  The user interface might.  In general, it is best
   to internationalize (or localize) strings that are encountered by the
   user and to keep those that are passed between computer systems and
   interpreted by them as simple and unambiguous as possible.  Even for
   names or strings that provide the underpinnings for the strings that
   users type or with which they interact, it is important to keep their
   forms as simple as possible.  Examples of such strings include the
   results of a search or material that must be translated into several
   different languages.





Sullivan, et al.         Expires August 18, 2014                [Page 4]


Internet-Draft               Charset Policy                February 2014


2.3.  Non-DNS, "visible" protocol elements

   Sometimes, protocol elements are expected to be visible or, as
   likely, manipulable by users.  [[CREF5: Sorry, the following bit
   needs some more references, which I've failed to get right in the
   interests of expediency.  This is here to remind me.
   --ajs@anvilwalrusden.com]] For instance, many values of SMTP
   [RFC5321] commands are parts of mail addresses that users are
   expected to type.  In the presence of EAI, those addresses may well
   be internationalized.

   In general, there are two ways to handle these sorts of strings.  One
   is to use an ASCII-compatible encoding in the way that IDNA does.
   Another is to internationalize the protocol.  If an internationalized
   protocol is to be undertaken, agility among coded character sets
   appears to cause more problems than it solves.  Therefore, for the
   purposes of transmission, it is best to transmit protocol elements as
   UTF-8 strings in "Net-Unicode" [RFC5198] form, with an appropriate
   profile.  All ASCII-only strings meet this criterion.  [[CREF6: Maybe
   the profile stuff needs to refer to PRECIS anyway.
   --ajs@anvilwalrusden]]

   Merely requiring Net-Unicode is not enough.  The PRECIS working group
   documents outline a number of considerations for how protocol
   elements and data need to be handled in the face of
   internationalization concerns.  These kinds of considerations are
   especially important for protocol elements that may be influenced by
   user action.  For instance, if comparisons are to be used, good
   PRECIS profiles for those elements are critical.

   In the design of protocols for use on the Internet (or in other
   communications systems) that use textual keywords, there is a
   tradeoff between strings that have high mnemonic value (i.e., the
   identifiers are easily remembered by those who will use them) in
   local environments and those that are easily recognized and used
   internationally.  Most cases are (and should be) resolved in favor of
   the latter, because these are strings used in protocols, a single set
   can easily be translated, and because it is possible to choose a
   single well-known script with good properties for those strings.  But
   there are cases when other considerations are more important and each
   case and protocol should be carefully and separately considered.
   [[CREF7: I think I'd remove the last of those sentences unless we
   want to say when. --ajs@anvilwalrusden.com]]








Sullivan, et al.         Expires August 18, 2014                [Page 5]


Internet-Draft               Charset Policy                February 2014


2.4.  Protocol data

   Protocol data is very frequently user visible, and to the extent
   there are highly variable internationalization principles, they
   appear more commonly here.

   In general, protocol data needs to carry an indicator of its coded
   character set.  A protocol MUST identify, for all character data,
   which coded character set is in use.  Protocols MUST be able to use
   UTF-8.  New protocols SHOULD use UTF-8, and UTF-8 only, unless strong
   motivation is given for exceptions.  The identification methods
   discussed in this section are for use with legacy protocols and
   situations.

   NOTE: In the protocol stack for any given application, there is
   usually one or a few layers that need to address these problems.

   It would, for instance, not be appropriate to define language tags
   for Ethernet frames.  It is the responsibility of protocol designers
   to ensure that whenever responsibility for internationalization is
   left to "another layer", those responsible for that layer are in fact
   aware that they have that responsibility.  The precis framework
   provides more guidance.  [[CREF8: Surely this is too hand-wavy?
   Should we refer to particular bits? --ajs]]

3.  General charset policy

   The general policy of the IETF is that all data should be transmitted
   on the wire as UTF-8.  Any protocol that does not conform to this
   policy but that is intended for the IETF standards track MUST justify
   it to the IETF.

   When the protocol allows a choice of multiple charsets, someone must
   make a decision on which charset to use.

   In some cases, like HTTP, there is direct or semi-direct
   communication between the producer and the consumer of data
   containing text.  In such cases, it may make sense to negotiate a
   charset before sending data.

   In other cases, like E-mail or stored data, there is no such
   communication, and the best one can do is to make sure the charset is
   clearly identified with the stored data, and choosing a charset that
   is as widely known as possible.

   Note that a charset is an absolute; text that is encoded in a charset
   cannot be rendered comprehensibly without supporting that charset.




Sullivan, et al.         Expires August 18, 2014                [Page 6]


Internet-Draft               Charset Policy                February 2014


   This also applies to English texts; charsets like EBCDIC do NOT have
   ASCII as a proper subset.

   Negotiating a charset may be regarded as an interim mechanism that is
   to be supported until support for interchange of UTF-8 is prevalent.
   Despite the wide adoption of Unicode and UTF-8, the timeframe of
   "interim" may remain long, though perhaps not permanent.

4.  Languages

4.1.  The need for language information

   All human-readable text has a language.

   Many operations, including high quality formatting, text-to-speech
   synthesis, searching, hyphenation, spellchecking and so on benefit
   greatly from, or are all but impossible without, access to
   information about the language of a piece of text (Section 3.1.1.4 of
   [RFC2130]).

   Humans have some tolerance for foreign languages, but are generally
   very unhappy with being presented text in a language they do not
   understand; this is why negotiation, or at least negotiation, of
   language is needed.

   In most cases, machines will not be able to deduce the language of a
   transmitted text by themselves; the protocol must specify how to
   transfer the language information if it is to be available at all.
   It is sometimes possible to guess the langage of a block of text, but
   such guessing is usually unreliable and becomes dramatically less
   reliable the shorter the block of text.

4.2.  Requirement for language tagging

   Protocols that transfer text MUST provide for carrying information
   about the language of that text.

   Protocols SHOULD also provide for carrying language information about
   visible protocol elements (especially if they are names), where
   appropriate.

   Note that this does not mean that such information must always be
   present; the requirement is that if the sender of information wishes
   to send information about the language of a text, the protocol
   provides a well-defined way to carry this information.  Nevertheless,
   if the data originator does not supply that information, it is
   generally impossible to make it up later.




Sullivan, et al.         Expires August 18, 2014                [Page 7]


Internet-Draft               Charset Policy                February 2014


4.3.  How to identify a language

   The language tag [RFC5646] is at the moment the most flexible tool
   available for identifying a language; protocols SHOULD use this, or
   provide clear and solid justification for doing otherwise in the
   document.  Language tags are in general not useful without profiling
   appropriate to the case, and there is significant danger of over-
   specification with tags.  See Section 4.1 of RFC 5646.

   Note also that a language is distinct from a POSIX locale (see
   Section 5); a POSIX locale identifies a set of cultural conventions,
   which may imply a language (the "POSIX" and "C" locales of course do
   not), while a language tag identifies only a language.

4.4.  Considerations for language negotiation

   Protocols where users have text presented to them in response to user
   actions MUST provide for support of multiple languages.

   How this is done will vary between protocols; for instance, in some
   cases, a negotiation where the client proposes a set of languages and
   the server replies with one is appropriate; in other cases, a server
   may choose to send multiple variants of a text and let the client
   pick which one to display.

   Negotiation is useful in the case where one side of the protocol
   exchange is able to present text in multiple languages to the other
   side, and the other side has a preference for one of these; the most
   common example is the text part of error responses, or Web pages that
   are available in multiple languages.

   Users do not, of course, actually use protocols, but instead user
   interfaces that in turn use the protocols.  Therefore, what is
   necessary to support is not the full internationalization of
   everything in the protocol, but enough that the user-visible
   components can be localized appropriately.  See Section 2.3.

   Negotiating a language should be regarded as a permanent requirement
   of the protocol that will not go away at any time in the future.

   In many cases, it should be possible to include it as part of the
   connection establishment, together with authentication and other
   preferences negotiation.








Sullivan, et al.         Expires August 18, 2014                [Page 8]


Internet-Draft               Charset Policy                February 2014


4.5.  Default language

   For the purposes of display, it may be necessary to pick a default
   language to use when it is not possible to determine the language.
   It is evident that picking a default may lead to user dissatisfaction
   or confusion, but when language cannot be determined such fallbacks
   may be necessary.

   Section 4.1 of [RFC5646], numbers 5 and 7, outline the considerations
   for language identification when the language cannot be determined.

5.  Locale

   The POSIX standard [ISO.9945-2.1993] defines a concept called a
   "locale", which includes a lot of information about collating order
   for sorting, date format, currency format and so on.

   In some cases, and especially with text where the user is expected to
   do processing on the text, locale information may be usefully
   attached to the text; this would identify the sender's opinion about
   appropriate rules to follow when processing the document, which the
   recipient may choose to agree with or ignore.

   This document does not require the communication of locale
   information on all text, but encourages its inclusion when
   appropriate.

   Note that language and character set information will often be
   present as parts of a locale tag (such as no_NO.iso-8859-1; the
   language is before the underscore and the character set is after the
   dot); care must be taken to define precisely which specification of
   character set and language applies to any one text item.

   The default locale is the "POSIX" locale.

6.  Documenting Internationalization Decisions

   In documents that deal with internationalization issues at all, a
   synopsis of the approaches chosen for internationalization SHOULD be
   collected into a section called "Internationalization
   considerations".  This practice has historically not been followed
   regularly, but it remains a good idea.  The goal is to provide an
   easy reference for those who are looking for advice on these issues
   when implementing the protocol.







Sullivan, et al.         Expires August 18, 2014                [Page 9]


Internet-Draft               Charset Policy                February 2014


7.  Security Considerations

   Security warnings in a foreign language may cause inappropriate
   behaviour (such as ignoring the warning entirely) from the user.  In
   addition, the issues raised in [RFC6943], especially in its section
   4.2 and section 5, are of particular relevance to
   internationalization.

8.  Acknowledgements

   Much of the text comes from [RFC2277].  Harald Alvestrand was the
   primary author of that RFC.

   Most of the discussion above was initiated as part of the IAB's
   internationalization program.  At the time of writing, the program
   members were (in alphabetical order) Marc Blanchet, Stuart Cheshire,
   Leslie Daigle, Patrik Faltstrom, Heather Flanagan, John Klensin, Olaf
   Kolkman, Barry Leiba, Xing Li, Pete Resnick, Peter Saint-Andre,
   Andrew Sullivan, and Dave Thaler.

   Significant text in Section 2.2 and Section 2.3 was derived from a
   forthcoming Internet Society education module for next-generation
   Internet leaders and future influencers and used with permission.
   The contributions and support for that work of Toral Cowleson and
   Niel Harper of the Internet Society are gratefully acknowledged.

9.  IANA Considerations

   This document makes no requests of IANA.

10.  Informative References

   [ISO.10646-1.1993]
              International Organization for Standardization,
              "Information Technology - Universal Multiple-octet coded
              Character Set (UCS) - Part 1: Architecture and Basic
              Multilingual Plane", ISO Standard 10646-1, May 1993.

   [ISO.9945-2.1993]
              International Organization for Standardization, "ISO/IEC
              9945-2:1993 Information Technology -- Portable Operating
              System Interface (POSIX) -- Part 2: Shell and Utilities",
              ISO Standard 9945-2, 1993.

   [RFC1033]  Lottor, M., "Domain administrators operations guide", RFC
              1033, November 1987.





Sullivan, et al.         Expires August 18, 2014               [Page 10]


Internet-Draft               Charset Policy                February 2014


   [RFC1034]  Mockapetris, P., "Domain names - concepts and facilities",
              STD 13, RFC 1034, November 1987.

   [RFC2026]  Bradner, S., "The Internet Standards Process -- Revision
              3", BCP 9, RFC 2026, October 1996.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2130]  Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
              Atkinson, R., Crispin, M., and P. Svanberg, "The Report of
              the IAB Character Set Workshop held 29 February - 1 March,
              1996", RFC 2130, April 1997.

   [RFC2181]  Elz, R. and R. Bush, "Clarifications to the DNS
              Specification", RFC 2181, July 1997.

   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
              Languages", BCP 18, RFC 2277, January 1998.

   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
              10646", STD 63, RFC 3629, November 2003.

   [RFC5198]  Klensin, J. and M. Padlipsky, "Unicode Format for Network
              Interchange", RFC 5198, March 2008.

   [RFC5321]  Klensin, J., "Simple Mail Transfer Protocol", RFC 5321,
              October 2008.

   [RFC5646]  Phillips, A. and M. Davis, "Tags for Identifying
              Languages", BCP 47, RFC 5646, September 2009.

   [RFC5890]  Klensin, J., "Internationalized Domain Names for
              Applications (IDNA): Definitions and Document Framework",
              RFC 5890, August 2010.

   [RFC5891]  Klensin, J., "Internationalized Domain Names in
              Applications (IDNA): Protocol", RFC 5891, August 2010.

   [RFC5892]  Faltstrom, P., "The Unicode Code Points and
              Internationalized Domain Names for Applications (IDNA)",
              RFC 5892, August 2010.

   [RFC5893]  Alvestrand, H. and C. Karp, "Right-to-Left Scripts for
              Internationalized Domain Names for Applications (IDNA)",
              RFC 5893, August 2010.





Sullivan, et al.         Expires August 18, 2014               [Page 11]


Internet-Draft               Charset Policy                February 2014


   [RFC5894]  Klensin, J., "Internationalized Domain Names for
              Applications (IDNA): Background, Explanation, and
              Rationale", RFC 5894, August 2010.

   [RFC5895]  Resnick, P. and P. Hoffman, "Mapping Characters for
              Internationalized Domain Names in Applications (IDNA)
              2008", RFC 5895, September 2010.

   [RFC6055]  Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on
              Encodings for Internationalized Domain Names", RFC 6055,
              February 2011.

   [RFC6365]  Hoffman, P. and J. Klensin, "Terminology Used in
              Internationalization in the IETF", BCP 166, RFC 6365,
              September 2011.

   [RFC6762]  Cheshire, S. and M. Krochmal, "Multicast DNS", RFC 6762,
              February 2013.

   [RFC6943]  Thaler, D., "Issues in Identifier Comparison for Security
              Purposes", RFC 6943, May 2013.

Appendix A.  Version History

A.1.  00

   Initial version.  Contains a number of xml2rfc warnings.

Authors' Addresses

   Andrew Sullivan
   Dyn
   150 Dow St.
   Manchester, NH  03101
   U.S.A.

   Email: asullivan@dyn.com


   Dave Thaler
   Microsoft Corporation
   One Microsoft Way
   Redmonad, WA  98052
   USA

   Phone: +1 425 703 8835
   Email: dthaler@microsoft.com




Sullivan, et al.         Expires August 18, 2014               [Page 12]


Internet-Draft               Charset Policy                February 2014


   John C Klensin
   1770 Massachusetts Ave, Ste 322
   Cambridge, MA  02140
   USA

   Phone: +1 617 245 1457
   Email: john-ietf@jck.com












































Sullivan, et al.         Expires August 18, 2014               [Page 13]