IETF A. Sullivan
Internet-Draft Dyn
Intended status: Best Current Practice D. Thaler
Expires: August 18, 2014 Microsoft
J. Klensin
February 14, 2014
IETF Policy on Character Sets and Languages
draft-sullivan-rfc2277-bis-00
Abstract
This is a proposed new policy for the IETF on Character Sets and
Languages.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on August 18, 2014.
Copyright Notice
Copyright (c) 2014 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Sullivan, et al. Expires August 18, 2014 [Page 1]
Internet-Draft Charset Policy February 2014
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3
2. Where to do internationalization . . . . . . . . . . . . . . 3
2.1. Domain names . . . . . . . . . . . . . . . . . . . . . . 3
2.2. Non-DNS, "invisible" protocol elements . . . . . . . . . 4
2.3. Non-DNS, "visible" protocol elements . . . . . . . . . . 5
2.4. Protocol data . . . . . . . . . . . . . . . . . . . . . . 6
3. General charset policy . . . . . . . . . . . . . . . . . . . 6
4. Languages . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1. The need for language information . . . . . . . . . . . . 7
4.2. Requirement for language tagging . . . . . . . . . . . . 7
4.3. How to identify a language . . . . . . . . . . . . . . . 8
4.4. Considerations for language negotiation . . . . . . . . . 8
4.5. Default language . . . . . . . . . . . . . . . . . . . . 9
5. Locale . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6. Documenting Internationalization Decisions . . . . . . . . . 9
7. Security Considerations . . . . . . . . . . . . . . . . . . . 10
8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10
10. Informative References . . . . . . . . . . . . . . . . . . . 10
Appendix A. Version History . . . . . . . . . . . . . . . . . . 12
A.1. 00 . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12
1. Introduction
The Internet is international.
With the international Internet follows an absolute requirement to
interchange data in a multiplicity of languages, which in turn
utilize a bewildering number of characters.
The document is very much based upon RFC 2277 [RFC2277] which is the
current policy being applied by the Internet Engineering Steering
Group (IESG) towards the standardization efforts in the Internet
Engineering Task Force (IETF) in order to help Internet protocols
fulfill these requirements.
RFC 2277 in turn was based on the recommendations of the IAB
Character Set Workshop of February 29-March 1, 1996, which is
documented in RFC 2130 [RFC2130]. This document is a proposed
replacement for RFC 2277 and attempts to be explicit and clear, and
as concise as possible without leaving out necessary detail.[[CREF1:
What other references do we want to add? --ajs@anvilwalrusden.com]]
Sullivan, et al. Expires August 18, 2014 [Page 2]
Internet-Draft Charset Policy February 2014
1.1. Terminology
This document uses the terms "character", "charset", "coded character
set", "language", "locale", and "protocol elements" as defined in RFC
6365 [RFC6365]. IDNA terminology is defined in RFC 5890 [RFC5890].
Any of those definitions may be used below, and the reader is
expected to be familiar with them. [[CREF2: That last sentence makes
this document much less accessible. I think at a minimum we need to
list which terms used in this document are defined in each other RFC.
I've now added a list above for 6365, but it may be missing some and
the list of terms used from 5890 is needed.
--dthaler@microsoft.com]][[CREF3: This is fair. I suggest we leave
this as is and do an exhaustive pass for terminology later and
updates these lists. --ajs@anvilwalrusden.com]]
This document uses the terms 'MUST', 'SHOULD' and 'MAY', and their
negatives, in the way described in RFC 2119 [RFC2119]. In this case,
'the specification' as used by RFC 2119 refers to the processing of
protocols being submitted to the IETF standards process.
2. Where to do internationalization
Internationalization is necessary because of the way natural language
is written. It enables localization, which is for humans. This
means that protocols are not subject to internationalization; text
strings are. Where protocol elements look like text tokens, such as
in many IETF application layer protocols, protocols MUST specify
which parts are protocol and which are text (see Section 2.2.1.1 of
[RFC2130]).
It is helpful to distinguish among four different types of strings
for these purposes: domain names whether in the DNS or not, other
protocol elements that are not normally visible to users, other
protocol elements that are (even sometimes) normally visible to
users, and data (in most cases, the protocol payload).
2.1. Domain names
Domain names (or strings of domain-name-like things) are used in a
number of protocols, and not all of those names are intended to be
looked up in the DNS. This raises a number of issues explored at
length in [RFC6055].
Given this state of affairs, it is possible to recommend the
following. These recommendations are consistent with RFC 6055:
o At resolution time, names that are to be looked up in the global
DNS SHOULD be transmitted as A-labels.
Sullivan, et al. Expires August 18, 2014 [Page 3]
Internet-Draft Charset Policy February 2014
o At resolution time, names that are not to be looked up in the
global DNS ought to be transmitted in the form appropriate to the
name resolution protocol. This is often UTF-8.
o Storage of internationalized domain names ought generally to be in
the form of U-labels.
o Any protocol that needs to use domain names ought to use U-labels
or A-labels consistently, and ought to prefer U-labels.
o Storage of U-labels (or putative U-labels) should be in the
encoding form appropriate to the context. For instance, on a
system that normally encodes UTF-8 using NFD, that is how the
strings should be stored; similarly, a system that uses UTF-16
should store the strings in that form.
[[CREF4: This in the end will need to be checked carefully for its
consistency with 6055. --ajs@anvilwalrusden.com]]
2.2. Non-DNS, "invisible" protocol elements
Many protocols include elements that are either words or word-like in
some natural language (usually English), but that are never exposed
to users under normal circumstances. Users might encounter these
protocol elements in log messages and so on, and system
administrators might regularly encounter them as part of the ordinary
support burden. But these elements are no more candidates for
internationalization than are hexadecimal protocol parameters.
Because they are not intended for user consumption, they should not
be treated as any part of a user interface. Internationalization
considerations do not apply to them.
It is important to recognize that some of this class of protocol
element sometimes appears to be exposed to users -- for instance,
many user agents for mail display headers. In these cases, it is
important to distinguish between the protocol element itself, and the
user cues it may provide. The protocol element does not need to be
internationalized. The user interface might. In general, it is best
to internationalize (or localize) strings that are encountered by the
user and to keep those that are passed between computer systems and
interpreted by them as simple and unambiguous as possible. Even for
names or strings that provide the underpinnings for the strings that
users type or with which they interact, it is important to keep their
forms as simple as possible. Examples of such strings include the
results of a search or material that must be translated into several
different languages.
Sullivan, et al. Expires August 18, 2014 [Page 4]
Internet-Draft Charset Policy February 2014
2.3. Non-DNS, "visible" protocol elements
Sometimes, protocol elements are expected to be visible or, as
likely, manipulable by users. [[CREF5: Sorry, the following bit
needs some more references, which I've failed to get right in the
interests of expediency. This is here to remind me.
--ajs@anvilwalrusden.com]] For instance, many values of SMTP
[RFC5321] commands are parts of mail addresses that users are
expected to type. In the presence of EAI, those addresses may well
be internationalized.
In general, there are two ways to handle these sorts of strings. One
is to use an ASCII-compatible encoding in the way that IDNA does.
Another is to internationalize the protocol. If an internationalized
protocol is to be undertaken, agility among coded character sets
appears to cause more problems than it solves. Therefore, for the
purposes of transmission, it is best to transmit protocol elements as
UTF-8 strings in "Net-Unicode" [RFC5198] form, with an appropriate
profile. All ASCII-only strings meet this criterion. [[CREF6: Maybe
the profile stuff needs to refer to PRECIS anyway.
--ajs@anvilwalrusden]]
Merely requiring Net-Unicode is not enough. The PRECIS working group
documents outline a number of considerations for how protocol
elements and data need to be handled in the face of
internationalization concerns. These kinds of considerations are
especially important for protocol elements that may be influenced by
user action. For instance, if comparisons are to be used, good
PRECIS profiles for those elements are critical.
In the design of protocols for use on the Internet (or in other
communications systems) that use textual keywords, there is a
tradeoff between strings that have high mnemonic value (i.e., the
identifiers are easily remembered by those who will use them) in
local environments and those that are easily recognized and used
internationally. Most cases are (and should be) resolved in favor of
the latter, because these are strings used in protocols, a single set
can easily be translated, and because it is possible to choose a
single well-known script with good properties for those strings. But
there are cases when other considerations are more important and each
case and protocol should be carefully and separately considered.
[[CREF7: I think I'd remove the last of those sentences unless we
want to say when. --ajs@anvilwalrusden.com]]
Sullivan, et al. Expires August 18, 2014 [Page 5]
Internet-Draft Charset Policy February 2014
2.4. Protocol data
Protocol data is very frequently user visible, and to the extent
there are highly variable internationalization principles, they
appear more commonly here.
In general, protocol data needs to carry an indicator of its coded
character set. A protocol MUST identify, for all character data,
which coded character set is in use. Protocols MUST be able to use
UTF-8. New protocols SHOULD use UTF-8, and UTF-8 only, unless strong
motivation is given for exceptions. The identification methods
discussed in this section are for use with legacy protocols and
situations.
NOTE: In the protocol stack for any given application, there is
usually one or a few layers that need to address these problems.
It would, for instance, not be appropriate to define language tags
for Ethernet frames. It is the responsibility of protocol designers
to ensure that whenever responsibility for internationalization is
left to "another layer", those responsible for that layer are in fact
aware that they have that responsibility. The precis framework
provides more guidance. [[CREF8: Surely this is too hand-wavy?
Should we refer to particular bits? --ajs]]
3. General charset policy
The general policy of the IETF is that all data should be transmitted
on the wire as UTF-8. Any protocol that does not conform to this
policy but that is intended for the IETF standards track MUST justify
it to the IETF.
When the protocol allows a choice of multiple charsets, someone must
make a decision on which charset to use.
In some cases, like HTTP, there is direct or semi-direct
communication between the producer and the consumer of data
containing text. In such cases, it may make sense to negotiate a
charset before sending data.
In other cases, like E-mail or stored data, there is no such
communication, and the best one can do is to make sure the charset is
clearly identified with the stored data, and choosing a charset that
is as widely known as possible.
Note that a charset is an absolute; text that is encoded in a charset
cannot be rendered comprehensibly without supporting that charset.
Sullivan, et al. Expires August 18, 2014 [Page 6]
Internet-Draft Charset Policy February 2014
This also applies to English texts; charsets like EBCDIC do NOT have
ASCII as a proper subset.
Negotiating a charset may be regarded as an interim mechanism that is
to be supported until support for interchange of UTF-8 is prevalent.
Despite the wide adoption of Unicode and UTF-8, the timeframe of
"interim" may remain long, though perhaps not permanent.
4. Languages
4.1. The need for language information
All human-readable text has a language.
Many operations, including high quality formatting, text-to-speech
synthesis, searching, hyphenation, spellchecking and so on benefit
greatly from, or are all but impossible without, access to
information about the language of a piece of text (Section 3.1.1.4 of
[RFC2130]).
Humans have some tolerance for foreign languages, but are generally
very unhappy with being presented text in a language they do not
understand; this is why negotiation, or at least negotiation, of
language is needed.
In most cases, machines will not be able to deduce the language of a
transmitted text by themselves; the protocol must specify how to
transfer the language information if it is to be available at all.
It is sometimes possible to guess the langage of a block of text, but
such guessing is usually unreliable and becomes dramatically less
reliable the shorter the block of text.
4.2. Requirement for language tagging
Protocols that transfer text MUST provide for carrying information
about the language of that text.
Protocols SHOULD also provide for carrying language information about
visible protocol elements (especially if they are names), where
appropriate.
Note that this does not mean that such information must always be
present; the requirement is that if the sender of information wishes
to send information about the language of a text, the protocol
provides a well-defined way to carry this information. Nevertheless,
if the data originator does not supply that information, it is
generally impossible to make it up later.
Sullivan, et al. Expires August 18, 2014 [Page 7]
Internet-Draft Charset Policy February 2014
4.3. How to identify a language
The language tag [RFC5646] is at the moment the most flexible tool
available for identifying a language; protocols SHOULD use this, or
provide clear and solid justification for doing otherwise in the
document. Language tags are in general not useful without profiling
appropriate to the case, and there is significant danger of over-
specification with tags. See Section 4.1 of RFC 5646.
Note also that a language is distinct from a POSIX locale (see
Section 5); a POSIX locale identifies a set of cultural conventions,
which may imply a language (the "POSIX" and "C" locales of course do
not), while a language tag identifies only a language.
4.4. Considerations for language negotiation
Protocols where users have text presented to them in response to user
actions MUST provide for support of multiple languages.
How this is done will vary between protocols; for instance, in some
cases, a negotiation where the client proposes a set of languages and
the server replies with one is appropriate; in other cases, a server
may choose to send multiple variants of a text and let the client
pick which one to display.
Negotiation is useful in the case where one side of the protocol
exchange is able to present text in multiple languages to the other
side, and the other side has a preference for one of these; the most
common example is the text part of error responses, or Web pages that
are available in multiple languages.
Users do not, of course, actually use protocols, but instead user
interfaces that in turn use the protocols. Therefore, what is
necessary to support is not the full internationalization of
everything in the protocol, but enough that the user-visible
components can be localized appropriately. See Section 2.3.
Negotiating a language should be regarded as a permanent requirement
of the protocol that will not go away at any time in the future.
In many cases, it should be possible to include it as part of the
connection establishment, together with authentication and other
preferences negotiation.
Sullivan, et al. Expires August 18, 2014 [Page 8]
Internet-Draft Charset Policy February 2014
4.5. Default language
For the purposes of display, it may be necessary to pick a default
language to use when it is not possible to determine the language.
It is evident that picking a default may lead to user dissatisfaction
or confusion, but when language cannot be determined such fallbacks
may be necessary.
Section 4.1 of [RFC5646], numbers 5 and 7, outline the considerations
for language identification when the language cannot be determined.
5. Locale
The POSIX standard [ISO.9945-2.1993] defines a concept called a
"locale", which includes a lot of information about collating order
for sorting, date format, currency format and so on.
In some cases, and especially with text where the user is expected to
do processing on the text, locale information may be usefully
attached to the text; this would identify the sender's opinion about
appropriate rules to follow when processing the document, which the
recipient may choose to agree with or ignore.
This document does not require the communication of locale
information on all text, but encourages its inclusion when
appropriate.
Note that language and character set information will often be
present as parts of a locale tag (such as no_NO.iso-8859-1; the
language is before the underscore and the character set is after the
dot); care must be taken to define precisely which specification of
character set and language applies to any one text item.
The default locale is the "POSIX" locale.
6. Documenting Internationalization Decisions
In documents that deal with internationalization issues at all, a
synopsis of the approaches chosen for internationalization SHOULD be
collected into a section called "Internationalization
considerations". This practice has historically not been followed
regularly, but it remains a good idea. The goal is to provide an
easy reference for those who are looking for advice on these issues
when implementing the protocol.
Sullivan, et al. Expires August 18, 2014 [Page 9]
Internet-Draft Charset Policy February 2014
7. Security Considerations
Security warnings in a foreign language may cause inappropriate
behaviour (such as ignoring the warning entirely) from the user. In
addition, the issues raised in [RFC6943], especially in its section
4.2 and section 5, are of particular relevance to
internationalization.
8. Acknowledgements
Much of the text comes from [RFC2277]. Harald Alvestrand was the
primary author of that RFC.
Most of the discussion above was initiated as part of the IAB's
internationalization program. At the time of writing, the program
members were (in alphabetical order) Marc Blanchet, Stuart Cheshire,
Leslie Daigle, Patrik Faltstrom, Heather Flanagan, John Klensin, Olaf
Kolkman, Barry Leiba, Xing Li, Pete Resnick, Peter Saint-Andre,
Andrew Sullivan, and Dave Thaler.
Significant text in Section 2.2 and Section 2.3 was derived from a
forthcoming Internet Society education module for next-generation
Internet leaders and future influencers and used with permission.
The contributions and support for that work of Toral Cowleson and
Niel Harper of the Internet Society are gratefully acknowledged.
9. IANA Considerations
This document makes no requests of IANA.
10. Informative References
[ISO.10646-1.1993]
International Organization for Standardization,
"Information Technology - Universal Multiple-octet coded
Character Set (UCS) - Part 1: Architecture and Basic
Multilingual Plane", ISO Standard 10646-1, May 1993.
[ISO.9945-2.1993]
International Organization for Standardization, "ISO/IEC
9945-2:1993 Information Technology -- Portable Operating
System Interface (POSIX) -- Part 2: Shell and Utilities",
ISO Standard 9945-2, 1993.
[RFC1033] Lottor, M., "Domain administrators operations guide", RFC
1033, November 1987.
Sullivan, et al. Expires August 18, 2014 [Page 10]
Internet-Draft Charset Policy February 2014
[RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
STD 13, RFC 1034, November 1987.
[RFC2026] Bradner, S., "The Internet Standards Process -- Revision
3", BCP 9, RFC 2026, October 1996.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
Atkinson, R., Crispin, M., and P. Svanberg, "The Report of
the IAB Character Set Workshop held 29 February - 1 March,
1996", RFC 2130, April 1997.
[RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS
Specification", RFC 2181, July 1997.
[RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
Languages", BCP 18, RFC 2277, January 1998.
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
10646", STD 63, RFC 3629, November 2003.
[RFC5198] Klensin, J. and M. Padlipsky, "Unicode Format for Network
Interchange", RFC 5198, March 2008.
[RFC5321] Klensin, J., "Simple Mail Transfer Protocol", RFC 5321,
October 2008.
[RFC5646] Phillips, A. and M. Davis, "Tags for Identifying
Languages", BCP 47, RFC 5646, September 2009.
[RFC5890] Klensin, J., "Internationalized Domain Names for
Applications (IDNA): Definitions and Document Framework",
RFC 5890, August 2010.
[RFC5891] Klensin, J., "Internationalized Domain Names in
Applications (IDNA): Protocol", RFC 5891, August 2010.
[RFC5892] Faltstrom, P., "The Unicode Code Points and
Internationalized Domain Names for Applications (IDNA)",
RFC 5892, August 2010.
[RFC5893] Alvestrand, H. and C. Karp, "Right-to-Left Scripts for
Internationalized Domain Names for Applications (IDNA)",
RFC 5893, August 2010.
Sullivan, et al. Expires August 18, 2014 [Page 11]
Internet-Draft Charset Policy February 2014
[RFC5894] Klensin, J., "Internationalized Domain Names for
Applications (IDNA): Background, Explanation, and
Rationale", RFC 5894, August 2010.
[RFC5895] Resnick, P. and P. Hoffman, "Mapping Characters for
Internationalized Domain Names in Applications (IDNA)
2008", RFC 5895, September 2010.
[RFC6055] Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on
Encodings for Internationalized Domain Names", RFC 6055,
February 2011.
[RFC6365] Hoffman, P. and J. Klensin, "Terminology Used in
Internationalization in the IETF", BCP 166, RFC 6365,
September 2011.
[RFC6762] Cheshire, S. and M. Krochmal, "Multicast DNS", RFC 6762,
February 2013.
[RFC6943] Thaler, D., "Issues in Identifier Comparison for Security
Purposes", RFC 6943, May 2013.
Appendix A. Version History
A.1. 00
Initial version. Contains a number of xml2rfc warnings.
Authors' Addresses
Andrew Sullivan
Dyn
150 Dow St.
Manchester, NH 03101
U.S.A.
Email: asullivan@dyn.com
Dave Thaler
Microsoft Corporation
One Microsoft Way
Redmonad, WA 98052
USA
Phone: +1 425 703 8835
Email: dthaler@microsoft.com
Sullivan, et al. Expires August 18, 2014 [Page 12]
Internet-Draft Charset Policy February 2014
John C Klensin
1770 Massachusetts Ave, Ste 322
Cambridge, MA 02140
USA
Phone: +1 617 245 1457
Email: john-ietf@jck.com
Sullivan, et al. Expires August 18, 2014 [Page 13]