Network Working Group J. Klensin
Internet-Draft January 29, 2007
Expires: August 2, 2007
ASCII Escaping of Unicode Characters
draft-klensin-unicode-escapes-01.txt
Status of this Memo
By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on August 2, 2007.
Copyright Notice
Copyright (C) The IETF Trust (2007).
Abstract
There are a number of circumstances in which an escape mechanism is
needed in conjunction with a protocol to encode characters that
cannot be represented or transmitted directly. With ASCII coding the
traditional escape has been either the decimal or hexadecimal offset
of the character, written in a variety of different ways. The move
to Unicode, where characters occupy two or more octets and may be
coded in several different forms, has further complicated the
question of escapes. This document discusses some options now in use
and makes a proposal for general use in new IETF protocols and
Klensin Expires August 2, 2007 [Page 1]
Internet-Draft Unicode Escapes January 2007
protocols that are now being internationalized.
Warning: Interim Draft
This version of the specification is an interim draft, intended to
correct (or at least note) obvious errors and reflect some of the
discussion on the mailing list in order to help focus the discussion
on remaining critical issues. It is not complete, nor does it claim
to accurately reflect all of the discussions so far.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Context and Background . . . . . . . . . . . . . . . . . . 3
1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4
1.3. Discussion List . . . . . . . . . . . . . . . . . . . . . 4
2. Encodings that Represent Unicode Code Points . . . . . . . . . 4
2.1. Unicode Table Position versus UTF-8 Octets . . . . . . . . 4
3. Referring to Unicode Characters . . . . . . . . . . . . . . . 5
4. Syntax for Code Point Escapes . . . . . . . . . . . . . . . . 5
5. Presentation Variants for Unicode Code Points . . . . . . . . 6
5.1. The C Programming Language: Backslash-U . . . . . . . . . 6
5.2. HTML and XML . . . . . . . . . . . . . . . . . . . . . . . 7
5.3. Perl: A Hexadecimal String . . . . . . . . . . . . . . . . 7
5.4. Java: Escaped UTF-16 . . . . . . . . . . . . . . . . . . . 7
6. Security Considerations . . . . . . . . . . . . . . . . . . . 7
7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 8
8. Change log . . . . . . . . . . . . . . . . . . . . . . . . . . 8
8.1. Changes in -01 . . . . . . . . . . . . . . . . . . . . . . 8
9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 8
9.1. Normative References . . . . . . . . . . . . . . . . . . . 8
9.2. Informative References . . . . . . . . . . . . . . . . . . 9
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 9
Intellectual Property and Copyright Statements . . . . . . . . . . 10
Klensin Expires August 2, 2007 [Page 2]
Internet-Draft Unicode Escapes January 2007
1. Introduction
1.1. Context and Background
There are a number of circumstances in which an escape mechanism is
needed in conjunction with a protocol to encode characters that
cannot be represented or transmitted directly. With ASCII [ASCII]
coding the traditional escape has been either the decimal or
hexadecimal offset of the character, written in a variety of
different ways. For example, in different contexts, we have seen
%dNN or %NN for the decimal form, %NN, %xNN, X'nn', and %X'NN' for
the hexadecimal form. "%NN" has become popular in recent years to
represent a hexadecimal value without further qualification, perhaps
as a consequence of its use in URLs and their prevalence. There are
even some applications around in which octal forms are used and,
while they do not generalize well, the MIME Quoted-Printable and
Encoded-word forms can be thought of as yet another set of escapes.
So, even for the fairly simple cases of ASCII and standard built by
extending ASCII, such as the ISO 8859 family, we have been living
with several different escaping forms, each the result of some
history.
When one moves to Unicode [Unicode] [ISO10646], where characters
occupy two or more octets and may be coded in several different
forms, the question of escapes becomes even more complicated. In
particular, we have seen fairly extensive use of both hexadecimal
representations of the UTF-8 encoding [RFC3629] of a character and
variations on the U+NNNN[N[N]] notation commonly used in conjunction
with the Unicode Standard. This document proposes that new
protocols, and protocols being internationalized, SHOULD use some
contextually-appropriate variation on the latter unless other
considerations outweigh those described here.
This recommendation is not applicable to protocols that already
accept native UTF-8 or some other encoding of Unicode. In general,
when protocols are internationalized, it is preferable to accept
those forms rather than using escapes. This recommendation applies
to cases, including transition arrangements, in which that is not
practical.
In addition to the protocol contexts addressed in this specification,
escapes to represent Unicode characters also appear in presentations
to users, i.e., in user interfaces (UI). The formats specified in,
and the reasoning of, this document may be applicable in UI contexts
as well, but this is not a proposal to standardize UI or presentation
forms.
Klensin Expires August 2, 2007 [Page 3]
Internet-Draft Unicode Escapes January 2007
1.2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
1.3. Discussion List
Discussion of this document should be addressed to the
discuss@apps.ietf.org mailing list.
2. Encodings that Represent Unicode Code Points
There are many different ways to designate, encode, or call out a
Unicode character. Given adequate decoding facilities, all of these
other than the formal character name are equivalent. However, when
information about characters is to be processed by people,
information about the Unicode code point is preferable to a further
encoding of the encoded form of the character. It is also desirable
to use hexadecimal references to code points because the Unicode
Standard is organized on a hexadecimal basis.
These issues are discussed in the following subsections.
2.1. Unicode Table Position versus UTF-8 Octets
There are two major families of ways to represent Unicode characters.
One uses the code point position in the table in some representation
(see the next section), the other encodes the octets of the UTF-8
encoding. Some other options are possible, but they have been rare
in practice. This specification recommends that, in the absence of
compelling reasons to do otherwise, the Unicode code point forms
SHOULD be used rather than the UTF-8 ones. There are several reasons
for this, including:
o One reason for the success of many IETF protocols is that they use
human-interpretable text forms to communicate, rather than
encodings that generally require computer programs (or hand
simulation of algorithms) to decode. This suggests that the
presentation form should reference the Unicode tables for
characters and to do so as simply as possible.
o The nature of UTF-8 implies that a decimal or hexadecimal numeral
representation of UTF-8 requires conversion to the UTF-8 form,
then conversion from the UTF-8 form to a Unicode character
position form in order to look the character up in a table. That
may be appropriate in some cases where the goal is really to
represent the UTF-8 form but, in general, it just obscures desired
information and makes errors more likely and debugging harder.
Klensin Expires August 2, 2007 [Page 4]
Internet-Draft Unicode Escapes January 2007
o Except for characters in the ASCII subset of Unicode (U+0000
through U+007F), the character code position form is generally
more compact than forms based on coding UTF-8 octets, sometimes
much more compact.
The same considerations that apply to encoding of UTF-8 octets also
apply to more compact ACE encodings such as the "bootstring" encoding
[RFC3492] with or without its "Punycode" profile.
3. Referring to Unicode Characters
Regardless of what decisions are made about escapes for Unicode
characters in protocol or similar contexts, references to Unicode
characters in text SHOULD use the U+NNN[N[N]] syntax for code point
references specified in the Unicode Standard, where the NNN... string
consists of hexadecimal numbers.
4. Syntax for Code Point Escapes
There are many options for code point escapes, some of which are
summarized below. All are equivalent in content and semantics -- the
differences lie in syntax. The best choice of syntax for a
particular protocol or other application depends on that application:
one form may simply "fit" better in a given context than others. It
is clear, however, that hexadecimal values are preferable to other
alternatives: Systems based on decimal or octal offsets SHOULD NOT be
used.
Since this specification does not recommend one specific syntax,
protocols specifications that use escapes MUST define the syntax they
are using, including any necessary escapes to permit the escape
sequence to be used literally.
The application designer selecting a format should consider at least
the following factors:
o If similar or related protocols already use one form, it may be
best to select that form for consistency and predictability.
o A Unicode code point can fall in the range from U+0000 to
U+10FFFF. Different escape systems may use four, five, six, or
eight hexadecimal digits. To avoid clever syntax tricks and the
consequent risk of confusion and errors, forms that use explicit
string terminators are generally preferred over other
alternatives. In many contexts, symmetric paired delimiters are
easier to recognize and understand than visually-unrelated ones.
Klensin Expires August 2, 2007 [Page 5]
Internet-Draft Unicode Escapes January 2007
o Forms that require decoding surrogate pairs share most of the
problems that appear with encoding of UTF-8 octets and SHOULD NOT,
in general, be used.
5. Presentation Variants for Unicode Code Points
There are a number of different ways to represent a Unicode code
point position. No one of them appears to be "best" for all
contexts. In addition, when an escape is needed for the escape
mechanism itself, the optimal one of those might differ from one
context to another.
Some forms that are in popular use and that might reasonably be
considered for use in a given protocol, are described below and
identified with a current-use context when feasible.
5.1. The C Programming Language: Backslash-U
The forms
\UNNNNNNNN (for any Unicode character) and
\uNNNN (for Unicode characters in plane 0)
are utilized in the C Programming Language [ISO-C] when an ASCII
escape for embedded Unicode characters is needed.
Specifically, in ABNF [RFC4234], [[anchor10: Note in Draft: The ABNF
that follows is _not_ valid because ABNF literal strings are not
case-sensitive. Once more substantive issues are resolved, this
syntax will need to be corrected, either to escape the "u" and "U"
(at least) or to note an exception from the standard ABNF rules. If
the charaters are escaped, a note will be necessary that the escapes
are references to ASCII (or Unicode) character abstractions, not a
limitation to the use of those particular octets.]]
EmbeddedUnicodeChar = BMP-form / Full-form
Hex-quad = 4*4HexDigit
BMP-form = "\u" Hex-quad
Full-form = "\U" 2*2Hex-quad
HexDigit = "0" / "1" / "2"/ "3"/ "4"/ "5"/ "6"/ "7"/ "8"/ "9"/ "A"/
"B" / "C"/ "D"/ "E"/ "F"
There are disadvantages of this form which may be significant.
First, the use of a case variation (between "u" for the four digit
form and "U" for the eight digit form) may not seem natural in
environments in which upper and lower case characters are generally
considered equivalent and might be confusing to people who are not
very familiar with Latin-based alphabets (although those people might
have even more trouble reading relevant English text and
explanations). Second, the very fact that there are several
Klensin Expires August 2, 2007 [Page 6]
Internet-Draft Unicode Escapes January 2007
different conventions that start in \u or \U may become a source of
confusion as people make incorrect assumptions about what they are
looking at. The similarity between this convention and the
surrogate-using Java one (see Section 5.4) are particularly
unfortunate examples of this.
5.2. HTML and XML
HTML and XML use the form &#xNNNN;. Like the Perl form, this form
has a clear terminator, reducing ambiguity. However, it is generally
considered ugly and awkward outside of its native HTML, XML, and
similar contexts.
5.3. Perl: A Hexadecimal String
Perl uses the form \x(NNN...). The advantage of this form is that
there are explicit delimiters, resolving the issue of having
variable-length strings or using the case-change mechanism of the
proposed form to distinguish between Plane 0 and more general forms.
Some other programming languages would tend to favor X'NNN...' forms
for hexadecimal strings and perhaps U'NNNN...' for Unicode-specific
strings, but those forms do not seem to be in use around the IETF.
5.4. Java: Escaped UTF-16
Java uses the form \uNNNN, but can represent characters outside Plane
0 (i.e., above U+FFFF) only by the use of surrogate pairs. Decoding
(or de-mapping) surrogates raises some of the same issues as the use
of UTF-8 octets discussed above. For characters in Plane 0, the Java
form is identical to the recommended Plane 0-only form recommended
above.
6. Security Considerations
This document proposes a specific mechanism for encoding Unicode
characters when other considerations do not apply. Since the
encoding is unambiguous and normalization issues are not involved, it
should not introduce any security issues that are not present as a
result of simple use of non-ASCII characters, no matter how they are
encoded. The mechanism suggested should slightly lower the risks of
confusing users with encoded characters by making the identity of the
characters being used somewhat more obvious than some of the
alternatives.
Klensin Expires August 2, 2007 [Page 7]
Internet-Draft Unicode Escapes January 2007
7. Acknowledgments
This document was produced in response to a series of discussions
within the IETF Applications Area and as part of work on email
internationalization and internationalized domain name updates. It
is a synthesis of a large number of discussions, the comments of the
participants in which are gratefully acknowledged. The help of Mark
Davis in constructing a list of alternative presentations and
selecting among them was especially important.
Stephane Bortzmeyer, Frank Ellermann, Clive D.W. Feather, Bill
McQuillan, Simon Josefsson, and Julian Reschke provided careful
reading and some corrections and suggestions on the initial draft.
Taken together, their suggestions motivated the significant revision
of this document and its recommendations between version -00 and
version -01.
8. Change log
[[anchor14: RFC Editor: Please remove this section before
publication.]]
8.1. Changes in -01
o Corrected ABNF syntax for Hex-quad and Full-form.
9. References
9.1. Normative References
[ISO10646]
International Organization for Standardization,
"Information Technology - Universal Multiple- Octet Coded
Character Set (UCS)"", ISO/IEC 10646:2003, December 2003.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
10646", STD 63, RFC 3629, November 2003.
[RFC4234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", RFC 4234, October 2005.
[Unicode] The Unicode Consortium, "The Unicode Standard, Version
5.0", 2006.
Klensin Expires August 2, 2007 [Page 8]
Internet-Draft Unicode Escapes January 2007
(Addison-Wesley, 2006. ISBN 0-321-48091-0).
9.2. Informative References
[ASCII] American National Standards Institute (formerly United
States of America Standards Institute), "USA Code for
Information Interchange", ANSI X3.4-1968, 1968.
ANSI X3.4-1968 has been replaced by newer versions with
slight modifications, but the 1968 version remains
definitive for the Internet.
[ISO-C] International Organization for Standardization,
"Information technology -- Programming languages -- C",
ISO/IEC 9899:1999, 1999.
[RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode
for Internationalized Domain Names in Applications
(IDNA)", RFC 3492, March 2003.
Author's Address
John C Klensin
1770 Massachusetts Ave, #322
Cambridge, MA 02140
USA
Phone: +1 617 245 1457
Email: john-ietf@jck.com
Klensin Expires August 2, 2007 [Page 9]
Internet-Draft Unicode Escapes January 2007
Full Copyright Statement
Copyright (C) The IETF Trust (2007).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
ietf-ipr@ietf.org.
Acknowledgment
Funding for the RFC Editor function is provided by the IETF
Administrative Support Activity (IASA).
Klensin Expires August 2, 2007 [Page 10]