ASCII Escaping of Unicode Characters
RFC 5137
Document | Type |
RFC - Best Current Practice
(February 2008; Errata)
Also known as BCP 137
Was draft-klensin-unicode-escapes (individual in app area)
|
|
---|---|---|---|
Last updated | 2016-12-16 | ||
Stream | IETF | ||
Formats | plain text html pdf htmlized with errata bibtex | ||
Reviews | |||
Stream | WG state | (None) | |
Document shepherd | No shepherd assigned | ||
IESG | IESG state | RFC 5137 (Best Current Practice) | |
Consensus Boilerplate | Unknown | ||
Telechat date | |||
Responsible AD | Chris Newman | ||
Send notices to | (None) |
Network Working Group J. Klensin Request for Comments: 5137 February 2008 BCP: 137 Category: Best Current Practice ASCII Escaping of Unicode Characters Status of This Memo This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements. Distribution of this memo is unlimited. Abstract There are a number of circumstances in which an escape mechanism is needed in conjunction with a protocol to encode characters that cannot be represented or transmitted directly. With ASCII coding, the traditional escape has been either the decimal or hexadecimal numeric value of the character, written in a variety of different ways. The move to Unicode, where characters occupy two or more octets and may be coded in several different forms, has further complicated the question of escapes. This document discusses some options now in use and discusses considerations for selecting one for use in new IETF protocols, and protocols that are now being internationalized. Klensin Best Current Practice [Page 1] RFC 5137 Unicode Escapes February 2008 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Context and Background . . . . . . . . . . . . . . . . . . 3 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 1.3. Discussion List . . . . . . . . . . . . . . . . . . . . . 4 2. Encodings that Represent Unicode Code Points: Code Position versus UTF-8 or UTF-16 Octets . . . . . . . . . . . . 4 3. Referring to Unicode Characters . . . . . . . . . . . . . . . 5 4. Syntax for Code Point Escapes . . . . . . . . . . . . . . . . 6 5. Recommended Presentation Variants for Unicode Code Point Escapes . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.1. Backslash-U with Delimiters . . . . . . . . . . . . . . . 7 5.2. XML and HTML . . . . . . . . . . . . . . . . . . . . . . . 7 6. Forms that Are Normally Not Recommended . . . . . . . . . . . 8 6.1. The C Programming Language: Backslash-U . . . . . . . . . 8 6.2. Perl: A Hexadecimal String . . . . . . . . . . . . . . . . 8 6.3. Java: Escaped UTF-16 . . . . . . . . . . . . . . . . . . . 9 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 9 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10 9.1. Normative References . . . . . . . . . . . . . . . . . . . 10 9.2. Informative References . . . . . . . . . . . . . . . . . . 10 Appendix A. Formal Syntax for Forms Not Recommended . . . . . . . 12 A.1. The C Programming Language Form . . . . . . . . . . . . . 12 A.2. Perl Form . . . . . . . . . . . . . . . . . . . . . . . . 12 A.3. Java Form . . . . . . . . . . . . . . . . . . . . . . . . 12 Klensin Best Current Practice [Page 2] RFC 5137 Unicode Escapes February 2008 1. Introduction 1.1. Context and Background There are a number of circumstances in which an escape mechanism is needed in conjunction with a protocol to encode characters that cannot be represented or transmitted directly. With ASCII [ASCII] coding, the traditional escape has been either the decimal or hexadecimal numeric value of the character, written in a variety of different ways. For example, in different contexts, we have seen %dNN or %NN for the decimal form, %NN, %xNN, X'nn', and %X'NN' for the hexadecimal form. "%NN" has become popular in recent years to represent a hexadecimal value without further qualification, perhaps as a consequence of its use in URLs and their prevalence. There are even some applications around in which octal forms are used and, while they do not generalize well, the MIME Quoted-Printable and Encoded-word forms can be thought of as yet another set of escapes. So, even for the fairly simple cases of ASCII and standard built by extending ASCII, such as the ISO 8859 family, we have been living with several different escaping forms, each the result of some history. When one moves to Unicode [Unicode] [ISO10646], where characters occupy two or more octets and may be coded in several different forms, the question of escapes becomes even more complicated. Unicode represents characters as code points: numeric values from 0 to hex 10FFFF. When referencing code points in flowing text, they are represented using the so-called "U+" notation, as values from U+0000 to U+10FFFF. When serialized into octets, these code pointsShow full document text