UTF-16, an encoding of ISO 10646
RFC 2781
Document | Type |
RFC - Informational
(February 2000; No errata)
Was draft-hoffman-utf16 (individual)
|
|
---|---|---|---|
Authors | Paul Hoffman , François Yergeau | ||
Last updated | 2013-03-02 | ||
Stream | Legacy | ||
Formats | plain text html pdf htmlized bibtex | ||
Stream | Legacy state | (None) | |
Consensus Boilerplate | Unknown | ||
RFC Editor Note | (None) | ||
IESG | IESG state | RFC 2781 (Informational) | |
Telechat date | |||
Responsible AD | (None) | ||
Send notices to | (None) |
Network Working Group P. Hoffman Request for Comments: 2781 Internet Mail Consortium Category: Informational F. Yergeau Alis Technologies February 2000 UTF-16, an encoding of ISO 10646 Status of this Memo This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2000). All Rights Reserved. 1. Introduction This document describes the UTF-16 encoding of Unicode/ISO-10646, addresses the issues of serializing UTF-16 as an octet stream for transmission over the Internet, discusses MIME charset naming as described in [CHARSET-REG], and contains the registration for three MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE (little-endian), and UTF-16. 1.1 Background and motivation The Unicode Standard [UNICODE] and ISO/IEC 10646 [ISO-10646] jointly define a coded character set (CCS), hereafter referred to as Unicode, which encompasses most of the world's writing systems [WORKSHOP]. UTF-16, the object of this specification, is one of the standard ways of encoding Unicode character data; it has the characteristics of encoding all currently defined characters (in plane 0, the BMP) in exactly two octets and of being able to encode all other characters likely to be defined (the next 16 planes) in exactly four octets. The Unicode Standard further defines additional character properties and other application details of great interest to implementors. Up to the present time, changes in Unicode and amendments to ISO/IEC 10646 have tracked each other, so that the character repertoires and code point assignments have remained in sync. The relevant standardization committees have committed to maintain this very useful synchronism, as well as not to assign characters outside of the 17 planes accessible to UTF-16. Hoffman & Yergeau Informational [Page 1] RFC 2781 UTF-16, an encoding of ISO 10646 February 2000 The IETF policy on character sets and languages [CHARPOLICY] says that IETF protocols MUST be able to use the UTF-8 character encoding scheme [UTF-8]. Some products and network standards already specify UTF-16, making it an important encoding for the Internet. This document is not an update to the [CHARPOLICY] document, only a description of the UTF-16 encoding. 1.2 Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [MUSTSHOULD]. Throughout this document, character values are shown in hexadecimal notation. For example, "0x013C" is the character whose value is the character assigned the integer value 316 (decimal) in the CCS. 2. UTF-16 definition UTF-16 is described in the Unicode Standard, version 3.0 [UNICODE]. The definitive reference is Annex Q of ISO/IEC 10646-1 [ISO-10646]. The rest of this section summarizes the definition is simple terms. In ISO 10646, each character is assigned a number, which Unicode calls the Unicode scalar value. This number is the same as the UCS-4 value of the character, and this document will refer to it as the "character value" for brevity. In the UTF-16 encoding, characters are represented using either one or two unsigned 16-bit integers, depending on the character value. Serialization of these integers for transmission as a byte stream is discussed in Section 3. The rules for how characters are encoded in UTF-16 are: - Characters with values less than 0x10000 are represented as a single 16-bit integer with a value equal to that of the character number. - Characters with values between 0x10000 and 0x10FFFF are represented by a 16-bit integer with a value between 0xD800 and 0xDBFF (within the so-called high-half zone or high surrogate area) followed by a 16-bit integer with a value between 0xDC00 and 0xDFFF (within the so-called low-half zone or low surrogate area). - Characters with values greater than 0x10FFFF cannot be encoded in UTF-16. Note: Values between 0xD800 and 0xDFFF are specifically reserved for use with UTF-16, and don't have any characters assigned to them. Hoffman & Yergeau Informational [Page 2] RFC 2781 UTF-16, an encoding of ISO 10646 February 2000 2.1 Encoding UTF-16 Encoding of a single character from an ISO 10646 character value to UTF-16 proceeds as follows. Let U be the character number, no greaterShow full document text