Unicode Format for Network Interchange
RFC 5198

Document Type RFC - Proposed Standard (March 2008; Errata)
Obsoletes RFC 698
Updates RFC 854
Was draft-klensin-net-utf8 (individual in app area)
Last updated 2013-03-02
Stream IETF
Formats plain text pdf html
Stream WG state (None)
Consensus Unknown
Document shepherd No shepherd assigned
IESG IESG state RFC 5198 (Proposed Standard)
Telechat date
Responsible AD Chris Newman
Send notices to the.map@alum.mit.edu, john-ietf@jck.com
Network Working Group                                         J. Klensin
Request for Comments: 5198                                  M. Padlipsky
Obsoletes: 698                                                March 2008
Updates: 854
Category: Standards Track

                 Unicode Format for Network Interchange

Status of This Memo

   This document specifies an Internet standards track protocol for the
   Internet community, and requests discussion and suggestions for
   improvements.  Please refer to the current edition of the "Internet
   Official Protocol Standards" (STD 1) for the standardization state
   and status of this protocol.  Distribution of this memo is unlimited.

Abstract

   The Internet today is in need of a standardized form for the
   transmission of internationalized "text" information, paralleling the
   specifications for the use of ASCII that date from the early days of
   the ARPANET.  This document specifies that format, using UTF-8 with
   normalization and specific line-ending sequences.

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2
     1.1.  Requirement for a Standardized Text Stream Format  . . . .  2
     1.2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Net-Unicode Definition . . . . . . . . . . . . . . . . . . . .  3
   3.  Normalization  . . . . . . . . . . . . . . . . . . . . . . . .  5
   4.  Versions of Unicode  . . . . . . . . . . . . . . . . . . . . .  5
   5.  Applicability and Stability of this Specification  . . . . . .  7
     5.1.  Use in IETF Applications Specifications  . . . . . . . . .  7
     5.2.  Unicode Versions and Applicability . . . . . . . . . . . .  7
   6.  Security Considerations  . . . . . . . . . . . . . . . . . . .  9
   7.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 10
   Appendix A.  History and Context . . . . . . . . . . . . . . . . . 11
   Appendix B.  The ASCII NVT Definition  . . . . . . . . . . . . . . 12
   Appendix C.  The Line-Ending Problem . . . . . . . . . . . . . . . 14
   Appendix D.  A Note about Related Future Work  . . . . . . . . . . 14
   References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
     Normative References . . . . . . . . . . . . . . . . . . . . . . 15
     Informative References . . . . . . . . . . . . . . . . . . . . . 16

Klensin & Padlipsky         Standards Track                     [Page 1]
RFC 5198                    Network Unicode                   March 2008

1.  Introduction

1.1.  Requirement for a Standardized Text Stream Format

   Historically, Internet protocols have been largely ASCII-based and
   references to "text" in protocols have assumed ASCII text and
   specifically text in Network Virtual Terminal ("NVT") or "Network
   ASCII" form (see Appendix A and Appendix B).  Protocols and formats
   that have moved beyond ASCII have included arrangements to
   specifically identify the character set and often the language being
   used.

   In our more internationalized world, "text" clearly no longer equates
   unambiguously to "network ASCII".  Fortunately, however, we are
   converging on Unicode [Unicode] [ISO10646] as a single international
   interchange character coding and no longer need to deal with per-
   script standards for character sets (e.g., one standard for each of
   Arabic, Cyrillic, Devanagari, etc., or even standards keyed to
   languages that are usually considered to share a script, such as
   French, German, or Swedish).  Unfortunately, though, while it is
   certainly time to define a Unicode-based text type for use as a
   common text interchange format, "use Unicode" involves even more
   ambiguity than "use ASCII" did decades ago.

   Unicode identifies each character by an integer, called its "code
   point", in the range 0-0x10ffff.  These integers can be encoded into
   byte sequences for transmission in at least three standard and
   generally-recognized encoding forms, all of which are completely
   defined in The Unicode Standard and the documents cited below:

   o  UTF-8 [RFC3629] defines a variable-length encoding that may be
      applied uniformly to all code points.

   o  UTF-16 [RFC2781] encodes the range of Unicode characters whose
      code points are less than 65536 straightforwardly as 16-bit
      integers, and provides a "surrogate" mechanism for encoding larger
      code points in 32 bits.

   o  UTF-32 (also known as UCS-4) simply encodes each code point as a
      32-bit integer.

   Older forms and nomenclature, such as the 16-bit UCS-2, are now
   strongly discouraged.

   As with ASCII, any of these forms may be used with different line-
   ending conventions.  That flexibility can be an additional source of
Show full document text