UTF-9 and UTF-18 Efficient Transformation Formats of Unicode
RFC 4042

Document Type RFC - Informational (April 2005; No errata)
Last updated 2013-03-02
Stream ISE
Formats plain text pdf html bibtex
Stream ISE state (None)
Consensus Boilerplate Unknown
Document shepherd No shepherd assigned
IESG IESG state RFC 4042 (Informational)
Telechat date
Responsible AD (None)
Send notices to (None)
Network Working Group                                         M. Crispin
Request for Comments: 4042                             Panda Programming
Category: Informational                                     1 April 2005

                           UTF-9 and UTF-18
              Efficient Transformation Formats of Unicode

Status of This Memo

   This memo provides information for the Internet community.  It does
   not specify an Internet standard of any kind.  Distribution of this
   memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2005).

Abstract

   ISO-10646 defines a large character set called the Universal
   Character Set (UCS), which encompasses most of the world's writing
   systems.  The same set of codepoints is defined by Unicode, which
   further defines additional character properties and other
   implementation details.  By policy of the relevant standardization
   committees, changes to Unicode and amendments and additions to
   ISO/IEC 646 track each other, so that the character repertoires and
   code point assignments remain in synchronization.

   The current representation formats for Unicode (UTF-7, UTF-8, UTF-16)
   are not storage and computation efficient on platforms that utilize
   the 9 bit nonet as a natural storage unit instead of the 8 bit octet.

   This document describes a transformation format of Unicode that takes
   advantage of the nonet so that the format will be storage and
   computation efficient.

1.  Introduction

   A number of Internet sites utilize platforms that are not based upon
   the traditional 8-bit byte or octet.  One such platform is the PDP-
   10, which is based upon a 36-bit word.  On these platforms, it is
   wasteful to represent data in octets, since 4 bits are left unused in
   each word.  The 9-bit nonet is a much more sensible representation.

   Although these platforms support IETF standards, many of these
   platforms still utilize a text representation based upon the septet,

Crispin                      Informational                      [Page 1]
RFC 4042                    UTF-9 and UTF-18                1 April 2005

   which is only suitable for [US-ASCII] (although it has been used for
   various ISO 10646 national variants).

   To maximize international and multi-lingual interoperability, the IAB
   has recommended ([IAB-CHARACTER]) that [ISO-10646] be the default
   coded character set.

   Although other transformation formats of [UNICODE] exist, and
   conceivably can be used on nonet-oriented machines (most notably
   [UTF-8]), they suffer significant disadvantages:

      [UTF-8]
         requires one to three octets to represent codepoints in the
         Basic Multilingual Plane (BMP), four octets to represent
         [UNICODE] codepoints outside the BMP, and six octets to
         represent non-[UNICODE] codepoints.  When stored in nonets,
         this results in as many as four wasted bits per [UNICODE]
         character.

      [UTF-16]
         requires a hexadecet to represent codepoints in the BMP, and
         two hexadecets to represent [UNICODE] codepoints outside the
         BMP.  When stored in nonet pairs, this results in as many as
         four wasted bits per [UNICODE] character.  This transformation
         format requires complex surrogates to represent codepoints
         outside the BMP, and can not represent non-[UNICODE] codepoints
         at all.

      [UTF-7]
         requires one to five septets to represent codepoints in the
         BMP, and as many as eight septets to represent codepoints
         outside the BMP.  When stored in nonets, this results in as
         many as sixteen wasted bits per character.  This transformation
         format requires very complex and computationally expensive
         shifting and "modified BASE64" processing, and can not
         represent non-[UNICODE] codepoints at all.

   By comparison, UTF-9 uses one to two nonets to represent codepoints
   in the BMP, three nonets to represent [UNICODE] codepoints outside
   the BMP, and three or four nonets to represent non-[UNICODE]
   codepoints.  There are no wasted bits, and as the examples in this
   document demonstrate, the computational processing is minimal.

   Transformation between [UTF-8] and UTF-9 is straightforward, with
   most of the complexity in the handling of [UTF-8].  It is hoped that
   future extensions to protocols such as SMTP will permit the use of
   UTF-9 in these protocols between nonet platforms without the use of
   [UTF-8] as an "on the wire" format.

Crispin                      Informational                      [Page 2]
RFC 4042                    UTF-9 and UTF-18                1 April 2005

   Similarly, transformation between [UNICODE] codepoints and UTF-18 is
Show full document text