UTF-9 and UTF-18 Efficient Transformation Formats of Unicode
RFC 4042
Network Working Group M. Crispin
Request for Comments: 4042 Panda Programming
Category: Informational 1 April 2005
UTF-9 and UTF-18
Efficient Transformation Formats of Unicode
Status of This Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2005).
Abstract
ISO-10646 defines a large character set called the Universal
Character Set (UCS), which encompasses most of the world's writing
systems. The same set of codepoints is defined by Unicode, which
further defines additional character properties and other
implementation details. By policy of the relevant standardization
committees, changes to Unicode and amendments and additions to
ISO/IEC 646 track each other, so that the character repertoires and
code point assignments remain in synchronization.
The current representation formats for Unicode (UTF-7, UTF-8, UTF-16)
are not storage and computation efficient on platforms that utilize
the 9 bit nonet as a natural storage unit instead of the 8 bit octet.
This document describes a transformation format of Unicode that takes
advantage of the nonet so that the format will be storage and
computation efficient.
1. Introduction
A number of Internet sites utilize platforms that are not based upon
the traditional 8-bit byte or octet. One such platform is the PDP-
10, which is based upon a 36-bit word. On these platforms, it is
wasteful to represent data in octets, since 4 bits are left unused in
each word. The 9-bit nonet is a much more sensible representation.
Although these platforms support IETF standards, many of these
platforms still utilize a text representation based upon the septet,
Crispin Informational [Page 1]
RFC 4042 UTF-9 and UTF-18 1 April 2005
which is only suitable for [US-ASCII] (although it has been used for
various ISO 10646 national variants).
To maximize international and multi-lingual interoperability, the IAB
has recommended ([IAB-CHARACTER]) that [ISO-10646] be the default
coded character set.
Although other transformation formats of [UNICODE] exist, and
conceivably can be used on nonet-oriented machines (most notably
[UTF-8]), they suffer significant disadvantages:
[UTF-8]
requires one to three octets to represent codepoints in the
Basic Multilingual Plane (BMP), four octets to represent
[UNICODE] codepoints outside the BMP, and six octets to
represent non-[UNICODE] codepoints. When stored in nonets,
this results in as many as four wasted bits per [UNICODE]
character.
[UTF-16]
requires a hexadecet to represent codepoints in the BMP, and
two hexadecets to represent [UNICODE] codepoints outside the
BMP. When stored in nonet pairs, this results in as many as
four wasted bits per [UNICODE] character. This transformation
format requires complex surrogates to represent codepoints
outside the BMP, and can not represent non-[UNICODE] codepoints
at all.
[UTF-7]
requires one to five septets to represent codepoints in the
BMP, and as many as eight septets to represent codepoints
outside the BMP. When stored in nonets, this results in as
many as sixteen wasted bits per character. This transformation
format requires very complex and computationally expensive
shifting and "modified BASE64" processing, and can not
represent non-[UNICODE] codepoints at all.
By comparison, UTF-9 uses one to two nonets to represent codepoints
in the BMP, three nonets to represent [UNICODE] codepoints outside
the BMP, and three or four nonets to represent non-[UNICODE]
codepoints. There are no wasted bits, and as the examples in this
document demonstrate, the computational processing is minimal.
Transformation between [UTF-8] and UTF-9 is straightforward, with
most of the complexity in the handling of [UTF-8]. It is hoped that
future extensions to protocols such as SMTP will permit the use of
UTF-9 in these protocols between nonet platforms without the use of
[UTF-8] as an "on the wire" format.
Crispin Informational [Page 2]
RFC 4042 UTF-9 and UTF-18 1 April 2005
Similarly, transformation between [UNICODE] codepoints and UTF-18 is
also quite simple. Although (like UCS-2) UTF-18 only represents a
subset of the available [UNICODE] codepoints, it encompasses the
non-private codepoints that are currently assigned in [UNICODE].
Show full document text