draft-ullmann-net-utf-00

Internet Draft                                          R. L. Ullmann
                                         Process Software Corporation
                                                        June 12, 1992



                 NET-UTF: International character set



1  Status of this Memo

This document is an Internet Draft.  Internet Drafts are working
documents of the Internet Engineering Task Force (IETF), its Areas,
and its Working Groups.  Note that other groups may also distribute
working documents as Internet Drafts).

Internet Drafts are draft documents valid for a maximum of six months.
Internet Drafts may be updated, replaced, or obsoleted by other
documents at any time.  It is not appropriate to use Internet Drafts
as reference material or to cite them other than as a "working draft"
or "work in progress."

Please check the I-D abstract listing contained in each Internet Draft
directory to learn the current status of this or any other Internet
Draft.

This draft expires on or before December 12, 1992.



2  Introduction

The Internet is no longer a creature of the United States, much less
of DARPA (the US Defence Advance Research Projects Agency).  It is now
an international network, and the ability to communicate in any of the
world languages on an equal footing is an imperative.

This draft attempts to track the development of ISO 10646 [2], a
moving target at this writing.  The reference citation below is to the
2nd 10646 draft, for which balloting has just concluded.  (June 1992).
It is therefore expected that this memo will potentially change until
the publication of IS 10646.  Some of the following text refers to
10646 in the present tense, as if it is IS now; it should be
understood in this context.



3  Motivation

A working group (JTC1/SC2/WG2) of the ISO is currently working on
specification of a 32-bit Universal Character Set (UCS), ISO 10646
[2], including as annex F the specification of a Universal Transfer
Format (UTF) algorithm that would convert the 32-bit codes into 1-5
octet sequences.  The UTF code is deliberately designed to be useable
with existing software.


Ullmann            DRAFT: expires December 12, 1992         [page  1]

Internet Draft   NET-UTF: International Character Set   June 12, 1992


This document is intended to facilitate the earliest possible use of
IS-10646-UTF as the new universal code for text within the Internet,
referring to it as NET-UTF, by analogy with NET-ASCII, the Internet
standard for the use of ASCII-7 within the Internet.

NET-UTF is upward-compatible with NET-ASCII.  Since "upward/downward
compatible" has been much abused, a more precise definition is in
order:

     1.  A document in 7 bit NET-ASCII can be taken to be in NET-UTF
         without altering its interpretation; the NET-UTF
         representation of the document is bit-for-bit identical to
         NET-ASCII.

     2.  A document in NET-UTF that consists only of characters
         representable by NET-ASCII, either by design or happenstance,
         is identical to the same document in NET-ASCII.


This document is itself in NET-UTF.

4.  Terminology:  Octets and Characters

Since NET-UTF is a multi-octet character set, an important distinction
must be drawn between the term "octet" and the term "character".

     1.  An octet is an 8-bit datum, which may contain values 0
         through 255 decimal.

     2.  A character is a conceptual entity, such as "A" or " ö" (o
         with 2 dots over it).  Its coded representation is not part
         of the definition of "character".

     3.  A coded character set is a set of unambigous rules that
         establish a set of characters and the relationship between
         the characters of the set and their coded representation.


Note that the same character coded in different character sets may
result in different octets.  For example, the character  ö
(o-diaresis), code point 246.  decimal in UCS:  in the variant of IS
646 character set used for (e.g.) Swedish it is the octet 123., but in
IS 8859-1 the octet 246., and UTF it is the 2 octets 160.  246.

Particular attention must be paid to this distinction when reading
RFCs that predate 10646 (i.e.  all of them, at this writing ...).  The
RFCs often say character when the intention is "octet", assuming an
equivalence that, while valid at the time, is no longer valid.

This can lead directly to fruitless argument of Original Intent.  It
is more important to determine which definition is more useful in any
given case, and refine the definition of the protocol(s) described
appropriately.


Ullmann            DRAFT: expires December 12, 1992         [page  2]

Internet Draft   NET-UTF: International Character Set   June 12, 1992


In almost all cases in the existing RFCs, "character" should be read
as "octet", and specific "character X" as "octet containing the value
assigned in ASCII-7 to the character X".



4  Description of IS-10646-UTF

Important Note:  This section and the next are not to be taken as in
any way authoritative.  Only the IS itself is authoritative.  This
description is provided only for informal reference and exposition.

IS 10646 defines (will define) a 32 bit set, UCS-4, with characters
assigned to integer code points in the range 0.  to 2147483647.  (Note
that the high bit of the first octet is never set; codes with it set
may be used for any internal purpose within a device, but may not
appear in external text conforming to the standard.)

The first 256 code points are IS 8859-1 (aka "ASCII-8").

For example, LATIN CAPITAL LETTER A is 65., or 00 00 00 41 in hex.
This is the same integer code point in 7 bit ASCII, in 8859/1, and in
UCS-4.  In UCS-4 it contains octets (00 in this case) that "look like"
control codes to software not recognizing multi-octet codes.

The standard defines (now in second working draft) a Universal
Transfer Format, to address this problem.  Codes are mapped through an
algorithm to a 1-5 octet sequence.

Ranges of the UCS code space are mapped to ranges of UTF as follows:

        UCS-4 (decimal codes)     UTF (hexadecimal octets)

        0. to 159.                00 to 9F
        160. to 255.              A0 A0 to A0 FF
        256. to 16405.            A1 21 to F5 FF
        16406. to 233005.         F6 21 21 to FB FF FF
        233006. to 4294967295.    FC 21 21 21 21 to FF 59 3C C8 C3


The octets used in the multi-octet characters after the first are
always in the range(s) 21-7E and A0-FF, and therefore do not look like
control codes to software unaware that it is transmitting a
multi-octet code set.

There are no shifts or locking shifts, a major technical advantage
over the previous draft of 10646.  Any control character (e.g.
including SPACE) thus provides a resynchronization point, if an error
occurs.

UTF also provides an advantage of compactness in most cases,
especially when small amounts of text in various languages are
intermixed, with the majority in a Latin language.  (One might say
"English", but that might lead to complaints of parochialism.  So one
doesn't.)


Ullmann            DRAFT: expires December 12, 1992         [page  3]

Internet Draft   NET-UTF: International Character Set   June 12, 1992


It also facilitates further compression with general purpose text
compression techniques, since the most useful statistics are found in
the tri-octet range, exactly where they are in NET-ASCII text, almost
regardless of the language used.  (The word "tri-octet" has not
appeared in print before to this author's knowledge, but "trigraph",
the usual term used in cryptography and data compression research,
would be a misnomer here.)



5  Outline of code table

The following is an outline of the (DRAFT) 10646 code table.  As with
the information in the prvious section, this may change up to the
issuance of the IS, and is authoritative, refer to the IS.

Each block is described by its name in the (DRAFT) standard (CJK means
Chinese-Japanese-Korean), the range of code points in the block in
decimal, and the first and last code points in UTF in hexadecimal.

The last column is the actual UTF code for the first and last points
in the block.  These may not (and in many cases do not) correspond to
assigned characters; even if this document is displayed with a
10646-UTF rendering process, it will not show anything useful for
those code points.

ISO-646 IRV                        32.    20
                                  126.    7E                  ~

Latin-1 Supplement                160.    A0 A0                 
                                  255.    A0 FF                ÿ

Extended Latin-A                  256.    A1 21               ¡!
                                  383.    A1 C1               ¡Á

Extended Latin-B                  384.    A1 C2               ¡Â
                                  591.    A2 D3               ¢Ó

IPA Extensions                    592.    A2 D4               ¢Ô
                                  687.    A3 54               £T

Spacing modifier letters          688.    A3 55               £U
                                  767.    A3 C5               £Å

Combining diacritical marks       768.    A3 C6               £Æ
                                  879.    A4 56               ¤V

Greek                             880.    A4 57               ¤W
                                 1023.    A5 28               ¥(

Cyrillic                         1024.    A5 29               ¥)
                                 1279.    A6 6A               ¦j

Armenian                         1328.    A6 BC               ¦¼
                                 1423.    A7 3C               §<


Ullmann            DRAFT: expires December 12, 1992         [page  4]

Internet Draft   NET-UTF: International Character Set   June 12, 1992



Hebrew                           1424.    A7 3D               §=
                                 1535.    A7 CD               §Í

Arabic                           1536.    A7 CE               §Î
                                 1791.    A9 30               ©0

Devanagari                       2304.    AB D6               «Ö
                                 2431.    AC 76               ¬v

Bengali                          2432.    AC 77               ¬w
                                 2559.    AD 38               8

Gurmukhi                         2560.    AD 39               9
                                 2687.    AD D9               Ù

Gujarati                         2688.    AD DA               Ú
                                 2815.    AE 7A               ®z

Oriya                            2816.    AE 7B               ®{
                                 2943.    AF 3C               ¯<

Tamil                            2944.    AF 3D               ¯=
                                 3071.    AF DD               ¯Ý

Telugu                           3072.    AF DE               ¯Þ
                                 3199.    B0 7E               °~

Kannada                          3200.    B0 A0               ° 
                                 3327.    B1 40               ±@

Malayalam                        3328.    B1 41               ±A
                                 3455.    B1 E1               ±á

Thai                             3584.    B2 A4               ²¤
                                 3711.    B3 44               ³D

Lao                              3712.    B3 45               ³E
                                 3839.    B3 E5               ³å

Tibetan                          4096.    B5 49               µI
                                 4191.    B5 C9               µÉ

Georgian                         4256.    B6 2B               ¶+
                                 4351.    B6 AB               ¶«

Additional Extended Latin        7680.    C8 2F               È/
                                 7935.    C9 70               Ép

Greek Extensions                 7936.    C9 71               Éq
                                 8191.    CA D3               ÊÓ

General Punctuation              8192.    CA D4               ÊÔ
                                 8303.    CB 64               Ëd



Ullmann            DRAFT: expires December 12, 1992         [page  5]

Internet Draft   NET-UTF: International Character Set   June 12, 1992


Superscripts and Subscripts      8304.    CB 65               Ëe
                                 8351.    CB B5               Ëµ

Currency Symbols                 8352.    CB B6               Ë¶
                                 8399.    CB E5               Ëå

Combining Diacritical Marks      8400.    CB E6               Ëæ
        For Symbols              8447.    CC 36               Ì6

Letterlike Symbols               8448.    CC 37               Ì7
                                 8527.    CC A7               Ì§

Number Forms                     8528.    CC A8               Ì¨
                                 8591.    CC E7               Ìç

Arrows                           8592.    CC E8               Ìè
                                 8703.    CD 78               Íx

Mathematical Operators           8704.    CD 79               Íy
                                 8959.    CE DB               ÎÛ

Miscellaneous Technical          8960.    CE DC               ÎÜ
                                 9215.    D0 3E               Ð>

Control Pictures                 9216.    D0 3F               Ð?
                                 9279.    D0 7E               Ð~

Optical Character Recognition    9280.    D0 A0               Ð 
                                 9311.    D0 BF               Ð¿

Enclosed Alphanumerics           9312.    D0 C0               ÐÀ
                                 9471.    D1 A1               Ñ¡

Box Drawing                      9472.    D1 A2               Ñ¢
                                 9599.    D2 42               ÒB

Block Elements                   9600.    D2 43               ÒC
                                 9631.    D2 62               Òb

Geometric Shapes                 9632.    D2 63               Òc
                                 9727.    D2 E3               Òã

Miscellaneous Dingbats           9728.    D2 E4               Òä
                                 9983.    D4 46               ÔF

Dingbats                         9984.    D4 47               ÔG
                                10175.    D5 48               ÕH

CJK Symbols and Punctuation     12288.    E0 5F               à_
                                12351.    E0 BF               à¿

Hiragana                        12352.    E0 C0               àÀ
                                12447.    E1 40               á@

Katakana                        12448.    E1 41               áA


Ullmann            DRAFT: expires December 12, 1992         [page  6]

Internet Draft   NET-UTF: International Character Set   June 12, 1992


                                12543.    E1 C1               áÁ

Bopomofo                        12544.    E1 C2               áÂ
                                12591.    E1 F1               áñ

Hangul Jamo                     12592.    E1 F2               áò
                                12687.    E2 72               âr

CJK Miscellaneous               12688.    E2 73               âs
                                12703.    E2 A3               â£

Combining Hangul Jamo           12704.    E2 A4               â¤
                                12799.    E3 24               ã$

Enclosed CJK Letters            12800.    E3 25               ã%
        and Months              13055.    E4 66               äf

CJK Compatibility Words         13056.    E4 67               äg
        and Hours               13183.    E5 28               å(

Hangul                          13312.    E5 CA               åÊ
                                15663.    F2 32               ò2

Supplementary Hangul            15872.    F3 45               óE
                                17807.    F6 28 68            ö(h

Old Hangul                      17920.    F6 28 FA            ö(ú
                                19599.    F6 31 DB            ö1Û

CJK Unified Ideograms           19968.    F6 33 D0            ö3Ð
                                40959.    F6 C3 4C            öÃL

Private Use Area                57344.    F7 3A 79            ÷:y
                                63487.    F7 5A D9            ÷ZÙ

CJK Compatibility Ideographs    63744.    F7 5C 3D            ÷\=
                                64255.    F7 5E E1            ÷^á

Alphabetic Presentation Forms   64256.    F7 5E E2            ÷^â
                                64335.    F7 5F 52            ÷_R

Arabic Presentation Forms       64592.    F7 60 B6            ÷`¶
                                65023.    F7 62 E9            ÷bé

CJK Compatibility Forms         65072.    F7 63 3B            ÷c;
                                65103.    F7 63 5A            ÷cZ

Small Form Variants             65104.    F7 63 5B            ÷c[
                                65135.    F7 63 7A            ÷cz

Arabic Presentation Forms-B     65136.    F7 63 7B            ÷c{
                                65278.    F7 64 4B            ÷dK

Halfwidth and Fullwidth Forms   65280.    F7 64 4D            ÷dM
                                65519.    F7 65 7E            ÷e~


Ullmann            DRAFT: expires December 12, 1992         [page  7]

Internet Draft   NET-UTF: International Character Set   June 12, 1992



Specials                        65520.    F7 65 A0            ÷e 
                                65533.    F7 65 AD            ÷e

Private Use Planes           14680064.    FC 23 35 46 3D      ü#5F=
                             16777215.    FC 23 6F 57 D7      ü#oW×

Private Use Groups         1610612736.    FD 4D D6 E4 D8      ýMÖäØ
                           2147483647.    FD BD 2B B9 40      ý½+¹@




6  Notes on particular Internet protocols

Most of the common Internet application protocols, as implemented by
commercial software, already provide for the use of 8 bit characters,
usually to permit use of IS 8859 variants for alphabetic languages.

Public domain software is often lacking in this area, not having been
subjected to international commercial pressures, and usually being
hacked by the user to handle character sets other than 7 bit ASCII
where necessary.  Which is almost everywhere:  the only two languages
that can be written properly with ASCII-7 are Hawaiian and Swahili
(when written in the Latin script; the Arabic script is also used).
English cannot be (R ôle, clich é, co öperate, fa çade; although the
spelling variants without the diacritics are considered acceptable),
and one wants to spell proper names properly:   Ångstr øm, M ötley
Cr üe.  (In that last case, one concludes that the spelling was chosen
for appearance; the implied pronunciation is awesome.)

Most commercial implementations also permit either CRLF or LF as a
line terminator, even when standards dictate CRLF.  It is perhaps
useful to move toward the simple use of LF as line terminator in
NET-UTF.  This would be consistent with the de facto text standard in
NFS, derived from Unix.

The following sections are simply observations on the major
application protocols, without any attempt to be comprehensive.  (IHR
is the Internet Host Requirements, RFC 1123).



6.1  TELNET

The TELNET (Internet remote login protocol) has very specific
requirements for the user of CR and LF, best explained in IHR, 3.3.1.

It also (IHR, 3.4.1) specifies character set transparency, at least
for 7 bit characters.  Most implementations actually already provide 8
bit transparency, whether "binary mode" is negotiated or not.  If the
binary mode is negotiated, this serves to turn off newline
interpretation and other control interpretation (if any), not to
enable 8 bit transmission.


Ullmann            DRAFT: expires December 12, 1992         [page  8]

Internet Draft   NET-UTF: International Character Set   June 12, 1992


While the default assumption should now be NET-UTF, the actual set
used may be entirely a private issue; note that TELNET servers and
clients may have (e.g.) knowledge of terminal types.



6.2  FTP

FTP (Internet file transfer protocol) uses two separate session
connections, one for issuing commands and responses, the other
(re-)established for each file transferred, and carrying only data.



6.2.1  Control connection

NET-UTF is only to be used in path names of files to be transferred.
This is an extension of the specifications of RFC959 and IHR 4.1.4.1,
which specify any 7 bit character other than CR and LF.  Note that
(for example) the BSD Unix file system allows any 8 bit character
(nominally IS 8859-1), and the FTP implementation permits these names
to be used.



6.2.2  Data connection

When the data is being transferred in text mode, most existing
implementations permit 8 bit characters, and accept either LF or CRLF
as line terminators.



6.3  SMTP

6.3.1  Protocol (RFC 821)

In the SMTP (Internet mail transfer protocol) itself (distinguished
from the message header) NET-UTF does not make any real difference,
since electronic mail addresses consist of a restricted set of
characters.  The other parts of the command syntax are entirely
keywords and fixed elements.

NET-UTF might be used in the text of responses (which are not
interpreted by the protocol), particularily when giving a
multi-lingual response.

The only issue of actual protocol failure in data transmission might
occur when an octet of value (hex) AE is the only content of a line;
if the data is being "stripped" to 7 bit (i.e.  by non-NET-UTF
compliant software) this might look like the dot (hex 2E) used to end
the message.


Ullmann            DRAFT: expires December 12, 1992         [page  9]

Internet Draft   NET-UTF: International Character Set   June 12, 1992


This is not considered a problem, since AE is never used by itself in
IS 10646-UTF.  It is always either the first octet of a 2 octet
character, or a subsequent octet in a multi-octet character; the
problem does not arise, except perhaps by intentional mischief.

No negotiation of 8 bit transmission is done; this would simply
introduce a presently non-existent failure mode.  Communities of users
that need to use 8 bit character sets already are using the protocol
with 8 bit transmission.  More importantly:  Internet mail transfer
agents do not now have license to modify the content of messages in
any way (although public domain software often does, to the detriment
of everyone), it would be a serious regression to allow any such
license.



6.3.2  Internet message headers (RFC 822)

The use of NET-UTF in message headers is effectively already
implemented, with the notable exception of the large quantity of
software that does not even attempt to comply with the existing
standard, to which no concession need be made.  (The existing standard
being a decade old at this writing.)

Where message header fields contain arbitrary text, either there are
no restrictions (e.g.  the subject field) or a well defined
combination of single "character" and string quoting is used.  Present
implementations consider this to be octet-level quoting (i.e.  given
that there has been no distinction between "octet" and "character"),
and this interpretation is used, reading "octet" for "character" in
the specification.

The de-escaped field content can then be interpreted as a NET-UTF
string, to be rendered as any other text.



6.3.3  SMTP/X.25 (RFC 1090)

The specification for the use of SMTP directly on X.25 and the packet
mode of the ISDN should now refer to IS 10646-UTF instead of the
reference to IS 8859-1.



6.4  NFS

The network file system does not normally place any interpretation on
the content of files when used in a Unix-only environment, but often
implementations on other operating systems must do some interpretation
or conversion of "text" files.


Ullmann            DRAFT: expires December 12, 1992         [page 10]

Internet Draft   NET-UTF: International Character Set   June 12, 1992


Taking all the existing 7 bit ASCII files to be NET-UTF is a powerful
extension of the present day environment, and should provide an
immediately effective transition to a universally useful network data
base.



7  References

[1] David H.  Crocker.  Standard for the Format of ARPA Internet Text
    Messages.  RFC 822, University of Delaware, August, 1982.

[2] International Organization for Standardization.  Information
    technology -- Universal Coded Character Set (UCS).  ISO/IEC DIS
    10646-1.2, ISO, 26 December, 1991.  (Draft, ballot just concluded
    at this writing)

[3] Jon Postel.  Simple Mail Transfer Protocol.  RFC 821, USC
    Information Sciences Institute, August, 1982.

[4] Robert L.  Ullmann.  SMTP on X.25.  RFC 1090, Prime Computer,
    February, 1989.



8  Author's Address


Robert Ullmann
Process Software Corporation
959 Concord Street
Framingham, MA 01701
USA

Phone: +1 508 879 6994 x226
Email: Ariel@Process.COM


This draft expires on or before December 12, 1992.

Document	Document type	Expired Internet-Draft (individual) Expired & archived This document is an Internet-Draft (I-D). Anyone may submit an I-D to the IETF. This I-D is not endorsed by the IETF and has no formal standing in the IETF standards process.
	Select version	00
	Author	Robert L. Ullmann Email authors
	RFC stream	(None)
	Intended RFC status	(None)
	Other formats	txt bibtex bibxml