Internet Draft R. L. Ullmann
Process Software Corporation
June 12, 1992
NET-UTF: International character set
1 Status of this Memo
This document is an Internet Draft. Internet Drafts are working
documents of the Internet Engineering Task Force (IETF), its Areas,
and its Working Groups. Note that other groups may also distribute
working documents as Internet Drafts).
Internet Drafts are draft documents valid for a maximum of six months.
Internet Drafts may be updated, replaced, or obsoleted by other
documents at any time. It is not appropriate to use Internet Drafts
as reference material or to cite them other than as a "working draft"
or "work in progress."
Please check the I-D abstract listing contained in each Internet Draft
directory to learn the current status of this or any other Internet
Draft.
This draft expires on or before December 12, 1992.
2 Introduction
The Internet is no longer a creature of the United States, much less
of DARPA (the US Defence Advance Research Projects Agency). It is now
an international network, and the ability to communicate in any of the
world languages on an equal footing is an imperative.
This draft attempts to track the development of ISO 10646 [2], a
moving target at this writing. The reference citation below is to the
2nd 10646 draft, for which balloting has just concluded. (June 1992).
It is therefore expected that this memo will potentially change until
the publication of IS 10646. Some of the following text refers to
10646 in the present tense, as if it is IS now; it should be
understood in this context.
3 Motivation
A working group (JTC1/SC2/WG2) of the ISO is currently working on
specification of a 32-bit Universal Character Set (UCS), ISO 10646
[2], including as annex F the specification of a Universal Transfer
Format (UTF) algorithm that would convert the 32-bit codes into 1-5
octet sequences. The UTF code is deliberately designed to be useable
with existing software.
Ullmann DRAFT: expires December 12, 1992 [page 1]
Internet Draft NET-UTF: International Character Set June 12, 1992
This document is intended to facilitate the earliest possible use of
IS-10646-UTF as the new universal code for text within the Internet,
referring to it as NET-UTF, by analogy with NET-ASCII, the Internet
standard for the use of ASCII-7 within the Internet.
NET-UTF is upward-compatible with NET-ASCII. Since "upward/downward
compatible" has been much abused, a more precise definition is in
order:
1. A document in 7 bit NET-ASCII can be taken to be in NET-UTF
without altering its interpretation; the NET-UTF
representation of the document is bit-for-bit identical to
NET-ASCII.
2. A document in NET-UTF that consists only of characters
representable by NET-ASCII, either by design or happenstance,
is identical to the same document in NET-ASCII.
This document is itself in NET-UTF.
4. Terminology: Octets and Characters
Since NET-UTF is a multi-octet character set, an important distinction
must be drawn between the term "octet" and the term "character".
1. An octet is an 8-bit datum, which may contain values 0
through 255 decimal.
2. A character is a conceptual entity, such as "A" or " ö" (o
with 2 dots over it). Its coded representation is not part
of the definition of "character".
3. A coded character set is a set of unambigous rules that
establish a set of characters and the relationship between
the characters of the set and their coded representation.
Note that the same character coded in different character sets may
result in different octets. For example, the character ö
(o-diaresis), code point 246. decimal in UCS: in the variant of IS
646 character set used for (e.g.) Swedish it is the octet 123., but in
IS 8859-1 the octet 246., and UTF it is the 2 octets 160. 246.
Particular attention must be paid to this distinction when reading
RFCs that predate 10646 (i.e. all of them, at this writing ...). The
RFCs often say character when the intention is "octet", assuming an
equivalence that, while valid at the time, is no longer valid.
This can lead directly to fruitless argument of Original Intent. It
is more important to determine which definition is more useful in any
given case, and refine the definition of the protocol(s) described
appropriately.
Ullmann DRAFT: expires December 12, 1992 [page 2]
Internet Draft NET-UTF: International Character Set June 12, 1992
In almost all cases in the existing RFCs, "character" should be read
as "octet", and specific "character X" as "octet containing the value
assigned in ASCII-7 to the character X".
4 Description of IS-10646-UTF
Important Note: This section and the next are not to be taken as in
any way authoritative. Only the IS itself is authoritative. This
description is provided only for informal reference and exposition.
IS 10646 defines (will define) a 32 bit set, UCS-4, with characters
assigned to integer code points in the range 0. to 2147483647. (Note
that the high bit of the first octet is never set; codes with it set
may be used for any internal purpose within a device, but may not
appear in external text conforming to the standard.)
The first 256 code points are IS 8859-1 (aka "ASCII-8").
For example, LATIN CAPITAL LETTER A is 65., or 00 00 00 41 in hex.
This is the same integer code point in 7 bit ASCII, in 8859/1, and in
UCS-4. In UCS-4 it contains octets (00 in this case) that "look like"
control codes to software not recognizing multi-octet codes.
The standard defines (now in second working draft) a Universal
Transfer Format, to address this problem. Codes are mapped through an
algorithm to a 1-5 octet sequence.
Ranges of the UCS code space are mapped to ranges of UTF as follows:
UCS-4 (decimal codes) UTF (hexadecimal octets)
0. to 159. 00 to 9F
160. to 255. A0 A0 to A0 FF
256. to 16405. A1 21 to F5 FF
16406. to 233005. F6 21 21 to FB FF FF
233006. to 4294967295. FC 21 21 21 21 to FF 59 3C C8 C3
The octets used in the multi-octet characters after the first are
always in the range(s) 21-7E and A0-FF, and therefore do not look like
control codes to software unaware that it is transmitting a
multi-octet code set.
There are no shifts or locking shifts, a major technical advantage
over the previous draft of 10646. Any control character (e.g.
including SPACE) thus provides a resynchronization point, if an error
occurs.
UTF also provides an advantage of compactness in most cases,
especially when small amounts of text in various languages are
intermixed, with the majority in a Latin language. (One might say
"English", but that might lead to complaints of parochialism. So one
doesn't.)
Ullmann DRAFT: expires December 12, 1992 [page 3]
Internet Draft NET-UTF: International Character Set June 12, 1992
It also facilitates further compression with general purpose text
compression techniques, since the most useful statistics are found in
the tri-octet range, exactly where they are in NET-ASCII text, almost
regardless of the language used. (The word "tri-octet" has not
appeared in print before to this author's knowledge, but "trigraph",
the usual term used in cryptography and data compression research,
would be a misnomer here.)
5 Outline of code table
The following is an outline of the (DRAFT) 10646 code table. As with
the information in the prvious section, this may change up to the
issuance of the IS, and is authoritative, refer to the IS.
Each block is described by its name in the (DRAFT) standard (CJK means
Chinese-Japanese-Korean), the range of code points in the block in
decimal, and the first and last code points in UTF in hexadecimal.
The last column is the actual UTF code for the first and last points
in the block. These may not (and in many cases do not) correspond to
assigned characters; even if this document is displayed with a
10646-UTF rendering process, it will not show anything useful for
those code points.
ISO-646 IRV 32. 20
126. 7E ~
Latin-1 Supplement 160. A0 A0
255. A0 FF ÿ
Extended Latin-A 256. A1 21 ¡!
383. A1 C1 ¡Á
Extended Latin-B 384. A1 C2 ¡Â
591. A2 D3 ¢Ó
IPA Extensions 592. A2 D4 ¢Ô
687. A3 54 £T
Spacing modifier letters 688. A3 55 £U
767. A3 C5 £Å
Combining diacritical marks 768. A3 C6 £Æ
879. A4 56 ¤V
Greek 880. A4 57 ¤W
1023. A5 28 ¥(
Cyrillic 1024. A5 29 ¥)
1279. A6 6A ¦j
Armenian 1328. A6 BC ¦¼
1423. A7 3C §<
Ullmann DRAFT: expires December 12, 1992 [page 4]
Internet Draft NET-UTF: International Character Set June 12, 1992
Hebrew 1424. A7 3D §=
1535. A7 CD §Í
Arabic 1536. A7 CE §Î
1791. A9 30 ©0
Devanagari 2304. AB D6 «Ö
2431. AC 76 ¬v
Bengali 2432. AC 77 ¬w
2559. AD 38 8
Gurmukhi 2560. AD 39 9
2687. AD D9 Ù
Gujarati 2688. AD DA Ú
2815. AE 7A ®z
Oriya 2816. AE 7B ®{
2943. AF 3C ¯<
Tamil 2944. AF 3D ¯=
3071. AF DD ¯Ý
Telugu 3072. AF DE ¯Þ
3199. B0 7E °~
Kannada 3200. B0 A0 °
3327. B1 40 ±@
Malayalam 3328. B1 41 ±A
3455. B1 E1 ±á
Thai 3584. B2 A4 ²¤
3711. B3 44 ³D
Lao 3712. B3 45 ³E
3839. B3 E5 ³å
Tibetan 4096. B5 49 µI
4191. B5 C9 µÉ
Georgian 4256. B6 2B ¶+
4351. B6 AB ¶«
Additional Extended Latin 7680. C8 2F È/
7935. C9 70 Ép
Greek Extensions 7936. C9 71 Éq
8191. CA D3 ÊÓ
General Punctuation 8192. CA D4 ÊÔ
8303. CB 64 Ëd
Ullmann DRAFT: expires December 12, 1992 [page 5]
Internet Draft NET-UTF: International Character Set June 12, 1992
Superscripts and Subscripts 8304. CB 65 Ëe
8351. CB B5 ˵
Currency Symbols 8352. CB B6 ˶
8399. CB E5 Ëå
Combining Diacritical Marks 8400. CB E6 Ëæ
For Symbols 8447. CC 36 Ì6
Letterlike Symbols 8448. CC 37 Ì7
8527. CC A7 ̧
Number Forms 8528. CC A8 ̨
8591. CC E7 Ìç
Arrows 8592. CC E8 Ìè
8703. CD 78 Íx
Mathematical Operators 8704. CD 79 Íy
8959. CE DB ÎÛ
Miscellaneous Technical 8960. CE DC ÎÜ
9215. D0 3E Ð>
Control Pictures 9216. D0 3F Ð?
9279. D0 7E Ð~
Optical Character Recognition 9280. D0 A0 Ð
9311. D0 BF п
Enclosed Alphanumerics 9312. D0 C0 ÐÀ
9471. D1 A1 Ñ¡
Box Drawing 9472. D1 A2 Ѣ
9599. D2 42 ÒB
Block Elements 9600. D2 43 ÒC
9631. D2 62 Òb
Geometric Shapes 9632. D2 63 Òc
9727. D2 E3 Òã
Miscellaneous Dingbats 9728. D2 E4 Òä
9983. D4 46 ÔF
Dingbats 9984. D4 47 ÔG
10175. D5 48 ÕH
CJK Symbols and Punctuation 12288. E0 5F à_
12351. E0 BF à¿
Hiragana 12352. E0 C0 àÀ
12447. E1 40 á@
Katakana 12448. E1 41 áA
Ullmann DRAFT: expires December 12, 1992 [page 6]
Internet Draft NET-UTF: International Character Set June 12, 1992
12543. E1 C1 áÁ
Bopomofo 12544. E1 C2 áÂ
12591. E1 F1 áñ
Hangul Jamo 12592. E1 F2 áò
12687. E2 72 âr
CJK Miscellaneous 12688. E2 73 âs
12703. E2 A3 â£
Combining Hangul Jamo 12704. E2 A4 â¤
12799. E3 24 ã$
Enclosed CJK Letters 12800. E3 25 ã%
and Months 13055. E4 66 äf
CJK Compatibility Words 13056. E4 67 äg
and Hours 13183. E5 28 å(
Hangul 13312. E5 CA åÊ
15663. F2 32 ò2
Supplementary Hangul 15872. F3 45 óE
17807. F6 28 68 ö(h
Old Hangul 17920. F6 28 FA ö(ú
19599. F6 31 DB ö1Û
CJK Unified Ideograms 19968. F6 33 D0 ö3Ð
40959. F6 C3 4C öÃL
Private Use Area 57344. F7 3A 79 ÷:y
63487. F7 5A D9 ÷ZÙ
CJK Compatibility Ideographs 63744. F7 5C 3D ÷\=
64255. F7 5E E1 ÷^á
Alphabetic Presentation Forms 64256. F7 5E E2 ÷^â
64335. F7 5F 52 ÷_R
Arabic Presentation Forms 64592. F7 60 B6 ÷`¶
65023. F7 62 E9 ÷bé
CJK Compatibility Forms 65072. F7 63 3B ÷c;
65103. F7 63 5A ÷cZ
Small Form Variants 65104. F7 63 5B ÷c[
65135. F7 63 7A ÷cz
Arabic Presentation Forms-B 65136. F7 63 7B ÷c{
65278. F7 64 4B ÷dK
Halfwidth and Fullwidth Forms 65280. F7 64 4D ÷dM
65519. F7 65 7E ÷e~
Ullmann DRAFT: expires December 12, 1992 [page 7]
Internet Draft NET-UTF: International Character Set June 12, 1992
Specials 65520. F7 65 A0 ÷e
65533. F7 65 AD ÷e
Private Use Planes 14680064. FC 23 35 46 3D ü#5F=
16777215. FC 23 6F 57 D7 ü#oW×
Private Use Groups 1610612736. FD 4D D6 E4 D8 ýMÖäØ
2147483647. FD BD 2B B9 40 ý½+¹@
6 Notes on particular Internet protocols
Most of the common Internet application protocols, as implemented by
commercial software, already provide for the use of 8 bit characters,
usually to permit use of IS 8859 variants for alphabetic languages.
Public domain software is often lacking in this area, not having been
subjected to international commercial pressures, and usually being
hacked by the user to handle character sets other than 7 bit ASCII
where necessary. Which is almost everywhere: the only two languages
that can be written properly with ASCII-7 are Hawaiian and Swahili
(when written in the Latin script; the Arabic script is also used).
English cannot be (R ôle, clich é, co öperate, fa çade; although the
spelling variants without the diacritics are considered acceptable),
and one wants to spell proper names properly: Ångstr øm, M ötley
Cr üe. (In that last case, one concludes that the spelling was chosen
for appearance; the implied pronunciation is awesome.)
Most commercial implementations also permit either CRLF or LF as a
line terminator, even when standards dictate CRLF. It is perhaps
useful to move toward the simple use of LF as line terminator in
NET-UTF. This would be consistent with the de facto text standard in
NFS, derived from Unix.
The following sections are simply observations on the major
application protocols, without any attempt to be comprehensive. (IHR
is the Internet Host Requirements, RFC 1123).
6.1 TELNET
The TELNET (Internet remote login protocol) has very specific
requirements for the user of CR and LF, best explained in IHR, 3.3.1.
It also (IHR, 3.4.1) specifies character set transparency, at least
for 7 bit characters. Most implementations actually already provide 8
bit transparency, whether "binary mode" is negotiated or not. If the
binary mode is negotiated, this serves to turn off newline
interpretation and other control interpretation (if any), not to
enable 8 bit transmission.
Ullmann DRAFT: expires December 12, 1992 [page 8]
Internet Draft NET-UTF: International Character Set June 12, 1992
While the default assumption should now be NET-UTF, the actual set
used may be entirely a private issue; note that TELNET servers and
clients may have (e.g.) knowledge of terminal types.
6.2 FTP
FTP (Internet file transfer protocol) uses two separate session
connections, one for issuing commands and responses, the other
(re-)established for each file transferred, and carrying only data.
6.2.1 Control connection
NET-UTF is only to be used in path names of files to be transferred.
This is an extension of the specifications of RFC959 and IHR 4.1.4.1,
which specify any 7 bit character other than CR and LF. Note that
(for example) the BSD Unix file system allows any 8 bit character
(nominally IS 8859-1), and the FTP implementation permits these names
to be used.
6.2.2 Data connection
When the data is being transferred in text mode, most existing
implementations permit 8 bit characters, and accept either LF or CRLF
as line terminators.
6.3 SMTP
6.3.1 Protocol (RFC 821)
In the SMTP (Internet mail transfer protocol) itself (distinguished
from the message header) NET-UTF does not make any real difference,
since electronic mail addresses consist of a restricted set of
characters. The other parts of the command syntax are entirely
keywords and fixed elements.
NET-UTF might be used in the text of responses (which are not
interpreted by the protocol), particularily when giving a
multi-lingual response.
The only issue of actual protocol failure in data transmission might
occur when an octet of value (hex) AE is the only content of a line;
if the data is being "stripped" to 7 bit (i.e. by non-NET-UTF
compliant software) this might look like the dot (hex 2E) used to end
the message.
Ullmann DRAFT: expires December 12, 1992 [page 9]
Internet Draft NET-UTF: International Character Set June 12, 1992
This is not considered a problem, since AE is never used by itself in
IS 10646-UTF. It is always either the first octet of a 2 octet
character, or a subsequent octet in a multi-octet character; the
problem does not arise, except perhaps by intentional mischief.
No negotiation of 8 bit transmission is done; this would simply
introduce a presently non-existent failure mode. Communities of users
that need to use 8 bit character sets already are using the protocol
with 8 bit transmission. More importantly: Internet mail transfer
agents do not now have license to modify the content of messages in
any way (although public domain software often does, to the detriment
of everyone), it would be a serious regression to allow any such
license.
6.3.2 Internet message headers (RFC 822)
The use of NET-UTF in message headers is effectively already
implemented, with the notable exception of the large quantity of
software that does not even attempt to comply with the existing
standard, to which no concession need be made. (The existing standard
being a decade old at this writing.)
Where message header fields contain arbitrary text, either there are
no restrictions (e.g. the subject field) or a well defined
combination of single "character" and string quoting is used. Present
implementations consider this to be octet-level quoting (i.e. given
that there has been no distinction between "octet" and "character"),
and this interpretation is used, reading "octet" for "character" in
the specification.
The de-escaped field content can then be interpreted as a NET-UTF
string, to be rendered as any other text.
6.3.3 SMTP/X.25 (RFC 1090)
The specification for the use of SMTP directly on X.25 and the packet
mode of the ISDN should now refer to IS 10646-UTF instead of the
reference to IS 8859-1.
6.4 NFS
The network file system does not normally place any interpretation on
the content of files when used in a Unix-only environment, but often
implementations on other operating systems must do some interpretation
or conversion of "text" files.
Ullmann DRAFT: expires December 12, 1992 [page 10]
Internet Draft NET-UTF: International Character Set June 12, 1992
Taking all the existing 7 bit ASCII files to be NET-UTF is a powerful
extension of the present day environment, and should provide an
immediately effective transition to a universally useful network data
base.
7 References
[1] David H. Crocker. Standard for the Format of ARPA Internet Text
Messages. RFC 822, University of Delaware, August, 1982.
[2] International Organization for Standardization. Information
technology -- Universal Coded Character Set (UCS). ISO/IEC DIS
10646-1.2, ISO, 26 December, 1991. (Draft, ballot just concluded
at this writing)
[3] Jon Postel. Simple Mail Transfer Protocol. RFC 821, USC
Information Sciences Institute, August, 1982.
[4] Robert L. Ullmann. SMTP on X.25. RFC 1090, Prime Computer,
February, 1989.
8 Author's Address
Robert Ullmann
Process Software Corporation
959 Concord Street
Framingham, MA 01701
USA
Phone: +1 508 879 6994 x226
Email: Ariel@Process.COM
This draft expires on or before December 12, 1992.