Internet Draft M. Duerst
<draft-duerst-i18n-norm-00.txt> University of Zurich
Expires in six months July 1997
Normalization of Internationalized Identifiers
Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working doc-
uments of the Internet Engineering Task Force (IETF), its areas, and
its working groups. Note that other groups may also distribute work-
ing documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months. Internet-Drafts may be updated, replaced, or obsoleted by
other documents at any time. It is not appropriate to use Internet-
Drafts as reference material or to cite them other than as a "working
draft" or "work in progress".
To learn the current status of any Internet-Draft, please check the
1id-abstracts.txt listing contained in the Internet-Drafts Shadow
Directories on ds.internic.net (US East Coast), nic.nordu.net
(Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific
Rim).
Distribution of this document is unlimited. Please send comments to
the author at <mduerst@ifi.unizh.ch> or to the uri mailing list at
uri@bunyip.com. This document is currently a very early draft,
intended to stimulate discussion only. It is intended to become part
of a suite of documents related to the internationalization of URLs.
Abstract
The Universal Character Set (UCS) makes it possible to extend the
repertoire of characters used in non-local identifiers beyond US-
ASCII. The UCS contains a large overall number of characters, many
codepoints for backwards compatibility, and various mechanisms to
cope with the features of the writing systems of the world. All this
together can lead to ambiguities in representation. Such ambiguities
are not a problem when representing running text. Therefore existing
standards have only defined equivalences. For the use in identi-
fiers, which are compared using their binary representation, this is
not sufficient. This document defines a normalization algorithm and
gives usage guidelines to avoid such ambiguities.
Expires End of January 1998 [Page 1]
Internet Draft Normalization of Identifiers July 1997
Table of contents
1. Introduction ................................................... 2
1.1 Motivation .................................................. 2
1.2 List of Potential Ambiguities ............................... 4
1.3 Categories .................................................. 5
1.3.1 Category Overview ....................................... 5
1.3.2 Category List ........................................... 5
1.4 Applicabality and Conformance ............................... 6
1.5 Notation .................................................... 6
2. Normalization Rules ............................................ 6
2.1 Normalization of Combining Sequences ........................ 7
2.2 Hangul Jamo Normalization ................................... 9
2.3 Arabic Ligature and Presentation Form Normalization ......... 9
3. Forbidden Characters and Character Combinations ................ 9
4. Dangerous Characters and Character Combinations ................ 9
5. Discouraged Characters and Character Combinations ............. 10
5.1 Similar Letters in Different Alphabets ..................... 10
6. No Normalization nor Restriction .............................. 10
6.1 Case Folding ............................................... 11
Acknowledgements ................................................. 11
Bibliography ..................................................... 11
Author's Address ................................................. 12
1. Introduction
1.1 Motivation
For the identification of resources in networks, many kinds of iden-
tifiers are in use. Locally, many kinds of identifiers can contain
characters from all kinds of languages and scripts, but as long as
different encodings for the same characters exist, these cannot be
used in identifiers across a wider network. Therefore, network iden-
tifiers had to be limited to a very restricted character repertoire,
usually a subset of US-ASCII.
With the definition of the Universal Character Set (UCS) [ISO 10646]
[Unicode2], it becomes possible to extend the character repertoire of
such identifiers. In some cases, this has already been done, for
example in Java and for URNs [URN-Syntax]; other cases are under
study. While identifiers for resources of full worldwide interest
should continue to be limited to a very restricted set of widestly
known characters, names for resources mainly used in a language-local
Expires End of January 1998 [Page 2]
Internet Draft Normalization of Identifiers July 1997
or script-local context may provide significant additional user con-
venience if they can make use of a wider character repertoire.
The UCS contains a large overall number of characters, many code-
points for backwards compatibility, and various mechanisms to allow
it to cope with the features of the writing systems of the world.
These all lead to ambiguities that in some cases can be resolved by
careful display, printing, and examination by the reader, but in
other cases are intended to be unnoticable by the reader. Such ambi-
guities can be dealt with in systems processing running text by using
various kinds of equivalences and normalizations, which may differ by
implementation.
However, identifier processing software usually compares their binary
representation to establish that two identifiers are identical. In
some cases, some additional processing may be done to account for the
specifics of identifier syntax variation. To upgrade all such soft-
ware to take into account the equivalences and ambiguities in the UCS
would be extremely tedious. For some classes of identifiers, it is
impossible because their binary representation is transparent in the
sense that it may allow legacy character encodings besides a charac-
ter encoding based on UCS to be used and/or it may allow for arbi-
trary binary data to be contained in identifiers.
In order to facilitate the use of identifiers containing characters
from UCS, this document therefore intends to develop clear specifica-
tions for a normalization algorithm removing basic ambiguities, and
guidelines for the use of characters with potential ambiguity.
A key design goal of the algorithm was and is that for most identi-
fiers in current use, applying the algorithm results in the identity
transform (i.e. the identifier is already normalized). This allows to
continue to use existing identifiers and to start to use internation-
alized identifiers in new settings even without all the details of
the normalization algorithm having been agreed upon.
Other goals when designing the algorithms and rules have been as fol-
lows:
- Avoid bad surprises for users when they cannot understand that two
identifiers looking exactly the same don't match. The user in
this case is an average user without any specific knowledge of
character encoding, but with a basic dose of "computer literacy"
(e.g. know that 0 and O have distinct keys on a keyboard).
- Restrict normalization to cases where it is really necessary;
cover remaining ambiguities by guidelines.
Expires End of January 1998 [Page 3]
Internet Draft Normalization of Identifiers July 1997
- Define normalization so that it can be implemented using widely
accessible documentation.
- Take measures for best possible compatibility with future addi-
tions to the UCS.
There are some issues this document does currently not address, in
particular bidirectionality. It is not clear yet whether this will be
included in this document or treated separately.
1.2 List of Potential Ambiguities
To give an idea of the extent of the problem, this section lists
potential character ambiguities, roughly ordered so that those cases
that are more difficult to distinguish come first. The difficulty to
distinguish certain characters or combinations may depend greatly on
context.
- Precomposed/decomposed diacritic character representation
- Hangul jamo vs. johab and jamo representation alternatives
- CJK compatibility ideographs
- Other backwards compatibility duplicated characters
- Separately coded Indic length/AI/AU marks
- Glyphs for vertical variants
- Croatian digraphs, other ligatures (Latin, Arabic,...)
- Various variant punctuation (apostrophes, middle dots, spaces,...)
- Half-width/full-width characters (Latin, Katakana and Hangul)
- Vertical variants (U+FE30...)
- Presence or absence of joiner/non-joiner
- Superscript/subscript variants (numbers and IPA)
- Small form variants (U+FE50...)
Expires End of January 1998 [Page 4]
Internet Draft Normalization of Identifiers July 1997
- Upper case/lower case
- Similar letters from different scripts (varying degrees) (e.g. "A"
in Latin, Greek, and Cyrillic)
- Letterlike symbols, Roman numerals (varying degrees)
- Enclosed alphanumerics, katakana, hangul,...
- Squared katakana (units,...), squared Latin abbreviations,...
- CJK ideograph variants (varying degrees, in particular general
simplifications, backwards-compatibility non-unifications, JIS
78/83 problems)
- Ignorable whitespace, hyphens,... (sorting)
- Ignorable accents,... (sorting)
1.3 Categories
1.3.1 Category Overview
This specification distinguishes various categories of ambigous char-
acters or strings. For each category, it will list or describe:
- The characters and character combinations in the category
- The context, if necessary
- The nature of the ambiguity
- The necessary actions or recommendations
1.3.1 Category List
The following categories are currently under investigation:
- Normalized: Characters and character combinations in this category
are not allowed in identifiers, they MUST be converted to a nor-
malized form. Examples include characters with strong equiva-
lences.
Expires End of January 1998 [Page 5]
Internet Draft Normalization of Identifiers July 1997
- Forbidden: Characters and character combinations in this category
are not allowed at all in identifiers; identifiers containing them
are illegal. Examlpes include characters that cause problems to
software, such as control characters, and cases that need normal-
ization but where normalization is too difficult to specify algo-
rithmically.
- Dangerous: Characters and character combinations in this category
are seriously advised against. Software would usually alert a user
of an attempt to use such a character, but not force the user to
remove it.
- Discouraged: Characters and character combinations in this cate-
gory are advised against, but not as strongly as to necessitate an
alert.
1.4 Applicability and Conformance
Where identifiers are used just to transmit data from one point to
another, e.g. in the case of the query component of an URL resulting
from a FORM reply, there is no need to apply the normalization rules
and guidelines defined in this document.
Identifiers containing a wide range of characters should be used with
care and only for an audience that is understood to be able to tran-
scribe them without problems.
1.5 Notation
Codepoints from the UCS are denoted as U+XXXX, where XXXX is their
hexadecimal representation, according to [Unicode2].
Ranges of characters are expressed as U+XXXX-U+YYYY. A block of char-
acters may also be identified by its first codepoint, followed by
"...". Official ISO character names are given in all upper case.
2. Normalization Rules
This chapter defines several normalization algorithms. They deal
with different kinds of phenomena, or different scripts. They are
defined so that the sequence of their application does not change the
Expires End of January 1998 [Page 6]
Internet Draft Normalization of Identifiers July 1997
normalization result; each algorithm has to be applied at least once.
Applying an algorithm a second time will not change the result any-
more.
The algorithms are to a certain extent written in a procedural fash-
ion. This does not imply that an implementation has to follow each
step. The only thing that is relevant is whether an implementation
produces the same outputs on the same inputs for all possible inputs,
i.e. for all randomly generated strings of arbitrary length. An
implementation may also combine the various algorithms into a single
one if the result is the same as applying each of the algorithms at
least once.
2.1 Normalization of Combining Sequences
UCS contains a general mechanism for encoding diacritic combinations
from base letters and modifying diacritics, as well as many combina-
tions as precomposed codepoints.
The following algorithm normalizes such combinations:
Step 1: Starting from the beginning of the identifier, find a maximal
sequence of a base character (possibly decomposable) followed by mod-
ifying letters.
Step 2: Fully decompose the sequence found in step 1, using all
canonical decompositions defined in [Unicode2] and all canonical
decompositions defined for future additions to the UCS.
Step 3: Sort the sequence of modifying letters found in Step 2
according to the canonical ordering algorithm of Section 3.9 of [Uni-
code2].
Step 4: If the base character is a Hebrew character, go to step 6.
Step 5: Try to recombine as much as possible of the sequence result-
ing from Step 3 into a precomposed character by finding the longest
initial match with any canonical decomposition sequence defined in
[Unicode2], ignoring decomposition sequences of length 1.
Step 6: Use the result obtained so far as output and continue with
Step 1.
Expires End of January 1998 [Page 7]
Internet Draft Normalization of Identifiers July 1997
NOTE -- In Step 4, the decomposition sequences in [Uni-
code2] have to be recursively expanded for each character
(except for decomposition sequences of length 1) before
application. Otherwise, a character such as U+1E1C, LATIN
CAPITAL LETTER E WITH CEDILLA AND BREVE, will not be recom-
posed correctly.
NOTE -- In Step 4, canonical decompositions defined for
future additions to the UCS are explicitly not considered.
This is done to ease forwards compatibility. It is assumed
that systems knowing about newly defined precompositions
will be able to decompose them correctly in Step 2, but
that it would be hard to change identifiers on older sys-
tems using a decomposed representation.
NOTE -- Maybe we have to define additions to the cannonical
equivalences, and/or to add more exceptions such as Hebrew.
NOTE -- A different definition of Step 4 may lead to
shorter normalizations for some identifiers. The current
definition was choosen for simplicity and implementation
speed. (this may be subject to discussion, in particular
if somebody has an implementation and is ready to share the
code).
NOTE -- The above algorithm can be sped up by shortcuts, in
particular by noting that most precomposed characters which
are not followed by modifying letters are already normal-
ized.
NOTE -- The exception for "precomposed letters that have a
decomposition sequence of length 1" in Step 4 is necessary
to avoid e.g. the letter "K" being "aggregated" to "KELVIN
SIGN" U+212A.
Expires End of January 1998 [Page 8]
Internet Draft Normalization of Identifiers July 1997
2.2 Hangul Jamo Normalization
Hangul Jamo (U+1100-U+11FF) provide ample possibilities for ambiguous
notations and therefore must be carefully normalized. The following
algorithm should be used:
Step 1: A seqence of Hangul jamo is split up into syllables according
to the definition of syllable boundaries on page 3-12 of [Unicode2].
Each of these syllables is processed according to Steps 2-4.
Step 2: Fillers are inserted as neccessary to form a canonical sylla-
ble as defined on page 3-12 of [Unicode2].
Step 3: Sequences of choseong, jungseong, and jongseong (leading con-
sonants, vowels, and trailing consonants) are replaced by a single
choseong, jungseong, and jongseong respectively according to the com-
patibility decompositions given in [Unicode2]. If this is not possi-
ble, this is a forbidden sequence.
Step 4: The seqence is replaced by a Hangul Syllable (U+AC00-U+D7AF)
if this is possible according to the algorithm given on pp. 3-12/3 of
[Unicode2].
NOTE -- We are not currently dealing with compatibility
Jamo (U+3130...).
2.3 Arabic Ligature and Presentation Form Normalization
It is not yet clear whether a normalization algorithm should be
defined here, or wheter ligatures and presentation forms should sim-
ply be forbidden.
3. Forbidden Characters and Character Combinations
To be completed.
4. Dangerous Characters and Character Combinations
Half-width and full-width compatibility characters (U+FF00...) can
easily be mistaken and are frequently interchanged. The version not
in the compatibility section (i.e. half-width for Latin and symbols,
Expires End of January 1998 [Page 9]
Internet Draft Normalization of Identifiers July 1997
full-width for Katakana, Hangul, "LIGHT VERTICAL", arrows, black
square, and white circle) should be used wherever possible. Because
half-with Latin characters may be needed in certain parts of certain
identifiers anyway, keyboard settings in places where identifiers are
input should be set to produce half-width Latin characters by
default, making the input of full-width characters more tedious.
Also, while the difference between half-width and full-width charac-
ters is well visible on computers in contexts that use fixed-pitch
displays, they are not well transcribed on paper or with high quality
printing. Identifiers should never differ by a half-width/full-width
difference only.
To be completed.
5. Discouraged Characters and Character Combinations
To be completed.
5.1 Similar Letters in Different Alphabets
Similar letters in different alphabets (e.g. Latin/Greek/Cyrillic A)
are discouraged in contexts where their assignement to a given alpha-
bet is or may be ambiguous. This means that mixed-alphabet identi-
fiers, in particular in cases where the use of each alphabet is not
cleary marked, e.g. by separators, is discouraged.
In the case of single letters mixed with numbers and simbols, such as
typicaly appearing in part numbers, it should be assumed that such
letters are Latin with first priority, and Cyrillic with second pri-
ority. Priority could also be different for different locations.
[what is best, fixed priorities or regional?]
Lower-case identifiers should be prefered to upper-case identifiers
because lower-case letters are more distinct.
6. No Normalization nor Restriction
This chapter lists cases where in some circumstances normalization is
applied or may seem advisable, but which are explicitly not normal-
ized, for example because a consistent normalization worldwide is not
possible.
Expires End of January 1998 [Page 10]
Internet Draft Normalization of Identifiers July 1997
6.1 Case Folding
This document assumes that case is distinguished, and does not have
to be folded or normalized. However, for some identifiers or parts
thereof, case folding may be taking place. In the absence of any spe-
cific knowlegde about this, it is very much advisable, both for auto-
matic processing as well as for user behaviour, to copy identifiers
without changing case in any way. On the other hand, it is advisable
for identifier creators to choose simple and consistent casing.
Intermittent casing can be copied visually, but is difficult to
transmit aurally.
The decision whether to make some part of an identifier case-
sensitive or not is one that can freely be taken in the case identi-
fiers are limited to the basic Latin alphabet. In many cases, there
is a tendency to extrapolate this to the Latin script in general.
However, the Latin script at large contains several special cases
which are language-dependent (e.g. Turkish dotted and dotless I/i) or
invalidate the one-to-one correspondence of upper case and lower case
(e.g. German sharp s). For identifiers with a repertoire extending
beyond the basic Latin alphabet, it is therefore highly advisable to
strictly distinguish case, i.e. to make identifiers case-sensitive.
Acknowledgements
I am grateful in particular to the following persons for contributing
ideas, advice, criticism and help: Mark Davis, Larry Masenter,
Michael Kung, Edward Cherlin, Alain LaBonte, Francois Yergeau, (to be
completed).
Bibliography
[ISO10646] ISO/IEC 10646-1:1993. International standard -- Infor-
mation technology -- Universal multiple-octet coded
character Set (UCS) -- Part 1: Architecture and basic
multilingual plane.
[Unicode2] The Unicode Standard, Version 2, Addison-Wesley, Read-
ing, MA, 1996.
[URN-Syntax] R. Moats, "URN Syntax", RFC 2141, May 1997.
Expires End of January 1998 [Page 11]
Internet Draft Normalization of Identifiers July 1997
Author's Address
Martin J. Duerst
Multimedia-Laboratory
Department of Computer Science
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich
Switzerland
Tel: +41 1 257 43 16
Fax: +41 1 363 00 35
E-mail: mduerst@ifi.unizh.ch
NOTE -- Please write the author's name with u-Umlaut wherever
possible, e.g. in HTML as Dürst.
Expires End of January 1998 [Page 12]