[Search] [txt|pdfized|bibtex] [Tracker] [WG] [Email] [Nits]

Versions: 00 01                                                         
Internet Draft                                     Liana Ye
draft-ietf-idn-step-00.txt                          Y&D ISG
May 29, 2001
Expires in six months (November 2001)

         StepCode- A User Access Oriented IDN Encoding

Status of this memo

This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed
         at http://www.ietf.org/shadow.html.


Abstract

This document describes a transformation method from an end user
into Unicode library for representing Chinese characters in host
name parts in a fashion that is extendable to include more than
Unicode symbols and completely compatible with the current DNS.
It is a potential candidate for an ASCII-Compatible Encoding (ACE)
for internationalized hostnames.  This method is based
on user widely used concept for denoting a Chinese character with
its phonetic elements, and register their IDN names with such an
extented phonetic description (English speakers can register a
CJK glyph for their blessing too.), so that an IDN name in either
traditional or simplified character set can be effective with
both user communities depending on servers, and allows standard
variations for compressing and security filtering of that
information. The method can be extented for other scripts as long
as they need more then 26 basic ASCII letters to be mapped.


1. Introduction

1.1 Context

There is a strong world-wide desire to use characters other than
plain ASCII in hostnames. Hostnames have become the equivalent of
business or product names for many services on the Internet, so
there is a need to make them usable by people whose native scripts
are not representable by ASCII. The requirements for
internationalizing hostnames are described in the IDN WG's
requirements document, [IDNReq].

The IDN WG's comparison document [IDNComp] describes three potential
main architectures for IDN: arch-1 (just send binary), arch-2 (send
binary or ACE), and arch-3 (just send ACE). StepCode is an ACE that
can be used with protocols that match arch-2 or arch-3. Hopefully,
this ACE may render arch-2 unnecessary. It is either a server or a
client generatable and selectable ACE according to a string's
language/script tag/label[RFC1766]. Because it does not attempt any
particular optimization or compression of string patterns, the
average length for a Chinese character in [ISO10646] is about 13
characters long, enough to code ÏÃٱ̅©¬.com (I wish 20% reader
can see this, and the rest should be at lost.) as:
xinzhuqinghua1212qin1jin0ge1ge0shui1qing0hua2shi0.com with 54 octets.
In the code the first group of digits specify the tones of each
glyph as well as the number of codepoints in [ISO10646] or glyphs.
The rest of digits specify the layout of different parts of each
glyph of above string.

The StepCode protocol has the following features:

- There is exactly one way to convert internationalized host parts
to and from StepCode strings with a script tag/label. It permits
different script tags to access the same glyph in [ISO10646]
similar with a searching method into a book library, while the
codepoints of [ISO10646] is an analogy to ISDN book number. Host
name part uniqueness is preserved. If there is a difference in the
code, it is considered as user input error.

- Host parts have no international glyphs but US-ASCII.

- Names using StepCode have lengths proportionate to the number
of glyphs (from IS 10646) in the names themselves plus the
script tag.  However, StepCode for most frequently used glyphs
in the table is shortend significantly such as Cyrillic.

- This specification allows standard compression or security
treatment compatible with existing hostnames.

It is important to note that the following sections contain many
normative statements with "MUST" and "MUST NOT". Any implementation
that does not follow these statements exactly is likely to cause
damage to the Internet by creating non-unique representations of
hostnames.

1.2 Author's Disclaimer

This document was written for the convenience of the IDN WG, in
case someone believes that there are no agreeable mechanisms for
referencing internationalized names without converting [ISO10646]
codepoints. The author believes that [ISO10646] is to
establish a symbol libaray, and there are better ways to do high
frequency symbol accessing.  Display a symbol onto
no DNS-based approach can solve the "IDN" problem as it
is hoped by users and company/enterprise domain name
registrants and it is prossible not to add another coding format,
further complicate the DNS and risk unknown problems and
incompatibilities.

1.3 Terminology

The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED",
and "MAY" in this document are to be interpreted as described in
 [RFC2119].

Hexadecimal values are shown preceded with an "0x". For example,
"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values
are shown preceded with an "0b". For example, a nine-bit value might
be shown as "0b101101111".

Examples in this document use the notation from the Unicode Standard
[Unicode3] as well as the ISO 10646 names. For example, the letter
"a" may be represented as either "U+0061" or "LATIN SMALL LETTER A".

StepCode converts strings at a client site with internationalized
characters into strings of US-ASCII that are acceptable as host
name parts in current DNS host naming usage. The former are called
"pre-converted" and a "glyph" for a symbol repesented by one codepoint
in [ISO10646] or "glyphs" for a string of that, and the latter
are called "post-converted".

The protocol contains one procedure and calls for standardizing
a minimum number of glyphs of a script using the same script
tag. Glyphs in the minimum number of glyph set is called
"particles".

The protocol using US-ASCII to denote the phonetic elements of a
script and calls for standardizing such a mapping for each
script tag. The phonetic elements of a glyph is called "spelling"
of the glyph and is called "stem" for that of a particle.

The protocol specifies an ASCII Compatible [ACE] Encoding
method and using Chinese script as an example to demonstrate its
features, here is refered as an "ACE" process.

1.4 IDN summary

Using the terminology in [IDNComp], StepCode specifies an ACE format
for arch-2 (send binary or ACE), and arch-3 (just send ACE).

The length characteristics of StepCode are discussed above
(1.1 Context), is a variable depending on users' choice among
many factors. It fits well with existing compression and security
treatments.

It calls for standardrizing phonetic elements and minimum glyph set
within its user community (and labeled by script tag), while
asking the internet industry to enforce the standard and providing
cross reference to different script tags into Unicode standard.

2. Host Part Transformation

According to [STD13], host parts must be case-insensitive, start and
end with a letter or digit, and contain only letters, digits, and
the hyphen character ("-"). This, of course, excludes any
internationalized characters, as well as many other characters in
the ASCII character repertoire. Further, domain name parts must be
63 octets or shorter in length.

2.1 Name tagging

All post-converted name parts that contain internationalized
characters begin with the string "gl-p-", where "gl" denote the
glyph set or script encoded as specified by [RFC2277], and
"p" denote the phonetic standard used, it SHOULD be reserved with
IANA. The string "gl-p-" will allow 674 scripts and 24 phonetic
standards of each to be encoded. The assignment of "gl-p-" shall
be defined in future versions of this draft.

Note that a zone administrator MAY still choose to use "gl-p-" at
the beginning of a hostname part even if that part does not contain
internationalized characters. Zone administrators MAY create
host part names that begin with "gl-p-" which means no conversion
is done and display systems SHOULD ignore converting
internationalized characters back for display.

2.2 Converting an internationalized name to an ACE name part

To convert a string of internationalized characters into an ACE name
part, the following steps MUST be preformed in the exact order of
the subsections given here.

If a name part consists exclusively of characters that conform to
the hostname requirements in [STD13] or the string "gl-p-",
the name MUST NOT be converted to StepCode. That is, a name part
that can be represented without StepCode MUST NOT be changed.
This absolute requirement prevents:
        1. double encoding from a client of user keyboard input and
         a server provider;
        2. mess up existing registered domain names;
        3. from being two different encodings for a single DNS
        registered hostname;
        4. interfering with registered glyphs with more than one
        phonetic standard, such as Chinese script.

If any checking for prohibited name parts (such as ones that are
prohibited characters, case-folding, or canonicalization) is to be
done, it MUST be done after doing the conversion to an ACE name
part as it is specified in [nameprep].

Characters outside the first plane of characters (those with
codepoints above U+FFFF) MUST be represented using surrogates, as
described in the UTF-16 description in ISO 10646.

The input name string consists of characters from the ISO 10646
character set in big-endian UTF-16 encoding. This is the
pre-converted string.

2.2.1 Check the input string for disallowed names

If the input string consists only of characters that conform to the
hostname requirements in [STD13], or the input string consists
a null language tag, the conversion MUST stop with an error.

2.2.2 Represent glyphs by their spelling and particle layout.

2.2.2.1 StepCode defination for digits

        Tone marks [Macmillan93]
        0       no tone
        1       flat/macron (-)
        2       rise/acute (/)
        3       dip/breve (v)
        4       drop/grave (\)
        5       throw/circumflex (^)
        6       thrill/tilde (~)
        7       dieresis (..)
        8       cedilla (hook)

        Particle layout [Ye95]
        0       end of a stem or a spelling
        1       to its left
        2       to its bellow
        3       to its inside (an enclosure particle)
        4       to its outside (normally a center divider)

2.2.2.2 StepCode phonetic symbol tables

2.2.2.2.1 Chinese

        Note: This is a list extracted from [Dictionary79] and
other sources.

        Pinyin          Wade            Zhuyin
        b               p               b
        p               p/              p
        m               m               m
        f               f               f
        d               t               d
        t               t/              t
        n               n               n
        l               l               l
        g               k               g
        k               k/              k
        h               h               h
        j               ch              j
        q               ch/             q
        x               hs              x
        zh              ch              zh
        ch              ch/             ch
        sh              sh              sh
        r               j               r
        z               ts              z
        c               ts/             c
        s               s               s
        a               a               a
        o               u               o
        e               e^              e
        e^              eh              ei
        i               uh              i
        u               u               u
        uo              o               oo
        u..             u..             u..
        y               y               y
        w               w               w
        v (' spelling separator)

2.2.2.2.2 Other scripts to come

2.2.3 StepCode Format
        Format Defination: A Stepcode unit is a string of [A-Za-z0-9]
characters without any white spaces, BLANK, in between. For each StepCode unit,
there are data elements indicated by "", which is a required elelment,
and [] where the element is optional, and / where the data is selectable.

        Sx stands for Spelling of xth glyph;
        Tx stands for Tone of xth glyph;
        Py stands for Stem for yth particle;
        Ly stands for Layout relation from y to y+1;
        Px.y stands for Stem for Xth glyph and its yth particle;
        Lx.y stands for Layout relation from Xth glyph and its y to y+1.

2.2.3.1. One glyph
        "S""T"[P1][L1][P2][L2]...[Py][0/BLANK]

Example:xin1qin
        xin1qin1jin0

2.2.3.2. Glyphs
        "S1S2S3...Sx""T1T2...Tx"[P1.1][L1.1][P1.2][L1.2]...[P1.y][0]
                        [P2.1][L2.1][P2.2][L2.2]...[P2.y][0]
                        ...
                        [Px.1][Lx.1][Px.2][Lx.2]...[Px.y][0/BLANK]

Example of glyphs of four:
        xinzhuqinghua1212
        xinzhuqinghua1212qin1jin0ge1ge0shui1qing0
        xinzhuqinghua1212qin1jin0ge1ge0shui1qing0hua
        xinzhuqinghua1212qin1jin0ge1ge0shui1qing0hua2shi0

The four StepCodes are equivalent, depending on where
it is registered, the size of the database, as well as there exist
similar hostnames it has confict with.

2.3. StepCode Encoding Process

Either, StepCode may be obtained from Unicode to StepCode through
a code lookup table, and combines glyph code into glyphs code as
shown in 2.2.

Or, it is inputed directly from keyboards, where an input
processing module to verify correctness of intented glyphs is necessary.

Prepend "glp--" or the name of conversion table used as script tag
to the post-convered string; finish. This is the hostname part that
can be used in DNS registration as well as resolution.

Go through [nameprep], checking for prohibited characters,
case-folding, or canonicalization.

2.4. Converting a StepCode hostname part to an internationalized name

The process has three steps with script tag untouched:

Step 1.If a domain name part consists no script tag, then goto Step 3;
        Otherwise enable conversion table named "glp" from StepCode to
        Unicode or other code, obtain the corespondends.
Step 2.If the corespondent is there then goto Step 3;
        Otherwise decomposes the post-converted string into a number
           of individual glyph specified in the "T" field;
        Searching for each glyph;
        If any of the glyph is not found,
                compose an error message and
                Requesting the missing glyph to be supplied
                        from the sender.
Step 3.Display available glyph where missing glyph is shown with
        its StepCode.

3. Security Considerations

Much of the security of the Internet relies on the DNS. Thus, any
change to the characteristics of the DNS can change the security of
much of the Internet. Thus, StepCode makes no changes to the DNS itself.

Hostnames are used by users to connect to Internet servers. The
security of the Internet would be compromised if a user entering a
single internationalized name could be connected to different
servers based on different interpretations of the internationalized
hostname. Thus the restriction of DNS names to a small symbol set is
necessary and effective, where adding any other data format such as
UTF-8 only opens the security gate for complication.

4.Internationalization considerations

StepCode is designed so that every internationalized hostname part can
be represented as one and only one DNS-compatible string. If there
is two different ways to obtain the same glyph on a display device,
then they are still two distinct hostnames, with no bearing on
security issue. If there is any way to follow the steps in this
document and get two or more different results, it is an error in
the domain name registration process, where one domain name register
 fails updates other domain name register servers a newly registered
 and well researched hostname.

5. References

[ASCII] American National Standards Institute (formerly United
States of America Standards Institute), X3.4, 1968, "USA Code for
Information Interchange". (ANSI X3.4-1968)

[IDNCOMP]   "Comparison of Internationalized Domain Name Proposals",
            draft-ietf-idn-compare-00.txt, June 2000, P. Hoffman.

[IDNReq] Zita Wenzel and James Seng, "Requirements of
Internationalized Domain Names", draft-ietf-idn-requirements. May 2001.)

[ISO10646]  ISO/IEC 10646-1:2000 (note that an amendment 1 is in
            preparation), ISO/IEC 10646-2 (in preparation), plus
            corrigenda and amendments to these standards.

[Dictionary79] Beijing Foriegn Language Dept., "A Chinese-English
Dictionary", 1979, BK# 9017.810.

[Macmillan93] The Macmillan Visual Desk Reference, 1993,
ISBN 0-02-531310-x.

[RFC2277]   "IETF Policy on Character Sets and Languages",
            rfc2277.txt, January 1998, H. Alvestrand.

[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997, RFC 2119.

[STD13] Paul Mockapetris, "Domain names - implementation and
specification", November 1987, STD 13 (RFC 1035).

[UNICODE]   The Unicode Consortium, "The Unicode Standard". Described at
            http://www.unicode.org/unicode/standard/versions/.

[UNICODE30] The Unicode Consortium, "The Unicode Standard -- Version
            3.0", ISBN 0-201-61633-5. Same repertoire as ISO/IEC
            10646-1:2000. Described at http://www.unicode.org/unicode/
            standard/versions/Unicode3.0.html.

[Ye95] Liana Ye, "A Language Oriented Chinese Encoding for
Multilingual Computing Environments", in "Proceeding of the 1995
International Conference on Computer Processing of Oriental
Languages", Page 323.

6. Acknowledgements

The author has reused existing IDN draft document and language as
much as possible to demonstrate the deep respect for the work has
been done by members of this working group.

7. IANA Considerations

This document require IANA action for availibility of language tag,
and registration for each tag and its sub-field for phonetic system
used.

8. Author Contact Information

Liana Ye
Y&D ISG
2607 Read Ave.
Belmont, CA 94002, USA.
(650) 592-7092
liana.ydisg@juno.com

Expires November 2001