Network Working Group                                        XiaoDong LEE
Internet-Draft                         Chinese Academy of Sciences, CNNIC
Expires: Nov 21, 2002                                         Kenny Huang
                                                                Erin Chen
                                                                    TWNIC
                                                               Xiang DENG
                                                             YanFeng WANG
                                                                    CNNIC

  Chinese Name String in Search-based access model for the DNS
                  draft-xdlee-cnnamestr-00.txt
Status of this Memo

   This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

Copyright Notice
   Copyright (C) The Internet Society (2001).  All Rights Reserved.

Content
1.      Abstract
2.      Terminology
3.      CNS equivalence
4.      Requirements
5.      Solution suggested
6.      Phonetic input of Chinese name string
7.      Encoding
8.      Security Considerations
9.      Authors' Addresses
10.     Acknowledgements
11.     References

1. Abstract
There are many requirements of developing internationalized and
human-readable Internet identifiers/names now, thereby there are many
systems based on DNS technology to meet such requirements. John C.
Klensin has proposed a three-layer search-based access model for the DNS
[DNSSEARCH]; this paper is only to explain some related problems
mentioned in John C. Klensin's proposal. Especially it focuses on
Traditional and Simplified Chinese problems and some other special
Chinese requirements.

The ultimate goal for any kinds of search-based access system is to help
users to access network resources in more natural ways, which have
different meaning for different user groups. On the premise of respecting
Chinese user's language convention, it is very important for a valuable
and human-friendly system to deal with traditional and simplified Chinese
equivalence problems.

2. Terminology
The key words "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", "MUST", and
"MAY" in this paper are to be interpreted as described in [RFC2119].

In order to describe the problem simply, we define these terminologies
first.

"TC" is an abbreviation for Traditional Chinese.

"SC" is an abbreviation for Simplified Chinese.

"CNS" is defined as an acronym of Chinese Name String that is the most
important facet, name string mentioned in [DNSSEARCH], which contains at
least one Chinese character. As to the scope of Chinese character, please
refer to ISO/IEC 10646-1:2000(E) [second edition 2000-09-15], if one
character is marked "C and G-Hanzi-T", it MUST be a Chinese character,
such definition does not mean it is not the character of other countries
that use HAN ideograph.

"TC-only CNS" is a CNS that all characters of it are TC characters.

"SC-only CNS" is a CNS that all characters of it are SC characters.

"Mixed-use TC and SC CNS" is a CNS of which at least one traditional and
one simplified Chinese character appear in all characters.

3.      CNS equivalence
The TC/SC equivalence problem is very complex and difficult to solve
perfectly, please refer to [CTCC], nevertheless, there are mainly three
categories of single TC/SC character equivalence, so we should solve
these problems respectively and one by one, after solving these three
kinds of problems, most of the TC/SC problems will be solved, and the
result will be acceptable for most Chinese users.
a)      One to one
E.g. U+98A8 (TC, "the wind") can be mapped to U+98CE (SC, the wind)
U+5099 (TC, to prepare) can be mapped to U+5907 (SC, to prepare)
U+908A (TC, a side) can be mapped to U+8FB9 (SC, a side)
b)      One to many
E.g. U+6FF1 (TC, the shore) can be mapped to U+6EE8,U+6D5C (SC, the
shore)
U+53C3 (TC, three, to take part in) can be mapped to U+53C2 (SC, to take
part in) U+53C1 (SC, three)
U+58DF (TC, a ridge or walkway in a field) can be mapped to U+5784,U+5785
(SC, a ridge or walkway in a field)
c)      Many to one
E.g. U+85F9,U+8B6A (TC, friendly) can be mapped to U+853C (SC, friendly)
U+5225 (TC, to leave), U+5F46 (TC, to awkward) can be mapped to U+522B
(SC, to leave, to awkward)
U+93DF (TC, a shovel), U+5277 (TC, a shovel) can be mapped to U+94F2 (SC,
a shovel)
But as to the equivalent problem of CNS, it is a combination of above
three categories, so it is more complex than single character, but we
could process it one character by one character.

4.      Requirements
These requirements SHOULD be considered for any system supported Chinese
name string.
a)      TC and SC CNS equivalent matching
SC is derived from TC, and Chinese people use both SC and TC. So Chinese
people think that TC CNS is equivalent with its corresponding SC forms,
so any implementation should meet such requirement.
b)      Mixed TC and SC CNS cause an exponential problem
If we want to ensure a CNS in both TC/SC forms to be resolved correctly,
we could register all its forms, but along with the length of label, an
exponential problem will occur. Most of Chinese character variants are
daily used. An ordinary Chinese Name String may have dozens of, hundreds
of, even thousands of TC/SC variants. That is unreasonable for users to
register, and uneasy for administrators to manage, and complex for system
to resolve. No matter which kind of search-based access system, flat or
hierarchy, or central-controlled, and so on, it is not reasonable for any
administration to process these thousands of name strings
un-automatically.
c)      Some other special requirement
As you know, there are many conventional differences between Chinese and
English. Such as of name string sequence. English people could write
"Minneapolis, Minnesota" to represent a location, but Chinese people
would like to write as "Minnesota, Minneapolis". So if we permit
search-based access system to use sequence attributes to represent
delegation or hierarchy, such kind of special requirement should be
satisfied.

5.  Solution suggested
As mentioned in [DNSSEARCH], there are many challenges in doing
traditional and simplified Chinese equivalence, because HAN character is
not only used in China, but also in other countries, mostly in Asia. To
be emphasized firstly, no method could solve traditional and simplified
Chinese equivalence perfectly and correctly, and up to now, the best
algorithm is only able to achieve about 99%, rather than 100%. So maybe
that is the reason why no consensus has been made in IDN WG.

Because we have two facets in search layer two, language and country
code/ geographical location, which will be very useful to solve most of
the problems. Based on these two facets, system with certain language and
country code could pick appropriate rules to do traditional and
simplified Chinese equivalence without any impact on other languages and
countries.

In Mainland China, as to "One to One" and "Many to One", we could convert
all these TC character into SC character, and then save SC-only CNS into
database for Chinese name string resolving. But as to "One to Many", it
maybe based on context, the system may handle this in artificial
intelligent method, it is a pity that even the best artificial
intelligent algorithm cannot solve this conversion completely. As in my
opinion, this kind of artificial conversion shouldn't be completed in
layer two, which should have affirmative result with some simple facets;
these artificial process should be completed in layer three or get user's
feedback to make sure which name string he want. User's feedback may be
added when doing conversion, or using result cached by last conversion.

E.g.
a)      One to one
{[CN] + [zh-cn] + TC} --> {[CN] + [zh-cn] + SC}
b)      Many to one
{[CN] + [zh-cn] + TC1/TC2/í¡/TCn}  --> {[CN] + [zh-cn] + SC}
c)      One to many
                       User feedback
{[CN] + [zh-cn] + TC} -------------------> {[CN] + [zh-cn] + SC1/í¡/SCn}

Finally, all Mixed-use TC and SC CNS should be converted into SC-only CNS
before resolving, and only SC-only CNS are stored in resolving database
in server. What's more, if we do want to implement "One to Many"
conversion in layer two, we could bind the TC CNS with one of its
corresponding SC forms with "first come, first use" based on reasonable
principle, that is, the binding process should avoid binding two
irrelevant CNS and cause meaningless equivalent resolving.

As shown above, Mainland of China could select conversion rules from TC
to SC, for TC area, they could select contrary rules from SC to TC. In
this suggestion, user feedback is very important for One to Many
conversion, we just provide a mechanism to resolve CNS correctly, it
permit user to input unconventional Mixed-use TC and SC CNS in certain
language and country or area, but actually it doesn't happened very
often.

Some people suggest to use fuzziness level to determine matching
precision, they want user to select which kind of conversion they want,
it is not useful to solve TC/SC equivalence problem, I think, traditional
and simplified Chinese equivalence problem is not a fuzziness problem as
other fuzzy matching problems in search-based access system. Providing
fuzziness level Chinese matching will mislead end users, and will cause
questionable namespace in layer two. Chinese name string should have same
process rules in system level, which should not be based on user
intention completely.

6.      Phonetic input of Chinese name string
Phonetic input is very useful for users to surf the Internet in an easy
way, especially for some application in mobile device without convenient
input device, thus many vendors have developed many applications, but
this method should be employed carefully in search layer two.

Any language has its pronunciation manner, Chinese doesn't make any
exception, PINYIN is the official standard to mark the pronunciation of
certain Chinese word, and some people once suggest using such roman
manner to substitute Chinese glyph, which is actually only advocated by
some academic scholar, because no one want to lost the beautiful Chinese
glyph forever, even though someone has developed a method using roman
PINYIN with certain number to represent any Chinese character.

Although the simplified Chinese character has the same pronunciation and
PINYIN with its traditional Chinese form, PINYIN is not very useful to
solve traditional and simplified Chinese equivalence problems in layer
two. Name string is to denote the consentient label, name or identifier
or something else of network resource itself, but PINYIN cannot be used
as name string.

7. Encoding
In layer two and layer three or above, as to the encoding of Chinese
character, we suggest using UNICODE directly, any additional encoding
will increase the system complexity, and it is unreasonable for a long
term solution.

8. Security Considerations
This paper is just a complement document for [DNSSEARCH], so it has same
security considerations. TC/SC CNS equivalence problem will not bring any
additional security problems into this search-based access model.

9.      Authors' Addresses
XiaoDong LEE
Chinese Academy of Sciences, CNNIC
4 South 4th Street, ZhongGuanCun, Beijing, P.R. China 100080
Phone: +86 10 62619750 ext. 3020
E-mail: lee@cnnic.net.cn

Kenny Huang
Taiwan Network Information Center (TWNIC)
4F-2, No.9 Sec. 2, Roosevelt Rd., Taipei, 100 Taiwan, R.O.C.
E-mail: huangk@alum.sinica.edu

Erin Chen ( also as Yu Hsuan Chen)
Taiwan Network Information Center (TWNIC)
4F-2, No.9 Sec. 2, Roosevelt Rd., Taipei, 100 Taiwan, R.O.C.
Phone:: +886 2 23411313 ext. 502
E-mail: erin@twnic.net.tw

Xiang DENG
China Internet Network Information Center(CNNIC)
4 South 4th Street, ZhongGuanCun, Beijing, P.R. China 100080
Phone: +86 10 62619750 ext. 3018
E-mail: deng@cnnic.net.cn

YanFeng WANG
China Internet Network Information Center(CNNIC)
4 South 4th Street, ZhongGuanCun, Beijing, P.R. China 100080
Phone: +86 10 62619750 ext. 3022
E-mail: wyf@cnnic.net.cn

10.     Acknowledgements
Thanks for these person's suggestions and efforts.
HuaLin QIAN hlqian@cnnic.net.cn ; CAS, CNNIC
Li-Ming Tseng     <tsenglm@csie.ncu.edu.tw>; NCU, TWNIC
Wei MAO     mao@cnnic.net.cn ; CNNIC
Wen-Sung Chen      <wschen@twnic.net.tw>; TWNIC

11. References
[RFC2119] Scott Bradner, Key words for use in RFCs to Indicate
Requirement Levels, March 1997, RFC 2119.

[STD13]   Paul Mockapetris, Domain names - implementation and
specification, November 1987, STD 13 (RFC 1034 and 1035).

[CTCC]    The Pitfalls and Complexities of Chinese to Chinese Conversion
 Jack Halpern, Jouni Kerman

[ISO10646] ISO/IEC 10646-1:2000. International Standard - Information
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part
1: Architecture and Basic Multilingual Plane.

[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version3.0",
           ISBN 0-201-61633-5.

[DNSSEARCH] John C. Klensin, "A Search-based access model for the DNS",
            draft-klensin-dns-search-03.txt, May 2001,

[KEYWORD] Arrouye, Yves, T. W. Tan, X.D. Lee, " Keywords Systems -
Definition and Requirements".  draft-arrouye-keywords-reqs-01.txt