IETF IDN Working Group Seungik Lee, Hyewon Shin, Dongman Lee
Internet Draft ICU
draft-ietf-idn-icu-00.txt Eunyong Park, Sungil Kim
Expires: 14 January 2001 KKU, Netpia.com
14 July 2000
Architecture of Internationalized Domain Name System
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
1. Abstract
For restrict use of Domain Name System (DNS) for domain names with
alphanumeric characters only, there needs a way to find an Internet
host using multi-lingual domain names: Internationalized Domain Name
System (IDNS).
This document describes how multi-lingual domain names are handled in
a new protocol scheme for IDNS servers and resolvers in architectural
view and it updates the [RFC1035] but still preserves the backward
compatibility with the current DNS protocol.
2. Conventions used in this document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
"IDNS" (Internationalized Domain Name System) is used here to
indicate a new system designed for a domain name service, which
supports multi-lingual domain names.
"The current/conventional DNS" or "DNS" (Domain Name System) is used
here to indicate the domain name systems currently in use. It
fulfills the [RFC1034, RFC1035], but implementations and functional
operations may be different from each other.
The "alphanumeric" character data used here is the character set that
is allowed for a domain name in DNS query format, [a-zA-Z0-9-].
3. Introduction
Domain Name System (DNS) has eliminated the difficulty of remembering
the IP addresses. As the Internet becomes spread over all the people,
the likelihood that the people who are not familiar with alphanumeric
characters use the Internet increases. The domain names in
alphanumeric characters are difficult to remember or use for the
people who is not educated English. Therefore, it needs a way to find
an Internet host using multi-lingual domain name: Internationalized
Domain Name System.
3.1 The current issues of IDNS
IDNS maps a name to an IP address as the typical DNS does, but it
allows domain names to contain multi-lingual characters. The multi-
lingual characters need to be encoded/decoded into one standardized
format, and it needs changes in the conventional DNS protocol
described in [RFC1034] and [RFC1035]. But it is required to minimize
the changes in the present DNS protocol so that it guarantees the
backward compatibility.
The IDNS issues have been discussed in IETF IDN Working Group. These
issues are well described in [IDN-REQ]. The main issues are:
- Compatibility and interoperability. The DNS protocol is in use
widely in the Internet. Although a new protocol is introduced for DNS,
the current protocol may be used with no changes. Therefore, a new
design for DNS protocol, IDNS must provide backward compatibility and
interoperability with the current DNS.
- Internationalization. IDNS is on the purpose of using multi-lingual
domain names. The international character data must be represented by
one standardized format in domain names.
- Canonicalization. DNS indexes and matches domain names to look up a
domain name from zone data. In the conventional DNS, canonicalization
is subjected to US-ASCII only. However, every multi-lingual character
data must be canonicalized in its own rules for a DNS standardized
matching policy, e.g. case-insensitive matching rule.
- Operational issues. IDNS uses international character data for
domain names. Normalization and canonicalization of domain names are
needed in addition to the current DNS operations. IDNS also needs an
operation for interoperability with the current DNS. Therefore, it is
needed to specify the operational guidelines for IDNS.
3.2 Overview of the proposed scheme
Our proposed scheme for IDNS is also subjected on the issues
described earlier to fulfill the requirements of IDN [IDN-REQ].
The proposed scheme can be summarized as following:
- The IN bit, which is reserved and currently unused in the DNS
query/response format header, is used to distinguish between the
queries generated by IDNS servers or resolvers and those of non-IDNS
ones [Oscarsson]. This mechanism is also needed to indicate whether
the query is generated by the appropriate IDNS operations for
canonicalization and normalization or not.
- The multi-lingual domain names are encoded into UTF-8 as a wire
format. UTF-8 is recommended as a default character encoding scheme
(CES) in the creation of new protocols which transmit text in
[RFC2130]. This scheme allows the IDNS server to handle the DNS query
from non-IDNS servers or resolvers because the ASCII code has no
changes in UTF-8.
- The UTF-8 domain names must be case-folded before transmission. It
minimizes the overhead on server's operations of matching names in
case-insensitive. It also guarantees that the result of caching
queries can be used without any further normalization and
canonicalization. If IDNS server gets non-IDNS query that is not
case-folded, it case-folds the query before transmitting to another
servers.
4. Design considerations
Our proposed scheme is designed to fulfill the requirements of IETF
IDN WG [IDN-REQ]. All the methods for IDNS schemes must be approved
by the requirements documents. The design described in this document
is based on these requirements.
4.1 Protocol Extensions
To indicate an IDNS query format, we use an unallocated bit in the
current DNS query format header, named 'IN' bit [Oscarsson]. All IDNS
queries are set IN bit to 1. Without this bit set to 1, we cannot
guarantee that the query is in the appropriate format for IDNS.
'IN' bit is to indicate whether the query is from IDNS
resolvers/servers or not. It also reduces overhead on canonicalizing
operation at IDNS server. It will be described further in <4.4.
Canonicalization>.
We devise new operations and new structures of resolvers and name
servers to add the multi-lingual domain name handling features into
the DNS. This causes changes of all DNS servers and resolvers to use
multi-lingual domain names. The new architectures for resolvers and
servers will be described in <5. Architectures>
4.2 Compatibility and interoperability
The 'IN' bit is valid bit location of query for the conventional DNS
protocol to be set to zero [RFC1035]. And operations and structures
of IDNS preserve the conventional rules of DNS to guarantee the
interoperability with the conventional DNS servers or resolvers so
that the changes are optional. These make this scheme for IDNS
compatible with the current protocol.
Although the current DNS protocol uses 7-bit ASCII characters only,
the query format of the current DNS protocol set is 8 bit-clean.
Therefore, we can guarantee the backward compatibility and
interoperability with the current DNS using UTF-8 code because the
ASCII code is preserved with no changes in UTF-8.
Note: There are also in use implementations that are compatible with
the current DNS but extend their operations to use UTF-8 domain names.
The IDNS described here interoperates well with these implementations.
The interoperability with these implementations will be described in
<5.4 Interoperability with the current DNS>.
4.3 Internationalization
All international character data must be represented in one
standardized format and the standardized format must be compatible
with the current ASCII-based protocols. Therefore, the coded
character set (CCS) for IDNS protocol must be Unicode [Unicode], and
be encoded using the UTF-8 [RFC2279] character encoding scheme (CES).
The client-side interface may allow the domain names encoded in any
local character sets, Unicode, ASCII and so on. But they must be
encoded into Unicode before being used in IDNS resolver. The IDNS
resolver accepts Unicode character data only, and converts it to UTF-
8 finally for transmission.
4.4 Canonicalization
In the current DNS protocol, the domain names are matched in case-
insensitive. Therefore, the domain names in a query and zone file
must be case-folded before equivalence test.
The case-folding issue has been discussed for a long time in IETF IDN
WG. The main problem is for case folding in locale-dependent. Some
different local characters are overlapped within case-folded format.
For example, Latin capital letter I (U+0049) case-folded to lower
case in the Turkish context will become Latin small letter dotless i
(U+0131). But in the English context, it will become Latin small
letter i (U+0069)
Therefore, we case-fold the domain names in locale-independent in our
new IDNS design with a method defined in [UTR21].
Multi-lingual domain names should be case-folded in IDNS resolvers or
IDNS servers before transmitting to other IDNS/DNS servers. That is,
IDNS resolver should case-fold the domain name and converts it to
UTF-8 before transmission. In case of IDNS server, if it gets a query
with IN bit set to 1, then it needs not to make the multi-lingual
domain name canonicalized anymore. If the IDNS server gets a query
with IN bit set to 0, then it cannot determine the query is
appropriate canonicalized format for IDNS server, so that it case-
folds that multi-lingual domain name in the query, and set 'IN' bit
to 1.
The current DNS queries contain the original case of domain names to
preserve the original cases. To be consistent with this rule, all
case-folded multi-lingual domain names should be stored by IDNS
resolvers or servers before case-folding, and should be restored
before sending response.
In the case of case-folding UTF-8 code, using the case-folding method
in [UTR21], the UTF-8 should be converted to Unicode and it must be
mapped to the mapping table finally. Of course that if we could make
a case-folding mapping table of UTF-8 character data, this overhead
could be reduced.
However it cannot avoid an overhead in IDNS servers for
canonicalization, because the canonicalization of international
character data is complicated.
To minimize this overhead, we use 'IN' bit to indicate that the
canonicalization for the query has been already handled. That means
it needs not canonicalization operation anymore. The detailed
operations according to the 'IN' bit are described later in <5.
Architectures>.
With international character data, the canonicalization (e.g. case-
folding) is much more complicated than the one with US-ASCII, and is
different from each other's by their locale contexts.
But this document doesn't specify any method or recommendation more
than case-folding. For canonicalization of international character
data, [UTR15] is a good start. It must be discussed further and
specified in the IDNS protocol specification.
4.5 Operational issues
In the current DNS scheme, it uses only ASCII code for a wire format.
But our new IDNS scheme uses UTF-8 code for a wire format. All the
IDNS resolvers must transmit queries encoded in UTF-8 and case-folded.
This format can be guaranteed by checking the IN bit: if IN bit is
set to 1, the query is encoded in UTF-8 and case-folded. Otherwise
the IDNS server cannot assure that the query is encoded in UTF-8 and
case-folded. Therefore it needs additional operations for encoding to
UTF-8 and case-folding, etc in this case.
The current DNS resolvers transmit the queries in ASCII code. But
it's not considerable in IDNS servers because the ASCII code is
preserved with no changes in UTF-8.
Some applications and resolvers transmit the queries in UTF-8
although they don't fit on the new IDNS resolvers' structures, e.g.
Microsoft's DNS servers. We cannot guarantee that those queries are
case-folded correctly. Therefore, the IDNS servers should convert
them to appropriate IDNS queries instead of the IDNS resolver in that
case.
All detailed operations of IDNS servers and resolvers are described
in <5. Architectures>.
5. Architectures
5.1 New header format
A new IDNS servers and resolvers must interoperate with the ones of
current DNS. Therefore, we need a way to determine whether the query
is for IDN or not. For this reason, we use a new header format as
proposed in [Oscarsson].
1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ID |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR| Opcode |AA|TC|RD|RA|IN|AD|CD| RCODE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QDCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ANCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| NSCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ARCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
The IDNS resolvers and servers identify themselves in a query or a
response by setting the 'IN' bit to 1 in the DNS query/response
format header. This bit is defined to be zero by default in the
current DNS servers and resolvers.
5.2 Structures of IDNS resolvers
To use multi-lingual domain names with IDNS servers, all the IDNS/DNS
resolvers must generate the query in a format of UTF-8 or ASCII. The
design of a resolver could be different with each other according to
the local operating systems or applications. We propose new design
guidelines of a resolver for a new standardization.
The IDNS resolver accepts Unicode from user interface for domain
names. The other character sets should be rejected. It encodes all
such character data into UTF-8 for transmission to name servers.
The procedures of the operation of an IDNS resolver are below:
<1>. If the resolver gets a domain name in Unicode or ASCII then it
stores the original domain name query. Otherwise the request for
lookup is rejected. In the current DNS protocol, the original case of
the domain name should be preserved. Therefore, the resolver must
store the original cases of the domain names before canonicalization
(e.g. case-folding).
<2>. Make the domain name case-folded with locale-independent case-
mapping table defined in [UTR21].
<3>. Convert it to UTF-8.
<4>. Set IN bit to 1. It indicates the query is from IDNS resolver
and the format is UTF-8, case-folded.
<5>. Send request query to name servers.
<6>. Restore the original domain name query into the response query
format.
<7>. Send response to the application.
5.3 Structures of IDNS servers
The operation of IDNS server is similar to the current one of DNS
server, but the IDNS server accepts UTF-8 queries and converts them
to the appropriate formats additionally.
The IDNS server distinguishes between the IDNS queries and DNS
queries by checking IN bit in the query/response format header.
According to the 'IN' bit, it operates differently.
The procedures of the operation of an IDNS server are below:
<1>. If the IN bit in the query/response format header is set to 1
then it matches the domain name within zone file data or forwards
request query to resolve. It operates as same as the operations of
the current DNS servers but retrieves UTF-8 code. In this case, it
needs not to make domain name canonicalized because the domain name
is already canonicalized in the previous procedures of IDNS resolvers
or IDNS servers. Go to step <7>.
<2>. Set IN bit to 1.
<3>. Store the original domain name query.
<4>. Make the domain name case-folded with locale-independent case-
mapping table defined in [UTR21].
<5>. Match the domain name within zone file data or send request
query to lookup.
<6>. Restore the original domain name query into the response query
format.
<7>. Send response for the query to the resolver or the other server
requested.
5.4 Interoperability with the current DNS
The DNS servers and resolvers accept domain names in ASCII only. But
IDNS servers and resolvers accept domain names in UTF-8. Therefore,
the queries from DNS ones to IDNS ones can be well handled because
the UTF-8 is a superset of ASCII code. But the queries from IDNS ones
to DNS ones will be rejected because the UTF-8 code is beyond the
range of ASCII code.
Note: There are some implementations which can handle UTF-8 domain
names although they don't fit on this specification of IDNS and fully
implemented with DNS protocol specification, e.g. Microsoft's DNS
server and resolvers. In this case, we cannot guarantee that the
queries from these 3rd-party implementations are encoded into UTF-8
and well canonicalized. But this queries are set 'IN' bit to 0, so
that the IDNS evaluates whether the domain name is the range of UTF-8
or not, and converts it into UTF-8 and makes it canonicalized finally.
6. Security Considerations
This architecture of IDNS uses 8bit-clean queries for transmission
and the UTF-8 code is handled instead of ASCII. The DNS protocol has
already allocated 8bit query format for domain names Therefore, the
IDNS protocol inherits the security issues for the current DNS.
Canonicalization of IDNS is defined in [UTR15] and case folding in
[UTR21]. All security issues related with canonicalization or
normalization inherits ones described in [UTR15, UTR21].
As always with data, if software does not check for data that can be
a problem, security may be affected. As more characters than ASCII is
allowed, software only expecting ASCII and with no checks may now get
security problems.
7. References
[IDN-REQ] James Seng, "Requirements of Internationalized Domain
Names," Internet Draft, June 2000
[KWAN] Stuart Kwan, "Using the UTF-8 Character Set in the
Domain Name System," Internet Draft, February 2000
[] Dan Oscarsson, "Internationalisation of the Domain Name
Service," Internet Draft, February 2000
[RFC1034] Mockapetris, P., "Domain Names - Concepts and
Facilities," STD 13, RFC 1034, USC/ISI, November 1987
[RFC1035] Mockapetris, P., "Domain Names - Implementation and
Specification," STD 13, RFC 1035, USC/ISI, November
1987
[RFC2119] S. Bradner, "Key words for use in RFCs to Indicate
Requirement Levels," RFC 2119, March 1997
[RFC2130] C. Weider et. Al., "The Report of the IAB Character Set
Workshop held 29 February - 1 March 1996," RFC 2130,
Apr 1997.
[RFC2279] F. Yergeau, "UTF-8, a transformation format of ISO
10646," RFC 2279, January 1998
[RFC2535] D. Eastlake, "Domain Name System Security Extensions,"
RFC 2535, March 1999
[UNICODE] The Unicode Consortium, "The Unicode Standard - Version
3.0," http://www.unicode.org/unicode/
[UTR15] M. Davis and M. Duerst, "Unicode Normalization Forms",
Unicode Technical Report #15, Nov 1999,
http://www.unicode.org/unicode/reports/tr15/
[UTR21] Mark Davis, "Case Mappings," Unicode Technical Report
#21, May 2000,
http://www.unicode.org/unicode/reports/tr21
8. Acknowledgments
Kyoungseok Kim <gimgs@asadal.cs.pusan.ac.kr>
Chinhyun Bae <piano@netpia.com>
9. Author's Addresses
Seungik Lee
Email: silee@icu.ac.kr
Hyewon Shin
Email: hwshin@icu.ac.kr
Dongman Lee
Email: dlee@icu.ac.kr
Information & Communications University
58-4 Whaam-dong Yuseong-gu Taejon, 305-348 Korea
Eunyong Park
Email: eunyong@eunyong.pe.kr
Konkuk University
93-1 Mojindong, Kwangjin-ku Seoul, 143-701 Korea
Sungil Kim
Email: clicky@netpia.com
Netpia.com
35-1 8-ga Youngdeungpo-dong Youngdeungpo-gu Seoul, 150-038 Korea