Internet Draft Dan Oscarsson
draft-oscarsson-i18ndns-00.txt Telia ProSoft
Updates: RFC 2181, 1035, 1034, 2535 25 February 2000
Expires: 25 August 2000
Internationalisation of the Domain Name Service
Status of this memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
There is a very strong world-wide desire to use characters other than
ASCII in the DNS, especially in domain names.
This document updates the Domain Name System (DNS) [RFC1035] in a way
that is compatible with the current DNS and specifies how
international characters are handled.
1. Introduction
There is an immediate need of using international characters (non-
ASCII) in DNS. This means that DNS cannot be extended as this would
take too long time. Instead the current ASCII only handling need to
be extended to non-ASCII in a way that can be used without updating
current software.
The basic handling of character data in DNS have several properties
Dan Oscarsson Expires: 25 August 2000 [Page 1]
Internet Draft Internationalisation of DNS 25 February 2000
that need to be preserved:
- The DNS itself places only one restriction on the particular
labels that can be used to identify resource records. That one
restriction relates to the length of the label and the full name.
The length of any one label is limited to between 1 and 63 octets.
A full domain name is limited to 255 octets (including the
separators). [RFC2181]
- Any binary string whatever can be used as the label of any
resource record. Similarly, any binary string can serve as the
value of any record that includes a domain name as some or all of
its value (SOA, NS, MX, PTR, CNAME, and any others that may be
added). Implementations of the DNS protocols must not place any
restrictions on the labels that can be used. In particular, DNS
servers must not refuse to serve a zone because it contains labels
that might not be acceptable to some DNS client programs.
[RFC2181]
- Names must be compared with case-insensitivity. [RFC1035]
- The original case should be preserved when possible as data is
entered into the system. This also implies that responses should
preserve case when possible. [RFC1035] Some of the reasons for
this are:
+ Domain names are used for many purposes.
+ One is domain names where company names or trademarks could be
used. Very commonly companies and trademarks are using a
combination of upper and lower case to enhance the image of
the name. Many of them would prefer that when you, for
example, lookup the domain name for an IP address, the correct
case is returned.
+ An other is the e-mail address defined in the SOA record.
While many systems now does a case-insensitive comparison on
the user name part of the e-mail address, there may still be
those that don't. And also here, e-mail addresses can be made
more readable by mixing upper and lower case.
+ If you look up a host name form an IP address you may want to
use the host name to compare with other data. Many services
under Unix does this, and many of the are not case-
insensitive. So they need the correct case returned.
+ There may be other uses of domain names that requires them to
be unchanged.
- The characters in the ASCII character set must still be encoded as
ASCII.
This document specifies the update needed of the DNS protocol, user
interface issues and the effect of other protocols. It is intended to
full fill the requirements of internationalised domain names which
currently worked on by the IDN working group.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
Dan Oscarsson Expires: 25 August 2000 [Page 2]
Internet Draft Internationalisation of DNS 25 February 2000
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
2. The DNS Protocol
The DNS protocol is used when communicating between DNS servers and
other DNS servers or DNS clients. User interface issues like the
format of zone files or how to enter or display domain names are not
part of the protocol.
The update of the protocol defined here can be used immediately as it
is fully compatible with the DNS of today.
2.1 Internationalisation aware software
Internationalisation aware DNS software (i18n aware) is software that
handles the rules for handling international text as defined here.
Only i18n aware software will get all requirements fulfilled.
Referring to section 4.1.1 in [RFC1035] and section 6.1 in [RFC2535]
the the DNS query/response format header is updated by allocation the
last un-allocated bit in the header. This bit is defined to be zero
in old servers and resolvers. For description of all field see the
sections in the above RFCs.
1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ID |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR| Opcode |AA|TC|RD|RA|IN|AD|CD| RCODE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QDCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ANCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| NSCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ARCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
I18n aware software identifies itself in a query or a response by
setting the IN bit in the DNS query/response format header. As this
bit is defined to be zero in old servers and resolvers they identify
themselves as non-i18n aware.
I18n aware software MUST set the IN bit in both queries and
Dan Oscarsson Expires: 25 August 2000 [Page 3]
Internet Draft Internationalisation of DNS 25 February 2000
responses.
Note: The reason that EDNS [RFC2671] is not used is because:
- It should work with the current pre-i18n DNS software.
- There should be no additional requests needed to be sent for i18n
aware software.
2.2 Character data
Character data need to be able to represent as much as possible of
the characters in the world as well as being compatible with ASCII.
It must also be well defined so that it can easily be handled and
should be compact as only 63 octets is available without an extension
of the protocol.
Therefore character data used in the DNS protocol MUST:
- Use ISO 10646 (UCS) [ISO10646] as coded character set.
- Be normalised using form C as defined in Unicode technical report
#15 [UTR15].
- Encoded using the UTF-8 [RFC2279] character encoding scheme.
The only exception to the above rules is in the inter operability
with non-i18n aware DNS software, as defined later.
2.2.1 Down coding
As a local character set may not support all of the characters of UCS
used internally in DNS, a way to encode unsupported characters into
the local character set is needed. That way a domain name can be used
even if the local character set cannot represent all characters in a
name. By setting the local character set to ASCII we get domain names
that are allowed in non-i18n aware software.
This will be done by down coding UTF-8 into the local character set.
It is done as follows:
- If a character can be represented in the local character set, map
it from UCS to local character set.
- If a character cannot be represented in the local character set,
map the UTF-8 octet sequence for the character to a hyphen ("-")
followed by the hex code of each octet as two characters per
octet.
- If it was needed to down code because not all characters could be
represented in the local character set, all original hyphens must
be replaced by two hyphens ("--") and the entire string MUST end
with a single hyphen.
Examples:
Dan Oscarsson Expires: 25 August 2000 [Page 4]
Internet Draft Internationalisation of DNS 25 February 2000
If we have the name: Ab-<a with ring above>r<greek omega>z, it is
represented in DNS as UTF-8:
(HEX) 41 62 2d c3 a5 72 c9 b7 7a
If the local character set is ISO 8859-1, the down coded name is:
Ab--<a with ring above>r-c9b7z-.
If the local character set is ASCII, the down coded name is:
Ab---c3a5r-c9b7z-.
Note: In other formats like HTML unsupported characters are handled
like: &number; (prefix, code point value and terminator). The above
format is choosen because it only needs a prefix (the length is
defined in the UTF-8 encoding so terminator is not needed) and can
easily be checked for valid sequence.
2.2.2 Up coding
When character data is entered into i18n aware DNS software, it must
be up coded from the down coding format into UTF-8. A down coded name
is identified by a trailing hyphen. When up coding invalid UTF-8
sequences should be left as it is, it may be an old name with a
trailing hyphen.
2.3 Domain name matching
One of the most difficult areas of internationalisation is what names
are equivalent to an other. For ASCII this was easily solved by
case-insensitivity. It is also easily solved for many other Latin
based alphabets. But when you look at the whole world you get a
mixture of rules, some conflicting, including case-insensitivity,
half width/full width, final/non-final forms and much more.
This type of matching will be called "equivalence matching" here
after
2.3.1 Equivalence matching rules
To compare two domain names, both names must first be mapped to a
format where all equivalent characters are mapped to one character so
that the names then can be binary compared. This mapping is done
from the original UCS normalised form C format, by case folding to
lower case followed by additional normalisation and simplification.
Folding to lower case MUST be done by following the one to one
mapping as defined in the Unicode 3.0 Character Database [UDATA].
Additional folding will probably also be done, but this has not been
agreed on yet. For normalisation Unicode 3.0 defines a normalisation
Dan Oscarsson Expires: 25 August 2000 [Page 5]
Internet Draft Internationalisation of DNS 25 February 2000
form KC [UTR15] that is a good start, but more is needed. More about
case folding to lower case is available in Unicode Technical Report
21 [UTR21].
Additional folding, normalisation and simplification will be defined
here or in a separate document at a later stage.
Note: As Turkish rules lower case I to dotless i instead of the
dotted i used in ASCII and the above case mapping, Turkish names with
dotless i will have to always be entered in lower case.
2.3.2 Matching of domain names in DNS servers
To be able to handle correct domain name matching in lookups, the
following MUST be followed by DNS servers:
- Do matching on authorative data using the full name equivalence
matching needed for the characters used in the data.
- On non-authorative data, either do binary matching or case-
insensitive matching on ASCII letters and binary matching on all
others.
- Implement the equivalence matching rules as defined above. Local
variations are not allowed.
The effect of the above is:
- only servers handling authorative data must implement equivalence
matching of names. And they need only implement the subset needed
for the subset of characters of UCS they support in its
authorative zones.
- it normally gives fast lookup because data is usually sent like:
resolver <-> server <-> authorative server.
While full equivalence matching can be complex and CPU consuming,
the server in the middle will do caching with only simple and fast
binary matching. So the impact of complex matching rules should
not slow down DNS very much.
2.4 Inter operability between i18n aware DNS software and non-i18n aware
While the current non-i18n aware DNS software MUST allow UTF-8
encoded domain names (if they follow RFC1035, 2181) a lot of software
using DNS may not (for example SMTP). To not break all the old
software only expecting or allowing ASCII in domin names, the
following rules MUST be followed by an i18n aware DNS server:
- A query with the IN bit set is assumed to be from i18n aware
software.
- A query with domain names having valid non-ASCII UTF-8 characters
is assumed to be from i18n aware software even if the IN bit is
Dan Oscarsson Expires: 25 August 2000 [Page 6]
Internet Draft Internationalisation of DNS 25 February 2000
not set. (this is because the query can have been sent from an
i18n aware resolver through a non-i18n aware server).
- Always down code (see above) the UTF-8 names into ASCII before
sending it when responding to non-i18n aware software.
- Never have down coded names in the response when responding to
i18n aware software.
- Always check for down coded names in requests and up code them.
- Not do zone transfers to non-i18n aware software, if the zone
contains non-ASCII.
- Return the server failed error if a label cannot be down coded and
fit in the 63 octets allowed.
An i18n aware DNS resolver MUST:
- Up code any down coded names before sending them using the DNS
protocol.
- Up code any down coded names received in a response.
The result of this is:
- Old software gets an ASCII only domain name using only the old set
of allowed characters.
- Both i18n aware DNS servers and resolver software must handle up
coding of domain names.
- Domain names used from old software will work in other protocols
only allowing ASCII names.
- We may get old software that is never fixed as it still works.
- We do not get rid of this user unfriendly, encode everything in
ASCII handling that many non-ASCII users complain about.
Note: As a non-i18n aware DNS server only understands matching using
ASCII case-insensitivity, it may cache i18n responses as different
even though the are i18n equivalent. This will result in more data
cached but not give invalid responses.
2.4 DNSSEC
DNSSEC [RFC2535] is complex and not yet fully studied. Especially the
canonical DNS name order and signing of RRsets.
The canonical DNS name order sorts names with letters as lower case.
In i18n this means to fold to lower case, normalise and simplify as
is done in lookups. This would mean that only a DNS server knowing
the full equivalence rules could do the sorting. It would be better
if this was not needed.
Signing of RRsets is done on the canonical RR form. RFC 2535 is
somewhat unclear if domain names inside the RDATA should be lower
cased. If not, so that original format of RDATA is preserved, signing
Dan Oscarsson Expires: 25 August 2000 [Page 7]
Internet Draft Internationalisation of DNS 25 February 2000
should be no problem in i18n aware DNS software.
The full handling of DNSSEC and i18n data may have to be described in
a separate document.
3. Characters allowed in domain names
The DNS protocol do not place any restriction on characters used in a
domain name. However applications that make use of DNS data may have
restrictions imposed on what particular values are acceptable in
their environment. If the client has such restrictions, it is solely
responsible for validating the data from the DNS to ensure that it
conforms before it makes any use of that data. [RFC2181]
For example domains, hosts and e-mail addresses are represented in
DNS and may have different rules.
As the whole idea of internationalisation of DNS is to get domain
names with non-ASCII, the original recommendation in DNS [RFC1035]
for host/domain names needs to be updated.
It is recommended that domains, hosts and e-mail addresses all are
extended to allow all letters, digits and some separators of UCS.
This have to be defined in an other document.
4. User interface issues
Locally on a system or in a user interface a different character set
than the one defined to be used in the DNS protocol may be used.
Therefore software must map between the local character set and the
character set of the protocol, so that human beings can understand
it.
This means that a zone file that is edited in a text editor by a
person before being loaded into a DNS server must be allowed to be in
the local character set. Software may not assume that the user can
edit text encoded in UTF-8. A zone file transmitted between DNS
software that is not handled by a human, can be transmitted using any
format.
When character data is presented to a human or entered by a human,
software must, as good as possible, present it using local character
set and allow it to be entered using the local character set. It is
the responsibility of the software to convert between the local
character set and the one used in the protocol, not the human.
Dan Oscarsson Expires: 25 August 2000 [Page 8]
Internet Draft Internationalisation of DNS 25 February 2000
The down coding defined above allows all names to be entered and
displayed for all users, as long as at least the ASCII characters are
supported.
4.1 Applications using DNS software
If an application does a call to DNS, it must present the data to the
users in the local character set used by the user, down coding if
necessary. Software used to access DNS should give the application
programmer both the possibility of doing queries and getting
responses using local character set, and using UTF-8.
5. Effect on other protocols
As now a domain name may include non-ASCII many other protocols that
include domain names need to be updated. For example SMTP, HTTP and
URIs. The down coding to ASCII as defined above can be used when
interfacing with ASCII only software or protocols. Protocols like
SMTP could be extended using ESMTP and a UTF8 option that defines
that all headers are in UTF-8.
It is recommended that protocols updated to handle i18n do this by
encoding character data in the same standard format as defined for
DNS in this document. The use of encoding it in ASCII or by tagged
character sets should be avoided.
DNS do not only have domain names in them, for example e-mail
addresses are also included. So an e-mail address would be expected
to be changed to include non-ASCII both before and after the @-sign.
Software need to be updated to follow the user interface
recommendations given above, so that a human will see the characters
in their local character set, if possible.
6. Security Considerations
As always with data, if software does not check for data that can be
a problem, security may be affected. As more characters than ASCII is
allowed, software only expecting ASCII and with no checks may now get
security problems.
7. References
[RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities",
STD 13, RFC 1034, November 1987.
[RFC1035] P. Mockapetris, "Domain Names - Implementation and
Dan Oscarsson Expires: 25 August 2000 [Page 9]
Internet Draft Internationalisation of DNS 25 February 2000
Specification", STD 13, RFC 1035, November 1987.
[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997, RFC 2119.
[RFC2181] R. Elz and R. Bush, "Clarifications to the DNS
Specification", RFC 2181, July 1997.
[RFC2279] F. Yergeau, "UTF-8, a transformation format of ISO 10646",
RFC 2279, January 1998.
[RFC2535] D. Eastlake, "Domain Name System Security Extensions".
RFC 2535, March 1999.
[RFC2671] P. Vixie, "Extension Mechanisms for DNS (EDNS0)", RFC
2671, August 1999.
[ISO10646] ISO/IEC 10646-1:2000. International Standard --
Information technology -- Universal Multiple-Octet Coded
Character Set (UCS)
[Unicode] The Unicode Consortium, "The Unicode Standard -- Version
3.0", ISBN 0-201-61633-5. Described at
http://www.unicode.org/unicode/standard/versions/
Unicode3.0.html
[UTR15] M. Davis and M. Duerst, "Unicode Normalization Forms",
Unicode Technical Report #15, Nov 1999,
http://www.unicode.org/unicode/reports/tr15/.
[UTR21] M. Davis, "Case Mappings", Unicode Technical Report #21,
Dec 1999, http://www.unicode.org/unicode/reports/tr21/.
[UDATA] The Unicode Character Database,
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt.
The database is described in
ftp://ftp.unicode.org/Public/UNIDATA/
UnicodeCharacterDatabase.html.
8. Acknowledgements
Ideas from drafts by Paul Hoffman, Stuart Kwan, James Gilroy and Kent
Karlsson.
Magnus Gustavsson, Mark Davis, Kent Karlsson and Andrew Draper for
comments on my draft.
Dan Oscarsson Expires: 25 August 2000 [Page 10]
Internet Draft Internationalisation of DNS 25 February 2000
Discussions and comments by the members of the IDN working group.
Author's Address
Dan Oscarsson
Telia ProSoft AB
Box 85
201 20 Malmo
Sweden
E-mail: Dan.Oscarsson@trab.se
Dan Oscarsson Expires: 25 August 2000 [Page 11]