Internet Draft                                             Dan Oscarsson
draft-oscarsson-i18ndns-00.txt                             Telia ProSoft
Updates: RFC 2181, 1035, 1034, 2535                     25 February 2000
Expires: 25 August 2000

            Internationalisation of the Domain Name Service

Status of this memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.


Abstract

   There is a very strong world-wide desire to use characters other than
   ASCII in the DNS, especially in domain names.

   This document updates the Domain Name System (DNS) [RFC1035] in a way
   that is compatible with the current DNS and specifies how
   international characters are handled.



1. Introduction

   There is an immediate need of using international characters (non-
   ASCII) in DNS. This means that DNS cannot be extended as this would
   take too long time. Instead the current ASCII only handling need to
   be extended to non-ASCII in a way that can be used without updating
   current software.

   The basic handling of character data in DNS have several properties



Dan Oscarsson           Expires: 25 August 2000                 [Page 1]


Internet Draft        Internationalisation of DNS       25 February 2000


   that need to be preserved:
    - The DNS itself places only one restriction on the particular
      labels that can be used to identify resource records. That one
      restriction relates to the length of the label and the full name.
      The length of any one label is limited to between 1 and 63 octets.
      A full domain name is limited to 255 octets (including the
      separators).  [RFC2181]
    - Any binary string whatever can be used as the label of any
      resource record. Similarly, any binary string can serve as the
      value of any record that includes a domain name as some or all of
      its value (SOA, NS, MX, PTR, CNAME, and any others that may be
      added).  Implementations of the DNS protocols must not place any
      restrictions on the labels that can be used. In particular, DNS
      servers must not refuse to serve a zone because it contains labels
      that might not be acceptable to some DNS client programs.
      [RFC2181]
    - Names must be compared with case-insensitivity.  [RFC1035]
    - The original case should be preserved when possible as data is
      entered into the system. This also implies that responses should
      preserve case when possible. [RFC1035] Some of the reasons for
      this are:
        + Domain names are used for many purposes.
        + One is domain names where company names or trademarks could be
          used.  Very commonly companies and trademarks are using a
          combination of upper and lower case to enhance the image of
          the name.  Many of them would prefer that when you, for
          example, lookup the domain name for an IP address, the correct
          case is returned.
        + An other is the e-mail address defined in the SOA record.
          While many systems now does a case-insensitive comparison on
          the user name part of the e-mail address, there may still be
          those that don't.  And also here, e-mail addresses can be made
          more readable by mixing upper and lower case.
        + If you look up a host name form an IP address you may want to
          use the host name to compare with other data. Many services
          under Unix does this, and many of the are not case-
          insensitive. So they need the correct case returned.
        + There may be other uses of domain names that requires them to
          be unchanged.
    - The characters in the ASCII character set must still be encoded as
      ASCII.

   This document specifies the update needed of the DNS protocol, user
   interface issues and the effect of other protocols. It is intended to
   full fill the requirements of internationalised domain names which
   currently worked on by the IDN working group.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",



Dan Oscarsson           Expires: 25 August 2000                 [Page 2]


Internet Draft        Internationalisation of DNS       25 February 2000


   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].


2. The DNS Protocol

   The DNS protocol is used when communicating between DNS servers and
   other DNS servers or DNS clients. User interface issues like the
   format of zone files or how to enter or display domain names are not
   part of the protocol.

   The update of the protocol defined here can be used immediately as it
   is fully compatible with the DNS of today.

2.1 Internationalisation aware software

   Internationalisation aware DNS software (i18n aware) is software that
   handles the rules for handling international text as defined here.
   Only i18n aware software will get all requirements fulfilled.

   Referring to section 4.1.1 in [RFC1035] and section 6.1 in [RFC2535]
   the the DNS query/response format header is updated by allocation the
   last un-allocated bit in the header. This bit is defined to be zero
   in old servers and resolvers. For description of all field see the
   sections in the above RFCs.

                                           1  1  1  1  1  1
             0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
            +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
            |                      ID                       |
            +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
            |QR|   Opcode  |AA|TC|RD|RA|IN|AD|CD|   RCODE   |
            +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
            |                    QDCOUNT                    |
            +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
            |                    ANCOUNT                    |
            +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
            |                    NSCOUNT                    |
            +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
            |                    ARCOUNT                    |
            +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

   I18n aware software identifies itself in a query or a response by
   setting the IN bit in the DNS query/response format header.  As this
   bit is defined to be zero in old servers and resolvers they identify
   themselves as non-i18n aware.

   I18n aware software MUST set the IN bit in both queries and



Dan Oscarsson           Expires: 25 August 2000                 [Page 3]


Internet Draft        Internationalisation of DNS       25 February 2000


   responses.

   Note: The reason that EDNS [RFC2671] is not used is because:
    - It should work with the current pre-i18n DNS software.
    - There should be no additional requests needed to be sent for i18n
      aware software.


2.2 Character data

   Character data need to be able to represent as much as possible of
   the characters in the world as well as being compatible with ASCII.
   It must also be well defined so that it can easily be handled and
   should be compact as only 63 octets is available without an extension
   of the protocol.

   Therefore character data used in the DNS protocol MUST:
    - Use ISO 10646 (UCS) [ISO10646] as coded character set.
    - Be normalised using form C as defined in Unicode technical report
      #15 [UTR15].
    - Encoded using the UTF-8 [RFC2279] character encoding scheme.

   The only exception to the above rules is in the inter operability
   with non-i18n aware DNS software, as defined later.

2.2.1 Down coding

   As a local character set may not support all of the characters of UCS
   used internally in DNS, a way to encode unsupported characters into
   the local character set is needed. That way a domain name can be used
   even if the local character set cannot represent all characters in a
   name. By setting the local character set to ASCII we get domain names
   that are allowed in non-i18n aware software.

   This will be done by down coding UTF-8 into the local character set.
   It is done as follows:
    - If a character can be represented in the local character set, map
      it from UCS to local character set.
    - If a character cannot be represented in the local character set,
      map the UTF-8 octet sequence for the character to a hyphen ("-")
      followed by the hex code of each octet as two characters per
      octet.
    - If it was needed to down code because not all characters could be
      represented in the local character set, all original hyphens must
      be replaced by two hyphens ("--") and the entire string MUST end
      with a single hyphen.

   Examples:



Dan Oscarsson           Expires: 25 August 2000                 [Page 4]


Internet Draft        Internationalisation of DNS       25 February 2000


   If we have the name: Ab-<a with ring above>r<greek omega>z, it is
   represented in DNS as UTF-8:
      (HEX) 41 62 2d c3 a5 72 c9 b7 7a
   If the local character set is ISO 8859-1, the down coded name is:
   Ab--<a with ring above>r-c9b7z-.
   If the local character set is ASCII, the down coded name is:
   Ab---c3a5r-c9b7z-.

   Note: In other formats like HTML unsupported characters are handled
   like: &number; (prefix, code point value and terminator).  The above
   format is choosen because it only needs a prefix (the length is
   defined in the UTF-8 encoding so terminator is not needed) and can
   easily be checked for valid sequence.

2.2.2 Up coding

   When character data is entered into i18n aware DNS software, it must
   be up coded from the down coding format into UTF-8. A down coded name
   is identified by a trailing hyphen. When up coding invalid UTF-8
   sequences should be left as it is, it may be an old name with a
   trailing hyphen.


2.3 Domain name matching

   One of the most difficult areas of internationalisation is what names
   are equivalent to an other. For ASCII this was easily solved by
   case-insensitivity. It is also easily solved for many other Latin
   based alphabets. But when you look at the whole world you get a
   mixture of rules, some conflicting, including case-insensitivity,
   half width/full width, final/non-final forms and much more.

   This type of matching will be called "equivalence matching" here
   after

2.3.1 Equivalence matching rules

   To compare two domain names, both names must first be mapped to a
   format where all equivalent characters are mapped to one character so
   that the names then can be binary compared.  This mapping is done
   from the original UCS normalised form C format, by case folding to
   lower case followed by additional normalisation and simplification.

   Folding to lower case MUST be done by following the one to one
   mapping as defined in the Unicode 3.0 Character Database [UDATA].

   Additional folding will probably also be done, but this has not been
   agreed on yet. For normalisation Unicode 3.0 defines a normalisation



Dan Oscarsson           Expires: 25 August 2000                 [Page 5]


Internet Draft        Internationalisation of DNS       25 February 2000


   form KC [UTR15] that is a good start, but more is needed. More about
   case folding to lower case is available in Unicode Technical Report
   21 [UTR21].

   Additional folding, normalisation and simplification will be defined
   here or in a separate document at a later stage.

   Note: As Turkish rules lower case I to dotless i instead of the
   dotted i used in ASCII and the above case mapping, Turkish names with
   dotless i will have to always be entered in lower case.


2.3.2 Matching of domain names in DNS servers

   To be able to handle correct domain name matching in lookups, the
   following MUST be followed by DNS servers:
    - Do matching on authorative data using the full name equivalence
      matching needed for the characters used in the data.
    - On non-authorative data, either do binary matching or case-
      insensitive matching on ASCII letters and binary matching on all
      others.
    - Implement the equivalence matching rules as defined above. Local
      variations are not allowed.

   The effect of the above is:
    - only servers handling authorative data must implement equivalence
      matching of names. And they need only implement the subset needed
      for the subset of characters of UCS they support in its
      authorative zones.
    - it normally gives fast lookup because data is usually sent like:
      resolver <-> server <-> authorative server.
      While full equivalence matching can be complex and CPU consuming,
      the server in the middle will do caching with only simple and fast
      binary matching. So the impact of complex matching rules should
      not slow down DNS very much.


2.4 Inter operability between i18n aware DNS software and non-i18n aware

   While the current non-i18n aware DNS software MUST allow UTF-8
   encoded domain names (if they follow RFC1035, 2181) a lot of software
   using DNS may not (for example SMTP). To not break all the old
   software only expecting or allowing ASCII in domin names, the
   following rules MUST be followed by an i18n aware DNS server:
    - A query with the IN bit set is assumed to be from i18n aware
      software.
    - A query with domain names having valid non-ASCII UTF-8 characters
      is assumed to be from i18n aware software even if the IN bit is



Dan Oscarsson           Expires: 25 August 2000                 [Page 6]


Internet Draft        Internationalisation of DNS       25 February 2000


      not set. (this is because the query can have been sent from an
      i18n aware resolver through a non-i18n aware server).
    - Always down code (see above) the UTF-8 names into ASCII before
      sending it when responding to non-i18n aware software.
    - Never have down coded names in the response when responding to
      i18n aware software.
    - Always check for down coded names in requests and up code them.
    - Not do zone transfers to non-i18n aware software, if the zone
      contains non-ASCII.
    - Return the server failed error if a label cannot be down coded and
      fit in the 63 octets allowed.

   An i18n aware DNS resolver MUST:
    - Up code any down coded names before sending them using the DNS
      protocol.
    - Up code any down coded names received in a response.

   The result of this is:
    - Old software gets an ASCII only domain name using only the old set
      of allowed characters.
    - Both i18n aware DNS servers and resolver software must handle up
      coding of domain names.
    - Domain names used from old software will work in other protocols
      only allowing ASCII names.
    - We may get old software that is never fixed as it still works.
    - We do not get rid of this user unfriendly, encode everything in
      ASCII handling that many non-ASCII users complain about.

   Note: As a non-i18n aware DNS server only understands matching using
   ASCII case-insensitivity, it may cache i18n responses as different
   even though the are i18n equivalent. This will result in more data
   cached but not give invalid responses.


2.4 DNSSEC

   DNSSEC [RFC2535] is complex and not yet fully studied. Especially the
   canonical DNS name order and signing of RRsets.

   The canonical DNS name order sorts names with letters as lower case.
   In i18n this means to fold to lower case, normalise and simplify as
   is done in lookups.  This would mean that only a DNS server knowing
   the full equivalence rules could do the sorting. It would be better
   if this was not needed.

   Signing of RRsets is done on the canonical RR form. RFC 2535 is
   somewhat unclear if domain names inside the RDATA should be lower
   cased. If not, so that original format of RDATA is preserved, signing



Dan Oscarsson           Expires: 25 August 2000                 [Page 7]


Internet Draft        Internationalisation of DNS       25 February 2000


   should be no problem in i18n aware DNS software.

   The full handling of DNSSEC and i18n data may have to be described in
   a separate document.


3. Characters allowed in domain names

   The DNS protocol do not place any restriction on characters used in a
   domain name. However applications that make use of DNS data may have
   restrictions imposed on what particular values are acceptable in
   their environment. If the client has such restrictions, it is solely
   responsible for validating the data from the DNS to ensure that it
   conforms before it makes any use of that data. [RFC2181]

   For example domains, hosts and e-mail addresses are represented in
   DNS and may have different rules.

   As the whole idea of internationalisation of DNS is to get domain
   names with non-ASCII, the original recommendation in DNS [RFC1035]
   for host/domain names needs to be updated.

   It is recommended that domains, hosts and e-mail addresses all are
   extended to allow all letters, digits and some separators of UCS.

   This have to be defined in an other document.


4. User interface issues

   Locally on a system or in a user interface a different character set
   than the one defined to be used in the DNS protocol may be used.
   Therefore software must map between the local character set and the
   character set of the protocol, so that human beings can understand
   it.

   This means that a zone file that is edited in a text editor by a
   person before being loaded into a DNS server must be allowed to be in
   the local character set. Software may not assume that the user can
   edit text encoded in UTF-8. A zone file transmitted between DNS
   software that is not handled by a human, can be transmitted using any
   format.

   When character data is presented to a human or entered by a human,
   software must, as good as possible, present it using local character
   set and allow it to be entered using the local character set.  It is
   the responsibility of the software to convert between the local
   character set and the one used in the protocol, not the human.



Dan Oscarsson           Expires: 25 August 2000                 [Page 8]


Internet Draft        Internationalisation of DNS       25 February 2000


   The down coding defined above allows all names to be entered and
   displayed for all users, as long as at least the ASCII characters are
   supported.

4.1 Applications using DNS software

   If an application does a call to DNS, it must present the data to the
   users in the local character set used by the user, down coding if
   necessary. Software used to access DNS should give the application
   programmer both the possibility of doing queries and getting
   responses using local character set, and using UTF-8.


5. Effect on other protocols

   As now a domain name may include non-ASCII many other protocols that
   include domain names need to be updated. For example SMTP, HTTP and
   URIs. The down coding to ASCII as defined above can be used when
   interfacing with ASCII only software or protocols.  Protocols like
   SMTP could be extended using ESMTP and a UTF8 option that defines
   that all headers are in UTF-8.

   It is recommended that protocols updated to handle i18n do this by
   encoding character data in the same standard format as defined for
   DNS in this document. The use of encoding it in ASCII or by tagged
   character sets should be avoided.

   DNS do not only have domain names in them, for example e-mail
   addresses are also included. So an e-mail address would be expected
   to be changed to include non-ASCII both before and after the @-sign.

   Software need to be updated to follow the user interface
   recommendations given above, so that a human will see the characters
   in their local character set, if possible.

6. Security Considerations

   As always with data, if software does not check for data that can be
   a problem, security may be affected. As more characters than ASCII is
   allowed, software only expecting ASCII and with no checks may now get
   security problems.

7. References

   [RFC1034]  P. Mockapetris, "Domain Names - Concepts and Facilities",
              STD 13, RFC 1034, November 1987.

   [RFC1035]  P. Mockapetris, "Domain Names - Implementation and



Dan Oscarsson           Expires: 25 August 2000                 [Page 9]


Internet Draft        Internationalisation of DNS       25 February 2000


              Specification", STD 13, RFC 1035, November 1987.

   [RFC2119]  Scott Bradner, "Key words for use in RFCs to Indicate
              Requirement Levels", March 1997, RFC 2119.

   [RFC2181]  R. Elz and R. Bush, "Clarifications to the DNS
              Specification", RFC 2181, July 1997.

   [RFC2279]  F. Yergeau, "UTF-8, a transformation format of ISO 10646",
              RFC 2279, January 1998.

   [RFC2535]  D. Eastlake, "Domain Name System Security Extensions".
              RFC 2535, March 1999.

   [RFC2671]  P. Vixie, "Extension Mechanisms for DNS (EDNS0)", RFC
              2671, August 1999.

   [ISO10646] ISO/IEC 10646-1:2000. International Standard --
              Information technology -- Universal Multiple-Octet Coded
              Character Set (UCS)

   [Unicode]  The Unicode Consortium, "The Unicode Standard -- Version
              3.0", ISBN 0-201-61633-5. Described at
              http://www.unicode.org/unicode/standard/versions/
              Unicode3.0.html

   [UTR15]    M. Davis and M. Duerst, "Unicode Normalization Forms",
              Unicode Technical Report #15, Nov 1999,
              http://www.unicode.org/unicode/reports/tr15/.

   [UTR21]    M. Davis, "Case Mappings", Unicode Technical Report #21,
              Dec 1999, http://www.unicode.org/unicode/reports/tr21/.

   [UDATA]    The Unicode Character Database,
              ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt.
              The database is described in
              ftp://ftp.unicode.org/Public/UNIDATA/
              UnicodeCharacterDatabase.html.



8. Acknowledgements

   Ideas from drafts by Paul Hoffman, Stuart Kwan, James Gilroy and Kent
   Karlsson.

   Magnus Gustavsson, Mark Davis, Kent Karlsson and Andrew Draper for
   comments on my draft.



Dan Oscarsson           Expires: 25 August 2000                [Page 10]


Internet Draft        Internationalisation of DNS       25 February 2000


   Discussions and comments by the members of the IDN working group.



Author's Address

   Dan Oscarsson
   Telia ProSoft AB
   Box 85
   201 20 Malmo
   Sweden

   E-mail: Dan.Oscarsson@trab.se






































Dan Oscarsson           Expires: 25 August 2000                [Page 11]