INTERNET-DRAFT                                          Martin Duerst
draft-ietf-idn-uri-01                             W3C/Keio University
Expires May 2002                                    November 20, 2001


           Internationalized Domain Names in URIs and IRIs

Status of this Memo

This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups.  Note that other
groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet- Drafts as reference
material or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.

The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.


Abstract

This document proposes to upgrade the definitions of URIs [RFC 2396]
and IRIs (Internationalized Resource Identifiers, [IRI]) to work
consistently with internationalized domain names.

0. Change Log

0.1 Changes from -00 to -01

- Changed requirement for URI/IRI resolvers from MUST to SHOULD
- Changed IRI syntax slightly (ichar -> idchar, based on changes
   in [IRI])
- Various wording changes


1. Introduction

Internet domain names serve to identify hosts and services on the
Internet in a convenient way. The IETF IDN working group is currently
working on extending the character repertoire usable in domain names
beyond a subset of US-ASCII.

One of the most important places where domain names appear are
Uniform Resource Identifiers (URIs, [RFC 2396], as modified by
[RFC2732]). However, in the current definition of the generic URI
syntax, the restrictions on domain names are 'hard-coded'. In
Section 2, this document relaxes these restrictions by updating
the syntax, and defines how internationalized domain names are
encoded in URIs.

URIs are restricted to a subset of US-ASCII. However, IRIs
(Internationalized Resource Identifier [IRI]) in general allow
non-ASCII characters. But the syntax of IRIs has the same 'hard-coded'
restrictions on domain names as the syntax of URIs. In Section 3,
this document relaxes these restrictions by updating the IRI syntax.
This is done in a way that is compatible with the new syntax for URIs.
This means that encoding an internationalized domain name in an URI
and encoding the same domain name in an IRI will produce an URI and an
IRI that can be converted into each other using the procedures defined
in [IRI] for these conversions.


2. URI syntax changes

The syntax of URIs [RFC2326] currently contains the following rules
relevant to domain names:

       hostname      = *( domainlabel "." ) toplabel [ "." ]
       domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
       toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

The later two rules are changed as follows:

       domainlabel   = escalphanum | escalphanum *( escalphanum | "-" )
                       escalphanum
       toplabel      = escalpha | escalpha *( escalphanum | "-" )
                       escalphanum

and the following rules are added:

       escalphanum   = escaped8 | alphanum
       escalpha      = elcaped8 | alpha
       escaped8      = "%" hexdig8 HEXDIG
       hexdig8       = <<HEXDIG greater than 7>>

The %HH escaping is used to encode characters outside the repertoire
of US-ASCII. This is done by first encoding the characters in UTF-8
[RFC 2279], resulting in a sequence of octets, and then escaping these
octets according to the rules defined in [RFC2396].

Using UTF-8 assures that this encoding interoperates with IRIs (see
Section 3). It is also aligned with the recommendations in [RFC 2277]
and [RFC 2718], and is consistent with the URN syntax [RFC2141] as
well as recent URL scheme definitions that define encodings of
non-ASCII characters based on UTF-8 (e.g., IMAP URLs [RFC 2192] and
POP URLs [RFC 2384]).

Please note that the use of UTF-8 for encoding internationalized
domain names in URIs is independent of the choice of encoding chosen
for these names in the DNS protocol. Depending on the choice of
encoding for the DNS protocol, an appropriate conversion is necessary.

The above syntax rules do not extend the possible domain names based
on US-ASCII characters. This is in accordance with the current direction
of the IDN WG [IDNWG].

The above rules also do not allow escaping of US-ASCII characters,
although this is allowed in the other parts of an URI (except for the
special provisions in case of reserved characters). Allowing such
escaping would make the syntax rules quite a bit more complicated,
would mean that the restrictions on US-ASCII characters can be
circumvented by using escaping, or would lead to much simpler syntax
rules that don't express these restrictions anymore.

Whether escaping of US-ASCII characters is allowed or not, two things
should be noted: 1) It is always better not to escape US-ASCII characters
in domain names because of the possibility that a resolver does not unescape
them. At least purely US-ASCII domain names would then always be resolved
by such a processor. 2) Because of the principle of syntax uniformity for
URIs, it is always more prudent to take into account the possibility that
US-ASCII characters are escaped.

Only the restrictions on US-ASCII characters are expressed in the
rules above. However, all the other restrictions on internationalized
domain names that are defined by the IDN WG [IDNWG] MUST be respected.

The work of the IDN WG currently includes some procedures for name
preparation. Before encoding an internationalized domain name in an
URI, this preparation step SHOULD be applied. However, the URI resolver
SHOULD also apply name preparation.


3. IRI syntax changes

The syntax of IRIs [IRI] currently contains the following rules
relevant to domain names:

       hostname      = *( domainlabel "." ) toplabel [ "." ]
       domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
       toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

The later two rules are changed as follows:

       domainlabel   = intalphanum | intalphanum *( intalphanum | "-" )
                       intalphanum
       toplabel      = intalpha | intalpha *( intalphanum | "-" )
                       intalphanum

and the following rules are added:

       intalphanum   = idchar | alphanum | escaped8
       intalpha      = idchar | alpha | escaped8
       escaped8      = "%" hexdig8 HEXDIG
       hexdig8       = <<HEXDIG greater than 7>>
       idchar        = << any character of the UCS [ISO10646] of U+00A0
                          and beyond, subject to limitations in Section
                          3.1. of [IRI] >>

With respect to the allowed domain names based on US-ASCII characters,
the same considerations as in Section 2 apply.

As in Section 2, all the other restrictions on internationalized
domain names that will be defined by the IDN WG MUST be respected.
Also, before encoding an internationalized domain name in an IRI,
name preparation SHOULD be applied. However, the IRI resolver SHOULD
also apply name preparation.

It is expected that the rules in Section 3.1 of [IRI] will be less
restrictive than the rules for internationalized domain names, so that
no escaping is necessary. Nevertheless, escaping is allowed for cases
where not all characters can be directly represented.


4. Security Considerations

The security considerations of [RFC 2396] and [IRI] and those applying
to internationalized domain names apply. There may be an increased
potential to smuggle escaped US-ASCII-based domain names across
firewalls, although because of the uniform syntax principle for
URIs, such a potential is already existing.


Acknowledgements

Looking forward for comments. Will acknowledge them here!


Copyright

Copyright (C) The Internet Society, 1997. All Rights Reserved.

This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph
are included on all such copies and derivative works.  However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other
than English.

The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.

This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."


Author's address

          Martin J. Duerst
          W3C/Keio University
          5322 Endo, Fujisawa
          252-8520 Japan
          duerst@w3.org
          http://www.w3.org/People/D%C3%BCrst/
          Tel/Fax: +81 466 49 1170

          Note: Please write "Duerst" with u-umlaut wherever
                possible, e.g. as "D&#252;rst" in XML and HTML.


References

[IDNWG] IETF Internationalized Domain Name (idn) Working Group.
  Information at http://www.ietf.org/html.charters/idn-charter.html.

[IRI] L. Masinter, M. Duerst, "Internationalized Resource Identifiers
  (IRI)", Internet Draft, November 2001,
  <http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-08.txt>,
  work in progress.

[ISO10646] ISO/IEC, Information Technology - Universal Multiple-Octet
  Coded Character Set (UCS) - Part 1: Architecture and Basic
  Multilingual Plane, Oct. 2000, with amendments.

[RFC 2119] S. Bradner, "Key words for use in RFCs to Indicate
  Requirement Levels", March 1997.

[RFC 2141] R. Moats, "URN Syntax", May 1997.

[RFC 2192] C. Newman, "IMAP URL Scheme", September 1997.

[RFC 2277] H. Alvestrad, "IETF Policy on Character Sets and
  Languages".

[RFC 2279] F. Yergeau. "UTF-8, a transformation format of ISO 10646.",
  January 1998.

[RFC 2384] R. Gellens, "POP URL Scheme", August 1998.

[RFC 2396] T.Berners-Lee, R.Fielding, L.Masinter. "Uniform Resource
  Identifiers (URI): Generic Syntax." August 1998.

[RFC 2640] B. Curtis, "Internationalization of the File Transfer
  Protocol", July 1999.

[RFC 2718] L. Masinter, H. Alvestrand, D. Zigmond, R. Petke,
  "Guidelines for new URL Schemes", November 1999.

[RFC 2732] R. Hinden, B. Carpenter, L. Masinter, "Format for Literal
  IPv6 Addresses in URL's", December 1999.