Internet Draft                                        Paul Hoffman
draft-hoffman-idn-reg-00.txt                            IMC & VPNC
March 25, 2003
Ex pires in six months
Intended status: Best Current Practice (BCP)


         Framework for Registering Internationalized Domain Names


Status of this Memo

This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other groups
may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt

The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.


Abstract

This document describes a framework for registering internationalized
domain names (IDNs) in a zone. Before accepting registrations of domain
names into a zone, the zone's registry should decide which codepoints in
the Unicode character set the zone will accept. The registry should also
decide whether particular characters in a registered domain name should
cause registration of multiple equivalent domain names. With those
decisions, the registry can safely register names using the steps
described here.


1. Introduction

IDNA [IDNA] specifies an encoding of characters in the Unicode character
set [UNICODE] which is backwards-compatible with the current definition
of hostnames. This implies that domain names encoded according to IDNA
will be able to be transported between peers using any existing
protocol, including DNS.

IDNA, through its requirement of Nameprep [NAMEPREP], uses equivalence
tables that are based only on the characters themselves; no attention is
paid to the intended language (if any) for the domain name. However, for
many domain names, the intended language of one or more parts of the
domain name actually does matter to the registry for the names and to
users.

If there are no constraints on registration in a zone, people can
register characters that increases the risk of misunderstandings,
cybersquatting, and other forms of confusion. A similar situation
existed before the introduction of IDNA exemplified by domain names such
as example.com and examp1e.com (note that the latter domain has the
digit "1" instead of the letter "l").

For some human languages, there are characters and/or strings that have
equivalent or near-equivalent meanings. If someone is allowed to
register a name with such a character or string, the registry might want
to automatically register all the names that have the same meaning in
that language. Further, some registries might want to restrict the set
of characters to be registered for language-based reasons. In addition,
IDNA allows the use of thousands of non-alphanumeric characters, and
some zone administrators will want to prohibit some or all of these
characters.

The intent of this document is that checking whether a label
can be approved can be a mathematical, objective inspection of the
codepoints in the label with no human intervention, and that all
applications of a particular table will yield identical results.

The mechanism described here does not require a registry to know the
"intended language" of a label. It is impossible to describe the
"intended language" of names that include numbers or acronyms. Proposals
that have this requirement require human intervention to validate the
assertion from the registrant and are therefore susceptible to fraud
from the registrant. Further, such a requirement prevents the
registration of labels that have two languages, some of which are common
in countries with multiple languages.

[IDN-ADMIN] shows a different proposal to the problem of registration
policy. That document uses a more complex algorithm and a different
registration philosophy that what is described here.

It is suggested that a registry act conservatively when starting
accepting IDNA-based domain names. Equivalences are very hard (if not
impossible) to define after registration has started. Assume that the
labels "x" and "y" at first are different, but later the tables for the
registry are changed so that "x" and "y" are then treated as being the
same. If x.example.com and y.example.com both were already registered to
different registrants, it is unclear which of them has to withdraw the
registration, how that selection process done, and so on. Thus, having
complete, publicly-stated policies before accepting registration will
lead to a much more stable registration process.

This document does not deal with how to handle whois data for multiple
registrations, and does not deal with regitrar-registry protocols.
This document also only deals only with variants of single characters,
not variants of strings.

1.1 Terminology

A "string" is an ordered set of one or more characters.

This document discusses characters that have equivalent or
near-equivalent characters or strings. The "base character" is the
character that has one or more equivalents; the "variant(s)" are the
character(s) and/or string(s) that are equivalent to the base character.

A "registration bundle" is the set of all labels that comes from
expanding all base characters for a single name into their variants.

A registry is the administrative authority for a DNS zone. That is, the
registry is the body that makes and enforces policies that are used in a
particular zone in the DNS.

The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119
[RFC2119].

2. Language-based tables

The registration strategy described in this document uses a table that
lists all characters allowed for input and any variants of those
characters. Note that the table lists all characters allowed, not only
the ones that have variants.

It is widely expected that there will be different tables for the same
language created by different people. Many languages are spoken in many
different countries, and each country might have a different view of
which characters should or should not be considered needed for that
language. For example, some people would say that the Latin characters
are needed for various Indic languages, while others would say that
they are not.

A zone needs to have exactly one table; having more than one table can
lead to unpredictable results because the variants in the different
tables may conflict. The table must be carefully composed so that all
expected variants will be created, and no unexpected variants are
created.

The registry's table MUST NOT have more than one entry for a particular
base character. A table with more than one variant rule requires that
some names be evaluated by humans and will open the registration process
to dispute.

The tables are language-specific, although it is possible to create a
single table that covers multiple languages. The following three
sub-sections describe the use of tables in three scenarios.

2.1 Table for a zone that uses names from one language

A zone that has a single language has a significant advantage over
zones that cover multiple languages. Its table can be constructed
without concern for variants that appear in other languages for the
base characters of the language used in the zone.

2.2 Table for a zone that uses names from a small number of languages

If a zone covers more than one language, the registry must create its
registration table from multiple language tables. Creating a table from
many languages is easy if none of the languages have overlapping
character variants for any single base character.

A registry MUST NOT blindly combine multiple tables which have
overlapping equivalences. Instead, the registry MUST carefully analyze
every instance in the combined table where a base character has one or
more different variants and select the desired set of variants for the
base character.

2.3 Table for a zone that has no language restrictions

A registry that does not restrict the number of languages will probably
allow a much wider range of characters to be used in names. At the same
time, that registry cannot easily use character variants because
variants for one language will be different from the variants used in a
different language. To handle conflicting variants among languages, the
registry can choose to have no variants for any base characters, or can
choose to have variants for a subset of the languages that are
expressible in the characters allowed.


3. Table processing rules

The input to the process is called the "input label". The output of the
process is either failure (the input label cannot be registered at all),
or a registration bundle that contains one or more labels that have been
processed with ToASCII.

Processing the input label requires two versions of ToASCII: "standard
ToASCII" and "enhanced ToASCII". Standard ToASCII is exactly the same as
the ToASCII in [IDNA]. Enhanced ToASCII is standard ToASCII with the
steps from section 3.1 added.

Note that the process MUST be executed only once. The process MUST NOT
be run on any output of the process, only on the new label that was
input.


3.1 Creating enhanced ToASCII.

During the processing, an "temporary bundle" contains partial labels,
that is, labels that are being built and are not complete labels. The
partial labels in the temporary bundle consist of Unicode characters.

The following steps after step 2 but before step 3 of ToASCII.

2a) Split the input label into individual characters, called "candidate
characters". Compare each candidate character against the base
characters in the table. If any candidate character does not exist in
the set of base characters, the system MUST stop and not register any
names (that is, it MUST not register either the base name or any labels
that would have come from character variants).

2b) Continue the steps in standard ToASCII for the input label. If
ToASCII fails for the input label, the system MUST stop and not register
any of the labels (even if the other labels would have passed ToASCII).
If ToASCII succeeds, add the result to the registration bundle.

2c) For each candidate character in the input label, do the following:

   2c1) Copy the candidate character into every partial label in the
   temporary bundle. If the base character that matches the candidate
   character has no variants, go to step 2c3.

   2c2) For each variant of the base character, do the following:

      2c2a) Duplicate all of the current partial labels in the
      temporary bundle.

      2c2b) If this is the last variant, go to step 2c3; otherwise,
      select the next variant, and go to step 2c2a.

   2c3) Copy the variant into each partial label.

   2c4) If there are more candidate characters, select the next
   candidate character and got to step 2c1. Otherwise, go to step 2d.

2d) The temporary bundle now contains zero or more labels that consist
of Unicode characters. For each label in the temporary bundle:

   2da) Process the label with standard ToASCII.

   2db) If ToASCII succeeds, put the result in the registration bundle.
   Otherwise, do not put anything into the registration bundle.

   2dc) Select the next label and go to step 2da.

2e) The resulting registration bundle has all the labels in ToASCII
encoding. Finish.


4. Table format

The format of the table is meant to be machine-readable but not
human-readable. It is fairly trivial to convert the table into one
that can be read by people.

Each character in the table is given in the "U+" notation for Unicode
characters. The lines of the table are terminated with either a carriage
return character (ASCII 0x0D), a linefeed character (ASCII 0x0A), or a
sequence of carriage return followed by linefeed (ASCII 0x0D 0x0A). The
order of the lines in the table do not matter.

Each line in the table starts with the character that is allowed in the
registry. If that character has any variants, the base character is
followed by a vertical bar character ("|", ASCII 0x7C) and the variant
string. If the base character has more than one variant, the variants
are separated by a colon (":", ASCII 0x3A). Strings are given without
any intervening spaces

The following is an example of how a table might look. The entries in
this table are purposely silly and should not be used by any registry as
the basis for choosing variants. For the example, assume that the
registry:
- allows the FOR ALL character (U+2200) with no variants
- allows the COMPLEMENT character (U+2201) which has a single variant
  of LATIN CAPITAL LETTER C (U+0043)
- allows the PROPORTION character (U+2237) which has one variant which
  is the string COLON (U+003A) COLON (U+003A)
- allows the PARTIAL DIFFERENTIAL character (U+2202) which has two
  variants: LATIN SMALL LETTER D (U+0064) and GREEK SMALL LETTER DELTA
  (U+03B4)

The table would look like:
U+2200
U+2201|U+0043
U+2237|U+003AU+003A
U+2202|U+0064;U+03B4

The registry's table MUST NOT have more than one entry for a particular
base character.

Implementors of table processors should remember that there are tens of
thousands of characters whose codepoints are greater than 0xFFFF. Thus,
any program that assumes that each character in the table is represented
in exactly six octets ("U", "+", and exactly four octets representing
the character value) will fail with tables that use characters whose
value is greater than 0xFFFF.


5. Steps after registering an input label

A registry has three options for how to handle the case where
the registration bundle has more than one label. The policy options are:

1) Allocate all labels to the same registrant, making
the zone information identical to that of the input label.

2) Block all labels so they cannot be registered in the
future.

3) Allocate some labels and block some other labels.

Option 1 will cause end users to be able to find names with variants
more easily, but will result in larger zone files. For some
language tables, the zone file could become so large that it
could negatively affect the ability of the registry to perform name
resolution.

Option 2 does not increase the size of the zone file, but it
may cause end users to not be able to find names with variants
that they would expect.

Option 3 is likely to cause the most confusion with users because
including some variants will cause a name to be found, bout using
other variants will cause the name to be not found.

With any of these three options, the registry MUST keep a database that
links each label in the registration bundle to the input label. This link
needs to be maintained so that changes in the non-DNS registration
information (such as the label's owner name and address) is reflected in
every member of the registration bundle as well.

If the registry chose option 1, when the zone information for the input
label changes, the zone information for all the members of the
registration bundle MUST change in exactly the same way. The zone
information for every member of the registration bundle MUST remain
identical as long as any of the members of the registration bundle
remain in the zone. A registry can keep the zone information for the
registration bundle identical using a database, or using DNAME records,
or using a combination of the two.

If the registry chose option 2, when the zone information for the input
label changes, the blocked information for all the members of the
registration bundle MUST be identical to that of the input label, and
MUST remain identical as long as the input label remains in the zone. A
registry can keep the zone and blocked name information for the
registration bundle identical using a database.

If the registry chose option 3, it must use an unspecified method to
keep the elements in the registration bundle cohesive. This option
SHOULD NOT be used except under carefully-controlled circumstances.


6. Examples

The following shows examples of the first two of the registry's options.
Both examples assume that the registry for the zone example.com uses the
following very short table, which says that LATIN SMALL LETTER L
(U+006C) has a single variant, DIGIT ONE (U+0031).

U+006C|U+0031

A registrant approaches the zone and requests a registration for the
name pale.example.com, for which there are two name servers
(x.example.com and y.example.com). After processing the input label
"pale", the registration bundle contains "pale" and "pa1e".

6.1 Example 1: allocating multiple labels

Assume that the registry for the zone example.com uses option 1
(allocating multiple labels) as its registration policy.

The registry allocates pale.example.com and pa1e.example.com to the
registrant. The registry also creates a link in its registration
database from pa1e.example.com to pale.example.com so that any changes
to either the non-zone information or the zone information for one name
will be reflected in the other name.

The registry adds the following four records to the example.com zone:

  $ORIGIN example.com.
  pale IN NS x.example.com.
  pale IN NS y.example.com.
  pa1e IN NS x.example.com.
  pa1e IN NS y.example.com.

Note that the registry can instead use DNAME records for allocating
labels. If the registry uses DNAMEs, the registry would instead add
the following three records to the example.com zone:

  $ORIGIN example.com.
  pale IN NS x.example.com.
  pale IN NS y.example.com.
  pa1e IN DNAME pale.example.com.

An end user who requests the name server for pa1e.example.com will get a
positive response with the correct information.

6.2 Example 2: blocking labels

Assume that the registry for the zone example.com uses option 2
(blocking labels) as its registration policy.

The registry allocates pale.example.com to the registrant and blocks
pa1e.example.com from being registered by anybody. The registry also
creates a link in its registration database from pa1e.example.com to
pale.example.com so that any changes to the non-zone information for
pale.example.com will be reflected in the blocked name.

The registry adds the following two records to the example.com zone:

  $ORIGIN example.com.
  pale IN NS x.example.com.
  pale IN NS y.example.com.

An end user who requests the name server for pa1e.example.com will get a
response of "no such name".


7. Owner implications of multiple labels

The creation of a registration bundle for equivalent or near-equivalent
labels in a zone at the time of registration leads to many delegations.
This leads to records in parallel zones which MUST be synchronized. That
is, the owner of a registration bundle MUST keep the same information in the
zone for each label in the bundle.

Using the examples from section 6, assume that the owner of the label
"pale" and "pa1e" creates a subdomain, "www". If the owner of
"example.com" used multiple delegations for the labels, the owner of
"pale" and "pa1e" would use two records:

  $ORIGIN pale.example.com.
  www IN A 1.2.3.4

  $ORIGIN pa1e.example.com.
  www IN A 1.2.3.4

An alternative for these two records, which helps the registrant
keep their names in synch, would be:

  $ORIGIN pale.example.com.
  www IN A 1.2.3.4

  $ORIGIN pa1e.example.com.
  www IN CNAME www.pale.example.com.

If the owner of "example.com" used a DNAME record to make "pale" and
"pa1e" equivalent, the owner of "pale" and "pa1e" could instead use one
record:

  $ORIGIN pale.example.com.
  www IN A 1.2.3.4


8. Security considerations

Apart from considerations listed in the IDNA specification, this
document explicitly talks about equivalences that a registry can define
as part of the policy which can be applied in a zone. A registry can
apply an equivalence table which solves some problems with homographs
already outlined in the security consideration section of IDNA. This
might be considered good for security because it will reduce the
possible confusion for the user, and lower the risk that the user will
"connect" to a service which was not intended.


9. References

9.1 Normative References

[IDNA] "Internationalizing Domain Names in Applications (IDNA)",
draft-ietf-idn-idna.

[NAMEPREP] "Nameprep: A Stringprep Profile for Internationalized Domain
Names", draft-ietf-idn-nameprep.

[RFC2119] "Key words for use in RFCs to Indicate Requirement Levels",
March 1997, RFC 2119.

[UNICODE] The Unicode Consortium. The Unicode Standard, Version 3.2.0 is
defined by The Unicode Standard, Version 3.0 (Reading, MA,
Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode
Standard Annex #27: Unicode 3.1 (http://www.unicode.org/reports/tr27/)
and by the Unicode Standard Annex #28: Unicode 3.2
(http://www.unicode.org/reports/tr28/).

9.2 Non-normative References

[IDN-ADMIN] "Internationalized Domain Names Registration and
Administration Guideline for Chinese, Japanese and Korean",
draft-jseng-idn-admin.


10. IANA considerations

There are no IANA considerations for this document. The tables described
in this document can be created by anyone. Tables at IANA are often
considered to be authoritative, but languages have no one who is
authoritative for them. It is unclear what value, if any, there is for
someone to know what table a particular zone says it is using for
registration. Further, the tables are expected to be updated at
irregular times as new characters are added to the list of acceptable
characters. Therefore, it is probably unwise for IANA to keep a registry
of these tables.


11. Author's address

Paul Hoffman
Internet Mail Consortium and VPN Consortium
127 Segre Place
Santa Cruz, CA  95060  USA
phoffman@imc.org