Network Working Group J. Klensin
Internet-Draft February 15, 2004
Expires: August 15, 2004
Registration of Internationalized Domain Names: Overview and Method
draft-klensin-reg-guidelines-02.txt
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at http://
www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on August 15, 2004.
Copyright Notice
Copyright (C) The Internet Society (2004). All Rights Reserved.
Abstract
IETF has introduced standards-track mechanisms to enable the use of
"internationalized", i.e., non-ASCII, names in the DNS and
applications that use it. This has led, in turn, to concerns that
characters with similar meanings or appearance could cause user
confusion and opportunities for deliberate deception and fraud. Part
of this problem can be addressed by limiting, on a per-zone (or
per-registry) basis, the specific characters that can be used to be a
subset of the list allowed by the standard and by creating
"reservations" of labels that might create confusion with those that
are permitted. The model for doing this for languages that use
characters that originated with Chinese has been extensively
developed in another document. This document discusses some of the
issues in that design and relates them to considerations and
Klensin Expires August 15, 2004 [Page 1]
Internet-Draft IDN Registration February 2004
mechanisms that might be appropriate for other languages and scripts,
especially those involving alphabetic characters.
In particular, it describes some suggested practices for registering
internationalized domain names (IDNs) in a zone. Before accepting
such registrations of domain names into a zone, the zone's registry
should decide which codepoints in the Unicode character set the zone
will accept. The registry should also decide whether particular
characters in a registered domain name should cause registration of
multiple equivalent domain names; these domain names might be added
to the zone or blocked from registration. This document also
describes how to handle character variants in registering IDNs, and
how to publish tables that list the character variants.
This document is intended to supply a basis for adapting methods
developed for Chinese, Japanese, and Korean to other languages and
scripts. If these adaptations are made carefully and with due
consideration for local issues, the likelihood of problematic DNS
registrations with be significantly reduced. A specific method is
introduced that should be applicable (directly, or with minor
modifications), to many scripts.
Klensin Expires August 15, 2004 [Page 2]
Internet-Draft IDN Registration February 2004
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Characters, variants, registrations, and other issues . . . 5
1.2.2 Confusion, fraud, and cybersquatting . . . . . . . . . . . . 6
1.3 A Review of the JET Guidelines . . . . . . . . . . . . . . . 6
1.3.1 JET model . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Reserved Names and Label Packages . . . . . . . . . . . . . 8
1.4 Languages, Scripts, and Variants . . . . . . . . . . . . . . 8
1.4.1 Languages and Scripts . . . . . . . . . . . . . . . . . . . 8
1.4.2 Variant Selection . . . . . . . . . . . . . . . . . . . . . 10
1.5 Reservations and Exclusions . . . . . . . . . . . . . . . . 11
1.5.1 Sequence Exclusions for Valid Characters . . . . . . . . . . 11
1.5.2 Character Pairing Issues . . . . . . . . . . . . . . . . . . 11
1.6 The Registration Bundle . . . . . . . . . . . . . . . . . . 11
1.6.1 Definitions and Structure . . . . . . . . . . . . . . . . . 11
1.6.2 Application of the Registration Bundle . . . . . . . . . . . 12
2. Some Implications of this Approach . . . . . . . . . . . . . 13
3. Required Modifications to JET Model Needed Under Some of
the Models Above . . . . . . . . . . . . . . . . . . . . . . 14
4. Conclusions and Recommendations about the General
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5. A Model Table format . . . . . . . . . . . . . . . . . . . . 15
6. A Model Registration Procedure --"CreateBundle" . . . . . . 16
6.1 Description of CreateBundle . . . . . . . . . . . . . . . . 16
7. Security Considerations . . . . . . . . . . . . . . . . . . 17
8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 18
References . . . . . . . . . . . . . . . . . . . . . . . . . 18
Author's Address . . . . . . . . . . . . . . . . . . . . . . 19
Intellectual Property and Copyright Statements . . . . . . . 20
Klensin Expires August 15, 2004 [Page 3]
Internet-Draft IDN Registration February 2004
1. Introduction
1.1 Background
Once work on the basic model for encoding non-ASCII strings in the
DNS with IDNA ([1], [2], [3]) was nearing completion, it became clear
that it would be desirable for registries to impose additional
restrictions on the names that could actually be registered (e.g.,
see [6]) as a means of reducing potential confusion among characters
that were similar in some way. These restrictions were, in many
respects, part of a long tradition. For example, while the original
DNS specifications [4] permitted any string of octets to be used in a
DNS label, they also recommended the use of a much more restricted
subset, one that was derived from the much older "hostname" rules [7]
and defined by the "LDH" (for "letter digit hyphen", the three
permitted types of characters) convention. Enforcement of those
restricted rules in registrations was the responsibility of the
registry or domain administrator. They were not embedded in the DNS
protocol itself, although some applications protocols, notably those
concerned with electronic mail, imposed and enforced similar rules.
If there are no constraints on registration in a zone, people can
register characters that increase the risk of misunderstandings,
cybersquatting, and other forms of confusion. A similar situation
existed even before the introduction of IDNA as exemplified by domain
names such as example.com and examp1e.com (note that the latter
domain contains the digit "1" instead of the letter "l").
For non-ASCII names (so-called "internationalized domain names" or
"IDNs"), the problem was more complicated than that which led to the
"LDH" (hostname) rules. In the earlier situation, all protocols,
hosts, and DNS zones used ASCII exclusively in practice, so the LDH
restriction could reasonably be applied uniformly across the
Internet. With the introduction of a very large character repertoire,
and different locations and languages considering different
characters important, the optimal registration restrictions became,
not a global matter, but ones that were different in different areas
and, hence, in different DNS zones.
For some human languages, there are characters and/or strings that
have equivalent or near-equivalent usages. If someone is allowed to
register a name with such a character or string, the registry might
want to automatically associate all the names that have the same
meaning with the registered name. The registry can also decide if the
names that came from one registration should go into the zone, be
blocked from other people registering them, or a combination of these
two actions.
Klensin Expires August 15, 2004 [Page 4]
Internet-Draft IDN Registration February 2004
To date, the best-developed system for handling registration
restrictions for IDNs is the JET Guidelines for Chinese, Japanese,
and Korean [5], the so-called "CJK" languages. That system is
limited to those languages and, in particular, to their common script
base. This document explores the principles behind those guidelines
and some of the issues that might arise in trying to adapt them to
alphabetic languages.
This document describes five things:
o The general background and considerations for non-ASCII scripts in
names. Just as the JET Guidelines contain some suggestions that
may not be applicable to alphabetic scripts, some of the
suggestions here, especially the more specific ones, may be
applicable to some scripts and not others
o Suggested practices for describing character variants
o A method for using a zone's character variants to determine which
names should be associated with a registration
o A format for publishing a zone's table of character variants
o A model algorithm for name registration given the presence of
language tables.
1.2 Terminology
1.2.1 Characters, variants, registrations, and other issues
1. Characters in this document are given as their Unicode
codepoints on U+xxxx format or with their official names.
2. The following terms are used in this document.
3. A "string" is an sequence of one or more characters.
4. This document discusses characters that may have equivalent or
near-equivalent characters or strings. The "base character" is
the character that has zero or more equivalents. In the JET
Guidelines, base characters are referred to as "valid
characters".
5. The "variant(s)" are the character(s) and/or string(s) that are
equivalent to the base character. Note that these might not be
true equivalent characters: a base character might have a
mapping to a particular variant character, but that variant
character does not have to have a mapping to the base character.
Usually, characters or strings to be designated as variants are
considered either equivalent or sufficiently similar (by some
registry-specific definition) that confusion between them and
the base character might occur.
6. The "base registration" is the single name that the registrant
requested from the registry.
7. A label (or "name") is described as "registered" if it is
actually entered into a domain (i.e., a zone file) by the
registry, so that it can be accessed and resolved using standard
Klensin Expires August 15, 2004 [Page 5]
Internet-Draft IDN Registration February 2004
DNS tools. The JET Guidelines describe a "registered" label as
"activated".
8. A "registration bundle" is the set of all labels that comes from
expanding the base characters for a single name into their
variants. The presence of a label in a registration bundle does
not imply that it is registered. In the JET Guidelines, a
registration bundle is called an "IDN Package".
9. A "reserved label" is a label in a registration bundle that is
not actually registered.
10. A "registry" is the administrative authority for a DNS zone.
That is, the registry is the body that enforces, and typically
makes, policies that are used in a particular zone in the DNS.
11. A coded character set ("CCS"): A term for a list of characters
and the code positions assigned to them. ASCII and Unicode are
CCSs.
12. A language: Something spoken by humans, independent of how it is
written or coded. ISO Standard 639 and IETF BCP 47 (RFC 3066)
[8] list and define codes for identifying languages.
13. Script: a collection of characters (glyphs, independent of
coding) that are used together, typically to represent one or
more languages. Note that the script for one language may
heavily overlap the script for another without their having
identical scripts.
14. Charset: An IETF-invented term to describe, more or less, the
combination of a script, a CCS that encodes that script, and
rules for serializing the bytes when those are stored on a
computer or transmitted over the network.
The last four of these definitions are redundant with, but
deliberately somewhat less precise than, the definitions in [12],
which also provides sources. The two sets of definitions are
intended to be consistent.
1.2.2 Confusion, fraud, and cybersquatting
The term "confusion" is used very generically in this document to
cover the entire range from accidental user misperception of the
relationship between characters with some characteristic in common
(typically appearance, sound, or meaning) to cybersquatting and
[other] deliberate fraudulent attempts to exploit those
relationships.
1.3 A Review of the JET Guidelines
1.3.1 JET model
In the JET Guidelines model, a prospective registrant approaches the
registry for a zone (perhaps through an intermediate registrar) with
Klensin Expires August 15, 2004 [Page 6]
Internet-Draft IDN Registration February 2004
a candidate base registration --a proposed name to be registered--
and a list of languages in which that name is to be interpreted. The
languages are defined according to the fairly high-resolution coding
of [8] -- Chinese as used on the mainland of the People's Republic of
China ("zh-cn") can, at registry option, be coded differently and
represented by a separate table compared to Chinese as used in Taiwan
("zh-tw").
The design of the JET Guidelines took one important constraint as a
basis: IDNA was treated as a firm standard. A procedure that
modified some portion of the IDNA functions, or was a variant on
them, was considered a violation of those standards and should not be
encouraged (or, probably, even permitted).
Each registry is expected to construct (or obtain) a table for each
language it considers relevant and appropriate. These tables list,
for the particular zone, the characters permitted for that language.
If a character does not appear as a "valid code point" (called a
"base character" in the rest of this document) in that table, then a
name containing it cannot be registered. If multiple languages are
listed for the registration, then the character must appear in the
tables for each of those languages.
The tables may also contain columns that specify alternate or variant
forms of the valid character. If these variants appear, they are
used to synthesize labels that are alternatives to the original one.
These labels are all reserved and can be registered or "activated"
(placed into the DNS) only by the action or request of the original
registrant; some (the "preferred variant labels") are typically
registered automatically. The zone is expected to establish
appropriate policies for situations in which the variant forms of one
label conflict with already-reserved or already-registered labels.
Most of these concepts were introduced because of concerns about
specific issues with CJK characters, beginning from the requirement
that the use of Simplified Chinese by some registrants and
Traditional Chinese by others not be permitted to create confusion or
opportunities for fraud. While they may be applicable to registry
tables contructed for alphabetic scripts, the transfer should be done
with care, since many analogies are not exact.
Some of the important issues are discussed in the sections that
follow. The JET model may be considered as a specialized variation on
the model and method presented by the rest of this document. Other
languages or scripts may require other variations
Klensin Expires August 15, 2004 [Page 7]
Internet-Draft IDN Registration February 2004
1.3.2 Reserved Names and Label Packages
A basic assumption of the JET model is that, if the properties of
Unicode [9], [10], IDNA, or the evolution of specific characters,
cause two strings to appear similar enough to cause confusion, either
or both should be registered by the same party or one of them should
become unregisterable. The definition of "appear similar enough"
will differ for different cultures and circumstances --and hence DNS
zones-- but the principle is fairly general. In the JET model, all
of the "variant" strings are identified, some are placed into the DNS
automatically, and others are simply reserved and can be activated,
if at all, only by the original registrant. Other zones might find
other policies appropriate. For example, a zone might conclude that
having similar strings registered in the DNS was undesirable. If so,
the list of variant labels would be used only to build a list of
names that would be reserved and not able to be registered.
1.4 Languages, Scripts, and Variants
1.4.1 Languages and Scripts
Conversations about scripts -- collections of characters associated
with particular languages -- are common when discussing character
sets and codes. But the boundaries between one script and another
are not well-defined. The Unicode Standard [9][10], for example,
does not define them at all, even though it is structured in terms of
usually-related blocks of characters. The issue is complicated by
the common origin of most alphabetic scripts (Cf. [11]), with certain
character-symbols appearing in the scripts associated with multiple
languages, sometimes with very different sounds or meanings. This
differs from the CJK situation in which, if a character appears in
more than one of the relevant languages, it will almost always have
the same interpretation in each one and, at least for the subset of
characters that actually are ideographs, pronunciation is expected to
vary widely while meaning is preserved. At least in part because of
that similarity of meaning, it made sense in the JET case to permit a
registration to specfy multiple languages, to verify that the
characters in the label string were valid for each, and then to
generate variant labels using each language in turn. For many
alphabetic languages, it may make sense to prohibit the label string
submitted for registration from being associated with more than one
language. Indeed, "one label, one language" has been suggested as an
important barrier against common sources of "look-alike" confusion.
For example, the imposition of that rule in a zone would prevent the
insertion of a few Greek or Cyrillic characters with shapes identical
to the Latin ones into what was otherwise a Latin-based string. For
a particular table, the list of valid characters may be thought of as
the script associated with the relevant language, with the
Klensin Expires August 15, 2004 [Page 8]
Internet-Draft IDN Registration February 2004
understanding that the table design does not prevent the same
character from appearing in the tables for multiple languages.
Indeed, this notion of a locally, and specifically, identified script
can be turned around: while the tables are referred to as "language
tables", they are associated with languages only insofar as thinking
about the character structure and word forms associated with a given
language helps to inform the construction of a table. A country like
Finland, for example, might select among
o One table each for Finnish, Swedish, and English characters and
conventions, permitting a string to be registered in one, two, or
all three languages (although a three-language registration would
presumably prohibit any characters that did not appear in all
three languages).
o One table each, but with a "one label, one language" rule for the
zone.
o A combined table based on the observation that all three writing
systems were based on Roman characters and that the possibilities
for confusion that were of interest to the registry would not be
reduced by "language" differentiation.
Regardless of what decisions were made about those languages and
scripts, if they also decided to permit registrations of labels
containing Cyrillic characters, they might have a separate table for
them. That table might contain some Roman-derived characters (either
as base characters or as variants) just as some CJK tables do. See
also Section 2, below.
It is also worth stressing, as the JET Guidelines do, that no tables
or systems of this type -- even if identified with languages as a
means of defining or describing those tables -- can assure linguistic
or even syntactic correctness of labels with regard to that language.
That level of assurance may not be possible without human
intervention or at least dictionary lookups of complete proposed
labels. It may even not be desirable to attempt that level of
correctness (see Section 2).
Of course, if any language-based tests or constraints, including "one
label, one language", are to be applied to limit those sources of
confusion, each zone must have a table for each language in which it
expects to accept registrations; the notion of a single combined
table for the zone is, in the general case, simply unworkable. One
could use a single table for the zone if the intent were to impose
only minimal restrictions, e.g., to force alphabetic and numeric
characters only and exclude symbols and punctuation. That type of
restriction might be useful in eliminating some problems, such as
those of unreadable labels, but would be unlikely to be very helpful
Klensin Expires August 15, 2004 [Page 9]
Internet-Draft IDN Registration February 2004
with, e.g., confusion caused by similar-looking characters.
1.4.2 Variant Selection
The area of character variants is rife with problems. There is no
universal agreement about which base characters have variants, or if
they do, what those variants are. For example, in some regions of the
world and in some languages, LATIN SMALL LETTER O WITH DIAERESIS and
LATIN SMALL LETTER O WITH STROKE are variants of each other, while in
other regions, most people would think that LATIN SMALL LETTER O WITH
DIAERESIS has no variants. In some cases, the list of variants is
difficult to enumerate. For example, it required several years for
the Chinese language community to create variant tables for use in
IDNA, and it remains, at the time of this writing, questionable how
widely those tables will be accepted among users of Chinese from
areas of the world other than those represented by the group that
created them.
Thus, the first thing a registry should ask is whether or not any of
the characters that they want to use have variants. If not, the
registry's work is much simpler. This is not to say that a registry
should ignore variants if they exist: adding variants after a
registry has started to take registrations is nearly as difficult
administratively as removing characters from the list of acceptable
characters. That is, if a registry later decides that two characters
are variants of each other, and there are actively-used names in the
zones that differ only on the new variants, the registry might have
to transfer ownership of one of the names to a different owner, using
some process that is certain to be controversial.
The list of character variants used in a zone should be stable.
Although it is possible to add variants for characters later, doing
so can cause confusing with registrants.
Of course, zone managers should inform all current registrants when
the registration policy for the zone changes. This includes when IDN
characters are allowed in the zone the first time, when characters
are added later, and when character variant tables change.
In many languages there are two variants for a character, but one
variant is strongly preferred. A registry might only allow the base
registration in the preferred form, or it might allow any form for
the base registration. If the variant tables are created carefully,
the resulting bundles will be the same, but some registries will give
special status to the base registration such as its appearance in
whois databases.
Klensin Expires August 15, 2004 [Page 10]
Internet-Draft IDN Registration February 2004
1.5 Reservations and Exclusions
1.5.1 Sequence Exclusions for Valid Characters
The JET Guidelines are based on processing only single characters.
Any processing of pairs or longer sequences of characters are left to
what that document describes as "additional processing" -- procedures
specifically permitted by the Guildlines but defined by a registry in
addition to the variant table processing specified in the Guidelines
themselves. A different zone, with different needs, could use a
modified version of the table structure, or different types of
additional processing, to prohibit, as well as accept, particular
sequences of characters by marking them as invalid. Other
modifications or extensions might be designed to prevent certain
letters from appearing at the beginning or end of labels. The use of
regular expressions in the "valid characters" column might be one
way to implement these types of restrictions.
In particular, in some scripts derived from Roman characters,
sequences that have historically been typographically represented by
single "ligature" or "digraph" characters may also be represented by
the separate characters (e.g., "ae" (U+00E6) or "ij" (U+0133)). If
it is desired to either prohibit these, or to treat them as variants,
some extensions to the single-character JET model may be needed (as
may be some careful thinking about IDNA (especially nameprep), since
some of these combinations are excluded there).
1.5.2 Character Pairing Issues
Some character pairings -- the use of a character form (glyph) in one
language and a different form with the same properties in a related
one -- closely approximate the issues with mapping between
Traditional and Simplified Chinese although the history is different.
For example, it might be useful to have "o" with a stroke (U+00F8) as
a variant for "o" with diaeresis above it (U+00F6) (and the
equivalent upper-case pair) in a Swedish table, and vice versa in a
Norwegian one, or to prohibit one of these characters entirely in
each table. In a German table, U+00F8 would presumably be prohibited,
while U+00F6 might have "oe" as a variant. Obviously, if the relevant
language of registration is unknown, this type of variant matching
cannot be applied in any sensible way.
1.6 The Registration Bundle
1.6.1 Definitions and Structure
As one of its critical innovations, the JET model defines an "IDN
package", known in this document as a "registration bundle", which
Klensin Expires August 15, 2004 [Page 11]
Internet-Draft IDN Registration February 2004
consists of the primary registered string (which is used as the name
of the bundle), the information about the language table(s) used, the
variant labels for that string, and indications of which of those
labels are registered in the relevant zone file ("activated" in the
JET terminology). Registration bundles are also atomic -- one can
not add or remove variant labels from one without unregistering the
entire package. A label exists in only one registration bundle at a
time; if a new label is registered that would generate a variant that
matches one that appears in an existing package, that variant simply
is not included in the second package. A subsequent deregistration
of the first package does not cause the variant to be added to the
second. While it might be possible to change this in other models,
the JET conclusion was that other options would be far too complex to
implement and operate and would cause many new types of name
conflicts.
1.6.2 Application of the Registration Bundle
A registry has three options for how to handle the case where the
registration bundle has more than one label. The policy options are:
1. Resolve all labels in the zone, making the zone information
identical to that of the registered label. This option will cause
end users to be able to find names with variants more easily, but
will result in larger zone files. For some language tables, the
zone file could become so large that it could negatively affect
the ability of the registry to perform name resolution. If the
base registration contains several characters that have
equivalents, the owner could end up having to take care of large
number of zones. For instance, if DIGIT ONE is a variant of LATIN
SMALL LETTER L, the owner of the domain name
all-lollypops.example.com will have to manage 32 zones.
2. Block all labels other than the registered label so they cannot
be registered in the future. This option does not increase the
size of the zone file and provides maximum safety against false
positives, but it may cause end users to not be able to find
names with variants that they would expect. If the base
registration contains characters that have equivalents, Internet
users who don't know what the base characters used in the
registration will not know what character to type in to get a DNS
response. For instance, if DIGIT ONE is a variant of LATIN SMALL
LETTER L, and LATIN SMALL LETTER L is a variant of DIGIT ONE, the
user who sees "pale.example.com" will no know whether to type a
"1" or a "l" after the "pa" in the first label.
3. Resolve some labels and block some other labels. This option is
likely to cause the most confusion with users because including
some variants will cause a name to be found, but using other
variants will cause the name to be not found. For example, even
Klensin Expires August 15, 2004 [Page 12]
Internet-Draft IDN Registration February 2004
if people understood that DIGIT ONE and LATIN SMALL LETTER L were
variants, a typical DNS user wouldn't know which character to
type because they wouldn't know whether this pair were allocating
variants or blocking variants. However, this option can be used
to balance the desires of the name owner (that every possible
attempt to enter their name will work) with the desires of the
zone administrator (to make the zone more manageable and possibly
to be compensated for greater amounts of work needed for a single
registration). For many circumstances, it may be the most
attractive option.
In all cases, at least the registered label should appear in the
zone. It would be almost impossible to describe to name owners why
the name that they asked for is not in the zone, but some other name
that they now control is. By implication, if the requested label is
already registered, the entire registration request must be rejected.
2. Some Implications of this Approach
Historically, DNS labels were considered to be arbitrary identifier
strings, without any inherent meaning. Even in ASCII, there was no
requirement that labels form words. Labels that could not possibly
represent words in any Romance or Germanic language have actually
been quite common. In general, in those languages, words contain at
least one vowel and do not have embedded numbers. The more one moves
toward "language"-based registry restrictions, the less it is going
to be possible to construct labels out of fanciful strings. Such
strings may make very good identifiers, while being terrible
candidates for "words". To take a trivial example using only ASCII
characters, "rtr32w", "rtr32x", and "rtr32z" might be very good DNS
labels for a particular zone and application, but, given the embedded
digits and lack of vowels, would fail even the most superficial of
tests for valid Engish word forms.
Interestingly, if one is trying to develop an "only words" system, a
rather different --but very restrictive-- model could be developed
using lookups in a dictionary for the relevant language and a listing
of valid business names for the relevant area. If a string did not
appear in either, it would not be permitted to be registered. Models
effectively equivalent to this one have historically been used to
restrict registrations in some country-code top level domains. On the
other hand, if look-alike characters are a concern, even that type of
rule (or restriction) would still not avoid the need for variants.
Consequently, registries applying the principles outlined in this
document should be careful not to apply more severe restrictions than
are reasonable and appropriate while, at the same time, being aware
of how difficult it usually is to add restrictions at a later time.
Klensin Expires August 15, 2004 [Page 13]
Internet-Draft IDN Registration February 2004
3. Required Modifications to JET Model Needed Under Some of the Models
Above
The JET model was designed for CJK characters. The discussion above
implies that some extensions to it may be needed to handle the
characteristics of various alphabetic scripts and the decisions that
might be made about them in different zones. Those extensions might
include facilities to process:
o Two-character (or more) sequences, such as ligatures and
typographic spelling conventions, as variants.
o Regular expressions or some other mechanism for dealing with
string positions of characters (e.g., characters that must, or
must not, appear at the beginning or end of strings).
o Delimiter breaks to permit multiple languages to be used,
separately, within the same label. E.g., is it possible to define
a label as consisting of two or more sublabels, each in a
different language, with some particular delimiter used to define
the boundaries of the sublabels.
4. Conclusions and Recommendations about the General Approach
Thinking about the implications of the use in DNS labels of the full
range of characters permitted by IDNA has led multiple groups to the
conclusion that some restrictions, on a per-registry or per-zone
basis, are needed to prevent many forms of user confusion about the
actual structure of a name or the word, phrase, or term that it
appears to spell out. It appears that the best way to approach such
restrictions involves drawing from the language and culture of the
community of registrants and users in the relevant zone: if
particular characters are likely to be unintelligible to both of
those groups, it is probably wise to not permit them to be used in
registrations. Registration restrictions can be carried much further
than restricting permitted characters to a selected Unicode subset.
The idea of a reserved "bundle" of related labels permits
probably-confusing combinations or sets of characters to be bound
together, under the control of a single registrant. While that
registrant might use the package in a way that confused his or her
own users, the possibility of turning potential confusion into a
hostile attack would be considerably reduced.
At the same time, excessive restrictions may make DNS identifiers
less useful for their original, intended, purpose: identifying
particular hosts and similar resources on the network in an orderly
way. Registries creating rules and policies about what can be
registered in particular zones -- whether those are based on the JET
Guidelines or the suggestions in this document-- should balance the
need for restrictions against the need for flexibility in
Klensin Expires August 15, 2004 [Page 14]
Internet-Draft IDN Registration February 2004
constructing identifiers.
The discussion above provides many options that could be selected,
defined, and applied in different types in different registries
(zones). Registrars would almost certainly prefer systems in which
they can predict, at least to a first order approximation, the
implications of a particular potential registration to ones in which
they cannot. Predictability of that sort probably requires more
standards, and less flexibility, than the model itself might suggest.
5. A Model Table format
The format of the table is meant to be machine-readable but not
human-readable. It is fairly trivial to convert the table into one
that can be read by people.
Each character in the table is given in the "U+" notation for Unicode
characters. The lines of the table are terminated with either a
carriage return character (ASCII 0x0D), a linefeed character (ASCII
0x0A), or a sequence of carriage return followed by linefeed (ASCII
0x0D 0x0A). The order of the lines in the table may or may not
matter, depending on how the table is constructed.
Comment lines in the table are preceded with a "#" character (ASCII
0x2C).
Each non-comment line in the table starts with the character that is
allowed in the registry, which is also called the "base character".
If the base character has any variants, it is followed by a vertical
bar character ("|", ASCII 0x7C) and the variant string. If the base
character has more than one variant, the variants are separated by a
colon (":", ASCII 0x3A). Strings are given with a hyphen ("-", ASCII
0x2D) between each character. Comments beging with a "#" (ASCII
0x2C), and may be preceded by spaces (" ", ASCII 0x20).
The following is an example of how a table might look. The entries in
this table are purposely silly and should not be used by any registry
as the basis for choosing variants. For the example, assume that the
registry:
o allows the FOR ALL character (U+2200) with no variants
o allows the COMPLEMENT character (U+2201) which has a single
variant of LATIN CAPITAL LETTER C (U+0043)
o allows the PROPORTION character (U+2237) which has one variant
which is the string COLON (U+003A) COLON (U+003A)
o allows the PARTIAL DIFFERENTIAL character (U+2202) which has two
variants: LATIN SMALL LETTER D (U+0064) and GREEK SMALL LETTER
DELTA (U+03B4)
Klensin Expires August 15, 2004 [Page 15]
Internet-Draft IDN Registration February 2004
The table would look like:
# An example of a table
U+2200
U+2201|U+0043
U+2237|U+003A-U+003A # Note that the variant is a string
U+2202|U+0064:U+03B4
Implementors of table processors should remember that there are tens
of thousands of characters whose codepoints are greater than 0xFFFF.
Thus, any program that assumes that each character in the table is
represented in exactly six octets ("U", "+", and exactly four octets
representing the character value) will fail with tables that use
characters whose value is greater than 0xFFFF.
6. A Model Registration Procedure --"CreateBundle"
This procedure has three inputs:
o the proposed base registration
o the language for the proposed base registration
o the processing table associated with that language
The output of the process is either failure (the base registration
cannot be registered at all), or a registration bundle that contains
one or more labels ( always including the base registration). As
described earlier, the registration bundle should be stored with its
date of creation so that issues with overlapping elements between
bundles can later be resolved on a first-come, first-served basis.
There are two steps to processing the registration:
1. Check whether the proposed base registration exists in any
bundle. If it does, stop immediately with a failure.
2. Process the base registration with the CreateBundle process
described below.
Note that the process must be executed only once. The process must
not be run on any output of the process, only on the proposed base
registration.
6.1 Description of CreateBundle
The CreateBundle process determines if a registration bundle can be
created and, if so, fills that bundle only with valid labels.
During the processing, an "temporary bundle" contains partial labels,
that is, labels that are being built and are not complete labels. The
partial labels in the temporary bundle consist of strings.
The steps in the CreateBundle process are:
1. Split the base registration into individual characters, called
"candidate characters". Compare every candidate character against
the base characters in the table. If any candidate character does
Klensin Expires August 15, 2004 [Page 16]
Internet-Draft IDN Registration February 2004
not exist in the set of base characters, the system must stop and
not register any names (that is, it must not register either the
base registration or any labels that would have come from
character variants).
2. Perform the steps in ToASCII for the base registration. If
ToASCII fails for the base registration, the system must stop and
not register any of the label (that is, it must not register
either the base registration or any created labels, even if those
labels would have passed ToASCII). If ToASCII succeeds, add the
result to the registration bundle.
3. For every candidate character in the base registration, do the
following:
1. Create the set of characters that consists of the candidate
character and any variants.
2. For each character in the set from the previous step,
duplicate the temporary bundle that resulted from the
previous candidate character, and add the new character to
the end of each partial label.
4. The temporary bundle now contains zero or more labels that
consist of Unicode characters. For every label in the temporary
bundle, do the following:
Process the label with ToASCII to see if ToASCII succeeds. If it
does, put the label into the registration bundle. Otherwise, do
not process this label from the temporary bundle any further; it
will not go into the registration bundle.
5. The result is the registration bundle with the base registration
and possibly other labels. Finish.
7. Security Considerations
Registration of labels in the DNS that contain essentially
unrestricted sequences of arbitrary Unicode characters may introduce
several opportunities for either attacks or simple confusion. Some
of these risks, such as confusion about which character, of several
that look alike), is actually intended, may be associated with the
presentation form of DNS names. Others may be linked to databases
associated with the DNS, e.g., with the difficulty of finding an
entry in a Whois file when it is not clear how to enter, or search
for, the characters that make up a name. This document discusses a
family of restrictions on the names that can be registered that can
be imposed on a DNS zone ("registry") and some possible tools for
implementing restrictions of that sort. No plausible set of
restrictions will eliminate all problems and sources of confusion:
for example, it has often been pointed out that the characters
digit-one ("1") and lower case L ("l") can easily be confused in some
fonts used to display ASCII. But, to the degree to which security
may be aided by sensible risk reduction, these techniques may be
Klensin Expires August 15, 2004 [Page 17]
Internet-Draft IDN Registration February 2004
helpful.
8. Acknowledgements
Discussions in the process of developing the JET Guidelines were
vital in developing this document and all of the JET participants are
consequently acknowledged. Attempts to explain some of the issues
there to, and feedback from, Vint Cerf, Wendy Rickard, and members of
the ICANN IDN Committee were also helpful in the thinking leading up
to this document.
An effort by Paul Hoffman to create a generic specification for
registration restrictions of this type helped to inspire this
document, which takes a somewhat different, more language-oriented,
approach. While the initial version of that document indicated that
multiple languages (or multiple language tables) for a single zone
were infeasible, more recent versions [13] shifted to inclusion of
language-based approaches. The current version of this document
incorporates considerable text, and even more ideas, from those
drafts, with Paul Hoffman's generous permission.
The opinions expressed here are, of course, the sole responsibility
of the author. Some of those whose ideas are reflected in this
document may disagree with the conclusions the authors have drawn
from them.
References
[1] Faltstrom, P., Hoffman, P. and A. Costello, "Internationalizing
Domain Names in Applications (IDNA)", RFC 3490, March 2003.
[2] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile
for Internationalized Domain Names (IDN)", RFC 3491, March
2003.
[3] Costello, A., "Punycode: A Bootstring encoding of Unicode for
Internationalized Domain Names in Applications (IDNA)", RFC
3492, March 2003.
[4] Mockapetris, P., "Domain names - implementation and
specification", RFC 1035, STD 13, November 1987.
[5] Seng, J., Ed., Klensin, J., Ed., Rickard, W., Ed., Konishi, K.,
Huang, K., Qian, H. and Y. Ko, "International Domain Names
Registration and Administration Guidelines for Chinese,
Japanese, and Korean", draft-jseng-idn-admin-05.txt (work in
progress), June 2003.
Klensin Expires August 15, 2004 [Page 18]
Internet-Draft IDN Registration February 2004
[6] Internet Engineering Steering Group, IETF, "IESG Statement on
IDN", IESG Statement IDNstatement.txt, February 2003.
[7] Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet host
table specification", RFC 952, October 1985.
[8] Alvestrand, H., "Tags for the Identification of Languages", BCP
47, RFC 3066, January 2001.
[9] The Unicode Consortium, "The Unicode Standard--Version 3.0",
January 2000.
[10] The Unicode Consortium, "Unicode Standard Annex #28", March
2002.
[11] Drucker, J., "The Alphabetic Labyrinth: The Letters in History
and Imagination", 1995.
[12] Hoffman, P., "Terminology Used in Internationalization in the
IETF", RFC 3536, May 2003.
[13] Hoffman, P., "A Method for Registering Internationalized Domain
Names", draft-hoffman-idn-reg-02.txt (work in progress),
October 2003.
Author's Address
John C Klensin
1770 Massachusetts Ave, #322
Cambridge, MA 02140
USA
Phone: +1 617 491 5735
EMail: john-ietf@jck.com
Klensin Expires August 15, 2004 [Page 19]
Internet-Draft IDN Registration February 2004
Intellectual Property Statement
The IETF takes no position regarding the validity or scope of any
intellectual property or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; neither does it represent that it
has made any effort to identify any such rights. Information on the
IETF's procedures with respect to rights in standards-track and
standards-related documentation can be found in BCP-11. Copies of
claims of rights made available for publication and any assurances of
licenses to be made available, or the result of an attempt made to
obtain a general license or permission for the use of such
proprietary rights by implementors or users of this specification can
be obtained from the IETF Secretariat.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights which may cover technology that may be required to practice
this standard. Please address the information to the IETF Executive
Director.
Full Copyright Statement
Copyright (C) The Internet Society (2004). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assignees.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
Klensin Expires August 15, 2004 [Page 20]
Internet-Draft IDN Registration February 2004
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Acknowledgment
Funding for the RFC Editor function is currently provided by the
Internet Society.
Klensin Expires August 15, 2004 [Page 21]