INTERNET-DRAFT                                John C. Klensin
March 1, 2002
Expires August 2002


                           A Search-based access model for the DNS
                                   draft-klensin-dns-search-03.txt


Status of this Memo

This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups.  Note that
other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt

The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.

This document supplements a companion document [DNSROLE] on the role
of the DNS relative to the uses to which it is being put and is
intended to start laying the groundwork for a specific proposal.
Both documents, their successors, and closely-related issues, can be
discussed on the mailing list at ietf-irnss@lists.elistx.com
See http://lists.elistx.com/archives/ for subscription and
archival information.


Copyright Notice

Copyright (C) The Internet Society (2000).  All Rights Reserved.



0. Abstract

This memo discusses strategies for supporting "DNS searching" --
finding of names in the DNS, or references that will ultimately point
to DNS names, by a mechanism layered above the DNS itself that
permits fuzzy matching, selection that uses attributes or facets, and
use of descriptive terms. Demand for these facilities appear to be
increasing with growth in the Internet (and especially the web) and
with requirements to move beyond the restricted subset of ASCII names
that have been the traditional contents of DNS "Class=IN".  This
document proposes a three-level system for access to DNS names in
which the upper two levels involve search, rather than lookup
(exactly known target), functions. It also discusses some of the
issues and challenges in completing the design of, and deploying,
such a system.

Table of Contents

0. Abstract
1. Introduction and Executive Summary

2. A three (or four) search-layer environment.
2.1.  Search Layer One: Identifiers -- a lookup system and the DNS.
2.2. Search Layer Two: Names -- a faceted search system with a small
       number of facets.
2.2.1. The facets
2.2.2. The name string
2.2.3. Case matching
2.2.4. More complex character matching
2.2.5. Query formation and specification
2.2.6. Imprecise matching
2.2.7. Registration rules and query rules
2.3.  Search Layer three: locality and/or content-domain-specific
       lookup mechanisms.
2.4. A search layer above the third: free-text searching applications.
2.5. Database and searching differentiation

3. Context and model details
3.1 The data search and access model
3.2 Uniqueness of name structures at the second search layer.
3.2.1 The case for unique names
3.2.2 Non-unique names
3.2.3 A middle ground approach: artificial uniqueness
3.3 Sources for controlled-vocabulary facets ("attributes")
3.3.1 Discussion of language identification
3.3.2 Discussion of geographical identification
3.4 Deployment against the existing DNS base
3.5 Thoughts about User Interfaces (UIs)
3.6 Implementation models
3.6.1 Calling and returning values
3.6.2 The cache model
3.6.3 An example: Looking at Chinese Traditional-Simplified Mappings
3.6.4 An example: Distance functions and Latin-based alphabets
3.7 Older applications

4. Comparisions to existing and proposed technology
4.1 The IDN Strawman
4.2 "Keyword" systems
4.3 Client-side and server-side solutions

5. Comments on business models
6. Glossary
7. Summary
8. IANA Considerations and related topics
9. Security Considerations

10. References
10.1. Normative references
10.2. Explanatory and informative references

11. Acknowledgements
12. Author's Address
Placeholders


1. Introduction and Executive Summary

The notion of "DNS searching" is somewhat of an oxymoron: the DNS is
structured to only perform exact lookups of structured strings of
labels.  But, as discussed elsewhere, there is considerable demand
for searching facilities -- partial and fuzzy matching, selection
that uses attributes or facets, and searching using descriptive
terms-- and that demand appears to be increasing with growth in the
Internet (and especially the web) and with requirements to move
beyond the restricted subset of ASCII names that have been the
traditional contents of DNS Class=IN.  This document proposes a
three-level system for access to DNS names in which the upper two
levels involve search, rather than lookup, functions. It also
discusses some of the issues and challenges in completing the design
of, and deploying, such a system.

These types of services are unnecessary as long as the problem is
defined as "get non-ASCII identifiers into the DNS, but keep to a
well-specified set of characters and usage so they retain strict
identifier properties".  Such approaches do not, as discussed in
[DNSROLE], solve the problem as perceived by many people. One
non-technical way of looking at this is that the DNS is fundamentally
downward-facing: it is designed to support references to network and
host resources.  Users want something upward-facing, i.e., that
provides natural-language terminology and searching for resources of
interest.  And, as the IAB has pointed out [RFC2825], even if "fixing
the DNS" did the job, it would be the easy part: the harder problem
is considering and adjusting the applications and applications-level
user interfaces.

It has been suggested that introducing a "directory" or "keywords"
into, or above, the DNS could be used as a solution to the IDN problem
and, often, several others.  Probing statements about "directories"
often quickly demonstrates that their advocates don't agree on what
they mean.  Similarly, many of those who advocate "keyword systems"
use that term to describe something very different from the
traditional use of keywords in information retrieval systems.  This
section outlines a three-layer search/lookup model (adding two layers
to the one provided by the DNS, i.e., constructing a three-layer
model, rather than continuing with the single one we have today).
Those layers consist of the current DNS, a faceted search-capable
layer using an extremely simple set of facets, and a layer capable of
broader search approaches in a localized context. It is intended as a
strawman for criticism and development, rather than as a specific
proposal.  I.e., the details are left for WG efforts.

As a terminology issue, the "layers" described here are probably best
thought of as sublayers of the applications layer, with actual
user-facing applications lying yet above them.  The term "search
layer" has been used below where it appears to be needed for clarity
or emphasis, and "sublayer" and "level" are sometimes used
interchangably with it: suggestions for better terminology would be
welcomed.

At the two "above DNS" sublayers, international ("universal")
character sets and scripts are assumed and part of this initial
design.  Since actual or applications-applied DNS restrictions are
not being inherited upward into these sublayers, coding can be chosen
for maximum utility and balance among language groups.  E.g., native
UCS-4 could be used as an alternative to a secondary encoding form
such as UTF-8 or an ASCII-compatible ("ACE") recoding.

This document is intended to evolve into a framework and model for the
layered search system, rather than a complete specification or even an
approximation to one.  It is complemented (for sublayer two) by
[Mealling-SLS], which discusses a CRNP-based [CRNP] implementation
model for the middle sublayer, by the more keyword-focused model of
[Arrouye], and, we hope, by other system more tailored to specific
languages or cultures.  Additional documents are expected to be
developed that describe other aspects of both sublayers.


2. A three (or four) search-layer environment.

The material below suggests three or more sublayers for name lookup
and search:

   (1) The DNS, with the existing lookup mechanisms and a single
   global name space in which names are unique.

     Names are placed in the DNS by those who wish to use those names
         themselves (e.g., for identifying hosts and resources within a
         home, an enterprise, or cooperating groups of organizations).
         The DNS was never designed for searching for, or querying of, an
         identifier by someone who does not already know what it is.

     A useful analogy has been drawn between DNS names and variable
         names in a programming language [Austein].

   (2) A restricted, facet-based, search system.  This system still
   preserves a global name space, but name strings are not expected
   to be unique and the set of facet values for a given entity may
   not be (see section 3.2).

      Names are placed into this second-sublayer system by those who
          want to be found, or want the names or resources to be found,
          by others.  The assumptions are neither that those others will
          know exactly what name they are trying to access (where the DNS
          requires precise knowledge of names or very good guessing) or
          that names will be unique (where the DNS requires uniqueness).
          But the search activity is still based on names (and
          attributes), not topics.

      It may be useful to think of this layer as similar to "white
          pages" services.  This comparision is discussed in more detail
          below.

   (3) Commercial, localized, and potentially topic-specific, search
   environments.   These environments utilize multiple, localized,
   name spaces.  These would typically be localized by language or
   (physical or political) geography, but might be structured around,
   e.g., specific subject matter.

      Names are placed in this sublayer by those who wish them to be
          found within a topic area context (or language or locality or
          combination of them).  Because the environments are localized,
          different search terms and levels of granularity can be used in
          different search sites and name spaces.

      It may be useful to think of this layer as similar to "yellow
          pages" services.   Again, the comparison is discussed in more
          detail below.

   (4) Something else?


2.1.  Search Layer One: Identifiers -- a lookup system and the DNS.

In this model, the DNS remains largely as is (see section 3.4ff) or,
perhaps, a bit closer to its original purpose and assumptions than
the direction in which it has evolved in recent years.  I.e., it is a
distributed database, with precise lookups, whose lookup keys are
identifiers for Internet hosts and other objects.  We give up the
notion that these identifiers should also serve as human-useful names
or at least try to abandon that notion.

   As an aside, note that some people have suggested that we
   should dehumanize DNS names entirely, e.g., prohibit the
   registration and use of any name that can be found in any
   dictionary for any language that can be represented in the
   DNS-acceptable character set.  This proposal doesn't
   include that idea.  But it is absent primarily because it does
   not appear that the transition process would be worth the time it
   would take to explore, rather than because it has no appeal.

The goal at this sublayer is relatively simple, unique, identifiers.
It is probably desirable that these identifiers be able to have some
human mnemonic value, but less important that they be tightly bound
to real-world names and descriptions.

The inputs and outputs at this layer remain as they are in the DNS
today, although modifications to accomodate non-hostname format
names there remain possible if that is deemed important for mnemomic
or other purposes.  "Hostname-format names" are those that are
restricted to the ASCII-based "letter-digit-hyphen" (LDH) format
traditionally used in Internet applications [HOSTNAME] and identified
as prudent practice in section 2.3.1 of the DNS specification
[RFC1035]).


2.2. Search Layer Two: Names -- a faceted search system with a small
number of facets.

Much of the current burden borne by the DNS would appear to be better
focused on a search system that contains names and a small number of
attributes represented in name facets.  That DNS burden includes a
wide range of non-identifier goals and constraints: names that a user
can understand and find and that have significant mnemonic value,
names with trademark implications, a wide variety of naming systems
and, in general, helping people find the things for which they are
looking.  It is critical that the number of attributes be constrained
to a minimal set, and that other attributes, especially those of
special interest, be deferred to the third search layer.

The term "attribute" is used here and below to identify the
controlled vocabulary or rule-defined facets as distinct from the
free-form "name-string".

It is probably most useful to think about this layer in terms of a
structured, multifacted, multihierarchical, thesaurus-like database
with search capability (Cf. ISO IS 5127-1 and IS 5127-6 [THES]),
rather than as a "directory" in the sense of X.500 and its
derivatives and antagonists.

2.2.1. The facets

A key question is what facets to use once the major commercial
product requirements are removed (to search layer three, see below).
It appears to me that, to satisfy to the critical name-uniqueness and
real world pressures on the DNS, candidates for identifying facets
might be

     name-string
            Characters from IS 10646, see below.
     language
        Presumably codes as specified in RFC 3066 (see section 3.3.1)
     geographical location
        Country, and/or for some federal countries, country/province
                ("state"). Granularity is important and there may be a case
                for an additional facet based in a coordinate system or for a
                two-level facet.  See section 3.3.2
     network location
        If we can figure out what that means and how to express it in
                a canonical way.
     industry category code
        For companies, presumably derived from some existing official
                list such as the WIPO Nice Agreement list [WIPO-NICE]. The
                list would presumably require extension in some way to deal
                with non-commercial organizations and entities and to
                identify resources and services associated with people.

This typology gives the trademark view of the world somewhat more
precedence in looking at name conflict issues than one might like in
principle.  But, in practice, one of the key issues we have
encountered in trying to store "names", rather than identifiers, in
the DNS is that the process unreasonably flattens the space, not only
from a technical standpoint but from a usability one.  That "Joe's
Auto Repair" and "Joe's Pizza" can co-exist in the same geographical
area without conflict or confusion and that "Joe's Pizza" in one area
can co-exist with "Joe's Pizza" in another, again without conflict or
confusion, are the consequence of the way we name and identify things
in the real world.  Most trademark rules are the consequence of those
naming systems, not their cause and many perceived conflicts between
the trademark system and DNS usage are the result of this flattening.

It is not intended that this level act as a white pages service for
people.  Doing so leads down several slippery slopes at once,
including heightened privacy concerns and a stronger requirement for
URL targets rather than DNS label ones (see below).

The general intent is that the list of facets be fixed by protocol
and that possible values for each facet be controlled vocabularies,
not necessarily (and probably not) controlled from the same source
(see section 3.2).  We would hope to utilize existing terminology
lists where possible.  For a particular record (i.e., a name and its
set of attributes), and especially if requirements for uniqueness can
be bypassed or relaxed, the selection (from the controlled
vocabularies) of particular facet values would be the responsibility
of the entity registering the names.  In other words, someone
registering a "name" in this system would select values for each of
the facets from the controlled vocabulary for that facet as part of
the process of placing the name into a database.

It is important to note that the registration of that name would
include all of the associated facets, although the vocabularies for
all of the facets other than the "name-string" would be drawn from
specific, external lists (controlled vocabularies or rules).  It
would not be desirable, and probably would not be feasible, for
registrants to record their names in independent, facet-based,
databases with one facet per database.

There is also no magic in the proposed system.  Names are placed in
the system with particular facet sets because a registrant wants them
there.  A registrant who wishes to have a given name-string
associated with different facet values (e.g., to identify different
locations or lines of business) will make multiple registrations.

While all faceted name strings would contain the same facets, there
is no technical reason why one or more of these might not have a
blank (or "missing") value, presumably causing a match to any search
term for that facet.  More important, searching for a name might
omit one or more facets from the search, again matching any value
that actually appeared in the database.

It should be clear that there is significantly more information (from
the values of the facets) at this layer than there is in the DNS.

2.2.2.  The name string

The names in this environment can reasonably be written in IS 10646
codes or some recoding of them.  Since we would be starting more or
less from scratch, we could select lengths and codings for maximum
efficiency and utility, not to meet the constraints of existing
software.  In such a context, this author has a slight bias for
direct UCS-4 coding. This is in preference to ASCII-compatible
("ACE") codes; compressed, null-octet-eliminating, systems such as
UTF-8; or surrogate introducers to hold things to 16 bits.  The loss
in transport efficiency is likely to be more than compensated for by
gains in cleanliness and equal treatment of all scripts.  And, if
compression is needed, it is perhaps better to do it at the string
level rather than the character one.  But that issue is separate from
the main and important design arguments of this document.

The work done to define "nameprep" and, later, the set of
"stringprep" functions [NAMEPREP], in the IDN WG is almost certainly
relevant to determining which names to actually store in the
database.  But the stakes are lower here than the "get it right or
fail completely" constraint of the DNS lookup environment: one can
imagine search mechanisms that would apply a more liberal set of
matching rules (and/or localized and language-specific ones) than the
rules used to encode names (much like recent applications protocols
that explicitly distinguish between the formats one is permitted to
send and those one is expected to accept (Cf. [RFC2822])).

At the same time, it would be sensible to permit short phrases as
these "names", something which is not generally possible in the DNS
(or in the IDN proposals).  The necessity, in the DNS, to turn, e.g.,
"Lower Slobbovian University" into "LowerSlobbovianUniversity.edu",
and hope case will be preserved (or "lowerslobbovian.edu", or worse)
is, ultimately, just another example of the unfortunate mismatch
between the identifiers of DNS and real-world naming systems.  So we
would assume that it is a design requirement to make it possible to
use "Lower Slobbovian University" and "University of Lower Slobbovia"
as stored names.

2.2.3.  Case matching

In the system proposed here, case-matching should be treated as just
another case of fuzzy searching and matching, not a relationship with
unique status.  As discussed below, in all cases, the user (or her
agent) would provide a string, some subset of facets, and
search-method specifications as input, and would receive a set of
matching results, in the form in which they are stored in the
database.

Case matching -- treating upper and lower case letters as identical
-- is another historical DNS property that does not have a simple and
unambiguous interpretation in the real world of non-ASCII character
sets and a range of language applications.  Some scripts contain
glyph forms that clearly represent two cases, some scripts clearly do
not have case distinctions, and, as the IDN WG has discovered, there
are character-matching requirements in some languages (e.g., equality
of simplified and traditional chinese [CNDC], see below) for which
the appropriateness of an analogy to case-matching has caused a
considerable controversy, not least because of the apparent absence
of a set of mapping tables that cover all of the possible character
pairs.  The IDN WG has also discovered that, even for scripts with
the presence of clear case distinctions, the matching rules sometimes
differ by geographical locality.

It is not yet completely clear how case matching should best be
handled, but one thing that appears completely clear is that the
model the IDN group seems to be creating is not desirable.  That
model essentially results in different rules being applied to
different scripts: case matching in some situations, none in others,
and some but not all characters in yet other cases.  This may
possibly the best compromise given the combination of the constraints
of the DNS with the idiosyncracies of Unicode, but, without the DNS
constraints, we should strive to treat all languages and scripts in
as nearly an identical way as possible.

While there are other options, it would appear to be better to handle
case-matching on the server, as it is done in the DNS.  As with other
searching variants, it should be possible to return the form of the
name as stored in the database while finding it using any of the
user-acceptable variations (use of client-side string preparation for
both the stored name and query formation, as an IDN-DNS seems to
require, loses information that some people consider important).
Case-matching in the proposed faceted system could be applied (or
not) as dictated either by a heuristic using the combination of the
language facet and a query containing the preferred location-context
of the user (see below).  Or there could be an explicit query flag
(or indicator carrying more than one bit of information).  This
author tends to prefer the latter because of a profound distrust of
heuristics, but the question requires additional study.

2.2.4.  More complex character matching

The case-matching strategy applies to more complex cases of character
matching as well.  If one can establish sufficient context, and
specify the types of expanded matching to be used, and permit
multiple variants to be returned to the application, then one could
support matching of similar-appearing characters (e.g., Latin "A" and
Greek Alpha), or Latin-derived and Cyrillic-derived scripts for
Serbo-Croatian, or, perhaps most important, mapping between
Traditional and Simplified Chinese.

2.2.5. Query formation and specification

As is common with systems of this type, we would anticipate the
possibility of searching on any of the attributes and that searching
on free-text strings would not be exact (i.e., near-match responses
could be returned using any of several algorithms, with the user
making choices).  One could also imagine distance function
calculations on appropriately structured restricted-vocabulary facets
being implemented in some search engines.  As is equally common, we
should think about user interfaces that store both queries and
response sets so that the responses could be used offline and
refreshed when the client systems were attached to the Internet (see
the discussion of caching in section 3.6.2).

At the same time, we would assume that a search without at least some
approximation to a name string would rarely be productive and would
expect search systems to be optimized accordingly.

In summary, the goal at this layer is to provide tuples of
human-recognizable (not just mnemonic) facets (names and attributes),
but names that are relevant within the context set by the
attributes, rather than a global system based on the names alone.

The input at this layer is a query consisting of search values for
one or more of the facets, plus information to control the search.
E.g., to the extent that designers of search protocols can provide
the proper tools and terminology, one would expect the query to be
accompanied by rule statements about how much "fuzziness" was
permitted, how "distant" names might be from the chosen ones and
still be selected, whether character set or language translation (or
even phonetic recognition) was to be applied (and whether translation
was to be restricted to a small group of languages or made more
general) and so forth.

The outputs are still being discussed, but would appear to best be
the full facet set of the matched tuple(s) (more than one such set if
multiple tuples match) and one or more DNS names associated with each
tuple. These DNS names, of course, have the same uniqueness
properties of the DNS itself: while a query, or full set of matching
facets, could match (and return) multiple DNS names, nothing would
make the DNS names less unique than they are today (i.e., as the DNS
requires).   One of many interesting questions is whether this layer
should pass through and return the DNS records themselves (labels,
class, type, and target) or whether it should return names (labels)
and let the applications do the DNS lookups.  Another possibility is
to return one or more URLs (or more general URIs?) rather than DNS
names.  Doing so increases flexibility but at the cost of greater
complexity and risk of recursion problems.

Still another possibility would be to create a URI [RFC-URI] for DNS
record information and use it to abstract this return information
into something applications can then specify or decode as
appropriate.  Use of this would need to be carefully structured to
avoid complex problems (e.g., recursion in either this system, the
DNS, or both), but might be a reasonable approach.

If the output is either a DNS name or a URI, if the DNS is extended,
as is being discussed in the IDN WG, the process of looking up DNS
names that emerge from the sublayer two search would presumably go
through the extended process, e.g., stringprep and IDNA or their
descendants.

Experience with the DNS and other distributed databases also argues
persuasively that these records are not forever.  Unless there are no
local copying and caching mechanisms (which seems unlikely and hard
to enforce), some type of time to live (TTL) or other expiration or
reverification mechanism will be needed.

2.2.6. Imprecise matching

<<to be supplied - see "placeholders" at the end of the document>>

2.2.7. Registration rules and query rules

As with the DNS, it may be more important to be conservative about
what types of names are registered than to be restrictive about
queries.  At the same time, if there are well-known and easy to
understand rules about registration restrictions (probaby implying
that the same registration rules must be used globally), it should be
possible to optimize query interfaces (corresponding to "resolvers"
in traditional DNS terminology) to immediately return "invalid name"
error messages rather than returning "not found" after a search.

One could, for example, easily imagine a query interface that would
maintain a local (although periodically updated) table of ISO 3166-2
codes to perform validation against the major components of
geographic names before initiating a search of a remote database.

Similarly, if a sublayer two database was created for a particular
country and language, registrations in it would presumably be
restricted to records for that country and language, and to name
strings that conformed to validation rules developed for that country
and language.

The category lists (restricted vocabularies) for each of the facets
would presumably come from different, although standardized,
databases, e.g., IS 3166-2 and UN/LOCODE for geographical
information, RFC 3066 for language names, an extended version of the
WIPO-NICE code set for industry codes.  But the name databases
themselves would contain a complete set of tuples for the facets
(some, of course, might be missing or, more precisely, "let anything
match").


2.3.  Search Layer three: locality and/or content-domain-specific
data and mechanisms.

The problem with the second-search-layer model is that there are a
number of usability and marketplace pressures for naming systems that
offer finer granularity and better match user needs.  For many
purposes, users want localized, not global, systems. This has been
confirmed in those systems which have been included in experiments or
partially deployed (see, e.g., [RFC2345], [Netword], and
[RealNames]), which require contextual localization, not a single
global environment.  There are many causes for this, but requirements
for very specific searches that are geographic-area, topic-area, or
language or culturally specific, tend to dominate the list.

The issue is perhaps illustrated by an example.  Suppose the
granularity of an entry at the second level is

  {"Joe's", "UK", Restaurant,... }

Now, I might want to create a business around a restaurant directory
for Bristol.  I would probably want to construct a database that
contained exact locations, type of food, menu information, prices,
etc., and permit people to query it that way.  That type of product
bears a strong relationship to traditional yellow pages services: the
best attributes to collect and the optimal way to organize them will
differ by topic (e.g., for most people, "menu" has no obvious analogy
in an automobile repair shop) and the business models are fairly
established.  Part of the history of those business models is the
observation that, when there are competing yellow pages services (or
guidebooks, or other, similar services), those who consistently make
better (and "more accurate") choices of categories and keywords tend,
other things being equal, to be judged "better" and to cature larger
market share.

One can imagine many different types of keyword and (yellow
pages-like) directory services at this level, using different types
of protocol mechanisms as well as different types of database content
and schema.  But those services are nearly ideal candidates for
competition: there is no requirement that either the providers or the
services be global or unique or even highly standardized.  Having all
three search layers bound to the same data sources --inheriting
values from them if one wants to think about it that way-- would
provide a degree of consistency that might be very attractive to
users, so there are clearly issues here that will need to be worked
out in the marketplace.

Directories of these types are, of course, common and widespread
outside the Internet.  There is no shortage --some would say there is
a surplus-- of directories and guides to resources and services of
particular types and in particular areas.  Some are supported by
advertising or placement fees from the resource owners, some by book
sales or fees charged to users, and others by a combination.  Most of
these directories and guides publish year after year and seem
profitable.

Inputs at the third search layer will differ by service: one can
imagine free-text interfaces and menus (but see section 2.4) as well
as systems that more closely resemble faceted search terms.  Outputs
will normally be search-layer-two names or strings to preserve name
and reference portability, or might be URIs containing such names.

Summary: Just as the monohierarchical identifier-lookup system at the
first (bottom, DNS) level should be supplemented by a multilingual,
multifaceted, multihierarchy search system at the second, that second
level system should be supplemented by a collection of localized,
subject- and topic- specific systems at the third.  These third-level
systems need not be centrally coordinated in any way, although some
similarity of function and interface would almost certainly make them
more consistent for users and easier to market.

2.4. A search layer above the third: free-text searching applications.

The approaches described above omit one set of techniques used today:
"web searches" on full text or its equivalent.  These systems have an
important role (and, similar to the third level, there seems no
particular advantage to trying to standardize them worldwide).  But
their disadvantage, if seen as a DNS surrogate or replacement, is
that they have difficulty distinguishing between the name of
something, a pointer to it, and a reference to, or discussion of, it
or how it works.  The other systems discussed in this document are
all "directories" in the sense that someone must make an explicit
decision to put an entry in a database; they are not full text
searching systems or analogues of them.

If, for example, one is looking for a web site for a company, the
third level would presumably find that site (assuming the company
wanted to be found).  The second (or even the DNS) might find it with
some guessing, but this fourth level would (as web search engines do
today) probably not reliably distinguish the company's site from
sites that reference the company or its products.

Search layer three produces information that is explicitly bound to
the query, i.e., what one is looking for, while a search engine
returns values that also include sites where the subject of the query
might have been mentioned.

2.5. Database and searching differentiation

In both sublayers two and three, but especially in two, we assume
that "compiling databases" (i.e., registry and, if appropriate,
registrar functions) and "designing and building search functions and
providing search services" are separate.  It would be necessary to
have database interfaces be sufficiently general and well-specified
that referrals were possible and different search services could rest
on top of them, but we would expect some search services to be much
more extensive than others and for their vendors to seek increased
compensation for those more extensive servces.  In many cases, the
market would eventually sort out the optimal combinations of
capabilities and costs.

Ultimately, the term "fuzzy search", used extensively in this
document and elsewhere, is handwaving.  Whether heuristic or
deterministic, one must devise, for each facet, systems for
determining whether matches have occurred and, for inexact matches,
whether the combination of query term and database entity are "close
enough" together to be candidates for being returned as responses.
We can imagine phonetic matching as well as character-string
matching, application of contextual rules as well as simple
character-pair rules for matching of Traditional and Simplified
Chinese, and similar rules for matching of Kanji and kana strings.
And we would presume that users, or their agents, would be able to
control such decisions by choice of search providers, configuration,
or choices on a per-search basis.


3. Context and model details

3.1 The data search and access model

It is interesting that recent IETF "directory" work has focused on
accessing mechanisms without worrying intensely about the underlying
database content, maintenance, and update issues.  Those latter issues
seem to be the harder ones, i.e., the difference between LDAP [LDAP]
and CNRP [CNRP] may make less difference than how we structure,
maintain, match, and distribute the relevant data.

Of course, that does not suggest that work on accessing mechanisms is
not important or that it isn't required.  And, to deploy the model
suggested above, we will need to deal with a pair of uncomfortable
problems:

     * CNRP looks interesting, but has not been widely implemented or
         deployed in production.

     * LDAP is widely deployed, but primarily in implementations that
         contain sufficient extensions and special features to be
         non-interoperable. Effective referral mechanisms have also not
         be clearly standardized in LDAP, and this might provide a
         barrier.

Some readers of earlier drafts have also suggested that the history
of LDAP points to local extensions that will result in inconsistent
search behavior, while CNRP may be better specified (or at least
closer to a clean slate).

If we are going to choose -- and search layer two certainly implies a
choice-- we need to figure out how to do that.



3.2 Uniqueness of name structures at the second search layer.

There are cases to be made both for and against uniqueness of names
(more precisely, of the combination of the name-string facet and all
of the other facets) at this sublayer, and even a partial middle
ground, in which names are unique within a registry namespace, but
there are mechanisms for identifying such spaces so that the names are
unique across the Internet.  The community should address the
tradeoffs because no position is ideal; summaries of the extreme
positions are below.  In none of these cases is it necessary, or even
desirable, that the name-string itself (without the additional
"attribute" facet values) be unique.

3.2.1 The case for unique names

The IAB's discussion of DNS root uniqueness [RFC2826] argues that DNS
names must be unique, i.e., that there must not be alternate or
surrogate root structures if the Internet is to survive as a seamless
whole and be universally addressable and accessible.  Even with
imprecise matching, similar arguments may apply at level two,
especially if this is the first level at which names in natural
languages (hence including "multilingual" names), rather than
constrained identifiers, appear.  The mathematical arguments aside,
the main argument for uniqueness is that a given combination of
name-string and facets will yield exactly one logical host (or
equivalent, an approach called "direct navigation" in some of the
so-called keyword proposals [Arrouye]).  If this is not the case, it
seems inevitable that users will be faced with choices they need to
resolve even when they have an exact match for a full set of facets.

Because the name structures stored in the databases at the second
level, in this case, still must be unique, some mechanism for
registries or structuring of names will be necessary to avoid
conflicts.  The problem is somewhat easier than the ones encountered
by ICANN and its associated groups because the very structuring of the
names and attributes creates opportunities for dividing up
responsibilities, but the registration problems exist nonetheless and
will need to be resolved.

3.2.2 Non-unique names

Conversely, one could have multiple appearances of the same set of
facets (including the name-string), such that an exact match could
still yield multiple "hits".  This would have the advantage of
eliminating all requirements for monopoly registries or [other]
technical mechanisms for guaranteeing that name conflicts did not
occur.  The disadvantage is that it would force more user choices or
heuristics, and at least some errors in which the wrong host or site
was identified would be almost inevitable.  If it turned out that
most user queries occurred at sublayer three or four, rather than
directly at this sublayer, that issue might not be significant.

Were extensive use of per-user (or per-group) local directories
("bookmarks", "favorites", etc.) to evolve, they might also make the
difficulties with non-uniqueness insignificant.  This would be
especially likely if these directories contained not only a keyword
and (DNS name or URI) target, but also a stored form of the search
used so that local data could be recalculated and replenished.  See
section 3.5 and 3.6 for some related discussion.

3.2.3 A middle ground approach: artificial uniqueness

A proposal was made in the initial version of [Mealling-SLS], that an
additional facet could be added to represent the registry which
records the names.  If this were done, names could be kept unique
within registries and would be globally unique as long as the
registry-identifying facet had a unique value for each registry.
There would be no need to restrict the number of registries in this
model or resolve naming disputes among them -- each one could have a
unique, randomly-generated and assigned identifier-- so the approach
could provide some degree of technical uniqueness while still
preserving most of the benefits of the non-unique approach.

This model could, of course, be deployed at a "registrar" level
instead, just by changing the assignment of the identifier facet from
value-per-registry to value-per-registrar.   Other variations are, of
course, possible.

3.3. Sources for controlled-vocabulary facets ("attributes")

We anticipate that most of the sublayer two facets other than the
name-string itself will have values chosen from controlled
vocabularies I.e., the user-registrants will be able to select
whatever values seem to match their needs, but only from pre-defined
lists of possible values.  These are not intended as free-text
entities; to make them free text would push the second-sublayer
system toward the lowered precision of Internet search engines and
other free-text search environments.  The facet values that are not
populated from controlled vocabularies will be determined by
deterministic and unambiguous rules.  For example if one of these
attributes is a geographic location that uses a coordiate scheme, the
definition of the coordinate scheme should be sufficient to yield a
predicatable and exact value.

The question, then, is how to establish the vocabulary lists and
write the definining rules.

It has been something of an Internet tradition, building on Jon
Postel's principles for registration and registries, to try to avoid
having IETF or IANA become embroiled in controversies about names,
their ownership, propriety of using them, and so on.  The use of IS
3166-1 alpha-2 codes as the basis for "country code" top-level domain
names (see [RFC1591]) is just one instance of the application of this
principle.  Following this tradition, facets should be chosen, in
part, on the basis of availability of pre-existing, well-known lists
of names and authorities or, at worst, the ability to identify
relatively non-controversial authorities who can quickly establish
such lists.

Some specific possibilities are discussed in the subsections that
follow.

3.3.1 Discussion of language identification

The IETF already has a standard for identifying languages and dialects,
documented in [LangTag] and based on an ISO Standard [ISO639].  It
appears that it would be usable here, with minimum fuzziness
associated with an exact match of all subtags and a higher degree of
fuzziness permitting matching different (national or dialect)
variations on the same language.

3.3.2 Discussion of geographical identification

For larger countries, and areas with many semi-independent
administrative districts, identification of the country may not
provide sufficiently precise resolution.  On the other hand, it is
desirable to have a scheme that is hierarchical or that otherwise
readily permits search expansion.  Conceptually, the coding should be
something like

 country / administrative-district / city or town

Fortunately, such a system exists as a generalization of one that is
in common use in the Internet.  ISO 3166-2 [ISO3166] provides a
model, and list of values, for representing countries and
administrative districts, and is designed to be compatible with the
UN/LOCODE list when those further subdivisions are provided and
satisfactory from a national point of view.  Since ISO 3166 is
probably even more satisfactory for this purpose than it is for its
use in defining ccTLD names, it should probably be used (with the
UN/LOCODE where appropriate) unless something clearly better can be
found.  For example, a complete coding using this approach would be
something like "DE-BW-DESTR" for Stuttgart.

The corresponding matching rules seem obvious, but, to review them:

* If the stored record contains all three elements, then a query of
  (and fuzziness=exact) should imply that

   "country" matches everything
   "country and subdivision" matches all cities in that subdivision,
      but does not match other subdivisions
   "country, subdivision, city" matches only that exact stored
      record.

The "fuzziness" indicator should be fairly clear here, e.g., 0=exact,
10=match next level ("country, subdivision, city" matches the whole
subdivision), 99=all levels ("country, subdivision, city" matches the
whole country), and intermediate values might match adjacent cities
or subdivisions using some reasonable distance or adjacency function.


3.4. Deployment against the existing DNS base

As with the "new class" approach to DNS changes [NEWCLASS], the
approach outlined here does not require any changes to the existing
installed DNS base.  But, like all solutions to the multilingual name
issues, it requires changes to all relevant applications.  The notion
of moving from lookup to searching does imply that we will need not
merely to change the code that calls the name resolution system, but
also to rethink the UIs of those applications.

3.5  Thoughts about user interfaces (UIs)

There are many possible models for user interfaces to be used with a
system of the type proposed here.  The IETF should, as usual, remain
agnostic about them.   At the same time, some notions about possible
user interfaces are important to demonstrate that the concepts are
practical and to inform the design of protocol interfaces.  So, with
the understanding that other approaches are possible, and may be
preferable:

As discussions on both DNS "searching" and multilingual names, and
the general model presented here, have evolved, it has become
apparent to some observers that these approaches would be best
realized in conjunction with user-specific directories or memory with
refresh capability, whether modeled on a local directory, or cache,
or history file, or something else.  It has been surmised [WJR ref?]
that the behavior of typical users is to spend most of their time
using or referencing known services and hosts (whether web sites,
hosts used in email addresses, or other services) and much less time
"searching" for unknown resources.  If this is actually the case,
then a typical reference should involve a DNS "name to address"
lookup only, even though it would be desirable for the DNS name to
not be visible to that user.   The user might reasonably see his or
her original collection of search terms, or a name assigned to that
search or its results, but actual searching would take place only as
a first-time activity or in the process or refreshing the search and
results (at user request or, perhaps, automatically).

3.6 Implementation models

While this document is not an implementation specification, nor is it
intended to substitute for one, some remarks about implementation
issues may be helpful in understanding the concepts that appear
elsewhere.

3.6.1 Calling and returning values

<<to be supplied -- see "placeholders" at the end of this draft>>

3.6.2 The cache model

Such "bookmarks" can be thought of as a local cache of queries and
responses with sufficient information to both immediately locate a
target associated with the user's perception of what was looked for
and of "refreshing" the search if circumstances changed or values
timed out.  In the presence of a particular query, a client system
would presumably check for a matching bookmark.  If one was not
found, the layer two search would be performed, yielding values that
might require user intervention for selection.  Once selected, the
search, the full set of facets returned, the DNS names or URIs, and
any TTL information would be stored (possibly using a user-supplied
name or tag) and the resource accessed via the appropriate DNS name
or URI.   If the search or tag was found in the cache, checks would
be made for the values being current and then the DNS name or URI
used directly, without going back through the search procedure.

3.6.3 An example: Looking at Chinese Traditional-Simplified Mappings

One of the problems that the IDN WG has been unable to solve in a
satisfactory way is the requirement that strings written with
Traditional Chinese ("TC") characters match those written with the
corresponding Simplified Chinese ("SC") ones.  The relationships among
these characters have been variously described as similar to font
differences that are not properly reflected in the IS 10646 coding and
structure and as similar to case mapping in alphabetic scripts that
support case.  Although both are thought-provoking, there are
significant weaknesses in both analogies.  But the problem is
sufficiently important that the working group has received requests to
delay DNS-level internationalization implementation of all (or
selected subsets of) Chinese characters.

Fortunately, mapping between TC and SC is fairly easily handled at
sublayer two of the system proposed here.  Details and variations
still need to be worked out and a specific proposal developed, but it
appears that something similar to the following outline would be one
option:

Unlike the DNS, the sublayer two system will have the critical
language identification information available.  This eliminates the
problems associated with distinguishing Chinese character usage from
uses of similar characters (at the same IS10646 code points) in
Japanese and Korean.  Assuming that the language is Chinese,
"fuzziness" could be used to determine the precision of matching
required.  E.g., "no fuzziness" might be construed as "exact match",
i.e., no attempt at TC-SC matching.  A low (but non-zero) fuzziness
value might permit unambiguous single-character (i.e., "one to one")
matching between TC and SC characters, but no other variations.  And a
higher degree of fuzziness might match more extensively, including
cases in multiple characters of context or user selection from a menu
or pick list were needed to determine a correct match.  ((Is
distinguishing between these two cases actually helpful?  I would
think it would be useful in the design of good-quality user
interfaces.))

If it were worthwhile, other variations on a sublayer two system could
be used to handle different character input models as server
functions.  For example, use of a different language subtype (or a
heuristic on the name string) could permit phonetic input (presumably
Pinyin, but, if anyone wanted it, a different subtype could permit use
of alternate systems such as Wade-Giles) even though the names in the
database were stored in Chinese characters.  Use of phonetic input of
course absolutely requires matching of TC and SC characters.

3.6.4 An example: Distance functions and Latin-based alphabets

The discussions of case mapping for scripts in which the
rules are subtle or culturally dependent has restarted the argument in
some quarters as to whether the case-mapping rule of the DNS was wise.
The alternate position is that users are better off with a single form
of writing an identifier and that they will then "get used to getting
it right".  The use of fuzziness with such scripts might permit this
issue to be left to the user or interface designer, e.g., no fuzziness
would imply no case matching, somewhat more fuzziness would permit
case matching in those cases where the rules were exact and
one-to-one, and additional fuzziness would permit matching, e.g.,
with and without diacritical marks or across character variants.  The
presence of language information makes these approaches much more
workable than they would be with the DNS, even with a more complex
canonicalization process than is now anticipated in "nameprep".


3.7 Older applications

To fully realize the benefits of internationalized naming requires
changing all relevant applications to understand the new method,
whatever it is.  Even the "internationalize the DNS" proposals are
subject to this principle.  Older applications will see distorted and
unfriendly names under some systems, and no names at all under others
(some approaches might cause implementations of some applications to
fail entirely).

The environment contemplated here is a "no international names in old
applications", i.e., "no new names without upgrading", one --
applications that have not been upgraded will not see
internationalized names or other natural-language phrases, nor coded
surrogates for them.

The advantages of a "no names without upgrading" approach are that it
avoids confusion and the risk, however slight, of catastrophe.  As
with the original host table to DNS conversion, they provide an
incentive to convert old applications to make newer naming styles,
and newer names, visible.  None of these transitions are ever easy,
but it may be worth going through this one to get things right,
rather than investing a large fraction of the pain to get a solution
that doesn't quite do the job.


4.  Comparisions to existing and proposed technology

4.1 The IDN Strawman

After the IETF IDN working group came into being, its work rapidly
converged on the assumption that internationalized name referencing
issues and requirements --including the requirements, not heretofore
satified even for ASCII-based names, to be able to search for things
using the DNS-- could be achieved by placing non-ASCII identifiers
into the DNS itself, in some coded form.  These identifiers have
commonly been described as "multilingual names", further complicating
the work program and concensus-seeking process in that working group.

Many of the problems associated with trying to overload the DNS in
this way have been described in [DNSROLE].  And that document, and
the experience from which it is drawn, predict that the IDN WG effort
will ultimately fail if it goes down paths that require sensitivity
to the characteristics of particular languages, rather than just an
expanded set of characters to be used in identifiers.  As implied in
the [DNSROLE] document, consideration of language-related issues and
their appropriate handling was one of the primary the motivations for
the model developed here.

However, at least from the viewpoint of this author, one important
question remains: assuming that the IDN WG's work can be appropriately
narrowed down to characters and identifiers, does the value of
local-language identifiers justify putting non-ASCII strings into the
DNS even if end users never see them?  We argue in section 2.1 that it
is not necessary and poses some risks.  However, the "variables in
programming languages" analogy and the "local directory or cache"
approach, both outlined above, suggest that such names would be
extremely useful and fairly safe if the limits of code-point-level
matching and identifier-only use are taken narrowly and observed
conservatively [ICANN-Permitted].  And, if one believes the model
outlined here, or any competing "keyword" model (see next section),
will achieve wide deployment and use, the needs and perspectives of
such systems should condition the evaluation of IDN WG-produced
alternatives.  So there is a serious and complex set of engineering
(and, realistically, political) tradeoffs to be evaluated in making
the decision as to whether wide deployment of some version of the IDN
work is appropriate.

4.2 "Keyword" systems

In the Internet object-referencing context, the term "keyword system"
is used to refer to many different things.  Many would fit nicely into
the third sublayer environment, but most of the existing proposals put
them directly on top of the DNS, or skip the DNS entirely and go
directly to IP addresses.  The difficulty with these systems is that
they either must be localized (e.g., a different system or database
for each language, country, or smaller locality) or they don't scale
well.  In particular, they eventually suffer from either the "all the
good names are taken" problem (of which the DNS is frequently accused)
or they are very vunerable to poor retrieval precision properties as
the number of names (or keyword combinations) in the name space grows
large.

Adapting bibliographic styles of keyword systems to operate locally
and as part of the third sublayer model proposed here would appear to
be the best way forward for such systems.  It has been observed that
what most users really want most of the time is localization, and
locally-oriented keyword systems could satisfy much of that
requirement.  And keyword systems would be strengthened by being
placed on a base of use and language-sensitive naming and searching,
rather than on the low-context, monohierarchical, DNS.

Other types of keyword systems, including the one described by Arrouye
and Popp [Arrouye], are really special cases of the sublayer two
search service that rely on careful selection of names (and,
consequently, resolution of "best fit" and "rights") to achieve
uniqueness and, hence what they describe as "direct navigation" (see
elsewhere in this document).  Similar systems might utilize a set of
keywords combined into a phrase that can be interpreted, possibly with
permutation rules, in a search service.  In the interest of
simplification and presenting simple names to users, these systems are
likely to omit most or all of the non-name string facets from
user-visible search interfaces.  Some further analysis, as to whether
what is optimally desirable is a set of unordered keywords, or an
ordered phrase that might contain such keywords, seems called for.
Different answers could, of course, be implemented at different layers
of this model.


4.3 Client-side and server-side solutions

The key approaches being considered in the IDN WG are essentially
client implementations, applied to names before they are placed in
the DNS.  This contrasts with the existing use and protocols of DNS
in which, e.g., string matching is done on the server.  Ignoring
speed of deployment (which can be argued either way), the advantage
of client-side implementations is that they don't require changes to
the DNS fabric itself (and therefore minimize the risk of damaging
existing applications that rely on that fabric).  Because the
sublayer two and three mechanisms do not rely on the DNS for any
searching or matching activities, and are completely new, server-side
implementations are again feasible: applications will require
modification to access these services (just as they would to support
a client-side implementation), but older, unmodified, applications
will not touch them at all.

Server-side implementations have several advantages over client-side
ones.  If something complicated is being done, it is often possible
to apply more computer resources, or larger tables, on a server, and
to update those resources and tables more easily if needed.  And
server-side implementations tend to yield more uniformity of behavior
relative to having a potentially wide mix of client implementations.


5. Comments on business models

Historically, the IETF has had even less desire to involve itself
with business models than it has with user interfaces (see section
3.5).  But the approach outlined here, and the protocol and
operational proposals that will derive from it, face a particular
challenge: the DNS works well for its intended purpose (something we
don't intend to change) and arguably works at least tolerably for
some purposes, including as a search engine, for which it was not
intended.  Many of us see its quality and capabilities, when used as
a search (or, more accurately, "guessing") engine deteriorating but
collapse, if it occurs, is still in the future.  There are also
considerable vested interests -- both economic and policy control--
associated with the current DNS structure and arrangements.

The ability to produce and deploy a different model, especially one
that requires new work in several areas, against that backdrop will
be challenging at best.  Unless there are clear business models for
doing so, the odds of success are quite low.  So this section
outlines some of the business issues and models not covered elsewhere
in this document.   As with the user interface discussion, it is not
intended to be definitive: some of these models may fail and others
may be more attractive.  But it is intended to provide a sufficient
demonstration of concept that, perhaps, the technical ideas can be
taken seriously.

We observe that a telephone system analogy may be helpful.  With the
telephone system, there are registries, described as national
numbering databases, that record which numbers are in use and by
whom.  There are white pages services which, given locale and some
other information (e.g., whether business or residential in some
areas) and a near or exact match to a name, provide name to number
lookup.  And there are yellow pages services, with precise categories
and organization differing somewhat from one location to another.
Organizations make money at all three levels, but the greatest
aggregate income occurs with the yellow pages services.

At each of sublayers two and three, there are multiple services.
Some of these would probably need to be operated as public goods,
spreading costs over the producers of other services.  Others would
presumably be directly profitable.

5.1 Sublayer two - faceted global searching

5.1.1 Facet listings and identification

For the attribute facets that rely on controlled vocabularies, some
organizational structure would be required to oversee those
vocabularies.  As suggested elsewhere, the ideal would be to use
pre-existing organizations and pre-existing lists (the WIPO
classification of goods and services [NICE] is an example of such a
list, as would be the IS 3166-1 list traditionally used for country
code domain names.  Where such lists did not exist, it would be
necessary to build arrangements for them.  The maintenance of such
vocabularies would be, from an Internet standpoint, be a public good.

5.1.2 Registration and searching

Actual registrations would be required for names their attributes
with, as mentioned above, multiple registrations when an individual,
organization, or business wished to be registered with more than one
attribute set.  The economic model would presumably parallel the
current registrar and registry business, with a charge for
registration (since there is no intrinsic requirement for a single
registry, registry services might well be competitive, eliminating
the need for models that separate registries and registrars.
However, lookup and search activities would be more flexible than the
DNS, with extended services, including character set transposition,
language translation, and potentially more extensive search
variations being potential areas on which providers could compete,
using fee for service or subscription models to support costs.

5.2 Sublayer three - localized databases and searching

As mentioned above, yellow pages and publication of directories and
guidebooks are traditionally where the money has been made.  The
analogies apply: one could imagine charging for entering information
into the databases, or for searching, or for information delivered,
or all three of these.  And all have been used for papers and related
databases.


6. Glossary

ACE

Encoding form

Facet

Keyword (see section 4.2)

Multilingual name (see section 4.1)


7.  Summary

The solution to the "multilingual DNS" problem, and to a series of
other limitations of the DNS relative to today's expectations for
naming and searching, lies in solutions targeted to those problems,
rather than superimposing additional mechanisms on the DNS in ways
that, those who advocate them hope, will not cause problems with
older programs and unconverted infrastructure.  Inserting new search
layers avoids those risks and permits a clean solution that is
adapted to the problems, rather than the limitations imposed by
existing properties of the DNS.


8. IANA Considerations and related topics

At search layer two, it is difficult to think about how the system
might function successfully without controlled vocabularies for each
of the non-name facets.  As discussed in section 2.2, we have already
established one such registry (bound to an ISO standard), and
mechanisms for utilizing it, with RFC 3066.  The Madrid agreement and
its predecessors [NICE] provide classifications for types of
businesses, but we would need to extend the registry for names that
are not business-related.  The two locational attributes are somewhat
vague at this point, but controlled vocabularies would presumably be
needed, and should, if possible, be drawn from stable, non-IETF, work
(e.g., IS 3166-1 and 3166-2 might provide a foundation, and possibly a
complete list, for the location vocabulary).  Curiously, there is no
technical reason why the name-strings themselves must be unique: that
is one of the attractions of a model like this over attempting to
overload the DNS.  If conflicts or confusion occur, those are standard
civil (marketplace or trademark) issues that can be resolved in their
own environments, rather than posing special Internet problems.


9. Security Considerations

Additional layers of naming, searching, and databases imply addition
of opportunities for compromising those databases and mechanisms.
Part of the challenge with the model implied here is to determine how
to secure and authenticate those databases and access (especially
modify access) to them.  The good news is that, since the functions
are new, we should be able to design security mechanisms in, rather
than --as with the DNS-- have to try to graft them on to a structure
not designed for them.

10. References

Most of the references in this document are to examples of approaches
to the systems outlined here, or provide additional information about
the context of some of the suggestions, or are included to give
credit for particular ideas or to better identify earlier and
approaches.  None of those references are normative in the protocol
sense typically used in the IETF.

10.1. Normative References

[ISO639] ISO 639:1988 (E/F) - Code for the representation of names of
languages - The International Organization for Standardization, 1st
edition, 1988-04-01

[ISO3166-2] International Organization for Standardization. "Codes for
the representation of names of countries and their subdivisions --
Part 2: Country subdivision code".  1998.
     The values provide by this standard, and its use with the
         UN/LOCODE list, are discussed at
     http://www.din.de/gremien/nas/nabd/iso3166ma/devrel_2.html

[LangTag] Alvestrand, H. "Tags for the Identification of Languages",
RFC 3066, January 2001.

[RFC882] Mockapetris, P.V., "Domain names: Concepts and facilities".
RFC 882.  Nov-01-1983.

[RFC883] Mockapetris, P.V. "Domain names: Implementation
specification", RFC 883. Nov-01-1983.

[RFC1035] Mockapetris, P.V. "Domain names - implementation and
specification", RFC 1035. Nov-01-1987.

[RFC2826] IAB. "IAB Technical Comment on the Unique DNS Root", RFC
2826.  May 2000.

[RFC3066] Alvestrand, H. "Tags for the Identification of Languages",
RFC 3066. January 2001.


10.2. Explanatory and informative references

[Arrouye] Arrouye, Yves, et al. "Keyword Lookup Systems As a Class
of Naming Systems".  Work in progress, draft-arrouye-kls-00.txt, and
unpublished BOF proposal.

[Austein] Austein, Rob.  Private communication.

[CDNC] One or more of the TC<->SC works in progress, to be supplied.

[CNRP] Popp, N., M.  Mealling, L. Masinter, K. Sollins. "Context and
Goals for Common Name Resolution", RFC 2972, October 2000.

[DNSROLE] Klensin, J., "Role of the Domain Name System", work in
progress, draft-klensin-dns-role-02.txt.

[HOSTNAME] Harrenstien, K., M.K. Stahl, E.J. Feinler.  "Hostname
Server", RFC 0953, Oct-01-1985.  Also Braden, R., ed. "Requirements
for Internet Hosts - Application and Support", RFC 1123, October
1989.

[ICANN-Permitted]

[LDAP] Wahl, M., T. Howes, S.  Kille. "Lightweight Directory Access
Protocol (v3)", RFC 2251, December 1997.

[Mealling-SLS] Mealling, M and L Daigle, "Service Lookup System
(SLS)", work in progress, draft-mealling-sls-01.txt.

[NAMEPREP] Hoffman, P. and M. Blanchet, "Preparation of
Internationalized Host Names", work in progress,
draft-ietf-idn-nameprep-04.txt

[WIPO-NICE] World Intellectual Property Organization, "Nice Agreement
concerning the International Classification of Goods and Services for
the Purposes of the Registration of Marks", June 1957.

[Netword] http://corp.netword.com/ -- real reference needed.

[NEWCLASS] Klensin, John, "Internationalizing the DNS -- A New
Class", work in progress, draft-klensin-i18n-newclass-...

[RealNames] http://www.realnames.com/ -- real reference needed.

[RFC1591] Postel, J. "Domain Name System Structure and Delegation",
RFC 1591, March 1994.

[RFC2345] Klensin, J, T. Wolf, G.  Oglesby. "Domain Names and Company
Name Retrieval", RFC 2345. May 1998.  It is perhaps worth noting
that, as in the case of many RFCs, descriptions of this work were
widely circulated in draft form and discussed for a year or two
before being published as an RFC.

[RFC2822] Resnick, P., Editor. "Internet Message Format", RFC 2822.
April 2001.

[RFC2825] IAB, L. Daigle, ed. "A Tangled Web: Issues of I18N, Domain
Names, and the Other Internet protocols", RFC 2825.  May 2000.

[THES] International Organization for Standardization. "Information
and documentation -- Vocabulary" ISO 5127:2001.

[RFC-URI] Berners-Lee, T., R. Fielding, L. Masinter. "Uniform
Resource Identifiers (URI): Generic Syntax", RFC 2396.  August 1998.

[WAIS] M. St. Pierre, J. Fullton, K. Gamiel, J.  Goldman, B.
Kahle, J. Kunze, H. Morris, F. Schiettecatte.  "WAIS over
Z39.50-1988", RFC 1625.  June 1994.

[Z39] International Organization for Standardization. "Information and
documentation -- Information retrieval (Z39.50) -- Application service
definition and protocol specification", ISO 23950:1998.


11. Acknowledgements

This document, and the related notes, are the result of thinking that
has come together and evolved since before the issue of
internationalized access to domain names came onto the IETF's radar.
Discussions with a number of people have led to refinements in the
approach or the text, even though some of them might not recognize
their contributions or agree with the conclusions I have drawn from
them (indeed, some of those discussions were rooted in challenges to
the general ideas expressed here).  Particularly important suggestions
have come from, or arisen out of conversations with, Ran Atkinson,
Harald Alvestrand, Rob Austein, Fred Baker, Christine Borgman, Eric
Brunner-Williams, Randy Bush, Vint Cerf, Kilnam Chon, Dave Crocker,
Leslie Daigle, Patrik Faltstrom, Michael Froomkin, Francis Gurry,
Marti Hearst, Paul Hoffman, Kenny Huang, Karen Liu, Mao Wei, Michael
Mealling, Gary Oglesby, Mike Padlipsky, Qian Huilin, James Seng,
Theresa Swinehart, Tan Tin Wee, Len Tower, and Zita Wenzel, as well as
some memorable long-ago conversations with Jon Postel and J.C.R.
Licklider.


12. Author's Address

John C Klensin
1770 Massachusetts Ave, #322
Cambridge, MA 02140
klensin+srch@jck.com


Placeholders

For some reason, new ideas or approaches, or ways of presenting or clarifying
existing ones, seem to arise immediately before a version of this
document is submitted for posting.  It has often been impossible to
properly incorporate these.  The following are pending, and will be
picked up in the next revision:

(i) A new section (3.3.3) on "Discussion of Industry Types" that will
introduce a better model (and less handwaving) for handing industry
type codes where they are appropriate and structuring data for that
facet for other types of names.

(ii) A reworking of section 3.6.2, which is still not as clear as it
should be and needs to be expanded to fully explain the model unless
someone else produces a caching/bookmark/refresh description first.
(Other parts of 3.6 are likely to be reworked in the process -- that
whole section is a new idea, responding to a number of comments, in
the current revision).

(iii) Section 3.6.3 should be reviewed with people who actually
understand the language and issues and then rewritten.  And section
3.6.4, which is now just an outline, needs to be filled in.

(iv) Completion of the glossary, which seems to be necessary for
readers who have not been immersed in, e.g., the discussions of the
IDN WG.

(v) Section 2.6.1, as originally prepared for this draft, was not
coherent and has been removed.  A new section will be written.  The
idea is not to specify an API, but to make clear, in call-return
terms, that the expected "boundary" inputs to a sublayer two search
system will be pairs of facet value and fuzziness indicator, one pair
for each specified facet, but with missing facet values permitted (as
discussed above).  The return values will contain zero or more records
that, in turn, contain the stored strings for all facets, a URI, and
time to live information, plus, perhaps, a free-text descriptive
string (in the same language as the name-string) provided at the time
of registration of the information into the database.  Similar
problems affected section 2.2.6.

Expires August 2002