INTERNET-DRAFT                                John C. Klensin
July 20, 2001
Expires January 2002


                           A Search-based access model for the DNS
                                   draft-klensin-dns-search-01.txt

Status of this Memo

This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups.  Note that
other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt

The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.

This document supplements a companion document [DNSROLE] on the role
of the DNS relative to the uses to which it is being put and is
intended to start laying the groundwork for a specific proposal.
Both documents, their successors, and closely-related issues, can be
discussed on the mailing list at ietf-i18n-dns-directory@imc.org.
See http://www.imc.org/ietf-i18n-dns-directory/ for subscription and
archival information.


Copyright Notice

Copyright (C) The Internet Society (2000).  All Rights Reserved.



0. Abstract

This memo discusses strategies for supporting "DNS searching" --
finding of names in the DNS by a mechanism layered above the DNS
itself that permits fuzzy matching, selection that uses attributes or
facets, and use of descriptive terms. Demand for these facilities
appear to be increasing with growth in the Internet (and especially
the web) and with requirements to move beyond the restricted subset
of ASCII names that have been the traditional contents of DNS
"Class=IN".  This document proposes a three-level system for access
to DNS names in which the upper two levels involve search, rather
than lookup, functions. It also discusses some of the issues and
challenges in completing the design of, and deploying, such a system.


1. Introduction and Executive Summary

The notion of "DNS searching" is somewhat of an oxymoron: the DNS is
structured to only perform exact lookups of structured strings of
labels.  But, as discussed elsewhere, there is considerable demand
for searching facilities -- partial and fuzzy matching, selection
that uses attributes or facets, and searching using descriptive
terms-- and that demand appears to be increasing with growth in the
Internet (and especially the web) and with requirements to move
beyond the restricted subset of ASCII names that have been the
traditional contents of DNS Class=IN.  This document proposes a
three-level system for access to DNS names in which the upper two
levels involve search, rather than lookup, functions. It also
discusses some of the issues and challenges in completing the design
of, and deploying, such a system.

These types of services are unnecessary as long as the problem is
defined as "get non-ASCII identifiers into the DNS, but keep to a
well-specified set of characters and usage so they retain strict
identifier properties".  Such approaches do not, as discussed in
[DNSROLE] solve the problem as perceived by many people.  And, as the
IAB has pointed out [RFC2825], "fixing the DNS" is the easy part: the
harder problem is considering and adjusting the applications and
applications-level user interfaces.

It has been suggested that introducing a "directory" or "keywords"
into, or above, the DNS could be used as a solution to the IDN
problem and, often, several others.  Probing statements about
"directories" often quickly demonstrates that their advocates don't
agree on what they mean.  This section outlines a three-layer
search/lookup model (adding two layers to the one provided by the
DNS, i.e., constructing a three-layer model, rather than continuing
with the single one we have today).  Those layers consist of the
current DNS, a search-capable layer using an extremely simple set of
facets, and a layer capable of broader search approaches in a
localized context. It is intended as a strawman for criticism and
development, rather than as a specific proposal.  I.e., the details
are left for WG efforts.

As a terminology issue, the "layers" described here are probably best
thought of as sublayers of the applications layer, with actual
user-facing applications lying yet above them.  The term "search
layer" has been used below where it appears to be needed for clarity
or emphasis, and "sublayer" and "level" are sometimes used
interchangably with it: suggestions for better terminology would be
welcomed.

At the two "above DNS" sublayers, international ("universal")
character sets and scripts are assumed and part of this initial
design.  Since actual or applications-applied DNS restrictions are
not being inherited upward into these sublayers, coding can be chosen
for maximum utility and balance among language groups.  E.g., native
UCS-4 could be used as an alternative to a secondary encoding form
such as UTF-8 or an ASCII-compatible recoding.

This document is a preliminary proposal -- a framework and fodder for
a working group or design team-- rather than a complete specification
or even an approximation to one.  It is complemented by
[Mealing-SLS], which discusses a CRNP-based implementation model for
the middle sublayer.


2. A three (or four) search-layer environment.

The material below suggests three or more sublayers for name lookup
and search:

   (1) The DNS, with the existing lookup mechanisms

   (2) A restricted, facet-based, search system.

   (3) Commercial, localized, and potentially topic-specific, search
   environments.

   (4) Something else?


2.1.  Search Layer One: Identifiers -- a lookup system and the DNS.

In this model, the DNS remains largely as is (see section 3.3ff) or,
perhaps, a bit closer to its original purpose and assumptions than
the direction in which it has evolved in recent years.  I.e., it is a
distributed database, with precise lookups, whose lookup keys are
identifiers for Internet hosts and other objects.  We give up the
notion that these identifiers should also serve as human-useful names
or at least try to abandon that notion.

   As an aside, note that some people have suggested that we
   should dehumanize DNS names entirely, e.g., prohibit the
   registration and use of any name that can be found in any
   dictionary for any language that can be represented in the
   DNS-acceptable character set.  This proposal doesn't
   include that idea.  But it is absent primarily because it does
   not appear that the transition process is worth the time it
   would take to explore, rather than because it has no appeal.

The goal at this sublayer is relatively simple, unique, identifiers.
It is probably desirable that these identifiers be able to have some
human mneumonic value, but less important that they be tightly bound
to real-world names and descriptions.

The inputs and outputs at this layer are as they are in the DNS
today, although modifications to accomodate non-hosttable format
names there remain possible if that is deemed important.


2.2 Search Layer Two: Names -- a faceted search system with a small
number of facets.

Much of the current burden borne by the DNS would appear to be better
localized in a search system that contains names and a small number
of facets/ attributes.  This burden includes a wide range of
non-identifier goals and constraints: names that a user can
understand and find and that have significant mneumonic value, names
with trademark implications, a wide variety of naming systems and, in
general, helping people find the things for which they are looking.
It is critical that the number of attributes be constrained to a
minimal set --and that other attributes, especially those of special
interest, be deferred to the third search layer.

It is probably most useful to think about this layer in terms of a
structured, multifacted, multihierarchical, thesaurus-like database
with search capability (Cf. ISO IS 5127-1 and IS 5127-6 [THES]),
rather than as a "directory" in the sense of X.500 and its
derivatives and antagonists.

A key question is what facets to use once the commercial product
requirements are removed (to search layer three, see below).  It appears to
me that, to satisfy to the critical name-uniqueness and real world
pressures on the DNS, candidates might be

     name-string (IS 10646, see below)
     language (presumably per RFC 3066)
     geographical location (country, and/or for some federal
            countries, country/province ("state"), granularity is
                important; there may be a case for an additional facet
                in a coordinate system)
     network location (If we can figure out what that means
            and how to express it in a canonical way.)
     industry category code (For companies, presumably derived
            from some existing official list; the list would
                need to be extended to deal with non-commercial
                organizations and entities and for identifying
                resources and services associated with people.

This typology gives the trademark view of the world somewhat more
precedence in looking at name conflict issues than one might like in
principle.  But, in practice, one of the key issues we have
encountered in trying to store "names", rather than identifiers, in
the DNS is that the process unreasonably flattens the space.  That
"Joe's Auto Repair" and "Joe's Pizza" can co-exist in the same
geographical area without conflict or confusion and that "Joe's
Pizza" in one area can co-exist with "Joe's Pizza" in another, again
without conflict or confusion, are the consequence of the way we name
and identify things in the real world.  Most trademark rules ar the
consequence of those naming systems, not their cause.

It is not intended that this level act as a white pages service for
people.  Doing so leads down several slippery slopes at once,
including heightened privacy concerns and a stronger requirement for
URL targets rather than DNS label ones (see below).

The general intent is that the list of facets be fixed by protocol
and that possible values for each facet be controlled vocabularies,
not necessarily (and probably not) controlled from the same source.
We would hope to utilize existing terminology lists where possible.
For a particular record (i.e., a name and its set of attributes), and
especially if requirements for uniqueness can be bypassed or relaxed,
the selection (from the controlled vocabularies) of particular facet
values would be the responsibility of the entity registering the
names.  In other words, someone registering a "name" in this system
would select values for each of the facets from the controlled
vocabulary for that facet as part of the process of placing the name
into a database.

It should be clear that there is significantly more information (from
the values of the facets) at this layer than there is in the DNS.

The names in this environment can reasonably be written in IS 10646
codes or some recoding of them.  Since we would be starting more or
less from scratch, we could select lengths and codings for maximum
efficiency and utility, not to meet the constraints of existing
software.  In such a context, this author has a slight bias for
direct UCS-4 coding, rather than ASCII-compatible ("ACE") codes;
compressed, null-octet-eliminating, systems such as UTF-8; or
surrogate introducers to hold things to 16 bits.  The loss in
transport efficiency is likely to be more than compensated for by
gains in cleanliness and equal treatment of all scripts.  But that
issue is separate from the main and important design arguments of
this document.

The work done for "nameprep" [NAMEPREP] in the IDN WG is almost
certainly relevant to determining which names to actually store in
the database.  But the stakes are lower here than the "get it right
or fail completely" constraint of the DNS lookup environment: one can
imagine search mechanisms that would apply a more liberal set of
matching rules (and/or localized and language-specific ones) than the
rules used to encode names (much like recent applications protocols
that explicitly distinguish between the formats one is permitted to
send and those one is expected to accept (Cf. [RFC8222])).

As is common with systems of this type, we would anticipate the
possibility of searching on any of the attributes and that searching
on free-text strings would not be exact (i.e., near-match responses
could be returned using any of several algorithms, with the user
making choices).  As is equally common, we should think about user
interfaces that store both queries and response sets so that the
responses could be used offline and refreshed when the client systems
were attached to the Internet.

In summary, the goal at this layer is to provide unique tuples of
human-recognizable (not just mneumonic) names, but names that are
unique within a context, rather than a global system based on the
names alone.

The inputs at this layer are search values for one or more of the
facets.  The outputs are still controversial, but would appear to
best be the full facet set of the matched tuple(s) and one or more
DNS names.  One of many interesting questions is whether this layer
should pass through and return the DNS records themselves (labels,
class, type, and target) or whether it should return names (labels)
and let the applications do the DNS lookups.  Another possibility is
to return one or more URLs (or more general URIs?) rather than DNS
names.  Doing so increases flexibility but at the cost of greater
complexity and risk of recursion problems.

Still another possibility would be to create a URI for DNS record
information and use it to abstract this return information into
something applications can then specify or decode as appropriate.
Use of this would need to be carefully structured to avoid complex
problems, but might be a reasonable approach.

Experience with the DNS and other distributed databases also argues
persuasively that these records are not forever.  Unless there are no
local copying and caching mechanisms (which seems unlikely and hard
to enforce), some type of time to live (TTL) or other expiration or
reverification mechanism will be needed.


2.3.  Search Layer three: locality and/or content-domain-specific
lookup mechanisms.

The problem with the second-search-layer model is that there are a
number of usability and marketplace pressures for naming systems that
offer finer granularity and better match user needs.  Interestingly,
those systems which have been included in experiments or partially
deployed (see, e.g., [RFC2345], [Netword], and [RealNames]) have
demonstrated that these systems require contextual localization, not
a single global environment.  There are many causes for this, but
need for very specific searches that are geographic-area, topic-area,
or language or culturally specific tend to dominate the list.

The issue is perhaps illustrated by an example.  Suppose the
granularity of an entry at the second level is

  {"Joe's", "UK", Restaurant,... }

Now, I might want to create a business around a restaurant directory
for Bristol.  I would probably want to construct a database that
contained exact locations, type of food, menu information, prices,
etc., and permit people to query it that way.  That type of product
bears a strong relationship to traditional yellow pages services: the
right attributes to collect and the right way to organize them will
differ by topic (e.g., "menu" has no obvious analogy in an automobile
repair shop) and the business models are fairly established.

One can imagine many different types of keyword and (yellow
pages-like) directory services at this level, using different types
of protocol mechanisms as well as different types of database content
and schema.  But those services are nearly ideal candidates for
competition: there is no requirement that either the providers or the
services be global or unique or even highly standardized.  Having all
three search layers bound to the same data sources --inheriting
values from them if one wants to think about it that way-- would
provide a degree of consistency that might be very attractive to
users, so there are clearly issues here that will need to be worked
out in the marketplace.

Inputs at the third search layer will differ by service: one can
imagine free-text interfaces and menus (but see section 2.4) as well
as systems that more closely resemble faceted search terms.  Outputs
will normally be search-layer-two names or strings to preserve name
and reference portability, or might be URIs containing such names.

Summary: Just as the monohierarchical identifier-lookup system at the
first (bottom, DNS) level should be supplemented by a multilingual,
multifaceted, multihierarchy search system at the second, that second
level system should be supplemented by a collection of localized,
subject- and topic- specific systems at the third.  These third-level
systems need not be centrally coordinated in any way, although some
similarity of function and interface would almost certainly make them
more consistent for users and easier to market.

2.4. A search layer above the third: free-text searching applications.

The approaches described above omit one set of techniques used today:
"web searches" on full text or its equivalent.  These systems have an
important role (and, similar to the third level, there seems no
particular advantage to trying to standardize them worldwide).  But
their disadvantage, if seen as a DNS surrogate or replacement, is
that they have difficulty distinguishing between the name of
something, a pointer to it, and a reference or discussion of it or
how it works.

If, for example, one is looking for a web site for a company, the
third level would presumably find that site.  The second (or even the
DNS) might find it with some guessing, but this fourth level would
(as web search engines do today) probably not distinguish the
company's site from sites that reference the company or its products.

Search layer three produces information that is explicitly bound to
the query, i.e., what one is looking for, while a search engine
returns values that also include sites where the subject of the query
might have been mentioned.


3 Context and directions

3.1 The data search and access model

It is interesting that recent IETF "directory" work has focused on
accessing mechanisms without worrying intensely about the underlying
database content, maintenance, and update issues.  Those issues seem
to be the harder ones, i.e., the difference between LDAP and CNRP may
make less difference than how we structure, maintain, and distribute
the relevant data.

Of course, that does not suggest that the work is not important or
that it isn't required.  And, to deploy the model suggested above, we
will need to deal with a pair of uncomfortable problems:

     * CNRP looks interesting, but has not been widely implemented or
         deployed in production.

     * LDAP is widely deployed, but primarily in implementations that
         contain sufficient extensions and special features to be
         non-interoperable.

If we are going to choose -- and search layer two certainly implies a
choice-- we need to figure out how to do that.



3.2 Uniqueness of name structures at the second search layer.

There are cases to be made both for and against uniqueness of names
(more precisey, of the combination of the name-string facet and all
of the other facets) at this sublayer, and even a partial middle
ground, in which names are unique within a registry namespace, but
there are mechanisms for identifying such spaces so that the names
are unique across the Internet.  The community should address the
tradeoffs because no position is ideal; summaries of the extreme
positions are below.  In none of these cases is it necessary, or even
desirable, that the name-string itself (without the additional facet
values) be unique.

3.2.1 The case for unique names

The IAB's discussion of DNS root uniqueness [RFC2826] argues that DNS
names must be unique, i.e., that there must not be alternate or
surrogate root structures if the Internet is to survive as a seamless
whole and be universally addressable and accessible.  Even with
imprecise matching, similar arguments may apply at level two,
especially if this is the first level at which names in natural
languages (hence including multilingual names), rather than
constrained identifiers, appear.  The mathematical arguments aside,
the main argument for uniqueness is that a given combination of
name-string and facets will yield exactly one logical host (or
equivalent).  If this is not the case, it seems inevitable that users
will be faced with choices they need to resolve even when they have
an exact match for a full set of facets.

Because the name structures at the second level still must be unique,
some mechanism for registries or structuring of names will be
necessary to avoid conflicts.  The problem is somewhat easier than
the ones encountered by ICANN and its associated groups because the
very structuring of the names and attributes creates opportunities
for dividing up responsibilities, but the registration problems exist
nonetheless and will need to be resolved.

3.2.2 Non-unique names

Conversely, one could have multiple appearances of the same set of
facets (including the name-string), such that an exact match could
still yield multiple "hits".  This would have the advantage of
eliminating all requirements for monopoly registries or [other]
technical mechanisms for guaranteeing that name conflicts did not
occur.  The disadvantage is that it would force more user choices or
heuristics, and at least some errors in which the wrong host or site
was identified would be almost inevitable.  If it turned out that
most user queries occurred at sublayer three or four, rather than
directly at this sublayer, that issue might not be significant.

3.2.3 The middle ground

A proposal been made in the initial version of [Mealing-SLS], that an
additional facet could be added to represent the registry which
records the names.  If this were done, names could be kept unique
within registries and would be globally unique as long as the
registry-identifying facet had a unique value for each registry.
There would be no need to restrict the number of registries in this
model or resolve naming disputes among them -- each one could have a
unique, randomly-generated and assigned identifier-- so the approach
could provide some degree of technical uniqueness while still
preserving most of the benefits of the non-unique approach.

This model could, of course, be deployed at a "registrar" level
instead, just by changing the assignment of the identifier facet from
value-per-registry to value-per-registrar.   Other variations are, of
course, possible.

3.3 Deployment against the existing DNS base

As with the "new class" approach to DNS changes [NEWCLASS], the
approach outlined here does not require any changes to the existing
installed DNS base.  But, like all solutions to the multilingual name
issues, it requires changes to all relevant applications.  The notion
of moving from lookup to searching does imply that we will need, not
merely to change the code that calls the name resolution system, but
to rethink the UIs of those applications.

3.4 Older applications

To fully realize the benefits of internationalized naming requires
changing all applications to understand the new method, whatever it
is.  Even the "internationalize the DNS" proposals are subject to
this principle.  Older applications will see distorted and unfriendly
names under some systems, and no names at all under others (some
approaches might cause some applications implementations to fail
entirely).

The environment contemplated here is a "no international names in old
applications", i.e., "no new names without upgrading", one --
applications that have not been upgraded will not see
internationalized names or other natural-language phrases, nor coded
surrogates for them.

The advantages of a "no names without upgrading" approach are that it
avoids confusion and the risk, however slight, of catastrophe.  As
with the original host table to DNS conversion, they provide an
incentive to convert old applications to make newer naming styles,
and newer names, visible.  None of these transitions are ever easy,
but it may be worth going through this one to get things right,
rather than investing a large fraction of the pain to get a solution
that doesn't quite do the job.


3.5 Why not just a keyword system

As suggested above, the term "keyword system" is used to refer to
many different things.  Many would fit nicely into the third sublayer
environment, but most of the existing proposals put them directly on
top of the DNS, or skip the DNS entirely and go directly to IP
addresses.  The difficulty with these systems is that they either
must be localized (e.g., a different system or database for each
language, country, or smaller locality) or they don't scale well.  In
particular, they eventually suffer from either the "all the good
names are taken" problem (of which the DNS is frequently accused) or
they are very vunerable to poor retrieval precision properties as the
number of names (or keyword combinations) in the name space grows
large.


4  Summary

The solution to the "multilingual DNS" problem, and to a series of
other limitations of the DNS relative to today's expectations for
naming and searching, lies in solutions targeted to those problems,
rather than superimposing additional mechanisms on the DNS in ways
that, those who advocate them hope, will not cause problems with
older programs and unconverted infrastructure.  Inserting new search
layers avoids those risks and permits a clean solution that is
adapted to the problems, rather than the limitations imposed by
existing properties of the DNS.


5 IANA Considerations and related topics

At search layer two, it is difficult to think about how the system
might function successfully without controlled vocabularies for each
of the non-name facets.  As discussed in section 2.2, we have already
established one such registry (bound to an ISO standard), and
mechanisms for utilizing it, with RFC 3066.  The Madrid agreement and
its predecessors [MADRID, NICE] provide classifications for types of
businesses, but we would need to extend the registry for names that
are not business-related.  The two locational attributes are somewhat
vague at this point, but controlled vocabularies would presumably be
needed, and should, if possible, be drawn from stable, non-IETF, work
(e.g., IS 3166-1 and 3166-2 might provide a foundation, and possibly
a complete list, for the location vocabulary).  Curiously, there is
no technical reason why the name-strings themselves must be unique:
that is one of the attractions of a model like this over attempting
to overload the DNS.  If conflicts or confusion occur, those are
standard civil (marketplace or trademark) issues that can be resolved
in their own environments, rather than posing special Internet
problems.


6 Security Considerations

Additional layers of naming, searching, and databases imply addition
of opportunities for compromising those databases and mechanisms.
Part of the challenge with the model implied here is to determine how
to secure and authenticate those databases and access (especially
modify access) to them.  The good news is that, since the functions
are new, we should be able to design security mechanisms in, rather
than --as with the DNS-- have to try to graft them on to a structure
not designed for them.

7 References

[Mealing-SLS]  Mealling, M and L Daigle, "Service Lookup System (SLS)",
work in progress, draft-mealling-sls-00.txt.

[MADRID]

[NAMEPREP] Hoffman, P. and M. Blanchet, "Preparation of
Internationalized Host Names", work in progress,
draft-ietf-idn-nameprep-04.txt

[NICE] World Intellectual Property Organization, "Nice Agreement
concerning the International Classification of Goods and Services for
the Purposes of the Registration of Marks", June 1957.


[Netword] http://corp.netword.com/ -- real reference needed.

[NEWCLASS] Klensin, John, "Internationalizing the DNS -- A New
Class", work in progress, draft-klensin-i18n-newclass-...

[RealNames] http://www.realnames.com/ -- real reference needed.

[RFC882] Mockapetris, P.V., "Domain names: Concepts and facilities".
RFC 822.  Nov-01-1983.

[RFC883] Mockapetris, P.V. "Domain names: Implementation
specification", RFC 883. Nov-01-1983.

[RFC1035] Mockapetris, P.V. "Domain names - implementation and
specification", RFC 1035. Nov-01-1987.

[RFC2345] Klensin, J, T. Wolf, G.  Oglesby. "Domain Names and Company
Name Retrieval", RFC 2345. May 1998.

[RFC2822] Resnick, P., Editor. "Internet Message Format", RFC 2822.
April 2001.

[RFC2825] IAB, L. Daigle, ed. "A Tangled Web: Issues of I18N, Domain
Names, and the Other Internet protocols", RFC 2825.  May 2000.

[RFC2826] IAB. "IAB Technical Comment on the Unique DNS Root", RFC
2826.  May 2000.

[RFC3066] Alvestrand, H. "Tags for the Identification of Languages",
RFC 3066. January 2001.

[THES] IS 5127-1, IS 5127-2.

[WAIS] M. St. Pierre, J. Fullton, K. Gamiel, J.  Goldman, B.
Kahle, J. Kunze, H. Morris, F. Schiettecatte.  "WAIS over
Z39.50-1988", RFC 1625.  June 1994.

[Z39] Z39.50, IS 23950.

8 Acknowledgements

This document, and the related notes, are the result of thinking that
has come together and evolved since before the issue of
internationalized access to domain names came onto the IETF's radar.
Discussions with a number of people have led to refinements in the
approach or the text, even though some of them might not recognize
their contributions or agree with the conclusions I have drawn from
them (indeed, some of those discussions were rooted in challenges to
the general ideas expressed here).  Particularly important
suggestions have come from, or arisen out of conversations with,
Harald Alvestrand, Rob Austein, Fred Baker, Eric Brunner-Williams,
Randy Bush, Vint Cerf, Kilnam Chon, Dave Crocker, Leslie Daigle,
Patrik Fältstr•m, Michael Froomkin, Francis Gurry, Marti Hearst, Paul
Hoffman, Kenny Huang, Mao Wei, Michael Mealing, Gary Oglesby, Qian
Huilin, James Seng, Theresa Swinehart, Tan Tin Wee, Len Tower, and
Zita Wenzel as well as some memorable long-ago conversations with Jon
Postel and J.C.R. Licklider.

The current draft has been prepared to meet the submission deadline
before IETF-51.  Some comments that the author already has in hand
have consequently been omitted and will be included in future
versions.


9 Author's Address

John C Klensin
AT&T Labs
99 Bedford St, 4th floor
Boston, MA 02111 USA
+1 617 574 3076
klensin@att.com

Expires January 2002