   There has been much discussion and several documents written about
   the need for an Internet Directory.  Recently, this discussion has
   focussed on ways to discover an organization's domain name without
   relying on use of DNS as a directory service.  This draft discusses
   lessons that were learned during InterNIC Directory and Database
   Services' development and operation of WWWSeeker, an application that
   finds a web site given information about the name and location of an
   organization.  The back end database that drives this application was
   built from information obtained from domain registries via WHOIS and
   other protocols.  We present this information to help future
   implementors to avoid some of the blind alleys that we have already
   explored.  This work builds on the Netfind system that was created by
   Mike Schwartz and his team at the University of Colorado at Boulder

1. Introduction

   Over time, there have been several RFCs [2, 3, 4] about approaches
   for providing Internet Directories.  Many of the earlier documents
   discussed white pages directories that supply mappings from a
   person's name to their telephone number, email address, etc.

   More recently, there has been discussion of directories that map from
   a company name to a domain name or web site [5].  Many people are
   using DNS as a directory today to find this type of information about
   a given company.  Typically when DNS is used, users guess the domain
   name of the company they are looking for and then prepend "www.".
   This makes it highly desirable for a company to have an easily
   guessable name.

   There are two major problems here.  As the number of assigned names
   increases, it becomes more difficult to get an easily guessable name.
   Also, the TLD must be guessed as well as the name.  While many users
   just guess ".COM" as the "default" TLD today, there are many two-
   letter country code top-level domains in current use as well as other
   gTLDs (.NET, .ORG, and possibly .EDU) with the prospect of additional
   gTLDs soon.  As the number of TLDs in general use continues to
   increase, guessing gets more difficult every day.

   Between July 1996 and our shutdown in March 1998, the InterNIC
   Directory and Database Services project maintained the Netfind search
   engine [1] and the associated database that maps organization
   information to domain names and thus acts as the type of Internet
   directory that associates company names with domain names.  We also
   built WWWSeeker, a system that used the Netfind database to find web
   sites associated with a given organization.  The experienced gained
   from maintaining and growing this database provides valuable insight
   into the issues of providing a directory service.  We present it here
   to allow future implementors to avoid some of the blind alleys that
   we have already explored.

2. Directory Population

2.1 Using WHOIS to Populate the Directory

   One proposal for populating a directory is to use WHOIS to gather
   information about the organization that owns a domain.  At the
   conclusion of the InterNIC Directory and Database Services project,
   our backend database contained about 2.9 million records that have
   data that could be retrieved via WHOIS.  The entire database
   contained 3.25 million records, with the additional records coming
   from sources other than WHOIS.

   In our experience this information contains a significant number of
   factual and typographical errors and requires further examination and
   processing to improve its quality.  Also, those TLDs that have
   registrars that support WHOIS typically only support WHOIS
   information for second level domains (i.e. as opposed to lower
   level domains (i.e.  Further, there are TLDs
   without registrars, TLDs without WHOIS support, and still other TLDs
   that use other methods (HTTP, FTP, gopher) for providing
   organizational information.  Based on our experience, an implementor
   of an internet directory needs to support multiple protocols for
   directory population.

2.2. Using "Tree Walks" to Populate the Directory

   Another proposal is to use a variant of a "Tree Walk" to determine
   the domains that need to be added to the directory.  Our experience
   is that this is neither a reasonable nor an efficient proposal for
   maintaining such a directory.  Except for some infrequent and long-
   standing DNS surveys [6].  DNS "tree walks" tend to be discouraged by
   the Internet community, especially given that the frequency of DNS
   changes would require a new tree walk monthly.  Also, our experience
   has shown that data on allocated DNS domains can be usually retrieved
   via other faster and more efficient methods (FTP, HTTP, etc.).

   Since existing domains in the database may be verified via direct DNS
   lookups rather than a "tree walk," "tree walks" should be the choice
   of last resort for directory population.

3. Directory Updating: Full Rebuilds vs Incremental Updates

   Given the size of our database in April 1998 when it was last
   generated, a complete rebuild of the database that is available from
   WHOIS lookups would require between 11.6 million and 14.5 million
   seconds of time.  This estimate does not include other considerations
   that would increase the amount of time to rebuild the entire

   Whether this is feasible depends on the frequency of database updates
   provided.  Because of the rate of growth of allocated domain names
   (150K-200K new allocated domains per month), we provided monthly
   updates of the database. To rebuild the database each month would
   require between 3 and 5 machines to be dedicated full time to the
   task.  Instead, we checkpointed the allocated domain list and rebuild
   on an incremental basis during one weekend of the month.  This
   allowed us to complete the update on between 1 and 4 machines without
   full dedication over a couple of days.  Further, by coupling
   incremental updates with periodic refresh of existing data (which can

   be done during another part of the month, and doesn't require full
   dedication of machine hardware), older records would be periodically
   updated when the underlying information changes.  The tradeoff is
   timeliness and accuracy of data (some data in the database may be
   old) against hardware and processing costs.

4. Directory Presentation: Distributed vs Monolithic

   While a distributed directory is a desirable goal, we maintained our
   database as a monolithic structure.  Given past growth, it is not
   clear at what point migrating to a distributed directory becomes
   actually necessary to support customer queries.  Our last database
   contained over 3.25 million records in a flat ASCII file.  Searching
   was done via a PERL script of an inverted tree (also produced by a
   PERL script).  While admittedly primitive, this configuration
   supported over 200,000 database queries per month from our production

   Increasing the database size only requires more disk space to hold
   the database and inverted tree.  Of course, using database technology
   would probably improve performance and scalability, but we had not
   reached the point where this technology was required.

5. Acknowledgments

   This work described in this document was partially supported by the
   National Science Foundation under Cooperative Agreement NCR-9218179.

