INTERNET DRAFT EXPIRES OCT 1998 INTERNET DRAFT
Network Working Group Ryan Moats
INTERNET DRAFT Rick Huber
Category: Informational AT&T
April 1998
Directories and DNS: Experiences from Netfind
<draft-rfced-info-moats-00.txt>
Status of This Memo
This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its
areas, and its working groups. Note that other groups may also
distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-
Drafts as reference material or to cite them other than as
"work in progress."
To view the entire list of current Internet-Drafts, please check
the "1id-abstracts.txt" listing contained in the Internet-Drafts
Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net
(Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au
(Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu
(US West Coast).
Distribution of this document is unlimited.
Abstract
There have been several internet drafts and RFCs written about the
need for Internet Directories. This draft discusses lessons that
have been learned during InterNIC Directory and Database's
custodianship of the Netfind search engine and database, which have
direct implications on providing and maintaining the mappings between
domain names and company information that are essential for an
Internet Directory. This work builds on that of Mike Schwartz and his
team at the university of Colorado at Boulder [1].
1. Introduction
There have been several internet drafts [2, 3] and RFCs [4, 5, 6]
written about approaches for providing Internet Directories. Many of
the earlier documents discussed white pages directories that supply
mappings from a person's name to their telephone number, email
address, etc. More recently, there has been discussion of
directories that map from a company name to a domain name or web
site.
>From July 1996 until our shutdown in March 1998, the InterNIC
Directory and Database Services project maintained the Netfind search
engine [1] and the associated "Seed Database" that maps organization
information to domain names and thus acts as the type of Internet
directory that associates company names with domain names. The
experience gained from maintaining and growing this database has
provided valuable insight into the issues of providing a directory
service.
Many people are using DNS as a directory today to find information
Moats, Huber [Page 1]
about a given company. Typically when DNS is used, users guess the
domain name of the company they are looking for and then prepend
"www.". This makes it highly desirable for a company to have an
easily guessable name.
There are two major difficulties here. As the number of assigned
names increases, it becomes more difficult to get an easily guessable
name. Also, the TLD must be guessed as well as the name. While many
users just guess ".COM" today, there are many two-letter country code
top-level domains in current use as well as other gTLDs (.NET, .ORG,
and possibly .EDU in addition to .COM) with the prospect of
additional gTLDs in the near future. Since both of these problems
are or will soon be present in DNS, guessing is getting more
difficult every day.
2. Building a Directory
We are dealing here with directories whose goal is to map company
names to domain names. The reverse lookup (domain name to "owning"
company) can be done using WHOIS or similar tools (for TLDs where
such tools are supported). A database that contains the mapping we
want can be built from the WHOIS data, but we must first get data on
what DNS names exist.
There are three issues that must be addressed:
- Finding new domain names for directory updates (and finding all
domain names for the initial directory build).
- Finding the company name associated with each domain name.
- Determining when the data associated with an existing domain name
has changed.
3. Finding New Domain Names
One proposal to determine domain name existence is to use a variant
of a "Tree Walk" to determine the domains that need to be added to
the directory. Our experience with the Netfind database is that this
is neither a reasonable nor an efficient mechanism for maintaining
such a directory. DNS "Tree Walks" tend to be discouraged (as they
should) by the Internet community for both security and load reasons.
In addition, our experience has shown that data on allocated DNS
domains can often be retrieved via other methods (FTP, HTTP, etc.).
Therefore, to find new domain names FTP or HTTP should be used to
download lists of allocated domains and DNS "Tree Walks" should be
Moats, Huber [Page 2]
used only as a last resort.
4. Associating Company Information with a Domain Name
WHOIS appears to be the logical starting point for information
relating company names to domain names, and several of the directory
proposals [2,3] discuss using WHOIS for this purpose. As of the
March 1998 release, the Netfind seed database had approximately 2.7
million records that contained data retrievable by WHOIS.
This constituted 82.8% of the entire Netfind database, but our
experience has shown that this information contains a number of
factual and typographical errors. Further, those TLDs that have
registrars that support WHOIS typically only support WHOIS
information for second level domains as opposed to lower level
domains. There also remains the other 17.2%: TLDs without
registrars, TLDs without WHOIS support, and TLDs that use tools other
than WHOIS (HTTP, FTP, gopher) for providing organizational
information. In summary, using WHOIS alone is not sufficient to
populate an internet directory.
5. Keeping Data Current
Given the current size of the Netfind database and a reasonable
processor, it requires somewhere between 7.2 million and 9.0 million
seconds of CPU time to rebuild the entire portion of the Netfind
database that is available from WHOIS lookups. This is roughly
85-105 CPU days if no parallel processing is done. Note that this
estimate does not include other considerations that would increase
the amount of time to rebuild the database.
During our maintenance of the Netfind database, we provided monthly
updates which would require between 3 and 5 machines dedicated full
time to provide a full database rebuild every month. Such a
dedication was unreasonable, given that the set of allocated domains
changes currently by around 150,000 new allocated domains per month.
Checkpointing the allocated domain list is checkpointed and
rebuilding during one weekend of the month reduced the requirement to
between 40 and 60 machines for a full update.
A more reasonable approach was to do incremental updates of the
directory. Such an approach allowed incremental updates to be
handled on a monthly basis using a reasonable number (between 1 and
4) of machines. Coupling such an approach with a periodic refresh of
already allocated domains allowed for older records to be updated
when underlying information changes. Note that the periodic refresh
Moats, Huber [Page 3]
was not triggered by any event; rather, it was a scheduled procedure.
When using an incremental approach, it was necessary to verify the
information for domains that are already in the database. This was
done by direct DNS lookups to verify the existence of the domain name
in question and then using WHOIS lookups to determine if that
information had changed. This was done on a rotating basis so that
acceptable performance was maintained. In practice, we did a 100%
check by direct DNS lookup and checked about 10% of the names in
WHOIS each month.
6. Distributed vs Monolithic
While a distributed directory is a desirable goal, the March 1998
Netfind database was monolithic in nature. Given past growth, it is
not clear at what point migrating to a distributed directory becomes
actually necessary to support customer queries. The current Netfind
database holds approximately 3.26 million records in a flat ASCII
file. Searching is done via a PERL script and an inverted tree.
While admittedly primitive, this configuration supported over 70,000
queries per month (with a peak level of 200,000 in one month) from
our production servers. Increasing the database size only requires
more disk space to hold the database and inverted tree. Of course,
using actual database technology would probably improve performance
and scalability, but such technology has not yet been required.
7. Other Directory Considerations
Availability goals can be met by having multiple copies of the
database in place. InterNIC Directory and Database Services
maintained 3 production copies of the Netfind database, and there are
about a dozen others maintained by other organizations throughout the
world. This ensures that users almost always have access to the
database. At the InterNIC Directory and Database services sites,
service downtime for database update was avoided by doing updates in
series; only one server was being updated at any given time.
8.0 Security Considerations
This document specifies methods of collecting and accessing data that
is already freely accessible to anyone on the Internet. Such
gathering will make access to this data easier, and may increase
opportunities for abuse.
Moats, Huber [Page 4]
9. Acknowledgments
The work described in this document was partially supported by the
National Science Foundation through Cooperative Agreement NCR-
9218179.
10. References
Request For Comments (RFC) are available from
<URL:ftp://venera.isi.edu/in-notes> and Internet-Drafts are available
from <URL:ftp://ftp.ietf.org/internet-drafts>. Both are also
available from numerous mirror sites.
[1] M. F. Schwartz, C. Pu. "Applying an Information
Gathering Architecture to Netfind: A White Pages
Tool for a Changing and Growing Internet," Univer-
sity of Colorado Technical Report CU-CS-656-93.
December 1993, revised July 1994.
<URL:ftp://ftp.cs.colorado.edu/pub/cs/techreports/
schwartz/Netfind.Gathering.txt.Z>
[2] G. Mansfield, et. al, "A Directory for Organiza-
tions and Services from DNS and WHOIS", Internet
Draft (work in progress), November 1997.
[3] J. Klensin, T. Wolf, Jr., "Domain Names and Company
Name Retrieval", Internet Draft (work in progress),
July 1997.
[4] K. Sollins, "Plan for Internet Directory Services",
RFC 1107, M.I.T. Laboratory for Computer Science,
July 1989.
[5] S. Hardcastle-Kille, "Replication Requirements to
provide an Internet Directory using X.500, RFC
1275, University College London, November 1991.
[6] J. Postel, C. Anderson, "White Pages Meeting
Report", RFC 1588, February 1994.
11. Authors' addresses
l l. Ryan Moats Rick Huber AT&T AT&T 15621 Drexel CircleRoom
1B-433, 101 Crawfords Corner Road Omaha, NE 68135-2358Holmdel, NJ
07733-3030 USA USA
EMail: jayhawk@att.comEmail: rvh@att.com
Moats, Huber [Page 5]
Moats, Huber [Page 6]
Network Working Group Ryan Moats
Request for Comments: NNNN Rick Huber
Category: Informational AT&T
April 1998
Directories and DNS: Experiences from Netfind
Status of This Memo
This memo provides information for the Internet community. This
memo does not specify an Internet standard of any kind.
Distribution of this memo is unlimited.
Abstract
There have been several internet drafts and RFCs written about the
need for Internet Directories. This draft discusses lessons that
have been learned during InterNIC Directory and Database's
custodianship of the Netfind search engine and database, which have
direct implications on providing and maintaining the mappings between
domain names and company information that are essential for an
Internet Directory. This work builds on that of Mike Schwartz and his
team at the university of Colorado at Boulder [1].
1. Introduction
There have been several internet drafts [2, 3] and RFCs [4, 5, 6]
written about approaches for providing Internet Directories. Many of
the earlier documents discussed white pages directories that supply
mappings from a person's name to their telephone number, email
address, etc. More recently, there has been discussion of
directories that map from a company name to a domain name or web
site.
>From July 1996 until our shutdown in March 1998, the InterNIC
Directory and Database Services project maintained the Netfind search
engine [1] and the associated "Seed Database" that maps organization
information to domain names and thus acts as the type of Internet
directory that associates company names with domain names. The
experience gained from maintaining and growing this database has
provided valuable insight into the issues of providing a directory
service.
Many people are using DNS as a directory today to find information
Moats, Huber [Page 1]
RFC NNNN Directories and DNS: Experiences from Netfind April 1998
about a given company. Typically when DNS is used, users guess the
domain name of the company they are looking for and then prepend
"www.". This makes it highly desirable for a company to have an
easily guessable name.
There are two major difficulties here. As the number of assigned
names increases, it becomes more difficult to get an easily guessable
name. Also, the TLD must be guessed as well as the name. While many
users just guess ".COM" today, there are many two-letter country code
top-level domains in current use as well as other gTLDs (.NET, .ORG,
and possibly .EDU in addition to .COM) with the prospect of
additional gTLDs in the near future. Since both of these problems
are or will soon be present in DNS, guessing is getting more
difficult every day.
2. Building a Directory
We are dealing here with directories whose goal is to map company
names to domain names. The reverse lookup (domain name to "owning"
company) can be done using WHOIS or similar tools (for TLDs where
such tools are supported). A database that contains the mapping we
want can be built from the WHOIS data, but we must first get data on
what DNS names exist.
There are three issues that must be addressed:
- Finding new domain names for directory updates (and finding all
domain names for the initial directory build).
- Finding the company name associated with each domain name.
- Determining when the data associated with an existing domain name
has changed.
3. Finding New Domain Names
One proposal to determine domain name existence is to use a variant
of a "Tree Walk" to determine the domains that need to be added to
the directory. Our experience with the Netfind database is that this
is neither a reasonable nor an efficient mechanism for maintaining
such a directory. DNS "Tree Walks" tend to be discouraged (as they
should) by the Internet community for both security and load reasons.
In addition, our experience has shown that data on allocated DNS
domains can often be retrieved via other methods (FTP, HTTP, etc.).
Therefore, to find new domain names FTP or HTTP should be used to
download lists of allocated domains and DNS "Tree Walks" should be
Moats, Huber [Page 2]
RFC NNNN Directories and DNS: Experiences from Netfind April 1998
used only as a last resort.
4. Associating Company Information with a Domain Name
WHOIS appears to be the logical starting point for information
relating company names to domain names, and several of the directory
proposals [2,3] discuss using WHOIS for this purpose. As of the
March 1998 release, the Netfind seed database had approximately 2.7
million records that contained data retrievable by WHOIS.
This constituted 82.8% of the entire Netfind database, but our
experience has shown that this information contains a number of
factual and typographical errors. Further, those TLDs that have
registrars that support WHOIS typically only support WHOIS
information for second level domains as opposed to lower level
domains. There also remains the other 17.2%: TLDs without
registrars, TLDs without WHOIS support, and TLDs that use tools other
than WHOIS (HTTP, FTP, gopher) for providing organizational
information. In summary, using WHOIS alone is not sufficient to
populate an internet directory.
5. Keeping Data Current
Given the current size of the Netfind database and a reasonable
processor, it requires somewhere between 7.2 million and 9.0 million
seconds of CPU time to rebuild the entire portion of the Netfind
database that is available from WHOIS lookups. This is roughly
85-105 CPU days if no parallel processing is done. Note that this
estimate does not include other considerations that would increase
the amount of time to rebuild the database.
During our maintenance of the Netfind database, we provided monthly
updates which would require between 3 and 5 machines dedicated full
time to provide a full database rebuild every month. Such a
dedication was unreasonable, given that the set of allocated domains
changes currently by around 150,000 new allocated domains per month.
Checkpointing the allocated domain list is checkpointed and
rebuilding during one weekend of the month reduced the requirement to
between 40 and 60 machines for a full update.
A more reasonable approach was to do incremental updates of the
directory. Such an approach allowed incremental updates to be
handled on a monthly basis using a reasonable number (between 1 and
4) of machines. Coupling such an approach with a periodic refresh of
already allocated domains allowed for older records to be updated
when underlying information changes. Note that the periodic refresh
Moats, Huber [Page 3]
RFC NNNN Directories and DNS: Experiences from Netfind April 1998
was not triggered by any event; rather, it was a scheduled procedure.
When using an incremental approach, it was necessary to verify the
information for domains that are already in the database. This was
done by direct DNS lookups to verify the existence of the domain name
in question and then using WHOIS lookups to determine if that
information had changed. This was done on a rotating basis so that
acceptable performance was maintained. In practice, we did a 100%
check by direct DNS lookup and checked about 10% of the names in
WHOIS each month.
6. Distributed vs Monolithic
While a distributed directory is a desirable goal, the March 1998
Netfind database was monolithic in nature. Given past growth, it is
not clear at what point migrating to a distributed directory becomes
actually necessary to support customer queries. The current Netfind
database holds approximately 3.26 million records in a flat ASCII
file. Searching is done via a PERL script and an inverted tree.
While admittedly primitive, this configuration supported over 70,000
queries per month (with a peak level of 200,000 in one month) from
our production servers. Increasing the database size only requires
more disk space to hold the database and inverted tree. Of course,
using actual database technology would probably improve performance
and scalability, but such technology has not yet been required.
7. Other Directory Considerations
Availability goals can be met by having multiple copies of the
database in place. InterNIC Directory and Database Services
maintained 3 production copies of the Netfind database, and there are
about a dozen others maintained by other organizations throughout the
world. This ensures that users almost always have access to the
database. At the InterNIC Directory and Database services sites,
service downtime for database update was avoided by doing updates in
series; only one server was being updated at any given time.
8.0 Security Considerations
This document specifies methods of collecting and accessing data that
is already freely accessible to anyone on the Internet. Such
gathering will make access to this data easier, and may increase
opportunities for abuse.
Moats, Huber [Page 4]
RFC NNNN Directories and DNS: Experiences from Netfind April 1998
9. Acknowledgments
The work described in this document was partially supported by the
National Science Foundation through Cooperative Agreement NCR-
9218179.
10. References
Request For Comments (RFC) are available from
<URL:ftp://venera.isi.edu/in-notes> and Internet-Drafts are available
from <URL:ftp://ftp.ietf.org/internet-drafts>. Both are also
available from numerous mirror sites.
[1] M. F. Schwartz, C. Pu. "Applying an Information
Gathering Architecture to Netfind: A White Pages
Tool for a Changing and Growing Internet," Univer-
sity of Colorado Technical Report CU-CS-656-93.
December 1993, revised July 1994.
<URL:ftp://ftp.cs.colorado.edu/pub/cs/techreports/
schwartz/Netfind.Gathering.txt.Z>
[2] G. Mansfield, et. al, "A Directory for Organiza-
tions and Services from DNS and WHOIS", Internet
Draft (work in progress), November 1997.
[3] J. Klensin, T. Wolf, Jr., "Domain Names and Company
Name Retrieval", Internet Draft (work in progress),
July 1997.
[4] K. Sollins, "Plan for Internet Directory Services",
RFC 1107, M.I.T. Laboratory for Computer Science,
July 1989.
[5] S. Hardcastle-Kille, "Replication Requirements to
provide an Internet Directory using X.500, RFC
1275, University College London, November 1991.
[6] J. Postel, C. Anderson, "White Pages Meeting
Report", RFC 1588, February 1994.
11. Authors' addresses
l l. Ryan Moats Rick Huber AT&T AT&T 15621 Drexel CircleRoom
1B-433, 101 Crawfords Corner Road Omaha, NE 68135-2358Holmdel, NJ
07733-3030 USA USA
EMail: jayhawk@att.comEmail: rvh@att.com
Moats, Huber [Page 5]
INTERNET DRAFT EXPIRES OCT 1998 INTERNET DRAFT