INTERNET DRAFT EXPIRES SEPT 1998 INTERNET DRAFT
Network Working Group R. Di Cosmo
INTERNET DRAFT ENS France
Category: Experimental P.E. Martinez Lopez
UNLP Argentina
January 1998
Distributed Robots: a Technology for Fast Web Indexing
<draft-rfced-exp-cosmo-00.txt>
Status of This Memo
This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its
areas, and its working groups. Note that other groups may also
distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-
Drafts as reference material or to cite them other than as
"work in progress."
To learn the current status of any Internet-Draft, please check
the "1id-abstracts.txt" listing contained in the Internet-
Drafts Shadow Directories on ftp.is.co.za (Africa),
ftp.nordu.net (Europe), munnari.oz.au (Pacific Rim),
ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast).
Distribution of this document is unlimited.
Copyright Notice
Copyright (C) The Internet Society (1998). All Rights Reserved.
1. Abstract
We propose a protocol (the Remote Update Protocol, RUP) for
cooperation between Web Servers and Web Robots in order to
increase the reliability of Web indexing, and decrease the load
both on the server and the robot side. If the servers conform to
the RUP protocol, the task of the Robot will appear to be
distributed between the servers that it consults - for that
reason we choose to call this note "Distributed Robots".
2. Introduction
Web Robots are programs that automatically traverse the Web's
hypertext structure in order to perform mainly indexing and/or
manteinance tasks [3]. Usually, the Robot connects to the servers in
order to retrieve and index the relevant documents. Due to
communication latency, exponential growth of Web servers,
multi-headed servers and various other factors, the task of
indexing the whole Web is a daunting one: a short back-of the
envelope calculation, assuming a 10 seconds delay between
requests to avoid overloading the servers, and an average of 100
million urls to index, even forgetting about the bandwidth
necessary to transfer the actual data, shows that we would need
more than 30 years to index the Web using one robot, and several
months even supposing to have hundreds of independent robots
examining disjoint partitions of the Web. Considering the widely
variable lifetime of URLs, this means that the snapshot taken by
a web search robot is doomed to be a pretty old one, so that the
probability of getting dead URLs as a result of a search on a web
index is quite high, and bound to increase steadily, unless some
radical change in indexing technology occurs. The purpose of the
present draft is to propose such a new technology, via a public
protocol for Remote Updates. The key observation is that, as
always, work should be done where it costs less: checking what is
new on a web server is best done by the web server itself, not by
an external search engine. Better, checking modifications of the
server's file system is a task that is already performed on their
Di Cosmo, Martinez Lopez Experimental [Page 1]
own by many Webmasters, for security and administrative reasons,
on a daily basis. Hence, it is the web server that should notify
registered robots, on a periodic basis, of relevant
modifications, and provide unregistered robots with the ability
to query for modifications occurred over a designated span of
time, thus taking a relevant part of the workload off the
robots. Also, the web server is the best able to know whether
some URLs in its domain are not to be indexed by a given robot
(like synthetic ones, or local temporary links etc.), and this
information is already available on the site through the
/robots.txt file, covered in the Robot Exclusion Protocol
(REP) [1,2]. Combining these local informations (modification logs and
exclusion preferences) with a registration mechanisms for
indexing robots, we can obtain the following advantages:
* lower server load: registered robots will no longer crush the
server with bursts of GET or HEAD http requests covering the
whole server's URL addressing space.
* lower robot load: registered robots will only have to retrieve
modified URLs
* lower bandwidth usage: besides drastically reducing the
bandwidth abuse due to indexing bursts, the remote update
protocol may further reduce bandwidth usage by sending back
modification information to robots using e-mail messages (that
use store-and forward methods instead of long range TCP/IP
connections)
* increased index liveness: the remote update mechanism allows to
maintain more up-to-date indexes, and to discover modification
patterns that will allow the robot to apply sound reindexing
policies (like tagging "hot" servers with higher reindexing
priorities, while using lower priorities for relatively stable
ones).
It is worth noting that what we are proposing is the web equivalent
of replacing polling of an input device with interrupt driven
technology, with similar benefits.
3 Protocol description
We present now the components of the remote update protocol for
the communication between servers and robots. This will consists
of
* a registration protocol,
* an interaction protocol for performing further actions like
unregistering, modifying preferences or requesting update
information on the fly (this last suitable for unregistered
servers), and
* a text format for the communication of update information.
Di Cosmo, Martinez Lopez Experimental [Page 2]
In order to describe data formats, we will use formal expressions
with the following conventions:
* characters in typewriter font should appear the same in the
data text;
* names in italics are variables that should be replaced by a
proper value;
* any text enclosed between [ ]* symbols can appear zero, one or
more times;
* any text enclose between { } symbols can appear at most one
time (that is, it is optional);
* the letters NL are used to indicate end-of-line, and are system
dependent.
3.1 Registration protocol
At present, when a Robot wants data from a Web server, it access
the server, and retrieve the information it wants. If it is
willing, it can retrieve /robots.txt file and respect the
guidances provided there. In our protocol, the first action the
Robot must do is to register with the server. The registration of
a robot involves its identification to the server, and the
communication of its preferences (latency between updates, for
example). The server should accept this registration - previous
an optional authentication of the robot - initialize the data for
the communication with the robot, and give back the robot a key
needed for further operations like unregistering or changing
preferences information. To accomplish that task, servers
implementing the RUP protocol will have in their root WWW
directory a file /rupinfo.tex, containing information about the
registration procedure (that will take place via a cgi-script)
and the implemented features of the protocol (like valid values
for latency, exclusion and inclusion policies, etc.).
3.1.1 /rupinfo.txt data format
The file /rupinfo.tex has the following syntax:
RUP-CGI: cgiurl NL
{Authentifier: authmethod NL}
{Latency: latvalue[, latvalue]* NL}
{Exclude: urlpatvalue[, urlpatvalue]* NL}
{IncludeOnly: urlpatvalue[, urlpatvalue]* NL}
{MaxQuerySpan: integer-latvalue NL}
where
* cgiurl is the URL of the CGI script implementing the RUP
protocol on the server.
Di Cosmo, Martinez Lopez Experimental [Page 3]
* authmethod is a verification scheme to determine the robot's
identity.
For the time being, we do not provide any support for robot
authentication, so the only valid value is currently none. But a
future version of the protocol may add new values.
* latvalue is a value for accepted latencies (common values are
day, week, month).
* urlpatvalue is a pattern of an url, expressed using regular
expression syntax.
* integer is a positive number.
The RUP-CGI field indicates the URL of the CGI script that should
be run in order to perform any action from the RUP protocol. The
robots will communicate with the RUP server by issuing HTTP
requests to cgiurl using preferrably the POST CGI method (but GET
methos should also be honored): the possible actions are
described in subsequent sections. The Latency field indicates the
possibles magnitudes for the interval between notifications. The
Exclude and IncludeOnly fields are lists of accepted patterns of
URLs that the robot may want to include in the exclude and
includeonly list in the registration phase (default is none). The
MaxQuerySpan field indicates how long the server keeps the
modification information; it is used by registered or
unregistered servers in order to know how far in the past they
can obtain update information from the server.
3.1.2 Registration phase
The robots willing to register with a server will retrieve the
/rupinfo.txt file to find which cgi-script to call for
registration and the preferences values supported by the
server. It will then issue an HTTP request to the found cgiurl
with the following set of key/value pairs as arguments:
Action=Register
RobotEmail=email-address
{Latency={integer-}latvalue}
{Exclude=urlpatvalue[ urlpatvalue]*}
{IncludeOnly=urlpatvalue[ urlpatvalue]*}
where
* email-address is a valid e-mail address where the robot wants
to be notified of changes,
* Latency indicates the time the robot wants to wait between two
succesive reports (and where latvalue is chosen from the Latency
field in the /rupinfo.txt),
* Exclude indicates that the robot wants to be informed of all
changes, except those affecting files matching the listed path
Di Cosmo, Martinez Lopez Experimental [Page 4]
patterns
* IncludeOnly indicates that the robot only wants to monitor
changes on files matching the listed path patterns. This is
especially suitable to allow single users to monitor changes on
specific sets of pages, if the server supports registration of
single users.
The only requierd value is the email address of the robot, and
either Exclude or IncludeOnly is allowed, not both. After
authentication of the robot identity, the server will answer to
this request with an HTML document containing the line:
RegistrationKey: string
with status code 200 (OK). This key will be required for any
further modifications of the robot's preferences record stored by
the server. In case the key is lost, human intervention will be
required. If an error occurs (for example, if the email address
is not a valid one), the answer will be an error status code -
tipically 400 (Bad Request) or 501 (Not implemented). After
registration, a robot has no more need for its normal operation
to interact with the server via the RUP protocol, because the
server will send out the modifications updates to the registered
e-mail address according to the latency preferences and
include/exclude directives given by the robot.
3.2 Data format for modification updates
The modification updates will be MIME compliant e-mail messages
whose content has the following format
SequenceNumber: integer NL
URLBase: URL NL
NL
[
New[date]: filepath[, filepath]* NL |
Change[date]: filepath[, filepath]* NL |
Delete[date]: filepath[, filepath]* NL
]*
In other words, each of these files is composed by a header of
two lines giving the sequence number of the report and the URL
base of the site (the server address), then a blank line, then a
sequence of lines indicating the changes in the server
contents. Each line in the sequence is of one of three types: a
New line, a Change line, or a Delete line. A New line with date
date indicates that the filepaths were created on that
date. Similarly, a Change line indicates that the filepaths were
changed on the indicated date, and a Delete line indicates that
they were deleted on the indicated date. The sequence number is
used to detect lost update reports as explained in 3.3.3.
Di Cosmo, Martinez Lopez Experimental [Page 5]
3.3 Further interactions
3.3.1 Canceling a registration
If for any reason a robot has no more need of a server's
information, it can cancel its registration by issuing an HTTP
request with the following key/value pairs:
Action=Unregister
RegistrationKey=string
RegisteredRobotEmail=email-address
The effect of this command is that the server will erase the
robot from its database, and stop sending reports of its changes
to it.
3.3.2 Modifying preferences
A robot can modify at any moment its preferences on the server by
issuing an HTTP request with the following key/value pairs:
Action=ModifyPreferences
RegistrationKey=string
RegisteredRobotEmail=email-address
{RobotEmail=email-address}
{Latency={integer-}latvalue}
{Exclude=urlpatvalue[ urlpatvalue]*}
{IncludeOnly=urlpatvalue[ urlpatvalue]*}
The effect of this command is that the server will check whether
the robot registered with the e-mail address given in the
RegisteredRobotEmail field is in its database, and if the
registration key is valid. In this case, it modifies the robot's
preferences items given in the request, leaving the other items
unchanged, otherwise it will give a status error code, typically
400 (Bad Request).
3.3.3 Discovering and handling errors
A given Robot should receive notifications from a server within a
fixed period of time - the latency-time set by the Robot. For
that reason, if the latency time expired and no message arrives
in a reasonable amount of time, the Robot can consult the
SequenceNumber stored in the server to check if any report was
lost - that is, the server sent it, but it did not arrive. This
is done by issuing an HTTP request with the following key/value
pairs:
Action=GetSequenceNumber
RegistrationKey=string
RegisteredRobotEmail=email-address
that will produce either an error or an HTML document containing
the line SequenceNumber=integer. In the case of a lost report,
Di Cosmo, Martinez Lopez Experimental [Page 6]
the Robot can either get the lost data using the unregistered
robot part of the protocol, or, if this is not supported or the
required data extends too much into the past for the server to
satisfy the request, completely reindex the site. After that,
normal operation is resumed.
3.3.4 Requesting an online modification update
A server may maintain the modification log in a format suitable
to answer unregistered queries asking for a selected part of the
modification log. In this case, it is very advisable to make this
information available even to servers not yet registered, via a
simple query protocol. This is done by issuing an HTTP request
with the following key/value pairs:
Action=GetIndex
Span=integer-latvalue
whose answer is either an error or an HTML document of the same
data format as the content of periodic update mail messages sent
to registered robots (see 3.2), with the requested modification
information. This method can also be used by registered servers
to recover a lost update report, as described in 3.3.3.
4 Security Considerations
The protocol proposed in this memo relies on the HTTP, CGI and
e-mail protocols for data transfer and does not change the
underlying security issues involved in those protocols. The only
new security issue raised concerns the risk of malicious
modification of a web indexer preferences record, that we counter
in a simple way by providing a unique identification key for
every web indexer. More robust solutions could be provided by
means of a more robust authentication protocol, for which a hook
is provided in the protocol, so the choice of the authentication
method, which is out of the scope of the present memo, does not
alter in any way the proposed protocol.
5 Conclusions
In the current state of the art for Web Robots, the fearful
amount of work involved in their task is a real barrier to
achieve completeness. For that reason, some form of distribution
of workload is needed. In this note we have presented a protocol
that can be used to distribute the task of indexing the Web. Each
server cooperates with the Robots, preparing reports of changes
in their contents, so that the Robot must not figure out those
changes by itself. The protocol described does not impose on the
servers any significant overhead, in consideration of the fact
that the needed modification logs are usually maintained for
administrative reasons, and if this technology spreads out,
we will surely find that the information the Robots can gather
and mantain will be much more accurate than
Di Cosmo, Martinez Lopez Experimental [Page 7]
at the present time. One could think of the possibility of
allowing not only change monitoring tasks, but also indexing
tasks to be performed on the server in idle time: it is quite
easy to add support for such a feature, for example by allowing
transmission of executable code during the registration phase and
allowing proprietary data to be included in the preiodic
report. Nevertheless, due to the varying nature of the indexing
algorithms used by different robots, this would require
robot-dependent code (or even robot supplied code) to be executed
on the server, which would increase the overhead and raise
fundamental security issues that would prevent easy distribution
of the protocol. So we decided not to include support for this
feature in this version of the protocol, but we may do so in the
future. A prototype RUP implementation on the server side is
being tested in this very moment and will be made available as a
reference implementation with the final version of this note.
6. References
[1] M. Koster, "A Standard for Robot Exclusion", June 1994.
http://info.webcrawler.com/mak/projects/robots/norobots.html
[2] Charles P. Kollar, John R. R. Leavitt, Michael Mauldin,
"Robot Exclusion Standard Revisited", June, 1996.
http://www.kollar.com/robots.html
[3] Martijn Koster, "The Web Robots Pages", 1995.
http://info.webcrawler.com/mak/projects/robots/robots.html
7. Author Information
Roberto Di Cosmo
LIENS-DMI
Ecole Normale Superieure
45, Rue d'Ulm
75230 Paris CEDEX 05
France
E-mail: Roberto.Di.Cosmo@ens.fr
Pablo Ernesto Martinez Lopez
LIFIA
Universidad Nacional de La Plata
CC.11, Correo Central
La Plata
Argentina
E-mail: fidel@info.unlp.edu.ar
Di Cosmo, Martinez Lopez Experimental [Page 8]
INTERNET DRAFT EXPIRES SEPT 1998 INTERNET DRAFT