draft-rfced-exp-cosmo-00

INTERNET DRAFT          EXPIRES SEPT 1998       INTERNET DRAFT
Network Working Group                               R. Di Cosmo
INTERNET DRAFT                                      ENS France
Category: Experimental                              P.E. Martinez Lopez
                                                    UNLP Argentina
                                                    January 1998

      Distributed Robots: a Technology for Fast Web Indexing
                <draft-rfced-exp-cosmo-00.txt>

Status of This Memo

This document is an Internet-Draft.  Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its
areas, and its working groups.  Note that other groups may also
distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time.  It is inappropriate to use Internet-
Drafts as reference material or to cite them other than as
"work in progress."

To learn the current status of any Internet-Draft, please check
the "1id-abstracts.txt" listing contained in the Internet-
Drafts Shadow Directories on ftp.is.co.za (Africa),
ftp.nordu.net (Europe), munnari.oz.au (Pacific Rim),
ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast).

Distribution of this document is unlimited.


Copyright Notice

  Copyright (C) The Internet Society (1998). All Rights Reserved.

1. Abstract

  We propose a protocol (the Remote Update Protocol, RUP) for
  cooperation between Web Servers and Web Robots in order to
  increase the reliability of Web indexing, and decrease the load
  both on the server and the robot side. If the servers conform to
  the RUP protocol, the task of the Robot will appear to be
  distributed between the servers that it consults - for that
  reason we choose to call this note "Distributed Robots".

2. Introduction

  Web Robots are programs that automatically traverse the Web's
  hypertext structure in order to perform mainly indexing and/or
  manteinance tasks [3]. Usually, the Robot connects to the servers in
  order to retrieve and index the relevant documents. Due to
  communication latency, exponential growth of Web servers,
  multi-headed servers and various other factors, the task of
  indexing the whole Web is a daunting one: a short back-of the
  envelope calculation, assuming a 10 seconds delay between
  requests to avoid overloading the servers, and an average of 100
  million urls to index, even forgetting about the bandwidth
  necessary to transfer the actual data, shows that we would need
  more than 30 years to index the Web using one robot, and several
  months even supposing to have hundreds of independent robots
  examining disjoint partitions of the Web. Considering the widely
  variable lifetime of URLs, this means that the snapshot taken by
  a web search robot is doomed to be a pretty old one, so that the
  probability of getting dead URLs as a result of a search on a web
  index is quite high, and bound to increase steadily, unless some
  radical change in indexing technology occurs. The purpose of the
  present draft is to propose such a new technology, via a public
  protocol for Remote Updates. The key observation is that, as
  always, work should be done where it costs less: checking what is
  new on a web server is best done by the web server itself, not by
  an external search engine. Better, checking modifications of the
  server's file system is a task that is already performed on their

Di Cosmo, Martinez Lopez            Experimental                [Page 1]


  own by many Webmasters, for security and administrative reasons,
  on a daily basis. Hence, it is the web server that should notify
  registered robots, on a periodic basis, of relevant
  modifications, and provide unregistered robots with the ability
  to query for modifications occurred over a designated span of
  time, thus taking a relevant part of the workload off the
  robots. Also, the web server is the best able to know whether
  some URLs in its domain are not to be indexed by a given robot
  (like synthetic ones, or local temporary links etc.), and this
  information is already available on the site through the
  /robots.txt file, covered in the Robot Exclusion Protocol
  (REP) [1,2]. Combining these local informations (modification logs and
  exclusion preferences) with a registration mechanisms for
  indexing robots, we can obtain the following advantages:

  * lower server load: registered robots will no longer crush the
  server with bursts of GET or HEAD http requests covering the
  whole server's URL addressing space.

  * lower robot load: registered robots will only have to retrieve
  modified URLs

  * lower bandwidth usage: besides drastically reducing the
  bandwidth abuse due to indexing bursts, the remote update
  protocol may further reduce bandwidth usage by sending back
  modification information to robots using e-mail messages (that
  use store-and forward methods instead of long range TCP/IP
  connections)

  * increased index liveness: the remote update mechanism allows to
  maintain more up-to-date indexes, and to discover modification
  patterns that will allow the robot to apply sound reindexing
  policies (like tagging "hot" servers with higher reindexing
  priorities, while using lower priorities for relatively stable
  ones).

  It is worth noting that what we are proposing is the web equivalent
  of replacing polling of an input device with interrupt driven
  technology, with similar benefits.

3 Protocol description

  We present now the components of the remote update protocol for
  the communication between servers and robots. This will consists
  of

  * a registration protocol,

  * an interaction protocol for performing further actions like
  unregistering, modifying preferences or requesting update
  information on the fly (this last suitable for unregistered
  servers), and

  * a text format for the communication of update information.

Di Cosmo, Martinez Lopez            Experimental                [Page 2]


  In order to describe data formats, we will use formal expressions
  with the following conventions:

  * characters in typewriter font should appear the same in the
  data text;

  * names in italics are variables that should be replaced by a
  proper value;

  * any text enclosed between [ ]* symbols can appear zero, one or
  more times;

  * any text enclose between { } symbols can appear at most one
  time (that is, it is optional);

  * the letters NL are used to indicate end-of-line, and are system
  dependent.

3.1 Registration protocol

  At present, when a Robot wants data from a Web server, it access
  the server, and retrieve the information it wants. If it is
  willing, it can retrieve /robots.txt file and respect the
  guidances provided there. In our protocol, the first action the
  Robot must do is to register with the server. The registration of
  a robot involves its identification to the server, and the
  communication of its preferences (latency between updates, for
  example). The server should accept this registration - previous
  an optional authentication of the robot - initialize the data for
  the communication with the robot, and give back the robot a key
  needed for further operations like unregistering or changing
  preferences information. To accomplish that task, servers
  implementing the RUP protocol will have in their root WWW
  directory a file /rupinfo.tex, containing information about the
  registration procedure (that will take place via a cgi-script)
  and the implemented features of the protocol (like valid values
  for latency, exclusion and inclusion policies, etc.).

3.1.1 /rupinfo.txt data format

  The file /rupinfo.tex has the following syntax:

     RUP-CGI: cgiurl NL
     {Authentifier: authmethod NL}
     {Latency: latvalue[, latvalue]* NL}
     {Exclude: urlpatvalue[, urlpatvalue]* NL}
     {IncludeOnly: urlpatvalue[, urlpatvalue]* NL}
     {MaxQuerySpan: integer-latvalue NL}

  where

  * cgiurl is the URL of the CGI script implementing the RUP
  protocol on the server.


Di Cosmo, Martinez Lopez             Experimental               [Page 3]


  * authmethod is a verification scheme to determine the robot's
  identity.

  For the time being, we do not provide any support for robot
  authentication, so the only valid value is currently none. But a
  future version of the protocol may add new values.

  * latvalue is a value for accepted latencies (common values are
  day, week, month).

  * urlpatvalue is a pattern of an url, expressed using regular
  expression syntax.

  * integer is a positive number.

  The RUP-CGI field indicates the URL of the CGI script that should
  be run in order to perform any action from the RUP protocol. The
  robots will communicate with the RUP server by issuing HTTP
  requests to cgiurl using preferrably the POST CGI method (but GET
  methos should also be honored): the possible actions are
  described in subsequent sections. The Latency field indicates the
  possibles magnitudes for the interval between notifications. The
  Exclude and IncludeOnly fields are lists of accepted patterns of
  URLs that the robot may want to include in the exclude and
  includeonly list in the registration phase (default is none). The
  MaxQuerySpan field indicates how long the server keeps the
  modification information; it is used by registered or
  unregistered servers in order to know how far in the past they
  can obtain update information from the server.

3.1.2 Registration phase

  The robots willing to register with a server will retrieve the
  /rupinfo.txt file to find which cgi-script to call for
  registration and the preferences values supported by the
  server. It will then issue an HTTP request to the found cgiurl
  with the following set of key/value pairs as arguments:

     Action=Register
     RobotEmail=email-address
     {Latency={integer-}latvalue}
     {Exclude=urlpatvalue[ urlpatvalue]*}
     {IncludeOnly=urlpatvalue[ urlpatvalue]*}

  where

  * email-address is a valid e-mail address where the robot wants
  to be notified of changes,

  * Latency indicates the time the robot wants to wait between two
  succesive reports (and where latvalue is chosen from the Latency
  field in the /rupinfo.txt),

  * Exclude indicates that the robot wants to be informed of all
  changes, except those affecting files matching the listed path

Di Cosmo, Martinez Lopez             Experimental               [Page 4]


  patterns

  * IncludeOnly indicates that the robot only wants to monitor
  changes on files matching the listed path patterns. This is
  especially suitable to allow single users to monitor changes on
  specific sets of pages, if the server supports registration of
  single users.

  The only requierd value is the email address of the robot, and
  either Exclude or IncludeOnly is allowed, not both. After
  authentication of the robot identity, the server will answer to
  this request with an HTML document containing the line:

     RegistrationKey: string

  with status code 200 (OK). This key will be required for any
  further modifications of the robot's preferences record stored by
  the server. In case the key is lost, human intervention will be
  required. If an error occurs (for example, if the email address
  is not a valid one), the answer will be an error status code -
  tipically 400 (Bad Request) or 501 (Not implemented). After
  registration, a robot has no more need for its normal operation
  to interact with the server via the RUP protocol, because the
  server will send out the modifications updates to the registered
  e-mail address according to the latency preferences and
  include/exclude directives given by the robot.

3.2 Data format for modification updates

  The modification updates will be MIME compliant e-mail messages
  whose content has the following format


     SequenceNumber: integer NL
     URLBase: URL NL
     NL
     [
       New[date]: filepath[, filepath]* NL |
       Change[date]: filepath[, filepath]* NL |
       Delete[date]: filepath[, filepath]* NL
     ]*


  In other words, each of these files is composed by a header of
  two lines giving the sequence number of the report and the URL
  base of the site (the server address), then a blank line, then a
  sequence of lines indicating the changes in the server
  contents. Each line in the sequence is of one of three types: a
  New line, a Change line, or a Delete line. A New line with date
  date indicates that the filepaths were created on that
  date. Similarly, a Change line indicates that the filepaths were
  changed on the indicated date, and a Delete line indicates that
  they were deleted on the indicated date. The sequence number is
  used to detect lost update reports as explained in 3.3.3.

Di Cosmo, Martinez Lopez             Experimental               [Page 5]


3.3 Further interactions

3.3.1 Canceling a registration

  If for any reason a robot has no more need of a server's
  information, it can cancel its registration by issuing an HTTP
  request with the following key/value pairs:

     Action=Unregister
     RegistrationKey=string
     RegisteredRobotEmail=email-address

  The effect of this command is that the server will erase the
  robot from its database, and stop sending reports of its changes
  to it.

3.3.2 Modifying preferences

  A robot can modify at any moment its preferences on the server by
  issuing an HTTP request with the following key/value pairs:

     Action=ModifyPreferences
     RegistrationKey=string
     RegisteredRobotEmail=email-address
     {RobotEmail=email-address}
     {Latency={integer-}latvalue}
     {Exclude=urlpatvalue[ urlpatvalue]*}
     {IncludeOnly=urlpatvalue[ urlpatvalue]*}

  The effect of this command is that the server will check whether
  the robot registered with the e-mail address given in the
  RegisteredRobotEmail field is in its database, and if the
  registration key is valid. In this case, it modifies the robot's
  preferences items given in the request, leaving the other items
  unchanged, otherwise it will give a status error code, typically
  400 (Bad Request).

3.3.3 Discovering and handling errors

  A given Robot should receive notifications from a server within a
  fixed period of time - the latency-time set by the Robot. For
  that reason, if the latency time expired and no message arrives
  in a reasonable amount of time, the Robot can consult the
  SequenceNumber stored in the server to check if any report was
  lost - that is, the server sent it, but it did not arrive. This
  is done by issuing an HTTP request with the following key/value
  pairs:

     Action=GetSequenceNumber
     RegistrationKey=string
     RegisteredRobotEmail=email-address

  that will produce either an error or an HTML document containing
  the line SequenceNumber=integer. In the case of a lost report,

Di Cosmo, Martinez Lopez             Experimental               [Page 6]


  the Robot can either get the lost data using the unregistered
  robot part of the protocol, or, if this is not supported or the
  required data extends too much into the past for the server to
  satisfy the request, completely reindex the site. After that,
  normal operation is resumed.

3.3.4 Requesting an online modification update

  A server may maintain the modification log in a format suitable
  to answer unregistered queries asking for a selected part of the
  modification log. In this case, it is very advisable to make this
  information available even to servers not yet registered, via a
  simple query protocol. This is done by issuing an HTTP request
  with the following key/value pairs:

     Action=GetIndex
     Span=integer-latvalue

  whose answer is either an error or an HTML document of the same
  data format as the content of periodic update mail messages sent
  to registered robots (see 3.2), with the requested modification
  information. This method can also be used by registered servers
  to recover a lost update report, as described in 3.3.3.

4 Security Considerations

  The protocol proposed in this memo relies on the HTTP, CGI and
  e-mail protocols for data transfer and does not change the
  underlying security issues involved in those protocols. The only
  new security issue raised concerns the risk of malicious
  modification of a web indexer preferences record, that we counter
  in a simple way by providing a unique identification key for
  every web indexer. More robust solutions could be provided by
  means of a more robust authentication protocol, for which a hook
  is provided in the protocol, so the choice of the authentication
  method, which is out of the scope of the present memo, does not
  alter in any way the proposed protocol.

5 Conclusions

  In the current state of the art for Web Robots, the fearful
  amount of work involved in their task is a real barrier to
  achieve completeness. For that reason, some form of distribution
  of workload is needed. In this note we have presented a protocol
  that can be used to distribute the task of indexing the Web. Each
  server cooperates with the Robots, preparing reports of changes
  in their contents, so that the Robot must not figure out those
  changes by itself. The protocol described does not impose on the
  servers any significant overhead, in consideration of the fact
  that the needed modification logs are usually maintained for
  administrative reasons, and if this technology spreads out,
  we will surely find that the information the Robots can gather
  and mantain will be much more accurate than

Di Cosmo, Martinez Lopez             Experimental               [Page 7]


  at the present time. One could think of the possibility of
  allowing not only change monitoring tasks, but also indexing
  tasks to be performed on the server in idle time: it is quite
  easy to add support for such a feature, for example by allowing
  transmission of executable code during the registration phase and
  allowing proprietary data to be included in the preiodic
  report. Nevertheless, due to the varying nature of the indexing
  algorithms used by different robots, this would require
  robot-dependent code (or even robot supplied code) to be executed
  on the server, which would increase the overhead and raise
  fundamental security issues that would prevent easy distribution
  of the protocol. So we decided not to include support for this
  feature in this version of the protocol, but we may do so in the
  future. A prototype RUP implementation on the server side is
  being tested in this very moment and will be made available as a
  reference implementation with the final version of this note.

6. References

   [1] M. Koster, "A Standard for Robot Exclusion", June 1994.
       http://info.webcrawler.com/mak/projects/robots/norobots.html

   [2] Charles P. Kollar, John R. R. Leavitt, Michael Mauldin,
       "Robot Exclusion Standard Revisited", June, 1996.
       http://www.kollar.com/robots.html

   [3] Martijn Koster, "The Web Robots Pages", 1995.
       http://info.webcrawler.com/mak/projects/robots/robots.html



7. Author Information

   Roberto Di Cosmo
   LIENS-DMI
   Ecole Normale Superieure
   45, Rue d'Ulm
   75230 Paris CEDEX 05
   France

   E-mail: Roberto.Di.Cosmo@ens.fr


   Pablo Ernesto Martinez Lopez
   LIFIA
   Universidad Nacional de La Plata
   CC.11, Correo Central
   La Plata
   Argentina

   E-mail: fidel@info.unlp.edu.ar

















Di Cosmo, Martinez Lopez             Experimental               [Page 8]

INTERNET DRAFT          EXPIRES SEPT 1998       INTERNET DRAFT

Document	Document type	Expired Internet-Draft (individual) Expired & archived
	Select version	00
	Authors	Roberto Di Cosmo , Pablo Ernesto Martin Lopez Email authors
	RFC stream	(None)
	Intended RFC status	(None)
	Other formats	txt pdf bibtex bibxml