Internet-Draft                                              Carlos Bueno
                                                                May 2004





             A Distributed Web Search Protocol -- Dowser/0.1
                        draft-dowser-spec-00.txt


Status of This Memo

   This document is an Internet-Draft and is subject to all provisions
   of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as
   Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other
   documents at any time.  It is inappropriate to use Internet-
   Drafts as reference material or to cite them other than as
   "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.html

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This document Expires on November 04, 2004

Abstract

   Dowser is an application-level, peer-to-peer protocol for creating a
   searchable index and cache of documents. It is intended for both
   small-scale intranets and Worldwide Web-scale indexes. This document
   describes the messages that members, or "nodes", of the network pass
   to each other to distribute workload, request & respond to queries,
   and maintain the index.














Bueno                                                          [Page 1]


Internet-Draft                 Dowser/0.1                       May 2004



Table of contents

        1. Introduction  ................................................. 3
        1.1  Purpose  .................................................... 3
        1.2  Definitions  ................................................ 3
        1.3  URL vs. URI vs. URN  ........................................ 4
        1.4  Requirements  ............................................... 5
        2. Some Examples  ................................................ 5
        2.1  NODEFIND  ................................................... 5
        2.1.1 Node-id & seed  ............................................ 6
        2.1.2 Last-key  .................................................. 6
        2.1.3 Ring-id  ................................................... 7
        2.1.4 Port  ...................................................... 7
        2.1.5 Server Responses  .......................................... 7
        2.2 SEARCH  ...................................................... 8
        2.2.1 Expires  ................................................... 9
        2.2.2 Content-key  ............................................... 9
        2.2.3 Search syntax  ............................................. 9
        2.3. URL caching  ............................................... 10
        2.4. Content caching  ........................................... 11
        2.5. CRAWL  ..................................................... 11
        2.6.INDEXADD  ................................................... 12
        3. Announcing  .................................................. 13
        4. Redundancy  .................................................. 14
        5. Optional headers  ............................................ 14
        6. Error conditions  ............................................ 14
        7. Notes on implementation  ..................................... 15
        7.1 Ranking  .................................................... 15
        7.2 Proxying  ................................................... 15
        7.3 TCP vs UDP  ................................................. 16
        7.4 Ping/Pong  .................................................. 16
        7.5 MHTML  ...................................................... 16
        7.6 Spam  ....................................................... 16
        7.7 Indexing  ................................................... 16
        8. Acknowledgements  ............................................ 16
        9. References  .................................................. 17
        9.1 Informative  ................................................ 17
        9.2 Normative  .................................................. 17
        10. Security Considerations  .................................... 17
        10.1 Personal Information  ...................................... 18
        10.2 Sensitive data  ............................................ 18
        11. IPR & Copyright  ............................................ 18
        12. Contact Information  ........................................ 19













Bueno                                                          [Page 2]


Internet-Draft                 Dowser/0.1                       May 2004


1. Introduction

1.1 Purpose

   The express purpose of the Dowser protocol is to encourage open
   research into web-scale, decentralized indexing systems. The
   specification for the Hypertext Transfer Protocol [RFC1945]
   has this observation:

      Practical information systems require more functionality than
      simple retrieval, including search, front-end update, and
      annotation. HTTP allows an open-ended set of methods to be used to
      indicate the purpose of a request.

   HTTP is in widespread use by hundreds of millions of people every
   day, and they issue hundreds of millions of searches for information.
   Most of these searches are served by centralized "search engines"
   that cache copies of as many websites as possible to create their
   indexes. The problems of scale have so far been admirably met
   [GOOGLE], but the operating costs of web-scale engines are now out of
   the reach of most Universities and corporations. Conversely, larger
   and larger amounts of storage and processing power are available to
   personal computer users every year.

   Dowser is an extension to HTTP that can be used by itself or added to
   existing HTTP implementations. Each node on the network claims a
   small part of the keyspace of a distributed hash table [CHORD].
   Indexes of documents are keyed under the hash of the word; documents
   themselves  are keyed by their Uniform Resource Identifier [URI].
   There is also a  "cache of caches", to allow popular documents or
   indexes to be  retrieved from any node that has a copy.

   The syntax and formatting rules are inherited from HTTP/1.1
   [RFC2616]. The required "Host" header in HTTP/1.1 does not really
   apply to Dowser, but including it shouldn't hurt anyone. Several
   other headers and return codes are adopted for use.


1.2 Definitions

   node

      In this protocol there is no difference between a "client" or a
      "server". They are all equal "nodes" on the network, capable of
      sending and serving requests. Each node is identified by a
      40-digit hexadecimal number called a "node-id". They participate
      in creating, storing and serving a distributed hash table. This
      hash table can contain document caches, indexes of those
      documents, and other data.

      For sanity's sake, the term "server" can be taken to mean
      "responding node" and "client" to mean "requesting node" in the
      context of the conversation being described.




Bueno                                                          [Page 3]


Internet-Draft                 Dowser/0.1                       May 2004


   hash, key

      Hash functions are used to evenly distribute data around the
      network and to uniquely identify nodes and pieces of information.
      Dowser uses the Secure Hash Algorithim, version 2 [SHA]. The terms
      "hash" and "key" refer to the 40-digit hexadecimal number gotten
      from passing some piece of information through the SHA function,
      e.g.:

         SHA("foo") == 0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33

   distributed hash table

      A distributed hash table is a key-->value mapping that is
      contained on many computers, where they keys are hashed.

   range

      The "range" of a node is defined by the first and last key of the
      hash table it is expected to store data under.

   left-hand, right-hand

      The total keyspace is imagined as the perimeter of a circle, or a
      "ring". While standing in the center, points to the left generally
      decrease and points to the right generally increase, except at the
      point where "fff..." meets "000...".

      A useful analog is timezones, where West is left and East is
      right. The time is always earlier to the left except at the
      International Date Line.


1.3 URL vs. URI vs. URN

   The protocol descibed here has a focus different from most
   filesharing protocols. Searches for files are not broadcast but
   targeted; nodes are authoritative regarding the location and
   checksums of data in their ranges. Searches for a particular node are
   sent to arbitrary nodes but not broadcast per se; they are not
   forwarded but responded to directly.

   In most ad-hoc filesharing protocols the only unique identifier is
   the checksum. The filesystem names and metadata may be different for
   different copies but the checksum remains the same. In HTTP jargon
   this is a Universal Resource Name or URN , that which uniquely
   identifies the file without necessarily referencing its location or
   common name [RFC2396].

   The content of a cached web page may change with time. This means the
   checksum will change while its Univeral Resource Locator, the origin
   of the page, remains constant.

   Indexes built from these cached pages are also treated as files, one
   index file per word or phrase. An index's Universal Resource Name is


Bueno                                                          [Page 4]


Internet-Draft                 Dowser/0.1                       May 2004


   that word. (More correctly it would be "dowser://foo", not "foo".)

   Searching in Dowser therefore starts with a lookup for the node(s)
   who handle the search term(s). The search terms (URN) are then sent
   to those nodes, resulting in either a) the content of index file or
   b) the latest checksum of the index file and a list of nodes that
   have previously downloaded a copy. Retrieving a cache of a web page
   is done the same way.


1.4 Requirements

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [34].

   An implementation is not compliant if it fails to satisfy one or more
   of the MUST or REQUIRED level requirements for the protocols it
   implements. An implementation that satisfies all the MUST or REQUIRED
   level and all the SHOULD level requirements for its protocols is said
   to be "unconditionally compliant"; one that satisfies all the MUST
   level requirements but not all the SHOULD level requirements for its
   protocols is said to be "conditionally compliant."

   In reality, the complications of this protocol are mostly in the
   state information, not the ins and outs of the messages being passed.


2. Some Examples

   We will assume the reader is familiar with HTTP transactions. It is
   probably most instructive to give examples using the new methods and
   then a discussion of particulars.  Elipses ("...") are used to
   truncate long hashes for readability.


2.1 NODEFIND

   NODEFINDs are the bread and butter of routing between nodes. The
   client wants to find the node(s) that have items under a certain hash
   key, in this example, "5c89379d0aa9840ac910fd8cacbde2dbf877214a".
   This can be a search term, url, or anything. The responding node may
   or may not have this key. If not, it tells the client about nodes it
   knows about that are closer to the target.


      NODEFIND 5c89379d0aa9840ac910fd8cacbde2dbf877214a Dowser/0.1
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 12222... a5754...
      Last-key: 13333...
      Port: 4666


   - Response -



Bueno                                                          [Page 5]


Internet-Draft                 Dowser/0.1                       May 2004



      Dowser/0.1 310 Not in my range
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...  de34....
      Last-key: 49999...

      192.0.2.10 8000 50000f... 599999...
      192.0.2.20 6876 43333a... 610000...


   - Or -


      Dowser/0.1 211 That's me
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 4000... de34....
      Last-key: 7aaaa...


   There is a lot to cover here. The first line of the request is like
   HTTP, identifying the protocol and version, the 'method' or action to
   be done, and the 'path' or key being requested. There are three
   additional headers which MUST be included in any message: Ring-id,
   Node-id, and Last-key. Additionally, the Port header MUST be included
   in every request.


2.1.1 Node-id & seed

   The node-id is comprised of two 40-digit hexadecimal numbers,
   separated by whitespace. The first number serves as the FIRST key in
   this node's range. This number MUST be generated by hashing some
   random data iteratively. To help ensure this has occurred, the second
   number (the "seed") MUST be the second-to-last result of the
   iterations.

         Node-id: b274f2e2a8d2881035af5866014e9ad5510ab15d
            cdd2ae2594a83ef90c05ee6014b78631db8538d8

   This example (linewrapped to fit) is well-formed because the first
   number is the SHA hash of the second number. Servers MUST check that
   the node-id and its "seed" are correct. If not, they MUST send a 412
   "Precondition failed" error.


2.1.2 Last-key

   Last-key is also an SHA hash, and is the LAST key in this node's
   range. The last-key may change over time, as nodes leave or join the
   network.

   Note: it has been brought up that a node might be able to route a
   large portion of traffic to itself by making its Last-key equal to
   the node-id minus 1. In Internet Protocol terms, this would be
   similar to setting your network address to 255.255.255.255. There is


Bueno                                                          [Page 6]


Internet-Draft                 Dowser/0.1                       May 2004


   nothing in the protocol to prevent this, but implementations should
   come up with a way to deal with it reasonably. Our feeling that it is
   self-correcting. If someone tries it on a network with a lot of nodes
   one hopes he or she has good fire-supression equipment handy; but the
   concern is the possible damage to the network routing.


2.1.3 Ring-id

   Ring-id is yet another SHA hash to identify the group of nodes it
   wishes to communicate with. This allows many Dowser networks to be
   running on an internet at once. Alternate rings are useful for
   testing implementations of the protocol on a large scale, or for
   partitioning networks in general.

   The "official" public ring-id:

      Dowser/0.1  -->  deadbeef00000000000000000000000000000000

   If the Ring-id header sent does not match the ring-id of the server,
   it MUST send a 412 "Precondition falied" message, and may provide
   information in the body of the response as to the reason. Note: there
   is a security risk in actually revealing the server's ring-id in the
   response, if the ring is intended for private data.


2.1.4 Port

   Simple enough: we need to advertise what TCP or UDP port we are
   listening on for future reference.


2.1.5 Server Responses

   If the hash is not in the listening node's range, it returns a 310
   code and a list of nodes that are closer to the target.  The list of
   nodes takes this form:

      IP PORT NODE-ID LAST-KEY


   If it is in range, it sends a 211 "That's Me" code.

   When overlap occurs a node may choose which one to contact first in
   an implementation-dependent way, such as IP or network distance, or
   closeness of the desired key to the node-id.

   Note: IPv4 addresses are used in these examples, but implementations
   SHOULD be written to handle IPv6 addresses as well, e.g.:

      FEDC:BA98:7654:3210:FEDC:BA98:7654:3210
      1080:0:0:0:8:800:200C:4171
      3ffe:2a00:100:7031::1
      1080::8:800:200C:417A
      ::192.9.5.5


Bueno                                                          [Page 7]


Internet-Draft                 Dowser/0.1                       May 2004


      ::FFFF:129.144.52.38
      2010:836B:4179::836B:4179


2.2 SEARCH

   We have the basis of all conversations, and have described how nodes
   identify themselves and each other. Now we can go to the next level:
   content routing.


   - Request -

      SEARCH foo+bar+baz Dowser/0.1
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 1d3470c... 55754...
      Last-key: 1e400cc...
      Port:  4666


   - Response -

      Dowser/0.1  200 OK
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888... de34...
      Last-key: 49999...
      Expires: 864000
      Content-key: 3fdb13677b10691debb3909dd917b00ee751115a

      URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]]
      URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]]
      ...


   - Or -

      Dowser/0.1 300 Multiple choices
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888... de34...
      Last-key: 49999...
      Expires: 864000
      Content-key: 3fdb13677b10691debb3909dd917b00ee751115a

      192.0.2.10 8000 50000f... 599999...
      192.0.2.20 6876 43333a... 610000...


   A SEARCH request is almost exactly like a nodefind except the "path"
   is a list of search terms. The server determines if any of the terms
   fall into its range. How to break up the terms may be
   implementation-specific, but the simplest is to break on whitespace.


   If any are in-range it returns a 300 code, the "content-key" of the
   data (a  SHA checksum of the content) a list of the nodes that have


Bueno                                                          [Page 8]


Internet-Draft                 Dowser/0.1                       May 2004


   requested that data.

   Or, it may return the index data itself. Requesting nodes SHOULD keep
   a copy of the data it receives for a reasonable amount of time, as
   hinted by the Expires: header. The data should be keyed under the
   content-key, for retrieval via a CACHE request. This helps distribute
   the workload. The index is in the form:

      URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]]

   And is tab-delimited. AGE is the number of seconds since the record
   was created. One or more "SNIPPET"s contain text that form an
   abstract of the document.

   As always, the server can alternately send the standard 310 Not in
   Range, or some error code.


2.2.1  Expires:

   Indexes and caches of web documents are not long-lived in the way a
   media file in traditional peer-to-peer networks are. The Expires
   header hints at how long a node should cache this data, expressed in
   seconds from the present time.


2.2.2 Content-key:

   This is a SHA hash of the data's content, used to positively identify
   it. This is different from the hash of the URL or search term. A page
   whose URL is

      "http://foo.example.com"

   ... and whose content is

      "<h1>Foo!</h1>"

   ... would have a URL hash of cfab46bb7dbd11e6187360d429586e2942f2d42e
   and a content-key of cf5ce65061218164e4148038cc3a56a9e988fe7a. The
   URL hash is a unique identifier of the document's origin, while the
   content-key is an identifier of a particular version of that
   document.



2.2.3 Search syntax

   There are a few schemes for search syntax out in the wild. The most
   common has operands for words that must or must not appear, and exact
   phrases. Implementations MUST honor these operands to the best of
   their ability.

      foo +bar



Bueno                                                          [Page 9]


Internet-Draft                 Dowser/0.1                       May 2004


   Means that the documents MUST contain "bar" and MAY contain "foo"

      foo -bar

   Means that the documents MUST contain "foo" and MUST NOT contain
   "bar". When there is only one term without a "-" operand, the "+" is
   implied.

      "foo bar"
      'foo bar'

   Means that the documents MUST contain the exact phrase "foo bar", in
   that order.


   Note: in the SEARCH example in section 2.0, it may seem like the "+"
   signs in "foo+bar+baz" are search syntax. They are not; the terms are
   "url-encoded", meaning spaces are encoded as "+" and other non-word
   characters are encoded as %HEX where HEX is the hexadecimal ASCII
   code. Therefore "foo +bar" would appear as "foo+%2Bbar"

   Implementations may honor other operands as they see fit.



2.3. URL caching

   Like a SEARCH, a URLCACHE does not necessarily return the data -- it
   can return the content-hash of the data and a list of the last X
   nodes to request that data, where X is implementation-specific. If
   none of the listed nodes have the data, the requesting node may then
   request it from the original node.

   - Request -

      URLCACHE fb9642989a2222305832a5d9c29b34... Dowser/0.1
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 1d3470c... a5754......
      Last-key: 1e400...
      Url:  http://foo.example.com/some/page.html
      Port:  4666

   - Response -

      Dowser/0.1  300 Multiple choices
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...  de34...
      Last-key: 49999...
      Expires: 864000
      Content-key: 3fdb13677b10691debb3909dd917b00ee751115a

      192.0.2.10 8000 50000f... 599999...
      192.0.2.20 6876 43333a... 610000...

   - Or -


Bueno                                                          [Page 10]


Internet-Draft                 Dowser/0.1                       May 2004



      Dowser/0.1  200 OK
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...  de34...
      Last-key: 49999...
      Expires: 864000
      Content-key: 3fdb13677b10691debb3909dd917b00ee751115a

      ... (file data) ...




2.4. Content caching

   This can be page caches, indexes, or really anything. A client sends
   a request with the content-key as the path:

   - Request -

      CACHE 3fdb13677b10691debb3909dd917b00ee751115a Dowser/0.1
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 1d3470c...  de34...
      Last-key: 1e400...
      Url:  http://foo.example.com/some/page.html
      Port:  4666

   - Response -

      Dowser/0.1  200 OK
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...  de34...
      Last-key: 49999...
      Expires: 864000
      Content-key: 3fdb13677b10691debb3909dd917b00ee751115a

      ... (file data) ...

   - Or -

      Dowser/0.1 404 not found
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...   de34...
      Last-key: 49999...

## todo: content-length? simultaneous downloading? the Range: & Content-length: headers?




2.5. CRAWL

   So we know how to find nodes and content. The last question is: where
   does the content come from? And the indexes?



Bueno                                                          [Page 11]


Internet-Draft                 Dowser/0.1                       May 2004



      CRAWL http://foo.example.com/foo/bar/baz.html Dowser/0.1
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 12222... a5754......
      Last-key: 13333...
      Referer: http://foo.example.com/foo/index.html
      Port: 4666
      Expires: 864000
      Content-key: 3fdb13677b10691debb3909dd917b00ee751115a


   - Response -

      Dowser/0.1 202 Accepted
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...   de34...
      Last-key: 49999...


   - Or -

      Dowser/0.1 310 Not in my range
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...   de34...
      Last-key: 49999...

      192.0.2.10 8000 50000f... 599999...
      192.0.2.20 6876 43333a... 610000...


   When a node crawls a page, it usually has links on it. If the hash of
   those second-level URLs are not in the node's range, it sends a CRAWL
   annoucement to the nodes that do handle that range, so they can
   continue the crawl.

   A node SHOULD first check (via URLCACHE) if a cached copy exists
   before downloading a page directly.

   Implementations SHOULD respect the "robots.txt" protocol for nice
   spiders.

   ##todo: so where do we START crawling? that is not part of the
   protocol because it could be a network of office machines sharing
   their spreadsheets, or whatever -- need to say that in a nice way.



2.6.INDEXADD

   An INDEXADD is an informational message to a node that a particular
   URL contains a keyword inside the server's range.


      INDEXADD 09dd917b00ee751115a3fdb13677b10691debb39 Dowser/0.1
      Ring-Id: deadbeef00000000000000000000000000000000


Bueno                                                          [Page 12]


Internet-Draft                 Dowser/0.1                       May 2004


      Node-Id: 12222...  a5754......
      Last-key: 13333...
      Term:  foo bar baz
      Url: http://foo.example.com/foo/bar/baz.html
      Expires: 864000
      Content-key: 3fdb13677b10691debb3909dd917b00ee751115a
      Port: 4666

##todo: positions, counts, tdf*idf?

   - Response -

      Dowser/0.1 202 Accepted
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...  de34..
      Last-key: 49999...

   - Or -

      Dowser/0.1 310 Not in my range
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...  de34..
      Last-key: 49999...

      192.0.2.10 8000 50000f... 599999...
      192.0.2.20 6876 43333a... 610000...


   The Term: header contains what words the server may be interested in,
   separated by tabs. An INDEXADD message implies that the requesting
   node has a copy of the document. Listening nodes should retrieve it
   from that node via a CONTENTCACHE message. This puts a cost on adding
   words to the index.


3. Announcing

   When a new node joins the network, it must first be provided with a
   list of "well-known" nodes and a ring-id. Knowing nothing else about
   the network topology, it generates a node-id and sends a NODEFIND
   request on that key to those well-known nodes for its own key:


      NODEFIND 12222...0 Dowser/0.1
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 12222... a5754...
      Last-key: 12222...
      Port:  4666


   This eventually leads the new node to its "neighbors" on the ring.
   The new node's range SHOULD initially be of the same length as its
   left-hand neighbor, but MAY grow or shrink according to how much
   capacity it has at hand, but the Last-key MUST NOT be less than the
   node-id of its closest neighbor.


Bueno                                                          [Page 13]


Internet-Draft                 Dowser/0.1                       May 2004




4. Redundancy

   All nodes may not be available at all times, and the "overlap" is not
   guaranteed, requiring some fiat redundancy. So:

   Each node assumes a range starting with its nodeid and That defines
   its "core" range. It SHALL also handle an "auxilary" range of equal
   length, calculated by adding 8 to the first digit. The addtion does
   not carry. For example:

   Node A "Core" range:

      9000000000000000000000000000000000000000 to
      c000000000000000000000000000000000000000

   Auxilary range:

      1000000000000000000000000000000000000000 to
      4000000000000000000000000000000000000000



5. Optional headers

   To save on bandwidth, nodes MAY compress their responses. A
   requesting node MAY indicate that it can accept compressed data with
   this optional header:

      Accept-Encoding: gzip, deflate

   If a server compresses a response, it MUST inidcate that with this
   header:

      Content-encoding: gzip


   The server MAY also send this optional header to SEARCH responses to
   help guide the user:

      Related-terms: meta, syntax, jargon



6. Error conditions

   Many error codes from HTTP/1.1 apply to Dowser.

   500 Internal Server error

   501 Not Implemented

   503 Server busy (Service unavailable)



Bueno                                                          [Page 14]


Internet-Draft                 Dowser/0.1                       May 2004


   505 HTTP Version Not Supported

   400 Bad request

   404 Not Found

   412 Precondition failed

      When a node's id and seed don't check out, for instance. This is
      different from a mal-formed request which should be responded to
      with a 400. Also, nodes may keep a "blacklist" of node-ids and/or
      IPs that they do not wish to do business with (See SPAM). In the
      case of blacklists, the server MAY simply close the connection,
      but it's polite to give some error information.

      Implementations MAY elect to withold the Ring-id header in this
      case, for security.



7. Notes on implementation

   A reference implementation is currently in alpha testing and should be
   released by mid-summer 2004 [IMPL]



7.1 Ranking

   There is not a lot of ranking data returned by SEARCH requests. The
   idea is to return somewhat raw indexes to the client and let it sort
   things out through intersecting multiple responses and its own
   biases.

   One time-saving trick may be employed by implementations: recall in
   the SEARCH example, all of the search terms were sent, even though
   it's unlikely one node will handle all of them. The reason is to
   allow nodes to pick out terms that tend to accompany the terms they
   are responsible for. The node may then keep an extra index of those
   accompanying terms and use it to trim and/or rank the results sent
   back.


7.2 Proxying

   If a relatively new node receives a request for a key that is in its
   range but not its datastore, it may create a new request to its
   upstream neighbor for it, on the theory that it may still be in that
   node's datastore. It then keeps a copy while forwarding it to the
   original requesting node. This allows new nodes to build on the work
   of older ones.

   Implementations might want to work out semantics for long-lived
   connections on the lines of HTTP/1.1 Keep-alive headers.



Bueno                                                          [Page 15]


Internet-Draft                 Dowser/0.1                       May 2004



7.3 TCP vs UDP

   Several methods (NODEFIND, CRAWL, INDEXADD) have two interesting
   properties: they will be used often, and involve a very small amount
   of data. It is tempting to implement those parts over UDP and the
   rest over TCP, or to come up with some tricks to do it all over UDP.


7.4 Ping/Pong

   Neighboring nodes should be in the habit of pinging their nearest
   neighbors, if they have not heard from them in a while. A "ping" can
   take the form of a NODEFIND using the neighbor's node-id.


7.5 MHTML

   HTML documents of today tend to be comprised of many files including
   graphics, stylesheets, etc. It may be useful for implementations to
   offer "MHTML" versions of cached documents that contain these
   supporting files in one MIME-encoded file. If a client supports
   MHTML, it should include it in the Accept: header. Common-sense
   restrictions apply, such as not including executable scripts or
   plugins and files that do not come from the same domain.


7.6 Spam

   The network described above may one day contain thousands of nodes.
   Thousands of nodes implies thousands of users looking for
   information, which nowadays means spam. An implementation may elect
   to deal with it in the way email clients do, with filters built from
   analysis of links classified by the user as "spam" or "useful". Even
   in the abscence of spurious traffic a good Bayesian filter may help
   tailor results.

   A sophisticated method of injecting targeted spam involves generating
   node-ids until one pair is "close enough" to the hash of a desired
   keyword, where "close enough" is matching the first 4 or 5 digits.
   Even thousands of nodes will be sparse in the keyspace, so an
   exhaustive search is not necessary.


7.7 Indexing

   A large sample of web documents suggests that there is a fairly
   constant number of terms, something on the close order of 14 million
   [GOOGLE].

   An unpublished survey of a large sample of search engine queries
   suggests that there is also a fairly constant number of unique terms
   searched, on the close order of 1 million. This is very interesting.



Bueno                                                          [Page 16]


Internet-Draft                 Dowser/0.1                       May 2004


   Suppose we give more "weight" to words found in the query stream over
   those that are not. As we've seen in the adventures of Internet
   search engines over the last few years, documents often lie about
   their contents, but users rarely lie about their desires. They are
   vauge, perhaps, but not false.

   . . . that is, until the query stream becomes a thing of value. A
   clever person might include a popular word and a made-up one into his
   or her web page, then send thousands of queries with both of those
   words to artificially boost the correlation. I have a wonderful proof
   describing how this gambit might be defeated, but unfortunately there
   is not enough room to include it here.



8. Acknowledgements

   This specification builds on the years of work put into designing and
   implementing Dr. Berner-Lee's HTTP protocol by many thousands of
   people around the world. You know who you are.

   Ideas and algorithims were stolen liberally from Chord, Circle, and
   Freenet.  Lots and lots of ideas from the original distributed search
   projects, Harvest and WebAnts.

   Special thanks to Richard Vasquez and Thomas Lackner for their
   devious minds and help with early implementations.


9. References

9.1 Informative

   [GOOGLE] Brin & Page, "The Anatomy of a Search Engine", Stanford
      University, 1997

   [CHORD] Stoica, et al. "A Scalable Peer-to-peer Lookup Service for
      Internet Applications", Sigcomm 2001
      http://www.pdos.lcs.mit.edu/papers/chord:sigcomm01/

   [IMPL] Dowser P2P client, http://dowser.sourceforge.net

9.2 Normative

   [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform
      Resource Identifiers (URI): Generic Syntax and Semantics", RFC
      2396, August 1998.

   [HTTP] Feilding, et al. "Hypertext Transfer Protocol -- HTTP/1.1",
      RFC2616, June 1999

   [SHA] Eastlake & Jones, US Secure Hash Algorithm 1 (SHA1), RFC3174,
      September 2001




Bueno                                                          [Page 17]


Internet-Draft                 Dowser/0.1                       May 2004


10. Security Considerations

10.1 Personal Information

   Dowser deals directly with people's searching habits. While spreading
   this information around is kind of the point, implementations SHOULD
   take care not to "leak" any more than neccessary. For instance, it
   may be tempting to skip the NODEFIND sequence and just issue SEARCHes
   or URLCACHEs, since they also return node-finding information. This
   sends people's search terms and requested urls to nodes that don't
   strictly need it, and SHOULD be avoided.

   Implementations can be expected to work closely with traditional HTTP
   applications, with their own privacy & security considerations.



10.2 Sensitive data

   Dowser was originally intended to deal only with publically
   accessible information, but it can also be used for sharing
   sensitive data among trusted computers, say, all the word processor
   documents on an office LAN. For these situations, implementations
   should use some sort of IP whitelisting or subnetting and make the
   user aware of the risks.


11. IPR & Copyright

    The IETF invites any interested party to bring to its
    attention any copyrights, patents or patent applications, or
    other proprietary rights which may cover technology that may be
    required to practice this standard.  Please address the
    information to the IETF Executive Director.

    Copyright (C) The Internet Society (2004). All Rights
    Reserved.

    This document and translations of it may be copied and
    furnished to others, and derivative works that comment on or
    otherwise explain it or assist in its implmentation may be
    prepared, copied, published and distributed, in whole or in
    part, without restriction of any kind, provided that the above
    copyright notice and this paragraph are included on all such
    copies and derivative works.  However, this document itself may
    not be modified in any way, such as by removing the copyright
    notice or references to the Internet Society or other Internet
    organizations, except as needed for the  purpose of developing
    Internet standards in which case the procedures for copyrights
    defined in the Internet Standards process must be followed, or
    as required to translate it into languages other than English.

    The limited permissions granted above are perpetual and will
    not be revoked by the Internet Society or its successors or
    assigns.



Bueno                                                          [Page 18]


Internet-Draft                 Dowser/0.1                       May 2004


    This document and the information contained herein is provided
    on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
    ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE
    OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
    IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
    PARTICULAR PURPOSE.


12. Contact Information


   Carlos Bueno
   6231 SW 78th St., Ste. 20
   Miami, FL, USA 33143
   Email: carlos@bueno.org



This document expires on November 4, 2004
































Bueno                                                          [Page 19]