FIND Working Group                              Jeff Allen
INTERNET-DRAFT                                  Bunyip Information
<draft-ietf-find-new-cip-00.txt>                Systems, Inc.

                                                Patrik Faltstrom
                                                Tele2/Swipnet

              The Common Indexing Protocol (CIP)

Status of this memo
-------------------

This document is an Internet Draft. Internet Drafts are working
documents of the Internet Engineering Task Force (IETF), its Areas,
and its Working Groups. Note that other groups may also distribute
working documents as Internet Drafts.

Internet Drafts are draft documents valid for a maximum of six
months. Internet Drafts may be updated, replaced, or obsoleted
by other documents at any time. It is not appropriate to use
Internet Drafts as reference material or to cite them other than
as a "working draft" or "work in progress."

Please check the I-D abstract listing contained in each Internet
Draft directory to learn the current status of this or any
other Internet Draft.

This Internet Draft expires November 13, 1996.

Introduction
------------

With the accelerating expansion of the availability of information of
all kinds on the Internet, search technologies have become more and
more important to users to help them turn the mountains of data into
gems of information. Technologies such as WAIS, Z39.50, and
the growth of ad-hoc interfaces based on HTML forms technology
bring the task of searching a single database (locally or remotely)
pretty much under control. A multitude of such databases and access
protocols/languages exist, and are serving users satisfactorally every
day.

In addition to the single databases available on the Net, we are also
seeing an explosion of massive databases created by automated indexing
tools, or (in rarer cases) by hand. Databases like Alta Vista,
Webcrawler, and Yahoo are aggregating vast amounts of information
into massive centrally maintained and accessed indexes of "everything
out there". By all accounts, this approach to managing massive amount
of widely distributed information is facing scaling problems (both
those based on automated technology, and those based on human
indexing), and is certainly not the end of the road for information
retrieval technologies.

Distributed database technology would seem to be the next step. While
much progress has been made with respect to managing searches in a
distributed environment, for a variety of reasons, this technology has
not come into use to make "searching the Internet" effective and
efficient.

The Common Indexing Protocol (CIP) is proposed not only as a mechanism
for distributing searches across several instances of a single type of
search engine.  The CIP approach was originally proposed in the
context of enabling searches across multiple Internet white pages
(people) database technologies (e.g., Whois++, X.500, PH) with a view
to enabling the creation of a global directory of people.  In this
application domain, the raw information (personal data) is usually
maintained by an organization to which the person is related (for
purposes of data integrity and security).  When it became clear that
there was no one white pages technology that would be embraced by all
organizations, the goal of a global directory service had to be met by
a technology that would act across the diverse set of available
systems.

With CIP, we plan to provide a scalable, flexible scheme to tie
existing and future databases into distributed data warehouses which
can scale gracefully with the growth of the Internet. A CIP mesh can
be searched efficiently, and scales easily. The goal of this document
is to define an architecture and protocol to move indexing information
among existing and future servers. We will also explain how clients
will make use of this information to conduct efficient searches.

Architecture and Protocol
-------------------------

In this section we define the architecture of the system, as well as
the protocol used between conforming CIP servers. Finally, we describe
the issues involved from a client's point of view, when executing a
query.

Overview
--------

* CIP integrates multiple data-access protocols

CIP is a protocol meant to be spoken between servers (existing and
future) which provide a local database search. CIP is not a protocol
for finding data. CIP allows servers to pass hints among themselves
about where data is so that, in their native protocol, the servers can
refer clients to the place(s) where their query will yield
results. Due to these referrals, a query is "moved" towards parts of
the distributed database where it is more likely to be successfully
answered. This process is called query routing, and it's central to
effective, efficient searching. CIP makes query routing possible.

* Indexing defined

Indexing, in the context of CIP, is the process of gathering some
"reconnaissance information" which can be used to help route queries
to the servers most likely to have answers. The process of indexing a
body of data and passing that knowledge on to other parts of the
distributed system is an investment, of sorts. By spending a little
time behind the scenes distributing indexing information among
themselves, a group of servers that form a distributed database can
make sure that answering queries is only attempted by servers likely
to have a productive part of the answer. Using query routing like this
can avoid the wasteful floods of queries that must occur in a
distributed system with no indexing capabilities.

The data generated and passed among these servers is called an index
object.  An index object is not the same as the full database. It is
simply a set of "hints" on what kind of queries the index object
originator will be able to answer. Based on the index object from a
server, you cannot, in general, reconstruct that server's database.
(See Security Considerations, below, for a discussion of why this is
an important property for index objects to have.)

* Organization of index servers

A group of servers sharing index objects of a common type is called a
"mesh". A mesh, as the name implies, can be more complicated than a
simple tree. In general, it is simplest to think of the mesh as a
rough tree-shaped hierarchy, with servers near the root of the tree
having more knowledge about the rest of the tree than those nearer the
leaves. Leaf nodes (called "data servers") know no indexing
information, but they export index objects based on their data.
Servers with some knowledge of the tree below them are called "index
servers", since they hold index objects of their children, and can do
query routing based on this knowledge.

The mesh is allowed to be less rigid than a strict tree because the
organization of the mesh needs to remain fluid. Different types of
data used for different applications may have divergent requirements
for index structure. Because the mesh need no be a strict tree, these
different structures may be overlayed on the same data servers. The
mesh is a simple solution to a hard problem. It's impossible to decide
on the "right" distributed architecture when faced with conflicting
requirements, so CIP allows multiple concurrent indexing
organizations.

* Implementation

CIP transactions take place on reliable data streams (currently, the
only implementation is for TCP). CIP servers can reside on any TCP
port, though the IANA assigned port number of XXX is common. It is
mostly an NVT-ASCII protocol, with the exception of the index objects
themselves. The format for index objects is defined separately from
CIP.

CIP's place in the World
------------------------

Based on the initial definition of the system above, we'll present an
example of what it means for CIP to be a backend protocol. Consider
the following arrangement of a client, and two servers:


  ----------   Query 1      ----------  API  -----------
  | Client | <----------->  |Server 1| <---> |CIP Impl.|
  ----------                ----------       -----------
          |                                   ^
          |                                   | Index Object moved
          |                                   | via CIP
          |    Query 2      ----------  API  -----------
          |-------------->  |Server 2| <---> |CIP Impl.|
                            ----------       -----------

                      Figure 1.


In this example, we are dealing with a very small mesh, so small in
fact, it consists of exactly two servers. Server 1 (S1) polls Server 2
(S2) via CIP. S1 holds not only all the data loaded into it by its
administrator, but also a copy of the index object that S2 sent it.

The client begins a search by asking S1 for a particular item. This
transaction (labeled "Query 1", above) takes place in the native data
access language and protocol of S1. Server 1 could be a Whois++
server, an LDAP server, or some other server yet to be developed. The
query could be for any kind of data appropriate for the server and its
protocol.

During the course of executing the query, S1 checks both it's local
data and the index object it has on hand from S2. Based on this
search, it can generate a result immediately if it has
the data locally, or it can generate a referral to the data if the
index object it searches indicates the result may be on S2.

The client may or may not chase down the referral, depending on how it
is programmed, or on what the user decides to do next. An important
property of CIP is that the client side is responsible for tracking
referrals. For clients which are not capable of understanding
referrals (due to backwards in-compatibility), a proxy might be
implemented on the server-side to chase referrals. However, this is
not a CIP protocol issue; the proxy server would be specific to the
native protocol in use on that server.

Note that the CIP implementation and the server implementation are in
separate "boxes", even though they are not, as it might appear,
separated into different processes. Instead, CIP is envisioned to be a
service that can be added into a system with minimal modification. The
work CIP does has been carefully abstracted from the work a
conventional database server does so that the two can be connected in
a straightforward manner via an implementation-defined API.

The CIP Dataset
---------------

For the purposes of making and managing CIP index objects, it is
important to have a unique way to identify particular databases. Each
database is called a dataset. A server may contain one or more
datasets. Each dataset can be indexed in one or more ways.

To allow CIP servers to talk about datasets in a meaningful way, every
dataset in the Internet which is indexed by CIP must have a
network-wide unique dataset identifier, or DSI. To avoid requiring the
IANA or other authority to worry abut trying to name every database on
the net, we have chosen to use the ISO Object Identifier (OID) for
DSI's. While OID's are a bit unwieldy to manage and understand by
humans, they have a very easy to manage storage and transfer format.
They are also essentially infinitely extensible, though certain limits
on CIP protocol message sizes put practical limits on the OIDs to make
the protocol easier to implement.

The part of the OID-space used by CIP is the same part used by SNMP
enterprise numbers. Thus, to begin naming CIP datasets within an
organization, it is necessary to get a enterprise number assigned to
you by the IANA. For help on getting enterprise numbers, see [IANA].

In addition to describing datasets with OIDs, any place a DSI is
given, a DSI description can be added. The description is a free-form
text field intended to help humans understand the more
computer-friendly OIDs.

The Index Object
----------------

Since not all data can be indexed the same way, CIP allows for many
different kinds of index objects. All of them share a common container
and general format. The exact derivation and specific contents of an
index object is defined by the index policy, which is defined by CIP
users who are interested in applying the protocol to a new domain of
data. A domain of data is the set of all data with common needs with
respect to indexing. Digital pictures, sound, and text would certainly
fall into separate domains, since they have such vastly different
needs with respect to searching. More subtly, full text databases and
record-oriented databases might fall into different domains, depending
on how one chose to index them. Because of this, a given dataset might
have one or more index objects associated with it; one index object
for every domain.

A concrete example lies in the current use of CIP with Whois++. These
servers exchange a form of index object called a centroid. A centroid
is a list of tokens in which each token has associated with it the
attribute and template type in which it appeared in the original
data. An index object definition must include a specification of how
the index is derived, and how it is formatted for transport. The
Whois++ specification defines the derivation as the set of unique
tokens (collated by attribute and template type) and the format as a
nested set of attribute/value pairs.  For an exact specification of
the Whois++ centroid, see [RFC-1913].

When an index is passed between two servers, the domain is specified
by the polling server. Thus, the server which is about to attempt to
receive and make sense of the index object specifies what format it is
expecting. If the sending server does not have an index object
available for the specified domain, nothing will be transferred. In
normal operation, this should not happen, since polling relationships
are set up by server administrators, who should know what domains the
potential peer provides index objects for. However, this mechanism
leaves the door open for research on cross-domain transfer of indexing
information. Note also, that it is possible within the existing
protocol for a server to export index objects for two different
domains based on the same data.

The index is packaged for transport in a container which looks like
this:

    # INDEX-CHANGES
     Version-Number: 3.0
     Index-Type: <index object type-name>
     DSI: 1.3.6.1.4.1.<enterprise-number>.<local part of OID>
     DSI-Description: <ASCII string describing the dataset>
     Base-URI: <base url>
     Authentication-Type: Password
     Authentication-Data: xxxxxxxx
              ...                                 ...
              ... domain-specific parts come here ...
              ...                                 ...
    # END INDEX-CHANGES

There are two parts to the index container. The first is the
domain-independent part, where meta-information defined by the core
CIP protocol is stored.

The domain-independent information in the index object container includes:

Index-Type:
     This is the case-insensitive name of the index object type. The
     only currently defined type name is "CENTROID". Experimental index object
     types should start with "X-". New type names are defined by the
     specifications written for new indexing objects.

DSI:
     This is the DSI for the dataset from which this index object was created.
     See "The CIP Dataset" above for information on where DSIs come from.

DSI-Description: (optional)
     A free-form human-readable line describing the dataset.

Base-URI: (optional)
     This is a URI (Uniform Resource Identifier) which can be used
     (augmented by an actual search string) to find items in given index
     object. For a more complete description of the Base-URI and its use
     when preparing referrals, see "Navigating the Mesh" below.

After the domain-independent part comes the index object itself. The
specific formatting of the index, as well as the rules governing how
an index object is derived from a given dataset are defined separately
from CIP as domain-specific extensions. The first existing object
specification is for the Whois++ "centroid" index object type. See the
"New Directions" for more information on future research into
alternative domains and cross-domain index management.

In a future version of this Internet Draft, the issues involved in
incremental updates will be addressed. There is still some contentiona
as to whether it should be a domain-dependent or independent thing.

Updating the mesh - transferring an index
-----------------------------------------

When a transfer of an index object is done, a mixture between a
pushing and polling method is used. A server which has an index object
that has changed must inform all of it's mesh neighbors that they
should fetch an update to their current index objects. The command
sent is called a Data-Changed command.  It is only ever sent from the
_polled_ server to a _polling_ server.

After this command is transferred, the polling server can choose when
it wants to fetch the new index object. Existing implementations pick
up the index object immediately, but there are provisions in the
protocol to allow the polled server to suggest a low-traffic period
during which it would prefer to handle the poll.

A polled server can send several Data-Changed commands to the same
server even if that server have not fetched a new index. It is
strongly recommend against Real-Time updates, i.e. it is best to not
send a change notice for every change made in the data, and to instead
accept a small amount of lag in the system in the name of lower
polling overhead.

The Data-Changed command is only a _notification_ that a newer index
object is available. Only once the polling server connects back to the
polled server does the index object get transferred.

The Data-Changed command
------------------------

The Data-Changed command includes the following data, for example:

      # DATA-CHANGED
       Version-Number: 3.0
       Modification-Date: 199603041625
       DSI: 1.3.6.1.4.1.1375.1
       Host-Name: pollee-hostname
       Host-Port: 63
       Best-Time-Next-Poll: 199603050100
       Window-Size: 3600
      # END

Notice that the modification-date and best-time-next-poll are both
given in GMT, and the window-size in seconds. This window for polling
is the recommended one from the polled servers point of view. It might
be the polling server that have to, because of work-load, poll
whenever it is possible.

After a Data-Changed command is sent from one server to another one,
the command is acknowledged, and the connection is closed.

Polling
-------

The Poll command is the command that is sent from the polling server
to the polled server when it wants to request a new index. The Poll
command is normally only sent as a response to one or more earlier
Data-Changed commands sent to the polling server.

The format of the POLL command is as follows (an example):

     # POLL
      Version-Number: 3.0
      Character-Set: UNICODE-1-1-UTF-8
      DSI: 1.3.6.4.1.1375.1
      DSI-Description: Bunyip
      Type-Of-Poll: CENTROID
      Tokenization-Type: Tokens
     # END

The value of the type-of-poll attribute implicates some extra
attributes, which in the case of CENTROID is the attribute
Tokenization-Type, which in turn is the tokenization-type the polling
server requests.

Navigating the mesh
-------------------

With the CIP infrastructure in place to manage index objects, the only
problem remaining is how to successfully use the indexing information
to do efficient searches. CIP allows query routing, which is
essentially a client activity. A client connect to one server, which
redirect the query to servers with the answer. This redirection
message is called a referral.

* The Referral

The concept of a referral and the mechanism for deciding when they
should be issued is described by CIP. However, the referral itself
must be transferred to the client in the native protocol, so its
syntax is not directly a CIP issue. Recall the example in Figure
1. The server S1 generates a referral, directing the client to contact
server S2 for more results. The mechanism for deciding this referral
needed to be made resides in the CIP part of the server. The mechanism
for generating and sending that referral to the client resides in the
server itself.

A referral is made when a search against the index objects held by the
server shows that there may be hits available in one of the datasets
represented by those index objects. There may be no more than one
referral per dataset. If there is more than one index object (each of
a different type) for the same dataset, only one of them will generate
a referral.

Though the format of the referral is dependent on the native protocol
of the CIP server, the content of the referral is constant across all
protocols. At the least, a DSI and a URI must be returned. The DSI is
the DSI associated with the dataset which caused the hit. This must be
presented to the client so that it can avoid referral loops. The other
required piece is a URI. In general, this URI provides a compact way
to return the hostname and port number that the client is being
referred to. The URI was chosen for this field because it can hold
additional information as necessary too. When an index object
container is received with a Base-URI attribute in it, referrals based
on that index will use that URI, instead of the default, which is
generate from the hostname and port associated with the index object.

The additional information in the Base-URI may be necessary for the
server receiving the referred query to correctly handle it. An good
example of this is an LDAP server, which needs a base X.500
distinguished name from which to search. When an LDAP server sends a
CENTROID up to a CIP indexing server, it sends a Base-URI along with
the name of the X.500 subtree for which the index was made. When a
referral is made, the Base-URI is passed back to the client so that it
can pass it to the original LDAP server.

As usual, in addition to sending the DSI, a DSI-Description attribute
can optionally sent. Because a client may attempt to check with the
user before chasing the referral, and because this string is the
friendliest CIP has to offer, it should be included in referrals when
possible. Of course, the DSI-Description is only available for
inclusion in referrals if it was sent the the index server as part of
the index object.

* Cross-protocol Mappings

Each data access protocol which uses CIP will need a clearly defined
set of rules to map queries in the native protocol to searches against
an index object. These rules will vary according to the data
domain. In principle, this could create a bit of a scaling difficulty;
for N protocols and M data domains, there would be N x M mappings
required. In practice, this should not be the case, since some access
protocols will be wholly unsuited to some data domains. Consider for
example, a LDAP server trying to make a search in an index object
composed from Web pages. What would the results be? How would you even
make sense of the incoming query or the outgoing results?

However, as pre-existing protocols are connected to CIP, and as new
ones are developed to work with CIP, this issue must be examined. In
the case of Whois++ and the CENTROID index type, there is an extremely
close mapping, since the two were designed together. When hooking LDAP
to the CENTROID index type, it will be necessary to map the attribute
names to attribute names which are already being used in the CENTROID
mesh. It will also be necessary to tokenize the LDAP queries under the
same rules as the CENTROID indexing policy, so that searches will take
place correctly.

* Moving through the mesh

From a client's point of view, CIP simply pushes all the "hard work"
onto its shoulders. Afterall, it's the client which needs to track
down the real data.  While this is true, it's very misleading. Because
the client has control over the query routing process, the client has
total control over the size of the result set, the speed with which
the query progresses, and the depth of the search.

The simplest client implementation simply provides referrals to the
user in a raw, ready-to-reuse form, without attempting to follow
them. For instance, one Whois++ client, which interacts with the user
via a Web-based form, simply makes referrals into links. Encoded in
the link via the HTML forms interface GET encoding rules is the data
of the referral: the hostname, port, and query. If a user chooses to
click on the referral, they execute a new search on the new host. A
more savvy client might present the referrals to the user and ask
which should be followed. And, assuming appropriate limits were placed
on search time, bandwidth usage, etc, it might be reasonable to
program a client to follow all referral automatically.

However, when following all referrals, a client must show a bit of
intelligence.  Remember that the mesh is defined as an interconnected
graph of CIP servers. This graph may have cycles, which could cause an
infinite loop of referrals, wasting the servers' time and the client's
too. When faced with the job of tacking down all referrals, a client
must use some form of a mesh traversal algorithm. Such an algorithm
has been documented for use with Whois++ in RFC-1914. The same
algorithm can be easily used with this version of CIP. In Whois++ the
equivalent of a DSI is called a handle. With this substitution, the
Whois++ mesh traversal algorithm works unchanged with CIP.

Finally, the mesh entry point (i.e. the first server queried) can have
an impact on eh success of the query. To avoid scaling issues, it is
not acceptable to use a single "root" node, and force all machine to
connect to it. Instead, clients should connect to a reasonably well
connected (with respect to the CIP mesh, not the Internet
infrastructure) server. If no match can be made from this entry point,
the client can expand the search by asking the original server who
polls it. In general, those servers will have a better "vantage point"
on the mesh, and will turn up answers that the initial search
didn't. The mechanism for dynamically determining the mesh structure
like this exists, but it not documented here for brevity. See RFC-1913
for more information on the POLLED-BY and POLLED-FOR commands.

CIP in the real world
---------------------

Much of what we have discussed here appears to simply form a framework
for future work. While it is true that there are many productive
directions to go with CIP from here, CIP is already in use in at least
one application (Whois++), and it is enlightening to explore another
imminent application (Web indexing).

* Whois++ and CIP

CIP evolved out of work originally done on the Whois++ directory
service system.  Though it was originally designed for directory
service tasks, the Whois++ indexing service was abstracted into what
is now being developed as CIP. Thus, and early version of CIP is alive
and in real use today in the form of Whois++ clients and servers.

Whois++ deals with records that are called templates. A template is an
ordered list of attribute/value pairs. Multiple pairs with the same
attribute are allowed in a given template. Typical search requests on
a directory service revolve around finding the template with a certain
word, or set of words in it. For instance, the search for "Jeff and
Allen" matches all templates with both of those words in any
attribute. In addition to unconstrained searches, it's also useful to
constrain a keyword to a single attribute name, i.e. "Jeff and Allen
and company=Bunyip".

To support this kind of searching in a distributed context, Whois++
servers exchange centroids. To create a centroid, first the records
are broken up into tokens. The data is broken on spaces and other
punctuation characters to form these tokens. The centroid is a list of
unique tokens in the database sorted and separated by template name
and attribute name. The centroid is case sensitive.

The design of the centroid was predicated by the searching and privacy
requirements.  It was necessary to allow keyword searching in a set of
attributes and values. It was necessary to constrain keywords to
particular attributes and/or values. It was required that the centroid
not divulge the original database (thus the requirement for collation
and duplicate removal). Though the terms had not been coined yet, the
centroid designers were making the first index object by defining the
first set of indexing policies. All that was left was to define the
transfer format, and the distributed system was ready to go to work.

* CIP and the Web

Whois++ is a "textbook" case of applying CIP, since the two were
designed hand-in-hand. What happens, one may ask, when CIP meets a new
problem? Below, we discuss at the same level the design of a
hypothetical index object called the weboid, used to index World Wide
Web (WWW) content.

We start by exploring the requirements for the data domain to see if
it is like any other data domains which are already being successfully
indexed by CIP. If we can use one of those sets of indexing policies,
our work will be done, and there will be no need for the
weboid. First, web searching requires keyword searches. Next, metadata
in the form of attribute value paris must be searchable (i.e. author,
title, filetype). So far, it looks like we are simply talking about
reinventing the centroid here. The final two requirements are what
tell us we need a new indexing object.  We need to be able to do
searches on adjacency (i.e. "Jeff near Allen"), and we don't care too
much how close the index is to the actual data, though we'd like to
see some compression from the indexing process, or else we'll be faced
with a system as unworkable as the systems in use today.

There are some other observations researches in the web indexing field
have made that come to bear on our problem. First, very unique words
(i.e. misspellings, acronyms, and exceedingly long words) will not be
useful in the finished database. If a word is unique enough to occur
only once in 1 million web pages, how likely is it to be used by a
user during a search? Also, various stemming algorithms can greatly
reduce the size of the index while increasing the yield of queries,
making both administrators and users happy.

Because of the last two requirements, and because of the special
characteristics of the web indexing data domain, we choose to make a
new indexing type. The weboid is born! The actual design of the weboid
is left as an exercise to the reader.

Future Directions
-----------------

CIP is a work in progress. It developed out of the indexing component
of Whois++ [RFC-1913]. It became clear the centroid-passing ideas
originally used in Whois++ would be useful in other realms, if it
could be abstracted out of the Whois++ design. Though much progress
has been made toward this end, work still needs to be done.

* Multi-domain index management issues

Research needs to be done into the topic of managing the interface of
multiple CIP data domains. First, the question must be answered, "Is
it useful to attempt to merge index objects from multiple domains?"
Because data that is useful in many ways and via many different search
techniques is likely to be indexed with CIP, it would seem important
to make full use of the CIP meshes, instead of forcing them to remain
forever isolated from one another.

However, even merging centroids with differing tokenization
algorithms appears quite challenging. It will be necessary to come to
grips with what it means to merge index objects from multiple domains,
and to understand under what conditions this can be reasonably
accomplished.

* Future Data Domains

As Internet contents becomes more varied, the demands on search and
retrieval software do too. Because of the wide variation in data
that is proliferating on the Net, CIP is designed to be extensible. Of
course, this extensibility is only useful if we take the next step and
actually develop alternative indexing objects for CIP.

Work is progressing on common indexing formats for web documents [IIF,
DISW]. By applying CIP to web indexing using the example above (see
"CIP and the Web") as a starting point, a scalable Web indexing system
should soon be within our grasp.

Other challenging and exciting applications domains await, including
indexing pictures and sounds.

* Mesh Management

Currently, CIP servers perform their polling duties with respect to a
static configuration programmed by server administrators. There is no
support in the protocol to change these polling relationships, thus
all such configuration happens locally implementation-specific
programs. In the future, we;d like to see mesh management take place
over the TCP connection itself, so that configuration can be
accomplished remotely. Of course, this will require advances in the
authentication methods used, as well as a protocol for the management
itself.

In the distant future, it's reasonable to imagine groups of CIP server
"managing themselves". A group of CIP servers which have been given
the tools to measure the optimality of the mesh, have been given goals
to reach for, and an algorithm to get them to the goal may be able to
dynamically reconfigure a mesh to minimize the amount of polling
overhead while maximizing client response time (by reducing the number
of referrals in the system).

* Domain Identification

Because CIP is now extensible, there needs to be a way for CIP peer
servers to enquire what data domains a particular server supports.
This should be a simple addition to the protocol.

Security Considerations
-----------------------

There are two distinct levels at which security must be discussed with
respect to CIP. First, we must explore the security necessary when
dealing with indexing data. There are also standard issues to be
explored with respect to protocol security.

* Secure Indexing

CIP is designed to index all kinds of data. Some of this data might be
considered valuable, proprietary, or even highly sensitive by the data
maintainer. Take, for example, a human resources database. Certain
public bits of data, in moderation, can be very helpful for a company
to make public. However, the database in its entirety is a very
valuable asset, which the company must protect. Much experience has
been gained in the directory service community over the years as to
how best to walk this fine line between completely revealing the
database and making useful pieces of it available.

Another example where security becomes a problem is for a data
publisher who'd like to participate in a CIP mesh. The data that
publisher creates and manages is the prime asset of the company. There
is a financial incentive to participate in a CIP mesh, since exporting
indices of the data will make it more likely that people will search
your database. (Making profit off of the search activity is left as an
exercise to the entrepreneur.) Once again, the index must be designed
carefully to protect the database while providing a useful synopsis of
the data.

One of the basic premises of CIP is that data providers will be
willing to provide indices of their data to peer indexing
servers. Unless they are carefully constructed, these indices could
constitute a threat to the security of the database. Thus, security of
the data must be a prime consideration when developing a new index
object type. The risk of reverse engineering a database based only on
the index exported from it must be kept to a level consistent with the
value of the data and the need for fine-grained indexing.

* Protocol Security

During the automated exchange of indexing information, there must be
provisions made for CIP servers to identify themselves to one
another. Though it's conceivable that a data server would be willing
to make it's index available on an anonymous basis, most index objects
will be passed between servers participating in a well-defined,
statically configured indexing relationship.

In the existing implementations of CIP, only clear-text password
authentication is available. When a poll request is submitted, a
hostname and password are sent along. If the receiving server does not
recognize the poller, the index object will not be sent. Likewise,
when a "DATA-CHANGED" command is sent, it is ignored by the receiving
server unless the sender's authentication information is valid.

Clear-text passwords are clearly not an acceptable authentication
mechanism for use in the long term on the Internet. CIP either needs
to rely on lower level authentication (like that provided by parts of
IPv6) or it must be wrapped in a secure communication layer, such as
SSL. Due to the complexity of the cryptography field, especially with
respect to commercial interests, solutions to the authentication
weakness in CIP have not been attempted yet.

Acknowledgements
----------------

Generous thanks to Leslie Daigle, Erik Selberg, and Roland Hedberg for
comments on previous drafts of this paper.

References
----------

[DISW]          Erik Selberg, "DISW Query Routing Breakout Notes", June 1996.
                http://www.cs.washington.edu/homes/speed/disw-wu.html

[RFC-1913]      C. Weider, J. Fullton, S. Spero, "Architecture of the
                Whois++ Index Service", February 1996

[RFC-1914]      P. Faltstrom, R. Schoultz, C. Weider, "How to Interact
                with a Whois++ Mesh", February, 1996

[INDEX500]      David Chadwick. "IndeX.500", May 1996.
                http://www.dante.net/pubs/dip/19/19.html

[CENTIPEDE]     T. Howes. "SLAPD and SLURPD administrator's guide".
                http://www.umich.edu/~rsug/ldap/doc/guides/slapd

[IIF]           Kevin Chang, Hector Garcia-Molina, Luis Gravano,
                Andreas Paepcke. "Internet Information Finding"
                http://www-db.stanford.edu/~gravano/standards

[IANA]          Internet Assigned Number Authority
                http://www.isi.edu/iana

Author contact info
-------------------

   Jeff R. Allen
   Bunyip Information Systems
   Suite 300
   310 Ste-Catherine St. West
   Montreal, Quebec   H2X 2A1
   Canada

   Phone: +1 514 875-8611
   Fax:   +1 514 875-8134
   EMail: jeff@bunyip.com


   Patrik Faltstrom
   Tele2/Swipnet
   BOX 62
   S-164 94 Kista
   Sweden

   Phone: +46-8-56264000
   Fax:   +46-8-56264200
   Email: paf@swip.net