Network Working Group                                               M. Wahl
INTERNET-DRAFT                                          Critical Angle Inc.
Expires in six months from                                  8 February 1997

                       A CIP-based Centroid Exchange for LDAP
                       draft-ietf-find-ldapc-00.txt


Status of this Memo

   This document is an Internet Draft.   Internet-Drafts are working
   documents of the Internet Engineering Task Force (IETF), its areas, and
   its working groups.  Note that other groups may also distribute working
   documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference material
   or to cite them other than as "work in progress."

   To learn the current status of any Internet-Draft, please check the
   "1id-abstracts.txt" listing  contained in the Internet-Drafts Shadow
   Directories on ds.internic.net (US East Coast), nic.nordu.net (Europe),
   ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim).

   Please note that this document reflects experimental software, and is
   incomplete.  The underlying specification is likely to change slightly.

Abstract

   This document describes how an LDAP server (the supplier) may transmit,
   through an out-of-band email, index information or attributes of its
   naming context to another LDAP server (the consumer).  The consumer
   server will make use of this information when determining whether the
   supplier server is likely to have entries in that naming context which
   match a particular search filter.  This assists the consumer in
   processing subtree searches in distributed directories.

1. Goals

   The primary goal of this specification is to allow an LDAP-capable
   server with a large number of subordinate references to more efficiently
   perform a subtree search operation, without having to chain or refer
   sub-operations for each subordinate reference.

   Secondary goals of this specification are:

   - allow servers to efficiently handle all shapes of search filters in
     common use, even distressful searches like

        (|(cn=joe bloggs)(sn=joe bloggs)(uid=joe bloggs)
          (cn=*joe bloggs*)(sn=*joe bloggs*)(uid=*joe bloggs*)
          (cn~=joe bloggs)(sn~=joe bloggs)(uid~=joe bloggs))

    - not require any modifications to the internals of servers holding
      subordinate contexts;



    - allow the organizations which maintain these contexts to control bulk
      retrieval of the data, and to schedule when it may be retrieved;

    - ensure that it is possible to protect against unauthorized disclosure of
      bulk directory data while in transit, and provide some protection
      against spoofing attacks.

   The following are NON-goals of this version of the specification:

    - exchange of index information without prior agreement (e.g. trawling);

    - negotiation of agreements (separate document for this);

    - updates of non-complete naming contexts;

    - allowing the consumer to poll for updates;

    - alias entries and/or non-hierarchical models;

    - representation of access control or transfer of
      non-publically-accessible attributes.

2. Introduction

   This document defines a centroid-based index for a subtree of the
   directory. It is carried by an update MIME object.

   The supplier server is an LDAP server which masters or shadows a naming
   context.  In this protocol it acts as a "simple CIP leaf server".

   The supplier server will at intervals mail the update to the consumer
   server.  The consumer server is a different LDAP server with a subordinate
   or cross reference to the supplier server, which makes use of this update
   to determine how to route queries.

   The centroid-based index consists of processed attribute values from
   entries. The supplier may send the complete attribute values, or if this
   would violate data protection laws, only approximate match codes of values,
   which are nonreversible.  As more processing is performed on values, the
   size of the index is reduced, as is its usefulness to the consumer.

   The specification allows for both total and incremental updates to be sent.

   This specification is intended for "push" environments, where there are many
   (tens of thousands) of naming contexts, and a small number (dozens) of
   index consumers.  The naming contexts are as a whole static, however it is
   desirable or changes to take effect rapidly.

   (If the consumer were instead to start one poll per second to cover 100,000
   suppliers, then if a person moved from one naming context to another, in
   the worst case the index information would be out of date for more than a
   day, making that person unlocatable.)



   Thus it is not intended to represent index information in the directory
   itself.  The index information is expected to be of interest to only a
   very small portion of users of the directory, and for legal reasons should
   only be visible to authorized servers.  Furthermore the size of the index
   information is often proportional to the size of the rest of the DIT.

3. Agreement

   An agreement is established between the organization which administers the
   supplier and the organization which administers the consumer, which
   specifies how the servers will communicate. The agreement contains the
   following:

    - "version": The version of the agreement and the index type.  This
      specification describes the index type "x-ldap-centroid-1".

    - "baseobject": The Distinguished Name of the prefix entry of the
      supplier's subtree.

    - "scope": The subset of information in the supplier's subtree for
      which the update information will index. For this version of the
      specification, the scope is always "subtree": the base object and all
      entries down to the leaves of the tree, including any subordinate naming
      contexts.

    - "dsi": An OID which uniquely identifies the subtree and scope.

    - "supplier": The hostname and listening LDAP port number of the
      supplier server, as well as any alternative servers holding that same
      naming contexts, in case the supplier is unavailable.

    - "consumeraddr": This is a URI of the "mailto:" form, with the RFC
      822 email address of the consumer server.  Subsequent versions of the
      specification may allow other forms of URI, so that the consumer may
      retrieve the update via the WWW, FTP or a Common Indexing Protocol.

    - "updateinterval": The maximum duration in seconds between occurances of
      the supplier server generating an update.  If the consumer server has
      not received an update from the supplier server after waiting this
      long since the previous update, it is likely that the index
      information is now out of date.

      A typical value for a server with frequent updates would be 604800
      seconds, or every week.  Servers whose DITs are only modified annually
      could have a much longer update interval.

    - "securityoption": Whether and how the supplier server should sign and
       encrypt the update before sending it to the consumer server.  Options
       for this version of the specification are

       "none": the update is sent in plaintext
       "PGP/MIME": the update is digitally signed and encrypted using PGP
       "Fortezza": the update is digitally signed and encrypted using Fortezza



      It is recommended that the "PGP/MIME" option be used when exchanging
      sensitive information across public networks, and both the supplier and
      consumer have PGP keys. The "Fortezza" option is intended for use in
      environments where security protocols are based on Fortezza-compatible
      devices.

    - Security Credentials: The long-term cryptographic credentials used for
      key exchange and authentication of the consumer and supplier servers, if
      a security option was selected. For "PGP/MIME", this will be the trusted
      public keys of both servers.  For "Fortezza", this will be the
      certificate paths of both servers to a common point of trust.

   4. Content Type

   The update consists of a MIME object of type application/cip-index-object.

   The parameters are:
    - "type": this has value "x-ldap-centroid-1".

    - "dsi": the DSI from the agreement.

    - "base-uri". A set of URIs, each of the "ldap:" form, separated by spaces.
      In each URI, the hostname/portno must be distinct, and based on the
      "supplier" part of the agreement.  Each URI must have on the RHS the
      base object distinguished name.

   The payload is mostly textual data but may include bytes with the high bit
   set.  The quoted-printable content-transfer-encoding is recommend to be
   used if there are any bytes with the high bit set, otherwise no transfer
   encoding is needed.

   This object may be encapsulated in a wrapper content (such as
   multipart/signed) or be encrypted as part of the security procedures.
   The resulting content will in this version of the specification be emailed.

   For example, an email without any security transformations may resemble:

   From: supplier@sup.com
   Date: Thu, 16 Jan 1997 13:50:37 -0500
   Message-Id: <199701161850.NAA29295@sup.com>
   To: consumer@consumer.com    <<-- from consumer server address
   Reply-To: supplier-admin@sup.com
   MIME-Version: 1.0
   Content-Type: application/cip-index-object; type=x-ldap-centroid-1;
    dsi=1.3.6.1.4.1.1466.85.85.1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16;
    base-uri="ldap://sup.com/dc=sup,dc=com ldap://alt.com/dc=sup,dc=com"

   ...payload here...

   The payload is series of CRLF-terminated lines. Each line is in the UTF-8
   encoding of the Unicode (ISO-10646 BMP) character set. No other character
   sets are permitted by this version of the specification.  Some supplier
   servers may only be able to generate the printable US-ASCII subset,
   but all consumer servers must be able to handle the full range of Unicode
   characters.



   The "x-ldap-centroid-1" index payload begins with a header section, which
   is followed by one or more attribute sections.  Each section is separated
   from the others by a blank line.  Multiple blank lines together are
   treated the same as a single blank line.

   Lines which begin with a '#' character are ignored.

4.1. Header Section

   The header section consists of one or more lines in "type:value" format.

   The following types are defined:

    - "version": This line must always be present, and have the value "1" for
      this version of the specification.

    - "updatetype": This line must always be present.  It takes as the value
      either "total" or "incremental".  The first update sent by
      a supplier server to a consumer server for a DSI must be a "total"
      update.

    - "thisupdate": This line must always be present. The value is an ASN.1
      GeneralizedTime in UTC (Z suffix) of the time at which the supplier
      constructed this update.

    - "lastupdate": This line must be present if the "updatetype" list has the
      value "incremental".  It is an ASN.1 GeneralizedTime in UTC (Zulu),
      which is the value of the "thisupdate" of previous update sent to the
      consumer.

    - "contextsize": This line may be present at the supplier's option. The
      value is a number, which is the approximate total number of entries in
      the subtree.  This information is provided for statistical purposes only.

    - "attributetypes": This line may be present any number of times with
      distinct values at the supplier's option.  The value is the string
      encoding of an LDAP AttributeTypeDescription. This allows the supplier
      to include privately defined or nonstandard attributes in the update.

      The supplier generally should not include in the header

      - descriptions of attribute types defined in X.500 or RFC 1274,

      - descriptions of attribute types for which it will not be including
        index information (other than presence),

      - descriptions of attribute types with privately-defined syntaxes,

      - descriptions of attribute types whose syntax is "DirectoryString" or
        "IA5String" and matching rule is "caseIgnoreMatch" or
        "caseIgnoreIA5Match".

      For each attribute description the supplier provides, the syntax and
      equality rule names must be included.



      The consumer may wish to make use of the oid, name, syntax and equality
      fields when processing queries using the information from the update.
      If the consumer does not recognize an attribute type in the update, and
      it is not defined in this header, the consumer must treat it as having
      an unknown oid, DirectoryString syntax, and caseIgnoreMatch equality
      matching.

    - "chopbefore": This line may be present any number of times with distinct
      values at the supplier's option. The value is a Distinguished Name in
      LDAP format.  The entry with this name and all its subordinates have
      been excluded from the generation of the index information.  Typically
      the entry will be a context prefix or an administrative point.

      Note: It is assumed that either the consumer will establish a separate
      agreement to cover each excluded area, or these areas are to be ignored
      during searching.

    - "chopafter": Similar to the "chopbefore" option, except that the index
      information does include the attributes of this entry, but not its
      subordinates. (E.g. the subordinate entries are held in a QUIPU DSA.)

   An example header would be:

     version: 1
     updatetype: total
     thisupdate: 199701121341Z

4.2. Attribute Sections

   This section is present any number of times following the header in an
   x-ldap-centroid-1 index object.  Each section is separated from the others
   by a blank line.

   Each section corresponds to one attribute type.

   The first line of the section consists of three parts separated by colons:
   the attribute type name, the match form, and the tokenization rule.

   The match form is one of the following:
    - equality: the tokens correspond to values of that attribute present in
      entries in the subtree.

    - approx: the tokens correspond to approximate match codes of all values of
      that attribute present in entries in the subtree.  The attribute
      must be of the DirectoryString or IA5String syntax.

    - presence: the token indicates whether there is at least one value of that
      attribute present in any entry in the subtree.



   The supplier should use the equality match form for attributes whose values
   are DirectoryString, IA5String or OID, are useful for filtering, and the
   supplier is willing to disclose to the consumer.  If the supplier is not
   willing to disclose the values, the supplier should use the approx match
   form.  For all other attributes on which the supplier permits
   searching, but would not be useful to include in a centroid (because they
   are large, binary or do not compress well), the presence match form should
   be used.

   Tokenization rules define the transformation from attribute values to
   tokens.

   The "flat" tokenization rule is always applied.  First, the attribute value
   is converted to an LDAP string.  Non-string values (e.g. photo and audio)
   are discarded.  Leading and trailing spaces are removed.  Multiple
   consecutive spaces and non-printing characters are replaced by a single
   space. Lower case ASCII letters are converted to upper case.  ASCII
   characters 0-31 and 127 are removed.

   If the match form is "approx", the "soundex" rule is also applied.  Each
   word (separated by spaces) is replaced by its SOUNDEX code and becomes its
   own token.

   If the match form is "presence", the "presence" rule is applied.  It
   generates only one token: "*" if there is any value.

   If a character "#", "-", "+", ";" or "\" occurs in the token, each is
   preceded by a "\" character.

   The supplier may append modifiers to each token.  Modifiers are separated
   from the tokens (and each other) by a semicolon.

   One modifier is described here:

      ;ec=<number>: An approximate count of the number of entries in which
      this token occurs.

   Modifiers are optional and consumers may choose to ignore them.

   If the "updatetype" is "total", each of the lines in the section after the
   first is one token.   The order of lines is unimportant.

   If the "updatetype" is "incremental", each of the lines starts with either a
   "+" or a "-" character, indicating that the token should be added or removed
   from the tokens of this attribute.  Lines must be processed in the order
   they occur.  Modifiers on a "-" token are ignored.



   For example, if the subtree contained:

     dn:dc=sup,dc=com
     objectclass:top
     objectclass:domain
     dc:sup

     dn:cn=Joe Bloggs,dc=sup,dc=com
     objectclass:top
     objectclass:person
     objectclass:strongAuthenticationUser
     cn:Joe Bloggs
     cn:Joseph Bloggs
     sn:Bloggs
     userCertificate;binary::0281...

     dn:cn=Mary Bloggs,dc=sup,dc=com
     objectclass:top
     objectclass:person
     cn:Mary Bloggs
     sn:Bloggs

   And the supplier generated equality match form for "sn", "objectclass" and
   "description", approximate match form with soundex for "cn", and presence
   match form for "userCertificate;binary", the payload might resemble:

     version:1
     updatetype: total
     thisupdate: 199701121341Z

     sn:equality:flat
     BLOGGS;ec=2

     objectclass:equality:flat
     TOP;ec=3
     DOMAIN;ec=1
     PERSON;ec=2
     STRONGAUTHENTICATIONUSER;ec=1

     description:equality:flat

     cn:approx:soundex
     J2;ec=1
     J3;ec=1                            -- these are not the right codes
     B256;ec=1

     userCertificate;binary:presence:presence
     *;ec=1

   If the subtree was subsequently modified to add a value to an entry
      description: seldom seen



   Then the next update might resemble

     version:1
     updatetype: incremental
     thisupdate: 199701131339Z
     lastupdate: 199701121341Z

     description:equality:flat
     +SELDOM SEEN;ec=1

5. Aggregation

   Aggregation may be performed to combine a index of a naming context with
   indices of ALL subordinate naming contexts.  It cannot be usefully
   performed if there is missing index information for one or more
   subordinates.

   When combining centroids, if an attribute centroid is provided by one index
   but not by any others, the attribute may be replaced by a presence index.

   Chopafter and chopbefore lists are merged.

   The base URIs are changed to point to the aggregating server and its
   alternatives.

   The DSI is distinct from that of any subordinate naming context's DSIs.

   If all subordinate indexes included a contextsize header, then the
   aggregate may also have a contextsize header.

   For example, suppose a server holds a naming context C=US with the
   following entry:

     dn:c=US
     objectclass: top
     objectclass: country
     c: US

   It has two subordinate references "o=Foo,c=US" and "o=Bar,c=US".



   It consumes an x-ldap-centroid-1 from the Foo supplier:

     version: 1
     updatetype: total
     thisupdate: 199701131339Z

     objectclass:equality:flat
     TOP;ec=2
     ORGANIZATION;ec=1
     PERSON;ec=1

     o:equality:flat
     FOO;ec=1

     cn:equality:flat
     SHARON META;ec=1

     sn:equality:flat
     META;ec=1

   And an x-ldap-centroid-1 from the Bar supplier:

     version: 1
     updatetype: total
     thisupdate: 199701131339Z

     o:equality:flat
     BAR;ec=1

     cn:equality:flat
     JEFF RUSSELL;ec=1
     JEFFREY RUSSELL;ec=1
     PENELOPE JONES;ec=1

     sn:equality:flat
     RUSSELL;ec=1
     JONES;ec=1

     uid:approx:soundex
     J311;ec=1
     J12;ec=1

     objectclass:equality:flat
     TOP;ec=3
     ORGANIZATION;ec=1
     PERSON;ec=2
     BARSPECIFICPERSON;ec=2

   To aggregate, it first converts its own entry into a centroid.

   Since the server can determine that it is the only centroid without an
   o attribute, it can create an "o:equality:flat" section for itself with no
   values.  This will prevent the o attribute information from being lost.



   Only the Bar server sent the uid attribute.  Since the Foo server did not
   include a "uid" attribute, the server derives that are no uid values for
   the Foo server.

   It then merges the o,cn,sn,uid and objectclass sections, to build the
   resulting centroid.

     version: 1
     completeness: index

     o:equality:flat
     FOO;ec=1
     BAR;ec=1

     cn:equality:flat
     SHARON META;ec=1
     JEFF RUSSELL;ec=1
     JEFFREY RUSSELL;ec=1
     PENELOPE JONES;ec=1

     sn:equality:flat
     META;ec=1
     RUSSELL;ec=1
     JONES;ec=1

     uid:approx:soundex
     J311;ec=1
     J12;ec=1

     objectclass:equality:flat
     TOP;ec=6
     ORGANIZATION;ec=2
     PERSON;ec=3
     BARSPECIFICPERSON;ec=2
     COUNTRY;ec=1

   Finally, the server generates a new DSI.  It can transfer this object to
   other servers as the index for C=US and all subordinates.

6. Use of Index Objects in the Consumer

   This procedure is followed for each data set held by the consumer LDAP
   server, when a search operation is to be performed that includes the subtree
   in its scope.

   Index information is not used during bind password validation or comparison
   operations.

   If the target object of the search is below the subtree prefix,
   then the operation is chained or referred to the supplier (or processed
   with a shadow copy), regardless of the contents of the index information.



7. Filter Evaluation

   If the base object of a search is superior to the subtree prefix, the index
   information is used as to make a routing decision.

   The consumer will evaluate the client's search filter.  The result will be
   one of the following outcomes:

    - LIKELY: there is a good possibility that there is at least one matching
      entry in the naming context.

    - POSSIBLE: there may or may not be any matching entries in the naming
      context.

    - UNLIKELY: the only matching entries would be those which were modified or
      added to the naming context subsequently to the index information being
      generated.

    - UNINDEXED: the supplying server will likely not have any matching entries
      as it does not allow searching on one or more attribute types referenced
      in the filter.  This occurs if any of the attributes in the search filter
      are not represented in the centroid.

   The consumer server should first chain or return referrals for subordinate
   naming contexts in which the evaluation returned LIKELY.  If the time and
   size limits permit, the server should then chain or return referrals for
   subordinate naming contexts in which the evaluation returned POSSIBLE.  The
   server should ignore subordinate contexts which were UNLIKELY or UNINDEXED.

   Generated referrals will be URIs of the LDAP form, in which the base object
   is the name of the naming context and the scope is subtree.  There may be
   mulitple URIs in a referral, if there are alternate servers for this
   naming context.

7.1. "and" filter

   Evaluate each of the component filters.
   If any filter returned "UNLIKELY", return "UNLIKELY".
   If any filter returned "POSSIBLE", return "POSSIBLE".
   Otherwise all filters returned "LIKELY", so return "LIKELY".

7.2. "or" filter

   Evaluate each of the component filters.
   If all filters returned "UNLIKELY", return "UNLIKELY".
   If all filters returned "LIKELY", return "LIKELY".
   Otherwise return "POSSIBLE".

7.3. "equalityMatch" filter

   If there is an attribute section for the presented attribute type of the
   equality match form, then tokenize the presented value and compare. If the
   presented value matches, return "LIKELY", otherwise return "UNLIKELY".



   If there is an attribute section for the presented attribute type of the
   approx match form, then tokenize the presented value and compare. If the
   approximated presented value matches, return "POSSIBLE", otherwise
   return "UNLIKELY".

   If there is an attribute section for the presented attribute type of the
   presence match form, then if the "*" token is present, return "POSSIBLE",
   otherwise return "UNLIKELY".

7.4. "substrings" filter

   If there is an attribute section for the presented attribute type of the
   equality match form, then tokenize the filter values and apply the filter
   against each token.  If there is a match, return "LIKELY", otherwise
   return "UNLIKELY".

   If there is an attribute section for the presented attribute type of the
   approx match form, then tokenize the filter values and compare each
   against each token.  If there is a match, return "LIKELY", otherwise
   return "POSSIBLE".

   If there is an attribute section for the presented attribute type of the
   presence match form, then if the "*" token is present, return "POSSIBLE",
   otherwise return "UNLIKELY".

7.5. "approxMatch" filter

   If there is an attribute section for the presented attribute type of the
   equality match form, then apply the filter against each token.  If there
   is a match, return "LIKELY", otherwise return "UNLIKELY".

   If there is an attribute section for the presented attribute type of the
   approx match form, then compare each tokenized word in the presented value
   against the tokens.  If there is at least one matching word, return
   "LIKELY", otherwise return "UNLIKELY".

   If there is an attribute section for the presented attribute type of the
   presence match form, then if the "*" token is present, return "POSSIBLE",
   otherwise return "UNLIKELY".

7.6. "present" filter

   If there are any values or tokens of the presented attribute type in the
   index object, return "POSSIBLE", otherwise return "UNLIKELY".

7.7. "not", "greaterOrEqual", "lessOrEqual" and "extensibleMatch" filters

   Return "UNLIKELY".

8. Recommendations

   To be written.



9. Security Considerations

   This specification provides a way to transfer directory information from
   one server to another.  This may consist of white pages data, which is
   protected by privacy laws in many countries.

   Depending on the requirements of the data suppliers, a number of index
   encoding options are available, which provide a range of non-reversibility,
   at a cost of usefulness for the consumer.

   The specification recommends that a digital signature be applied and the
   data be encrypted before being transferred to the consumer.  This will allow
   the consumer to verify the source of the data, and to ensure that
   unauthorized parties are not able to access the data while in transit.

   The specification is designed to work in environments where there is an
   agreement between the index supplier and consumer.  This may be based on
   a legal or contractual agreement between the two parties, which defines the
   protections the consumer must provide to the index information.

10. Author's Address

   Mark Wahl
   Critical Angle Inc.
   4815 W Braker Lane #502-385
   Austin, TX 78759
   USA

   EMail:  M.Wahl@critical-angle.com

Bibliography

[CIP] <draft-ietf-find-new-cip-01.txt>

[LDAP] <draft-ietf-asid-ldapv3-protocol-04.txt>

[LDIF] <draft-ietf-asid-ldif-00.txt>

[Fortezza] <draft-housley-msp-mime-01.txt>

[UTF8] RFC 2044