INTERNET DRAFT                                              C. Feather
Expires: 8 December 2001                                      Thus plc
                                                            8 June 2001
                            The Hashed URI

                   draft-feather-hashed-uri-00.txt

1. Status of this Document

This document is an Internet-Draft and is in full conformance with
Section 10 of RFC 2026.

Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other groups
may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or made obsolete by other documents at any
time. It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accesses at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft shadow directories can be accessed
at http://www.ietf.org/shadow.html.

This section will be updated with the appropriate verbiage from RFC 2223
when this document has been found ready for publication as an RFC.

2. Abstract

There are situations where it is desirable to determine whether two
URIs are identical without revealing the value of one or the other if
they are not. For example, the publisher of filtering software may want
to provide a set of URLs to be filtered without identifying the actual
resources in question.

This problem has traditionally been done using cryptographically
secure hashes. The two items are both hashed and the resulting values
then compared. If they are equal it can be assumed that the two original
items were equal as well; if not, the hash does not reveal anything
about the original item. This technique is eminently suitable for this
situation.

This document describes a "hashed" URI naming scheme for representing
the hashed form of a URI [RFC2396] as another URI.










Feather                                                       [Page 1]


INTERNET DRAFT             The Hashed URI                  8 June 2001

3. Formal definitions

Formal definitions in this document use augmented Backus-Naur syntax
[RFC2234].   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
"SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
[RFC2119].

4. The hashed URI scheme

4.1. Syntax

A hashed URI is formally described as follows:

     hashed-uri = hashed-scheme ":" hash-algorithm "=" hash-value

     hashed-scheme = "hashed"

     hash-algorithm = identifier

     hash-value = 1*HEXDIG

     identifier = 1*(ALPHA / DIGIT / "." / "-")

     ALPHA  = %x41-5A / %x61-7A   ; A-Z / a-z

     DIGIT  = %x30-39             ; 0-9

     HEXDIG =  DIGIT / "A" / "B" / "C" / "D" / "E" / "F"

The entire URI is case-insensitive.

4.2. Identification of hash algorithms

Each hashing algorithm has an identifier for the purposes of the hashed
URI scheme. Names beginning with "x-" are available for private use. The
Internet Assigned Numbers Authority (IANA) will maintain a registry of
hash algorithms and the associated identifier. The initial contents of
this registry will be:

     Identifier    Definition
     "md5"         Message Digest algorithm 5 ("MD5")     [RFC1321]
     "sha-1"       Secure Hash Algorithm ("SHA-1")        [FIPS-180-1]

A hashed URI MUST only use an identifier that is in the registry or
that is available for private use. An application that interprets hashed
URIs MUST accept "sha-1" and SHOULD accept all identifiers in the
registry.







Feather                                                       [Page 2]


INTERNET DRAFT             The Hashed URI                  8 June 2001

4.3. Generating a hashed URI

A hashed URI is generated from an initial URI in the following three
steps.

(1) The initial URI is reduced to its equivalent canonical form. To do
this, any #fragment or ?query component is removed, a relative URI is
converted to the equivalent absolute URI, and the scheme name is
converted to lowercase; any other changes depend on the specific scheme
used. Except where done as part of a canonicalization algorithm for the
specific scheme, no encoding or decoding of escaped characters is done.
A possible canonicalization algorithm for hierarchical URIs is given in
Appendix A.

(2) The resulting string is passed through the appropriate hash
algorithm. The string does not include any trailing zero bytes, a
length, delimiting angle brackets, or any other ornamentation. This
produces a sequence of octets (16 octets for MD5, 20 octets for
SHA-1). If the algorithm requires the data to be a multiple of a
particular size and does not specify a padding rule, the string is
padded with zero bytes to that size (MD5 and SHA-1 both specify their
own padding rules and therefore no padding is done at this stage).

(3) Each octet is converted to two hexadecimal digits in the normal
manner and these are written in the same order as the corresponding
octets.

(4) Any #fragment or ?query component is restored.

4.4. Examples

Initial URI               Hashed URI

mailto:clive@demon.net    hashed:md5=cd933f3b87ee60e58917448c9678ee32

http://www.thus.net       hashed:md5=be96302a0468481aca954431cf293e05

HTTP://www.Thus.net       hashed:md5=be96302a0468481aca954431cf293e05

http://10.20.30.40/a%62c  hashed:md5=754a2c63a13aa2e8153d027066e31f8f

5. IANA considerations

Section 4.2 requires IANA to maintain a registry of hash algorithms and
identifiers. Since implementers will want to handle as many hashes as
possible, adding new algorithms places a significant burden on them.
Therefore new algorithms should only be added when they are widely
accepted in the community. The IETF Consensus process [RFC2434] is
recommended.





Feather                                                       [Page 3]


INTERNET DRAFT             The Hashed URI                  8 June 2001

6. Security Considerations

The security considerations for URIs in [RFC2396] continue to apply.

The security of the hashed URI depends completely on that of the hash
algorithm used.

Where the initial URI is limited to a sufficiently small set of
possibilities, an exhaustive search can be used to determine the initial
URI from the hashed URI. Given H hashed URIs and P possible initial
URIs, the time take to do this search is O(H+P) and not O(HP). In
particular, the set of URIs of the form <http://www.domain>, where the
domain is in a major domain registry, may be small enough for such a
search to be feasible.

Because (subject to the above) the initial URI cannot be determined from
a hashed URI, it is necessary to trust the originator of the latter as
to whether the resource has any of the properties claimed for it.

7. References

[FIPS-180-1] "Secure Hash Standard", National Institute of Standards
              and Technology, U.S. Department Of Commerce, April 1995.
              Also known as: 59 Fed Reg 35317 (1994).

[RFC1321]    Rivest, R., "The MD5 Message-Digest Algorithm", RFC
              1321, April 1992.

[RFC2119]    Bradner, S., "Keywords for use in RFCs to Indicate
              Requirement Levels", RFC 2119.

[RFC2234]    Crocker, D. and P. Overell, "Augmented BNF for Syntax
              Specifications: ABNF", RFC 2234, November 1997.

[RFC2396]    Berners-Lee, T., R. Fielding and L. Manister, "Uniform
              Resource Identifiers (URI): Generic Syntax", RFC 2396,
              August 1998.

[RFC2434]    Narten, T. and H. Alvestrand, "Guidelines for Writing an
              IANA Considerations Section in RFCs", RFC 2434,
              October 1998.

8. Acknowledgements

The assistance of Richard Clayton in preparing this document is greatly
appreciated.







Feather                                                       [Page 4]


INTERNET DRAFT             The Hashed URI                  8 June 2001

9. Author's Address

Clive Feather
Thus plc
322 Regents Park Road
London
N3  2QQ
United Kingdom

Email: clive@demon.net
    or: clive@davros.org
Tel: +44 20 8371 1138
Fax: +44 20 8371 4037



This document expires 8 December 2001.




































Feather                                                       [Page 5]


INTERNET DRAFT             The Hashed URI                  8 June 2001

A  Canonicalization of hierarchical URIs

The following algorithm is proposed as a way of canonicalizing
hierarchical URIs; that is, those that use the production "hier_part"
from [RFC2396] appendix A.

In the real world it is not always possible to say whether two URIs
address the same resource, because this will depend on the actual
implementation deployed on network servers. Therefore the algorithm has
two variants - "N" and "P" - which respectively tend towards false
negatives (the URIs are assumed to be different) and false positives
(they are assumed to be the same). The choice of variant will depend on
the application domain.

(1) The changes described in section 3.3 step (1) are made.

(2) The URI is decomposed into scheme, authority, and a list of
     path_segments. If the scheme uses user/host/port triples as
     authorities, the authority is further split into the three parts.
     Any of these parts may be empty.

(3) If the port is the default port for the scheme, it is deleted.

(4) Variant N: the scheme and host (if a domain name) are lowercased.
     Variant P: all the components are lowercased.

(5) The host (if an IP address) and port have all leading zeroes
     removed.

(6) Variant P only: if the last path_segment ends with one of the
     strings in the first column, that string is replaced by the one in
     the second column:

         .html      .htm
         .jpeg      .jpg
         .text      .txt
         .ram       .ra

(7) Scheme "http" only: if the last path_segment is "index.htm" or
     "index.html", it is removed.

(8) The URI is recomposed, including only those components that remain
     and are not empty strings. That is, it is the concatenation of:
         [always]                          scheme ":"
         [if there is an authority]        "//" authority
         [if there is a user and host]     "//" user "@" host
         [if there is only a host]         "//" host
         [if there is a port]              ":" port
         [for each path_segment in order]  "/" path_segment





Feather                                                       [Page 6]


INTERNET DRAFT             The Hashed URI                  8 June 2001

Examples:

   Original URI: "http://Fred@WWW.Thus.net:0112/Test/index.html"
   Variant N:    "http://Fred@www.thus.net:112/Test/"
   Variant P:    "http://fred@www.thus.net:112/test/"

   Original URI: "http://111.011.101.001/test/Image.jpeg"
   Variant N:    "http://111.11.101.1/test/Image.jpeg"
   Variant P:    "http://111.11.101.1/test/image.jpg"










































This document expires 8 December 2001.

Feather                                                       [Page 7]

--
Clive D.W. Feather    | Internet Expert      | Work: <clive@demon.net>
Tel: +44 20 8371 1138 | Demon Internet       | Home: <clive@davros.org>
Fax: +44 20 8371 4037 | Thus plc             | Web:  <http://www.davros.org>
Written on my laptop; please observe the Reply-To address