Internet Engineering Task Force                               M. Nilsson
INTERNET DRAFT                                         17th January 1999
Document: draft-nilsson-latin1-http-uri-00.txt
Expires 17th July 1999


                  8 bit latin1 characters in HTTP URIs

Status of this Memo

   This document is an Internet-Draft.  Internet-Drafts are working
   documents of the Internet Engineering Task Force (IETF), its
   areas, and its working groups.  Note that other groups may also
   distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other
   documents at any time.  It is inappropriate to use Internet-
   Drafts as reference material or to cite them other than as
   "work in progress."

   To view the entire list of current Internet-Drafts, please check
   the "1id-abstracts.txt" listing contained in the Internet-Drafts
   Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net
   (Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East
   Coast), or ftp.isi.edu (US West Coast).

   This memo provides information for the Internet community. This
   memo does not specify an Internet standard of any kind.
   Distribution of this memo is unlimited.


Abstract

   The recent gain of internet users in non-US countries has increased
   the demand for 8 bit characters in URIs. The lack of recommended
   character map has lead to several incompatible implementations. This
   document suggests the use of ISO-8859-1 to represent all the
   characters present in that character table.


1.  The problem

   The definition of an URI to be used in HTTP/1.0 and HTTP/1.1, as
   described in RFC 1945 and RFC 2068, includes "International
   characters" among the allowed characters. It is further stated in
   the end of section 3.2.1 that:

    "The BNF above includes national characters not allowed in valid URLs
    as specified by RFC 1738, since HTTP servers are not restricted in
    the set of unreserved characters allowed to represent the rel_path
    part of addresses, and HTTP proxies may receive requests for URIs not
    defined by RFC 1738."

   But nothing is said about the representation of these 8 bit
   characters.

   Since different applications use different character maps to
   represent 8 bit URIs, the following problem, and several other
   similar ones, can occur:

   1. A html page is authored and published on a UNIX system. The page
      contains a link to another page which has an 8 bit name, but
      neither page has any embedded information regarding its character
      encoding.

   2. A Macintosh user is looking at the first page with a web browser.
      All characters on the page are displayed as intended, including
      the 8 bit ones in the link when using 'display source', as the
      browser assumes ISO-8859-1 which happen to be the character set
      used on the authoring system.

   3. When the user tries to use the link to the document with an 8 bit
      URI he only gets a "Not found" message from the server, because
      the browser encodes its URIs with the machintosh character set.


2.  Solutions

   First it must be recognised that URIs are on the whole very difficult
   to internationalise and a complete internationalisation is not
   possible. We want at the same time as much internationalisation as
   possible within the constraints given by the URI definition in the
   HTTP/1.0 and HTTP/1.1 standards. Considering this suggesting that
   ISO-8859-1 should be the default mapping is not a solution.

   A solution that might look as a good one at a glance is to use the
   same encoding as that of the document in wich the link was found.
   E.g. if the document was encoded in ISO-8859-1 all links should be
   treated as ISO-8859-1 and if the document was encoded in ISO-8859-2
   all links should be treated as ISO-8859-2. This is a solution for
   characters present in only one character set, but it does not solve
   the problem described in section 1.

   The solutions suggested by this document is to encode all characters
   present in ISO-8859-1 with ISO-8859-1 and let all other characters
   remain in their present encoding, if possible. ISO-8859-1 is chosen
   since it is the default encoding of HTML documents which means that
   URIs from pages without any content encoding descriptions can be used
   without modifications.

   While this isn't a complete solution it is very straight forward and
   solves most of the problems since the ISO-8859-1 characters are the
   ones present in most character sets (at different positions that is).


3.  Suggested solution

   All characters present in ISO-8859-1 should be represented with their
   ISO-8859-1 encoding.


4.  Security considerations

   Since this document does not suggest any technical changes of the URI
   definition (such as adding or removing valid URI characters) the
   author does not see any security issues.


5.  References

   [HTTP1.0]
     T. Berners-Lee, R. Fielding, H. Frystyk, "Hypertext Transfer
     Protocol -- HTTP/1.0", RFC 1945, May 1996

   [HTTP1.1]
     R. Fielding, J. Gettys, J. Mogul, H. Frystyk, T. Berners-Lee,
     "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2068, January 1997

   [ISO-8859-1] ISO/IEC DIS 8859-1.
     8-bit single-byte coded graphic character sets, Part 1: Latin
     alphabet No. 1. Technical committee / subcommittee: JTC 1 / SC 2


6.  Full Copyright Statement

   Copyright (C) The Internet Society (1998).  All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implmentation may be prepared, copied, published and
   distributed, in whole or in part, without restriction of any kind,
   provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."


7.  Author's Address

   Martin Nilsson
   Rydsvägen 246 C. 30
   S-584 34 Linköping
   Sweden

   Email: nilsson@id3.org