INTERNET-DRAFT                      Larry Masinter, Xerox Corporation
draft-masinter-url-i18n-01                              March 9, 1998
Expires in 6 months


        Using UTF8 for non-ASCII Characters in Extended URIs

Status of this Memo

   This document is an Internet-Draft.  Internet-Drafts are working
   documents of the Internet Engineering Task Force (IETF), its areas,
   and its working groups.  Note that other groups may also distribute
   working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other
   documents at any time.  It is inappropriate to use Internet-Drafts
   as reference material or to cite them other than as ``work in
   progress.''

   To learn the current status of any Internet-Draft, please check the
   ``1id-abstracts.txt'' listing contained in the Internet-Drafts
   Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net
   (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East
   Coast), or ftp.isi.edu (US West Coast).

   This document is not a product of any working group, but may
   be discussed on the mailing list url-i18n@unicode.org.

Abstract

   URIs are defined as sequences of characters chosen from a limited
   subset of the repertoire of ASCII characters, both for transmission
   in network protocols and representation in spoken and written human
   communication.

   This document defines a uniform way of representing non-ASCII
   scripts in URIs and in an Extended URI, so these identifiers can be
   used for the world's languages.

1. Introduction

   URIs [RFC-URI-SYNTAX] are defined as sequences of characters chosen
   from a limited subset of the repertoire of ASCII characters.  The
   characters in URIs are frequently used for representing English
   words and phrases; unfortunately, this leaves out most of the
   world, who do not write merely with the letters A-Z.

2. Syntax

   This memo defines two ways of represting non-ASCII characters
   within URIs:

   1) Within traditional URIs: To be compatible with [RFC-URI-SYNTAX],
      non-ASCII characters SHOULD be transcribed in URIs by first
      representing the characters with the UTF-8 character encoding
      [RFC-UTF8], and then using the hex-encoding defined in
      [RFC-URI-SYNTAX] to encode any octet that does not correspond to
      an allowed, non-reserved character.

   2) Within a new object, an 8-bit URIs (8URI): for a more compact
      and natural representation, an 8URI consists of a sequence of
      octets in the UTF-8 encoding; all characters are represented
      directly by their UTF-8 encoding, except those disallowed in
      [RFC-URI-SYNTAX] (reserved, delimiters, white space, unwise
      special characters), which MUST be hex-encoded.

   Any octet sequence which would likely yield ambiguous or incorrect
   results when printed or displayed and then subsequently typed by a
   user SHOULD be hex-encoded. (See [RFC-DUERST] for details.)

3. Software Requirements

   Supporting URIs for non-ASCII characters requires cooperation from
   the providers of three different components of URI software:

3.1 Requirements for URI entry

   One component of software that deals with URIs allows users to type
   in the URIs. A human transcribes a visual representation of a URI
   (as a sequence of glyphs, in some order, in some visual display)
   using some entry method that will result in a URI.

   If the visual representation contains only those characters that
   are allowed [RFC-URI-SYNTAX] standard syntax of URIs, the
   transcription is simple. However, for all other sequences of
   characters, it is desirable that the entry results in characters,
   in logical order from the ISO 10646 character repertoire, encoded
   using the UTF-8 method [RFC 2044], and then subsequently encoded as
   necessary using the URI hex-encoding (the set of octets that
   require encoding depending on whether the result is a URI or an
   8URI).

   Care must be taken in the identification of the characters and
   character sequence: all accented characters should be translated
   into their combined form, no extraneous BIDI (bidirectional) marks
   should be left in the resulting stream, and that characters that
   are intended to represent Western European letters should be
   transcribed into their ISO-8859-1 equivalents and not, for example,
   as double-wide characters. See [RFC-DUERST] for more complete
   rules.

3.2 Requirements for URI generation and interpretation

   Systems that are offering resources through the Internet, where
   those resources have logical names, sometimes offer the ability to
   generate URIs for the resources they offer.  For example, some HTTP
   servers offer the ability to generate a 'directory listing' for
   file directories under their purvue, and then to respond to the
   generated URIs with the files. If the names of the files consist
   solely of US-ASCII characters the transcription is simple, but
   other file systems offer a wider variety of characters. For maximum
   interoperability, the generation of directories SHOULD be
   in UTF-8, and the results hex-encoded as appropriate for the
   URI or 8URI.

   This requirement applies to HTTP servers, FTP servers, gopher
   servers, and the like.

3.3 Requirements for display of URIs

   Software that displays URIs to users (or any other kind of
   transcription, e.g., deciding what to print in a magazine) should
   follow a general principle: "Don't display a URI that the viewer
   wouldn't be able to type!" The consequences of this principle
   require judgement about the availability of software that
   implements the character input method described in section 3.1.

   a) In situations where most viewers would not have the capability
      of typing non-ASCII characters, any octet not allowed in the
      [RFC-URI-SYNTAX] definition of URIs SHOULD be displayed as if it
      were hex-encoded.

   b) In situations where the viewer is likely to have software for
      non-ASCII character entry as described in section 3.1, sequences
      of octets MAY be displayed directly as the non-ASCII character
      sequence it represents in UTF-8. In addition, character
      sequences of %HH-encoding which correspond to non-ASCII
      characters MAY be displayed directly, just show the encoding in
      ASCII, OR may be displayed as if it were a sequence of
      hex-encoded UTF-8.

3.4 Requirements for interpretation of URIs

   Software that interprets URIs as the names of local resources
   SHOULD accept multiple renditions of the URIs in the case where
   those resources names might have non-ASCII representations.

   Just as allowing case-insensitive file names makes URIs more
   robust, because the person viewing the URI might type the
   case differently than it is displayed, similarly, URI-interpreting
   software should be generous in allowing all of the possible
   representations that might result from the recommendations in
   section 3.1. In addition, it is useful if unaccented characters
   are accepted, when possible, as aliases for accented characters,
   and that other equivalences are made.

Summary

These recommendations, when taken together, will allow for the
extension of URIs to handle scripts other than ASCII while minimizing
interoperability problems.

Acknowledgements

Many thanks to Martin Duerst and others for help with this draft.


References

[RFC 2044]
[RFC-URI-SYNTAX] draft-fielding-url-syntax
[RFC-DUERST]     draft-duerst-url-???