INTERNET-DRAFT                                         Larry Masinter
                                                    Xerox Corporation
                                                        Martin Duerst
                                                  W3C/Keio University
draft-masinter-url-i18n-02                            August 30, 1998
Expires in 6 months


   Representing non-ASCII Characters in URIs and Extended URIs

Status of this Memo

This document is an Internet-Draft.  Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas,
and its working groups.  Note that other groups may also distribute
working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as ``work in progress.''

To view the entire list of current Internet-Drafts, please check
the "1id-abstracts.txt" listing contained in the Internet-Drafts
Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net
(Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au
(Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US
West Coast).

This document is not a product of any working group, but may
be discussed on the mailing list url-i18n@unicode.org.

Abstract

URIs are defined as sequences of characters chosen from a limited
subset of the repertoire of ASCII characters, both for transmission in
network protocols and representation in spoken and written human
communication.

This document defines a uniform way of representing non-ASCII scripts
in URIs and in an extended 8-bit form (8URI), so these identifiers can
be used for the world's languages. The document gives guidelines for
the use and deployment of these forms in various elements of software
that deal with URIs.

1. Introduction

URIs [RFC 2396] are defined as sequences of characters chosen from a
limited subset of the repertoire of ASCII characters.  The characters
in URIs are frequently used for representing English words and
phrases; unfortunately, this leaves out most of the world, who do not
write merely with the letters A-Z.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119.

2. Syntax

This document defines two ways of representing non-ASCII characters in
resource identifiers: a URI syntax which is compatible with the
definition of URI syntax [RFC 2396], and a new syntax which is usable
in contexts where resource identifiers are transported within "8-bit"
environments. This new syntax is called an "8URI"; it is upward
compatible with the URI syntax, but is defined as a sequence of 8-bit
octets.

2.1 URI syntax

The standard definition of URIs [RFC 2396] requires that URIs be
represented with a very limited repertoire of characters which are a
subset of those characters representable in ASCII. URIs are defined as
a sequence of characters (since URIs may be written on paper or read
out loud) which my be represented as a sequence of 7-bit bytes.

Character sequences that include non-ASCII characters must be
transcribed to represent them in URIs. The transcription to be applied
to a character sequence before it is included in an element of a URI
(path, etc.) SHOULD be performed by:

1) representing the characters as a sequence of ISO 10646 characters.
2) "normalizing" the character sequence to reduce ambiguity.
   [UNI15] defines several normalization forms; for the purpose
   of representing characters in URIs, "Normalization Form CC".
3) encoding the result with the UTF-8 character encoding [RFC 2279]
4) using %HH hex-encoding [RFC 2396] to encode any octet that
   does not correspond to an allowed, non-reserved character.

This syntax is consistent with the definition of the generic URI
syntax [RFC 2396], the URN syntax [RFC 2141], as well as recent URL
scheme definitions [RFC 2192], [RFC 2384].

2.2 8URI syntax

This specification defines a new protocol element, called an '8URI'.
An 8URI is similar to a URI in its use, but is different in that it is
solely for use in network protocols that allow the transport of octets
outside of the range allowed within URIs. An 8URI MAY have 8-bit
octets within it. An 8URI is represented using the same methods (1-4)
defined in section 2.1, but in step (4), octets with the leading bit
on need not be encoded; all characters outside of those explicitly
disallowed in RFC 2396 (reserved, delimiters, white space, unwise
special characters) MAY be represented directly by their UTF-8
encoding.

An '8URI' for characters outside of the ASCII range will use
considerably less space than the corresponding hex-encoded URI.

Even within 8URIs, any octet sequence which would likely yield
ambiguous or incorrect results when printed or displayed and then
subsequently typed by a user SHOULD be hex-encoded.

Internet protocols that currently allow the designation of a URI may
be extended at some point to allow 8URIs as well as URIs, but this
extension must be done explicitly. Section 4 lays out some of the
software guidelines that will allow the deployment of 8URIs in
existing Internet Protocols.

3. Software Requirements and Upgrade Strategy

Supporting URIs for non-ASCII characters requires cooperation from the
providers of several different components of URI software: software
that allows users to enter URIs, software that generates URIs,
software that displays URIs, and software that interprets URIs.

3.1 URI entry

One component of software that deals with URIs allows users to enter a
URI, e.g., by typing or dictation. For example, a person viewing a
visual representation of a URI (as a sequence of glyphs, in some
order, in some visual display) might use a keyboard entry method for
keys in that language to create the URI. For ASCII characters with
standard English keyboards, the process is simple, since there is
generally a simple correspondence between letters represented, keys
pressed, and internal system representation, but for other languages
the process is much more complex.

If the visual representation contains only those characters that are
allowed [RFC 2396] standard syntax of URIs, the transcription is
simple. However, for all other sequences of characters, it is
RECOMMENDED that the entry results in characters, in logical order
from the ISO 10646 character repertoire, encoded using the UTF-8
method [RFC 2279], and then subsequently encoded as necessary using
the URI hex-encoding. The set of octets that require encoding
depending on whether the result is a URI or an 8URI.

The characters the user has entered should be normalized according to
the rules in [RFC-DUERST]; for example, all accented characters should
be translated into their combined form, no extraneous BIDI
(bidirectional) marks should be left in the resulting stream, and that
characters that are intended to represent Western European letters
should be transcribed into their ISO-8859-1 equivalents and not, for
example, as double-wide characters.

Whether URI entry should result in a URI or an 8URI will depend on the
capability of the protocol or software to which the result will be
submitted.

3.2 URI generation

Systems that are offering resources through the Internet, where those
resources have logical names, sometimes offer the ability to generate
URIs for the resources they offer.  For example, some HTTP servers
offer the ability to generate a 'directory listing' for file
directories under their purvue, and then to respond to the generated
URIs with the files. If the names of the files consist solely of
US-ASCII characters the transcription is simple, but other file
systems offer a wider variety of characters. Many currently deployed
systems currently do not transform the local character representation
of the underlying system before generating URIs.

For maximum interoperability, systems that generate resource
identifiers SHOULD translate the local encoding to UTF-8, and the
results hex-encoded as appropriate for the URI or 8URI.

Whether the generated identifier should result in a URI or an 8URI
depends on the capability of the protocol or software to which the
result will be submitted.

This recommendation applies to HTTP servers as well as those systems
that generate and interpret URLs for FTP, gopher and the like.

3.3 Display of URIs

Many systems contain software that present URIs to users as part of
their user interface (sometimes presenting 'friendly' URIs). This
section applies to this presentation, as well as to the strategy for
printing URIs in magazines, newspapers, or reading them over the
radio.

Software that displays identifiers to users should follow a general
principle: "Don't display something to a user that the user would not
be able to enter." The consequences of this principle require
judgement about the availability of software that implements the entry
methods described in section 3.1.

a) In situations where a viewer is not likely to have software that
implements non-ASCII character entry as described in section 3.1, any
octet not representable by a character allowed in the [RFC 2396]
SHOULD be displayed as if it were hex-encoded.

b) In situations where a viewer _is_ likely to have such software,
sequences of octets MAY be displayed directly as the non-ASCII
character sequence it represents in UTF-8. Character sequences of
%HH-encoding which correspond to non-ASCII characters MAY be displayed
directly without decoding OR may be displayed as if it were a sequence
of hex-encoded UTF-8.

3.4 Interpretation of URIs

Software that interprets URIs as the names of local resources SHOULD
accept multiple renditions of the URIs in the case where those
resources names might have non-ASCII representations; this includes
accepting both the URI syntax of section 2.1 and the 8URI form in
section 2.2.

Just as allowing case-insensitive file names makes URIs more robust
(because the person viewing the URI might type the case differently
than it is displayed), similarly, URI-interpreting software should be
generous in allowing all of the possible representations that might
result from the recommendations in section 3.1. In addition, it is
useful if unaccented characters are accepted, when possible, as
aliases for accented characters, and that other equivalences are made.

For example, a URI which contains a string in Japanese might actually
arrive with a variety of encodings, due to the variety of
interpretations of deployed systems. While this recommendation
specifies a canonical encoding of Japanese using %HH-encoded UTF-8, in
practice many URIs will be presented which contain characters encoded
using Shift-JIS or EUC-JP, either with %HH encoding or not. Thus, to
transition to the new regime, URI-interpreting software for Japanese
should accept all three of the EUC-JP, Shift-JIS and UTF-8 encodings.

4. Upgrading

As this recommendation places further constraints on software for
which many instances are already deployed, it is important to
introduce upgrade carefully.

4.1 Upgrade sequence

The deployment strategy (for both hex-encoded and 8URIs) is in the
following sequence:

  Interpret  -->   Generation
              |
              +->  Entry   --> Display

Initially, it is most important to upgrade the URI interpreting
software according to the recommendations of section 3.4.

The upgrade of generating software to use UTF-8 (instead of a local
encoding) should happen only after the service is upgraded to accept
such URIs. Similarly, 8URIs should only be generated when the service
accepts 8URIs and the intervening infrastructure and protocol is known
to transport them safely.

Similarly, once interpreting software has been modified to accept
alternative encodings, then the entry software can also transition.

Display software should be upgraded only after upgraded entry software
has been widely deployed to the population that will see the displayed
result.

These recommendations, when taken together, will allow for the
extension of URIs to handle scripts other than ASCII while minimizing
interoperability problems.

4.2 Examples: upgrading URIs within various contexts

4.2.1 URIs within HTTP

The HTTP protocol [RFC HTTP] includes the URI of the resource being
accessed as the 'Request-URI' in the request line. Most deployed HTTP
servers that access resources with localized non-ASCII naming do not
translate the Request-URI's character encoding to a local form, and
will need to be upgraded to accept such aliases.  Most deployed HTTP
servers do not do not restrict the octets allowed in the protocol, and
so an upgrade from URI to 8URI will not be difficult.

4.2.2 URIs within HTML and XML

Within a HTML [HTML4] or XML [XML1] document the primary difficulty
for the use of 8URIs is that the document itself may be represented
and labelled with a charset other than UTF-8. In such situations, the
document as a whole might be transcoded into another
encoding. However, the hex-encoded URIs following the recommendations
of this document should pass from the recipient of the document back
into the URI interpreting infrastructure without change.

4.2.3 URIs within email and text/plain

E-mail messages are frequently transmitted as text/plain; the use of
octets outside of US-ASCII requires an encoding of the message using
quoted-printable or base64. In addition, text messages that arrive
with charset=utf-8 may be transcoded into a local character
representation before storage or display. Thus, URIs within email
messages should likely remain within the limited repertoire rather
than the 8URI representation.

However, it is now common for email software to recognize embedded
URIs within email messages and present them specially, e.g., as
hypertext links. Within such systems, it is reasonable to upgrade
the email display software to present URIs as the natural characters
they represent, as long as the entry software in the same system
has been upgraded.

5. Security Considerations

If URI entry software is upgraded to normalize the characters entered,
but the URI interpreting software has not been upgraded to treat
multiple forms as equivalent, this introduces the possibility of
"spoofing": having different resources whose URIs look the same but
are not the same. For example, if "abc" and "def" are different
encodings of the same visual characters, "http://a.com/abc" and
"http://a.com/def" might look the same to users, might display the
same, and different URI entry software components might generate
different ones; e.g., EUC-JP-based Japanese URI entry software might
generate one encoding, while UTF-8-based software would generate
another one. In this case, if "a.com" allows multiple users to
establish different areas, it might be possible for someone other than
the owner of "http://a.com/abc" to put different content at
"http://a.com/def" and "spoof" the results.

Conceptually, this is no different from the problems surrounding the
use of case-insensitive web servers.  For example, a popular web page
with a mixed case name (http://big.site/PopularPage) might be
"spoofed" by someone who obtains access to
(http://big.site/popularpage).  However, the introduction of the
Unicode canonicalization rules in conjunction with mapping from
multiple possible native encodings might result in aliasing which is
difficult to determine in advance. Administrators of large sites which
allow independent users to create subareas may need to be careful that
the aliasing rules do not create such conflicts.

6. Acknowledgements

Thanks to Francois Yergeau, Chris Wendt, Yaron Goland, Graham Klyne,
Roy Fielding and many others for help with this document.

7. Copyright

Copyright (C) The Internet Society, 1997. All Rights Reserved.

This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published and
distributed, in whole or in part, without restriction of any kind,
provided that the above copyright notice and this paragraph are
included on all such copies and derivative works.  However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of developing
Internet standards in which case the procedures for copyrights defined
in the Internet Standards process must be followed, or as required to
translate it into languages other than English.

The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.

This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT
NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN
WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."

8. Author's address

        Larry Masinter
        Xerox Corporation
        3333 Coyote Hill Road
        Palo Alto, CA 94304
        masinter@parc.xerox.com
        http://www.parc.xerox.com/masinter
        Fax: +1 650 812-4333

        Martin J. Duerst
        W3C/Keio University
        5322 Endo, Fujisawa
        252-8520 Japan
        duerst@w3.org
        http://www.w3.org/People/D%C3%BCrst/
        Tel/Fax: +81 466 49 1170

9. References

[RFC 2119]    S. Bradner, "Key words for use in RFCs to Indicate
              Requirement Levels", March 1997.

[RFC 2279]    F. Yergeau. "UTF-8, a transformation format of ISO 10646."
              January 1998.

[RFC 2396]    T.Berners-Lee, R.Fielding, L.Masinter. "Uniform
              Resource Identifiers (URI): Generic Syntax." August,
              1998.

[UNI15]       M.Davis, "Unicode Normalization Forms", Draft Unicode
              Technical Report #15, August 1998.

[RFC HTTP]    R.Fielding, J.Gettys, et al, "Hypertext Transfer Protocol --
              HTTP/1.1", <draft-ietf-http-v11-spec-rev-04.txt>.

[RFC 2141]    R. Moats, "URN Syntax", May 1997.

[RFC 2192]    C. Newman, "IMAP URL Scheme", September 1997.

[RFC 2384]    R. Gellens, "POP URL Scheme", August 1998.

[RFC FTP]     B. Curtis, "Internationalization of the File Transfer Protocol",
              <draft-ietf-ftpext-intl-ftp-05.txt>.

[HTML4]       "HTML 4.0", World Wide Web Consortium,
               <http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2>.

[XMl1]        "XML 1.0", World Wide Web Consortium Recommendation,
              <http://www.w3.org/TR/REC-xml#sec-external-ent>.