draft-ietf-ftpext-intl-ftp-00

FTPEXT Working Group                              B. Curtin
INTERNET DRAFT           Defense Information Systems Agency
Expires 26 May 1997                        26 November 1996


       Internationalization of the File Transfer Protocol
                  <draft-ietf-ftpext-intl-ftp-00.txt>

Status of this Memo

   This document is an Internet-Draft.  Internet-Drafts are
   working documents of the Internet Engineering Task Force
   (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as
   Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of
   six months. Internet-Drafts may be updated, replaced, or
   obsoleted by other documents at any time.  It is not
   appropriate to use Internet-Drafts as reference material or
   to cite them other than as a "working draft" or "work in
   progress".

   To learn the current status of any Internet-Draft, please
   check the 1id-abstracts.txt listing contained in the
   Internet-Drafts Shadow Directories on ds.internic.net (US
   East Coast), nic.nordu.net (Europe), ftp.isi.edu (US West
   Coast), or munnari.oz.au (Pacific Rim).

   Distribution of this document is unlimited.  Please send
   comments to the FTP Extension working group (FTPEXT-WG) of
   the Internet Engineering Task Force (IETF) at
   <ftp-wg@hops.ag.utk.edu>. Subscription address is
   <ftp-wg-request@hops.ag.utk.edu>. Discussions of the group
   are archived at <URL:ftp://hops.ag.utk.edu/ftp-wg/archives/>.


Abstract

   The File Transfer Protocol, as defined in RFC 959 [RFC959]
   and RFC 1123 Section 4 [RFC1123], is one of the oldest and
   widely used protocols on the Internet. The protocol's primary
   character set, 7 bit ASCII, has served the protocol well
   through the early growth years of the Internet. However, as
   the Internet becomes more global, there is a need to support
   character sets beyond 7 bit ASCII.

   This document addresses the internationalization (I18n) of
   FTP, which includes supporting the multiple character sets
   found throughout the Internet community.  This is achieved
   by extending the FTP specification and giving recommendations
   for proper internationalization support.



                              Expires 26 May 1997            [Page 1]


INTERNET DRAFT    FTP Internationalization 26 November, 1996




Table of Contents


1. INTRODUCTION.................................................3

1.1 SCOPE.......................................................3

2.0 INTERNATIONALIZATION........................................3

2.1 INTERNATIONAL CHARACTER SET.................................3

2.2 TRANSFER ENCODING...........................................4

2.3 TRANSLATIONS................................................6

2.3.1 ISO/IEC 8859-8 EXAMPLE....................................9

2.3.2 VENDOR CODEPAGE EXAMPLE..................................10

3. CONFORMANCE.................................................11

3.1 INTERNATIONAL SERVERS......................................11

3.1.1 SERVER STRATEGIES EXAMPLES...............................12

3.2 INTERNATIONAL CLIENTS......................................12

4.0 SECURITY...................................................13

5.0 ACKNOWLEDGEMENTS...........................................13

BIBLIOGRAPHY...................................................14

AUTHOR'S ADDRESS...............................................15













                              Expires 26 May 1997            [Page 2]


INTERNET DRAFT    FTP Internationalization 26 November, 1996

1.   Introduction

   As the Internet grows throughout the world the requirement to
   support character sets outside of the ASCII / Latin-1
   character set becomes ever more urgent.  For FTP, because of
   the large installed base, it is paramount that this be done
   without breaking existing clients and servers.  This document
   addresses this need. In doing so it defines a solution which
   will still allow the installed base to interoperate with new
   international clients and servers.

1.1  Scope

   This document enhances the capabilities of the File Transfer
   Protocol by defining a Universal Character Set (UCS), a UCS
   transformation format (UTF), and removing the 7-bit
   restrictions on pathnames used in client commands and server
   responses.

2.0  Internationalization

   The File Transfer Protocol was developed in a period when the
   predominate character sets were 7 bit ASCII and 8 bit EBCDIC.
   Today these character sets can not support the wide range of
   characters needed by multinational systems. Given that there
   are a number of character sets in current use that provide
   more characters than 7-bit ASCII, it makes sense to decide on
   a convenient way to represent the union of those
   possibilities. To work globally either requires support of a
   number of character sets and to be able to translate between
   them, or the use of a single preferred character set . To
   assure interoperability this document recommends the latter
   approach and defines a single character set, in addition to
   NVT ASCII and EBCDIC, which is understandable by all systems.
   For FTP this character set will be ISO/IEC 10646:1993 and the
   UTF-8 encoding.  For support of global compatibility it is
   strongly recommended that clients and servers use UTF-8
   encoding when performing operations on filenames. Clients and
   servers are, however, under no obligation to perform any
   translation on the contents of a file for operations such as
   STOR or RETR.

   A more thorough description, beyond what is given in the
   document,  on UTF-8, ISO/IEC 10646, and UNICODE can be found
   in RFC 2044 [RFC2044].

2.1  International Character Set

   The character set defined for international support of FTP
   shall be the Universal Character Set as defined in ISO

                              Expires 26 May 1997            [Page 3]


INTERNET DRAFT    FTP Internationalization 26 November, 1996

   10646:1993 [ISO-10646] as amended. This standard incorporates
   the script and symbol character sets of many existing
   international, national, and corporate standards. ISO/IEC
   10646 defines two alternate forms of encoding, UCS-4 and
   UCS-2. UCS-4 is a four byte (31 bit) encoding containing
   2**31 code positions divided into 128 groups of 256 planes.
   Each plane consists of 256 rows of 256 cells. UCS-2 is a 2
   byte (16 bit) character set consisting of plane zero or the
   Basic Multilingual Plane (BMP).  Currently, no codesets have
   been defined outside of the 2 byte BMP.

   The Unicode standard version 2.0 [UNICODE] is consistent with
   the UCS-2 subset of ISO/IEC 10646. The Unicode standard
   version 2.0 includes the repertoire of IS 10646 characters,
   amendments 1-7 of IS 10646, and editorial and technical
   corrigenda.

     NOTE -- implementers should be aware that ISO 10646 amended
     from time to time; 4 amendments have been adopted since the
     initial 1993 publication, none of which significantly
     affects this specification.  A fifth amendment, now under
     consideration, will introduce incompatible changes to the
     standard: 6556 Korean Hangul syllables allocated between
     code positions 3400 and 4DFF (hexadecimal) will be moved to
     new positions (and 4516 new syllables added), thus making
     references to the old positions invalid.  Since the Unicode
     consortium has already adopted the corresponding amendment
     in Unicode 2.0, adoption of DAM 5 is considered likely and
     implementers should probably consider the old code positions
     as already invalid.  Despite this one-time change, the
     relevant standard bodies have committed themselves not to
     change any allocated code position in the future.  To encode
     Korean Hangul irrespective of these changes, the conjoining
     Hangul Jamo in the range 1110-11F9 can be used.

2.2  Transfer Encoding

   UCS Transformation Format 8 (UTF-8) [UTF-8], also known as
   UTF-2, will be used as a transfer encoding to transmit the
   international character set. UTF-8 is a file safe encoding
   which avoids the use of byte values which have special
   significance during the parsing of file name character
   strings. UTF-8 is an 8 bit encoding of the characters in the
   UCS. Some of UTF-8's benefits are that it is compatible with
   7 bit ASCII, so it doesn't affect programs that give special
   meanings to various ASCII characters; it is immune to
   synchronization errors; and it has enough space to support
   large character sets.

   UTF-8 encoding represents each UCS character as a sequence of

                              Expires 26 May 1997            [Page 4]


          INTERNET DRAFT    FTP Internationalization 26 November, 1996

   1 to 6 bytes in length. For all sequences of one byte the
   most significant bit is ZERO. For all sequences of more than
   one byte the number of ONE bits in the first byte, starting
   from the most significant bit position, indicates the number
   of bytes in the UTF-8 sequence followed by a ZERO bit. For
   example, the first byte of a 3 byte UTF-8 sequence would have
   1110 as its most significant bits. Each additional bytes
   (continuing bytes) in the UTF-8 sequence, contain a ONE bit
   followed by a ZERO bit as their most significant bits. The
   remaining free bit positions in the continuing bytes are used
   to identify characters in the UCS. The relationship between
   UCS and UTF-8 is demonstrated in the following table:

   UCS-4 range              UTF-8 byte sequence
   0000 0000-0000 007F      0xxxxxxx
   0000 0080-0000 07FF      110xxxxx 10xxxxxx
   0000 0800-0000 FFFF      1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-001F FFFF      11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
   0020 0000-03FF FFFF      111110xx 10xxxxxx 10xxxxxx 10xxxxxx
                            10xxxxxx

   0400 0000-7FFF FFFF      1111110x 10xxxxxx 10xxxxxx 10xxxxxx
                            10xxxxxx 10xxxxxx

   A beneficial property of UTF-8 is that its single byte
   sequence is consistent with the ASCII character set. This
   feature will allow a transition where old ASCII-only clients
   can still interoperate with new servers which support the
   UTF-8 encoding.

   Another feature is that the encoding rules make it very
   unlikely that a character sequence from a different character
   set will be mistaken for a UTF-8 encoded character sequence.
   Clients and servers can use a simple routine to determine if
   the character set being exchanged is a valid UTF-8:

   int utf8_valid(const unsigned char *buf, unsigned int len)
   {
     const unsigned char *endbuf = buf + len;
     int trailing = 0;    /* trailing (continuation) bytes to
   follow */

     while (buf != endbuf)
     {
          unsigned char c = *buf++;
          if (trailing)
               if      ((c&0xC0) == 0x80)  trailing--;
               else                        return 0;
          else


                              Expires 26 May 1997            [Page 5]


          INTERNET DRAFT    FTP Internationalization 26 November, 1996

               if      ((c&0x80) == 0x00)  continue;
               else if ((c&0xE0) == 0xC0)  trailing = 1;
               else if ((c&0xF0) == 0xE0)  trailing = 2;
               else if ((c&0xF8) == 0xF0)  trailing = 3;
               else if ((c&0xFC) == 0xF8)  trailing = 4;
               else if ((c&0xFE) == 0xFC)  trailing = 5;
               else                        return 0;
     }
     return trailing == 0;
   }

2.3  Translations

   Translation from the local filesystem character set to UTF-8
   will normally involve a two step process. First translate the
   local character set to the UCS; then translate the UCS to
   UTF-8.

   The first step in the process can be performed by maintaining
   a translation table which includes the local character set
   code and the corresponding UCS code. For instance the ISO/IEC
   8859-8 [ISO-8859] code for the Hebrew letter "VAV" is 0xE4.
   The corresponding 4 byte ISO/IEC 10646 code is 0x000005D5.

   The next step is to translate the UCS character code to the
   UTF-8 encoding. The following routine can be used to
   determine and encode the correct number of bytes based on the
   UCS-4 character code:

   int ucs4_to_utf8 (unsigned long *ucs4_buf, unsigned int ucs4_len,
                  unsigned char *utf8_buf)
   {
     const unsigned long *ucs4_endbuf = ucs4_buf + ucs4_len;
     unsigned long ucs4_ch;

     while (ucs4_buf != ucs4_endbuf)
       {
         ucs4_ch = *ucs4_buf;
         if ( ucs4_ch <= 0x7FUL)  /* ASCII chars no conversion needed */
           *utf8_buf++ = (unsigned char) ucs4_ch;
         else
           if ( ucs4_ch <= 0x07FFUL ) /* In the 2 byte utf-8 range */
             {
               *utf8_buf++= (unsigned char) (0xC0UL + (ucs4_buf/0x40UL));
               *utf8_buf++= (unsigned char) (0x80UL + (ucs4_buf%0x40UL));
             }
           else
             if ( ucs4_ch <= 0xFFFFUL ) /* In the 3 byte utf-8 range.
                                           The values 0x0000FFFE,
                                           0x0000FFFF and

                              Expires 26 May 1997            [Page 6]


          INTERNET DRAFT    FTP Internationalization 26 November, 1996

                                           0x0000D800 - 0x0000DFFF do
                                           not occur in UCS-4 */
              {
                *utf8_buf++=
                  (unsigned char) (0xE0UL + (ucs4_buf/0x1000UL));
                *utf8_buf++= (unsigned char) (0x80UL
                             + ((ucs4_buf/0x40UL)%0x40UL));
                *utf8_buf++= (unsigned char) (0x80UL +
                                              (ucs4_buf%0x40UL));
              }
             else
               if ( ucs4_ch <= 0x1FFFFFUL ) /* In the 4 byte
                 {                             utf-8 range */
                   *utf8_buf++= (unsigned char) (0xF0UL +
                                (ucs4_buf/0x040000UL));
                   *utf8_buf++= (unsigned char) (0x80UL
                                 + ((ucs4_buf/0x10000)%0x40UL));
                   *utf8_buf++= (unsigned char) (0x80UL
                                 + ((ucs4_buf/0x40UL)%0x40UL));
                   *utf8_buf++= (unsigned char) (0x80UL +
                                (ucs4_buf%0x40UL));
                 }
               else
                 if ( ucs4_ch <= 0x03FFFFFFUL ) /* In the 5 byte
                   {                               utf-8 range */
                     *utf8_buf++= (unsigned char) (0xF8UL
                                   +(ucs4_buf/0x01000000UL));
                     *utf8_buf++= (unsigned char) (0x80UL
                                   + ((ucs4_buf/0x040000UL)%0x40UL));
                     *utf8_buf++= (unsigned char) (0x80UL
                                   + ((ucs4_buf/0x1000UL)%0x40UL));
                     *utf8_buf++= (unsigned char) (0x80UL
                                   + ((ucs4_buf/0x40UL)%0x40UL));
                     *utf8_buf++= (unsigned char) (0x80UL +
                                  (ucs4_buf%0x40UL));
                   }
                 else
                   if ( ucs4_ch <= 0x7FFFFFFFUL ) /* In the 6 byte
                     {                               utf-8 range */
                       *utf8_buf++= (unsigned char) (0xF8UL
                                    +(ucs4_buf/0x40000000UL));
                       *utf8_buf++= (unsigned char) (0x80UL
                                  + ((ucs4_buf/0x01000000UL)%0x40UL));
                       *utf8_buf++= (unsigned char) (0x80UL
                                    + ((ucs4_buf/0x040000UL)%0x40UL));
                       *utf8_buf++= (unsigned char) (0x80UL
                                    + ((ucs4_buf/0x1000UL)%0x40UL));
                       *utf8_buf++= (unsigned char) (0x80UL
                                    + ((ucs4_buf/0x40UL)%0x40UL));
                       *utf8_buf++= (unsigned char) (0x80UL

                              Expires 26 May 1997            [Page 7]


          INTERNET DRAFT    FTP Internationalization 26 November, 1996

                                    + (ucs4_buf%0x40UL));
                    }
       }
   }

   When moving from UTF-8 encoding to the local character set
   the reverse procedure is used. First the UTF-8 encoding is
   transformed into the UCS-4 character set. The UCS-4 is then
   converted to the local character set from a translation table
   (i.e. the opposite of the table used to form the UCS-4
   character code).

   To convert from UTF-8 to UCS-4 the free bits (those that do
   not define UTF-8 sequence size or signify continuation bytes)
   in a UTF-8 sequence are concatenated as a bit string. The
   bits are then distributed into a four byte sequence starting
   from the least significant bits. Those bits not assigned a
   bit in the four byte sequence are padded with ZERO bits. The
   following routine converts the UTF-8 encoding to UCS-4
   character codes:

   int utf8_to_ucs4 (unsigned long *ucs4_buf, unsigned int utf8_len,
                     unsigned char *utf8_buf)
   {
   const unsigned char *utf8_endbuf = utf8_buf + utf8_len;

   while (utf8_buf != utf8_endbuf)
     {
       if ((*utf8_buf & 0x80) == 0x00)  /* ASCII chars no conversion
         {                                 needed */
           *ucs4_buf++ = (unsigned long) *utf8_buf;
           utf8_buf++;
         }
       else
         if ((*utf8_buf & 0xE0)== 0xC0) /* In the 2 byte utf-8
           {                               range */
             *ucs4_buf++ = (unsigned long) ((*utf8_buf - 0xC0) * 0x40)
                           + ( *(utf_buf+1) - 0x80));
             utf8_buf += 2;
           }
         else
           if ( (*utf8_buf & 0xF0) == 0xE0 ) /* In the 3 byte utf-8
             {                                  range */
               *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xE0)
                             * 0x1000) + (( *(utf8_buf+1) -  0x80)
                             * 0x40) + ( *(utf_buf+2) - 0x80));
               utf8_buf += 3;
             }
          else
            if ((*utf8_buf & 0xF8) == 0xF0) /* In the 4 byte utf-8

                              Expires 26 May 1997            [Page 8]


          INTERNET DRAFT    FTP Internationalization 26 November, 1996

              {                                range */
                *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xF0)
                              * 0x040000) + (( *(utf8_buf+1) -  0x80)
                              * 0x1000) + (( *(utf8_buf+2) -  0x80)
                              * 0x40) + ( *(utf_buf+3) - 0x80));
                utf8_buf += 4;
              }
            else
              if ((*utf8_buf & 0xFC) == 0xF8) /* In the 5 byte utf-8
                {                                range */
                  *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xF8)
                                * 0x01000000) + ((*(utf8_buf+1) - 0x80)
                                * 0x040000) + (( *(utf8_buf+2) -  0x80)
                                * 0x1000)
                                + (( *(utf8_buf+3) -  0x80) * 0x40)
                                + ( *(utf_buf+4) - 0x80));
                  utf8_buf+=5;
                }
              else
                if ((*utf8_buf & 0xFE) == 0xFC) /* In the 6 byte utf-8
                  {                                range */
                    *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xFC)
                                  * 0x40000000) + ((*(utf8_buf+1)
                                  - 0x80) * 0x010000000)
                                  + ((*(utf8_buf+2) - 0x80) * 0x040000)
                                  + (( *(utf8_buf+3) -  0x80) * 0x1000)
                                  + (( *(utf8_buf+4) -  0x80) * 0x40)
                                  + ( *(utf_buf+5) - 0x80));
                    utf8_buf+=6;
                 }
     }
   }

2.3.1     ISO/IEC 8859-8 Example

   This example demonstrates mapping ISO/IEC 8859-8 character
   set to UTF-8 and back to ISO/IEC 8859-8. As noted earlier,
   the Hebrew letter "VAV" is translated from the ISO/IEC 8859-8
   character code 0xE4 to the corresponding 4 byte ISO/IEC 10646
   code of 0x000005D5 by a simple lookup of a
   translation/mapping file.

   The UCS-4 character code is transformed into UTF-8 using the
   ucs4_to_utf8 routine described earlier by:

     1. Because the UCS-4 character is between 0x80 and 0x07FF it
        will map to a 2 byte UTF-8 sequence.
     2. The first byte is defined by (0xC0 + (0x000005D5 / 0x40))
        = 0xD7.
     3. The second byte is defined by (0x80 + (0x000005D5 %

                              Expires 26 May 1997            [Page 9]


          INTERNET DRAFT    FTP Internationalization 26 November, 1996

        0x40)) = 0x95.

   The UTF-8 encoding is transferred back to UCS-4 by using the
   utf8_to_ucs4 routine described earlier by:

     1. Because the first byte of the sequence, when the '&'
        operator with a value of 0xE0 is applied, will produce
        0xC0 (0xD7 & 0xE0 = 0xC0) the UTF-8 is a 2 byte sequence.
     2.  The four byte UCS-4 character code is produced by
        (((0xD7 - 0xC0) * 0x40) + (0x95 -0x80)) = 0x000005D5.

   Finally, the UCS-4 character code is translated to ISO/IEC
   8859-8  character code (using the translation table which
   matches ISO/IEC 8859-8 to UCS-4 ) to produce the original
   0xE4 code for the Hebrew letter "VAV".

2.3.2     Vendor Codepage Example

   This example demonstrates the mapping of a codepage to UTF-8
   and back to a vendor codepage. Mapping between vendor
   codepages can be done in a very similar manner as described
   above. For instance both the PC and Mac codepages reflect the
   character set from the Thai standard TIS 620-2533. The
   character code on both platforms for the Thai letter "SO SO"
   is 0xAB. This character can then be mapped into the UCS-4 by
   way of a translation/mapping file to produce the UCS-4 code
   of 0x0E0B.

   The UCS-4 character code is transformed into UTF-8 using the
   ucs4_to_utf8 routine described earlier by:

     1. Because the UCS-4 character is between 0x0800 and 0xFFFF
        it will map to a 3 byte UTF-8 sequence.
     2. The first byte is defined by (0xE0 + (0x00000E0B /
        0x1000) =  0x00.
     3. The second byte is defined by (0x80 + ((0x00000E0B /
        0x40) % 0x40))) = 0xB8.
     4. The third byte is defined by (0x80 + (0x00000E0B % 0x40))
        = 0x8B.

   The UTF-8 encoding is transferred back to UCS-4 by using the
   utf8_to_ucs4 routine described earlier by:

     1. Because the first byte of the sequence, when the '&'
        operator with a value of 0xF0 is applied, will produce
        0xE0 (0xE0 & 0xF0 = 0xE0) the UTF-8 is a 3 byte sequence.
     2.  The four byte UCS-4 character code is produced by
        (((0xE0 - 0xE0) * 0x1000) + ((0xB8 - 0x80) * 0x40) +
        (0x8B -0x80) = 0x0000E0B.


                              Expires 26 May 1997           [Page 10]


          INTERNET DRAFT    FTP Internationalization 26 November, 1996

   Finally, the UCS-4 character code is translated to either the
   PC or MAC codepage character code (using the translation
   table which matches codepage to UCS-4 ) to produce the
   original 0xAB code for the Thai letter "SO SO".

3.   Conformance

   File names are sequences of bytes.  The character set of
   names that are valid UTF-8 sequences is UTF-8.  The character
   set of other names is undefined.

   Conforming internationalized client and servers must either
   support UTF-8 or support a local character set which is
   supported by both the client and server. Clients and servers,
   unless otherwise configured to support a specific native
   character set, should check for a valid UTF-8 byte sequence
   to determine if the pathname being presented is UTF-8.

3.1  International Servers

   The 7-bit restriction on pathnames used in server responses
   is dropped.

   If servers and clients are not configured to share the same
   character set, servers should use UTF-8 encoding for all
   pathname transfers.

   There are several plausible UTF-8 server implementation
   strategies:

   - A server that copies filenames transparently from a local
   filesystem may continue to do so. It is then up to the local
   file creators to use UTF-8 filenames.

   -A  server may translate filenames from a local character set
   to UTF-8. Each filename will be translated to UTF-8 before it
   is sent to the client.

   - UTF-8 Filenames received from the client must be translated
   back if possible. Many existing servers interpret 8-bit
   filenames as being in the local character set. They may
   continue to do so for filenames that are not valid UTF-8.

   A high-quality translating server will use the following
   procedure:

      If fn is valid UTF-8 and can be translated to the local
      character set:
        Translate fn to the local character set, obtaining
        localfn.

                              Expires 26 May 1997           [Page 11]


          INTERNET DRAFT    FTP Internationalization 26 November, 1996
        Attempt to operate on localfn.
          Upon success: Stop.
          Upon temporary error: Return an error message to the
        client.
            Stop.
        Attempt to operate on fn.
          Upon temporary error: Return an error message to the
          client.
          Stop.
      Otherwise:
        Attempt to operate on fn.
        Upon temporary error: Return an error message to the
        client.
        Stop.

3.1.1 Server Strategies Examples

   There are a number of server strategies which might be
   employed:

   - Server's OS uses one fixed character set.  In this case,
   the server should easily be able to support built-in
   translation to UTF-8. This is trivial where that fixed
   character set is ASCII, ISO 8859/1, or UTF-8.

    - Server supports charset labeling of files and/or
   directories, such that different file names may have
   different charsets. The server should attempt to translate
   all file names to UTF-8, but if it can't then it should leave
   that name in its raw form.

    - Server's OS does not mandate the character set, but the
   administrator configures it in the FTP server. The server
   should be configured to use a particular translation table.
   (Maybe external, but the server might have some common
   choices built-in.)  This also allows the flexibility of
   defining different charsets for different directories.

   - Server's OS does not mandate the character set and it is
   not configured. The server should simply use the raw bytes in
   the file name.  They might be ASCII or UTF-8.

   - Server is a mirror, and wants to look just like the site it
   is mirroring. It should save the exact file name bytes that
   it received from the main server.

3.2  International Clients

   The 7-bit restriction on pathnames used by client commands is
   dropped.


                              Expires 26 May 1997           [Page 12]


          INTERNET DRAFT    FTP Internationalization 26 November, 1996

   While clients are not obligated to support all of the
   characters or the associated glyphs defined in the UCS,
   clients which are presented UTF-8 filenames by the server
   should parse UTF-8 correctly, and attempt to display the
   filename within the limitation of the resources available.
   Unknown UTF-8 glyphs might be displayed as question marks, or
   hex, or something else. This is a quality-of-implementation
   issue.

   Client developers should be aware that it will be possible
   for pathnames to contain mixed characters (e.g.
   /Latin1DirectoryName/HebrewFileName). They should be prepared
   to handle the Bi-directional (BIDI) display of these
   character sets (i.e. right to left display for the directory
   and left to right display for the filename).

   Character semantics of other names shall remain undefined. If
   a client detects that a server is non-UTF-8, it should change
   its display appropriately. How a client implementation
   handles non UTF-8 is a quality of implementation issue. It
   may try to assume some other encoding, give the user a chance
   to try to assume something, or save encoding assumptions for
   a server from one FTP session to another.

   Client implementation notes: Many existing clients interpret
   8-bit filenames as being in the local character set. They may
   continue to do so for filenames that are not valid UTF-8.

4.0  Security

   This document addresses the support of character sets beyond
   1 byte. Conformance to this document should not induce a
   security threat.

5.0 Acknowledgements

   The following people have contributed to this document:

   Alex Belits
   D. J. Berstein
   Martin J. Duerst
   Mark Harris
   Paul Hethmon
   Alun Jones
   James Matthews
   Keith Moore
   Benjamin Riefenstahl

   (and others from the FTPEXT working group)


                              Expires 26 May 1997           [Page 13]


          INTERNET DRAFT    FTP Internationalization 26 November, 1996

Bibliography

[ISO-8859]

        ISO 8859.  International standard -- Information
        processing -- 8-bit single-byte coded graphic character
        sets -- Part 1: Latin alphabet No. 1 (1987) -- Part 2:
        Latin alphabet No. 2 (1987) -- Part 3: Latin alphabet
        No. 3 (1988) -- Part 4: Latin alphabet No. 4 (1988) --
        Part 5: Latin/Cyrillic alphabet (1988) -- Part 6:
        Latin/Arabic alphabet (1987) -- Part : Latin/Greek
        alphabet (1987) -- Part 8: Latin/Hebrew alphabet (1988)
        -- Part 9: Latin alphabet No. 5 (1989) -- Part10: Latin
        alphabet No. 6 (1992)

[ISO-10646]

        ISO/IEC 10646-1:1993. International standard --
        Information technology -- Universal multiple-octet coded
        character set (UCS) -- Part 1: Architecture and basic
        multilingual plane.

[RFC959]

        J. Postel, J Reynolds, "File Transfer Protocol (FTP)",
        RFC 959, October 1985.

[RFC1123]

        R. Braden, "Requirements for Internet Hosts --
        Application and Support", RFC 1123, October 1989.

[RFC2044]

        F. Yergeau, "UTF-8, a transformation format of Unicode
        and ISO 10646", RFC 2044, October 1996.

[UNICODE]

        The Unicode Consortium, "The Unicode Standard - Version
        2.0", Addison Westley Developers Press, July 1996.

[UTF-8]

        ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS
        Transformation Format 8 (UTF-8).





                              Expires 26 May 1997           [Page 14]


          INTERNET DRAFT    FTP Internationalization 26 November, 1996

Author's Address

JIEO
Attn JEBBD (Bill Curtin)
Ft. Monmouth, N.J.
          07703-5613

curtinw@ftm.disa.mil











































                              Expires 26 May 1997           [Page 15]

Document	Document type	This is an older version of an Internet-Draft that was ultimately published as RFC 2640. Expired & archived
	Select version	00 01 02 03 04 05 06 RFC 2640
	Compare versions
	Author
	RFC stream
	Other formats	txt pdf bibtex bibxml
	Additional resources	Mailing list discussion