FTPEXT Working Group B. Curtin
INTERNET DRAFT Defense Information Systems Agency
Expires 16 December 1997 20 November 1997
Internationalization of the File Transfer Protocol
<draft-ietf-ftpext-intl-ftp-03.txt>
Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its
areas, and its working groups. Note that other groups may also
distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months. Internet-Drafts may be updated, replaced, or obsoleted by
other documents at any time. It is not appropriate to use
Internet-Drafts as reference material or to cite them other than
as a "working draft" or "work in progress".
To view the entire list of current Internet-Drafts, please check
the 1id-abstracts.txt listing contained in the Internet-Drafts
Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net
(Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East
Coast), or ftp.isi.edu (US West Coast).
Distribution of this document is unlimited. Please send comments
to the FTP Extension working group (FTPEXT-WG) of the Internet
Engineering Task Force (IETF) at <ftp-wg@hops.ag.utk.edu>.
Subscription address is <ftp-wg-request@hops.ag.utk.edu>.
Discussions of the group are archived at
<URL:ftp://hops.ag.utk.edu/ftp-wg/archives/>.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
RFC 2119 [RFC 2119].
Expires 20 May 1998 [Page 1]
INTERNET DRAFT FTP Internationalization 20 November 1997
Abstract
The File Transfer Protocol, as defined in RFC 959 [RFC959] and RFC
1123 Section 4 [RFC1123], is one of the oldest and widely used
protocols on the Internet. The protocol's primary character set, 7
bit ASCII, has served the protocol well through the early growth
years of the Internet. However, as the Internet becomes more
global, there is a need to support character sets beyond 7 bit
ASCII.
This document addresses the internationalization (I18n) of FTP,
which includes supporting the multiple character sets found
throughout the Internet community. This is achieved by extending
the FTP specification and giving recommendations for proper
internationalization support.
Table of Contents
1 INTRODUCTION......................................................3
2 INTERNATIONALIZATION..............................................3
2.1 International Character Set....................................4
2.2 Transfer Encoding..............................................4
3 CONFORMANCE.......................................................5
3.1 General........................................................5
3.2 International Servers..........................................7
3.3 International Clients..........................................7
4 SECURITY..........................................................7
5 ACKNOWLEDGMENTS...................................................8
6 GLOSSARY..........................................................8
7 BIBLIOGRAPHY......................................................9
8 AUTHOR'S ADDRESS.................................................10
APPENDIX A - IMPLEMENTATION CONSIDERATIONS........................A-1
A.1 General Considerations.......................................A-1
A.2 Transition Considerations....................................A-2
APPENDIX B - SAMPLE CODE AND EXAMPLES.............................B-1
B.1 Valid UTF-8 check............................................B-1
B.2 Conversions..................................................B-2
B.2.1 Conversion from local character set to UTF-8.............B-2
B.2.2 Conversion from UTF-8 to local character set.............B-4
B.2.3 ISO/IEC 8859-8 Example..................................B-6
B.2.4 Vendor Codepage Example..................................B-7
B.3 Pseudo Code for translating servers..........................B-8
Expires 20 May 1998 [Page 2]
INTERNET DRAFT FTP Internationalization 20 November 1997
1 Introduction
As the Internet grows throughout the world the requirement to
support character sets outside of the ASCII [ASCII] / Latin-1
[ISO-8859] character set becomes ever more urgent. For FTP,
because of the large installed base, it is paramount that this be
done without breaking existing clients and servers. This document
addresses this need. In doing so it defines a solution which will
still allow the installed base to interoperate with new
international clients and servers.
This document enhances the capabilities of the File Transfer
Protocol by removing the 7-bit restrictions on pathnames used in
client commands and server responses, recommends the use of a
Universal Character Set (UCS) ISO/IEC 10646 [ISO-10646], and
recommends a UCS transformation format (UTF) UTF-8 [UTF-8].
The recommendations made in this document are consistent with the
recommendations expressed by the 29 Feb - 1 Mar 1996 IAB Character
Set Workshop as expressed in RFC 2130 [RFC 2130].
2 Internationalization
The File Transfer Protocol was developed in a period when the
predominate character sets were 7 bit ASCII and 8 bit EBCDIC.
Today these character sets cannot support the wide range of
characters needed by multinational systems. Given that there are a
number of character sets in current use that provide more
characters than 7-bit ASCII, it makes sense to decide on a
convenient way to represent the union of those possibilities. To
work globally either requires support of a number of character
sets and to be able to convert between them, or the use of a
single preferred character set. To assure global interoperability
this document RECOMMENDS the latter approach and defines a single
character set, in addition to NVT ASCII and EBCDIC, which is
understandable by all systems. For FTP this character set SHALL be
ISO/IEC 10646:1993. For support of global compatibility it is
STRONGLY RECOMMENDED that clients and servers use UTF-8 encoding
when exchanging pathnames. Clients and servers are, however, under
no obligation to perform any conversion on the contents of a file
for operations such as STOR or RETR.
The character set used to store files SHALL remain a local
decision and MAY depend on the capability of local operating
systems. Prior to the exchange of pathnames they should be
converted into a ISO/IEC 10646 format and UTF-8 encoded. This
approach, while allowing international exchange of pathnames, will
still allow backward compatibility with older systems because the
code set positions for ASCII characters are identical to the one
byte sequence in UTF-8.
Expires 20 May 1998 [Page 3]
INTERNET DRAFT FTP Internationalization 20 November 1997
Sections 2.1 and 2.2 give a brief description of the international
character set and transfer encoding recommended by this document.
A more thorough description of UTF-8, ISO/IEC 10646, and UNICODE
[UNICODE], beyond that given in this document, can be found in RFC
2044 [RFC2044].
2.1 International Character Set
The character set defined for international support of FTP SHALL
be the Universal Character Set as defined in ISO 10646:1993 as
amended. This standard incorporates the character sets of many
existing international, national, and corporate standards. ISO/IEC
10646 defines two alternate forms of encoding, UCS-4 and UCS-2.
UCS-4 is a four byte (31 bit) encoding containing 2**31 code
positions divided into 128 groups of 256 planes. Each plane
consists of 256 rows of 256 cells. UCS-2 is a 2 byte (16 bit)
character set consisting of plane zero or the Basic Multilingual
Plane (BMP). Currently, no codesets have been defined outside of
the 2 byte BMP.
The Unicode standard version 2.0 [UNICODE] is consistent with the
UCS-2 subset of ISO/IEC 10646. The Unicode standard version 2.0
includes the repertoire of IS 10646 characters, amendments 1-7 of
IS 10646, and editorial and technical corrigenda.
2.2 Transfer Encoding
UCS Transformation Format 8 (UTF-8), in the past referred to as
UTF-2 or UTF-FSS, SHALL be used as a transfer encoding to transmit
the international character set. UTF-8 is a file safe encoding
which avoids the use of byte values which have special
significance during the parsing of pathname character strings.
UTF-8 is an 8 bit encoding of the characters in the UCS. Some of
UTF-8's benefits are that it is compatible with 7 bit ASCII, so it
doesn't affect programs that give special meanings to various
ASCII characters; it is immune to synchronization errors; its
encoding rules allow for easy identification; and it has enough
space to support a large number of character sets.
UTF-8 encoding represents each UCS character as a sequence of 1 to
6 bytes in length. For all sequences of one byte the most
significant bit is ZERO. For all sequences of more than one byte
the number of ONE bits in the first byte, starting from the most
significant bit position, indicates the number of bytes in the
UTF-8 sequence followed by a ZERO bit. For example, the first byte
of a 3 byte UTF-8 sequence would have 1110 as its most significant
bits. Each additional bytes (continuing bytes) in the UTF-8
sequence, contain a ONE bit followed by a ZERO bit as their most
significant bits. The remaining free bit positions in the
continuing bytes are used to identify characters in the UCS. The
Expires 20 May 1998 [Page 4]
INTERNET DRAFT FTP Internationalization 20 November 1997
relationship between UCS and UTF-8 is demonstrated in the
following table:
UCS-4 range(hex) UTF-8 byte sequence(binary)
00000000 - 0000007F 0xxxxxxx
00000080 - 000007FF 110xxxxx 10xxxxxx
00000800 - 0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
00010000 - 001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
00200000 - 03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx
04000000 - 7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx 10xxxxxx
A beneficial property of UTF-8 is that its single byte sequence is
consistent with the ASCII character set. This feature will allow a
transition where old ASCII-only clients can still interoperate
with new servers which support the UTF-8 encoding.
Another feature is that the encoding rules make it very unlikely
that a character sequence from a different character set will be
mistaken for a UTF-8 encoded character sequence. Clients and
servers can use a simple routine to determine if the character set
being exchanged is valid UTF-8. Section B.1 shows a code example
of this check.
3 Conformance
3.1 General
- The 7-bit restriction for pathnames exchanged is dropped.
- Many operating system allow the use of spaces <SP>, carriage
return <CR>, and line feed <LF> characters as part of the
pathname. The exchange of pathnames with these special command
characters will cause the pathnames to be parsed improperly. This
is because ftp commands associated with pathnames have the form:
COMMAND <SP> <pathname> <CRLF>.
To allow the exchange of pathnames containing these characters,
the definition of pathname is changed from
<pathname> ::= <string> ; in BNF format
to
pathname = 1*(%x01..%xFF) ; in ABNF format [ABNF]
To avoid mistaking these characters within pathnames as special
command characters the following rules will apply:
Expires 20 May 1998 [Page 5]
INTERNET DRAFT FTP Internationalization 20 November 1997
There MUST be only one <SP> between a ftp command and the
pathname. Implementations MUST assume <SP> characters
following the initial <SP> as part of the pathname. For
example the pathname in STOR <SP><SP><SP>foo.bar<CRLF> is
<SP><SP>foo.bar .
Current implementations which may allow multiple <SP>
characters as separators between the command and pathname MUST
assure that they comply with this single <SP> convention.
Note: Implementations which treat 3 character commands (e.g.
CWD, MKD, etc.) as a fixed 4 character command by padding the
command with a trailing <SP> are in non-compliance to this
specification.
When a <CR> character is encountered as part of a pathname it
MUST be padded with a <NUL> character prior to sending the
command. On receipt of a pathname containing a <CR><NUL> sequence
the <NUL> character MUST be stripped away. This approach is
described in RFC 854 [RFC854] on pages 11 and 12. For example, to
store a pathname foo<CR><LF>boo.bar the pathname would become
foo<CR><NUL><LF>boo.bar prior to sending the command STOR
<SP>foo<CR><NUL><LF>boo.bar<CRLF> .
Upon receipt of the altered pathname the <NUL> character
following the <CR> would be stripped away to form the original
pathname.
- Conforming internationalized clients and servers MUST support
UTF-8 for the transfer and receipt of pathnames. Clients and
servers MAY in addition give the user a choice to specify
interpretation of pathnames in another encoding. Note that
configuring clients and servers to use other character sets /
encoding other than UTF-8 is outside of the scope of this
document. While it is recognized that in certain operational
scenarios this may be desirable, this is left as a quality of
implementation and operational issue.
- Pathnames are sequences of bytes. The encoding of names that
are valid UTF-8 sequences is assumed to be UTF-8. The character
set of other names is undefined. Clients and servers, unless
otherwise configured to support a specific native character set,
MUST check for a valid UTF-8 byte sequence to determine if the
pathname being presented is UTF-8.
- To avoid data loss, clients and servers SHOULD use the UTF-8
encoded pathnames when unable to convert them to a usable code
set.
- There may be cases when the code set / encoding presented to the
server or client cannot be determined. In such cases the raw
bytes SHOULD be used.
Expires 20 May 1998 [Page 6]
INTERNET DRAFT FTP Internationalization 20 November 1997
3.2 International Servers
- Mirror servers may want to exactly reflect the site that they
are mirroring. In such cases servers MAY store and present the
exact pathname bytes that it received from the main server.
3.3 International Clients
- Clients which do not require display of pathnames are under no
obligation to do so. Non-display clients do not need to conform
to requirements associated with display.
- Clients which are presented UTF-8 pathnames by the server SHOULD
parse UTF-8 correctly, and attempt to display the pathname within
the limitation of the resources available.
- Character semantics of other names shall remain undefined. If a
client detects that a server is non UTF-8, it SHOULD change its
display appropriately. How a client implementation handles non
UTF-8 is a quality of implementation issue. It MAY try to assume
some other encoding, give the user a chance to try to assume
something, or save encoding assumptions for a server from one FTP
session to another.
- Glyph rendering is outside the scope of this document. How a
client presents characters it cannot display is a quality of
implementation issue. This document RECOMMENDS that octets
corresponding to non-displayable characters SHOULD be presented
in URL %HH format defined in RFC 1738 [RFC1738]. They MAY,
however, display them as question marks, with their UCS
hexadecimal value, or in any other suitable fashion.
- Many existing clients interpret 8-bit pathnames as being in the
local character set. They MAY continue to do so for pathnames
that are not valid UTF-8.
4 Security
This document addresses the support of character sets beyond 1
byte. Conformance to this document should not induce a security
threat.
Expires 20 May 1998 [Page 7]
INTERNET DRAFT FTP Internationalization 20 November 1997
5 Acknowledgments
The following people have contributed to this document:
D. J. Bernstein
Martin J. Duerst
Mark Harris
Paul Hethmon
Alun Jones
James Matthews
Keith Moore
Sandra O'Donnell
Benjamin Riefenstahl
Stephen Tihor
(and others from the FTPEXT working group)
6 Glossary
BIDI - abbreviation for Bi-directional, a reference to mixed right-
to-left and left-to-right text.
Character Set - a collection of characters used to represent
textual information in which each character has a numeric value
Code Set - (see character set).
Glyph - a character image represented on a display device.
I18N - "I eighteen N", the first and last letters of the word
"internationalization" and the eighteen letters in between.
UCS-2 - the ISO/IEC 10646 two octet Universal Character Set form.
UCS-4 - the ISO/IEC 10646 four octet Universal Character Set form.
UTF-8 - the UCS Transformation Format represented in 8 bits.
UTF-16 - A 16-bit format including the BMP (directly encoded) and
surrogate pairs to represent characters in planes 01-16; equivalent
to Unicode.
Expires 20 May 1998 [Page 8]
INTERNET DRAFT FTP Internationalization 20 November 1997
7 Bibliography
[ABNF]
D. Crocker, P. Overell, Augmented BNF for Syntax Specifications:
ABNF, RFC 2234, November 1997.
[ASCII]
ANSI X3.4:1986 Coded Character Sets - 7 Bit American National
Standard Code for Information Interchange (7-bit ASCII)
[ISO-8859]
ISO 8859. International standard -- Information processing --
8-bit single-byte coded graphic character sets -- Part 1: Latin
alphabet No. 1 (1987) -- Part 2: Latin alphabet No. 2 (1987) --
Part 3: Latin alphabet No. 3 (1988) -- Part 4: Latin alphabet No.
4 (1988) -- Part 5: Latin/Cyrillic alphabet (1988) -- Part 6:
Latin/Arabic alphabet (1987) -- Part : Latin/Greek alphabet
(1987) -- Part 8: Latin/Hebrew alphabet (1988) -- Part 9: Latin
alphabet No. 5 (1989) -- Part10: Latin alphabet No. 6 (1992)
[ISO-10646]
ISO/IEC 10646-1:1993. International standard -- Information
technology -- Universal multiple-octet coded character set (UCS)
-- Part 1: Architecture and basic multilingual plane.
[RFC854]
J. Postel, J Reynolds, "Telnet Protocol Specification", RFC 854,
May 1983.
[RFC959]
J. Postel, J Reynolds, "File Transfer Protocol (FTP)", RFC 959,
October 1985.
[RFC1123]
R. Braden, "Requirements for Internet Hosts -- Application and
Support", RFC 1123, October 1989.
[RFC1738]
T. Berners-Lee, L. Masinter, M.McCahill, "Uniform Resource
Locators (URL)", RFC 1738, December 1994.
Expires 20 May 1998 [Page 9]
INTERNET DRAFT FTP Internationalization 20 November 1997
[RFC2044]
F. Yergeau, "UTF-8, a transformation format of Unicode and ISO
10646", RFC 2044, October 1996.
[RFC 2119]
S. Bradner, " Key words for use in RFCs to Indicate Requirement
Levels", RFC 2119, March 1997.
[RFC 2130]
C. Weider, C. Preston, K.Simonsen, H. Alvestrand, " The Report of
the IAB Character Set Workshop held 29 February - 1 March, 1996",
RFC 2130, April, 1997.
[UNICODE]
The Unicode Consortium, "The Unicode Standard - Version 2.0",
Addison Westley Developers Press, July 1996.
[UTF-8]
ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS Transformation
Format 8 (UTF-8).
8 Author's Address
JIEO
Attn JEBBD (Bill Curtin)
Ft. Monmouth, N.J.
07703-5613
curtinw@ftm.disa.mil
Expires 20 May 1998 [Page 10]
INTERNET DRAFT FTP Internationalization 20 November 1997
Appendix A - Implementation Considerations
A.1 General Considerations
- Implementers should ensure that their code accounts for
potential problems, such as using a NULL character to terminate a
string or no longer being able to steal the high order bit for
internal use, when supporting the extended character set.
- Implementers should be aware that there is a chance that
pathnames which are non UTF-8 may be parsed as valid UTF-8. The
probability is low for some encoding or statistically zero to
zero for others. A recent non-scientific analysis found that EUC
encoded Japanese words had a 2.7% false reading; SJIS had a
0.0005% false reading; other encoding such as ASCII or KOI-8 have
a 0% false reading. This probability is highest for short
pathnames and decreases as pathname size increases. Implementers
may want to look for signs that pathnames which parse as UTF-8
are not valid UTF-8, such as the existence of multiple local
character sets in short pathnames. Hopefully, as more
implementations conform to UTF-8 transfer encoding there will be
a smaller need to guess at the encoding.
- Client developers should be aware that it will be possible for
pathnames to contain mixed characters (e.g.
/Latin1DirectoryName/HebrewFileName). They should be prepared to
handle the Bi-directional (BIDI) display of these character sets
(i.e. right to left display for the directory and left to right
display for the filename). While bi-directional display is
outside the scope of this document and more complicated than the
above example, an algorithm for bi-directional display can be
found in the UNICODE 2.0 [UNICODE] standard. Also note that
pathnames can have different byte ordering yet be logically and
display-wise equivalent due to the insertion of BIDI control
characters at different points during composition. Also note that
mixed character sets may also present problems with font
swapping.
- A server that copies pathnames transparently from a local
filesystem may continue to do so. It is then up to the local file
creators to use UTF-8 pathnames.
- Servers can supports charset labeling of files and/or
directories, such that different pathnames may have different
charsets. The server should attempt to convert all pathnames to
UTF-8, but if it can't then it should leave that name in its raw
form.
- Some server's OS do not mandate character sets, but allow
administrators to configures it in the FTP server. These servers
Expires 20 May 1998 [Page A-1]
INTERNET DRAFT FTP Internationalization 20 November 1997
should be configured to use a particular mapping table (either
external or built-in). This will allow the flexibility of
defining different charsets for different directories.
- If the server's OS does not mandate the character set and the
FTP server cannot be configured, the server should simply use the
raw bytes in the file name. They might be ASCII or UTF-8.
- If the server is a mirror, and wants to look just like the site
it is mirroring, it should store the exact file name bytes that
it received from the main server.
A.2 Transition Considerations
-Clients and servers can transition to UTF-8 by either converting
to/from the local encoding, or the users can store UTF-8
filenames. The former approach is easier on tightly controlled
file systems (e.g. PCs and MACs). The latter approach is easier
on more free form file systems (e.g. Unix).
-For interactive use attention should be focused on user interface
and ease of use. Non-interactive use requires a consistent and
controlled behavior.
-There may be many applications which reference files under their
old raw pathname (e.g. linked URLs). Changing the pathname to
UTF-8 will cause access to the old URL to fail. A solution may be
for the server to act as if there was 2 different pathnames
associated with the file. This might be done internal to the
server on controlled file systems or by using symbolic links on
free form systems. While this approach may work for single file
transfer non-interactive use, a non-interactive transfer of all
of the files in a directory will produce duplicates. Interactive
users may be presented with lists of files which are double the
actual number files.
Expires 20 May 1998 [Page A-2]
INTERNET DRAFT FTP Internationalization 20 November 1997
Appendix B - Sample Code and Examples
B.1 Valid UTF-8 check
The following routine checks if a byte sequence is valid UTF-8.
This is done by checking for the proper tagging of the first and
following bytes to make sure they conform to the UTF-8 format. It
then checks to assure that the data part of the UTF-8 sequence
conform to the proper range allowed by the encoding. Note: This
routine will not detect characters which have not been assigned
and therefore do not exist.
int utf8_valid(const unsigned char *buf, unsigned int len)
{
const unsigned char *endbuf = buf + len;
unsigned char byte2mask=0x00, c;
int trailing = 0; // trailing (continuation) bytes
to follow
while (buf != endbuf)
{
c = *buf++;
if (trailing)
if ((c&0xC0) == 0x80) // Does trailing byte follow UTF-8
format?
{if (byte2mask) // Need to check 2nd byte for
proper range?
if (c&byte2mask) // Are appropriate bits set?
byte2mask=0x00;
else
return 0;
trailing--; }
else
return 0;
else
if ((c&0x80) == 0x00) continue; // valid 1 byte UTF-8
else if ((c&0xE0) == 0xC0) // valid 2 byte UTF-8
if (c&0x1E) // Is UTF-8 byte in
proper range?
trailing =1;
else
return 0;
else if ((c&0xF0) == 0xE0) // valid 3 byte UTF-8
{if (!(c&0x0F)) // Is UTF-8 byte in
proper range?
byte2mask=0x20; // If not set mask to
check next byte
trailing = 2;}
else if ((c&0xF8) == 0xF0) // valid 4 byte UTF-8
{if (!(c&0x07)) // Is UTF-8 byte in
proper range?
Expires 20 May 1998 [Page B-1]
INTERNET DRAFT FTP Internationalization 20 November 1997
byte2mask=0x30; // If not set mask to
check next byte
trailing = 3;}
else if ((c&0xFC) == 0xF8) // valid 5 byte UTF-8
{if (!(c&0x03)) // Is UTF-8 byte in
proper range?
byte2mask=0x38; // If not set mask to
check next byte
trailing = 4;}
else if ((c&0xFE) == 0xFC) // valid 6 byte UTF-8
{if (!(c&0x01)) // Is UTF-8 byte in
proper range?
byte2mask=0x3C; // If not set mask to
check next byte
trailing = 5;}
else return 0;
}
return trailing == 0;
}
B.2 Conversions
The code examples in this section closely reflect the algorithm in
ISO 10646 and may not present the most efficient solution for
converting to / from UTF-8 encoding. If efficiency is an issue,
implementers should use the appropriate bitwise operators.
Additional code examples and numerous mapping tables can be found
at the Unicode site, HTTP://www.unicode.org or FTP://unicode.org.
Note that the conversion examples below assume that the local
character set supported in the operating system is something other
than UCS2/UTF-16. There are some operating systems which already
support UCS2/UTF-16 (notably Plan 9 and Windows NT). In this case
no conversion will be necessary from the local character set to
the UCS.
B.2.1 Conversion from local character set to UTF-8
Conversion from the local filesystem character set to UTF-8 will
normally involve a two step process. First convert the local
character set to the UCS; then convert the UCS to UTF-8.
The first step in the process can be performed by maintaining a
mapping table which includes the local character set code and the
corresponding UCS code. For instance the ISO/IEC 8859-8 [ISO-8859]
code for the Hebrew letter "VAV" is 0xE4. The corresponding 4 byte
ISO/IEC 10646 code is 0x000005D5.
Expires 20 May 1998 [Page B-2]
INTERNET DRAFT FTP Internationalization 20 November 1997
The next step is to convert the UCS character code to the UTF-8
encoding. The following routine can be used to determine and
encode the correct number of bytes based on the UCS-4 character
code:
unsigned int ucs4_to_utf8 (unsigned long *ucs4_buf, unsigned int
ucs4_len, unsigned char *utf8_buf)
{
const unsigned long *ucs4_endbuf = ucs4_buf + ucs4_len;
unsigned int utf8_len = 0; // return value for UTF8 size
unsigned char *t_utf8_buf = utf8_buf;// Temporary pointer
// to load UTF8 values
while (ucs4_buf != ucs4_endbuf)
{
if ( *ucs4_buf <= 0x7F) // ASCII chars no conversion needed
{
*t_utf8_buf++ = (unsigned char) *ucs4_buf;
utf8_len++;
ucs4_buf++;
}
else
if ( *ucs4_buf <= 0x07FF ) // In the 2 byte utf-8 range
{
*t_utf8_buf++= (unsigned char) (0xC0 + (*ucs4_buf/0x40));
*t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40));
utf8_len+=2;
ucs4_buf++;
}
else
if ( *ucs4_buf <= 0xFFFF ) /* In the 3 byte utf-8 range. The
values 0x0000FFFE, 0x0000FFFF
and 0x0000D800 - 0x0000DFFF do
not occur in UCS-4 */
{
*t_utf8_buf++= (unsigned char) (0xE0 + (*ucs4_buf/0x1000));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x40)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40));
utf8_len+=3;
ucs4_buf++;
}
else
if ( *ucs4_buf <= 0x1FFFFF ) //In the 4 byte utf-8 range
{
*t_utf8_buf++= (unsigned char) (0xF0 + (*ucs4_buf/0x040000));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x10000)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
Expires 20 May 1998 [Page B-3]
INTERNET DRAFT FTP Internationalization 20 November 1997
((*ucs4_buf/0x40)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40));
utf8_len+=4;
ucs4_buf++;
}
else
if ( *ucs4_buf <= 0x03FFFFFF )//In the 5 byte utf-8 range
{
*t_utf8_buf++= (unsigned char) (0xF8 +
(*ucs4_buf/0x01000000));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x040000)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x1000)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x40)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
(*ucs4_buf%0x40));
utf8_len+=5;
ucs4_buf++;
}
else
if ( *ucs4_buf <= 0x7FFFFFFF )//In the 6 byte utf-8 range
{
*t_utf8_buf++= (unsigned char)
(0xF8 +(*ucs4_buf/0x40000000));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x01000000)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x040000)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x1000)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x40)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
(*ucs4_buf%0x40));
utf8_len+=6;
ucs4_buf++;
}
}
return (utf8_len);
}
B.2.2 Conversion from UTF-8 to local character set
When moving from UTF-8 encoding to the local character set the
reverse procedure is used. First the UTF-8 encoding is transformed
into the UCS-4 character set. The UCS-4 is then converted to the
Expires 20 May 1998 [Page B-4]
INTERNET DRAFT FTP Internationalization 20 November 1997
local character set from a mapping table (i.e. the opposite of the
table used to form the UCS-4 character code).
To convert from UTF-8 to UCS-4 the free bits (those that do not
define UTF-8 sequence size or signify continuation bytes) in a
UTF-8 sequence are concatenated as a bit string. The bits are then
distributed into a four byte sequence starting from the least
significant bits. Those bits not assigned a bit in the four byte
sequence are padded with ZERO bits. The following routine converts
the UTF-8 encoding to UCS-4 character codes:
int utf8_to_ucs4 (unsigned long *ucs4_buf, unsigned int utf8_len,
unsigned char *utf8_buf)
{
const unsigned char *utf8_endbuf = utf8_buf + utf8_len;
unsigned int ucs_len=0;
while (utf8_buf != utf8_endbuf)
{
if ((*utf8_buf & 0x80) == 0x00) /*ASCII chars no conversion
needed */
{
*ucs4_buf++ = (unsigned long) *utf8_buf;
utf8_buf++;
ucs_len++;
}
else
if ((*utf8_buf & 0xE0)== 0xC0) //In the 2 byte utf-8 range
{
*ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xC0) * 0x40)
+ ( *(utf8_buf+1) - 0x80));
utf8_buf += 2;
ucs_len++;
}
else
if ( (*utf8_buf & 0xF0) == 0xE0 ) /*In the 3 byte utf-8
range */
{
*ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xE0) * 0x1000)
+ (( *(utf8_buf+1) - 0x80) * 0x40)
+ ( *(utf8_buf+2) - 0x80));
utf8_buf+=3;
ucs_len++;
}
else
if ((*utf8_buf & 0xF8) == 0xF0) /* In the 4 byte utf-8
range */
{
Expires 20 May 1998 [Page B-5]
INTERNET DRAFT FTP Internationalization 20 November 1997
*ucs4_buf++ = (unsigned long)
(((*utf8_buf - 0xF0) * 0x040000)
+ (( *(utf8_buf+1) - 0x80) * 0x1000)
+ (( *(utf8_buf+2) - 0x80) * 0x40)
+ ( *(utf8_buf+3) - 0x80));
utf8_buf+=4;
ucs_len++;
}
else
if ((*utf8_buf & 0xFC) == 0xF8) /* In the 5 byte utf-8
range */
{
*ucs4_buf++ = (unsigned long)
(((*utf8_buf - 0xF8) * 0x01000000)
+ ((*(utf8_buf+1) - 0x80) * 0x040000)
+ (( *(utf8_buf+2) - 0x80) * 0x1000)
+ (( *(utf8_buf+3) - 0x80) * 0x40)
+ ( *(utf8_buf+4) - 0x80));
utf8_buf+=5;
ucs_len++;
}
else
if ((*utf8_buf & 0xFE) == 0xFC) /* In the 6 byte utf-8
range */
{
*ucs4_buf++ = (unsigned long)
(((*utf8_buf - 0xFC) * 0x40000000)
+ ((*(utf8_buf+1) - 0x80) * 0x010000000)
+ ((*(utf8_buf+2) - 0x80) * 0x040000)
+ (( *(utf8_buf+3) - 0x80) * 0x1000)
+ (( *(utf8_buf+4) - 0x80) * 0x40)
+ ( *(utf8_buf+5) - 0x80));
utf8_buf+=6;
ucs_len++;
}
}
return (ucs_len);
}
B.2.3 ISO/IEC 8859-8 Example
This example demonstrates mapping ISO/IEC 8859-8 character set to
UTF-8 and back to ISO/IEC 8859-8. As noted earlier, the Hebrew
letter "VAV" is convertd from the ISO/IEC 8859-8 character code
0xE4 to the corresponding 4 byte ISO/IEC 10646 code of 0x000005D5
by a simple lookup of a conversion/mapping file.
The UCS-4 character code is transformed into UTF-8 using the
ucs4_to_utf8 routine described earlier by:
Expires 20 May 1998 [Page B-6]
INTERNET DRAFT FTP Internationalization 20 November 1997
1. Because the UCS-4 character is between 0x80 and 0x07FF it will
map to a 2 byte UTF-8 sequence.
2. The first byte is defined by (0xC0 + (0x000005D5 / 0x40)) =
0xD7.
3. The second byte is defined by (0x80 + (0x000005D5 % 0x40)) =
0x95.
The UTF-8 encoding is transferred back to UCS-4 by using the
utf8_to_ucs4 routine described earlier by:
1. Because the first byte of the sequence, when the '&' operator
with a value of 0xE0 is applied, will produce 0xC0 (0xD7 &
0xE0 = 0xC0) the UTF-8 is a 2 byte sequence.
2. The four byte UCS-4 character code is produced by (((0xD7 -
0xC0) * 0x40) + (0x95 -0x80)) = 0x000005D5.
Finally, the UCS-4 character code is converted to ISO/IEC 8859-8
character code (using the mapping table which matches ISO/IEC
8859-8 to UCS-4 ) to produce the original 0xE4 code for the Hebrew
letter "VAV".
B.2.4 Vendor Codepage Example
This example demonstrates the mapping of a codepage to UTF-8 and
back to a vendor codepage. Mapping between vendor codepages can be
done in a very similar manner as described above. For instance
both the PC and Mac codepages reflect the character set from the
Thai standard TIS 620-2533. The character code on both platforms
for the Thai letter "SO SO" is 0xAB. This character can then be
mapped into the UCS-4 by way of a conversion/mapping file to
produce the UCS-4 code of 0x0E0B.
The UCS-4 character code is transformed into UTF-8 using the
ucs4_to_utf8 routine described earlier by:
1. Because the UCS-4 character is between 0x0800 and 0xFFFF it
will map to a 3 byte UTF-8 sequence.
2. The first byte is defined by (0xE0 + (0x00000E0B / 0x1000) =
0xE0.
3. The second byte is defined by (0x80 + ((0x00000E0B / 0x40) %
0x40))) = 0xB8.
4. The third byte is defined by (0x80 + (0x00000E0B % 0x40)) =
0x8B.
The UTF-8 encoding is transferred back to UCS-4 by using the
utf8_to_ucs4 routine described earlier by:
1. Because the first byte of the sequence, when the '&' operator
with a value of 0xF0 is applied, will produce 0xE0 (0xE0 &
0xF0 = 0xE0) the UTF-8 is a 3 byte sequence.
Expires 20 May 1998 [Page B-7]
INTERNET DRAFT FTP Internationalization 20 November 1997
2. The four byte UCS-4 character code is produced by (((0xE0 -
0xE0) * 0x1000) + ((0xB8 - 0x80) * 0x40) + (0x8B -0x80) =
0x0000E0B.
Finally, the UCS-4 character code is converted to either the PC or
MAC codepage character code (using the mapping table which matches
codepage to UCS-4 ) to produce the original 0xAB code for the Thai
letter "SO SO".
B.3 Pseudo Code for a high-quality translating server
if utf8_valid(fn)
{
attempt to convert fn to the local charset, producing localfn
if (conversion fails temporarily) return error
if (conversion succeeds)
{
attempt to open localfn
if (open fails temporarily) return error
if (open succeeds) return success
}
}
attempt to open fn
if (open fails temporarily) return error
if (open succeeds) return success
return permanent error
Expires 20 May 1998 [Page B-8]