UTF-7 - A Mail-Safe Transformation Format of Unicode
RFC 1642

Document Type RFC - Experimental (July 1994; No errata)
Obsoleted by RFC 2152
Was draft-goldsmith-mime-utf7 (individual)
Last updated 2013-03-02
Stream Legacy
Formats plain text pdf ps html bibtex
Stream Legacy state (None)
Consensus Boilerplate Unknown
RFC Editor Note (None)
IESG IESG state RFC 1642 (Experimental)
Telechat date
Responsible AD (None)
Send notices to (None)
Network Working Group                                       D. Goldsmith
Request for Comments: 1642                                      M. Davis
Category: Experimental                                    Taligent, Inc.
                                                               July 1994

                                 UTF-7

              A Mail-Safe Transformation Format of Unicode

Status of this Memo

   This memo defines an Experimental Protocol for the Internet
   community.  This memo does not specify an Internet standard of any
   kind.  Distribution of this memo is unlimited.

Abstract

   The Unicode Standard, version 1.1, and ISO/IEC 10646-1:1993(E)
   jointly define a 16 bit character set (hereafter referred to as
   Unicode) which encompasses most of the world's writing systems.
   However, Internet mail (STD 11, RFC 822) currently supports only 7-
   bit US ASCII as a character set. MIME (RFC 1521 and RFC 1522) extends
   Internet mail to support different media types and character sets,
   and thus could support Unicode in mail messages. MIME neither defines
   Unicode as a permitted character set nor specifies how it would be
   encoded, although it does provide for the registration of additional
   character sets over time.

   This document describes a new transformation format of Unicode that
   contains only 7-bit ASCII characters and is intended to be readable
   by humans in the limiting case that the document consists of
   characters from the US-ASCII repertoire. It also specifies how this
   transformation format is used in the context of RFC 1521, RFC 1522,
   and the document "Using Unicode with MIME".

Motivation

   Although other transformation formats of Unicode exist and could
   conceivably be used in this context (most notably UTF-1 and UTF-8,
   also known as UTF-2 or UTF-FSS), they suffer the disadvantage that
   they use octets in the range decimal 128 through 255 to encode
   Unicode characters outside the US-ASCII range. Thus, in the context
   of mail, those octets must themselves be encoded. This requires
   putting text through two successive encoding processes, and leads to
   a significant expansion of characters outside the US-ASCII range,
   putting non-English speakers at a disadvantage. For example, using

Goldsmith & Davis                                               [Page 1]
RFC 1642                         UTF-7                         July 1994

   UTF-FSS together with the Quoted-Printable content transfer encoding
   of MIME represents US-ASCII characters in one octet, but other
   characters may require up to nine octets.

Overview

   UTF-7 encodes Unicode characters as US-ASCII, together with shift
   sequences to encode characters outside that range. For this purpose,
   one of the characters in the US-ASCII repertoire is reserved for use
   as a shift character.

   Many mail gateways and systems cannot handle the entire US-ASCII
   character set (those based on EBCDIC, for example), and so UTF-7
   contains provisions for encoding characters within US-ASCII in a way
   that all mail systems can accomodate.

   UTF-7 should normally be used only in the context of 7 bit
   transports, such as mail and news. In other contexts, straight
   Unicode or UTF-8 is preferred.

   See the document "Using Unicode with MIME" for the overall
   specification on usage of Unicode transformation formats with MIME.

Definitions

   First, the definition of Unicode:

      The 16 bit character set Unicode is defined by "The Unicode
      Standard, Version 1.1". This character set is identical with the
      character repertoire and coding of the international standard
      ISO/IEC 10646-1:1993(E); Coded Representation Form=UCS-2;
      Subset=300; Implementation Level=3.

      Note. Unicode 1.1 further specifies the use and interaction of
      these character codes beyond the ISO standard. However, any valid
      10646 BMP (Basic Multilingual Plane) sequence is a valid Unicode
      sequence, and vice versa; Unicode supplies interpretations of
      sequences on which the ISO standard is silent as to
      interpretation.

   Next, some handy definitions of US-ASCII character subsets:

      Set D (directly encoded characters) consists of the following
      characters (derived from RFC 1521, Appendix B): the upper and
      lower case letters A through Z and a through z, the 10 digits 0-9,
      and the following nine special characters (note that "+" and "="
      are omitted):

Goldsmith & Davis                                               [Page 2]
RFC 1642                         UTF-7                         July 1994

               Character   ASCII & Unicode Value (decimal)
                  '           39
                  (           40
                  )           41
                  ,           44
                  -           45
                  .           46
                  /           47
                  :           58
Show full document text