UTF-8, a transformation format of ISO 10646
RFC 3629

 
Document Type RFC - Internet Standard (November 2003; No errata)
Obsoletes RFC 2279
Also known as STD 63
Last updated 2013-03-02
Stream ISE
Formats plain text pdf html
Stream ISE state (None)
Document shepherd No shepherd assigned
IESG IESG state RFC 3629 (Internet Standard)
Telechat date
Responsible AD Ted Hardie
Send notices to <fyergeau@alis.ca>
Network Working Group                                         F. Yergeau
Request for Comments: 3629                             Alis Technologies
STD: 63                                                    November 2003
Obsoletes: 2279
Category: Standards Track

              UTF-8, a transformation format of ISO 10646

Status of this Memo

   This document specifies an Internet standards track protocol for the
   Internet community, and requests discussion and suggestions for
   improvements.  Please refer to the current edition of the "Internet
   Official Protocol Standards" (STD 1) for the standardization state
   and status of this protocol.  Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2003).  All Rights Reserved.

Abstract

   ISO/IEC 10646-1 defines a large character set called the Universal
   Character Set (UCS) which encompasses most of the world's writing
   systems.  The originally proposed encodings of the UCS, however, were
   not compatible with many current applications and protocols, and this
   has led to the development of UTF-8, the object of this memo.  UTF-8
   has the characteristic of preserving the full US-ASCII range,
   providing compatibility with file systems, parsers and other software
   that rely on US-ASCII values but are transparent to other values.
   This memo obsoletes and replaces RFC 2279.

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2
   2.  Notational conventions . . . . . . . . . . . . . . . . . . . .  3
   3.  UTF-8 definition . . . . . . . . . . . . . . . . . . . . . . .  4
   4.  Syntax of UTF-8 Byte Sequences . . . . . . . . . . . . . . . .  5
   5.  Versions of the standards  . . . . . . . . . . . . . . . . . .  6
   6.  Byte order mark (BOM)  . . . . . . . . . . . . . . . . . . . .  6
   7.  Examples . . . . . . . . . . . . . . . . . . . . . . . . . . .  8
   8.  MIME registration  . . . . . . . . . . . . . . . . . . . . . .  9
   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 10
   10. Security Considerations  . . . . . . . . . . . . . . . . . . . 10
   11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11
   12. Changes from RFC 2279  . . . . . . . . . . . . . . . . . . . . 11
   13. Normative References . . . . . . . . . . . . . . . . . . . . . 12

Yergeau                     Standards Track                     [Page 1]
RFC 3629                         UTF-8                     November 2003

   14. Informative References . . . . . . . . . . . . . . . . . . . . 12
   15. URI's  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
   16. Intellectual Property Statement  . . . . . . . . . . . . . . . 13
   17. Author's Address . . . . . . . . . . . . . . . . . . . . . . . 13
   18. Full Copyright Statement . . . . . . . . . . . . . . . . . . . 14

1. Introduction

   ISO/IEC 10646 [ISO.10646] defines a large character set called the
   Universal Character Set (UCS), which encompasses most of the world's
   writing systems.  The same set of characters is defined by the
   Unicode standard [UNICODE], which further defines additional
   character properties and other application details of great interest
   to implementers.  Up to the present time, changes in Unicode and
   amendments and additions to ISO/IEC 10646 have tracked each other, so
   that the character repertoires and code point assignments have
   remained in sync.  The relevant standardization committees have
   committed to maintain this very useful synchronism.

   ISO/IEC 10646 and Unicode define several encoding forms of their
   common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32.  In an
   encoding form, each character is represented as one or more encoding
   units.  All standard UCS encoding forms except UTF-8 have an encoding
   unit larger than one octet, making them hard to use in many current
   applications and protocols that assume 8 or even 7 bit characters.

   UTF-8, the object of this memo, has a one-octet encoding unit.  It
   uses all bits of an octet, but has the quality of preserving the full
   US-ASCII [US-ASCII] range: US-ASCII characters are encoded in one
   octet having the normal US-ASCII value, and any octet with such a
   value can only stand for a US-ASCII character, and nothing else.

   UTF-8 encodes UCS characters as a varying number of octets, where the
   number of octets, and the value of each, depend on the integer value
   assigned to the character in ISO/IEC 10646 (the character number,
   a.k.a. code position, code point or Unicode scalar value).  This
   encoding form has the following characteristics (all values are in
   hexadecimal):

   o  Character numbers from U+0000 to U+007F (US-ASCII repertoire)
      correspond to octets 00 to 7F (7 bit US-ASCII values).  A direct
Show full document text