A Generalized Unified Character Code: Western European and CJK Sections
RFC 5242

Document Type RFC - Informational (April 2008; No errata)
Last updated 2013-03-02
Stream ISE
Formats plain text pdf html bibtex
Stream ISE state (None)
Consensus Boilerplate Unknown
Document shepherd No shepherd assigned
IESG IESG state RFC 5242 (Informational)
Telechat date
Responsible AD (None)
Send notices to (None)
Network Working Group                                         J. Klensin
Request for Comments: 5242
Category: Informational                                    H. Alvestrand
                                                                  Google
                                                            1 April 2008

A Generalized Unified Character Code: Western European and CJK Sections

Status of This Memo

   This memo provides information for the Internet community.  It does
   not specify an Internet standard of any kind.  Distribution of this
   memo is unlimited.

IESG Note

   This is not an IETF document.  Readers should be aware of RFC 4690,
   "Review and Recommendations for Internationalized Domain Names
   (IDNs)", and its references.

   This document is not a candidate for any level of Internet Standard.
   The IETF disclaims any knowledge of the fitness of this document for
   any purpose, and in particular notes that it has not had IETF review
   for such things as security, congestion control, or inappropriate
   interaction with deployed protocols.  The RFC Editor has chosen to
   publish this document at its discretion.  Readers of this document
   should exercise caution in evaluating its value for implementation
   and deployment.

Abstract

   Many issues have been identified with the use of general-purpose
   character sets for internationalized domain names and similar
   purposes.  This memo describes a fully unified coded character set
   for scripts based on Latin, Greek, Cyrillic, and Chinese (CJK)
   characters.  It is not a complete specification of that character
   set.

Klensin & Alvestrand         Informational                      [Page 1]
RFC 5242                      Unified CCS                     April 2008

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
     1.1.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  3
     1.2.  Discussion . . . . . . . . . . . . . . . . . . . . . . . .  4
   2.  Types of Characters  . . . . . . . . . . . . . . . . . . . . .  4
     2.1.  Base Character . . . . . . . . . . . . . . . . . . . . . .  4
     2.2.  Nonspacing Marks . . . . . . . . . . . . . . . . . . . . .  4
     2.3.  Case Indicators  . . . . . . . . . . . . . . . . . . . . .  4
     2.4.  Joining Indicators . . . . . . . . . . . . . . . . . . . .  5
     2.5.  Character-Matrix Positioning Indicators  . . . . . . . . .  5
     2.6.  Position Shaping Controls  . . . . . . . . . . . . . . . .  6
     2.7.  Repetition Indicators  . . . . . . . . . . . . . . . . . .  6
     2.8.  Control Characters . . . . . . . . . . . . . . . . . . . .  7
   3.  Code Assigment Groupings . . . . . . . . . . . . . . . . . . .  7
   4.  Canonical Form . . . . . . . . . . . . . . . . . . . . . . . .  7
   5.  Examples of Graphic Element Codes  . . . . . . . . . . . . . .  8
   6.  Composite Characters and Unicode Equivalences  . . . . . . . . 10
   7.  Ideographic Characters . . . . . . . . . . . . . . . . . . . . 11
   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 11
   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 12
   10. Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 12
   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 13
     11.1. Normative References . . . . . . . . . . . . . . . . . . . 13
     11.2. Informative References . . . . . . . . . . . . . . . . . . 13

Klensin & Alvestrand         Informational                      [Page 2]
RFC 5242                      Unified CCS                     April 2008

1.  Introduction

   Many issues have been identified with the use of general-purpose
   character sets for internationalized domain names and similar
   purposes.  This memo specifies a fully unified coded character set
   for scripts based on Latin, Greek, Cyrillic, and Chinese characters.

   There are four important principles in this work:

   1.  If it looks alike, it is alike.  The number of base characters
       and marks should be minimized.  Glyphs are more important than
       character abstractions.

   2.  If it is the same thing, it is the same thing.  Two symbols that
       have the same semantic meaning in all contexts should be encoded
       in a way that allows their identity to be discovered by removing
       modifiers, rather than having to resort to external equivalence
       tables.

   3.  For simplicity, when a character form can be evaluated on the
       basis of either serif or sanserif fonts, the sanserif font is
       always preferred.

   4.  The use of combining characters and modifiers is preferred to
       adding more base characters.

   Based on these principles, it becomes obvious that:

   o  Ligatures, digraphs, and final forms are constructed with special
Show full document text