A Generalized Unified Character Code: Western European and CJK Sections
RFC 5242
|
Document |
Type |
|
RFC - Informational
(April 2008; No errata)
|
|
Last updated |
|
2013-03-02
|
|
Stream |
|
ISE
|
|
Formats |
|
plain text
pdf
html
bibtex
|
Stream |
ISE state
|
|
(None)
|
|
Consensus Boilerplate |
|
Unknown
|
|
Document shepherd |
|
No shepherd assigned
|
IESG |
IESG state |
|
RFC 5242 (Informational)
|
|
Telechat date |
|
|
|
Responsible AD |
|
(None)
|
|
Send notices to |
|
(None)
|
Network Working Group J. Klensin
Request for Comments: 5242
Category: Informational H. Alvestrand
Google
1 April 2008
A Generalized Unified Character Code: Western European and CJK Sections
Status of This Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.
IESG Note
This is not an IETF document. Readers should be aware of RFC 4690,
"Review and Recommendations for Internationalized Domain Names
(IDNs)", and its references.
This document is not a candidate for any level of Internet Standard.
The IETF disclaims any knowledge of the fitness of this document for
any purpose, and in particular notes that it has not had IETF review
for such things as security, congestion control, or inappropriate
interaction with deployed protocols. The RFC Editor has chosen to
publish this document at its discretion. Readers of this document
should exercise caution in evaluating its value for implementation
and deployment.
Abstract
Many issues have been identified with the use of general-purpose
character sets for internationalized domain names and similar
purposes. This memo describes a fully unified coded character set
for scripts based on Latin, Greek, Cyrillic, and Chinese (CJK)
characters. It is not a complete specification of that character
set.
Klensin & Alvestrand Informational [Page 1]
RFC 5242 Unified CCS April 2008
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Types of Characters . . . . . . . . . . . . . . . . . . . . . 4
2.1. Base Character . . . . . . . . . . . . . . . . . . . . . . 4
2.2. Nonspacing Marks . . . . . . . . . . . . . . . . . . . . . 4
2.3. Case Indicators . . . . . . . . . . . . . . . . . . . . . 4
2.4. Joining Indicators . . . . . . . . . . . . . . . . . . . . 5
2.5. Character-Matrix Positioning Indicators . . . . . . . . . 5
2.6. Position Shaping Controls . . . . . . . . . . . . . . . . 6
2.7. Repetition Indicators . . . . . . . . . . . . . . . . . . 6
2.8. Control Characters . . . . . . . . . . . . . . . . . . . . 7
3. Code Assigment Groupings . . . . . . . . . . . . . . . . . . . 7
4. Canonical Form . . . . . . . . . . . . . . . . . . . . . . . . 7
5. Examples of Graphic Element Codes . . . . . . . . . . . . . . 8
6. Composite Characters and Unicode Equivalences . . . . . . . . 10
7. Ideographic Characters . . . . . . . . . . . . . . . . . . . . 11
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11
9. Security Considerations . . . . . . . . . . . . . . . . . . . 12
10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 12
11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 13
11.1. Normative References . . . . . . . . . . . . . . . . . . . 13
11.2. Informative References . . . . . . . . . . . . . . . . . . 13
Klensin & Alvestrand Informational [Page 2]
RFC 5242 Unified CCS April 2008
1. Introduction
Many issues have been identified with the use of general-purpose
character sets for internationalized domain names and similar
purposes. This memo specifies a fully unified coded character set
for scripts based on Latin, Greek, Cyrillic, and Chinese characters.
There are four important principles in this work:
1. If it looks alike, it is alike. The number of base characters
and marks should be minimized. Glyphs are more important than
character abstractions.
2. If it is the same thing, it is the same thing. Two symbols that
have the same semantic meaning in all contexts should be encoded
in a way that allows their identity to be discovered by removing
modifiers, rather than having to resort to external equivalence
tables.
3. For simplicity, when a character form can be evaluated on the
basis of either serif or sanserif fonts, the sanserif font is
always preferred.
4. The use of combining characters and modifiers is preferred to
adding more base characters.
Based on these principles, it becomes obvious that:
o Ligatures, digraphs, and final forms are constructed with special
modifiers so that relationships to basic forms are obvious.
o Symbols consisting of multiple marks are always constructed from
Show full document text