i;unicode-casemap - Simple Unicode Collation Algorithm
RFC 5051
Document | Type |
RFC - Proposed Standard
(October 2007; No errata)
Was draft-crispin-collation-unicasemap (individual in app area)
|
|
---|---|---|---|
Author | Mark Crispin | ||
Last updated | 2015-10-14 | ||
Stream | IETF | ||
Formats | plain text html pdf htmlized bibtex | ||
Reviews | |||
Stream | WG state | (None) | |
Document shepherd | No shepherd assigned | ||
IESG | IESG state | RFC 5051 (Proposed Standard) | |
Consensus Boilerplate | Unknown | ||
Telechat date | |||
Responsible AD | Lisa Dusseault | ||
Send notices to | alexey.melnikov@isode.com |
Network Working Group M. Crispin Request for Comments: 5051 University of Washington Category: Standards Track October 2007 i;unicode-casemap - Simple Unicode Collation Algorithm Status of This Memo This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited. Abstract This document describes "i;unicode-casemap", a simple case- insensitive collation for Unicode strings. It provides equality, substring, and ordering operations. 1. Introduction The "i;ascii-casemap" collation described in [COMPARATOR] is quite simple to implement and provides case-independent comparisons for the 26 Latin alphabetics. It is specified as the default and/or baseline comparator in some application protocols, e.g., [IMAP-SORT]. However, the "i;ascii-casemap" collation does not produce satisfactory results with non-ASCII characters. It is possible, with a modest extension, to provide a more sophisticated collation with greater multilingual applicability than "i;ascii-casemap". This extension provides case-independent comparisons for a much greater number of characters. It also collates characters with diacriticals with the non-diacritical character forms. This collation, "i;unicode-casemap", is intended to be an alternative to, and preferred over, "i;ascii-casemap". It does not replace the "i;basic" collation described in [BASIC]. 2. Unicode Casemap Collation Description The "i;unicode-casemap" collation is a simple collation which is case-insensitive in its treatment of characters. It provides equality, substring, and ordering operations. The validity test operation returns "valid" for any input. Crispin Standards Track [Page 1] RFC 5051 i;unicode-casemap October 2007 This collation allows strings in arbitrary (and mixed) character sets, as long as the character set for each string is identified and it is possible to convert the string to Unicode. Strings which have an unidentified character set and/or cannot be converted to Unicode are not rejected, but are treated as binary. Each input string is prepared by converting it to a "titlecased canonicalized UTF-8" string according to the following steps, using UnicodeData.txt ([UNICODE-DATA]): (1) A Unicode codepoint is obtained from the input string. (a) If the input string is in a known charset that can be converted to Unicode, a sequence in the string's charset is read and checked for validity according to the rules of that charset. If the sequence is valid, it is converted to a Unicode codepoint. Note that for input strings in UTF-8, the UTF-8 sequence must be valid according to the rules of [UTF-8]; e.g., overlong UTF-8 sequences are invalid. (b) If the input string is in an unknown charset, or an invalid sequence occurs in step (1)(a), conversion ceases. No further preparation is performed, and any partial preparation results are discarded. The original string is used unchanged with the i;octet comparator. (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]), are performed on the resulting codepoint from step (1)(a). (a) If the codepoint has a titlecase property in UnicodeData.txt (this is normally the same as the uppercase property), the codepoint is converted to the codepoints in the titlecase property. (b) If the resulting codepoint from (2)(a) has a decomposition property of any type in UnicodeData.txt, the codepoint is converted to the codepoints in the decomposition property. This step is recursively applied to each of the resulting codepoints until no more decomposition is possible (effectively Normalization Form KD). Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON) has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON). Codepoint U+01C5 has a decomposition property of U+0044 (LATIN CAPITAL LETTER D) U+017E (LATIN SMALL LETTER Z WITH CARON). U+017E has a decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c Crispin Standards Track [Page 2] RFC 5051 i;unicode-casemap October 2007Show full document text