i;unicode-casemap - Simple Unicode Collation Algorithm
RFC 5051

Document Type RFC - Proposed Standard (October 2007; No errata)
Was draft-crispin-collation-unicasemap (individual in app area)
Last updated 2013-03-02
Stream IETF
Formats plain text pdf html
Stream WG state (None)
Consensus Unknown
Document shepherd No shepherd assigned
IESG IESG state RFC 5051 (Proposed Standard)
Telechat date
Responsible AD Lisa Dusseault
Send notices to mrc@cac.washington.edu, alexey.melnikov@isode.com
Network Working Group                                         M. Crispin
Request for Comments: 5051                      University of Washington
Category: Standards Track                                   October 2007

         i;unicode-casemap - Simple Unicode Collation Algorithm

Status of This Memo

   This document specifies an Internet standards track protocol for the
   Internet community, and requests discussion and suggestions for
   improvements.  Please refer to the current edition of the "Internet
   Official Protocol Standards" (STD 1) for the standardization state
   and status of this protocol.  Distribution of this memo is unlimited.

Abstract

   This document describes "i;unicode-casemap", a simple case-
   insensitive collation for Unicode strings.  It provides equality,
   substring, and ordering operations.

1.  Introduction

   The "i;ascii-casemap" collation described in [COMPARATOR] is quite
   simple to implement and provides case-independent comparisons for the
   26 Latin alphabetics.  It is specified as the default and/or baseline
   comparator in some application protocols, e.g., [IMAP-SORT].

   However, the "i;ascii-casemap" collation does not produce
   satisfactory results with non-ASCII characters.  It is possible, with
   a modest extension, to provide a more sophisticated collation with
   greater multilingual applicability than "i;ascii-casemap".  This
   extension provides case-independent comparisons for a much greater
   number of characters.  It also collates characters with diacriticals
   with the non-diacritical character forms.

   This collation, "i;unicode-casemap", is intended to be an alternative
   to, and preferred over, "i;ascii-casemap".  It does not replace the
   "i;basic" collation described in [BASIC].

2.  Unicode Casemap Collation Description

   The "i;unicode-casemap" collation is a simple collation which is
   case-insensitive in its treatment of characters.  It provides
   equality, substring, and ordering operations.  The validity test
   operation returns "valid" for any input.

Crispin                     Standards Track                     [Page 1]
RFC 5051                   i;unicode-casemap                October 2007

   This collation allows strings in arbitrary (and mixed) character
   sets, as long as the character set for each string is identified and
   it is possible to convert the string to Unicode.  Strings which have
   an unidentified character set and/or cannot be converted to Unicode
   are not rejected, but are treated as binary.

   Each input string is prepared by converting it to a "titlecased
   canonicalized UTF-8" string according to the following steps, using
   UnicodeData.txt ([UNICODE-DATA]):

      (1) A Unicode codepoint is obtained from the input string.

          (a) If the input string is in a known charset that can be
              converted to Unicode, a sequence in the string's charset
              is read and checked for validity according to the rules of
              that charset.  If the sequence is valid, it is converted
              to a Unicode codepoint.  Note that for input strings in
              UTF-8, the UTF-8 sequence must be valid according to the
              rules of [UTF-8]; e.g., overlong UTF-8 sequences are
              invalid.

          (b) If the input string is in an unknown charset, or an
              invalid sequence occurs in step (1)(a), conversion ceases.
              No further preparation is performed, and any partial
              preparation results are discarded.  The original string is
              used unchanged with the i;octet comparator.

      (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]),
          are performed on the resulting codepoint from step (1)(a).

          (a) If the codepoint has a titlecase property in
              UnicodeData.txt (this is normally the same as the
              uppercase property), the codepoint is converted to the
              codepoints in the titlecase property.

          (b) If the resulting codepoint from (2)(a) has a decomposition
              property of any type in UnicodeData.txt, the codepoint is
              converted to the codepoints in the decomposition property.
              This step is recursively applied to each of the resulting
              codepoints until no more decomposition is possible
              (effectively Normalization Form KD).

          Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)
          has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D
          WITH SMALL LETTER Z WITH CARON).  Codepoint U+01C5 has a
          decomposition property of U+0044 (LATIN CAPITAL LETTER D)
          U+017E (LATIN SMALL LETTER Z WITH CARON).  U+017E has a
Show full document text