i;unicode-casemap - Simple Unicode Collation Algorithm
         i;unicode-casemap - Simple Unicode Collation Algorithm

   This document describes "i;unicode-casemap", a simple case-
   insensitive collation for Unicode strings.  It provides equality,
   substring, and ordering operations.

1.  Introduction

   The "i;ascii-casemap" collation described in [COMPARATOR] is quite
   simple to implement and provides case-independent comparisons for the
   26 Latin alphabetics.  It is specified as the default and/or baseline
   comparator in some application protocols, e.g., [IMAP-SORT].

   However, the "i;ascii-casemap" collation does not produce
   satisfactory results with non-ASCII characters.  It is possible, with
   a modest extension, to provide a more sophisticated collation with
   greater multilingual applicability than "i;ascii-casemap".  This
   extension provides case-independent comparisons for a much greater
   number of characters.  It also collates characters with diacriticals
   with the non-diacritical character forms.

   This collation, "i;unicode-casemap", is intended to be an alternative
   to, and preferred over, "i;ascii-casemap".  It does not replace the
   "i;basic" collation described in [BASIC].

2.  Unicode Casemap Collation Description

   The "i;unicode-casemap" collation is a simple collation which is
   case-insensitive in its treatment of characters.  It provides
   equality, substring, and ordering operations.  The validity test
   operation returns "valid" for any input.

   This collation allows strings in arbitrary (and mixed) character
   sets, as long as the character set for each string is identified and
   it is possible to convert the string to Unicode.  Strings which have
   an unidentified character set and/or cannot be converted to Unicode
   are not rejected, but are treated as binary.

   Each input string is prepared by converting it to a "titlecased
   canonicalized UTF-8" string according to the following steps, using
   UnicodeData.txt ([UNICODE-DATA]):

      (1) A Unicode codepoint is obtained from the input string.

          (a) If the input string is in a known charset that can be
              converted to Unicode, a sequence in the string's charset
              is read and checked for validity according to the rules of
              that charset.  If the sequence is valid, it is converted
              to a Unicode codepoint.  Note that for input strings in
              UTF-8, the UTF-8 sequence must be valid according to the
              rules of [UTF-8]; e.g., overlong UTF-8 sequences are

          (b) If the input string is in an unknown charset, or an
              invalid sequence occurs in step (1)(a), conversion ceases.
              No further preparation is performed, and any partial
              preparation results are discarded.  The original string is
              used unchanged with the i;octet comparator.

      (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]),
          are performed on the resulting codepoint from step (1)(a).

          (a) If the codepoint has a titlecase property in
              UnicodeData.txt (this is normally the same as the
              uppercase property), the codepoint is converted to the
              codepoints in the titlecase property.

          (b) If the resulting codepoint from (2)(a) has a decomposition
              property of any type in UnicodeData.txt, the codepoint is
              converted to the codepoints in the decomposition property.
              This step is recursively applied to each of the resulting
              codepoints until no more decomposition is possible
              (effectively Normalization Form KD).

          Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)
          has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D
          WITH SMALL LETTER Z WITH CARON).  Codepoint U+01C5 has a
          decomposition property of U+0044 (LATIN CAPITAL LETTER D)
          U+017E (LATIN SMALL LETTER Z WITH CARON).  U+017E has a
