Network Working Group                                       P. Faltstrom
Internet-Draft                                         Cisco Systems Inc
Expires: May 15, 2002                                  November 14, 2001


                        Lookup, Search and Sort
                  draft-faltstrom-i18n-sorting-00.txt

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at http://
   www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on May 15, 2002.

Copyright Notice

   Copyright (C) The Internet Society (2001).  All Rights Reserved.

Abstract

   Lookup, searching and sorting is somewhat defined when using a
   limited set of characters, and especially only one or a few scripts.
   When operating in a more complex environment, the algorithms get more
   complex, and in many cases need information about locale.

   This document describes some of the problems which one encounter in
   an international environment.








Faltstrom                 Expires May 15, 2002                  [Page 1]


Internet-Draft           Lookup, Search and Sort           November 2001


1. Introduction

   Characters mentioned in this document are identified by their
   position or code point in the Unicode character set [UCS].  The
   notation U+12AB, for example, indicates the character at the position
   12AB (hexadecimal) in the [UCS].  It is strongly recommended that a
   [UCS] table is available for reference for the ideograph described.

   In various applications the communication involves a requester and a
   servant of information.  The requester ask for something, and the
   servant send back the result of the operation which takes place on
   the servant side of the communication link.  The communication
   between the requester and the servant is done according to the
   definitions of the protocol which is used.

   If the same request is coming from two different requesters, the
   result given back from the servant is supposed to be the same, or
   else the requester will be surprised.  For example, if a requester is
   accessing the same servant from two different places in the world,
   the requester is supposed to get the same result back in both cases.
   Also, if two different requesters send the same request to the
   servant, they are supposed to get the same result back.

   The request itself include some definition created by the requester
   on what information he wants from the servant.  This request is in
   the case of textual based data expressed through a textual string,
   which of course can be encoded according to the transport encoding
   mechanisms defined in the protocol used.

   This gives that the requester compose a request in his local
   environment, pass that request in the application protocol which is
   used to the servant, which is parsing the request, computes a result
   which is passed back to the requester.

   The problem with lookups, searches and sorting is that the request is
   created in the local environment of the requester, and the
   computation which leads to the result set which is passed back from
   the servant is computed in the local environment of the servant.  In
   a global environment these local environments might be different.
   Examples of differences are character sets, scripts, and locale
   dependent issues like language and geographical locality.

   The environmental variables which can change the computational
   algorithm at the servent include: - Character set - Language -
   Geographical locality - Sort order definition(s) - Equality
   definition(s) - Case folding definition(s)





Faltstrom                 Expires May 15, 2002                  [Page 2]


Internet-Draft           Lookup, Search and Sort           November 2001


2. Lookup

   A lookup is done when the requester know the exact key or value for
   the data which is requested.  If the key is not exactly the right
   one, no data is returned.  When using a distributed system, this
   implies also knowing what server to send the query to.  In turn this
   implies that the key must include enough information to know where to
   send the query, and what the query should be (normally the key
   itself).

   Example of a system which is using lookup is the Domain Name System.

2.1 Problems

   match is a relativly simple operation, as the computation at the
   servent involves making a decision about what is "equal".  This gives
   that the requester and the servant needs to agree on what is equal.
   Equality for text is defined by comparison of characters, and can be
   defined being case depenent or independent.

   A classical example of equivalence definition problems inside one
   script can be illustrated using the character 'A' with a ring above.
   This character can be encoded in Unicode in three ways: 1) U+00C5
   LATIN CAPITAL LETTER A WITH RING ABOVE 2) U+0041 LATIN CAPITAL LETTER
   A followed by U+030A COMBINING RING ABOVE 3) U+212B ANGSTROM SIGN The
   equivalence between 1) and 3) is a singleton equivalence.  The
   equivalence between 1) and 2) is a precomposed/decomposed
   equivalence, where 1) is the precomposed representation, and 2) is
   the decomposed representation.  In all three cases, it is supposed to
   look the same for the reader.

   Case folding gives yet some other problems.  In many parts of the
   world the character 'i' (U+0069, LATIN SMALL LETTER I) has as upper
   case equivalent the character 'I' (U+0049, LATIN CAPITAL LETTER I).
   That is not the case in Turkey.  There they have a character U+0131
   (LATIN SMALL LETTER DOTLESS I) which is lower case of U+0049 (LATIN
   CAPITAL LETTER I), and also a special upper case character which is
   the upper case equivalent of the character 'i'.

   Multiple scripts introduce yet more complicated problems if the
   computation on the servant side is not done carefully.  Basically,
   one can get matches inside any of the scripts (i.e.  both the request
   and the data at the servant is in the same script) but also between
   scripts.  This because some requests might be possible to express in
   more than one script (i.e.  the same character might exist in more
   than one script).

   Another issue have to do with "unification" of characters (or not)



Faltstrom                 Expires May 15, 2002                  [Page 3]


Internet-Draft           Lookup, Search and Sort           November 2001


   between different languages and scripts.  In Unicode a unification
   between Chinese, Japanese and Korean (CJK) characters were used,
   which means in practice that the "same" character used in any of the
   languages are unified to the same codepoint.  This implies that given
   a character and no context, one doesn't know if one talk about the
   Chinese version of the character or the Japanese.  In some
   environments knowing the difference is important for example when
   trying to define matching rules between the simplified and
   traditional form of the chinese character when used in the chinese
   language.









































Faltstrom                 Expires May 15, 2002                  [Page 4]


Internet-Draft           Lookup, Search and Sort           November 2001


3. Sort

   If more than one item is given back in the result set, there is a
   need to decide in what order the results are presented in the result
   set.  The decision on ordering results normally consists on two
   steps: deciding what characters to use for creating the order and
   calculation of the order between two characters.

   An example of the former is a decision that sorting of white pages
   records is to be done according to the last name of a person.  The
   latter is the process of calculating the order between different last
   names.

   Example of a system which allowes sorting is a white pages service
   using the LDAP protocol.

3.1 Problems

   What characters to use when doing sorting changes between
   applications and between cultures.  A phone book on Iceland is sorted
   on the first name of a person while one in Sweden is sorted on the
   last name for example.  A list of email in an email application might
   be sorted in different ways (date, sender, subject) in the same
   session.  What way to sort records is in many cases a request from
   the client to the server, and the server tries to fulfil that request
   just like it tries to fulfil the request of finding the data itself.
   This in many cases leads to a need for negotiation between client and
   server 031-934660.  Gun Segerling

   When comparing the individual characters with each other, the
   collation order is normally dependent on the language which is used,
   which means that cross language collation order is very hard to
   define (it is in most cases undefined, so there is no "standard" to
   reference).  For example, the order of the local characters used in
   Danish, Swedish and Norwegian is different in the three countries,
   even though they in some way can be seen as being the same.















Faltstrom                 Expires May 15, 2002                  [Page 5]


Internet-Draft           Lookup, Search and Sort           November 2001


4. Search

   When doing searches, compared with lookups, just matching characters
   is not enough as a computation algorithm.  The data which is to be
   searched have to be structured in such a way that (more or less) the
   matching itself can be done in an arbitrary part of the data, and not
   only the key of the record.  This is in many cases when the datastore
   is centralized (i.e.  all data is stored at one servant) about the
   same problems as in the case of lookups, but if the data is
   distributed among several servants, finding the exact record is a
   non-trivial task.  In the case of systems (such as DNS) which uses
   lookups, the structure of the key is normally chosen in a way so the
   key itself gives information about what servant have information
   about the record.  Searches might also include getting information
   back from more than one servant if the data is distributed.

   If the search term allow expressions like the regular expression "[a-
   c]" (match every character between 'a' and 'c', inclusive) the same
   problems occur like is described below regarding sorting, i.e.  that
   a rank of each character is needed.  The example given, LDAP, allow
   such expressions, so even though in LDAP a specific sort directive is
   not given, the computation which is taking place on the servant might
   in some cases involve sorting data.

4.1 Problems

   The problems with searching is about the same problem as when doing a
   lookup, with the difference that a lookup is with an exact key, while
   a search is done via approximate matchings.  Either approximate on
   the level of character by character, or one level above, as a
   semantic match ("color" and "colour").



Author's Address

   Patrik Faltstrom
   Cisco Systems Inc
   170 W Tasman Drive SJ-13/2
   San Jose CA 95134
   USA

   EMail: paf@cisco.com
   URI:   http://www.cisco.com







Faltstrom                 Expires May 15, 2002                  [Page 6]


Internet-Draft           Lookup, Search and Sort           November 2001


Appendix A. Appendix


















































Faltstrom                 Expires May 15, 2002                  [Page 7]


Internet-Draft           Lookup, Search and Sort           November 2001


Full Copyright Statement

   Copyright (C) The Internet Society (2001).  All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Acknowledgement

   Funding for the RFC Editor function is currently provided by the
   Internet Society.



















Faltstrom                 Expires May 15, 2002                  [Page 8]