draft-sullivan-lucid-prob-stmt-00

Internet Engineering Task Force                              A. Sullivan
Internet-Draft                                                       Dyn
Intended status: Informational                                A. Freytag
Expires: September 10, 2015                                   ASMUS Inc.
                                                           March 9, 2015


A Problem Statement to Motivate Work on Locale-free Unicode Identifiers
                   draft-sullivan-lucid-prob-stmt-00

Abstract

   Internationalization techniques that the IETF has adopted depended on
   some assumptions about the way characters get added to Unicode.  Some
   of those assumptions turn out not to have been true.  Discussion is
   necessary to determine how the IETF should respond to the new
   understanding of how Unicode works.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on September 10, 2015.

Copyright Notice

   Copyright (c) 2015 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of




Sullivan & Freytag     Expires September 10, 2015               [Page 1]


Internet-Draft           LUCID Problem Statement              March 2015


   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Background  . . . . . . . . . . . . . . . . . . . . . . . . .   3
     2.1.  The Inclusion Mechanism . . . . . . . . . . . . . . . . .   3
     2.2.  The Difference Between Theory and Practice  . . . . . . .   4
       2.2.1.  Confusability . . . . . . . . . . . . . . . . . . . .   4
         2.2.1.1.  Not everything can be solved  . . . . . . . . . .   5
       2.2.2.  The Problem Now Before Us . . . . . . . . . . . . . .   6
   3.  Identifiers . . . . . . . . . . . . . . . . . . . . . . . . .   7
     3.1.  Types of Identifiers  . . . . . . . . . . . . . . . . . .   8
   4.  Possible Nature of Problem  . . . . . . . . . . . . . . . . .   9
     4.1.  Just a Species of Confusables . . . . . . . . . . . . . .   9
     4.2.  Just a Species of Homoglyphs  . . . . . . . . . . . . . .   9
     4.3.  Separate Problem  . . . . . . . . . . . . . . . . . . . .  10
     4.4.  Unimportant Problem . . . . . . . . . . . . . . . . . . .  10
   5.  Possible Ways Forward . . . . . . . . . . . . . . . . . . . .  10
     5.1.  Find the Cases, Disallow New Ones, and Deal
           With Old Ones . . . . . . . . . . . . . . . . . . . . . .  10
     5.2.  Disallow Certain Combining Sequences
           Absolutely  . . . . . . . . . . . . . . . . . . . . . . .  11
     5.3.  Do Nothing, Possibly Warn . . . . . . . . . . . . . . . .  11
     5.4.  Identify Enough Commonality for a New
           Property  . . . . . . . . . . . . . . . . . . . . . . . .  12
     5.5.  Create an IETF-only Normalization Form  . . . . . . . . .  12
   6.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  12
   7.  Informative References  . . . . . . . . . . . . . . . . . . .  12
   Appendix A.  Examples . . . . . . . . . . . . . . . . . . . . . .  13
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  15

1.  Introduction

   Among its features, IDNA2008 [RFC5890] [RFC5891] [RFC5892] [RFC5893]
   [RFC5894] [RFC5895] provides a way of using Unicode [Unicode]
   characters without regard to the version of Unicode available.  The
   same approach is generalized for protocols other than DNS by the
   PRECIS framework [I-D.ietf-precis-framework].

   The mechanism used is called "inclusion", and is outlined in
   Section 2.1 below.  We call the general strategy "inclusion-based
   identifier internationalization" or "i3" for short.  I3 depends on
   certain assumptions made in the IETF at the time it was being
   developed.  Some of those assumptions were about the relationships
   between various characters and the likelihood that similar such
   relationships would get added to future versions of Unicode.  Those



Sullivan & Freytag     Expires September 10, 2015               [Page 2]


Internet-Draft           LUCID Problem Statement              March 2015


   assumptions turn out not to have been true in every case.  This
   raises a question, therefore, about whether the current approach
   meets the needs of the IETF for internationalizing identifiers.

   This memo attempts to give enough background about the situation so
   that IETF participants can participate in a discussion about what (if
   anything) to do about the state of affairs; the discussion is
   expected to happen as part of the LUCID BoF at IETF 92.  The reader
   is assumed to be familiar with the terminology in [RFC6365].  This
   memo owes a great deal to the exposition in
   [I-D.klensin-idna-5892upd-unicode70].

2.  Background

   The intent of Unicode is to encode all known writing systems into a
   single coded character set.  One consequence of that goal is that
   Unicode encodes an enormous number of characters.  Another is that
   the work of Unicode does not end until every writing system is
   encoded; even after that, it needs to continue to track any changes
   in those writing systems.  Unicode encodes abstract characters, not
   glyphs.  Because of the way Unicode was built up over time, there are
   sometimes multiple ways to encode the same abstract character.  If
   Unicode encodes an abstract character in more than one way, then for
   most purposes the different encodings should all be treated as though
   they're the same character.  This is called "canonical equivalence".

   A lack of a defined canonical equivalence is tantamount to an
   assertion by Unicode that the two encodings do not represent the same
   abstract character, even if both happen to result in the same
   appearance.

   Every encoded character in Unicode (that is, every code point) is
   associated with a set of properties.  The properties define what
   script a code point is in, whether it is a letter or a number or
   punctuation and so forth, what direction it is written in, to what
   other code point or code point sequence it is canonically equivalent,
   and many other properties.  These properties are important to the
   inclusion mechanism.

2.1.  The Inclusion Mechanism

   Because of both the enormous number of characters in Unicode and the
   many purposes it must serve, Unicode contains characters that are not
   well-suited for use as part of identifiers for network protocols.
   The inclusion mechanism starts by assuming an empty set of
   characters.  It then evaluates Unicode characters not individually,
   but instead by classifying them according to their properties.  This




Sullivan & Freytag     Expires September 10, 2015               [Page 3]


Internet-Draft           LUCID Problem Statement              March 2015


   classification provides the "derived properties" that IDNA2008 and
   PRECIS rely upon.

   In practice, the inclusion mechanism includes code points that are
   letters or digits.  There are some ways to include or exclude
   characters that otherwise would be excluded or included
   (respectively); but it is impractical to evaluate each character, so
   most characters are included or excluded based on the properties they
   have.

   I3 depends on the assumption that strings that will be used in
   identifiers will not have any ambiguous matching to other strings.
   In practice, this means that input strings to the protocol are
   expected to be in Normalization Form C.  This way, any alternative
   sequences of code points for the same characters will be normalized
   to a single form.  Assuming then that those characters are all
   included by the inclusion mechanism, the string is eligible to be an
   identifier under the protocol.

2.2.  The Difference Between Theory and Practice

   In principle, under i3 identifiers should be unambiguous.  It has
   always been recognized, however, that for humans some ambiguity was
   inevitable, because of the vagaries of writing systems and of human
   perception.

   Normalization Form NFC removes the ambiguities based on dual or
   multiple encoding for the same abstract character.  However,
   characters are not the same as their glyphs.  This means that it is
   possible for certain abstract characters to share a glyph.  We call
   such abstract characters "homoglyphs".  While this looks at first
   like something that should be handled (or should have been handled)
   by normalization (NFC or something else), there are important
   differences; the situation is in some sense an extreme case of a
   spectrum of ambiguity discussed in the following section.

2.2.1.  Confusability

   While Unicode deals in abstract characters and i3 works on Unicode
   code points, users interact with the characters as actually rendered:
   glyphs.  There are characters that, depending on font, sometimes look
   quite similar to one another (such as "l" and "1"); any character
   that is like this is often called "visually similar".  More difficult
   are characters that, in any normal rendering, always look the same as
   one another.  The shared history of Cyrillic, Greek, and Latin
   scripts, for example, means that there are characters in each script
   that function similarly and that are usually indistinguishable from
   one another, though they are not the same abstract character.  These



Sullivan & Freytag     Expires September 10, 2015               [Page 4]


Internet-Draft           LUCID Problem Statement              March 2015


   are examples of "homoglyphs."  Any character that can be confused for
   another one can be called confusable, and confusability can be
   thought of as a spectrum with "visually similar" at one end, and
   "homoglyphs" at the other.  (We use the term "homoglyph" strictly:
   code points that normally use the same glyph when rendered.)

   Most of the time, there is some characteristic that can help to
   mitigate confusion.  Mitigation may be as simple as using a font
   designed to distinguish among different characters.  For homoglyphs,
   a large number of cases (but not all of them) turn out to be in
   different scripts.  As a result, there is an operational convention
   that identifiers should always be in a single script.  (This strategy
   can be less than successful in cases where each identifier is in a
   single script, but the repertoire used in operation allows multiple
   scripts, because of whole string confusables -- strings made up
   entirely of homoglyphs of another string in a different script.)

   There is another convention that operators should only ever use the
   smallest repertoire of code points possible for their environment.
   So, for example, if there is a code point that is sometimes used but
   is perhaps a little obscure, it is better to leave it out and gain
   some experience with other cases first.  In particular, code points
   used in a language with which the administrator is not familiar
   should probably be excluded.  In the case of IDNA, some client
   programs restrict display of U-labels to top-level domains known to
   have policies about single-script labels.  None of these policies or
   convention will do anything to help strict homoglyphs of each other
   in the same script (see Appendix A for some example cases.)

2.2.1.1.  Not everything can be solved

   Before continuing, it is worth noting that there are some cases that,
   regardless of mitigation, are fundamentally impossible to solve.
   There are certainly cases of two strings in which all the code points
   in one script in the first string, and all the code points in another
   script in the second string, are respectively confusable with one
   another.  In that case, the strings cannot be distinguished by a
   reader, and the whole string is confusable.  Further, human
   perception is easily tricked, so that entirely unrelated character
   sequences can become confusable, for example "rn" being confused with
   "m".

   Given the facts of history and the contingencies of writing systems,
   one cannot defend against all of these cases; and it seems all but
   certain that many of these cases cannot successfully be addressed on
   the protocol level alone.  In general, the i3 strategy can only
   define rules for one identifier at a time, and has no way to offer
   guidance about how different identifiers under the same scheme ought



Sullivan & Freytag     Expires September 10, 2015               [Page 5]


Internet-Draft           LUCID Problem Statement              March 2015


   to interact.  Humans are likely to respond according to the entire
   identifier string, so there seems to be a deep tension between the
   narrow focus of i3, and the actual experience of users.

   In addition, several factors limit the ability to ensure that any
   solution adopted is final and complete: the sheer complexity of
   writing systems, the fact that many of them are not equally well
   understood as Latin or Han, and that many less developed writing
   systems are potentially susceptible to paradigm changes as digital
   support for them becomes more widespread.  Detailed knowledge about,
   and implementation experience for, these writing systems only emerges
   over time; disruptive changes are neither predictable ahead of time
   nor preventable.  In essence, any solution to eliminate ambiguity can
   be expected to get some detail wrong.

   Nobody should imagine that the present discussion takes as its goal
   the complete elimination of all possible confusion.  The failure to
   achieve such a goal does not mean, however, that we should do
   nothing, any more than the low chances of ever arresting all grifters
   means that we should not enact laws against fraud.  Our discussion,
   then, must focus on those problems that are able to be addressed in
   the constraint of the protocols; and, in particular, the subset that
   are suitable for that

2.2.2.  The Problem Now Before Us

   During the expert review necessary for supporting Unicode 7.0.0 for
   use with IDNA, a new code point U+08A1, ARABIC LETTER BEH WITH HAMZA
   ABOVE came in for some scrutiny.  Using versions of Unicode up to and
   including 7.0.0, it is possible to combine ARABIC LETTER BEH (U+0628)
   and ARABIC HAMZA ABOVE (U+0654) to produce a glyph that is
   indistinguishable from the one produced by U+08A1.  But U+08A1 and
   \u'0628'\u'0654' are not canonically equivalent.  (For more
   discussion of this issue, see [I-D.klensin-idna-5892upd-unicode70].)

   Further investigation reveals that there are several similar cases.
   ARABIC HAMZA ABOVE (U+0654) turns out to be implicated in some cases,
   but not all of them.  There are cases in Latin (see Appendix A for
   examples).  There are certainly cases in other scripts (some examples
   are provided in Appendix A).  The majority of cases all have a
   handful of things in common:

   o  There are at least two forms by which the same glyph is produced.

   o  One of the forms uses a combining sequence and another form is a
      precomposed character, or else one of the forms is a digraph.
      [[CREF1: Is this true?  Are there any cases that don't match it?
      --ajs]]



Sullivan & Freytag     Expires September 10, 2015               [Page 6]


Internet-Draft           LUCID Problem Statement              March 2015


   o  The results when rendered as glyphs cannot be distinguished from
      one another.

   o  The two forms are not canonically equivalent.

   o  All of the relevant code points have the same script property, or
      else inherit the script property of the previous character so that
      it is not possible to select on the basis of the script.

   o  Competent users of the writing system in a language do not treat
      one of the combining sequence or the precomposed character as
      reasonable.  To writers for whom the combining sequence is
      "wrong", it is not a case of a base character modified by an
      additional mark, but instead a separate letter.  Conversely, to
      writers for whom the precomposed character is "wrong", it is
      definitely a matter of adding something to a character that
      otherwise stands on its own.  (Not every possible combination
      would normally be used by anyone, of course, and sometimes -- not
      infrequently -- one of the alternatives is not used by any
      orthography.)

   Cases that match these conditions might be considered to involve
   "non-normalizable diacritics", because most of the combining marks in
   question are non-spacing marks that are or act like diacritics.

3.  Identifiers

   Part of the reason i3 works from the assumption that not all Unicode
   code points are appropriate for identifiers is that identifiers do
   not work like words of phrases in a language.  First, identifiers
   often appear in contexts where there is no way to tell the language
   of the identifiers.  Indeed, many identifiers are not really "in a
   language" at all.  Second, and partly because of that lack of
   linguistic root, identifiers are often either not words or use
   unusual orthography precisely to differentiate themselves.

   In ordinary language use, the ambiguity identified in Section 2.2 may
   well create no difficulty.  Running text has two properties that make
   this so.  First, because there is a linguistic context (the rest of
   the text), it is possible to detect code points that are used in an
   unusual way and flag them or, even, create automatic rules to "fix"
   such issues.  Second, linguistic context comes with spelling rules
   that automatically determine whether something is written the right
   way.  Because of these facts, it is often possible even without a
   locale identifier to work out what the locale of the text ought to
   be.  So, even in cases where passages of text need to be compared, it
   is possible to mitigate the issue.




Sullivan & Freytag     Expires September 10, 2015               [Page 7]


Internet-Draft           LUCID Problem Statement              March 2015


   The same locale-detection approach does not work for identifiers.
   Worse, identifiers, by their very nature, are things that must
   provide reliable exact matches.  The whole point of an identifier is
   that it provides a reliable way of uniquely naming the thing to be
   identified.  Partial matches and heuristics are inadequate for those
   purposes.  Identifiers are often used as part of the security
   practices for a protocol, and therefore ambiguity in matching
   presents a risk for the security of any protocol relying on the
   identifier.

3.1.  Types of Identifiers

   It is worth observing that not all identifiers are of the same type.
   There are four relevant dimensions in which identifiers can differ in
   type:

   1.  Scope

       (a)  Internet-wide

       (b)  Unique within a context (often a site)

       (c)  Link-local only

   2.  Management

       (a)  Centrally managed

       (b)  Contextually managed (e.g. registering a nickname with a
            server for a session)

       (c)  Unmanaged

   3.  Durability

       (a)  Permanent

       (b)  Durable but with possible expiration

       (c)  Temporary

       (d)  Ephemeral

   4.  Authority

       (a)  Single authority

       (b)  Multiple authorities (possibly within a hierarchy)



Sullivan & Freytag     Expires September 10, 2015               [Page 8]


Internet-Draft           LUCID Problem Statement              March 2015


       (c)  No authority

   These different dimensions present ways in which mitigation of the
   identified issue might be possible.  For instance, a protocol that
   uses only link-local identifiers that are unmanaged, temporary, and
   configured automatically does not really present a problem, because
   for practical purposes its linguistic context is constrained to the
   social realities of the LAN in question.  A durable Internet-wide
   identifier centrally managed by multiple authorities will present a
   greater issue unless locale information comes along with the
   identifier.

4.  Possible Nature of Problem

   We may regard this problem as one of several different kinds, and
   depending on how we view it we will have different approaches to
   addressing it.

4.1.  Just a Species of Confusables

   Under this interpretation, the current issue is no different to any
   other confusable case, except in detail.  Since there is no way to
   solve the general problem of confusables, there is no way to solve
   this problem either.  Moreover, to the degree that confusables are
   solved outside protocols, by administration and policy, the current
   issue might be addressed by the same strategy.

   This interpretation seems unsatisfying, because there exist some
   partial mitigations, and if suitable further mitigations are possible
   it would be wise to apply them.

4.2.  Just a Species of Homoglyphs

   Under this interpretation, the current issue is no different than any
   other homoglyph case.  After all, the basic problem is that there is
   no way for a user to tell which codepoint is represented by what the
   user sees in either case.

   There is some merit to this view, but it has the problem that many of
   the homoglyph issues (admittedly not all of them) can be mitigated
   through registration rules, and those rules can be established
   without examining the particular code points in question (that is,
   they can operate just on the properties of code points, such as
   script membership).  The current issue does not allow such mitigation
   given the properties that are currently available.  At the same time,
   it may be that it is impossible to deal with this adequately, and
   some judgement will be needed for what is adequate.  This is an area
   where more discussion is clearly needed.



Sullivan & Freytag     Expires September 10, 2015               [Page 9]


Internet-Draft           LUCID Problem Statement              March 2015


4.3.  Separate Problem

   Under this interpretation, there is a definable problem, and its
   boundaries can be specified.

   That we can list some necessary conditions for the problem suggests
   that it is a separable problem.  The list of factors in Section 2.2.2
   seems to indicate that it is possible to describe the bounds of a
   problem that can be addressed separately.

   What is not clear is whether it is separable enough to make it worth
   treating separately.

4.4.  Unimportant Problem

   Under this interpretation, while it is possible to describe the
   problem, it is not a problem worth addressing since nobody would ever
   create such identifiers on purpose.

   The problem with this approach, for identifiers, is that it
   represents an opportunity for phishing and other similar attacks.
   While mitigation will not stop all such attacks, we should try to
   understand opportunities for those attacks and close when we have
   identified them and it is practical to do so.

   Whether phishing or other attacks using confusable code points "pay
   off" depends to some extent on the popularity or frequency of the
   code points in question.  While it may be worth to address the
   generalized issue, individual edge cases may have no practical
   consequences.  The inability to address them then, should not hold up
   progress on a solution for the more common, general case.

5.  Possible Ways Forward

   There are a few ways that this issue could be mitigated.  Note that
   this section is closely related to Section 3 in
   [I-D.klensin-idna-5892upd-unicode70].

5.1.  Find the Cases, Disallow New Ones, and Deal With Old Ones

   In this case, it is necessary to enumerate all the cases, add
   exceptions to DISALLOW any new cases from happening, and make a
   determination about what to do for every past case.  There are two
   reasons to doubt whether this approach will work.

   1.  The IETF did not catch these issues during previous
       internationalization efforts, and it seems unlikely that in the




Sullivan & Freytag     Expires September 10, 2015              [Page 10]


Internet-Draft           LUCID Problem Statement              March 2015


       meantime it has acquired enough expertise in writing systems to
       do a proper job of it this time.

   2.  This approach blunts the effectiveness of being Unicode version-
       agnostic, since it would effectively block any future additions
       to Unicode that had any interaction with the present version.

   So, this approach does not seem too promising.

5.2.  Disallow Certain Combining Sequences Absolutely

   In this case, instead of treating all the code points in Unicode, the
   IETF would need only to look at all combining characters.  While the
   IETF obviously does not have the requisite expertise in writing
   systems to do this unilaterally, the Unicode Consortium does.  In
   fact the Unicode Technical Committee has a clear understanding that
   some combining sequences are never intended to be used for
   orthographic purposes.  Any glyph needed for an orthography or
   writing system will, once identified, be added as a single code point
   with "pre-composed" glyph.

   In principle there is no obstacle, in these cases, to asking Unicode
   to express this understanding in form of a character property, which
   then means that IETF could DISALLOW the combining marks having such a
   property.

5.3.  Do Nothing, Possibly Warn

   One possibility is to accept that there is nothing one can do in
   general here, and that therefore the best one can do is warn people
   to be careful.

   The problem with this approach, of course, is that it all but
   guarantees future problems with ambiguous identifiers.  It would
   provide a good reason to reject all internationalized identifiers as
   representing a significant security risk, and would therefore mean
   that internationalized identifiers would become "second class".
   Unfortunately, however, the demand for internationalized identifiers
   would not likely be reduced by this decision, so some people would
   end up using identifiers with known security problems.

   This approach may be the only possible in some of the borderline
   cases where mitigation approaches are not successful.








Sullivan & Freytag     Expires September 10, 2015              [Page 11]


Internet-Draft           LUCID Problem Statement              March 2015


5.4.  Identify Enough Commonality for a New Property

   There is reason to suppose that, if the IETF can come up with clear
   and complete conditions under which code points causing an issue
   could be classified, the Unicode Technical Committee would add such a
   property to code points in future versions of the Unicode Standard.
   Assuming the conditions were clear, future additions to the Standard
   could also be assigned appropriate values of the property, meaning
   that the IETF could revert to making decisions about code points
   based on derived properties.  Beyond the property mentioned in
   Section 5.2 this property could cover certain combining marks in the
   Arabic script.

   If this is possible, it seems a desirable course of action.

5.5.  Create an IETF-only Normalization Form

   Under this approach, the IETF creates a special normalization form
   that it maintains outside the Unicode Standard.  For the sake of the
   discussion, we'll call this "NFI".

   This option does not seem workable.  The IETF would have to evaluate
   every new release of Unicode to discover the extent to which the new
   release interacts with NFI.  Because it would be independently
   maintained, Unicode stability guarantees would not apply to NFI; the
   results would be unpredictable.  As a result, either the IETF would
   have to ignore new additions to Unicode, or else it would need UTC to
   take NFI into account.  If UTC were able to do so, this option
   reduces to the option in Section 5.4.  The UTC might not be able to
   do this, however, because the very principles that Unicode uses to
   assign new characters in certain situations guarantees that new
   characters will be added that cannot be so normalized and yet are
   essential for still-to-be-encoded writing systems.  Communities for
   which these new characters would be added would also not accept any
   existing code point sequence as equivalent.  This also means that
   Unicode cannot create a stability policy to take into account the
   needs of such an NFI.

6.  Acknowledgements

   The discussion in this memo owes a great deal to the IAB
   Internationalization program, and particularly to John Klensin.

7.  Informative References







Sullivan & Freytag     Expires September 10, 2015              [Page 12]


Internet-Draft           LUCID Problem Statement              March 2015


   [I-D.ietf-precis-framework]
              Saint-Andre, P. and M. Blanchet, "PRECIS Framework:
              Preparation, Enforcement, and Comparison of
              Internationalized Strings in Application Protocols",
              draft-ietf-precis-framework-23 (work in progress),
              February 2015.

   [I-D.klensin-idna-5892upd-unicode70]
              Klensin, J. and P. Faeltstroem, "IDNA Update for Unicode
              7.0.0", draft-klensin-idna-5892upd-unicode70-03 (work in
              progress), January 2015.

   [RFC5890]  Klensin, J., "Internationalized Domain Names for
              Applications (IDNA): Definitions and Document Framework",
              RFC 5890, August 2010.

   [RFC5891]  Klensin, J., "Internationalized Domain Names in
              Applications (IDNA): Protocol", RFC 5891, August 2010.

   [RFC5892]  Faltstrom, P., "The Unicode Code Points and
              Internationalized Domain Names for Applications (IDNA)",
              RFC 5892, August 2010.

   [RFC5893]  Alvestrand, H. and C. Karp, "Right-to-Left Scripts for
              Internationalized Domain Names for Applications (IDNA)",
              RFC 5893, August 2010.

   [RFC5894]  Klensin, J., "Internationalized Domain Names for
              Applications (IDNA): Background, Explanation, and
              Rationale", RFC 5894, August 2010.

   [RFC5895]  Resnick, P. and P. Hoffman, "Mapping Characters for
              Internationalized Domain Names in Applications (IDNA)
              2008", RFC 5895, September 2010.

   [RFC6365]  Hoffman, P. and J. Klensin, "Terminology Used in
              Internationalization in the IETF", BCP 166, RFC 6365,
              September 2011.

   [Unicode]  "The Unicode Standard",
              http://www.unicode.org/versions/Unicode7.0.0/, .

Appendix A.  Examples

   There are a number of cases that illustrate the combining sequence or
   digraph issue:





Sullivan & Freytag     Expires September 10, 2015              [Page 13]


Internet-Draft           LUCID Problem Statement              March 2015


   U+08A1 vs \u'0628'\u'0654'  This case is ARABIC LETTER BEH WITH HAMZA
      ABOVE, which is the one that was detected during expert review
      that caused the IETF to notice the issue.  The issue existed
      before this, but we did not know it.  For detailed discussion of
      this case and some of the following ones, see
      [I-D.klensin-idna-5892upd-unicode70]

   U+0681 vs \u'062D'\u'0654'  This case is ARABIC LETTER HAH WITH HAMZA
      ABOVE, which (like U+08A1) does not have a canonical equivalent.
      In both cases, the places where hamza above are used are
      specialized enough that the combining marks can be excluded in
      some cases (for example, the root zone under IDNA).

   U+0623 vs \u'0627'\u'0654'  This case is ARABIC LETTER ALEF WITH
      HAMZA ABOVE.  Unlike the previous two cases, it does have a
      canonical equivalence with the combining sequence.  In the past,
      the IETF misunderstood the reasons for the difference between this
      pair and the previous two cases.

   U+09E1 vs u\'098C'u\'09E2'  This case is BENGALI LETTER VOCALIC LL.
      This is an example in Bengali script of a case without a canonical
      equivalence to the combining sequence.  Per Unicode, the single
      code point should be used to represent vowel letters in text, and
      the sequence of code points should not be used.  But it is not a
      simple matter of disallowing the combining vowel mark in cases
      like this; where the combination does not exist and the use of the
      sequence is already established, Unicode is unlikely to encode the
      combination.

   U+019A vs \u'006C'\u'0335'  This case is LATIN SMALL LETTER L WITH
      BAR.  In at least some fonts, there is a detectable difference
      with the combining sequence, but only if one types them one after
      another and compares them.  There is no canonical equivalence
      here.  Unicode has a principle of encoding barred letters as
      composites when needed for any writing system.

   U+00F8 vs \u'006F'\u'0337'  This is LATIN SMALL LETTER O WITH STROKE.
      The effect are similar to the previous case.  Unicode has a
      principle of encoding stroked letters as composites when needed
      for any writing system.

   U+02A6 vs \u'0074'\u'0073'  This is LATIN SMALL LETTER TS DIGRAPH,
      which is not canonically equivalent to the letters t and s.  The
      intent appears to be that the digraph shows the two shapes as
      kerned, but the difference may be slight out of context.

   U+01C9 vs \u'006C'\u'006A'  Unlike the TS digraph, the LJ digraph has
      a relevant compatibility decomposition, so it fails the relevant



Sullivan & Freytag     Expires September 10, 2015              [Page 14]


Internet-Draft           LUCID Problem Statement              March 2015


      stability rules under i3 and is therefore DISALLOWED.  This
      illustrates the way that consistencies that might be natural to
      some users of a script are not necessarily found in it, possibly
      because of uses by another writing system.

   U+06C8 vs u\'0648'u\'0670'  ARABIC LETTER YU is an example where the
      normally-rendered character looks just like a combining sequence,
      but are named differently.  In other words, this is an example
      where the simple fact of the Unicode name would have concealed the
      apparent relationship from the casual observer.

   U+069 vs \u'0069'\u'0307'  LATIN SMALL LETTER I followed by COMBINING
      DOT ABOVE by definition, renders exactly the same as LATIN SMALL
      LETTER I by itself and does so in practice for any good font.  The
      same would be true if "i" was replaced with any of the other
      Soft_Dotted characters defined in Unicode.  The character sequence
      \u'0069'\u'0307' (followed by no other combining mark) is
      reportedly rather common on the Internet.  Because base character
      and stand-alone code point are the same in this case, and the code
      points affected have the Soft_Dotted property already, this could
      be mitigated separately via a context rule affecting U+0307.

   Other cases test the claim that the issue lies primarily with
   combining sequences at all:

   U+0B95 vs U+0BE7  The TAMIL LETTER KA and TAMIL DIGIT ONE are always
      indistinguishable, but needed to be encoded separately because one
      is a letter and the other is a digit.

   Arabic-Indic Digits vs. Extended Arabic-Indic       Digits  Seven
      digits of these two sequences have entirely identical shapes.
      This case is an example of something dealt with in i3 that
      nevertheless can lead to confusions that are not fully mitigated.
      IDNA, for example, contains context rules restricting the digits
      to one set or another; but such rules apply only to a single
      label, not to an entire name.  Moreover, it provides no way of
      distinguishing between two labels that both conform to the context
      rule, but where each contains one of the seven identical shapes.

   U+53E3 vs U+56D7  These are two Han characters (roughly rectangular)
      that are different when laid side by side; but they may be
      impossible to distinguish out of context or in small print.

Authors' Addresses







Sullivan & Freytag     Expires September 10, 2015              [Page 15]


Internet-Draft           LUCID Problem Statement              March 2015


   Andrew Sullivan
   Dyn
   150 Dow St.
   Manchester, NH  03101
   US

   Email: asullivan@dyn.com


   Asmus Freytag
   ASMUS Inc.

   Email: asmus@unicode.org






































Sullivan & Freytag     Expires September 10, 2015              [Page 16]

Document	Document type	Expired Internet-Draft (individual) Expired & archived
	Select version	00
	Authors	Andrew Sullivan , Asmus Freytag Email authors
	RFC stream	(None)
	Intended RFC status	(None)
	Other formats	txt xml pdf bibtex bibxml