Internet Architecture Board                                 T. Hardie,Ed.
Internet-Draft                                             L. Daigle, Ed.
                                                                     IAB


Considerations on Increasing Character Repertoires for Protocol Elements
                         draft-iab-char-rep-00.txt


Status of this Memo

    This document is an Internet-Draft and is in full conformance with
    all provisions of Section 10 of RFC2026.

    Internet-Drafts are working documents of the Internet Engineering
    Task Force (IETF), its areas, and its working groups.  Note that
    other groups may also distribute working documents as Internet-
    Drafts.

    Internet-Drafts are draft documents valid for a maximum of six months
    and may be updated, replaced, or obsoleted by other documents at any
    time.  It is inappropriate to use Internet-Drafts as reference
    material or to cite them other than as "work in progress."

    The list of current Internet-Drafts can be accessed at http://
    www.ietf.org/ietf/1id-abstracts.txt.

    The list of Internet-Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html.



Copyright Notice

    Copyright (C) The Internet Society (2003).  All Rights Reserved.

Abstract

    This document describes a set of considerations and strategies to
    use in increasing the character repertoire available in a protocol
    element or suite of protocol elements.  This document is not meant
    to provide normative instruction to protocol designers, but does
    hope to provide guidance on common issues arising from this task.
    This initial draft does not contain real-world examples for the
    different strategies outlined below; before final publication, the
    IAB plans to add such examples and solicits feedback from the
    community on examples appropriate to this draft. Feedback on this
    point or the draft as a whole should be sent to the editors or the
    IAB.


Definitions

    Protocol element: A protocol element is any portion of a message
    which affects processing of that message by the protocol in
    question.  In general, protocol elements are bound to specific
    processing choices by membership in a set of predetermined tokens
    or by explicit structure.  Protocol elements are context dependent
    in that the processing for a token is specific to a protocol.  To
    IP, for example, a TCP port number is payload; to TCP it is a
    protocol element.  Similarly, to TCP a Content-encoding: header is
    payload; to HTTP, it is a protocol element.

    Character repertoire: A character repertoire is the set of all
    characters in all permitted encodings which may be used in a
    protocol element.  Each element in a character repertoire is a
    tuple of a code point and an encoding.  Thus the glyph "a" would
    appear three times in a character repertoire that permitted ASCII,
    iso-8859-1, and iso-8859-7.

    Character set:  As it says, "a set of characters", but more
    particularly a set of characters as represented by code
    points in a particular encoding.


1. Introduction.

    After a protocol's initial deployment, changes in the use of the
    protocol sometimes neccesitate revisiting the character repertoire
    originally chosen for one or more of the elements which make up
    the protocol.  On rare occasions, this occurs because the protocol
    designers need to increase the number of tokens available in a
    fixed-length field and choose to do so by increasing the number of
    characters which may be used.  More commonly, the motive for the
    increase of a character repertoire is the exposure of a protocol
    element to a user community.  Once this leakage occurs, there is
    often pressure to expand the permitted character repertoire of the
    protocol element to match the character repertoire in use in that
    community.

    Though increasing a character repertoire may appear to be a
    relatively simple matter, there are a number of protocol
    processing functions which may be affected.  First among these is
    matching.  Many encodings have very specific matching rules or
    equivalence tables; increasing a character repertoire to include a
    new encoding implies that the protocol must specify how matching
    works in that encoding.  Like matching, sorting works in different
    ways in different encoding schemes, and including a new encoding
    means specifying sorting algorithms for use with it.  Transformation
    presents some unique issues, as it may be possible for some
    systems to map only unidirectionally from one encoding to
    another.  Any of these, and more, can present problems to
    a protocol designer who must post-facto retrofit an increased
    character repertoire into a deployed protocol.

2. Avoidance mechanisms.

   To avoid the need to increase character repertoires at some later
   date, protocol designers can either start with a character
   repertoire which is large enough to encompass that in use in the target
   user community or use protocol elements that are sufficiently
   opaque to a human user that their leakage is unlikely to present
   later pressure.  Both strategies, unfortunately, have been
   notororiously difficult to get right.

2.1 Choosing a large initial character repertoire.

   In this avoidance strategy, the protocol designers presume that
   their protocol elements will leak in the future and provide a
   character repertoire which is sufficiently rich to match the user
   community's needs.  Increasing use of a protocol, however, often
   changes the target user community beyond the intial designers
   projections. A character repetoire which looks large to one user
   community may be completely wrong or very limited to another.  When
   protocol designers attempt to avoid the issue by using a character
   repertoire with a very large number of code points in a very large
   number of encodings, they incurr real costs in parser complexity,
   processing overhead, and bloat.  They also risk that
   misconfiguration of these complex parsers will result in incorrect
   protocol processing.

2.2 Choosing opaque protocol tokens.

   In the second case, designers who choose to use tokens or structure
   which are not human-readable can resist later pressure to increase
   the character repertoire available.  As those who have used
   encodings like ASN.1 can attest, there is, however, an increased
   development cost, as those working with the protocol must develop
   an understanding of the use of the tokens or structure without the
   aid of readability.  This avenue may also be blocked or narrowed to
   protocol designers who will need to pass the new elements among
   different protocols; in those cases, the new protocol is either
   constrained by the previous choices or must provide a normative
   mapping to them.

   When designers use tokens or structures which are not human readable,
   it is common to create a presentation format or layer which is mapped
   to the tokens or structures.  One of the advantage to this approach
   is that new mappings can be defined as new user communities express
   the need for them.  It is important, however, that these are always
   retained as mappings to the protocol elements, and are not treated
   as protocol elements themselves.


3.  Expansion mechanisms.

   For designers who must increase the character repertoire for a
   particular protocol element, there are three basic strategies
   available: they may replace the existing protocol element with a
   new one; they may subsume the character repertoire of the existing
   protocol element in a new one; they may map the new character
   repertoire into the existing repertoire.

   For each of the following strategies, consider the following
   example: a protocol element called "POSTAL" used to name the
   U.S. zip code in which the network element is placed cannot handle
   postal codes containing characters outside (0,1,2,3,4,5,6,7,8,9)
   encoded in a subset of US-ASCII.  We will refer to this character
   set as (NUM-ASCII).  The original character repertoire for this
   protocol element has NUM-ASCII as its single member character set.

3.1 Replace.

   Replacing an existing protocol element with an entirely new
   protocol element with a different character repertoire is by far
   the cleanest solution from a design perspective.  A new protocol
   element may have its own matching and sorting rules, without
   regard to any previous deployment.  This means that the new
   element will have as little baggage as is possible when updating
   parsers and setting forth how it fits into the protocol's
   semantics.

   Unfortunately, this method presents a raft of deployment problems.
   Since existing protocol implementations will know nothing about
   it, they cannot be interoperable with any entirely new protocol
   element.  At best, they can ignore it gracefully; at worst, they
   will fail.  A protocol designer can react to this by changing the
   revision number on a protocol, by using some form of feature
   negotiation, or by using heuristics (including failure!) to
   determine whether or not a new protocol element may be used.  All
   of these are difficult to get right, especially in hop-by-hop
   protocols, in which it may not be possible to determine whether
   all hops support specific features or versions.

   A protocol designer tackling this problem for the protocol element
   naming the postal code in which a network element is placed might
   replace "POSTAL" with "NEW_POSTAL" and create a new character
   repertoire for "NEW_POSTAL" which contained the single entry
   (ISO-8859-1). [This is merely an example; the choice of which
   character set or sets to use would be made in this instance by
   reference to the relevant international postal standards.]
   Obviously, any system which did not understand "NEW_POSTAL" would
   need to be upgraded to handle the new character set.  Depending on
   the transition mechanism, systems communicating postal codes which
   were numeric-only might well include both "POSTAL" and
   "NEW_POSTAL" protocol elements.


3.2 Subsume.

   Rather than completely replacing an existing protocol element,
   another strategy is to create a protocol element which subsumes
   the character repertoire of the existing protocol element.  When
   this option is chosen, the new protocol element retains all the
   character sets and the related matching and sorting rules which
   were originally present.  These become a strict subset of the
   new character repertoire.

   This strategy limits the functionality of the new protocol element
   both by forcing it to include specific character sets and by
   requiring that the semantics of the new protocol element exactly
   match the existing protocol element.  This strategy also retains
   many of the deployment problems of the replacement strategy,
   though it offers some opportunities to mitigate the issues.  Like
   the replacement strategy, there may need to be negotiation
   mechanisms capable of handling both protocol elements, though new
   implementations can sometimes treat the old protocol element as a
   degenerate case of new protocol element.

   If our "POSTAL" protocol design team took this strategy, they
   might replace the (NUM-ASCII) character repertoire of "POSTAL"
   with a new protocol element "BIG_POSTAL" for which the character
   repertoire is (NUM-ASCII, US-ASCII).  Because NUM-ASCII is a
   strict subset of US-ASCII, the protocol can treat all "POSTAL"
   protocol elements as if they were "BIG_POSTAL" protocol elements.
   Note that this is the simplest possible example of this particular
   strategy, as there is no need to mark which character set from the
   character repertoire is in use.  More complex examples may
   require much more complex processing to achieve the same
   results.


3.3 Map.

   In some instances it may be possible and desirable to map an
   expanded character repertoire onto the existing code points
   specified by a protocol.  In this case, the code points are
   themselves retained but the character encoding portion of the
   tuple is changed to create an expanded character repertoire.  This
   strategy can only work when some marker is used to indicate which
   character encoding applies to a specific instance of the protocol.
   This marker must be something which is non-operative in the
   original protocol processing, or the strategy will incur the
   negotiation costs mentioned above.  This strategy will tend to
   increase the size of protocol elements unless the original code
   points were radically under-used.  It also carries the
   near-certainty that there will be occasions in which protocol
   elements encoded with the new character encoding are
   mis-identified as being encoded with the original character
   encoding.

   This strategy has somewhat unique deployment consequences, in that
   it is both easier to get initial deployment and harder to get
   complete penetration.  Because the same code points are used
   throughout, there is no requirement that all systems upgrade for
   the increased character repertoire to be available to a subset of
   users.  There is also, however, almost no incentive for upgrade of
   systems which do not themselves require the increased repertoire.
   This is particularly true in hop-by-hop and commonly proxied
   protocols, because the on-path intermediate systems will pass the
   elements of the expanded repertoire by virtue of their being
   legitimate code points in the original repertoire; they do not
   need to upgrade and they probably never will.

   For our protocol design team to tackle "POSTAL" using this
   strategy they must develop or discover an encoding which allows
   them to represent all the needed characters using just
   (NUM-ASCII).  If, for example, the character repertoire needed to
   add a character set which included (A-Z), but no others, the team
   could use US-ASCII's three digit decimal encoding for each
   included character.  A postal code like "KLHSW1" would then be
   encoded as "075076071083087049".  Provided that the original
   POSTAL protocol element had a field length sufficient to handle
   the new encoding, it could carry the new values without any
   difficulty.  The difficulty would be determining whether the
   new encoding or the old should be assumed; in this limited case,
   length alone could be made a marker by padding any short alphabetic
   postal codes with the ASCII null character,"OOO", until they reached a
   length sufficient to trigger treatment as non-ZIP code postal codes.
   In other cases more complex triggers would be required.


4. Layering a presentation element on a new protocol element.

   It is noted above that designers using non-human readable tokens
   may provide a mapping to a presentation element which can be used
   by humans working with the protocol.  In employing any of the
   strategies above, it is useful for protocol designers to consider
   introducing a presentation element at the same time.  This is
   almost a required part of the mapping strategy, as using an
   encoding based on the original set of code points does not help the
   user community unless it can also be mapped to an encoding in
   common use for presentation.  It may be used with any of them,
   though, and given the potential for the introduction of new
   character encodings, it must be considered carefully as a method
   of ensuring that the same problem does not face the protocol
   in a few years time.



5. Selecting a strategy.

   The first step in selecting a strategy is identifying the protocol
   processing choices which depend on the protocol element.  If a
   protocol element is passed among different protocols, this set of
   choices must be identified for each of the protocols which depend
   on the element.  After those have been identified, the available
   methods for passing the protocol elements from one protocol to
   another must be considered.

   If at all possible, a single strategy should be selected for
   use with a specific protocol element, even when that protocol
   element will be passed among different protocols.  Since protocol
   processing is context-specific, it is technically possible
   to use different methods in different contexts, but this
   increase in complexity rarely has a corresponding gain.

   Whether the protocol element will be used in one protocol or
   several,the core question to consider is how best to maintain
   interoperability while increasing the character repertoire.  For
   example, if creating a new protocol element as a fully fledged
   replacement, are there available mechanisms to handle the
   negotiation and/or versioning?  Alternatively, are there methods
   which would allow both protocol elements to coexist?

   The second question to consider is the cost of implementation.  If,
   for example, a choice is made to introduce a protocol element which
   subsumes the original character repertoire in a larger character
   repertoire, how expensive will the increase in parsing complexity
   be?

   The third question to consider is likely deployment patterns.  For
   a client/server protocol, will it be feasible to update both
   client and server?  For a hop-by-hop protocol, will there be
   any pressure for interemdiate servers to upgrade?

   A related question is whether this change will be tied to other
   changes which will drive adoption, or whether this change
   will be unrelated to other updates to the protocol.




6. Security Considerations.

   Any protocol processing which depends on a specific set of tokens
   or structure is at risk when the matching and sorting rules for the
   set is indeterminate.  In some cases, this can result in a denial
   of service, as legitimate tokens are not recognized; in other
   cases, inappropriate access may be granted by matching incorrectly.


7. IANA Considerations.

   There are no IANA considerations defined in this memo.


8. Acknowledgements

    The authors would like to thank Martin Duerst for his attention
    and expertise.


Normative References

None.

Non-normative References

None.

Editors' Addresses

    Ted Hardie
    Qualcomm, Inc.
    675 Campbell Technology Parkway
    Suite 200
    Campbell, CA
    U.S.A.

    EMail: hardie@qualcomm.com

    Leslie Daigle
    VeriSign Applied Research

    EMail: leslie@thinkingcat.com

Full Copyright Statement


    Copyright (C) The Internet Society (2002).  All Rights Reserved.

    This document and translations of it may be copied and furnished to
    others, and derivative works that comment on or otherwise explain it
    or assist in its implementation may be prepared, copied, published
    and distributed, in whole or in part, without restriction of any
    kind, provided that the above copyright notice and this paragraph are
    included on all such copies and derivative works.  However, this
    document itself may not be modified in any way, such as by removing
    the copyright notice or references to the Internet Society or other
    Internet organizations, except as needed for the purpose of
    developing Internet standards in which case the procedures for
    copyrights defined in the Internet Standards process must be
    followed, or as required to translate it into languages other than
    English.

    The limited permissions granted above are perpetual and will not be
    revoked by the Internet Society or its successors or assigns.

    This document and the information contained herein is provided on an
    "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
    TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
    BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
    HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
    MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Acknowledgement

    Funding for the RFC Editor function is currently provided by the
    Internet Society.