draft-ietf-idn-altdude-00

INTERNET-DRAFT                                          Adam M. Costello
draft-ietf-idn-altdude-00.txt                                2001-Mar-19
Expires 2001-Sep-19

                         AltDUDE version 0.0.2

Status of this Memo

    This document is an Internet-Draft and is in full conformance with
    all provisions of Section 10 of RFC2026.

    Internet-Drafts are working documents of the Internet Engineering
    Task Force (IETF), its areas, and its working groups.  Note
    that other groups may also distribute working documents as
    Internet-Drafts.

    Internet-Drafts are draft documents valid for a maximum of six
    months and may be updated, replaced, or obsoleted by other documents
    at any time.  It is inappropriate to use Internet-Drafts as
    reference material or to cite them other than as "work in progress."

    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt

    The list of Internet-Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html

    Distribution of this document is unlimited.  Please send comments
    to the author at amc@cs.berkeley.edu, or to the idn working
    group at idn@ops.ietf.org.  A non-paginated (and possibly
    newer) version of this specification may be available at
    http://www.cs.berkeley.edu/~amc/charset/altdude

Abstract

    DUDE [DUDE01] by Mark Welter and Brian Spolarich is an
    ASCII-Compatible Encoding (ACE) of Unicode strings, and AltDUDE is a
    slight variation on it that is conceptually simpler.

    AltDUDE is a reversible map from a sequence of unsigned integers
    (intended to be Unicode code points) to a sequence of letters (A-Z,
    a-z), digits (0-9), and hyphen-minus (-), henceforth called LDH
    characters.  Such a map might be useful for internationalized domain
    names [IDN], because host name labels are currently restricted to
    LDH characters by [RFC952] and [RFC1123].

    Besides domain names, there might also be other contexts where it is
    useful to transform Unicode [UNICODE] code points (or any unsigned
    integers that exhibit locality) into "safe" (delimiter-free)
    ASCII characters.  (If other contexts consider hyphen-minus to be
    unsafe, it can trivially be eliminated, or replaced by a different
    character, like underscore.)

Contents

    Differences from DUDE
    Features
    Name
    Overview
    Base-32 characters
    Encoding procedure
    Decoding procedure
    Signature
    Case sensitivity models
    Comparison with other ACEs
    Example strings
    Security considerations
    Credits
    References
    Author
    Example implementation

Differences from DUDE

    AltDUDE differs from DUDE in four respects:

     1) DUDE computes the XOR of each integer and the previous in order
        to decide how many bits of each integer to encode, whereas
        AltDUDE encodes the XOR itself, so there is no need for a mask.

     2) DUDE makes the first quintet of each sequence different from the
        rest, while AltDUDE makes the last quintet different, so it's
        easier for the decoder to detect the end of the sequence.

     3) AltDUDE uses a base-32 map that avoids 0, 1, o, and l, to help
        humans avoid transcription errors.

     4) AltDUDE uses 96 rather than 0 as the initial value of the
        previous code point.  For domain names, this makes a few
        encodings one character shorter and makes none longer.

Features

    Uniqueness:  Every sequence of integers maps to at most one LDH
    string.

    Completeness:  Every sequence of integers maps to an LDH string.
    Restrictions on which integers are allowed, and on sequence length,
    may be imposed by higher layers.

    Efficient encoding:  The ratio of encoded size to original size is
    small.  This is important in the context of domain names because
    [RFC1034] restricts the length of a domain label to 63 characters.

    Simplicity:  The encoding and decoding algorithms are reasonably
    simple to implement.  The goals of efficiency and simplicity are at
    odds; AltDUDE places greater emphasis on simplicity.

    Case-preservation:  If a Unicode string has been case-folded prior
    to encoding, it is possible to record the case information in the
    case of the letters in the encoding, allowing a mixed-case Unicode
    string to be recovered if desired, but a case-insensitive comparison
    of two encoded strings is equivalent to a case-insensitive
    comparison of the Unicode strings.  This feature is optional; see
    section "Case sensitivity models".

Name

    AltDUDE is a working name that should be changed if it is adopted.
    Rather than waste good names on experimental proposals, let's
    wait until one proposal is chosen, then assign it a good name.
    Suggestions (assuming the primary use is in domain names):

        DUDE (if the DUDE authors wish to adopt this algorithm)
        UniHost
        UTF-D ("D" for "domain names")
        UTF-33 (there are 33 characters in the output repertoire)

Overview

    AltDUDE encodes unsigned integers as characters, although
    implementations will of course need to represent the output
    characters somehow, usually as bytes or other code units.  When
    AltDUDE is used to encode Unicode characters, the integers are the
    corresponding Unicode code points, not UTF-16 surrogates.

    Each integer is represented by an integral number of characters in
    the encoded string.  There is no intermediate bit string or octet
    string.

    Integers with value 45 are represented by hyphen-minus characters
    (45 is the Unicode code point for hyphen-minus).  Each
    non-hyphen-minus character in the encoded string represents five
    bits (a "quintet").  A sequence of quintets represents the bitwise
    XOR between each non-45 integer and the previous one.

    The exception for 45 and hyphen-minus is useful for domain names,
    but could be dropped in other contexts, or replaced by a different
    exception.

Base-32 characters

        "a" =  0 = 0x00 = 00000         "s" = 16 = 0x10 = 10000
        "b" =  1 = 0x01 = 00001         "t" = 17 = 0x11 = 10001
        "c" =  2 = 0x02 = 00010         "u" = 18 = 0x12 = 10010
        "d" =  3 = 0x03 = 00011         "v" = 19 = 0x13 = 10011
        "e" =  4 = 0x04 = 00100         "w" = 20 = 0x14 = 10100
        "f" =  5 = 0x05 = 00101         "x" = 21 = 0x15 = 10101
        "g" =  6 = 0x06 = 00110         "y" = 22 = 0x16 = 10110
        "h" =  7 = 0x07 = 00111         "z" = 23 = 0x17 = 10111
        "i" =  8 = 0x08 = 01000         "2" = 24 = 0x18 = 11000
        "j" =  9 = 0x09 = 01001         "3" = 25 = 0x19 = 11001
        "k" = 10 = 0x0A = 01010         "4" = 26 = 0x1A = 11010
        "m" = 11 = 0x0B = 01011         "5" = 27 = 0x1B = 11011
        "n" = 12 = 0x0C = 01100         "6" = 28 = 0x1C = 11100
        "p" = 13 = 0x0D = 01101         "7" = 29 = 0x1D = 11101
        "q" = 14 = 0x0E = 01110         "8" = 30 = 0x1E = 11110
        "r" = 15 = 0x0F = 01111         "9" = 31 = 0x1F = 11111

    The digits "0" and "1" and the letters "o" and "l" are not used, to
    avoid transcription errors.

    All decoders must recognize both the uppercase and lowercase
    forms of the base-32 characters.  The case may or may not convey
    information, as described in section "Case sensitivity models".

Encoding procedure

    All ordering of nybbles and quintets is big-endian (most significant
    first).  A nybble is 4 bits.  XOR is bitwise exclusive or.

    let prev = 96
    for each input integer n (in order) do begin
      if n == 45 then output hyphen minus
      else begin
        let diff = prev XOR n
        extract the least significant nybbles of diff, as few as are
          sufficient to hold all the nonzero bits (but at least one)
        prepend 0 to the last nybble and 1 to the rest
        output base-32 characters corresponding to the quintets
        let prev = n
      end
    end

    The encoder must either correctly handle all integer values that can
    be represented in the type of its input, or it must check whether
    the input contains values that it cannot handle and return an error
    if so.  Under no circumstances may it produce incorrect output.

Decoding procedure

    let prev = 96
    while the input string is not exhausted do begin
      if the next character is hyphen-minus then output 45
      else begin
        input characters and convert them to quintets until
          encountering a quintet beginning with 0
        fail upon encountering a non-base-32 character or end-of-input
        strip the first bit of each quintet
        concatenate the resulting nybbles to form diff
        let prev = prev XOR diff
        output prev
      end
    end
    encode the output sequence and compare it to the input string
    fail if they are not equal

    The comparison at the end must be case-insensitive if ACEs are
    always compared case-insensitively (which is true of domain names),
    case-sensitive otherwise.  See also section "Case sensitivity
    models".  This check is necessary to guarantee the uniqueness
    property (there cannot be two distinct encoded strings representing
    the same sequence of integers).  This check also frees the decoder
    from having to check for overflow while decoding the base-32
    characters.

Signature

    The issue of how to distinguish ACE strings from unencoded strings
    is largely orthogonal to the encoding scheme itself, and is
    therefore not specified here.  In the context of domain name labels,
    a standard prefix and/or suffix (chosen to be unlikely to occur
    naturally) would presumably be attached to ACE labels.  (In that
    case, it would probably be good to forbid the encoding of Unicode
    strings that appear to match the signature, to avoid confusing
    humans about whether they are looking at a Unicode string or an ACE
    string.)

    In order to use AltDUDE in domain names, the choice of signature
    must be mindful of the requirement in [RFC952] that labels never
    begin or end with hyphen-minus.  The raw encoded string will
    begin or end with a hyphen-minus iff the Unicode string does.  If
    the Unicode strings are forbidden from beginning or ending with
    hyphen-minus (which seems prudent anyway), then there is no problem.
    Otherwise, the signature must consist of both a prefix and a suffix.

    It appears that "---" is extremely rare in domain names; among the
    four-character prefixes of all the second-level domains under .com,
    .net, and .org, "---" never appears at all.  Therefore, perhaps the
    signature should be of the form ?--- (prefix) or ---? (suffix),
    where ? could be "u" for Unicode, or "i" for internationalized, or
    "a" for ACE, or maybe "q" or "z" because they are rare.

Case sensitivity models

    The higher layer must choose one of the following four models.

    Models suitable for domain names:

      * Case-insensitive:  Before a string is encoded, all its non-LDH
        characters must be case-folded so that any strings differing
        only in case become the same string (for example, strings could
        be forced to lowercase).  Folding LDH characters is optional.
        The case of base-32 characters and literal-mode characters is
        arbitrary and not significant.  Comparisons between encoded
        strings must be case-insensitive.  The original case of non-LDH
        characters cannot be recovered from the encoded string.

      * Case-preserving:  The case of the Unicode characters is not
        considered significant, but it can be preserved and recovered,
        just like in non-internationalized host names.  Before a string
        is encoded, all its non-LDH characters must be case-folded
        as in the previous model.  LDH characters are naturally able
        to retain their case attributes because they are encoded
        literally.  The case attribute of a non-LDH character is
        recorded in the last of the base-32 characters that represent
        it, which is guaranteed to be a letter rather than a digit.
        If the base-32 character is uppercase, it means the Unicode
        character is caseless or should be forced to uppercase after
        being decoded (which is a no-op if the case folding already
        forces to uppercase).  If the base-32 character is lowercase,
        it means the Unicode character is caseless or should be forced
        to lowercase after being decoded (which is a no-op if the case
        folding already forces to lowercase).  The case of the other
        base-32 characters in a multi-quintet encoding is arbitrary
        and not significant.  Only uppercase and lowercase attributes
        can be recorded, not titlecase.  Comparisons between encoded
        strings must be case-insensitive, and are equivalent to
        case-insensitive comparisons between the Unicode strings.  The
        intended mixed-case Unicode string can be recovered as long as
        the encoded characters are unaltered, but altering the case of
        the encoded characters is not harmful--it merely alters the case
        of the Unicode characters, and such a change is not considered
        significant.

        In this model, the input to the encoder and the output of the
        decoder can be the unfolded Unicode string (in which case the
        encoder and decoder are responsible for performing the case
        folding and recovery), or can be the folded Unicode string
        accompanied by separate case information (in which case the
        higher layer is responsible for performing the case folding and
        recovery).  Whichever layer performs the case recovery must
        first verify that the Unicode string is properly folded, to
        guarantee the uniqueness of the encoding.

        It is not very difficult to extend the nameprep algorithm
        [NAMEPREP03] to remember case information.

    The case-insensitive and case-preserving models are interoperable.
    If a domain name passes from a case-preserving entity to a
    case-insensitive entity, the case information will be lost, but
    the domain name will still be equivalent.  This phenomenon already
    occurs with non-internationalized domain names.

    Models unsuitable for domain names, but possibly useful in other
    contexts:

      * Case-sensitive:  Unicode strings may contain both uppercase and
        lowercase characters, which are not folded.  Base-32 characters
        must be lowercase.  Comparisons between encoded strings must be
        case-sensitive.

      * Case-flexible:  Like case-preserving, except that the choice
        of whether the case of the Unicode characters is considered
        significant is deferred.  Therefore, base-32 characters must
        be lowercase, except for those used to indicate uppercase
        Unicode characters.  Comparisons between encoded strings may be
        case-sensitive or case-insensitive, and such comparisons are
        equivalent to the corresponding comparisons between the Unicode
        strings.

Comparison with other ACEs

    The differences between AltDUDE and DUDE were given in section
    "Differences from DUDE".  For a comparison between DUDE and other
    ACEs, please see the AMC-ACE-O specification [AMCACEO00].

Example strings

    The first several examples are all translations of the sentence "Why
    can't they just speak in <language>?" (courtesy of Michael Kaplan's
    "provincial" page [PROVINCIAL]).  Word breaks and punctuation have
    been removed, as is often done in domain names.

    (A) Arabic (Egyptian):
        U+0644 U+064A U+0647 U+0645 U+0627 U+0628 U+062A U+0643 U+0644
        U+0645 U+0648 U+0634 U+0639 U+0631 U+0628 U+064A U+061F

        AltDUDE: yueqpcycrcyjhbpznpitjycxf

    (B) Chinese (simplified):
        U+4ED6 U+4EEC U+4E3A U+4EC0 U+4E48 U+4E0D U+8BF4 U+4E2D U+6587

        AltDUDE: w85gvk7g9k2iwf6x9j6x7ju54k

    (C) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky

        <ccaron> = U+010D
        <ecaron> = U+011B
        <iacute> = U+00ED

        AltDUDE: tActptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc

    (D) Hebrew:
        U+05DC U+05DE U+05D4 U+05D4 U+05DD U+05E4 U+05E9 U+05D5 U+05D8
        U+05DC U+05D0 U+05DE U+05D3 U+05D1 U+05E8 U+05D9 U+05DD U+05E2
        U+05D1 U+05E8 U+05D9 U+05EA

        AltDUDE: x5nckajvjpvnpenqpcvjvbevrvdvjvbvd

    (E) Hindi:
        U+092F U+0939 U+0932 U+094B U+0917 U+0939 U+093F U+0928 U+094D
        U+0926 U+0940 U+0915 U+094D U+092F U+094B U+0902 U+0928 U+0939
        U+0940 U+0902 U+092C U+094B U+0932 U+0938 U+0915 U+0924 U+0947
        U+0939 U+0948 U+0902  (Devanagari)

        AltDUDE: 3wrtgmzjxnuqgthyfymygxfxiycyewjuktbzjwcuqyhzjkupvbydzq\
                 zbwk

    (F) Japanese:
        U+306A U+305C U+307F U+3093 U+306A U+65E5 U+672C U+8A9E U+3092
        U+8A71 U+3057 U+3066 U+304F U+308C U+306A U+3044 U+306E U+304B
        (kanji and hiragana)

        AltDUDE: vsskvgud8n9jxx2ru6j875c54sn548d54ugvbuj6d8guqukuf

    (G) Korean:
        U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774
        U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74
        U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C  (Hangul syllables)

        AltDUDE: 6txiy79ny53nz79a8wizwwnzzuavyizv3atuuiz2vby27jz66iz8si\
                 tusauiyz5i23az96iz6ze3xaz2td96ry3si

    (H) Russian:
        U+041F U+043E U+0447 U+0435 U+043C U+0443 U+0436 U+0435 U+043E
        U+043D U+0438 U+043D U+0435 U+0433 U+043E U+0432 U+043E U+0440
        U+044F U+0442 U+043F U+043E U+0440 U+0443 U+0441 U+0441 U+043A
        U+0438  (Cyrillic)

        AltDUDE: wxRbzjzcjzrzfdmdffigpnnzqrpzpbzqdcazmc

    (I) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol

        <eacute> = U+00E9
        <ntilde> = U+00F1

        AltDUDE: tAtrtpde3n2hbtrftabbmtptketptnjiimtktbpjdqptdthmMtgdtb\
                 3a3qd

    (J) Taiwanese:
        U+4ED6 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D U+6587

        AltDUDE: w85gt86huuudv69c7szp7s5a6w4h6w2hu54k

    (K) Vietnamese:
        Ta<dotbelow>isaoho<dotbelow>kh<ocirc>ngth<ecirc><hookabove>chi\
        <hookabove>no<acute>iti<ecirc><acute>ngVi<ecirc><dotbelow>t

        <dotbelow>  = U+0323
        <ocirc>     = U+00F4
        <ecirc>     = U+00EA
        <hookabove> = U+0309
        <acute>     = U+0301

        AltDUDE: tEtfvwcvwktktcqhhvwnvwid3n3kjtdtn2cv8dvykmbvyavyhbvyqv\
                 yitptp2dv8mvyrjtBtr2dv6jvxh

    The next several examples are all names of Japanese music artists,
    song titles, and TV programs, just because the author happens to
    have them handy (but Japanese is useful for providing examples
    of single-row text, two-row text, ideographic text, and various
    mixtures thereof).

    (L) 3<nen>B<gumi><kinpachi><sensei>  (Japanese TV program title)

        <nen>              = U+5E74                       (kanji)
        <gumi>             = U+7D44                       (kanji)
        <kinpachi><sensei> = U+91D1 U+516B U+5148 U+751F  (kanji)

        AltDUDE: xdx8whx8tGz7ug863f6s5kuduwxh

    (M) <amuro><namie>-with-SUPER-MONKEYS  (Japanese music group name)

        <amuro><namie> = U+5B89 U+5BA4 U+5948 U+7F8E U+6075  (kanji)

        AltDUDE: x58jupu8nuy6gt99m-yssctqtptn-tMGFtFtH-tRCBFQtNK

    (N) Hello-Another-Way-<sorezore><no><basho>  (Japanese song title)

        <sorezore><no> = U+305D U+308C U+305E U+308C U+306E  (hiragana)
        <basho>        = U+5834 U+6240                       (kanji)

        AltDUDE: Ipjad-Qrbtmtnpth-Ftgti-vsue7b7c7c8cy2xkv4ze

    (O) <hitotsu><yane><no><shita>2  (Japanese TV program title)

        <hitotsu> = U+3072 U+3068 U+3064  (hiragana)
        <yane>    = U+5C4B U+6839         (kanji)
        <no>      = U+306E                (hiragana)
        <shita>   = U+4E0B                (kanji)

        AltDUDE: vstctkny6urvwzcx2xhz8yfw8vj

    (P) Maji<de>Koi<suru>5<byou><mae> (Japanese song title)

        <de>        = U+3067         (hiragana)
        <suru>      = U+3059 U+308B  (hiragana)
        <byou><mae> = U+79D2 U+524D  (kanji)

        AltDUDE: PnmdvssqvssNegvsva7cvs5qz38hu53r

    (Q) <pafii>de<runba>  (Japanese song title)

        <pafii> = U+30D1 U+30D5 U+30A3 U+30FC  (katakana)
        <runba> = U+30EB U+30F3 U+30D0         (katakana)

        AltDUDE: vs5bezgxrvs3ibvs2qtiud

    (R) <sono><supiido><de>  (Japanese song title)

        <sono>    = U+305D U+306E                (hiragana)
        <supiido> = U+30B9 U+30D4 U+30FC U+30C9  (katakana)
        <de>      = U+3067                       (hiragana)

        AltDUDE: vsvpvd7hypuivf4q

    The last example is an ASCII string that breaks not only the
    existing rules for host name labels but also the rules proposed in
    [NAMEPREP03] for internationalized domain names.

    (S) -> $1.00 <-

        AltDUDE: -xqtqetftrtqatatn-

Security considerations

    Users expect each domain name in DNS to be controlled by a single
    authority.  If a Unicode string intended for use as a domain label
    could map to multiple ACE labels, then an internationalized domain
    name could map to multiple ACE domain names, each controlled by
    a different authority, some of which could be spoofs that hijack
    service requests intended for another.  Therefore AltDUDE is
    designed so that each Unicode string has a unique encoding.

    However, there can still be multiple Unicode representations of the
    "same" text, for various definitions of "same".  This problem is
    addressed to some extent by the Unicode standard under the topic
    of canonicalization, but some text strings may be misleading or
    ambiguous to humans when used as domain names, such as strings
    containing dots, slashes, at-signs, etc.  These issues are being
    further studied under the topic of "nameprep" [NAMEPREP03].

Credits

    AltDUDE reuses a number of preexisting techniques.

    The basic encoding of integers to nybbles to quintets to base-32
    comes from UTF-5 [UTF5], and the particular variant used here comes
    from AMC-ACE-M [AMCACEM00].

    The idea of avoiding 0, 1, o, and l in base-32 strings was taken
    from SFS [SFS].

    From DUDE (of which the latest version is [DUDE01]) comes the idea
    of encoding differences between successive integers.  The idea
    of using the alphabetic case of base-32 characters to record the
    desired case of the Unicode characters was suggested by this author,
    but in DUDE it was first applied it to the UTF-5-style encoding.

References

    [AMCACEM00] Adam Costello, "AMC-ACE-M version 0.1.0", 2001-Feb-12,
    draft-ietf-idn-amc-ace-m-00.

    [AMCACEO00] Adam Costello, "AMC-ACE-O version 0.0.3", 2001-Mar-19,
    draft-ietf-idn-amc-ace-o-00.

    [DUDE01] Mark Welter, Brian Spolarich, "DUDE: Differential Unicode
    Domain Encoding", 2001-Mar-02, draft-ietf-idn-dude-01.

    [IDN] Internationalized Domain Names (IETF working group),
    http://www.i-d-n.net/, idn@ops.ietf.org.

    [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation
    of Internationalized Host Names", 2001-Feb-24,
    draft-ietf-idn-nameprep-03.

    [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page",
    http://www.trigeminal.com/samples/provincial.html.

    [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
    Table Specification", 1985-Oct, RFC 952.

    [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities",
    1987-Nov, RFC 1034.

    [RFC1123] Internet Engineering Task Force, R. Braden (editor),
    "Requirements for Internet Hosts -- Application and Support",
    1989-Oct, RFC 1123.

    [SFS] David Mazieres et al, "Self-certifying File System",
    http://www.fs.net/.

    [UNICODE] The Unicode Consortium, "The Unicode Standard",
    http://www.unicode.org/unicode/standard/standard.html.

    [UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a
    Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*.

Author

    Adam M. Costello <amc@cs.berkeley.edu>
    http://www.cs.berkeley.edu/~amc/

    See also the authors of DUDE [DUDE01].


Example implementation


/******************************************/
/* altdude.c 0.0.2 (2001-Mar-19-Sun)      */
/* Adam M. Costello <amc@cs.berkeley.edu> */
/******************************************/

/* This is ANSI C code (C89) implementing AltDUDE    */
/* (draft-ietf-idn-altdude-00), a simplified variant */
/* of DUDE (draft-ietf-idn-dude-01).                 */

/************************************************************/
/* Public interface (would normally go in its own .h file): */

#include <limits.h>

enum altdude_status {
  altdude_success,
  altdude_invalid_input,
  altdude_output_too_big
};

enum case_sensitivity { case_sensitive, case_insensitive };

#if UINT_MAX >= 0x1FFFFF
typedef unsigned int u_code_point;
#else
typedef unsigned long u_code_point;
#endif

enum altdude_status altdude_encode(
  unsigned int input_length,
  const u_code_point *input,
  const unsigned char *uppercase_flags,
  unsigned int *output_size,
  char *output );

    /* altdude_encode() converts Unicode to AltDUDE (without any      */
    /* signature).  The input must be represented as an array         */
    /* of Unicode code points (not code units; surrogate pairs        */
    /* are not allowed), and the output will be represented as        */
    /* null-terminated ASCII.  The input_length is the number of code */
    /* points in the input.  The output_size is an in/out argument:   */
    /* the caller must pass in the maximum number of characters       */
    /* that may be output (including the terminating null), and on    */
    /* successful return it will contain the number of characters     */
    /* actually output (including the terminating null, so it will be */
    /* one more than strlen() would return, which is why it is called */
    /* output_size rather than output_length).  The uppercase_flags   */
    /* array must hold input_length boolean values, where nonzero     */
    /* means the corresponding Unicode character should be forced     */
    /* to uppercase after being decoded, and zero means it is         */
    /* caseless or should be forced to lowercase.  Alternatively,     */
    /* uppercase_flags may be a null pointer, which is equivalent     */
    /* to all zeros.  The encoder always outputs lower case base-32   */
    /* characters except when nonzero values of uppercase_flags       */
    /* require otherwise.  The return value may be any of the         */
    /* altdude_status values defined above; if not altdude_success,   */
    /* then output_size and output may contain garbage.  On success,  */
    /* the encoder will never need to write an output_size greater    */
    /* than input_length*k+1 if all the input code points are less    */
    /* than 1 << (4*k), because of how the encoding is defined.       */

enum altdude_status altdude_decode(
  enum case_sensitivity case_sensitivity,
  char *scratch_space,
  const char *input,
  unsigned int *output_length,
  u_code_point *output,
  unsigned char *uppercase_flags );

    /* altdude_decode() converts AltDUDE (without any signature) to   */
    /* Unicode.  The input must be represented as null-terminated     */
    /* ASCII, and the output will be represented as an array of       */
    /* Unicode code points.  The case_sensitivity argument influences */
    /* the check on the well-formedness of the input string; it       */
    /* must be case_sensitive if case-sensitive comparisons are       */
    /* allowed on encoded strings, case_insensitive otherwise.        */
    /* The scratch_space must point to space at least as large        */
    /* as the input, which will get overwritten (this allows the      */
    /* decoder to avoid calling malloc()).  The output_length is      */
    /* an in/out argument: the caller must pass in the maximum        */
    /* number of code points that may be output, and on successful    */
    /* return it will contain the actual number of code points        */
    /* output.  The uppercase_flags array must have room for at least */
    /* output_length values, or it may be a null pointer if the case  */
    /* information is not needed.  A nonzero flag indicates that the  */
    /* corresponding Unicode character should be forced to uppercase  */
    /* by the caller, while zero means it is caseless or should be    */
    /* forced to lowercase.  The return value may be any of the       */
    /* altdude_status values defined above; if not altdude_success,   */
    /* then output_length, output, and uppercase_flags may contain    */
    /* garbage.  On success, the decoder will never need to write     */
    /* an output_length greater than the length of the input (not     */
    /* counting the null terminator), because of how the encoding is  */
    /* defined.                                                       */


/**********************************************************/
/* Implementation (would normally go in its own .c file): */

#include <string.h>

/* Character utilities: */

/* is_AtoZ(c) returns 1 if c is an         */
/* uppercase ASCII letter, zero otherwise. */

static unsigned char is_AtoZ(char c)
{
  return c >= 65 && c <= 90;
}

/* base32[n] is the lowercase base-32 character representing  */
/* the number n from the range 0 to 31.  Note that we cannot  */
/* use string literals for ASCII characters because an ANSI C */
/* compiler does not necessarily use ASCII.                   */

static const char base32[] = {
  97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,     /* a-k */
  109, 110,                                               /* m-n */
  112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,  /* p-z */
  50, 51, 52, 53, 54, 55, 56, 57                          /* 2-9 */
};

/* base32_decode(c) returns the value of a base-32 character, in the */
/* range 0 to 31, or the constant base32_invalid if c is not a valid */
/* base-32 character.                                                */

enum { base32_invalid = 32 };

static unsigned int base32_decode(char c)
{
  if (c < 50) return base32_invalid;
  if (c <= 57) return c - 26;
  if (c < 97) c += 32;
  if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid;
  return c - 97 - (c > 108) - (c > 111);
}

/* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */
/* are equal, 1 otherwise.  If case_sensitivity is case_insensitive,  */
/* then ASCII A-Z are considered equal to a-z respectively.           */

static int unequal(
  enum case_sensitivity case_sensitivity, const char *s1, const char *s2 )
{
  char c1, c2;

  if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0;

  for (;;) {
    c1 = *s1;
    c2 = *s2;
    if (c1 >= 65 && c1 <= 90) c1 += 32;
    if (c2 >= 65 && c2 <= 90) c2 += 32;
    if (c1 != c2) return 1;
    if (c1 == 0) return 0;
    ++s1, ++s2;
  }
}

/* altdude_initial_code_point is the initial value of the */
/* "previous" code point, before the first code point. */

static const u_code_point altdude_initial_code_point = 96;

/* Encoder: */

enum altdude_status altdude_encode(
  unsigned int input_length,
  const u_code_point *input,
  const unsigned char *uppercase_flags,
  unsigned int *output_size,
  char *output )
{
  unsigned int next_in, next_out, max_out, n, out;
  u_code_point prev, codept, diff, tmp;
  char shift;

  prev = altdude_initial_code_point;
  max_out = *output_size;
  next_out = 0;

  for (next_in = 0;  next_in < input_length;  ++next_in) {
    codept = input[next_in];

    if (codept == 45) {
      /* hyphen-minus stands for itself */
      if (max_out - next_out < 1) return altdude_output_too_big;
      output[next_out++] = 45;
      continue;
    }

    shift = uppercase_flags && uppercase_flags[next_in] ? 32 : 0;
    /* shift will determine the case of the last base-32 digit */
    diff = prev ^ codept;
    for (tmp = diff >> 4, n = 1;  tmp != 0;  ++n, tmp >>= 4);
    /* n is the number of base-32 digits */
    if (max_out - next_out < n) return altdude_output_too_big;

    /* Computing the base-32 digits in reverse order is easiest. */
    /* Only the last base-32 digit has the high bit clear.       */

    out = next_out + n - 1;
    output[out] = base32[diff & 0xF] - shift;

    while (out > next_out) {
      diff >>= 4;
      output[--out] = base32[0x10 | (diff & 0xF)];
    }

    next_out += n;
    prev = codept;
  }

  /* null terminator: */
  if (max_out - next_out < 1) return altdude_output_too_big;
  output[next_out++] = 0;
  *output_size = next_out;
  return altdude_success;
}

/* Decoder: */

enum altdude_status altdude_decode(
  enum case_sensitivity case_sensitivity,
  char *scratch_space,
  const char *input,
  unsigned int *output_length,
  u_code_point *output,
  unsigned char *uppercase_flags )
{
  u_code_point prev, q, diff;
  const char *in;
  char c;
  unsigned int next_out, max_out, input_size, scratch_size;
  enum altdude_status status;

  prev = altdude_initial_code_point;
  max_out = *output_length;
  next_out = 0;
  in = input;

  for (c = *in;  c != 0; ) {
    if (max_out - next_out < 1) return altdude_output_too_big;

    if (c == 45) {
      /* hyphen-minus stands for itself */
      output[next_out] = 45;
      if (uppercase_flags) uppercase_flags[next_out] = 0;
      ++next_out;
      c = *++in;
      continue;
    }

    /* Base-32 sequence: */

    diff = 0;

    do {
      q = base32_decode(c);
      if (q == base32_invalid) return altdude_invalid_input;
      diff = (diff << 4) | (q & 0xF);
      c = *++in;
    } while (q >> 4 == 1);

    /* case of last digit determines uppercase flag: */
    if (uppercase_flags) uppercase_flags[next_out] = is_AtoZ(in[-1]);
    prev = output[next_out++] = prev ^ diff;
  }

  /* Re-encode the output and compare to the input: */

  input_size = in - input + 1;
  scratch_size = input_size;
  status = altdude_encode(next_out, output, uppercase_flags,
                          &scratch_size, scratch_space);
  if (status != altdude_success ||
      scratch_size != input_size ||
      unequal(case_sensitivity, scratch_space, input)
     ) return altdude_invalid_input;

  *output_length = next_out;
  return altdude_success;
}


/******************************************************************/
/* Wrapper for testing (would normally go in a separate .c file): */

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* For testing, we'll just set some compile-time limits rather than */
/* use malloc(), and set a compile-time option rather than using a  */
/* command-line option.                                             */

enum {
  unicode_max_length = 256,
  ace_max_size = 256,
  test_case_sensitivity = case_insensitive  /* suitable for host names */
};


static void usage(char **argv)
{
  fprintf(stderr,
    "%s -e reads big-endian UTF-32 and writes AltDUDE ASCII.\n"
    "%s -d reads AltDUDE ASCII and writes big-endian UTF-32.\n"
    "UTF-32 is extended: bit 31 is used as force-to-uppercase flag.\n"
    , argv[0], argv[0]);
  exit(EXIT_FAILURE);
}


static void fail(const char *msg)
{
  fputs(msg,stderr);
  exit(EXIT_FAILURE);
}

static const char too_big[] =
  "input or output is too large, recompile with larger limits\n";
static const char invalid_input[] = "invalid input\n";
static const char io_error[] = "I/O error\n";

int main(int argc, char **argv)
{
  enum altdude_status status;
  int r;

  if (argc != 2) usage(argv);
  if (argv[1][0] != '-') usage(argv);
  if (argv[1][2] != '\0') usage(argv);

  if (argv[1][1] == 'e') {
    u_code_point input[unicode_max_length];
    unsigned char uppercase_flags[unicode_max_length];
    char output[ace_max_size];
    unsigned int input_length, output_size;
    int c0, c1, c2, c3;

    /* Read the UTF-32 input string: */

    input_length = 0;

    for (;;) {
      c0 = getchar();
      c1 = getchar();
      c2 = getchar();
      c3 = getchar();
      if (ferror(stdin)) fail(io_error);

      if (c1 == EOF || c2 == EOF || c3 == EOF) {
        if (c0 != EOF) fail("input not a multiple of 4 bytes\n");
        break;
      }

      if (input_length == unicode_max_length) fail(too_big);

      if ((c0 != 0 && c0 != 0x80)
          || c1 < 0 || c1 > 0x10
          || c2 < 0 || c2 > 0xFF
          || c3 < 0 || c3 > 0xFF ) {
        fail(invalid_input);
      }

      input[input_length] = ((u_code_point) c1 << 16) |
                            ((u_code_point) c2 <<  8) | (u_code_point) c3;
      uppercase_flags[input_length] = (c0 >> 7);
      ++input_length;
    }

    /* Encode, and output the result: */

    output_size = ace_max_size;
    status = altdude_encode(input_length, input, uppercase_flags,
                            &output_size, output);
    if (status == altdude_invalid_input) fail(invalid_input);
    if (status == altdude_output_too_big) fail(too_big);
    assert(status == altdude_success);
    r = fputs(output,stdout);
    if (r == EOF) fail(io_error);
    return EXIT_SUCCESS;
  }

  if (argv[1][1] == 'd') {
    char input[ace_max_size], scratch[ace_max_size];
    u_code_point output[unicode_max_length], codept;
    unsigned char uppercase_flags[unicode_max_length];
    unsigned int output_length, i;

    /* Read the AltDUDE ASCII input string: */

    fgets(input, ace_max_size, stdin);
    if (ferror(stdin)) fail(io_error);
    if (!feof(stdin)) fail(too_big);

    /* Decode, and output the result: */

    output_length = unicode_max_length;
    status = altdude_decode(test_case_sensitivity, scratch, input,
                            &output_length, output, uppercase_flags);
    if (status == altdude_invalid_input) fail(invalid_input);
    if (status == altdude_output_too_big) fail(too_big);
    assert(status == altdude_success);

    for (i = 0;  i < output_length;  ++i) {
      r = putchar(uppercase_flags[i] ? 0x80 : 0);
      if (r == EOF) fail(io_error);
      codept = output[i];
      r = putchar(codept >> 16);
      if (r == EOF) fail(io_error);
      r = putchar((codept >> 8) & 0xFF);
      if (r == EOF) fail(io_error);
      r = putchar(codept & 0xFF);
      if (r == EOF) fail(io_error);
    }

    return EXIT_SUCCESS;
  }

  usage(argv);
  return EXIT_SUCCESS;  /* not reached, but quiets compiler warning */
}



                   INTERNET-DRAFT expires 2001-Sep-19
Document	Document type	Expired Internet-Draft (idn WG) Expired & archived
	Select version	00
	Author	Adam M. Costello Email authors
	RFC stream
	Intended RFC status	(None)
	Other formats	txt pdf bibtex bibxml
	Additional resources	Mailing list discussion