Internet Draft M. Ishisone
draft-ietf-idn-mace-01.txt SRA
Jun 28, 2001 Y. Yoneya
Expires Dec 28, 2001 JPNIC
MACE: Modal ASCII Compatible Encoding for IDN
Status of this Memo
This document is an Internet-Draft and is subject to all provisions
of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet- Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
Abstract
MACE is a reversible transformation method from a sequence of Unicode
[UNICODE] characters to a sequence of ASCII letters, digits and
hyphens (LDH characters). It is designed to be used as an encoding
for internationalized domain names [IDN].
Contents
1. Introduction
2. Terminology
3. Overview
4. Base32 Format
5. Notations
6. Encoding Description
7. Encoding Procedure
8. Decoding Description
9. Decoding Procedure
10. ACE Identifier
11. Examples
Expires December 28th, 2001 [Page 1]
Internet Draft MACE June 28th, 2001
12. Security Considerations
13. References
14. Acknowlegdements
15. Authors' Address
A. Changes from draft-ietf-idn-mace-00
B. Sample Implementation
1. Introduction
MACE is intended to be used as an ACE in the IDNA architecture
[IDNA], and encodes a sequence of Unicode (ISO/IEC 10646) characters
in the range U+0000-U+10FFFF as a sequence of LDH characters.
MACE is designed to have following features:
Completeness: Every Unicode string has a map to an LDH character
string.
Uniqueness: Every Unicode string maps to at most one LDH character
string.
Reversibility: The original Unicode string can be obtained from an
LDH character string to which the Unicode string maps.
Efficiency: The ratio of encoded size to original size is small.
If the code points of the Unicode string are clustered, a
compression algorithm enables a compact encoding. Even if they
are not, the encoded size is still kept small.
Simplicity: The encoding/decoding algorithms are fairly simple to
implement.
2. Terminology
LDH characters are the letters A-Z and a-z, the digits 0-9, and
hyphen-minus.
As in the Unicode Standard [UNICODE], Unicode characters are denoted
by "U+" followed by four to six hexadecimal digits representing its
UCS-4 code point. A range of Unicode characters is denoted by the
form "U+xxxx-U+yyyy".
3. Overview
MACE encodes a sequence of Unicode (ISO/IEC 10646) characters in the
range U+0000-U+10FFFF as a sequence of LDH characters.
MACE is a modal encoding. There are two major modes and one of which
Expires December 28th, 2001 [Page 2]
Internet Draft MACE June 28th, 2001
has four submodes. Each character is encoded in a specific
mode/submode. The mode/submode is chosen according to the code point
of the character and possibly its neiboring characters. The modal
encoding enables compact representation of each character, and the
modes are chosen so that mode change occurs rather infrequently as
long as the source string is written in a single language.
LDH characters are represented literally, for the compactness of the
encoded result. Other Unicode characters are represented as base32
format strings. Each of Unicode characters in Basic Multilingual
Plane (BMP, U+0000-U+FFFF) except LDH characters is encoded as a
3-octet base32 format sting, while each non-BMP (U+10000-U+10FFFF)
character is encoded as a 4-octet base32 format string.
To achieve fairly good compression for non-LDH charactes, there is
also a submode for differential encoding. Using this submode,
characters are encoded as bitwise-xor value between the code points
of the previous character and the current character. In this submode
a character is encoded as a 1 or 2 octet base32 format string.
So if the code points of the input string are clusterd in a small
region, an effective compression algorithm enables 1 or 2
octets/character encoding (plus some overhead for mode changes).
Even if the code points are widely scattered and difficult to
compress (such as CJK Han characters), 3 octets/character (for BMP)
or 4 octets/character (for Non-BMP) encoding (plus some overhead for
mode changes) can be achieved.
4. Base32 Format
MACE uses base32 format string to encode non-negative intergers. The
base32 format used for MACE is:
"0" = 0 = 0x00 = 00000 "g" = 16 = 0x10 = 10000
"1" = 1 = 0x01 = 00001 "h" = 17 = 0x11 = 10001
"2" = 2 = 0x02 = 00010 "i" = 18 = 0x12 = 10010
"3" = 3 = 0x03 = 00011 "j" = 19 = 0x13 = 10011
"4" = 4 = 0x04 = 00100 "k" = 20 = 0x14 = 10100
"5" = 5 = 0x05 = 00101 "l" = 21 = 0x15 = 10101
"6" = 6 = 0x06 = 00110 "m" = 22 = 0x16 = 10110
"7" = 7 = 0x07 = 00111 "n" = 23 = 0x17 = 10111
"8" = 8 = 0x08 = 01000 "o" = 24 = 0x18 = 11000
"9" = 9 = 0x09 = 01001 "p" = 25 = 0x19 = 11001
"a" = 10 = 0x0A = 01010 "q" = 26 = 0x1A = 11010
"b" = 11 = 0x0B = 01011 "r" = 27 = 0x1B = 11011
"c" = 12 = 0x0C = 01100 "s" = 28 = 0x1C = 11100
"d" = 13 = 0x0D = 01101 "t" = 29 = 0x1D = 11101
"e" = 14 = 0x0E = 01110 "u" = 30 = 0x1E = 11110
"f" = 15 = 0x0F = 01111 "v" = 31 = 0x1F = 11111
Expires December 28th, 2001 [Page 3]
Internet Draft MACE June 28th, 2001
The encoding is big-endian (most-significant bits first). The
following shows some examples.
decimal hexadecimal binary base32 string
-------------------------------------------------------
40 0x28 00001 01000 "18"
9876 0x2694 01001 10100 10100 "9kk"
5. Notations
In the following description, following five functions are used.
base32_encode(N, LEN)
denotes a base32 format string of LEN octets representing number
N. If LEN is larger than what needs to represent N, "0" is
prepended.
base32_decode(S)
denotes a number which corresponds to a base32 format string S.
codepoint(C)
denotes a UCS-4 code point value for character C.
character(N)
denotes a Unicode character whose UCS-4 code point is N.
xor(N, M)
denotes a bit-wise XOR value of integer N and M.
6. Encoding Description
MACE can encode Unicode/ISO10646 characters in the range
U+0000-U+10FFFF. If the input string contains other characters, or
it represents a non-internationalized host name parts (conforms to
[STD13]), it MUST NOT be converted.
MACE has several encoding modes/submodes. There are two major modes,
`Literal' and `Non-Literal'. Non-Literal mode has four submodes,
while Literal mode has none. Each character is encoded in a specific
mode/submode. The encoding process of a character is:
1. Determine the mode/submode to encode the character.
2. If and only if it is necessary to change the current mode,
output ASCII hyphen-minus to change the mode.
3. If and only if it is necessary to change the current submode,
output the submode introducer octet (described below) to change
the submode.
4. Encode the character in the mode/submode.
Expires December 28th, 2001 [Page 4]
Internet Draft MACE June 28th, 2001
ASCII letter and digit characters are encoded in Literal mode, while
non-LDH characters are encoded in Non-Literal mode. ASCII hyphen
character (U+002D) can be encoded in either modes, and is always
encoded as a sequence of two hyphen-minus ("--"). Switching between
Literal mode and Non-Literal mode is indicated by an ASCII hyphen not
followed by another hyphen. The initial mode is Non-Literal.
In Literal mode, characters are encoded as they are. For example
ASCII character "a" is encoded as "a". In Non-Literal mode,
characters are encoded as a base32 format string.
Non-Literal mode further comprises four submodes, `BMP-A', `BMP-B',
`Non-BMP' and `Compress'. Every non-LDH character is encoded one of
these submodes. Shifting to each submode is indicated by a certain
octet (called introducer octet) shown below. These introducer octets
can be distinguished from the base32 string since they never appear
in the base32 string used by MACE.
submode introducer octet
---------------------------
BMP-A "w"
BMP-B "x"
Non-BMP "y"
Compress "z"
Switching between Literal mode and Non-Literal mode doesn't affect
current submode, that is, on returning from the Literal mode,
previous submode is restored. This lowers the necessity of submode
changes. The initial submode is BMP-A.
BMP-A and BMP-B submodes are used for encoding characters in Unicode
Basic Multilingual Plane (U+0000-U+FFFF), except LDH characters. In
these submodes, a character is encoded as base32 format string of 3
octets. BMP-A is used for characters in the range U+0000-U+1FFF and
U+A000-U+FFFF, covering most of Western/Middle-Eastern scripts and
Hangul. BMP-B is used for characters in the range U+2000-U+9FFF,
covering CJK unification area. Those characters are first mapped to
integers of the range 0x0000-0x7fff (15bit integer), then converted
to base32 format string using the following scheme:
submode character range encoding
-----------------------------------------------------------------
BMP-A U+0000-U+1FFF base32_encode(codepoint(C), 3)
U+A000-U+FFFF base32_encode(codepoint(C) - 0x8000, 3)
BMP-B U+2000-U+9FFF base32_encode(codepoint(C) - 0x2000, 3)
Expires December 28th, 2001 [Page 5]
Internet Draft MACE June 28th, 2001
Here are some examples:
character submode integer base32 string
---------------------------------------------
U+00B0 BMP-A 0xb0 "05g"
U+5678 BMP-B 0x3678 "djo"
U+BCDE BMP-A 0x3CDE "f6u"
Non-BMP submode is used for encoding Unicode characters outside Basic
Multilingual Plane (U+10000-U+10FFFF). In this mode a character is
encoded as base 32 format string of 4 octets. Characters
U+10000-U+10FFFF are first mapped to intergers of the range
0x00000-0xFFFFF (20bit integer), then converted to bae32 format
string using the following scheme:
submode character range encoding
-------------------------------------------------------------------
Non-BMP U+10000-U+10FFFF base32_encode(codepoint(C) - 0x10000, 4)
Compress submode is used for the efficient encoding of non-LDH
characters. This mode can be used for any non-LDH characters if
certain condition is met. In this mode, a character is encoded as a
bit-wise XOR value between the code point of the character (called C)
and the last non-LDH character before C (called PREV). The XOR value
(xor(codepoint(PREV), codepoint(C))) must be less than 0x200, or the
Compress submode cannot be used. If the XOR value is less than 16,
it is encoded as a base32 format string of 1 octet. Otherwise 0x200
is added to the XOR value, then it is encoded as a base32 format
string of 2 octets. When decoding, this encoding enables to determine
the encoded length by looking at the first octet.
submode character range encoding condition
------------------------------------------------------------------
Compress U+0000-U+10FFFF base32_encode(X, 1) if X < 16
base32_encode(X + 0x200, 2) if X >= 16
[where X is xor(codepoint(PREV), codepoint(C))]
There are two possible submodes for encoding a non-LDH character C,
one of which is Compress, and the other is one of the other three
(BMP-A, BMP-B, Non-BMP). The submode is determined using the
following algorithm. This algorithm is designed so that it chooses
the submode which produces shorter encoding result.
1. Let PREV be the last non-LDH character before C, and let NXT be
the first non-LDH character after C. In case C is the first
non-LDH character of the input string, let PREV be U+0000.
2. If xor(codepoint(PREV), codepoint(C)) > 0x1FF, go to 4.
3. If at least one of the following conditions holds, choose
`Compress'. Otherwise go to 4.
a) the current submode is `Compress'
b) C is non-BMP character (U+10000-U+10FFFF)
Expires December 28th, 2001 [Page 6]
Internet Draft MACE June 28th, 2001
c) xor(codepoint(PREV), codepoint(C)) is less than 16
d) NXT exists and xor(codepoint(C), codepoint(NXT)) <= 0x1ff
4. If C is in the range U+0000-U+1FFF or U+A000-U+FFFF, choose
`BMP-A'.
5. If C is in the range U+2000-U+9FFF, choose `BMP-B'.
6. Otherwise choose `Non-BMP'.
Initial state is set as follows.
mode : Non-Literal
submode : BMP-A
PREV : U+0000
7. Encoding Procedure
procedure encode(INPUT)
MODE = `Non-Literal'
SUBMODE = `BMP-A'
PREV = U+0000
while (is_not_empty(INPUT))
C = read_one_character(INPUT)
if (<C is not in the range U+0000-U+10FFFF>)
<encode error>
else if (<C is hyphen (U+002D)>)
output("--")
else if (<C is ASCII letter or digit>)
if (MODE != `Literal')
output("-")
MODE = `Literal'
endif
output(C)
else
if (MODE != `Non-Literal')
output("-")
MODE = `Non-Literal'
endif
if (compressible(SUBMODE, C, PREV, INPUT) == TRUE)
NEW_SUBMODE = `Compress'
V = xor(codepoint(PREV), codepoint(C))
if (V >= 16)
V = V + 0x200
LEN = 2
else
LEN = 1
endif
else
V = codepoint(C)
if (0x0000 <= V <= 0x1FFF)
NEW_SUBMODE = `BMP-A'
Expires December 28th, 2001 [Page 7]
Internet Draft MACE June 28th, 2001
LEN = 3
else if (0xA000 <= V <= 0xFFFF)
NEW_SUBMODE = `BMP-A'
V = V - 0x8000
LEN = 3
else if (0x2000 <= V <= 0x9FFF)
NEW_SUBMODE = `BMP-B'
V = V - 0x2000
LEN = 3
else
NEW_SUBMODE == `Non-BMP'
V = V - 0x10000
LEN = 4
endif
endif
if (NEW_SUBMODE != SUBMODE)
output(<submode introducer for NEW_SUBMODE>)
SUBMODE = NEW_SUBMODE
endif
output(base32_encode(V, LEN))
PREV = C
endif
end
end
function compressible(SUBMODE, C, PREV, INPUT)
if (xor(codepoint(C), codepoint(PREV)) > 0x1FF)
return (FALSE)
endif
# The differenct between C and PREV is confined to lower 9 bits.
if (SUBMODE == `Compress')
return (TRUE)
else if (codepoint(C) >= 0x10000)
return (TRUE)
else if (xor(codepoint(C), codepoint(PREV)) < 16)
return (TRUE)
else
<peek the next non-LDH character in INPUT>
if (<there is such a character (called NXT)> and
xor(codepoint(NXT), codepoint(C)) <= 0x1FF)
return (TRUE)
endif
endif
return (FALSE)
end
8. Decoding Description
Like encoding, MACE decoding process keeps track of the current
Expires December 28th, 2001 [Page 8]
Internet Draft MACE June 28th, 2001
mode/submode to decode each character. The initial state for
decoding is the same as that of encoding.
mode : Non-Literal
submode : BMP-A
PREV : U+0000
Because ASCII domain names are case-insensitive, decoding process
MUST treat uppercase leters and lowercase letters equally.
The consecutive two ASCII hyphen-minus characters are always decoded
as a single ASCII hyphen-minus, regardless of the current
mode/submode. ASCII hyphen-minus not followed by another
hyphen-minus indicates mode switching between Literal mode and
Non-Literal mode.
In Literal mode, all ASCII letter and digit characters are decoded as
they are.
In Non-Literal mode, every character is either a submode introducer
or a part of base32 format string. If a character is a submode
introducer, the current submode is changed to the corresponding
submode. If it isn't, it is a part of base32 format string.
To decode base32 format string in a certain submode, first determine
the length of the string which is decoded to a single Unicode
character. For submodes other than Compress, the number of octets
which encodes a character is fixed (3 for BMP-A and BMP-B, 4 for
Non-BMP). For Compress submode, the number of octets is variable (1
or 2), and can be determined by looking at the first octet. If the
first octet represents a number less than 16 in base32 (either 0-9,
a-f or A-F) the number of octets is one, otherwise two. The
following list shows the length of the string S and how to get the
decoded character in each submode.
submode length decoded character condition
--------------------------------------------------------------
BMP-A 3 character(N) if N < 0x2000
character(N + 0x8000) if N >= 0x2000
BMP-B 3 character(N + 0x2000)
Non-BMP 4 character(N + 0x10000)
Compress 1 character(xor(P, N))
2 character(xor(P, N - 0x200))
[where N is base32_decode(S), P is codepoint(PREV)]
MACE decoding process can accept invalidly-encoded strings as well.
In order to guarantee the unique mapping, following two types of
check must be performed.
1) The decoded string must be checked if it is a [STD13] conforming
name. If it is, decoding process MUST fail.
Expires December 28th, 2001 [Page 9]
Internet Draft MACE June 28th, 2001
2) The decoded string must be re-encoded and compared to the input
string. If they are not equal (allowing case-difference),
decoding process MUST fail.
9. Decoding Procedure
procedure decode(input)
MODE = `Non-Literal'
SUBMODE = `BMP-A'
PREV = U+0000
while (is_not_empty(INPUT))
C = read_one_character(INPUT)
if (<C is hyphen>)
NXT = read_one_character(INPUT)
if (<NXT is hyphen>)
output("-")
else
<push back NXT to INPUT>
if (MODE == `Literal')
MODE = `Non-Literal'
else
MODE = `Literal'
endif
endif
else if (MODE == `Literal')
output(C)
else if (<C matches one of the submode introducer octets>)
SUBMODE = <corresponding submode>
else
<push back C to INPUT>
if (SUBMODE == `BMP-A')
S = read_string_of_length(INPUT, 3)
V = base32_decode(S)
if (V >= 0x2000)
V = V + 0x8000
endif
else if (SUBMODE == `BMP-B')
S = read_string_of_length(INPUT, 3)
V = base32_decode(S) + 0x2000
else if (SUBMODE == `Non-BMP')
S = read_string_of_length(INPUT, 4)
V = base32_decode(S) + 0x10000
else if (SUBMODE == `Compress')
if (<C is either 0-9, a-f or A-F>)
S = read_string_of_length(INPUT, 1)
V = base32_decode(S)
else
S = read_string_of_length(INPUT, 2)
Expires December 28th, 2001 [Page 10]
Internet Draft MACE June 28th, 2001
V = base32_decode(S) - 0x200
endif
V = PREV xor V
endif
output(character(V))
PREV = character(V)
endif
end
end
The above decoding procedure accepts invalidly-encoded strings as
well. In order to guarantee the unique mapping, following two
additional checks MUST be performed after decoding:
1) that the decoding string is NOT a [STD13] conforming name.
2) that the string which is the result of re-encoding of the
decoded string matches the original string.
10. ACE Identifier
In order to use MACE as an ACE, there must be a certain prefix or
suffix string which is unlikely to be used in normal domain names and
thus identifies MACE-encoded domain name parts. Since MACE-encoded
names can begin with hyphen-minus and names beginning with
hyphen-minus do not conform [STD13], a prefix string should be used.
So if MACE is used for encoding domain name parts, the encoded names
should be prefixed by the prefix string.
This document does not specify the prefix string for MACE. The
actual selection should be left to certain authority such as IANA
[ACEID].
For testing purpose, there is a registry of test prefix strings for
ACEs on IETF IDN working group web site [IDN].
11. Examples
The following examples are meaningless strings, but they are designed
to exercise various aspects of the algorithm in order to verify the
correctness of the implementation.
(a) U+0200 U+4000 U+002D U+B001 U+40001 U+0061
MACE: 0g0x800--wc01y6001-a
(b) U+0061 U+002D U+0300 U+0062 U+0400 U+3000 U+002D U+5000
MACE: -a---0o0-b-100x400--c00
(c) U+1FFF U+2000 U+9FFF U+A000 U+FFFF U+10000 U+10FFFF
MACE: 7vvx000vvvw800vvvy0000vvvv
Expires December 28th, 2001 [Page 11]
Internet Draft MACE June 28th, 2001
(d) U+0200 U+002F U+0030 U+0039 U+003A U+0200 U+0040 U+0041 \
U+005A U+005B U+0200 U+0060 U+0061 U+007A U+007B
MACE: 0g001f-09-01q0g0020-AZ-02r0g0030-az-03r
(e) U+0061 U+0062 U+0063 U+002D U+1000 U+1200 U+002D \
U+2000 U+2010 U+2200 U+002D U+3000 U+3010
MACE: -abc---4004g0--x00000g0g0--40040g
(f) U+0100 U+0102 U+0200 U+002D U+0201 U+002D U+03FE U+0061 U+0234
MACE: zo02w0g0--z1--vv-a-ua
(g) U+3000 U+002D U+3010 U+0061 U+3100 U+310F U+31FF
MACE: x400--zgg-a-ogfng
(h) U+20000 U+002D U+20100 U+0061 U+20010 U+20012 U+200FF
MACE: y2000--zo0-a-og2nd
The following examples are typical Japanese fairly long (15-25
characters) names.
(i) 15 CJK Han characters
<zaiadanhoujinhokkaidoshizenhogosuishinkyoukai>
U+8CA1 U+56E3 U+6CD5 U+4EBA U+5317 U+6D77 U+9053 U+81EA \
U+7136 U+4FDD U+8B77 U+63A8 U+9032 U+5354 U+4F1A
MACE: xr51dn3j6lblqconjbns2jofak9mbutqrngt8s1icqkboq
(j) 4 Digits, 2 CJK Han, 1 Hiragana, 6 CJK Han, 6 Katakana characters
2001<nenharu><no><koutsujikobokumetsu><kyanpe-n>
U+0032 U+0030 U+0030 U+0031 U+5E74 U+6625 U+306E U+4EA4 \
U+901A U+4E8B U+6545 U+64B2 U+6EC5 U+30AD U+30E3 U+30F3 \
U+30DA U+30FC U+30F3
MACE: -2001-xfjkhh543ebl4s0qbkbha5h5ijm545dzieggh9h6f
(k) 9 CJK Han, 9 Katakana characters
<saitamarinkaikaiyohakubutsukan><marinmyu-jiamu>
U+57FC U+7389 U+81E8 U+6D77 U+6D77 U+6D0B U+535A U+7269 \
U+9928 U+30DE U+30EA U+30F3 U+30DF U+30E5 U+30FC U+30B8 \
U+30A2 U+30E0
MACE: xdvsks9of8jbnz0jsxcqqkj9u9846uzhkgphchqgpi4gqi2
(l) 6 CJK Han, 19 Katakana characters
<shadanhoujinnippon><nettowa-kuinfome-shonsenta->
U+793E U+56E3 U+6CD5 U+4EBA U+65E5 U+672C U+30CD U+30C3 \
U+30C8 U+30EF U+30FC U+30AF U+30A4 U+30F3 U+30D5 U+30A9 \
U+30E1 U+30FC U+30B7 U+30E7 U+30F3 U+30BB U+30F3 U+30BF \
U+30FC
MACE: xm9udn3j6lblqhf5hpc46dzebh7gjijbinh6jsi8gtibiggki8i8ici3
12. Security Considerations
Expires December 28th, 2001 [Page 12]
Internet Draft MACE June 28th, 2001
Users expect each domain name in DNS to be controlled by a single
authority. If a Unicode string intended for use as a domain label
could map to multiple ACE labels, then an internationalized domain
name could map to multiple ACE domain names, each controlled by a
different authority, some of which could be spoofs that hijack
service requests intended for another. Therefore MACE is designed so
that each Unicode string has a unique encoding.
13. References
[UNICODE] The Unicode Consortium, "The Unicode Standard",
http://www.unicode.org/unicode/standard/standard.html
[IDN] Internationalized Domain Names (IETF Working Group),
http://www.i-d-n.net/, idn@ops.ietf.org
[IDNA] Patrik Falstrom, Paul Hoffman, "Internationalizing Host
Names In Applications (IDNA)", draft-ietf-idn-idna-01
[STD13] Paul Mockapetris, "DOMAIN NAMES - IMPLEMENTATION AND
SPECIFICATION", Nov 1987, STD 13 (RFC 1035)
[RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
Table Specification", Oct 1985, RFC 952
[NAMEPREP] Paul Hoffman, Marc Blanchet, "Preparation of
Internationalized Host Names", Feb 2001,
draft-ietf-idn-nameprep-03
[ACEID] Naomasa Maruyama, Yoshiro Yoneya, "Proposal for a determining
process of ACE identifier", Jun 2001, draft-ietf-idn-aceid-02
[BRACE] Adam M. Costello, "BRACE: Bi-mode Row-based
ASCII-Compatible Encoding for IDN", Sep 2000,
draft-ietf-idn-brace-00
[DUDE] Mark Welter, Brian W. Spolarich, Adam M. Costello,
"Differential Unicode Domain Encoding (DUDE)", Jun 2001,
draft-ietf-idn-dude-02
14. Acknowlegdements
Some of the ideas in MACE are taken from other ACE proposals.
The idea of Literal/Non-Literal mode is taken from BRACE draft
[BRACE] by Adam M. Costello.
The idea of differencial encoding used by Compress submode is taken
Expires December 28th, 2001 [Page 13]
Internet Draft MACE June 28th, 2001
from DUDE [DUDE], by Mark Welter, Brian W. Spolarich and Adam M.
Costello.
The structure of this document and text of some sections are borrowed
from AMC-ACE- series draft (draft-ietf-idn-amc-ace-*) by Adam
M. Costello.
15. Authors' Address
Makoto Ishisone
Software Research Associates, Inc.
4-16-10, Chigasaki-Minami, Tsuzuki-ku, Yokohama,
Kanagawa 224-0037 Japan
<ishisone@sra.co.jp>
Yoshiro Yoneya
Japan Network Information Center (JPNIC)
Fuundo Bldg 1F, 1-2 Kanda-ogawamachi,
Chiyoda-ku Tokyo 101-0052, Japan
<yone@nic.ad.jp>
A. Changes from draft-ietf-idn-mace-00
1) A typo in example a) is fixed.
2) More examples are added.
3) A sample implementation is included as an appendix.
B. Sample Implementation
/*
* MACE encoder/decoder sample implementation.
*
* For brevity, this code assumes it is written in ASCII code (or
* its superset).
*
* Option -e encodes the input Unicode characters (standard U+XXXX
* notation) and output MACE-encoded string.
* Option -d decodes MACE-encoded string and output Unicode characters
* (also in U+XXXX notation).
*/
#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
Expires December 28th, 2001 [Page 14]
Internet Draft MACE June 28th, 2001
/* UCS-4 code point type */
typedef unsigned long mace_ucs_t;
/* Encode/decode status */
typedef enum {
mace_success, /* succeeded */
mace_overflow, /* buffer overflow */
mace_invalid_input, /* input string is invalid */
mace_nomemory /* malloc failed */
} mace_result_t;
extern mace_result_t
mace_encode(const mace_ucs_t *from, size_t from_len, char *to,
size_t to_size, size_t *to_lenp);
extern mace_result_t
mace_decode(const char *from, size_t from_len, mace_ucs_t *to,
size_t to_size, size_t *to_lenp);
/* Major mode and submode */
typedef enum { Literal, Non_Literal } mace_mode_t;
typedef enum { BMP_A, BMP_B, Non_BMP, Compress } mace_submode_t;
/* Submode introducer octets. */
static int submode_introducer[] = { 'w', 'x', 'y', 'z' };
/* Encode length for a character in each submode. */
/* For Comperss submode it is actually either 1 or 2. */
static int submode_encodelen[] = { 3, 3, 4, 1 };
#define LDH_CHARACTER(c) \
(('a' <= (c) && (c) <= 'z') || ('A' <= (c) && (c) <= 'Z') || \
('0' <= (c) && (c) <= '9') || (c) == '-')
#define LOWER_CHAR(c) \
(('A' <= (c) && (c) <= 'Z') ? (((c) - 'A') + 'a') : (c))
static int
conform_to_std13(mace_ucs_t *s, size_t len)
{
/*
* Check if Unicode string S whose length is LEN conforms
* to the host name part specification by STD13.
*/
size_t idx;
for (idx = 0; idx < len; idx++) {
if (!LDH_CHARACTER(s[idx])) return (0);
}
Expires December 28th, 2001 [Page 15]
Internet Draft MACE June 28th, 2001
return (s[0] != '-' && s[len - 1] != '-');
}
static void
base32_encode(mace_ucs_t v, size_t len, char *s)
{
/*
* Convert non-negative integer V to the corresponding
* base32 format string of LEN octets and store to S.
* Caller must check S is big enough to hold LEN octets beforehand.
*/
static char *b32 = "0123456789abcdefghijklmnopqrstuv";
int idx;
for (idx = len - 1; idx >= 0; idx--) {
s[idx] = b32[v & 0x1F];
v >>= 5;
}
}
static int
base32_decode(const char *s, size_t len, mace_ucs_t *vp)
{
/*
* Convert a base32 string S of LEN octets long to
* a non-negative integer, store it to *VP and return 1.
* If S is not a valid base32 string, return 0.
*/
mace_ucs_t v = 0;
int x;
while (len-- > 0) {
if ('0' <= *s && *s <= '9') x = *s++ - '0';
else if ('a' <= *s && *s <= 'v') x = *s++ - 'a' + 10;
else if ('A' <= *s && *s <= 'V') x = *s++ - 'A' + 10;
else return 0;
v = (v << 5) + x;
}
*vp = v;
return 1;
}
static int
round_trip_check(const mace_ucs_t *u, size_t ulen,
const char *a, size_t alen)
{
/*
* Encode Unicode string U whose length is ULEN and compare the
* result with string A of length ALEN. If the two are same
* (allowing case-difference), return mace_success. Otherwise
* return mace_invalid_input or mace_nomemory (if malloc failed).
Expires December 28th, 2001 [Page 16]
Internet Draft MACE June 28th, 2001
*/
char *check;
size_t reallen, idx;
if ((check = malloc(alen)) == NULL) return mace_nomemory;
if (mace_encode(u, ulen, check, alen, &reallen) != mace_success ||
reallen != alen)
goto invalid;
for (idx = 0; idx < alen; idx++) {
if (LOWER_CHAR(a[idx]) != LOWER_CHAR(check[idx])) goto invalid;
}
free(check);
return mace_success;
invalid:
free(check);
return mace_invalid_input;
}
static int
compressible(mace_submode_t submode, mace_ucs_t prev, mace_ucs_t c,
const mace_ucs_t *rest, size_t rest_len)
{
/*
* Determin whether the Unicode character C should be
* encoded in Compress submode or not.
*/
int idx;
if ((c ^ prev) > 0x1FF) return 0;
if (submode == Compress || c >= 0x10000 || (c ^ prev) < 16)
return 1;
/* Find the next non-LDH character */
for (idx = 0; idx < rest_len; idx++) {
if (!LDH_CHARACTER(rest[idx])) break;
}
return (idx < rest_len && (c ^ rest[idx]) <= 0x1FF);
}
mace_result_t
mace_encode(const mace_ucs_t *from, size_t from_len,
char *to, size_t to_size, size_t *to_lenp)
{
/*
* Encode a Unicode string FROM whose length is FROM_LEN and store
* the result to TO, whose allocated length is TO_SIZE. The
* length of the result string is stored to *TO_LENP. Note that
Expires December 28th, 2001 [Page 17]
Internet Draft MACE June 28th, 2001
* TO will not be terminated by NUL character.
*/
mace_mode_t mode = Non_Literal;
mace_submode_t submode = BMP_A;
mace_ucs_t prev = 0;
const mace_ucs_t *from_ptr = from;
size_t from_rest = from_len, to_idx = 0, len;
mace_ucs_t c, v;
#define OUTPUT(c) \
if (to_idx >= to_size) return mace_overflow; to[to_idx++] = (c)
while (from_rest > 0) {
c = *from_ptr++;
from_rest--;
/* Perform range check. */
if (c > 0x10FFFF) return mace_invalid_input;
if (c == '-') {
OUTPUT('-'); OUTPUT('-');
} else if (LDH_CHARACTER(c)) {
if (mode != Literal) {
/* Switch to Literal mode. */
OUTPUT('-');
mode = Literal;
}
OUTPUT(c);
} else {
mace_submode_t new_submode;
if (mode != Non_Literal) {
/* Switch to Non-Literal mode. */
OUTPUT('-');
mode = Non_Literal;
}
if (compressible(submode, prev, c, from_ptr, from_rest)) {
/* Compress submode */
new_submode = Compress;
v = prev ^ c;
len = 1;
if (v >= 16) {
v += 0x200;
len = 2;
}
} else {
/* Choose the right submode based on the code point. */
if ((0x0000 <= c && c <= 0x1FFF) ||
(0xA000 <= c && c <= 0xFFFF)) {
new_submode = BMP_A;
Expires December 28th, 2001 [Page 18]
Internet Draft MACE June 28th, 2001
v = c - (c >= 0xA000 ? 0x8000 : 0);
} else if (0x2000 <= c && c <= 0x9FFF) {
new_submode = BMP_B;
v = c - 0x2000;
} else {
new_submode = Non_BMP;
v = c - 0x10000;
}
len = submode_encodelen[new_submode];
}
if (new_submode != submode) {
/* Shift to the new submode. */
OUTPUT(submode_introducer[new_submode]);
submode = new_submode;
}
/* Remember the last non-LDH character. */
prev = c;
/* Convert to base32 format string. */
if (to_idx + len > to_size) return mace_overflow;
base32_encode(v, len, &to[to_idx]);
to_idx += len;
}
}
#undef OUTPUT
*to_lenp = to_idx;
return mace_success;
}
mace_result_t
mace_decode(const char *from, size_t from_len,
mace_ucs_t *to, size_t to_size, size_t *to_lenp)
{
/*
* Decode a MACE-encoded string FROM whose length is FROM_LEN
* and store the result to TO, whose allocated length is TO_SIZE.
* The length of the result string is stored to *TO_LENP.
*/
mace_mode_t mode = Non_Literal;
mace_submode_t submode = BMP_A;
mace_ucs_t prev = 0, v;
const char *from_ptr = from;
size_t from_rest = from_len, to_idx = 0;
int c;
#define OUTPUT(c) \
if (to_idx >= to_size) return mace_overflow; to[to_idx++] = (c)
Expires December 28th, 2001 [Page 19]
Internet Draft MACE June 28th, 2001
while (from_rest > 0) {
c = *from_ptr++;
from_rest--;
if (c == '-') {
if (from_rest > 0 && from_ptr[0] == '-') {
OUTPUT('-');
from_ptr++, from_rest--;
} else {
mode = (mode == Literal) ? Non_Literal : Literal;
}
} else if (mode == Literal) {
OUTPUT(c);
} else if (c == 'w' || c == 'W') {
submode = BMP_A;
} else if (c == 'x' || c == 'X') {
submode = BMP_B;
} else if (c == 'y' || c == 'Y') {
submode = Non_BMP;
} else if (c == 'z' || c == 'Z') {
submode = Compress;
} else {
int encode_len = submode_encodelen[submode];
from_ptr--, from_rest++; /* push back C */
if (from_rest < encode_len ||
base32_decode(from_ptr, encode_len, &v) == 0)
return mace_invalid_input;
if (submode == BMP_A) {
if (v >= 0x2000) v += 0x8000;
} else if (submode == BMP_B) {
v += 0x2000;
} else if (submode == Non_BMP) {
v += 0x10000;
} else { /* Compress */
if (v >= 16) {
encode_len = 2;
if (from_rest < encode_len ||
base32_decode(from_ptr, encode_len, &v) == 0)
return mace_invalid_input;
v -= 0x200;
}
v ^= prev;
}
OUTPUT(v);
prev = v;
from_ptr += encode_len;
from_rest -= encode_len;
}
Expires December 28th, 2001 [Page 20]
Internet Draft MACE June 28th, 2001
}
#undef OUTPUT
*to_lenp = to_idx;
if (conform_to_std13(to, to_idx)) return mace_invalid_input;
return (round_trip_check(to, to_idx, from, from_len));
}
/******* Test Driver **************************************************/
static void
error(const char *msg)
{
fprintf(stderr, "%s\n", msg);
exit(1);
}
static void
mace_error(const char *s, mace_result_t r)
{
static char *emsg[] = {
"no error", "buffer overflowed",
"input string is invalid", "malloc failed",
};
fprintf(stderr, "%s: %s\n", s, emsg[r]);
exit(1);
}
int
main(int ac, char **av)
{
char *cmd = *av;
char line[256];
mace_ucs_t ucs[64];
mace_result_t r;
size_t len, ucslen, i;
int encode = 1;
if ('a' != 0x61) error("oops. not ASCII code (EBCDIC?)");
if (ac > 2) {
usage:
fprintf(stderr, "Usage: %s [-e|-d]\n", cmd);
return 1;
} if (ac == 2) {
if (!strcmp(av[1], "-e")) encode = 1;
else if (!strcmp(av[1], "-d")) encode = 0;
else goto usage;
}
Expires December 28th, 2001 [Page 21]
Internet Draft MACE June 28th, 2001
while (fgets(line, sizeof(line), stdin) != NULL) {
if (encode) {
char *p = line, *nxt;
int idx = 0;
while (idx < 64) {
while (isspace((unsigned char)*p)) p++;
if (*p == '\0') break;
if (strncmp(p, "U+", 2) != 0)
error("invalid input format");
ucs[idx++] = strtoul(p + 2, &nxt, 16);
if (nxt == p + 2) error("invalid input format");
p = nxt;
}
if (idx >= 64) error("input too long");
r = mace_encode(ucs, idx, line, 255, &len);
if (r != mace_success) mace_error("mace_encode", r);
printf("%1.*s\n", (int)len, line);
} else {
len = strlen(line) - 1; /* 1 for newline */
r = mace_decode(line, len, ucs, 64, &ucslen);
if (r != mace_success) mace_error("mace_decode", r);
for (i = 0; i < ucslen; i++) {
printf("U+%04lX ", ucs[i]);
}
printf("\n");
}
}
return 0;
}
Expires December 28th, 2001 [Page 22]