Internet Draft Paul Hoffman draft-hoffman-imaa-02.txt IMC & VPNC August 6, 2003 Adam M. Costello Expires in six months UC Berkeley Internationalizing Mail Addresses in Applications (IMAA) Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract The Internationalizing Domain Names in Applications (IDNA) specification describes how to process domain names that contain characters outside the ASCII repertoire. A user who has a non-ASCII domain name may want to use it in an Internet mail address that contains non-ASCII characters not only in the domain part but also in the local part (the part to the left of the "@"). This document describes how to use non-ASCII characters in local parts. It defines internationalized local parts (ILPs), internationalized mail addresses (IMAs), and a mechanism called IMAA for handling them in a standard fashion. 1. Introduction A mail address consists of local part, an at-sign (@), and a domain name. The IDNA specification [IDNA] describes how to handle domain names that have non-ASCII characters. This document describes how to handle non-ASCII characters in the rest of the mail address. This document explicitly does not discuss internationalization of display names and comments in mail addresses that appear in message headers [MSGFMT]. MIME part three [MIME3] describes how use an extended set of characters in message headers, and this document does not alter that specification. This document is being discussed on the ietf-imaa mailing list. See <http://www.imc.org/ietf-imaa/> for information about subscribing and the list's archive. 1.1 Relationship to IDNA This document relies heavily on IDNA for both its concepts and its justification. This document omits a great deal of the justification and design information that might otherwise be found here because it is identical to that in IDNA. Anyone reading this document needs to have first read [IDNA], [PUNYCODE], [NAMEPREP], and [STRINGPREP]. There are a few key differences between the way IMAA treats local parts of mail addresses and the way IDNA treats domain names. - The ACE infix for internationalized local parts is different from the ACE prefix for internationalized domain labels. - Domain names have an intrinsic segmentation into labels, and are already segmented before transformations are performed. Local parts, on the other hand, have no intrinsic segmentation. The transformations on local parts perform a segmentation internally, but it has no external significance. - There is no UseSTD3ASCIIRules flag for local parts. One apparent difference that is not really a difference is the handling of quoting mechanisms. IDNA did not discuss quoting because the phrase "domain label" is presumed to refer to a simple literal string. [DNS] defines domain labels in terms of their literal form (which is used in DNS protocol messages), and later introduces a quoting syntax for representing domain labels in master files, but there is never any doubt that the domain label itself is a simple unstructured sequence. It goes without saying that domain labels obtained from contexts that use quoting (like master files) need to be reduced to their literal form before any processing is done on them. Local parts, on the other hand, are defined in [MSGFMT] and [SMTP] in terms of their quoted form, as they appear in message headers and SMTP commands. Later it is stated that the quotation characters are not really part of the local part. To avoid any ambiguity, IMAA explicitly discusses the process of dequoting and requoting local parts. 1.2 Open issues This section describes the issues that are known to be unresolved. There may also be other issues we haven't thought of yet. This section might be easier to follow after the rest of the draft has been read. This section will be removed before the document is passed to the IESG or RFC Editor for publication. Throughout the draft, comments related to these open issues appear inside brackets like this: [[ OPEN ISSUE: comments ]]. In the requoting step, the original quoted local part is recommended when ToASCII/ToUnicode had no effect and the original quoting style is compatible with the destination context. Should we keep that recommendation? It adds complexity, and should not be necessary, but it makes IMAA less likely to trigger quotation-related bugs, and is motivated by the principle of not altering local parts unnecessarily (for example, when converting an already-ASCII local part to ASCII, don't gratuitously change the way it's quoted). Are the recommendations in section 6 regarding stored strings and query strings good? Are they sufficient? 1.3 Closed issues that could be reopened The 59-character limit on the Punycode encoder output is aimed at making it easier to reuse Punycode implementations that were written for IDNA (and which might use fixed-sized buffers). This limit could be relaxed for IMAA. Unlike domain labels, which have a hard size limit imposed by the syntax of DNS messages, local parts have no hard limit (SMTP must support local parts up to 64 character, but may support arbitrarily large local parts). A Punycode implementation using 31-bit unsigned integers (or 32-bit signed integers) ought to be able to handle Unicode strings in excess of 2000 code points (I have not calculated the exact limit). For very long strings, the O(n^2) running time of Punycode might become an issue, but I think an O(n log n) implementation is possible. If we were to sacrifice the protection of hyphens, it would slightly simplify the application of Punycode (there would be no need to convert between hyphens and the ACE infix). If we were to sacrifice the protection of hyphens, it would become possible to reuse the IDNA ACE prefix for IMAA. The disadvantage of using different ACE markers in IDNA and IMAA is that humans cannot, without computational assistance, copy local parts into domain labels (as in SOA records) or copy domain names into local parts, because copying the non-ASCII form and then converting to ASCII would give a different result versus converting to ASCII and then copying, and it's the latter procedure that must be considered correct (for compatibility with IMA-unaware and IDN-unaware software that might try to do the same sort of copying). Furthermore, once the copying has happened, the result will display unintelligibly (the ACE will be visible), because the different ACE prefix won't be recognized on the other side of the at-sign. It is impossible to fully solve this problem, because encoded strings don't mark their own endings, only their own beginnings. Even if the same ACE prefix is used on both sides of the at-sign, if local parts are segmented then a multi-segment local part copied into a domain label will not display intelligibly, while if local parts are not segmented then a multi-label domain name copied into a local part will not display intelligibly. However, using the same ACE prefix would allow the common cases to work intuitively: Local parts containing only LDH characters and non-ASCII characters could be copied (by humans, in non-ACE form) into domain labels (where they would display correctly), and domain names obeying the STD3 ASCII rules could be copied (by humans, in non-ACE form) into local parts (where they would display correctly). One concern with using the same prefix is that in the uncommon cases where it doesn't work nicely, the unintelligible display will not be an ACE, but will be non-ASCII gobbledygook (which will still work if copied back to the other side of the at-sign, but might be even less user-friendly than an ACE). 2. Terminology The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and "MAY" in this document are to be interpreted as described in RFC 2119 [KEYWORDS]. Code point, Unicode, and ASCII are defined in [IDNA]. The "protected code points" are 0..40, 5B..60, 7B..7F (in other words, those corresponding to ASCII characters other than letters and digits). A "mail address" consists of a local part, an at-sign, and a domain name, in that order. The exact details of the syntax depend on the context; for example, a "mailbox" in [SMTP] and an "addr-spec" in [MSGFMT] are both mail addresses, but they define slightly different syntaxes for local parts and domain names. A "dequoted local part" is the simple literal text string that is the intended "meaning" of a local part after it has undergone lexical interpretation. A dequoted local part excludes optional white space, comments, and lexical metacharacters (like backslashes and quotation marks used to quote other characters). Dequoted local parts are generally not allowed in protocols (like SMTP commands and message headers), but they are needed by IMAA as an intermediate form. The dequoted form of X is sometimes written dequote(X). An "internationalized local part" (ILP) is anything that satisfies both of the following conditions: (1) It conforms to the same syntax as a non-internationalized local part except that non-ASCII Unicode characters are allowed wherever ASCII letters are allowed. (2) After it has been dequoted, the ToASCII operation can be applied to it without failing (see section 4). The term "internationalized local part" is a generalization, embracing both old ASCII local parts and new non-ASCII local parts. Although most Unicode characters can appear in internationalized local parts, ToASCII will fail for some inputs. Anything that fails to satisfy condition 2 is not a valid internationalized local part. A "traditional local part" is a local part that contains only ASCII characters and whose dequoted form would be left unchanged by the ToUnicode operation (see section 4). An "internationalized mail address" (IMA) consists of an internationalized local part, an at-sign, and an internationalized domain name [IDNA], in that order. Equivalence of local parts is defined in terms of the dequoted form (see above) and the ToASCII operation, which constructs an ASCII form for a given dequoted local part (whether or not the local part was already an ASCII local part). Two traditional local parts X and Y are equivalent if and only if dequote(X) and dequote(Y) are exactly identical. (That is not a new rule, it is inferred from [SMTP] and [MSGFMT].) For internationalized local parts X and Y that are not both traditional, they are defined to be equivalent if and only if ToASCII(dequote(X)) matches ToASCII(dequote(Y)) using a case-insensitive ASCII comparison. Unlike traditional local parts, non-traditional internationalized local parts are always case-insensitive. Two internationalized mail addresses are equivalent if and only if their local parts are equivalent (according to the previous definition) and their domain parts are equivalent (according to IDNA). To allow internationalized labels to be handled by existing applications, IDNA uses an "ACE local part" (ACE stands for ASCII Compatible Encoding). An ACE local part is an internationalized local part that can be rendered in ASCII and is equivalent to an internationalized local part that cannot be rendered in ASCII. Given any internationalized local part (in dequoted form) that cannot be rendered in ASCII, the ToASCII operation will convert it to an equivalent ACE local part (whereas an ASCII local part will be left unaltered by ToASCII). ACE local parts are unsuitable for display to users. The ToUnicode operation will convert any local part (in dequoted form) to an equivalent non-ACE local part. In fact, an ACE local part is formally defined to be any local part that the ToUnicode operation would alter (whereas non-ACE local part are left unaltered by ToUnicode). The ToASCII and ToUnicode operations are specified in section 4. The "ACE infix" is defined in this document to be a string of ASCII characters that occurs within every encoded segment in a dequoted ACE local part. It is specified in section 5. A "mail address slot" is defined in this document to be a protocol element or a function argument or a return value (and so on) explicitly designated for carrying a mail address (or part of a mail address). Mail address slots exist, for example, in the MAIL and RCPT commands of the SMTP protocol, in the To: and Received: fields of message headers, and in a mailto: URI in the href attribute of an HTML <A> tag. General text that just happens to contain an mail address is not a mail address slot; for example, a mail address appearing in the plain text body of a message is not occupying a mail address slot. An "IMA-aware mail address slot" is defined in this document to be a mail address slot explicitly designated for carrying an internationalized mail address as defined in this document. The designation may be static (for example, in the specification of the protocol or interface) or dynamic (for example, as a result of negotiation in an interactive session). An "IMA-unaware mail address slot" is defined in this document to be any mail address slot that is not an IMA-aware mail address slot. Obviously, this includes any mail address slot whose specification predates this document. 3. Requirements and applicability 3.1 Requirements IMAA conformance means adherence to the following four requirements: 1) In an internationalized mail address, the following characters MUST be recognized as at-signs for separating the local part from the domain name: U+0040 (commercial at), U+FF20 (fullwidth commercial at). 2) Whenever a mail address (or part of a mail address) is put into an IMA-unaware mail address slot (see section 2), it MUST contain only ASCII characters. Given an internationalized mail address, an equivalent mail address satisfying this requirement can be obtained by applying ToASCII to the local part as specified in section 4, changing the at-sign to U+0040, and processing the domain name as specified in [IDNA]. 3) ACE local parts obtained from mail address slots SHOULD be hidden from users when it is known that the environment can handle the non-ACE form, except when the ACE form is explicitly requested. When it is not known whether or not the environment can handle the non-ACE form, the application MAY use the non-ACE form (which might fail, such as by not being displayed properly), or it MAY use the ACE form (which will look unintelligible to the user). Given an internationalized local part, an equivalent non-ACE local part can be obtained by applying the ToUnicode operation as specified in section 4. When requirements 2 and 3 both apply, requirement 2 takes precedence. 4) If two mail addresses are equivalent and either one refers to a mailbox, then both MUST refer to the same mailbox, regardless of whether they use the same form of at-sign. Discussion: This implies that non-ASCII local parts cannot be deployed in domains whose mail exchangers are case-sensitive. IMAA is designed to work without upgrading mail exchangers, but it works only for mail exchangers that treat ASCII local parts as case-insensitive (which is the common and preferred behavior). All local parts received by an IMA-unaware mail exchanger are ASCII, either traditional or ACE, and a case-insensitive exchanger will automatically obey requirement 4 without being aware of it. Case-sensitive exchangers will not correctly handle ACE local parts, but administrators can simply refrain from creating ACE local parts in those domains. This is necessary because a round-trip through ToUnicode and ToASCII is not case-preserving, and therefore the result might refer to a different mailbox (in violation of requirement 4) if interpreted by a case-sensitive mail exchanger. 3.2 Applicability IMAA is applicable to all mail addresses in all mail address slots except where it is explicitly excluded. This implies that IMAA is applicable to protocols that predate IMAA. Note that mail addresses occupying mail address slots in those protocols MUST be in ASCII form (see section 3.1, requirement 2). 3.2.1. Case-sensitive local parts IMAA does not apply to local parts that are interpreted case-sensitively (see section 3.1 requirement 4). 3.2.2. Local parts versus domain names The IMAA ToASCII and ToUnicode operations apply to local parts, not to domain labels. The IDNA ToASCII and ToUnicode operations apply to domain labels, not to local parts. There exist conventions for transplanting local parts into domain labels (in DNS SOA records, for example), and there may exist conventions for transplanting domain names into local parts. Such conventions that predate IMAA are IMA-unaware, and therefore the domain labels receiving the transplanted local parts and the local parts receiving the transplanted domain names are IMA-unaware slots. Therefore the strings MUST be in ASCII form before they are transplanted. If they were transplanted in non-ASCII form they would risk being passed through the wrong ToASCII operation. 4. Conversion operations An application converts a local part put into an IMA-unaware mail address slot or displayed to a user. This section specifies the steps to perform in the conversion, and the ToASCII and ToUnicode operations. The input to ToASCII or ToUnicode is a dequoted local part that is a sequence of Unicode code points (remember that all ASCII code points are also Unicode code points). If a local part is represented using a character set other than Unicode or US-ASCII, it will first need to be transcoded to Unicode. Starting from a local part, the steps that an application takes to do the conversions are: 1) Decide whether the local part is a "stored string" or a "query string" as described in [STRINGPREP] (see section 6 below for a discussion). If this conversion follows the "queries" rule from [STRINGPREP], set the flag called "AllowUnassigned". 2) Save a copy of the local part. 3) Dequote the local part; that is, perform lexical interpretation and remove all nonliteral characters. For example, for a local part that uses the lexical syntax of [SMTP] or [MSGFMT], unfold it, remove comments and unquoted white space, and remove backslashes and quotation marks used to quote other characters. The result is a simple literal text string. 4) Process the string with either the ToASCII or the ToUnicode operation as appropriate. Typically, you use the ToASCII operation if you are about to put the local part into an IMA-unaware slot, and you use the ToUnicode operation if you are displaying the local part to a user. 5) If step 4 had no effect on the string, and if the saved local part from step 2 is a valid representation of the string in the destination context, then the saved local part SHOULD be used, otherwise proceed to step 6. 6) Apply whatever quoting is needed in the destination context (if any). For "mailbox" slots [SMTP] and "addr-spec" slots [MSGFMT] the following action suffices: If the string contains any control characters, spaces, or specials [MSGFMT], or if it begins or ends with a dot, or contains two consecutive dots, then convert it to a quoted-string: insert a backslash before every quotation mark and backslash, then enclose the string with quotation marks. [[ OPEN ISSUE: Keep steps 2 and 5? ]] The destination context might also impose a length restriction. Depending on whether the restriction applies to the quoted form or the dequoted form, the application might want to check the length just before or after step 5. The following two subsections define the ToASCII and ToUnicode operations that are used in step 4. This description of the protocol uses specific procedure names, names of flags, and so on, in order to facilitate the specification of the protocol. These names, as well as the actual steps of the procedures, are not required of an implementation. In fact, any implementation which has the same external behavior as specified in this document conforms to this specification. 4.1 ToASCII The ToASCII operation takes a sequence of Unicode code points that make up a dequoted local part and transforms it into a sequence of code points in the ASCII range (0..7F). If ToASCII succeeds, the original sequence and the resulting sequence are equivalent dequoted local parts. It is important to note that the ToASCII operation can fail. ToASCII fails if any step of it fails. If any step of the ToASCII operation fails, that string MUST NOT be used as an internationalized local part. The method for dealing with this failure is application-specific. The inputs to ToASCII are a sequence of code points, and the AllowUnassigned flag. The output of ToASCII is either a sequence of ASCII code points or a failure condition. ToASCII never alters a sequence of code points that are all in the ASCII range to begin with. Applying the ToASCII operation multiple times has exactly the same effect as applying it just once. ToASCII consists of the following steps: 1. If the sequence contains any code points outside the ASCII range (0..7F) then proceed to step 2, otherwise stop, leaving the sequence unchanged. 2. Perform the steps specified in [NAMEPREP] and fail if there is an error. The AllowUnassigned flag is used in [NAMEPREP]. 3. If the sequence is empty then stop, leaving an empty result. 4. Divide the sequence into segments. Segment boundaries occur wherever a protected code point is adjacent to a non-protected code point, and nowhere else. (Therefore segments are never empty, and they alternate between segments containing only protected code points and segments containing only non-protected code points.) 5. For each segment perform the following substeps: (a) If the segment contains any code points outside the ASCII range (0..7F) then proceed to substep b, otherwise leave the segment unchanged. (b) Verify that the ACE infix does NOT occur anywhere within the segment. (c) Encode the sequence using the encoding algorithm in [PUNYCODE] and fail if there is an error. (d) Verify that the result contains no more than 59 code points. (e) The sequence will contain at most one instance of U+002D (hyphen-minus). If it is absent then prepend the ACE infix; otherwise verify that the ACE infix does not already occur before the hyphen-minus, and substitute the ACE infix in place of it. 6. Rejoin the segments into a single sequence. 4.2 ToUnicode The ToUnicode operation takes a sequence of Unicode code points that make up a dequoted local part and returns a sequence of Unicode code points. If the input sequence is a dequoted local part in ACE form, then the result is an equivalent dequoted internationalized local part that is not in ACE form, otherwise the original sequence is returned unaltered. ToUnicode never fails. If any step fails, then the original input sequence is returned immediately in that step. The Punycode decoder can never output more code points than it inputs, but Nameprep can, and therefore ToUnicode can. Note that the number of octets needed to represent a sequence of code points depends on the particular character encoding used. The inputs to ToUnicode are a sequence of code points, and the AllowUnassigned flag. The output of ToUnicode is a sequence of code points. ToUnicode consists of the following steps: 1. If the sequence contains any code points outside the ASCII range (0..7F) then proceed to step 2, otherwise skip to step 3. 2. Perform the steps specified in [NAMEPREP] and fail if there is an error. The AllowUnassigned flag is used in [NAMEPREP]. 3. Verify that the sequence is nonempty. 4. Divide the sequence into segments (same as step 4 of ToASCII). 5. For each segment perform the following substeps: (a) If the ACE infix does not occur anywhere within the segment then leave the segment unchanged, otherwise save a copy of the segment and proceed to substep b. (b) If the ACE infix occurs at the very beginning of the segment then remove it, otherwise substitute U+002D (hyphen-minus) in place of the first occurrence of the ACE infix. (c) Decode the segment using the decoding algorithm in [PUNYCODE] and catch any error. If there was an error then restore the saved copy from substep a. 6. Verify that at least one segment was altered in step 5. 7. Rejoin the segments into a single sequence, and save a copy of the result. 8. Apply ToASCII to the current sequence and to a copy of the original input. 9. Verify that the two results of step 8 match using a case-insensitive ASCII comparison. 10. Return the saved copy from step 7. 5. ACE infix [[ Note to the IESG and Internet Draft readers: The two uses of the string "0iesg1" below are to be changed at time of publication to an infix that fulfills the requirements in the first paragraph. IANA will assign this value. ]] The ACE infix, used in the conversion operations (section 4), is two ASCII letters surrounded by two distinct ASCII digits. The ToASCII and ToUnicode operations MUST recognize the ACE infix in a case-insensitive manner. The ACE infix for IMAA is "0iesg1" or any capitalization thereof. This means that an ACE local part might be "foobar!de0iesg1jg4avhby1noc0d!0iesg1d9juau41awczczp", where "de-jg4avhby1noc0d" and "d9juau41awczczp" are the results of the encoding steps in [PUNYCODE]. While every encoded segment (segment that would be altered by ToUnicode) within an ACE local part contains the ACE infix, not every segment containing the ACE infix is an encoded segment. Segments that contain the ACE infix but are not encoded segments will confuse users, and local parts containing such segments SHOULD NOT be used as mailbox names. 6. Stored strings and query strings [STRINGPREP] prohibits unassigned code points in "stored strings" and allows them in "query strings", but concedes that "different Internet protocols use strings very differently, so these terms cannot be used exactly in every protocol that needs to use stringprep". In the context of IMAA, the following clarifications apply. A string that assigns/creates the name of an object is a "stored string". A string that merely refers to an object using a name that is presumed to have been assigned/created elsewhere is a "query string". Examples of stored strings: * In a mail server configuration file/database, the strings that create the mail addresses associated with the local mailboxes. (These mail addresses might be defined in pieces: the domain parts might be defined by a set of local domains, and the local parts might be defined by a separate set of user names and aliases, but the net effect is that these strings create a set of mail addresses, and are therefore stored strings.) * The msg-id in the Message-ID: field of a message header. Examples of query strings: * A mail address in the From: or To: or Reply-To: field of a message header. * A mail address in the MAIL or RCPT command of SMTP. * A mail address in a personal address book. * A msg-id in the In-Reply-To: or References: field of a message header. [[ OPEN ISSUE: Does this section say the right things? Should it say more? ]] 7. References 7.1 Normative references [IDNA] Faltstrom, P., Hoffman, P. and A. Costello, "Internationalizing Domain Names in Applications (IDNA)", RFC 3490, March 2003. [NAMEPREP] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)", RFC 3491, March 2003. [PUNYCODE] Costello, A., "Punycode: A Bootstring encoding of Unicode for use with Internationalized Domain Names in Applications (IDNA)", RFC 3492, March 2003. [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [SMTP] Klensin, J., "Simple Mail Transfer Protocol", RFC 2821, April 2001. [MSGFMT] Resnick, P., "Internet Message Format", RFC 2822, April 2001. [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of Internationalized Strings ("stringprep")", RFC 3454, December 2002. 7.2 Informative references [MIME3] Moore, K., "MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text", RFC 2047, November 1996. [DNS] Mockapetris, P., "Domain names - concepts and facilities", STD 13, RFC 1034 and "Domain names - implementation and specification", STD 13, RFC 1035, November 1987. 8. Security considerations Because this document normatively refers to [IDNA], [NAMEPREP], [PUNYCODE], and [STRINGPREP], it includes the security considerations from those documents as well. Internationalized local parts will cause mail addresses to become longer, and possibly make it harder to keep lines in a header under 78 characters. Lines that are longer than 78 characters (which is a SHOULD specification, not a MUST specification, in RFC 2822) could possibly cause mail user agents to fail in ways that affect security. 9. IANA considerations IANA will assign the ACE infix in consultation with the IESG, possibly following the same process used for [IDNA]. 10. Authors' addresses Paul Hoffman Internet Mail Consortium and VPN Consortium 127 Segre Place Santa Cruz, CA 95060 USA phoffman@imc.org Adam M. Costello University of California, Berkeley http://www.nicemice.net/amc/