Skip to main content

IAB Statement on Identifiers and Unicode

Document Type IAB Statement
Title IAB Statement on Identifiers and Unicode
Published 2018-03-15
Metadata last updated 2023-08-09
State Active
Send notices to (None)

In January of 2015, the IAB published a statement on “Identifiers and Unicode 7.0.0.” This statement was published to alert the community to a problem that had been discovered during the examination of new characters added in Unicode 7.0.0.

The problem occurs when a script contains a precomposed character form and two or more other characters which may be composed to render an indistinguishable form.  There are cases in which Unicode specifies that a precomposed character and the same form achieved by composition are canonically equivalent. For the cases which triggered the original statement, no such equivalence was specified.  Because both methods of producing the form are valid, there are risks to interoperability and security, as comparison, sorting, and equivalence will not meet user or developer expectations. Note that these are not merely visually similar characters, as described in RFC 5890 Section 4.4, but different methods of producing the same form in the same script.

The IAB recommended that the Internationalized Domain Names for Applications (IDNA) Parameters registry not be updated to Unicode 7.0.0 until the IETF had consensus on a solution to this problem.  More generally, the IAB called upon the community not to use characters in global identifiers such as domain names when those characters do not normalize to the same codepoint but would be considered indistinguishable in form without further context.

The IETF has not come to consensus on this matter. There have been three attempts to look at related topics (LUCID being the most directly related, with ARCING and ENAME touching on the same issues). The situation hasn’t failed entirely to progress, as there’s been some work on cataloging characters that have the “non-decomposing codepoint” problem and individual drafts suggesting more general approaches to this class of problems.  But no consensus has emerged. It now seems clear that the existence of such characters predates Unicode 7, and that the issue is inherent in different assumptions about context between Unicode and IDNA2008.

Given the passage of time, the IDNA Parameters registry is now seriously behind the deployed versions of Unicode, version 10 being common and version 11 expected in June of 2018. Mismatches between the derived property values calculated by applying the normative rules to the most recent Unicode version and the non-normative tables published by IANA have well-known failure modes which were demonstrated in the deployment of IDNA 2003.  With each increase in the mismatch, the balance between those failure modes and the security and interoperability issues inherent in these characters has had to be reassessed. It seems to the IAB that keeping the registry frozen does ongoing damage to the deployment of IDNA2008, without getting us closer to a better answer to the non-decomposing codepoint problem.

The IAB believes that the best current course of action for now is to bring the two back into alignment while the IETF continues to work toward consensus on the underlying issues.  We have therefore asked the expert reviewer to bring the IDNA Parameters registry up to date by June of 2018, working with other experts as necessary.

IDNA is based, in part, on the assumption that there will always be some strings that are permissible under the standard that are nonetheless ill-advised for use in specific identifiers and contexts. For DNS names the responsibility for providing guidance on identifier composition extends beyond the IETF, to administrators of DNS zones. For the DNS root, this administrator is ICANN, which has continued to develop its rules for IDNs in the root zone (see, for example, the formation of a study group on application of the rootzone label generation rules).  It also extends to ccTLD administrators, TLD registries, and throughout the DNS registration hierarchy.  For now, implementation guidelines by ICANN, TLD registries, and other administrators must continue to include serious caution about DNS name registrations using characters of this type.

Security considerations which have long applied to mixed-script environments, such as phishing and pharming, will also apply to single script environments which contain these characters.  Developers should consider visual markers parallel to those set out in RFC 5890, Section 4.4 for these situations as well. There may, of course, be no human user to see such a marker, and user input systems such as speech-recognition systems may not provide ways to see them. This reinforces the need for caution in registration but, ultimately, Internet identifier systems can’t presume that registration rules have eliminated this problem completely.

The IAB continues to encourage the IETF to pursue the work discussed above, and we encourage all others to be conservative in the assignment and use of potentially ambiguous identifiers. RFC 6912 provides some useful principles in this regard for the DNS that may also extend to other global identifiers and the protocols that use them.