Internet-Draft Compact, Grammar-Friendly Representations for UUIDs July 2020
Taylor Expires 27 January 2021 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-taylor-uuid-ncname-00
Updates:
RFC4122 (if approved)
Published:
Intended Status:
Informational
Expires:
Author:
D. Taylor
Independent

Compact, Grammar-Friendly Representations for UUIDs

Abstract

The Universally Unique Identifier is a suitable standard for, as the name suggests, uniquely identifying entities in a symbol space large enough that the identifiers do not collide. The literal representation, however, specified in RFC 4122 and elsewhere, cannot be used in conjunction with a number of formal grammars where it would be beneficial to do so. This document provides the UUID with two additional representations to make these applications possible.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 27 January 2021.

1. Introduction

There are a number of places in formal languages where it would be useful to put UUIDs, but the grammar forbids it. Many grammars forbid identifiers to begin with numbers, or contain hyphens, or contain colons (as with the URN representation in RFC 4122 [RFC4122]). The NCName production [XML-NAMES], which is pervasive in XML and RDF applications, is one such example. Up until a recent change, the HTML ID production had similar constraints. Virtually every programming language likewise requires identifiers such as variables and function names to start with a letter or underscore, and very few admit hyphens. This constraint causes developers to turn to ad-hoc solutions when they want to use UUIDs in these places.

This document specifies a representation - or rather, two representations - as well as the related transformations to and from the familiar UUID format. A provisional name for these representations is UUID-NCName, with the two variants styled as UUID-NCName-32 and UUID-NCName-64, referring to the base of their respective encodings. The goal of this specification is in part to eliminate an extra decision on the part of developers who find themselves in this position, and in part to provide alternative representations for UUIDs which remain valid but are shorter than the original.

1.1. Motivation & Applications

The purpose of an identifier in general is to pick out some information resource or other, such that it can be referred to, ideally unambiguously. The purpose of a large, generated identifier like the UUID, is to satisfy the uniqueness criterion while also specifying a datatype and normal form for said identifiers, and ultimately alleviate the need to sit down and think these identifiers up. Why one would want to go inserting UUIDs in places they wouldn't otherwise fit, is so these UUIDs can be cross-referenced in some other database where they do fit. Consider:

  • A component content management system that uses UUIDs to identify elementary content components, uses the UUID-NCName-64 representations of the same UUIDs as fragment identifiers for when those components are transcluded.
  • A literate programming system uses the UUID-NCName-32 representation as stable identifiers for all symbols (variables, constants, class names, etc.), enabling said identifiers to be defined and described elsewhere, while still yielding syntactically-correct code.

2. Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

3. Strategy

Not all 128 bits of a UUID are data; rather, several bits are masked. The top four bits of the third segment, known as time_hi_and_version, specify the UUID's version, which is fixed. Up to three high bits in the following segment, called clock_seq_hi_and_reserved, specify the variant: how the UUID - if applicable - is meant to be read. We remove these masked quartets (we take an extra bit for the variant) and use them as "bookends" for the rest of the identifier, mapping them to the first sixteen symbols of the Base32 table [RFC4648], which are all letters. The remaining 120 bits, which we bit-shift to close the gaps of the two masked quartets we removed, now divide evenly by both 5 and 6, the number of bits per character in Base32 and Base64, respectively.

The transformation takes the UUID 4abc6330-f548-4e67-b9f9-12d4323769cd, and returns the result ESrxjMPVI5nn5EtQyN2nNL for base64, and ejk6ggmhvjdtht6is2qzdo2onl for base32. These symbols will always start and end with case-insensitive letters, and the entire base32 symbol is case-insensitive.

4. Syntax

Here is the ABNF grammar for the productions uuid-ncname-32 and uuid-ncname-64:

uuid-ncname-32 = bookend 24base32 bookend
uuid-ncname-64 = bookend 20base64url bookend
bookend        = %x41-50 / %x61-70 ; [A-Pa-p]
base32         = %x32-37 / %x41-5a / %x61-7a ; [2-7A-Za-z]
base64url      = %x2d / %x30-39 / %x41-5a / %x5f / %x61-7a
                 ; [-0-9A-Z_a-z]

"Bookends" are 4-bit sequences (nybbles, quartets, etc.) which we map directly onto the Base32 table from [RFC4648]. Indeed the this portion of the Base64 table is identical, though we say Base32 to underscore the fact that bookend characters are case-insensitive. Certain environments encode meaning into the case of the first character of a symbol, so it is important that its literal representation be flexible. There is likewise little value in arbitrarily constraining the last character. Nevertheless, UUID-NCName-64 symbols SHOULD be generated with upper-case bookend characters, while UUID-NCName-32 bookends (and indeed the entire symbol) SHOULD be lower-case.

4.1. Recognizing UUID-NCName Symbols

UUID-NCName symbols always have a fixed length and certain characteristics: UUID-NCName-32 symbols are always exactly 26 characters long while UUID-NCName-64 symbols are always 22 characters long. The version (first bookend character) is mapped to the Base32 table where A is 0, so B is 1, etc. Random (version 4) UUIDs will therefore always start with the letter E. Any value higher than F (version 5/truncated SHA-1 UUID) is unspecified (though there is room for future UUID specifications to go all the way up to version 15). Likewise the variant bit-mask defined in [RFC4122] will cause the symbol to always end, modulo upper/lower-case, in I, J, K, or L (8, 9, 10, 11).

4.2. Equivalency

Two UUID-NCName symbols are necessarily identical if they produce the same UUID. Two UUID-NCName-32 symbols are identical if their string values match when normalized to all upper- or lower-case letters. Two UUID-NCName-64 symbols are identical if their string values match when the bookend characters are normalized to either upper- or lower-case.

5. Algorithms

These are candidate algorithms for encoding and decoding the symbols, transforming them to and from the conventional UUID representation. There are certainly many equivalents.

5.1. Encoding Algorithm

First we apply the shifting algorithm:

  1. Convert the UUID to a binary string bin.
  2. Convert bin to an array of four 32-bit unsigned network-endian integers ints.
  3. Extract version as (ints[1] & 0x0000f000) >> 12.
  4. Extract variant as (ints[2] & 0xf0000000) >> 24.
  5. Assign ints[1] = (ints[1] & 0xffff0000) | ((ints[1] & 0x00000fff) << 4) | ((ints[2] & 0x0fffffff) >> 24).
  6. Assign ints[2] = (ints[2] & 0x00ffffff) << 8 | (ints[3] >> 24).
  7. Assign ints[3] = (ints[3] << 8) | variant.
  8. Convert ints back into a binary string and return it along with the version.

Then one of the formatting algorithms, here is Base64:

  1. Take the binary string bin and shift the last octet to the right by two bits.
  2. Encode bin with the base64url algorithm to get the string b64.
  3. Truncate b64 to 21 characters.
  4. Convert version to its value in the base32 table.
  5. return version concatenated to b64.

And Base32:

  1. Take the binary string bin and shift the last octet to the right by one bit.
  2. Encode bin with the base32 algorithm to get the string b32.
  3. Truncate b32 to 25 characters.
  4. Convert version to its value in the Base32 table.
  5. Return version concatenated to b32, optionally in either upper or lower case.

5.2. Decoding Algorithm

  1. First verify the syntax and determine whether the symbol ncname is base32 or base64.
  2. If ncname is base64 and the last character is lowercase, set it to uppercase.
  3. Remove the first character of the symbol ncname and convert it into an integer according to the base32 spec; call that integer version.
  4. Append padding if necessary to satisfy the decoder, A====== for Base32 and A== for Base64.
  5. Decode the remainder of ncname by either the base32 or base64url decoding algorithm into binary string bin.
  6. If ncname was base32, shift the last octet of bin one bit to the left; if base64 shift it two bits.

Now we apply the shifting algorithm in reverse:

  1. Ensure version is in the range of 0-15 by masking it with 0xf.
  2. Convert the binary string bin into four 32-bit unsigned network-endian integers ints.
  3. Assign variant = (ints[3] & 0xf0) << 24.
  4. Shift and assign ints[3] >>= 8.
  5. Union and assign ints[3] |= ((ints[2] & 0xff) << 24).
  6. Shift and assign ints[2] >>= 8.
  7. Union and assign ints[2] |= ((ints[1] & 0xf) << 24) | variant.
  8. Assign ints[1] = (ints[1] & 0xffff0000) | (version << 12) | ((ints[1] >> 4) & 0xfff).
  9. Convert ints back into the new binary string bin.
  10. Format bin as a UUID.

6. IANA Considerations

There are no discernible IANA considerations associated with this specification.

7. Security Considerations

As UUID-NCName symbols are isomorphic to their conventional UUID representations, the security considerations for these symbols also the same as [RFC4122], though we repeat here the admonition not to assume that UUIDs are hard to guess.

8. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC4122]
Leach, P., Mealling, M., and R. Salz, "A Universally Unique IDentifier (UUID) URN Namespace", RFC 4122, DOI 10.17487/RFC4122, , <https://www.rfc-editor.org/info/rfc4122>.
[RFC4648]
Josefsson, S., "The Base16, Base32, and Base64 Data Encodings", RFC 4648, DOI 10.17487/RFC4648, , <https://www.rfc-editor.org/info/rfc4648>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.

9. Informative References

[XML-NAMES]
Bray, T., Hollander, D., Layman, A., Tobin, R., and H S. Thompson, "Namespaces in XML 1.0 (Third Edition)", , <https://www.w3.org/TR/2009/REC-xml-names-20091208/>.

Appendix A. Samples

Table 1: Samples of canonical UUID representations
Version Canonical UUID Representation
0, Nil 00000000-0000-0000-0000-000000000000
1, Timestamp ca6be4c8-cbaf-11ea-b2ab-00045a86c8a1
2, DCE "Security" 000003e8-cbb9-21ea-b201-00045a86c8a1
3, MD5 3d813cbb-47fb-32ba-91df-831e1593ac29
4, Random 01867b2c-a0dd-459c-98d7-89e545538d6c
5, SHA-1 21f7f8de-8051-5b89-8680-0195ef798b6a
Table 2: Samples of UUID-NCName representations
Version Base32 Base64
0, Nil aaaaaaaaaaaaaaaaaaaaaaaaaa AAAAAAAAAAAAAAAAAAAAAA
1, Timestamp bzjv6jsglv4pkfkyaarninsfbl BymvkyMuvHqKrAARahsihL
2, DCE "Security" caaaah2glxepkeaiaarninsfbl CAAAD6Mu5HqIBAARahsihL
3, MD5 dhwatzo2h7mv2dx4ddykzhlbjj DPYE8u0f7K6Hfgx4Vk6wpJ
4, Random eagdhwlfa3vm4rv4j4vcvhdlmj EAYZ7LKDdWcjXieVFU41sJ
5, SHA-1 feh37rxuakg4jnaabsxxxtc3ki FIff43oBRuJaAAZXveYtqI

Appendix B. Implementations

As of this writing, there are two implementations of UUID-NCName:

Author's Address

Dorian Taylor
Independent