Skip to main content

Unicode Character Repertoire Subsets
draft-bray-unichars-10

Document Type Active Internet-Draft (individual in art area)
Authors Tim Bray , Paul E. Hoffman
Last updated 2024-12-12 (Latest revision 2024-12-11)
RFC stream Internet Engineering Task Force (IETF)
Intended RFC status Proposed Standard
Formats
Reviews
ARTART Early review (of -09) by Barry Leiba Partially completed Ready w/nits
Additional resources Mailing List
Stream WG state (None)
Document shepherd John R. Levine
Shepherd write-up Show Last changed 2025-01-09
IESG IESG state AD Evaluation
Action Holder
Consensus boilerplate Unknown
Telechat date (None)
Responsible AD Orie Steele
Send notices to johnl@taugh.com
draft-bray-unichars-10
Network Working Group                                            T. Bray
Internet-Draft                                       Textuality Services
Intended status: Standards Track                              P. Hoffman
Expires: 14 June 2025                                              ICANN
                                                        11 December 2024

                  Unicode Character Repertoire Subsets
                         draft-bray-unichars-10

Abstract

   This document discusses specifying subsets of the Unicode character
   repertoire for use in protocols and data formats.  It also specifies
   those subsets as PRECIS profiles.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 14 June 2025.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Bray & Hoffman            Expires 14 June 2025                  [Page 1]
Internet-Draft               Unicode Subsets               December 2024

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Notation  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Characters and Code Points  . . . . . . . . . . . . . . . . .   3
     2.1.  Transformation Formats  . . . . . . . . . . . . . . . . .   3
     2.2.  Problematic Code Points . . . . . . . . . . . . . . . . .   4
       2.2.1.  Surrogates  . . . . . . . . . . . . . . . . . . . . .   4
       2.2.2.  Control Codes . . . . . . . . . . . . . . . . . . . .   5
       2.2.3.  Noncharacters . . . . . . . . . . . . . . . . . . . .   5
   3.  Dealing With Problematic Code Points  . . . . . . . . . . . .   5
   4.  Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . .   6
     4.1.  Unicode Scalars . . . . . . . . . . . . . . . . . . . . .   7
     4.2.  XML Characters  . . . . . . . . . . . . . . . . . . . . .   7
     4.3.  Unicode Assignables . . . . . . . . . . . . . . . . . . .   8
   5.  Using Subsets . . . . . . . . . . . . . . . . . . . . . . . .   8
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   9
     6.1.  UnicodeScalarsClass Profile . . . . . . . . . . . . . . .   9
     6.2.  XMLCharactersClass Profile  . . . . . . . . . . . . . . .  10
     6.3.  Unicode Assignables Profile . . . . . . . . . . . . . . .  10
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  11
   8.  Normative References  . . . . . . . . . . . . . . . . . . . .  11
   9.  Informative References  . . . . . . . . . . . . . . . . . . .  12
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  12
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  13

1.  Introduction

   When a protocol or data format has text fields, that text is normally
   composed of Unicode [UNICODE] characters, to support use by speakers
   of many languages.  Unicode characters are represented by numeric
   code points, and the "set of all Unicode code points" is generally
   not a good choice for use in text fields.  Unicode recognizes
   different types of code points, not all of which are appropriate in
   protocols, or even associated with characters.  Therefore, even if
   the desire is to support "all Unicode characters" a subset of the
   Unicode code point repertoire should be specified.  Subsets such as
   those discussed in this document are appropriate choices.

   In this document, "subset" means a subset of the Unicode character
   repertoire.  This document specifies subsets that exclude some or all
   of the code points that are "problematic" as defined in Section 2.2.
   Authors should have a way to concisely and exactly reference a stable
   specification that identifies which subset a protocol or data format
   accepts.

Bray & Hoffman            Expires 14 June 2025                  [Page 2]
Internet-Draft               Unicode Subsets               December 2024

   This document discusses issues that apply in choosing subsets, names
   two subsets that have been popular in practice, and suggests one new
   subset.  The goal is to provide a convenient target for cross-
   reference from other specifications.

1.1.  Notation

   In this document, the numeric values assigned to Unicode characters
   are provided in hexadecimal.  This document uses Unicode's standard
   notation of "U+" followed by four or more hexadecimal digits.  For
   example, "A", decimal 65, is expressed as U+0041, and "🖤" (Black
   Heart), decimal 128,420, is U+1F5A4.

   Groups of numeric values described in Section 4 are given in ABNF
   [RFC5234].  In ABNF, hexadecimal values are preceded by "%x" rather
   than "U+".

   All the numeric ranges in this document are inclusive.

   The subsets are described both in ABNF and as PRECIS profiles
   [RFC8264].

2.  Characters and Code Points

   Definition D9 in section 3.4 of [UNICODE] defines "Unicode codespace"
   as "a range of integers from 0 to 10FFFF_16".  Definition D10 defines
   "code point" as "Any value in the Unicode codespace".

   The Unicode Standard's definition of "Unicode character" is
   conceptual.  However, each Unicode character is assigned a code
   point, used to represent the characters in computer memory and
   storage systems and, in specifications, to specify allowed subsets.

   There are 1,114,112 code points; as of Unicode 15.1 (2023), fewer
   than 150,000 have been assigned to characters.  It is difficult to
   specify that unassigned code points should be avoided because they
   regularly become assigned when new characters are added to Unicode.

2.1.  Transformation Formats

   Unicode describes a variety of "transformation formats", ways to
   marshal code points into byte sequences.  A survey of transformation
   formats is beyond the scope of this document.  However, it is useful
   to note that the "UTF-16" format represents each code point with one
   or two 16-bit chunks, and the "UTF-8" format uses variable-length
   byte sequences.

Bray & Hoffman            Expires 14 June 2025                  [Page 3]
Internet-Draft               Unicode Subsets               December 2024

   The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277],
   says "Protocols MUST be able to use the UTF-8 charset", which becomes
   a mandate to use UTF-8 for any protocol or data format that specifies
   a single transformation format.  UTF-8 is widely used for
   interoperable data formats such as JSON, YAML, CBOR, and XML.

2.2.  Problematic Code Points

   This section classifies as "problematic" all the code points which
   can never represent useful text and in some cases can lead to
   software misbehavior.  This is a low bar; the PRECIS [RFC8264]
   framework's "IdentifierClass" and "FreeformClass" exclude many more
   code points which can cause problems when displayed to humans, in
   some cases presenting security risks.  Specifications of fields in
   protocols and data formats whose contents are designed for display to
   and interactions with humans would benefit from careful consideration
   of the issues described by PRECIS; its more-restrictive subsets might
   be better choices than those specified in this document.

   Definition D10a in section 3.4 of [UNICODE] defines seven code point
   types.  Three types of code points are assigned to entities which are
   not actually characters or whose value as Unicode characters in text
   fields is questionable: "Surrogate", "Control", and "Noncharacter".
   In this document, "problematic" refers to code points whose type is
   "Surrogate" or "Noncharacter", and to "legacy controls" as defined in
   Section 2.2.2.2 below.

   Unicode's definition D49 concerns the "private-use" type and section
   3.5.10 states that they "are considered to be assigned characters".
   Section 23.5 further states that these characters' "use may be
   determined by private agreement among cooperating users".  Because
   private-use code points may have uses based on private agreements,
   this document does not classify them as "problematic".

2.2.1.  Surrogates

   A total of 2,048 code points, the range U+D800-U+DFFF, is divided
   into two blocks called "high surrogates" and "low surrogates";
   collectively the 2,048 code points are referred to as "surrogates".
   Surrogates may only be used in Unicode texts encoded in UTF-16, where
   a high-surrogate/low-surrogate pair represents a code point greater
   than U+FFFF.

   A surrogate which occurs in text encoded in any transformation format
   other than UTF-16 has no meaning.  In particular, [UNICODE] section
   3.9.3 forbids representing a surrogate in UTF-8.

Bray & Hoffman            Expires 14 June 2025                  [Page 4]
Internet-Draft               Unicode Subsets               December 2024

2.2.2.  Control Codes

   Section 23.1 of [UNICODE] introduces the control codes for
   compatibility with legacy pre-Unicode standards.  They comprise 65
   code points in the ranges U+0000-U+001F ("C0 controls") and
   U+0080-U+009F ("C1 controls"), plus U+007F, "DEL".

2.2.2.1.  Useful Controls

   The C0 controls include newline (U+000A), carriage return (U+000D),
   and tab (U+0009); this document refers to these three characters as
   the "useful controls".

2.2.2.2.  Legacy Controls

   Aside from the useful controls, the control codes are mostly obsolete
   and generally lack interoperable semantics.  This document uses the
   phrase "legacy controls" to describe control codes that are not
   useful controls.

   Because the code points for C0 controls include the 32 smallest
   integers including zero, they are likely to occur in data as a result
   of programming errors.

2.2.3.  Noncharacters

   Certain code points are classified as "noncharacters", and [UNICODE]
   asserts repeatedly that they are not designed or used for open
   interchange.

   Code points are organized into 17 "planes", each containing 2^16 code
   points.  The last two code points in each plane are noncharacters:
   U+00FFFE, U+00FFFF, U+01FFFE, U+01FFFF, U+02FFFE, U+02FFFF, and so
   on, up to U+10FFFE, U+10FFFF.

   The code points in the range U+FDD0-U+FDEF are noncharacters.

3.  Dealing With Problematic Code Points

   [RFC9413], "Maintaining Robust Protocols", provides a thorough
   discussion of strategies for dealing with issues in input data.

   Different types of problematic code points cause different issues.
   Noncharacters and legacy controls are unlikely to cause software
   failures, but they cannot usefully be displayed to humans, and can be
   used in attacks based on attempting to display text that includes
   them.

Bray & Hoffman            Expires 14 June 2025                  [Page 5]
Internet-Draft               Unicode Subsets               December 2024

   The behavior of software which encounters surrogates is unpredictable
   and differs among programming-language implementations, even between
   different API calls in the same language.

   Section 3.9 of [UNICODE] makes it clear that a UTF-8 byte sequence
   which would map to a surrogate is ill-formed.  If a specification
   requires that input data be encoded with UTF-8, and if all input were
   well-formed, implementors would never have to concern themselves with
   surrogates.

   Unfortunately, industry experience teaches that problematic code
   points, including surrogates, can and do occur in program input where
   the source of input data is not controlled by the implementor.  In
   particular, the specification of JSON allows any code point to appear
   in object member names and string values [RFC8259].

   For example, the following is a conforming JSON text:

   {"example": "\u0000\u0089\uDEAD\uD9BF\uDFFF"}

   The value of the "example" field contains the C0 control NUL, the C1
   control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired
   surrogate, and the noncharacter U+7FFFF encoded per JSON rules as two
   escaped UTF-16 surrogate code points.  It is unlikely to be useful as
   the value of a text field.  That value cannot be serialized into
   well-formed UTF-8, but the behavior of libraries asked to parse the
   sample is unpredictable; some will silently parse this and generate
   an ill-formed UTF-8 string.

   Two reasonable options for dealing with problematic input are either
   rejecting text containing problematic code points, or replacing the
   problematic code points with placeholders.

   Silently deleting an ill-formed part of a string is a known security
   risk.  Responding to that risk, [UNICODE] section 3.2 recommends
   dealing with ill-formed byte sequences by signaling an error, or
   replacing problematic code points, ideally with "�" (U+FFFD,
   REPLACEMENT CHARACTER), although some popular software platforms,
   notably Java, use "?".

4.  Subsets

   This section describes subsets that can be used in specifying
   acceptable content for text fields in protocols and data types.
   Specifications can refer to these subsets by the names "Unicode
   Scalars", "XML Characters", and "Unicode Assignables".

Bray & Hoffman            Expires 14 June 2025                  [Page 6]
Internet-Draft               Unicode Subsets               December 2024

4.1.  Unicode Scalars

   Definition D76 in section 3.9 of [UNICODE] defines the term "Unicode
   scalar value" as "Any Unicode code point except high-surrogate and
   low-surrogate code points."

   The "Unicode Scalars" subset can be expressed as an ABNF production:

   unicode-scalar =
      %x0-D7FF / %xE000-10FFFF  ; exclude surrogates

   This subset is the default for CBOR [RFC8949], and has the advantage
   of excluding surrogates.  However, it includes legacy controls and
   noncharacters.

   This subset is called the UnicodeScalarsClass for use in PRECIS.  Its
   registration template can be found in Section 6.1.

4.2.  XML Characters

   The XML 1.0 Specification [XML], in its grammar production labeled
   "Char", specifies a subset of Unicode code points that excludes
   surrogates, legacy C0 controls, and the noncharacters U+FFFE and
   U+FFFF.

   The "XML Characters" subset can be expressed as an ABNF production:

   xml-character =
      %x9 / %xA / %xD /   ; useful controls
      %x20-D7FF /         ; exclude surrogates
      %xE000-FFFD/        ; exclude FFFE and FFFF nonchars
      %x100000-10FFFF

   While this subset does not exclude all the problematic code points,
   the C1 controls are less likely than the C0 controls to appear
   erroneously in data, and have not been observed to be a frequent
   source of problems.  Also, the noncharacters greater in value than
   U+FFFF are rarely encountered.

   This subset is called the XMLCharactersClass for use in PRECIS.  Its
   registration template can be found in Section 6.2.

Bray & Hoffman            Expires 14 June 2025                  [Page 7]
Internet-Draft               Unicode Subsets               December 2024

4.3.  Unicode Assignables

   This document defines the "Unicode Assignables" subset as all the
   Unicode code points that are not problematic.  This subset, which is
   smaller than the others, comprises all code points that are currently
   assigned, excluding legacy control codes, or that might in future be
   assigned.

   Unicode Assignables can be expressed as an ABNF production:

   unicode-assignable =
      %x9 / %xA / %xD /               ; useful controls
      %x20-7E /                       ; exclude C1 controls and DEL
      %xA0-D7FF /                     ; exclude surrogates
      %xE000-FDCF                     ; exclude FDD0 nonchars
      %xFDF0-FFFD /                   ; exclude FFFE and FFFF nonchars
      %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
      %x30000-3FFFD / %x40000-4FFFD /
      %x50000-5FFFD / %x60000-6FFFD /
      %x70000-7FFFD / %x80000-8FFFD /
      %x90000-9FFFD / %xA0000-AFFFD /
      %xB0000-BFFFD / %xC0000-CFFFD /
      %xD0000-DFFFD / %xE0000-EFFFD /
      %xF0000-FFFFD / %x100000-10FFFD

   This subset is called the AssignablesClass for use in PRECIS.  Its
   registration template can be found in Section 6.3.

5.  Using Subsets

   Many IETF specifications rely on well-known data formats such as
   JSON, I-JSON, CBOR, YAML, and XML.  These formats specify default
   subsets.  For example, JSON allows object member names and string
   values to include any Unicode code point, including all the
   problematic types.

   A protocol based on JSON can be made more robust and implementor-
   friendly by restricting the contents of object member names and
   string values to one of the subsets described in Section 4.
   Equivalent restrictions are possible for other packaging formats such
   as I-JSON, XML, YAML, and CBOR.

Bray & Hoffman            Expires 14 June 2025                  [Page 8]
Internet-Draft               Unicode Subsets               December 2024

   Note that escaping techniques such as those in the JSON example in
   Section 3 cannot be used to circumvent this sort of restriction,
   which applies to data content, not textual representation in
   packaging formats.  If a specification restricted a JSON field value
   to the Unicode Assignables, the example would remain a conforming
   JSON Text but the data it represents would not constitute Unicode
   Assignable code points.

6.  IANA Considerations

   This document defines new PRECIS profiles to be entered into the
   "PRECIS Profiles" registry.

   These profiles are specified with straightforward ABNF expressions.
   This contrasts with the more flexible and expressive approach in
   [RFC8264], but is appropriate because the subsets are much simpler
   and less restrictive.

   Note that these profiles are oblivious to many of the issues that are
   considered in depth in [RFC8264], including case mapping,
   normalization, and directionality.  As noted above, specifications
   for text fields that are designed for display to and interaction with
   humans would benefit from consideration of those issues.

6.1.  UnicodeScalarsClass Profile

   The registration template for this class is as follows:

   Name: UnicodeScalarsClass

   Base Class: None

   Applicability: Protocols that want to include all Unicode code points
   except surrogates

   Replaces: None

   Width Mapping Rule: None

   Additional Mapping Rule: None

   Case Mapping Rule: None

   Normalization Rule: None

   Directionality Rule: None

   Enforcement: Not specified

Bray & Hoffman            Expires 14 June 2025                  [Page 9]
Internet-Draft               Unicode Subsets               December 2024

   Specification: Section 4.1 of this RFC

6.2.  XMLCharactersClass Profile

   The registration template for this class is as follows:

   Name: XMLCharactersClass

   Base Class: None

   Applicability: Protocols that want to allow the same Unicode code
   points that are allowed in XML

   Replaces: None

   Width Mapping Rule: None

   Additional Mapping Rule: None

   Case Mapping Rule: None

   Normalization Rule: None

   Directionality Rule: None

   Enforcement: Not specified

   Specification: Section 4.2 of this RFC

6.3.  Unicode Assignables Profile

   The registration template for this class is as follows:

   Name: AssignablesClass

   Base Class: None

   Applicability: Protocols that want to allow all Unicode code points
   that are currently assigned, or might be assigned in the future, to
   characters that are not "legacy controls" as defined in
   Section 2.2.2.2 of this document.

   Replaces: None

   Width Mapping Rule: None

   Additional Mapping Rule: None

Bray & Hoffman            Expires 14 June 2025                 [Page 10]
Internet-Draft               Unicode Subsets               December 2024

   Case Mapping Rule: None

   Normalization Rule: None

   Directionality Rule: None

   Enforcement: Not specified

   Specification: Section 4.3 of this RFC

7.  Security Considerations

   Section 3 of this document discusses security issues.

   Unicode Security Considerations [TR36] is a wide-ranging survey of
   the issues implementors should consider while writing software to
   process Unicode text.  Many of the attacks it discusses are aimed at
   deceiving human readers, but vulnerabilities involving issues such as
   surrogates and noncharacters are also covered, and in fact can
   contribute to human-deceiving exploits.

   The Security Considerations in Section 12 of [RFC8264] generally
   applies to this document as well.

   Note that the Unicode-character subsets specified in this document
   include a successively-decreasing number of problematic code points,
   and thus should be less and less susceptible to vulnerabilities.  The
   Section 4.3 subset, "Unicode Assignables", excludes all of them.

8.  Normative References

   [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
              Specifications: ABNF", STD 68, RFC 5234,
              DOI 10.17487/RFC5234, January 2008,
              <https://www.rfc-editor.org/info/rfc5234>.

   [RFC8264]  Saint-Andre, P. and M. Blanchet, "PRECIS Framework:
              Preparation, Enforcement, and Comparison of
              Internationalized Strings in Application Protocols",
              RFC 8264, DOI 10.17487/RFC8264, October 2017,
              <https://www.rfc-editor.org/info/rfc8264>.

   [TR36]     The Unicode Consortium, "Unicode Security Considerations",
              <https://www.unicode.org/reports/tr36/>.  Note that this
              reference is to the latest version of this document,
              rather than to a specific release.  It is not expected
              that future updates will affect the referenced
              discussions.

Bray & Hoffman            Expires 14 June 2025                 [Page 11]
Internet-Draft               Unicode Subsets               December 2024

   [UNICODE]  The Unicode Consortium, "The Unicode Standard",
              <http://www.unicode.org/versions/latest/>.  Note that this
              reference is to the latest version of Unicode, rather than
              to a specific release.  It is not expected that future
              changes in the Unicode Standard will affect the referenced
              definitions.

9.  Informative References

   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
              Languages", BCP 18, RFC 2277, DOI 10.17487/RFC2277,
              January 1998, <https://www.rfc-editor.org/info/rfc2277>.

   [RFC8259]  Bray, T., Ed., "The JavaScript Object Notation (JSON) Data
              Interchange Format", STD 90, RFC 8259,
              DOI 10.17487/RFC8259, December 2017,
              <https://www.rfc-editor.org/info/rfc8259>.

   [RFC8949]  Bormann, C. and P. Hoffman, "Concise Binary Object
              Representation (CBOR)", STD 94, RFC 8949,
              DOI 10.17487/RFC8949, December 2020,
              <https://www.rfc-editor.org/info/rfc8949>.

   [RFC9413]  Thomson, M. and D. Schinazi, "Maintaining Robust
              Protocols", RFC 9413, DOI 10.17487/RFC9413, June 2023,
              <https://www.rfc-editor.org/info/rfc9413>.

   [XML]      Bray, T., Paoli, J., McQueen, C.M., Maler, E., and F.
              Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth
              Edition)", 26 November 2008,
              <http://www.w3.org/TR/2008/REC-xml-20081126/>.  Note that
              this reference is to a specific release, based on a
              history of previous "Edition" releases having changed this
              production.

Acknowledgements

   Thanks are due to Guillaume Fortin-Debigaré, who filed an Errata
   Report against RFC 8259, The JavaScript Object Notation, noting
   frequent references to "Unicode characters", when in fact the RFC
   formally specifies the use of Unicode Code Points.

   Thanks also to Asmus Freytag for careful review and many constructive
   suggestions aimed at making the language more consistent with the
   structure of the Unicode Standard.

   Thanks also to James Manger for the correctness of the ABNF and JSON
   samples.

Bray & Hoffman            Expires 14 June 2025                 [Page 12]
Internet-Draft               Unicode Subsets               December 2024

   Thanks also to Peter Saint-Andre for harmonization with PRECIS.

   This document got a great deal of thoughtful discussion during the
   late stages of review which helped tighten up wording and make
   difficult points clearer.

Authors' Addresses

   Tim Bray
   Textuality Services
   Email: tbray@textuality.com

   Paul Hoffman
   ICANN
   Email: paul.hoffman@icann.org

Bray & Hoffman            Expires 14 June 2025                 [Page 13]