                            Unicode in ABNF

   This experimental document adds support for Unicode strings in ABNF
   (Augmented Backus-Naur Form), and provides certain symbols related to
   Unicode code point ranges.

1.  Introduction

   Augmented Backus-Naur Form (ABNF) [RFC5234] is a formal syntax that
   is popular among many Internet specifications. Many Internet
   documents employ this syntax along with the Core Rules defined in
   Appendix B.1 of [RFC5234]. ABNF is defined in terms of ASCII
   [ASCII86, RFC0020]; however, Unicode [UNICODE] has become
   increasingly popular--even required--as the Internet has evolved over
   the last two decades. Unicode (as UTF-8) will be permitted in the RFC
   series [IABNA], while [RFC5198] established Net-Unicode as the
   standard form for the use of Unicode as "network text". Protocols
   that originally were ASCII-based have been, or are being, extended to
   support Unicode. However, protocols that use Unicode in some way
   (e.g., permit UTF-8 content in a production) use different ABNF
   expressions, some of which do not conform to the modern Unicode
   Standard 9.0.0, and therefore could introduce interoperability or
   security problems.

   Many parties have expressed interest in incorporating [UNICODE] into
   ABNF, yet the questions remain: "How?" and "To what extent?"

   This document proposes standardized techniques for expressing Unicode
   code points using ABNF. This document intends to be very conservative
   in its approach: a conforming implementation only needs to know how
   to map between the Unicode scalar values and any Unicode encoding
   form. The Unicode Character Database (UCD, Section 4.1 of [UNICODE])
   is intentionally not necessary. ABNF text that uses the syntax in
   this document needs to be in a Unicode encoding form (Conformance
   Clause D89 of [UNICODE]), but ABNF text that just uses the rules or
   terminal values can be expressed in ASCII [RFC0020].

2.  Unicode Code Points in ABNF

   (Consult Section 2.3 of [RFC5234] in relation to this paragraph.)
   Unicode has been expressed in several different ways in RFCs to-date.
   This document establishes that in contexts where Unicode is specified
   as the coded character set [RFC2130], the terminal values %x00-10FFFF
   are to be used to represent the Unicode code points. Only the Unicode
   scalar values are to be used in specifications that follow this
   document; surrogate code points (%xD800-DFFF) are not to be used
   [[NB: directly]]. This technique aligns ABNF with W3C EBNF [XMLEBNF]
   and Unicode EBNF [UNICODE].

   (Consult Section 2.4 and Appendix B.2 of [RFC5234] in relation to
   this paragraph.)
   In contexts where Unicode is specified as the character set, the
