Skip to main content

Early Review of draft-bray-unichars-09
review-bray-unichars-09-genart-early-worley-2024-10-20-00

Request Review of draft-bray-unichars-09
Requested revision 09 (document currently at 11)
Type Early Review
Team General Area Review Team (Gen-ART) (genart)
Deadline 2024-12-01
Requested 2024-10-04
Requested by Orie Steele
Authors Tim Bray , Paul E. Hoffman
I-D last updated 2024-10-20
Completed reviews Genart Early review of -09 by Dale R. Worley (diff)
Artart Early review of -09 by Barry Leiba (diff)
Artart Early review of -09 by Harald T. Alvestrand (diff)
Comments
This document provides general guidance regarding the use of unicode in protocols.
Please consider the internationalization, interoperability and security implications of the document.
Since this document is AD sponsored, please note the mailing list for discussion is:

https://mailarchive.ietf.org/arch/browse/art/?q=draft-bray-unichars
Assignment Reviewer Dale R. Worley
State Completed
Request Early review on draft-bray-unichars by General Area Review Team (Gen-ART) Assigned
Posted at https://mailarchive.ietf.org/arch/msg/gen-art/35jqGyl3kOELhBAiP5ZfxTjnSm4
Reviewed revision 09 (document currently at 11)
Result Ready w/issues
Completed 2024-10-20
review-bray-unichars-09-genart-early-worley-2024-10-20-00
I am the assigned Gen-ART reviewer for this draft. The General Area
Review Team (Gen-ART) reviews all IETF documents being processed
by the IESG for the IETF Chair.  Please treat these comments just
like any other comments.

For more information, please see the FAQ at

<https://wiki.ietf.org/en/group/gen/GenArtFAQ>.

Document:  draft-bray-unichars-09
Reviewer:  Dale R. Worley
Review Date:  2024-10-20
IETF LC End Date:  [not known]
IESG Telechat date:  [not known]

Summary:

    This draft is basically ready for publication, but has a
    considerable number of editorial issues that should be fixed
    before publication.

Editorial comments:

Check whether "numeric values", "code points", and "characters" are
used correctly throughout the document.  I don't have a good sense of
the proper usage of these terms regarding Unicode, but I have a sense
(that might be incorrect) that "code point" is a subclass of "numeric
value", and should always be used when referring the the number
representing a character.

You probably want to ASCII-ize various quote symbols used in the
document.  I'm not sure how the Editor wants to handle the "black
heart" characters, but they are informative examples and ought to be
retained if possible.

1.  Introduction

   This document discusses issues that apply in choosing subsets, names
   two subsets that have been popular in practice, and suggests one new
   subset.  The goal is to provide a convenient target for cross-
   reference from other specifications.

It would be useful to describe here why the newly defined subsets are
superior to the two existing subsets.

Also, this statement is incorrect; the document defines four new
subsets, comprising one base class and three profiles.

1.1.  Notation

   In the text, Unicode’s standard "U+",
   zero-padded to four places, is used.  For example, "A", decimal 65,
   would be expressed as U+0041, and "🖤" (Black Heart), decimal 128,420,
   would be U+1F5A4.

This seems awkward to me.  Perhaps:

   In the text, we use Unicode’s standard notation of "U+" followed by
   four or more hexadecimal digits.  For example, "A", decimal 65,
   is expressed as U+0041, and "🖤" (Black Heart), decimal 128,420,
   is U+1F5A4.

--

   The subsets are described both in ABNF and as PRECIS profiles
   [RFC8264].

This is correct, but ... The entire document is organized as being
within the PRECIS conceptual framework, and yet the references to
PRECIS are all phrased as pointers to various parts of the PRECIS
RFCs, not to the whole.  The document should "at the top level" (it
seems like this means in section 1) state that it is part of, or
within, the PRECIS framework, and reference the relevant PRECIS RFCs
at that point.  The later references to PRECIS can then be omitted
unless they are to specific sections of RFCs that are relevant to the
particular reference.

2.  Characters and Code Points

   However, each Unicode character is assigned a code
   point, used to represent the characters in computer memory and
   storage systems and, in specifications, to specify allowed subsets.

This is an awkward mix of singular and plural usages.  Inquire of
Editor the best way to phrase this.

   Section 6.1 defines a new PRECIS base class that encompasses all
   Unicode code points.  This base class is used for the PRECIS profiles
   for the subsets defined in this document.

Would be a little clearer as

   Section 6.1 defines a new PRECIS base class, UnicodeBaseClass, that
   encompasses all Unicode code points.  UnicodeBaseClass is used for
   the PRECIS profiles for the subsets defined in this document.

Also, "used for" could probably be replaced with a more specific term
describing the relationship between a base class and a profile.

2.1.  Transformation Formats

   However, it is useful
   to note that the "UTF-16" format represents each code point with one
   or two 16-bit chunks, and the “UTF-8” format uses variable-length
   byte sequences.

I think the usual terminology would be "variable-length sequences of
8-bit chunks" or better "variable-length sequences of octets".

2.2.  Problematic Code Points

   [...] would benefit from careful consideration of the issues
   described by PRECIS; [...]

It seems to me this ought to specify where these issues are described.

   Definition D10a in section 3.4 of [UNICODE] defines seven code point
   types.  Three types of code points are assigned to entities which are
   not actually characters or whose value as Unicode characters in text
   fields is questionable: "Surrogate", "Control", and "Noncharacter".
   In this document, "problematic" refers to code points whose type is
   "Surrogate" or "Noncharacter", and to "legacy controls" as defined in
   Section 2.2.2.2.

Given that "section 3.4" at the beginning of the paragraph refers to
[UNICODE], it might be clearer to say "as defined in Section 2.2.2.2
of this document" or "as defined in Section 2.2.2.2 below".

2.2.1.  Surrogates

   A total of 2,048 code points, in the range U+D800-U+DFFF, are divided

Since "the range" consists of 2,048 code points, this can be said more
exactly:

   A total of 2,048 code points, the range U+D800-U+DFFF, are divided

Also, doesn't "total" take a singular verb?  Or is that an
Americanism?

2.2.2.2.  Legacy Controls

   Aside from the useful controls, the control codes are mostly obsolete

I think you need to capitalize "Control Codes" here.

2.2.3.  Noncharacters

It seems, looking at rule D15 of section 3.4 of Unicode 15.0.0 shows
"noncharacter" as not intrinsically capitalized in Unicode usage.  But
rule D10a shows "Noncharacter" as intrinsically capitalized.  Perhaps
ask the Editor about this.

3.  Dealing With Problematic Code Points

   [RFC9413], "Maintaining Robust Protocols", provides a thorough
   discussion of strategies for dealing with issues in input data, for
   example problematic code points.

Probably better to use "including" in place of "for example".

   [...] can be
   used in attacks based on misleading human readers of text that
   attempt to display them [TR36].

Text does not itself attempt attempt anything.  Better is "attacks
based on attempting to display text that includes them".

   [...] differs in programming-language implementations [...]

I would say "differs between".

   Thus, in theory, if a
   specification requires that input data be encoded with UTF-8,
   implementors should never have to concern themselves with surrogates.

This sentence doesn't make sense to me.  If a specification requires
something, there is no "in theory" which implies that the input data
will conform to the specification.  Perhaps something like

   Section 3.9 of [UNICODE] makes it clear that a UTF-8 byte sequence
   which would map to a surrogate is ill-formed.  If a specification
   requires that input data be encoded with UTF-8, and all input were
   well-formed, implementors would never have to concern themselves
   with surrogates.

But it's not clear to me that the second sentence adds any useful
information.  It seems that the paragraph could just continue with the
next sentence:

   Unfortunately, industry experience teaches that problematic code
   points, including surrogates, can and do occur in program input where
   the source of input data is not controlled by the implementor.

If the source of the data is controlled by the implementor, it isn't
"input".  So it seems to me that "where the source of input data is
not controlled by the implementor" can be omitted.

   In
   particular, the specification of JSON allows any code point to appear
   in object member names and string values [RFC8259]; the following is
   a conforming JSON text:

It seems like this should start a new paragraph, and be prefixed with
"For example,".

   Reasonable options for dealing with problematic input include, first,
   rejecting text containing problematic code points, and second,
   replacing them with placeholders.  (As an exception, [UNICODE] notes
   that it may in some cases be appropriate, specifically for
   noncharacters, to treat them as non-problematic unassigned code
   points.)

I think you can omit "As an exception", since the parenthesized
sentence already contains "may in some cases be appropriate".

   Silently deleting an ill-formed part of a string is a known security
   risk.

It seems well worth referencing a discussion of the "known security
risk".

   [RFC9413] emphasizes that when encountering problematic input,
   software should consider the field as a whole, not individual code
   points or bytes.

This needs to be clarified; RFC 9413 does not contain the word
"field", and only one instance of "as a whole" (in the phrase
"protocol as a whole").

4.1.  Unicode Scalars

   This subset is called the UnicodeScalarsClass for use in PRECIS.

This is awkward.  Why not:

   This subset is the PRECIS profile UnicodeScalarsClass.

Similarly in sections 4.2 and 4.3.

4.2.  XML Characters

   [...] surrogates, legacy C0 Controls, and the noncharacters U+FFFE [...]

The phrase "legacy C0 Controls" is not defined.  I think you mean "C0
Controls".

4.3.  Unicode Assignables

   This subset comprises
   all code points that are currently assigned, or might in future be
   assigned, to characters that are not legacy control codes.

This is awkward because it seems be careful to exclude "code points
that might in future be assigned to characters that are legacy control
codes", and of course there are none of those.  Probably better:

   This subset comprises
   all code points that are currently assigned,
   excluding legacy control codes, or that might in future be
   assigned.

5.  Using Subsets

   These formats specify default subsets.

This is unclear.  Do you mean

   These specifications specify default subsets of Unicode for use in
   their protocols.

--

   Note that escaping techniques such as those in the JSON example in
   Section 3 cannot be used to circumvent this sort of restriction,
   which applies to data content, not textual representation in
   packaging formats.

This could be clarified.  Perhaps

   A restriction placed on the contents of a name or value would not
   be circumventable by an escaping technique (such as those in the
   JSON example in Section 3) because the restriction applies to the
   data content, not the textual representation of the content.

6.1.  Addition to the PRECIS Base Classes Registry

   Reference: Section 2 of this RFC

This isn't flagged explicitly for Editor/IANA attention.  That may be
OK, but usually these items are marked explicitly.  See also other
occurrences of "this RFC".

6.2.3.  Unicode Assignables Profile

   Applicability: Protocols that want to allow all Unicode code points
   that are currently assigned, or might be assigned in the future, to
   characters that are not "legacy controls" as defined in
   Section 2.2.2.2

It seems like this should be "section 2.2.2.2 of [this RFC]".

Also, see the comment for section 4.3.

7.  Security Considerations

It might be worth pointing to section 3 here, as that section contains
some security considerations, and points to security considerations
documented elsewhere.

   Note that the Unicode-character subsets specified in this document
   include a successively-decreasing number of problematic code points,
   [...]

It might be worth explicitly saying "problematic code points (as
defined in section 2.2)" so section 7 can be read correctly by someone
who hasn't read the rest of the document.

8.  Normative References

   [UNICODE]  The Unicode Consortium, "The Unicode Standard",
              <http://www.unicode.org/versions/latest/>.  Note that this
              reference is to the latest version of Unicode, rather than
              to a specific release.  It is not expected that future
              changes in the Unicode Standard will affect the referenced
              definitions.

It isn't your problem, but currently the URL
<http://www.unicode.org/versions/latest/> goes to a page titled
"Unicode(R) 16.0.0", but that page gives only a summary of changes,
not the contents of Unicode 16.  You have to go to
e.g. <https://www.unicode.org/versions/Unicode15.0.0/> to see the
standard.

[END]