Zstandard Compression and the application/zstd Media Type
RFC 8478

Note: This ballot was opened for revision 01 and is now closed.

Alexey Melnikov Yes

Comment (2018-07-13 for -02)
No email
send info
Checking a few minor points with authors before approval.

Ignas Bagdonas No Objection

Comment (2018-05-24 for -01)
No email
send info
No objection as in "have read the document but it is not in the domain of expertise that I can authoritatively comment on".

Deborah Brungard No Objection

Comment (2018-05-21 for -01)
No email
send info
Confused also on the track. Datatracker says intended status is Informational, but the document says Standards Track.

(Ben Campbell) No Objection

Comment (2018-05-23 for -01)
No email
send info
I agree with Adam's discuss points, although it was not clear to me if public dictionaries are expected.

Others have already pointed out the document says "standards track". Otherwise I have a few mostly editorial comments:

§2.1 - s/indepedently/independently
§2.1.1, last paragraph -  s/origina/original
§  "The value of this bit should be set to zero"- I know this isn't using 2119 language--but is the plain-English meaning of "should" what you had in mind? Also, why is this stated differently from § (Also also, what's the IETF record for section nesting depth?)
§, reserved -  The text says "this value cannot be used with the current specification". What if you get a block that used it?
§ : "always strictly less than" - Is this really true? There's no possible input (e.g. already compressed text) where the decompressed and compressed sizes are equal?

- "Usual precautions"- does that refer to the subsequent paragraphs, or something else?
- Any chance of a citation for "fuzz-test"?
- Last paragraph, last sentence: Forward looking predictions in RFCs have a habit of becoming dated, one way or another.

§6.2 - Are none of these references properly normative?

Alissa Cooper No Objection

(Spencer Dawkins) No Objection

Comment (2018-05-18 for -01)
No email
send info
I'll let you folks chat with Mirja about the larger topic of requirements language, but I note that this formulation  Unused Bit

   The value of this bit should be set to zero.  A decoder compliant
   with this specification version shall not interpret it.  It might be
   used in a future version, to signal a property which is not mandatory
   to properly decode the frame.

really doesn't protect that bit for future use. I don't care if it's "MUST be set to zero by encoders and ignored by decoders" or "is always set to zero in this version of the algorithm", but a weaker constraint doesn't prevent implementers from squatting on this bit now. 

You know, something like the next subsection:  Reserved Bit

   This bit is reserved for some future feature.  Its value must be
   zero.  A decoder compliant with this specification version must
   ensure it is not set.  This bit may be used in a future revision, to
   signal a feature that must be interpreted to decode the frame

This text 

   For improved interoperability, decoders are recommended to be
   compatible with Window_Size >= 8 MB, and encoders are recommended to
   not request more than 8 MB.  It's merely a recommendation though, and
   decoders are free to support larger or lower limits, depending on
   local limitations.

is pretty clear about the motivation for limiting the Window_Size to 8 MB, and why an implementation might want to use a smaller Window_Size, but is there anything you could say about why an implementation might want to use a larger Window_Size value?

In this text,

2.4.  Entropy Encoding

   Two types of entropy encoding are used by the Zstandard format: FSE,
   and Huffman coding.

could you give any guidance about why you might choose to use one format over another?

Is the meaning of "under control of a third party" well understood?

   One should never compress together a message whose content must
   remain secret with a message under control of a third party. 

I might be able to guess at a precise definition, but I'd be guessing.

I'm wondering if you really want to remove all of 

5.  Implementation Status

   [RFC EDITOR: Please remove this section prior to publication.]

   Source code for a C language implementation of a "Zstandard"
   compliant library is available at [ZSTD-GITHUB].  This implementation
   is production ready, implementing the full range of the
   specification.  It is tested against security hazards, and widely
   deployed within Facebook infrastructure.

given that this text 

2.5.  Dictionary Format


   However, dictionaries created by "zstd --train" in the reference
   implementation follow a specific format, described here.

refers to what I'm assuming is the same reference implementation (but I can't be sure, because there's no reference pointer in the Section 2.5 usage). 

(I was on the IESG and balloted Yes for https://datatracker.ietf.org/doc/rfc7942/, so I understand that this says you delete Implementation Sections before publishing as an RFC, but I don't think pointers to a reference implementation fall into the same category as the typical "so far, X, Y, and Z have implemented this protocol" Implementation Sections that are instantly outdated. A pointer to a reference implementation sounds more useful for future readers. But, at a minimum, adding a reference pointer to the Section 2.5 occurrence would be useful, since that's the first time a reference implementation is mentioned)

Benjamin Kaduk (was Discuss) No Objection

Comment (2018-07-12 for -02)
No email
send info
Thanks for addressing my DISCUSS (and pointing out that my other point was not grounded in fact)!
I think I had intended to switch to Yes, but since I was so slow to respond to the author comments,
I have forgotten enough about the document that I will just go with No Objection instead, to avoid
any further delay.

Original comments preserved below:

Some high-level comments:

After reading Section (and subsections), I'm not sure I
fully understand the procedures for the Huffman coding.  Granted,
in order to obtain the efficient compression ratios the procedure is
necessarily complicated, so arguably there is some onus on me as the
reader to work harder to understand.  Having said that, I think the
mapping from weights to prefixes in the prefix code makes sense to
me, but I'm not sure I understand how the weights are assigned to
their respective symbols/literals (i.e., what the prefix decodes
to).  My current reading of the text is that weights are given for
literals 0 through (last literal - 1), and in particular if I want
to do raw (non-FSE) weights, I can only do 128 literals like this.
So, are the symbols/literals just these values from 0 to 127?  That
seems unlikely to be correct, given that we want to compress data
containing bytes with the high bit set, but I'm unsure where I'm
going astray.

It seems like a link to some dedicated/official test vectors seems
like it would be useful, if test vectors themselves are seen as
being too bulky to include in the document.

The text about third-party dictionaries in Section 4 makes me wonder
if it should explicitly be stated that "dictionary input to
decompression should be treated as being similarly untrusted as
input compressed frames, and the appropriate bounds checking
performed on data accesses as needed".

In Appendix B, I don't see a reference to match up to a chapter
"from normalized distribution to decoding tables".

Some more minor section-by-section comments follow.

Section 2.1.1

Should the frame-structure diagram list "0 or 4 bytes" for the
content checksum?

   Content_Checksum:  An optional 32-bit checksum, only present if the
      Content_Checksum_flag is set.  The content checksum is the result
      of the xxh64() hash function [XXHASH] digesting the origina



   In this case, Window_Descriptor byte is skipped, but

"the Window_Descriptor byte"

What's the difference between "Unused" and "Reserved", viz. and


   This is a two-bit flag (= FHD & 3)

two-bit flag (whose value is obtained by masking the frame header
descriptor with 0x3)


   Provides guarantees on minimum memory buffer required to decompress a
   frame. [...]

"the minimum memory buffer"


   Field size depends on Dictionary_ID_flag. [...]

"The field size depends on the"

How are dictionary (ID)s allocated and standardized?


   RLE_Block:  This is a single byte, repeated Block_Size times.
      Block_Content consists of a single byte.  On the decompression
      side, this byte must be repeated Block_Size times.

I think we need normative language on "the decoder MUST verify


This "all previously decoded data" is only within a frame, right?
It might be worth reiterating that here.


For Size_Format for Raw_Literals_Block and RLE_Literals_Block, I
think we need to explicitly say "when parsing the flags bit-by-bit,
if the low-order bit of the Size_Format field is zero, the field is
only one bit, and processing proceeds to the next bitfield.
Regenerated_Size uses [...]"


Please expand FSE on first use.


   Finally, the decoder can tell how many bytes were used in this
   process, and how many symbols are present.  The bitstream consumes a
   round number of bytes.  Any remaining bit within the last byte is
   simply unused.

Usually we say "set to zero on encode, ignore on decode" to make
things deterministic.

Section 2.4.2

   When decompressing, the last byte containing the padding is the first
   byte to read.  The decompressor needs to skip 0-7 initial 0-bits and
   the first 1-bit lt occurs.  Afterwards, the useful part of the
   bitstream begins.

I guess "lt" is supposed to be "that"?


   The tree depth is 4, since its smallest element uses 4 bits.

It's not entirely clear to me what "smallest" means here --
presumably "lowest in the tree", though that is rather tautological
in this context.  "Smallest weight" would make sense, except there
is not a weight column in the table.


     if Number_of_Bits != 0
         Number_of_Bits = Max_Number_of_Bits + 1 - Weight

Should the first Number_of_Bits be Weight?

Suresh Krishnan No Objection

Mirja Kühlewind No Objection

Comment (2018-05-18 for -01)
No email
send info
Unfortunately there is no shepherd write-up, therefore I have couple of questions/comments:

1) What's the reason that this document is submitted as "Standards Track"? In there a working group that is planning to use this mechanism? Or why does it need to be in the IETF/have IETF consensus?

2) I think this document would benefit from the use of normative language.

3) Why is there a normative reference to a website? I would think the document should be and is describing the compression mechanism comprehensively without the need to have a look at a website that might not even provide a stable reference.

Sorry one more comment I forgot earlier:
4) Should the User_Data Frame be further discussed in the security section as it can carry arbitrary information which can be security-relevant or privacy sensitive...?

(Terry Manderson) No Objection

Alvaro Retana No Objection

Comment (2018-05-24 for -01)
No email
send info
I share Adam's concern and support his DISCUSS.

FWIW, I would also be happy with a clarification along the lines of what Adam suggested.

Adam Roach (was Discuss) No Objection

Comment (2018-07-11 for -02)
No email
send info
Thanks for addressing my discuss and comments.

Martin Vigoureux No Objection