Note: This ballot was opened for revision 01 and is now closed.
Checking a few minor points with authors before approval.
No objection as in "have read the document but it is not in the domain of expertise that I can authoritatively comment on".
Confused also on the track. Datatracker says intended status is Informational, but the document says Standards Track.
I agree with Adam's discuss points, although it was not clear to me if public dictionaries are expected. Others have already pointed out the document says "standards track". Otherwise I have a few mostly editorial comments: §2.1 - s/indepedently/independently §2.1.1, last paragraph - s/origina/original §184.108.40.206.1.3 "The value of this bit should be set to zero"- I know this isn't using 2119 language--but is the plain-English meaning of "should" what you had in mind? Also, why is this stated differently from §220.127.116.11.1.4? (Also also, what's the IETF record for section nesting depth?) §18.104.22.168.2, reserved - The text says "this value cannot be used with the current specification". What if you get a block that used it? §22.214.171.124.3 : "always strictly less than" - Is this really true? There's no possible input (e.g. already compressed text) where the decompressed and compressed sizes are equal? §4: - "Usual precautions"- does that refer to the subsequent paragraphs, or something else? - Any chance of a citation for "fuzz-test"? - Last paragraph, last sentence: Forward looking predictions in RFCs have a habit of becoming dated, one way or another. §6.2 - Are none of these references properly normative?
I'll let you folks chat with Mirja about the larger topic of requirements language, but I note that this formulation 126.96.36.199.1.3. Unused Bit The value of this bit should be set to zero. A decoder compliant with this specification version shall not interpret it. It might be used in a future version, to signal a property which is not mandatory to properly decode the frame. really doesn't protect that bit for future use. I don't care if it's "MUST be set to zero by encoders and ignored by decoders" or "is always set to zero in this version of the algorithm", but a weaker constraint doesn't prevent implementers from squatting on this bit now. You know, something like the next subsection: 188.8.131.52.1.4. Reserved Bit This bit is reserved for some future feature. Its value must be zero. A decoder compliant with this specification version must ensure it is not set. This bit may be used in a future revision, to signal a feature that must be interpreted to decode the frame correctly. This text For improved interoperability, decoders are recommended to be compatible with Window_Size >= 8 MB, and encoders are recommended to not request more than 8 MB. It's merely a recommendation though, and decoders are free to support larger or lower limits, depending on local limitations. is pretty clear about the motivation for limiting the Window_Size to 8 MB, and why an implementation might want to use a smaller Window_Size, but is there anything you could say about why an implementation might want to use a larger Window_Size value? In this text, 2.4. Entropy Encoding Two types of entropy encoding are used by the Zstandard format: FSE, and Huffman coding. could you give any guidance about why you might choose to use one format over another? Is the meaning of "under control of a third party" well understood? One should never compress together a message whose content must remain secret with a message under control of a third party. I might be able to guess at a precise definition, but I'd be guessing. I'm wondering if you really want to remove all of 5. Implementation Status [RFC EDITOR: Please remove this section prior to publication.] Source code for a C language implementation of a "Zstandard" compliant library is available at [ZSTD-GITHUB]. This implementation is production ready, implementing the full range of the specification. It is tested against security hazards, and widely deployed within Facebook infrastructure. given that this text 2.5. Dictionary Format (snip) However, dictionaries created by "zstd --train" in the reference implementation follow a specific format, described here. refers to what I'm assuming is the same reference implementation (but I can't be sure, because there's no reference pointer in the Section 2.5 usage). (I was on the IESG and balloted Yes for https://datatracker.ietf.org/doc/rfc7942/, so I understand that this says you delete Implementation Sections before publishing as an RFC, but I don't think pointers to a reference implementation fall into the same category as the typical "so far, X, Y, and Z have implemented this protocol" Implementation Sections that are instantly outdated. A pointer to a reference implementation sounds more useful for future readers. But, at a minimum, adding a reference pointer to the Section 2.5 occurrence would be useful, since that's the first time a reference implementation is mentioned)
Thanks for addressing my DISCUSS (and pointing out that my other point was not grounded in fact)! I think I had intended to switch to Yes, but since I was so slow to respond to the author comments, I have forgotten enough about the document that I will just go with No Objection instead, to avoid any further delay. Original comments preserved below: Some high-level comments: After reading Section 184.108.40.206 (and subsections), I'm not sure I fully understand the procedures for the Huffman coding. Granted, in order to obtain the efficient compression ratios the procedure is necessarily complicated, so arguably there is some onus on me as the reader to work harder to understand. Having said that, I think the mapping from weights to prefixes in the prefix code makes sense to me, but I'm not sure I understand how the weights are assigned to their respective symbols/literals (i.e., what the prefix decodes to). My current reading of the text is that weights are given for literals 0 through (last literal - 1), and in particular if I want to do raw (non-FSE) weights, I can only do 128 literals like this. So, are the symbols/literals just these values from 0 to 127? That seems unlikely to be correct, given that we want to compress data containing bytes with the high bit set, but I'm unsure where I'm going astray. It seems like a link to some dedicated/official test vectors seems like it would be useful, if test vectors themselves are seen as being too bulky to include in the document. The text about third-party dictionaries in Section 4 makes me wonder if it should explicitly be stated that "dictionary input to decompression should be treated as being similarly untrusted as input compressed frames, and the appropriate bounds checking performed on data accesses as needed". In Appendix B, I don't see a reference to match up to a chapter "from normalized distribution to decoding tables". Some more minor section-by-section comments follow. Section 2.1.1 Should the frame-structure diagram list "0 or 4 bytes" for the content checksum? Content_Checksum: An optional 32-bit checksum, only present if the Content_Checksum_flag is set. The content checksum is the result of the xxh64() hash function [XXHASH] digesting the origina "original" Section 220.127.116.11.1.2 In this case, Window_Descriptor byte is skipped, but "the Window_Descriptor byte" What's the difference between "Unused" and "Reserved", viz. 18.104.22.168.1.3 and 22.214.171.124.1.4? Section 126.96.36.199.1.6. This is a two-bit flag (= FHD & 3) two-bit flag (whose value is obtained by masking the frame header descriptor with 0x3) Section 188.8.131.52.2. Provides guarantees on minimum memory buffer required to decompress a frame. [...] "the minimum memory buffer" Section 184.108.40.206.3. Field size depends on Dictionary_ID_flag. [...] "The field size depends on the" How are dictionary (ID)s allocated and standardized? Section 220.127.116.11.2. RLE_Block: This is a single byte, repeated Block_Size times. Block_Content consists of a single byte. On the decompression side, this byte must be repeated Block_Size times. I think we need normative language on "the decoder MUST verify [...]". Section 18.104.22.168 This "all previously decoded data" is only within a frame, right? It might be worth reiterating that here. Section 22.214.171.124.1.1. For Size_Format for Raw_Literals_Block and RLE_Literals_Block, I think we need to explicitly say "when parsing the flags bit-by-bit, if the low-order bit of the Size_Format field is zero, the field is only one bit, and processing proceeds to the next bitfield. Regenerated_Size uses [...]" Section 126.96.36.199.2.1. Please expand FSE on first use. Section 188.8.131.52 Finally, the decoder can tell how many bytes were used in this process, and how many symbols are present. The bitstream consumes a round number of bytes. Any remaining bit within the last byte is simply unused. Usually we say "set to zero on encode, ignore on decode" to make things deterministic. Section 2.4.2 When decompressing, the last byte containing the padding is the first byte to read. The decompressor needs to skip 0-7 initial 0-bits and the first 1-bit lt occurs. Afterwards, the useful part of the bitstream begins. I guess "lt" is supposed to be "that"? Section 184.108.40.206 The tree depth is 4, since its smallest element uses 4 bits. It's not entirely clear to me what "smallest" means here -- presumably "lowest in the tree", though that is rather tautological in this context. "Smallest weight" would make sense, except there is not a weight column in the table. Section 220.127.116.11.3 if Number_of_Bits != 0 Number_of_Bits = Max_Number_of_Bits + 1 - Weight Should the first Number_of_Bits be Weight?
Unfortunately there is no shepherd write-up, therefore I have couple of questions/comments: 1) What's the reason that this document is submitted as "Standards Track"? In there a working group that is planning to use this mechanism? Or why does it need to be in the IETF/have IETF consensus? 2) I think this document would benefit from the use of normative language. 3) Why is there a normative reference to a website? I would think the document should be and is describing the compression mechanism comprehensively without the need to have a look at a website that might not even provide a stable reference. Sorry one more comment I forgot earlier: 4) Should the User_Data Frame be further discussed in the security section as it can carry arbitrary information which can be security-relevant or privacy sensitive...?
I share Adam's concern and support his DISCUSS. FWIW, I would also be happy with a clarification along the lines of what Adam suggested.
Thanks for addressing my discuss and comments.