Skip to main content

CBOR: On Deterministic Encoding
draft-bormann-cbor-det-03

Document Type Active Internet-Draft (individual)
Author Carsten Bormann
Last updated 2024-07-21
RFC stream (None)
Intended RFC status (None)
Formats
Additional resources GitHub Repository
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-bormann-cbor-det-03
CBOR                                                          C. Bormann
Internet-Draft                                    Universität Bremen TZI
Intended status: Informational                              21 July 2024
Expires: 22 January 2025

                    CBOR: On Deterministic Encoding
                       draft-bormann-cbor-det-03

Abstract

   CBOR (STD 94, RFC 8949) defines "Deterministically Encoded CBOR" in
   its Section 4.2.  The present document provides additional
   information about use cases, deployment considerations, and
   implementation choices for Deterministic Encoding.

About This Document

   This note is to be removed before publishing as an RFC.

   Status information for this document may be found at
   https://datatracker.ietf.org/doc/draft-bormann-cbor-det/.

   Discussion of this document takes place on the Concise Binary Object
   Representation Maintenance and Extensions (CBOR) Working Group
   mailing list (mailto:cbor@ietf.org), which is archived at
   https://mailarchive.ietf.org/arch/browse/cbor/.  Subscribe at
   https://www.ietf.org/mailman/listinfo/cbor/.

   Source for this draft and an issue tracker can be found at
   https://github.com/cbor-wg/draft-ietf-cbor-cde.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 22 January 2025.

Bormann                  Expires 22 January 2025                [Page 1]
Internet-Draft          On Deterministic Encoding              July 2024

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.  Conventions and Definitions . . . . . . . . . . . . . . .   4
   2.  Use Cases for Deterministic Encoding  . . . . . . . . . . . .   5
     2.1.  Diagnostics . . . . . . . . . . . . . . . . . . . . . . .   5
     2.2.  Caching . . . . . . . . . . . . . . . . . . . . . . . . .   6
     2.3.  Security: Signing Inputs  . . . . . . . . . . . . . . . .   6
   3.  Support by Generic Encoders and Decoders  . . . . . . . . . .   7
     3.1.  Basic Support . . . . . . . . . . . . . . . . . . . . . .   7
     3.2.  Application Requirements and Tags . . . . . . . . . . . .   8
       3.2.1.  Example with Tags 0 and 1 (Date/Time) . . . . . . . .   8
       3.2.2.  Example with Major Types 0, 1, and 7, and Tags 2 and
               3 . . . . . . . . . . . . . . . . . . . . . . . . . .   8
   4.  Specification Considerations  . . . . . . . . . . . . . . . .  11
     4.1.  Media Type Considerations . . . . . . . . . . . . . . . .  11
     4.2.  The Need for Maps . . . . . . . . . . . . . . . . . . . .  12
   5.  Implementation Considerations for Deterministic Encoding  . .  12
     5.1.  API Considerations  . . . . . . . . . . . . . . . . . . .  12
     5.2.  Map Key Ordering  . . . . . . . . . . . . . . . . . . . .  13
   6.  Application Profiles of Deterministic Encoding  . . . . . . .  14
     6.1.  The need for CBOR Common Deterministic Encoding (CDE) . .  14
     6.2.  Numeric Reduction in dCBOR  . . . . . . . . . . . . . . .  14
   7.  Using Deterministically Encoded CBOR as a Deterministic
           Encoding of JSON  . . . . . . . . . . . . . . . . . . . .  15
   8.  Security Considerations . . . . . . . . . . . . . . . . . . .  16
   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  16
   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  16
     10.1.  Normative References . . . . . . . . . . . . . . . . . .  16
     10.2.  Informative References . . . . . . . . . . . . . . . . .  17
   Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . .  19
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  19

Bormann                  Expires 22 January 2025                [Page 2]
Internet-Draft          On Deterministic Encoding              July 2024

1.  Introduction

   The Concise Binary Object Representation (CBOR, [STD94] as documented
   in RFC 8949) is a data format whose design goals include the
   possibility of extremely small code size, fairly small message size,
   and extensibility without the need for version negotiation.

   In many cases, CBOR allows some information to be encoded in several
   variants, which provide different amounts of space and thus lengths
   in Bytes.  The encoder is generally free to choose the length that is
   most practical for it (with the constraint, of course, that the data
   need to fit).  For most encoders, it is natural to always choose the
   shortest form available (essentially avoiding leading zeros).
   Section 4.1 (Preferred Serialization) of RFC 8949 [STD94] names this
   practice and provides additional guidance for CBOR implementations;
   another term in use is "Preferred Encoding".

   Section 4.2 (Deterministically Encoded CBOR) of RFC 8949 [STD94] goes
   beyond the Preferred Serialization practice by providing rules for
   _Deterministic Encoding_. The objective of Deterministic Encoding is
   to, deterministically, always produce the same encoding for data
   items that are equivalent at the data model level.  To achieve this,
   Preferred Serialization is mandated, an encoding choice intended for
   incremental encoding (indefinite length encoding) is disabled, and
   additional effort is expended for encoding key/value pairs in maps
   (the order of which does not matter semantically) in a deterministic
   order.

   Given that additional effort needs to be expended and/or
   implementation choices are taken away, neither Preferred
   Serialization nor Deterministic Encoding are mandatory in CBOR.
   (Contrast this with UTF-8 (Section 3 of RFC 3629 [STD63]), which is
   always treating as "invalid" any encoding variants that are longer
   than necessary.)

Bormann                  Expires 22 January 2025                [Page 3]
Internet-Draft          On Deterministic Encoding              July 2024

   Deterministic Encoding is defined in Section 4.2 of RFC 8949 [STD94]
   (note that Section 4.2.3 of RFC 8949 [STD94] defines a variant that
   was needed at the time for backward compatibility and will not be
   discussed further in this document).  The present document elaborates
   on this normative definition by providing additional information
   about use cases, deployment considerations, and implementation
   choices for Deterministic Encoding; it is an informational document
   that however may still be cited where a single reference for the
   background of Deterministic Encoding is convenient.  This document is
   intended to be used in conjunction with CBOR Common Deterministic
   Encoding (CDE, [I-D.ietf-cbor-cde]), a normative specification for a
   deterministic encoding profile that was developed in order to allow
   generic CBOR implementations to provide common support for a variety
   of applications of deterministic encoding.

1.1.  Conventions and Definitions

   The definitions of [STD94] apply.  Readers are expected to be
   familiar with CBOR, and particularly so with Sections 4.1 and 4.2 of
   RFC 8949 [STD94].

   The following terms introduced in the text of [STD94] receive their
   own separate definitions here:

   Preferred Serialization:  a set of choices made during Serialization
      (Encoding) that generally leads to shortest-form encodings where a
      choice of encoding lengths is available, without expending
      additional effort on converting between different kinds of data
      item.  See Section 4.1 of RFC 8949 [STD94] and the terms defined
      in that section.  The Preferred Encoding rules for data items in
      the Basic Generic Data Model may be augmented by rules for
      specific Tags, see for instance Section 3.4.3 of RFC 8949 [STD94].

   Preferred Encoding:  Preferred Serialization

   Deterministic Encoding:  An encoding process that employs Preferred
      Serialization and makes additional decisions to always
      (deterministically) lead to the exact same encoding for equivalent
      inputs at the data model level.  Similar to Preferred
      Serialization, the equivalence model as defined for the Basic
      Generic Data Model may be augmented by equivalence rules defined
      for specific Tags (see also Section 2.1 of RFC 8949 [STD94]).

   In this document, CBOR data items at the data model level are
   represented in the CBOR diagnostic notation (Section 8 of RFC 8949
   [STD94] as extended by Appendix G of [RFC8610], further elaborated in
   [I-D.ietf-cbor-edn-literals]), abbreviated with "EDN" (extended
   diagnostic notation).

Bormann                  Expires 22 January 2025                [Page 4]
Internet-Draft          On Deterministic Encoding              July 2024

   While this document is informative, it does use certain keywords to
   indicate practical requirements for interoperability.  The key words
   "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
   "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL"
   in this document are to be interpreted as described in [BCP14]
   (RFC2119) (RFC8174) when, and only when, they appear in all capitals,
   as shown here.

2.  Use Cases for Deterministic Encoding

   Before discussing further details of Deterministic Encoding, we would
   like to point out three areas of use cases, which differ enough in
   the resulting objectives that it is worth to have terminology for
   them.

2.1.  Diagnostics

   In many cases, diagnostic procedures benefit from having available a
   single, easily comparable representation of some data:

   *  Comparing outputs of a test or validation suite during development

      -  CI (Continuous Integration) may capture Deterministically
         Encoded copies of process output, of data in flight or data at
         rest, of specific test output etc.  Being able to compare them
         over time or between systems without differences occurring as
         false positives can help indicate the presence or absence of
         certain problems.

      -  Test vectors and other kinds of tests often represent some
         input and desired output of a transformation.  By making sure
         the output is deterministically encoded, a simple bytewise
         comparison can find out whether the transformation was
         performed successfully.

   *  Improving the presentation of diagnostic information to humans

      By minimizing inconsequential differences between representations
      of similar data, humans may be faster in finding information they
      are interested in.  In particular inconsistent map ordering can
      easily hide information that would have been useful for diagnostic
      purposes.  Transformation to human-readable forms may be easier
      and more useful if there is only one form of representation for
      the interchanged data.

Bormann                  Expires 22 January 2025                [Page 5]
Internet-Draft          On Deterministic Encoding              July 2024

2.2.  Caching

   Many systems cache (memoize) results of a request so they can reply
   with the cached result when the same request comes in again and the
   context of the reply has not changed.

   If two requests that are semantically the same also have the same
   representation, the representation (or its hash) can serve as an
   efficient cache key.  If the request is already encoded
   deterministically, this is by definition the case; alternatively, the
   recipient can re-encode a request with Deterministic Encoding.

   Were the Deterministic Encoding to fail, this could lead to cache
   failures, which could be benign, but also could be specifically
   evoked by an active attacker to degrade a system.

   As usual for deterministically encoded data, not all forms of
   application equivalence imply equivalence at the data model level, so
   some equivalence processing (_deterministic representation_) may be
   required at the application level as well, to achieve equivalent
   representations and thus a good cache hit rate.

2.3.  Security: Signing Inputs

   Security Frameworks such as COSE and JOSE sign or MAC (authenticate
   with a Message Authentication Code, MAC) information in the form in
   which it has actually been interchanged, making representation
   variants less relevant.

   (Note that Section 9 of RFC 9052 [STD96] defines deterministic
   encoding rules for its own derivation of signing inputs from
   interchange data and additional cryptographic parameters; these are a
   compatible subset of the Core Deterministic Encoding Requirements
   specified in Section 4.2.1 of RFC 8949 [STD94] and thus of CDE.)

   However, in some cases, the signing input for a signature or a MAC
   may need to be derived from data at rest and/or specific
   transformations of the data that was interchanged.  Such a
   transformation is fraught with perils at the application level that
   may be exploited by attackers; this problem is outside the scope of
   the present document.  Deterministic Encoding may remove one
   potential source of variability that might make signatures or MACs
   useless between systems.

Bormann                  Expires 22 January 2025                [Page 6]
Internet-Draft          On Deterministic Encoding              July 2024

3.  Support by Generic Encoders and Decoders

   CBOR implementations can be specific to a particular application, or
   they can be _Generic_.  There is a strong incentive to be able to use
   a Generic encoder/decoder across the spectrum of CBOR applications;
   CBOR applications that require specific support from an encoder/
   decoder can considerably reduce the wide implementation support CBOR
   enjoys from existing generic implementations.  So, as a general best
   practice, we want to minimize the number of ways an application may
   need to influence a generic coder/decoder by options, flags,
   switches, etc.

3.1.  Basic Support

   There is some expectation that, barring any particular constraints
   that would make this more difficult than normally, a CBOR encoder
   will use Preferred Encoding, in particular generic encoders.
   Deterministic Encoding, however, will need to be switched on
   explicitly in most implementations.  Note that Preferred Encoding,
   while using the shortest form available for the specific data item to
   be encoded, doesn't have that shortness as the overriding objective:
   Conversions of a data item into a different one to achieve shorted
   encoding are not part of the processing labeled "Preferred Encoding".
   (This is particularly relevant for CBOR's different numeric systems;
   see Section 3.2.2 below.)

   Some applications will also want to check that an encoded input
   actually satisfies the requirements for Deterministic Encoding.  By
   the definition of Deterministic Encoding, this can be done after
   decoding a data item by deterministically encoding the just decoded
   data item and comparing the result with the decoding input.  However,
   specific support for checking immediately in the decoding process can
   be more efficient.

   As a result, support for Deterministic Encoding in generic encoder
   implementations is RECOMMENDED to be provided by a flag to switch on
   (or separate function that enables) Deterministic Encoding.
   Similarly, generic decoders are RECOMMENDED to have a flag to switch
   on/separate function to enable checking for Deterministic Encoding,
   whether that is efficiently implemented during decoding or less
   efficiently by comparing a re-encoding.

Bormann                  Expires 22 January 2025                [Page 7]
Internet-Draft          On Deterministic Encoding              July 2024

3.2.  Application Requirements and Tags

   The definition of Deterministic Encoding can become more complicated
   with the addition of Tags (Section 3.4 of RFC 8949
   [STD94][IANA.cbor-tags]).  Not all tags come with a strong
   equivalence model.  Worse, the equivalence model may be more
   application specific than for basic Deterministic Encoding.

3.2.1.  Example with Tags 0 and 1 (Date/Time)

   For instance, are the following Tag 0 timestamps (expressed in CBOR
   diagnostic notation) equivalent?

   0("2013-10-23T21:52:23Z")
   0("2013-10-23T21:52:23+00:00")
   0("2013-10-23T14:52:23-07:00")

   They all denote the same instance in time, so if that is the relevant
   application semantics, they should all be represented as
   0("2013-10-23T21:52:23Z") in Deterministic Encoding as that is the
   shortest form.  However, they carry additional semantics that may be
   incidental or intentional (the e-mail message from which this date/
   time example was taken originated from California, which then was at
   a time zone the time offset of which is expressed by the -07:00).
   Whether the first two are exactly equivalent or not is the subject of
   Section 2 of [RFC9557].

   If the additional semantics conveyed by the time-offset (Section 5.6
   of [RFC3339]) is not relevant to the application, an application-
   specific rule may be needed to convert text-based timestamps into the
   "Z" form before encoding.  Some applications may also process this
   timestamp as 1(1382565143), losing the additional semantics as well,
   and using a quite different form.  Is that maybe an even better
   Deterministic Encoding?  (Note that 0("2016-12-31T23:59:60Z") does
   not have an equivalent form with Tag 1, so the application can either
   decide to never use such a date/time, or to exceptionally encode the
   rare leap second with Tag 0.)

3.2.2.  Example with Major Types 0, 1, and 7, and Tags 2 and 3

      |  In some of the following examples, we will use the number
      |  65 536 000 000 (or its floating-point form, 65536000000.0, in
      |  diagnostic notation), as it has both binary and non-binary
      |  (decimal) factors: it is equal to 2^16⋅10^6 (and thus to
      |  2^22⋅5^6).

   CBOR has four different sets of numeric representations:

Bormann                  Expires 22 January 2025                [Page 8]
Internet-Draft          On Deterministic Encoding              July 2024

   *  Major types 0 and 1.

      These provide for a variable-length representation of 64-bit
      unsigned integer numbers (major type 0) or negative numbers (major
      type 1) and, by combining these, of 65-bit signed integer numbers.
      The various lengths are intended to be semantically without
      meaning; the Preferred Encoding always chooses the shortest one.

   *  Tags 2 and 3 ("bignums")

      These provide for a variable-length representation of arbitrarily
      large unsigned (Tag 2) or negative (Tag 3) integer numbers.
      According to Section 3.4.3 of RFC 8949 [STD94], the Preferred
      Encoding of an integer that fits into major type 0 or 1 is just
      that, i.e., the boundary between regular integers and bignums is
      intentionally thin.  This means that, in Preferred Encoding, the
      value space of integral numbers is cleanly split into basic
      integers (64-bit unsigned integers or 64-bit negative integers)
      and bignums (Tag 2/3 integers that fit into neither of the two
      64-bit forms).

      As a result, an application may want to place any distinctions it
      needs in the area of integer numbers not on the representation as
      a regular integer or a bignum, but on the value: e.g., an
      application could provide a 64-bit signed integer range separate
      from a bignum-based arbitrary size integer range that is outside
      64-bit signed space, and would map half of the 65-bit space into
      the arbitrary size range.

      Note that, accordingly, Preferred Encoding as defined in
      Section 3.4.3 of RFC 8949 [STD94] selects the shortest encoding in
      major type 0/1 space if that is available and the shortest
      encoding (no leading zero bytes) in Tag 2/3 space only if the
      former is not available.  This means that the integer number
      65 536 000 000 in preferred representation is encoded as (9 bytes)

      1b 00 00 00 0f 42 40 00 00

      and not as (7 bytes)

      c2 45 0f 42 40 00 00

      (2(h'0f 42 40 00 00') in diagnostic notation), even though the
      latter is shorter by two bytes.

   *  Major type 7

Bormann                  Expires 22 January 2025                [Page 9]
Internet-Draft          On Deterministic Encoding              July 2024

      CBOR directly provides the [IEEE754] types binary16, binary32, and
      binary64, colloquially known as half-precision, single-precision,
      and double-precision floating point.  Note that other [IEEE754]
      binary floating types are indirectly supported via Tag 4, as well
      as decimal fractions via Tag 5.

      The set of values that binary32 and binary64 can represent are
      proper supersets of the value sets of the binary16 and binary32,
      respectively.  These sets have CDDL names of float16, float32, and
      float64 (Section 3.3 of [RFC8610]).  Again, preferred encoding
      chooses the smallest of the encodings; e.g., an application
      float64 such as 1.5 will be represented in a binary16 (0xf93e00)
      because that representation is the shortest floating point that
      provides the range and precision needed for this value.  (Bulk
      encoding of floating point values, where the need for detection of
      this situation might cause a performance limitation, is handled by
      tagged arrays [RFC8746].)

      While the three major type 7 floating point representations are
      semantically equivalent among each other in the same way as the
      major type 0/1 integer representations are to each other,
      implementers have indicated that between these two groups, numbers
      need to be kept separated into integers and floating point numbers
      at the generic data model level.

      This means that the integer number 65 536 000 000 in preferred
      representation is encoded as (9 bytes)

      1b 00 00 00 0f 42 40 00 00

      and not as (5 bytes)

      fa 51 74 24 00

      which would be considered to be the semantically separate floating
      point value 65536000000.0 (CBOR diagnostic notation).

   *  Tag 4 and 5 (decimal fractions, "bigfloats")

      Instead of adopting further formats such as decimal64 or binary128
      from [IEEE754], CBOR defines two generalized tags that can be used
      for extended precision representation: Tag 5 for general binary
      floating point numbers ("bigfloats") and Tag 4 for general decimal
      floating point (decimal fractions).  Section 3.4.4 of RFC 8949
      [STD94] also states that "Bigfloats may also be used by
      constrained applications that need some basic binary floating-
      point capability without the need for supporting IEEE 754", while
      decimal fractions "are most useful if an application needs the

Bormann                  Expires 22 January 2025               [Page 10]
Internet-Draft          On Deterministic Encoding              July 2024

      exact representation of a decimal fraction such as 1.1 because
      there is no exact representation for many decimal fractions in
      binary floating-point representations", as might occur when
      representing literal JSON [STD90] instead of I-JSON-interpreted
      JSON [RFC7493].

      Neither bigfloats nor decimal fractions provide rules for
      preferred encoding, except implicitly by providing a choice
      between basic integer and bignum representation for the mantissa
      value that will in turn be governed by the preferred encoding
      rules for integers.  Beyond that, the assumption is that these
      Tags create separate number spaces, and that any deterministic
      representation of numbers via these tags is shaped by application
      rules for the use of Tag 4 and 5.

4.  Specification Considerations

   In many specifications, asserting that interchange is based on
   deterministically encoded data items (and specifying what has to
   happen if that is not the case) is all that is needed.

4.1.  Media Type Considerations

   Some specifications define a media type for their interchange
   formats.  This definition is a good place to reiterate that a
   deterministically encoded data item is required for instances of that
   media type.

   A question arises whether a Structured Syntax Suffix (SSS, [RFC6838])
   should be defined for CBOR data items in Deterministic Encoding (and
   similarly for CBOR sequences [RFC8742] of such).

   There is precedent for this approach, as ASN.1 DER (Distinguished
   Encoding Rules) has an SSS, +der.  However, this appears misguided as
   the purpose of an SSS is to enable processing of the underlying data
   representation format, and any ASN.1 BER (Basic Encoding Rules)
   processor (+ber) can also process a +der instance, which is not
   apparent from the +der suffix.  (This was maybe mitigated by
   introducing both SSS at the same time.)  Similarly, any CBOR decoder
   today can process deterministically encoded data items as plain CBOR
   data items (without any mitigation of having introduced a related
   suffix at the same time), so the SSS should be the usual +cbor/+cbor-
   seq.  (The additional processing that would be enabled by identifying
   data items as deterministically encoded appears rather limited.)

   Alternatively, instead of replacing +cbor, an indication of
   Deterministic Encoding could be provided by adding multiple suffixes
   to the SSS concept.  There is an ongoing effort to define a more

Bormann                  Expires 22 January 2025               [Page 11]
Internet-Draft          On Deterministic Encoding              July 2024

   complex structure of media type suffixes, as documented in
   [I-D.ietf-mediaman-suffixes].  In general, the combination of
   multiple SSS in one media type name raises similar questions to the
   multiple inheritance problem in object-oriented programming
   languages, so it may not be easy to use such a mechanism in practice.

4.2.  The Need for Maps

   As an extension to JSON objects in JSON [STD90], maps are an
   important data structure in the CBOR generic data model to obtain
   extensibility of "struct"-like data (see Section 2 of [RFC8610]).
   Where this is not needed or can be provided in another way,
   expressing the entire data item without the use of maps can be an
   efficient option, avoiding any additional processing for
   Deterministic Encoding beyond that needed for Preferred Encoding.
   (This requires ensuring that a similar kind of uncertainty then does
   not occur at the application level, though.)

5.  Implementation Considerations for Deterministic Encoding

5.1.  API Considerations

   Support for Deterministic Encoding can be added to an API for a
   generic CBOR encoder and decoder by adding one flag each:

   *  a flag for the encoder to produce Deterministic Encoding

   *  a flag for the decoder to check for Deterministic Encoding
      (optional)

   Additional elements could be added to a decoder API to give
   diagnostic information about inputs that were not deterministically
   encoded, e.g., by flagging elements with error codes.  It is often
   useful to give the application full information about well-formed
   CBOR that is not deterministically encoded even when it should be.
   However, if a flag for checking is provided and switched on, there
   SHOULD be no chance that any other decoded data item is mistaken for
   one that was encoded deterministically.

   As reordering maps for Deterministic Encoding is relatively
   expensive, a generic encoder can also offer additional APIs for
   providing map content in a pre-ordered form.  If an encoder complies
   with Preferred Encoding and maps can be supplied in ordered form, an
   explicit Deterministic Encoding flag may not be required.  If it is,
   it is RECOMMENDED that the encoder not simply rely on the assumption
   that inputs were properly ordered by the application.

Bormann                  Expires 22 January 2025               [Page 12]
Internet-Draft          On Deterministic Encoding              July 2024

5.2.  Map Key Ordering

   Generating deterministically encoded data items requires arranging
   key/value pairs in maps into an order defined in Section 4.2.1 of RFC
   8949 [STD94].

   This map is ordered by the byte-wise lexicographic ordering of the
   deterministically encoded map keys.  Section 4.2.1 of RFC 8949
   [STD94] notes:

   |  Implementation note: the self-delimiting nature of the CBOR
   |  encoding means that there are no two well-formed CBOR encoded data
   |  items where one is a prefix of the other.  The bytewise
   |  lexicographic comparison of deterministic encodings of different
   |  map keys therefore always ends in a position where the byte
   |  differs between the keys, before the end of a key is reached.

   Also, an implementation may be able to make use of the property that
   map keys in Deterministic Encodings are ordered by the following
   information, in order of precedence:

   *  the key's major type

   *  the numeric value of the argument of the key

   *  any content of the key data item, such as

      -  the string value in a byte or text string key

      -  the elements of an array key, in order

      -  the key/value pairs of a map-shaped key, deterministically
         ordered

      -  the tag content of a tagged key

   I may be expeditious to use this property, e.g. by processing
   integers first, starting with unsigned integers in ascending order
   and then negative integers in descending order, and then strings
   (byte strings first), ordered by their length in bytes (encoded in
   the argument) and then the string content, arrays ordered by length
   and then content, and maps ordered by length and then content.
   Often, and particularly with integer and string keys, it may not be
   necessary to actually build a deterministically encoded data item for
   a map key to perform the overall map content ordering.

Bormann                  Expires 22 January 2025               [Page 13]
Internet-Draft          On Deterministic Encoding              July 2024

6.  Application Profiles of Deterministic Encoding

   To enable the use of generic encoders, applications are encouraged to
   define rules for representing application information in the CBOR
   generic data model that enable the use of Preferred Encoding on that
   level as well.

6.1.  The need for CBOR Common Deterministic Encoding (CDE)

   Applications can also define their own deterministic encoding rules,
   as for instance FIDO CTAP2 (Client to Authenticator Protocol [CTAP2])
   does with the _CTAP2 canonical CBOR encoding form_ (Section 6 of
   [CTAP2]).  Its description appears to be derived from an equivalent
   of Section 4.2.3 of RFC 8949 [STD94].  (The actual structure of CTAP2
   limits its use to cases where that is compatible with standard
   Deterministic Encoding and thus CDE; there is text in the
   specification that calls for revisiting the definition when this
   would no longer be the case.)

   Application-specific deterministic encoding rules can make it
   difficult to use existing generic encoders and may therefore diminish
   the value of using a standard representation format.

   Instead, applications can define transformations of their data into a
   more limited data model that reduces the cases the Deterministic
   Encoding rules have to implement.  This allows both the following
   implementation choices:

   *  the use of generic encoders with standard Deterministic Encoding
      rule implementations after some application processing, or

   *  the use of specialized encoders which combine encoding with the
      implementation of the application transformations.

   The next subsection describes some of the considerations that led to
   one such application profile for Deterministic Encoding.

6.2.  Numeric Reduction in dCBOR

   The dCBOR specification [I-D.mcnally-deterministic-cbor] describes
   the pervasive use of Deterministic Encoding throughout an
   application.  It also defines a simplified application data model of
   numbers, where there no longer is a distinction between integers and
   floating point numbers at the application data model level — all
   numbers are of a single numeric type, and the choice of integer or
   floating point representations is made based on value:

Bormann                  Expires 22 January 2025               [Page 14]
Internet-Draft          On Deterministic Encoding              July 2024

   *  integral numbers that fit into Major Type 0 and 1 are represented
      in this way even if they were originally represented as floating
      point values;

   *  all other numbers are represented as floating point values (and
      all NaN values are mapped to a single quiet NAN).

   The underlying CBOR Deterministic Encoding rules ensure that, in both
   cases, the shortest form for the case will then be used for encoding.

   Reducing the separate integer and floating point spaces to a single
   numeric space is particularly attractive in implementation languages
   that also only have a single such space, such as JavaScript
   [ECMA262].  (While JavaScript recently has acquired a separate
   integer type, it is much less well integrated into the language and
   existing libraries than the more well-established general numeric
   type.)

   Within the CBOR working group of the IETF, the dCBOR specification
   prompted a discussion about profiles for deterministic encoding,
   which led to the CBOR Common Deterministic Encoding (CDE)
   specification [I-D.ietf-cbor-cde] and the concept of a deterministic
   encoding _application profile_ (Section 3 of [I-D.ietf-cbor-cde]).
   Without help of the CDE specification at the time, an early version
   of the dCBOR specification restated much of Section 4.2 of RFC 8949
   [STD94] and added a rule that gets in the way of compatibility with
   Deterministic Encoding (disallowing the interchange of basic negative
   integers in the range -2^64 to -2^63-1).

      |  Early dCBOR specifications also had a mention as future work of
      |  subnormal values [IEEE754], which work fine for interchange
      |  (even with Deterministic Encoding) in [STD94].  Note that
      |  specific values may not be available to applications that
      |  employ floating point hardware implementing the FTZ ("flush to
      |  zero") and/or DAZ ("denormals are zero") strategies.  These may
      |  then require special handling in the application data model.
      |  It is generally difficult to rely on exact equality of floating
      |  point values, which however is what Deterministic Encoding
      |  requires.

7.  Using Deterministically Encoded CBOR as a Deterministic Encoding of
    JSON

   Certain applications could make use of a Deterministic Encoding for
   JSON [STD90] data.  Deterministically Encoded CBOR provides an
   attractive solution to that as it is already well-defined.

Bormann                  Expires 22 January 2025               [Page 15]
Internet-Draft          On Deterministic Encoding              July 2024

   While the data model of JSON is not well-defined, I-JSON provides one
   interpretation that is generally accepted [RFC7493].  Section 6.2
   (Converting from JSON to CBOR) of RFC 8949 [STD94] provides a way to
   transform JSON data that conform to this data model to CBOR.  When
   used with its default parameters, the combination of (1) I-JSON, (2)
   the JSON-to-CBOR transformation, and (3) the rules for CBOR
   Deterministic Encoding provide a well-defined Deterministic Encoding
   for JSON data.

   Transforming decoded CBOR data after interchange back to data-model
   level JSON data can be done with the inverse of Section 6.2 of RFC
   8949 [STD94] (the full generality of Section 6.1 (Converting from
   CBOR to JSON) of RFC 8949 [STD94] is obviously not required as only
   the JSON subset of the CBOR generic data model is used).

   Comparing the handling of numeric data in the JSON-to-CBOR
   transformation to that reported in Section 6.2, the main difference
   is that the former only maps integral values between -2^53+1 and
   2^53-1 to basic CBOR integers and leaves the others in floating point
   form.  (The rationale is that only this range is injective
   ("unambiguous" or "exact") in the mapping of integers to binary64
   floating point values, which may be a desirable property beyond the
   use in JSON encoding.)

8.  Security Considerations

   One of the major use cases of Deterministic Encoding is in security,
   namely in the derivation of signing inputs from some CBOR data only
   available at the model level.  Any transformation error from the
   application data to the CBOR model level and then to deterministic
   encoding can lead to a potential exploit by an attacker.

   Pertinent Security Considerations are further discussed Section 8 of
   [I-D.ietf-cbor-cde].

9.  IANA Considerations

   This document has no IANA actions.

10.  References

10.1.  Normative References

   [BCP14]    Best Current Practice 14,
              <https://www.rfc-editor.org/info/bcp14>.
              At the time of writing, this BCP comprises the following:

Bormann                  Expires 22 January 2025               [Page 16]
Internet-Draft          On Deterministic Encoding              July 2024

              Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

              Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

   [I-D.ietf-cbor-cde]
              Bormann, C., "CBOR Common Deterministic Encoding (CDE)",
              Work in Progress, Internet-Draft, draft-ietf-cbor-cde-03,
              8 July 2024, <https://datatracker.ietf.org/doc/html/draft-
              ietf-cbor-cde-03>.

   [IANA.cbor-tags]
              IANA, "Concise Binary Object Representation (CBOR) Tags",
              19 September 2013,
              <https://www.iana.org/assignments/cbor-tags>.

   [RFC8610]  Birkholz, H., Vigano, C., and C. Bormann, "Concise Data
              Definition Language (CDDL): A Notational Convention to
              Express Concise Binary Object Representation (CBOR) and
              JSON Data Structures", RFC 8610, DOI 10.17487/RFC8610,
              June 2019, <https://www.rfc-editor.org/rfc/rfc8610>.

   [STD94]    Internet Standard 94,
              <https://www.rfc-editor.org/info/std94>.
              At the time of writing, this STD comprises the following:

              Bormann, C. and P. Hoffman, "Concise Binary Object
              Representation (CBOR)", STD 94, RFC 8949,
              DOI 10.17487/RFC8949, December 2020,
              <https://www.rfc-editor.org/info/rfc8949>.

10.2.  Informative References

   [CTAP2]    FIDO Alliance, "Client to Authenticator Protocol (CTAP)",
              CTAP2 canonical CBOR encoding form (in Section 6), 27
              February 2018, <https://fidoalliance.org/specs/fido-v2.0-
              id-20180227/fido-client-to-authenticator-protocol-v2.0-id-
              20180227.html#ctap2-canonical-cbor-encoding-form>.

   [ECMA262]  Ecma International, "ECMAScript 2020 Language
              Specification", Standard ECMA-262, 11th Edition, June
              2020, <https://www.ecma-
              international.org/publications/standards/Ecma-262.htm>.

Bormann                  Expires 22 January 2025               [Page 17]
Internet-Draft          On Deterministic Encoding              July 2024

   [I-D.ietf-cbor-edn-literals]
              Bormann, C., "CBOR Extended Diagnostic Notation (EDN)",
              Work in Progress, Internet-Draft, draft-ietf-cbor-edn-
              literals-10, 4 July 2024,
              <https://datatracker.ietf.org/doc/html/draft-ietf-cbor-
              edn-literals-10>.

   [I-D.ietf-mediaman-suffixes]
              Sporny, M. and A. Guy, "Media Type Suffixes", Work in
              Progress, Internet-Draft, draft-ietf-mediaman-suffixes-08,
              19 June 2024, <https://datatracker.ietf.org/doc/html/
              draft-ietf-mediaman-suffixes-08>.

   [I-D.mcnally-deterministic-cbor]
              McNally, W., Allen, C., and C. Bormann, "dCBOR: A
              Deterministic CBOR Application Profile", Work in Progress,
              Internet-Draft, draft-mcnally-deterministic-cbor-10, 10
              June 2024, <https://datatracker.ietf.org/doc/html/draft-
              mcnally-deterministic-cbor-10>.

   [IEEE754]  IEEE, "IEEE Standard for Floating-Point Arithmetic", IEEE
              Std 754-2019, DOI 10.1109/IEEESTD.2019.8766229,
              <https://ieeexplore.ieee.org/document/8766229>.

   [RFC3339]  Klyne, G. and C. Newman, "Date and Time on the Internet:
              Timestamps", RFC 3339, DOI 10.17487/RFC3339, July 2002,
              <https://www.rfc-editor.org/rfc/rfc3339>.

   [RFC6838]  Freed, N., Klensin, J., and T. Hansen, "Media Type
              Specifications and Registration Procedures", BCP 13,
              RFC 6838, DOI 10.17487/RFC6838, January 2013,
              <https://www.rfc-editor.org/rfc/rfc6838>.

   [RFC7493]  Bray, T., Ed., "The I-JSON Message Format", RFC 7493,
              DOI 10.17487/RFC7493, March 2015,
              <https://www.rfc-editor.org/rfc/rfc7493>.

   [RFC8742]  Bormann, C., "Concise Binary Object Representation (CBOR)
              Sequences", RFC 8742, DOI 10.17487/RFC8742, February 2020,
              <https://www.rfc-editor.org/rfc/rfc8742>.

   [RFC8746]  Bormann, C., Ed., "Concise Binary Object Representation
              (CBOR) Tags for Typed Arrays", RFC 8746,
              DOI 10.17487/RFC8746, February 2020,
              <https://www.rfc-editor.org/rfc/rfc8746>.

Bormann                  Expires 22 January 2025               [Page 18]
Internet-Draft          On Deterministic Encoding              July 2024

   [RFC9557]  Sharma, U. and C. Bormann, "Date and Time on the Internet:
              Timestamps with Additional Information", RFC 9557,
              DOI 10.17487/RFC9557, April 2024,
              <https://www.rfc-editor.org/rfc/rfc9557>.

   [STD63]    Internet Standard 63,
              <https://www.rfc-editor.org/info/std63>.
              At the time of writing, this STD comprises the following:

              Yergeau, F., "UTF-8, a transformation format of ISO
              10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
              2003, <https://www.rfc-editor.org/info/rfc3629>.

   [STD90]    Internet Standard 90,
              <https://www.rfc-editor.org/info/std90>.
              At the time of writing, this STD comprises the following:

              Bray, T., Ed., "The JavaScript Object Notation (JSON) Data
              Interchange Format", STD 90, RFC 8259,
              DOI 10.17487/RFC8259, December 2017,
              <https://www.rfc-editor.org/info/rfc8259>.

   [STD96]    Internet Standard 96,
              <https://www.rfc-editor.org/info/std96>.
              At the time of writing, this STD comprises the following:

              Schaad, J., "CBOR Object Signing and Encryption (COSE):
              Structures and Process", STD 96, RFC 9052,
              DOI 10.17487/RFC9052, August 2022,
              <https://www.rfc-editor.org/info/rfc9052>.

              Schaad, J., "CBOR Object Signing and Encryption (COSE):
              Countersignatures", STD 96, RFC 9338,
              DOI 10.17487/RFC9338, December 2022,
              <https://www.rfc-editor.org/info/rfc9338>.

Acknowledgments

   This document was motivated by the work of Wolf McNally and
   Christopher Allen as documented in [I-D.mcnally-deterministic-cbor]
   and discussed in 2023 in the CBOR working group.  It collects
   information that is present in the apps-discuss and CBOR WG mailing
   list discussions since 2013, but not necessarily easy to find.  The
   author is grateful to the many contributors to these discussions.

Author's Address

Bormann                  Expires 22 January 2025               [Page 19]
Internet-Draft          On Deterministic Encoding              July 2024

   Carsten Bormann
   Universität Bremen TZI
   Postfach 330440
   D-28359 Bremen
   Germany
   Phone: +49-421-218-63921
   Email: cabo@tzi.org

Bormann                  Expires 22 January 2025               [Page 20]