Skip to main content

Erasure Encoding of Files in NFSv4.2
draft-haynes-nfsv4-erasure-encoding-03

Document Type Active Internet-Draft (individual)
Author Thomas Haynes
Last updated 2024-11-05
RFC stream (None)
Intended RFC status (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-haynes-nfsv4-erasure-encoding-03
Network File System Version 4                                  T. Haynes
Internet-Draft                                               Hammerspace
Intended status: Standards Track                         5 November 2024
Expires: 9 May 2025

                  Erasure Encoding of Files in NFSv4.2
                 draft-haynes-nfsv4-erasure-encoding-03

Abstract

   Parallel NFS (pNFS) allows a separation between the metadata (onto a
   metadata server) and data (onto a storage device) for a file.  The
   Flexible File Version 2 Layout Type is defined in this document as an
   extension to pNFS that allows the use of storage devices that require
   only a limited degree of interaction with the metadata server and use
   already-existing protocols.  Data replication is also added to
   provide integrity.

Note

   This note is to be removed before publishing as an RFC.

   Discussion of this draft takes place on the NFSv4 working group
   mailing list (nfsv4@ietf.org), which is archived at
   https://mailarchive.ietf.org/arch/browse/nfsv4/.  Working Group
   information can be found at https://datatracker.ietf.org/wg/nfsv4/
   about/.

Note

   This note is to be removed before publishing as an RFC.

   This draft starts sparse and will be filled in as details are ironed
   out.  For example, WRITE_BLOCK4 in Section 6.5 is presented as being
   WRITE4 (see Section 18.32 of [RFC8881]) plus some semantic changes.
   In the first draft, we simply explain the semantics changes.  As
   these are accepted by the knowledgeable reviewers, we will flesh out
   the WRITE_BLOCK4 section to include sub-sections more akin to 18.32.3
   and 18.32.4 of [RFC8881].

   Except where called out, all the semantics of the Flexible File
   Version 1 Layout Type presented in [RFC8435] still apply.  This new
   version extends it and does not replace it.

Haynes                     Expires 9 May 2025                   [Page 1]
Internet-Draft              erasure encoding               November 2024

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 9 May 2025.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.  Definitions . . . . . . . . . . . . . . . . . . . . . . .   4
     1.2.  Requirements Language . . . . . . . . . . . . . . . . . .   5
   2.  Flexible File Version 2 Layout Type . . . . . . . . . . . . .   5
     2.1.  ffv2_encoding_type  . . . . . . . . . . . . . . . . . . .   6
     2.2.  ff_flags4 . . . . . . . . . . . . . . . . . . . . . . . .   6
     2.3.  ffv2_file_info4 . . . . . . . . . . . . . . . . . . . . .   6
     2.4.  ffv2_ds_flags4  . . . . . . . . . . . . . . . . . . . . .   7
     2.5.  ffv2_data_server4 . . . . . . . . . . . . . . . . . . . .   7
     2.6.  ffv2_encoding_type_data . . . . . . . . . . . . . . . . .   8
     2.7.  ffv2_mirror4  . . . . . . . . . . . . . . . . . . . . . .   8
     2.8.  ffv2_layout4  . . . . . . . . . . . . . . . . . . . . . .   8
     2.9.  ffv2_layouthint4  . . . . . . . . . . . . . . . . . . . .   9
     2.10. Mixing of Encoding Types  . . . . . . . . . . . . . . . .   9
   3.  Erasure Encoding  . . . . . . . . . . . . . . . . . . . . . .  11

Haynes                     Expires 9 May 2025                   [Page 2]
Internet-Draft              erasure encoding               November 2024

     3.1.  Encoding a Data Block . . . . . . . . . . . . . . . . . .  11
     3.2.  Decoding a Data Block . . . . . . . . . . . . . . . . . .  15
   4.  Blocks and Activating . . . . . . . . . . . . . . . . . . . .  18
     4.1.  Dead or Partitioned Client  . . . . . . . . . . . . . . .  18
     4.2.  Client Overwrite  . . . . . . . . . . . . . . . . . . . .  18
     4.3.  Racing Clients  . . . . . . . . . . . . . . . . . . . . .  21
     4.4.  Reader and Writer Racing  . . . . . . . . . . . . . . . .  24
   5.  New Infrastructure  . . . . . . . . . . . . . . . . . . . . .  25
     5.1.  Errors  . . . . . . . . . . . . . . . . . . . . . . . . .  25
     5.2.  EXCHGID4_FLAG_USE_PNFS_DS . . . . . . . . . . . . . . . .  25
     5.3.  Block Owner . . . . . . . . . . . . . . . . . . . . . . .  26
   6.  New NFSv4.2 Operations  . . . . . . . . . . . . . . . . . . .  27
     6.1.  Operation 77: ACTIVATE_BLOCK4 - Activate Cached Block
           Data  . . . . . . . . . . . . . . . . . . . . . . . . . .  27
     6.2.  Operation 78: READ_BLOCK_STATUS4 - Read Block Commit Status
           from File . . . . . . . . . . . . . . . . . . . . . . . .  28
     6.3.  Operation 79: READ_BLOCK4 - Read Blocks from File . . . .  28
     6.4.  Operation 80: ROLLBACK_BLOCK - Rollback Cached Block
           Data  . . . . . . . . . . . . . . . . . . . . . . . . . .  31
     6.5.  Operation 81: WRITE_BLOCK4 - Write Blocks to File . . . .  33
   7.  Extraction of XDR . . . . . . . . . . . . . . . . . . . . . .  37
   8.  Security Considerations . . . . . . . . . . . . . . . . . . .  37
   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  37
     9.1.  pNFS Layout Types Registry  . . . . . . . . . . . . . . .  38
     9.2.  NFSv4 Recallable Object Types Registry  . . . . . . . . .  38
     9.3.  Flexible Files Version 2 Layout Type Erasure Encoding Type
           Registry  . . . . . . . . . . . . . . . . . . . . . . . .  38
   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  39
     10.1.  Normative References . . . . . . . . . . . . . . . . . .  39
     10.2.  Informative References . . . . . . . . . . . . . . . . .  40
   Appendix A.  Acknowledgments  . . . . . . . . . . . . . . . . . .  40
   Appendix B.  RFC Editor Notes . . . . . . . . . . . . . . . . . .  40
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  40

1.  Introduction

   In Parallel NFS (pNFS) (see Section 12 of [RFC8881]), the metadata
   server returns layout type structures that describe where file data
   is located.  There are different layout types for different storage
   systems and methods of arranging data on storage devices.  [RFC8435]
   defined the Flexible File Version 1 Layout Type used with file-based
   data servers that are accessed using the NFS protocols: NFSv3
   [RFC1813], NFSv4.0 [RFC7530], NFSv4.1 [RFC8881], and NFSv4.2
   [RFC7862].

   The Client Side Mirroring (see Section 8 of [RFC8435]), introduced
   with the first version of the Flexible File Layout Type, provides for
   replication of data but does not provide for integrity of data.  In

Haynes                     Expires 9 May 2025                   [Page 3]
Internet-Draft              erasure encoding               November 2024

   the event of an error, an user would be able to repair the file by
   silvering the mirror contents.  I.e., they would pick one of the
   mirror instances and replicate it to the other instance locations.

   However, lacking integrity checks, silent corruptions are not able to
   be detected and the choice of what constitutes the good copy is
   difficult.  This document updates the Flexible File Layout Type to
   version 2 by providing data integrity for erasure encoding.  It
   introduces new variants of COMMIT4 (see Section 18.3 of [RFC8881]) ,
   READ4 (see Section 18.22 of [RFC8881]) , and WRITE4 (see
   Section 18.32 of [RFC8881]) to allow for the transmission of
   integrity checking.

   Using the process detailed in [RFC8178], the revisions in this
   document become an extension of NFSv4.2 [RFC7862].  They are built on
   top of the external data representation (XDR) [RFC4506] generated
   from [RFC7863].

1.1.  Definitions

   block:  One of the resulting blocks to be exchanged with a data
      server after a transformation has been applied to a data block.
      Note that the resulting block may be a different size than the
      data block.

   Client Side Mirroring:  A file based replication method where copies
      are maintained in parallel.

   data block:  A block of data in the client's cache for a file.

   Erasure Encoding:  A data protection scheme where a block of data is
      replicated into fragments and additional redundant fragments are
      added to achieve parity.  The new blocks are stored in different
      locations.

   Client Side Erasure Encoding:  A file based integrity method where
      copies are maintained in parallel.

   consistency of payload:  A payload is consistent when all contained
      blocks have the same owner, i.e., they share the same writing
      client and transaction id.

   integrity of data:  Data integrity refers to the accuracy,
      consistency, and reliability of data throughout its life cycle.

   payload:  The set of metadata header and transformed blocks generate

Haynes                     Expires 9 May 2025                   [Page 4]
Internet-Draft              erasure encoding               November 2024

      per data block by the erasure encoding type.  Note that the
      resulting blocks might be of type active, parity, spare, or
      repair.

   replication of data:  Data replication is making and storing multiple
      copies of data in different locations.

   write hole:  A write hole is a data corruption scenario where either
      two clients are trying to write to the same block or one client is
      overwriting an existing block of data.

1.2.  Requirements Language

   The key words 'MUST', 'MUST NOT', 'REQUIRED', 'SHALL', 'SHALL NOT',
   'SHOULD', 'SHOULD NOT', 'RECOMMENDED', 'NOT RECOMMENDED', 'MAY', and
   'OPTIONAL' in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

2.  Flexible File Version 2 Layout Type

   In order to introduce erasure encoding to pNFS, a new layout type of
   LAYOUT4_FLEX_FILES_V2 needs to be defined.  While we could define a
   new layout type per erasure encoding type, there exist use cases
   where multiple erasure encoding types exist in the same layout.

   The original layouttype4 introduced in [RFC8881] is modified to as in
   Figure 1.

          enum layouttype4 {
              LAYOUT4_NFSV4_1_FILES   = 1,
              LAYOUT4_OSD2_OBJECTS    = 2,
              LAYOUT4_BLOCK_VOLUME    = 3,
              LAYOUT4_FLEX_FILES      = 4,
              LAYOUT4_FLEX_FILES_V2   = 5
          };

          struct layout_content4 {
              layouttype4             loc_type;
              opaque                  loc_body<>;
          };

          struct layout4 {
              offset4                 lo_offset;
              length4                 lo_length;
              layoutiomode4           lo_iomode;
              layout_content4         lo_content;
          };

Haynes                     Expires 9 May 2025                   [Page 5]
Internet-Draft              erasure encoding               November 2024

                                  Figure 1

   This document defines structures associated with the layouttype4
   value LAYOUT4_FLEX_FILES_V2.  [RFC8881] specifies the loc_body
   structure as an XDR type 'opaque'.  The opaque layout is
   uninterpreted by the generic pNFS client layers but is interpreted by
   the Flexible File Version 2 Layout Type implementation.  This section
   defines the structure of this otherwise opaque value, ffv2_layout4.

2.1.  ffv2_encoding_type

      /// enum ffv2_encoding_type {
      ///     FFV2_ENCODING_MIRRORED       = 0x1;
      /// };

                                  Figure 2

   The ffv2_encoding_type (see Figure 2) encompasses a new IANA registry
   for 'Flex Files V2 Erasure Encoding Type Registry' (see Section 9.3).
   I.e., instead of defining a new Layout Type for each Erasure
   Encoding, we define a new Erasure Encoding Type.  Except for
   FFV2_ENCODING_MIRRORED, each of the types is expected to employ the
   new operations in this document.

   FFV2_ENCODING_MIRRORED offers replication of data and not integrity
   of data.  As such, it does not need operations like WRITE_BLOCK4 (see
   Section 6.5).

2.2.  ff_flags4

      const FF_FLAGS_NO_LAYOUTCOMMIT4   = 0x00000001;
      const FF_FLAGS_NO_IO_THRU_MDS    = 0x00000002;
      const FF_FLAGS_NO_READ_IO        = 0x00000004;
      const FF_FLAGS_WRITE_ONE_MIRROR  = 0x00000008;
      typedef uint32_t            ff_flags4;

                                  Figure 3

   ff_flags4 is defined as in Section 5.1 of [RFC8435] and is shown in
   Figure 3 for reference.

2.3.  ffv2_file_info4

      /// struct ffv2_file_info4 {
      ///     stateid4                fffi_stateid;
      ///     nfs_fh4                 fffi_fh_vers;
      /// };

Haynes                     Expires 9 May 2025                   [Page 6]
Internet-Draft              erasure encoding               November 2024

                                  Figure 4

   The ffv2_file_info4 is a new structure to help with the stateid issue
   discussed in Section 5.1 of [RFC8435].  I.e., in version 1 of the
   Flexible File Layout Type, there was the singleton ffds_stateid
   combined with the ffds_fh_vers array.  I.e., each NFSv4 version has
   its own stateid.  In Figure 4, each NFSv4 file handle has a one-to-
   one correspondence to a stateid.

2.4.  ffv2_ds_flags4

      /// const FFV2_DS_FLAGS_ACTIVE        = 0x00000001;
      /// const FFV2_DS_FLAGS_SPARE         = 0x00000002;
      /// const FFV2_DS_FLAGS_PARITY        = 0x00000004;
      /// const FFV2_DS_FLAGS_REPAIR        = 0x00000008;
      /// typedef uint32_t            ffv2_ds_flags4;

                                  Figure 5

   The ffv2_layout4 (in Figure 5) flags detail the state of the data
   servers.  With Erasure Encoding algorithms, there are both Systematic
   and Non-Systematic approaches.  In the Systematic, the bits for
   integrity are placed amoungst the resulting transformed block.  Such
   an implementation would typically see FFV2_DS_FLAGS_ACTIVE and
   FFV2_DS_FLAGS_SPARE data servers.  The FFV2_DS_FLAGS_SPARE ones allow
   the client to repair a payload with enaging the metadata server.
   I.e., if one of the FFV2_DS_FLAGS_ACTIVE did not respond to a
   WRITE_BLOCK4, the client could fail the block to the
   FFV2_DS_FLAGS_SPARE data server.

   With the Non-Systematic approach, the data and integrity live on
   different data servers.  Such an implementation would typically see
   FFV2_DS_FLAGS_ACTIVE and FFV2_DS_FLAGS_PARITY data servers.  If the
   implementation wanted to allow for local repair, it would also use
   FFV2_DS_FLAGS_SPARE.  Note that with a Non-Systematic approach, it is
   possible to update parts of the blocks, see Section 6.5.3.2.

   See [Plank97] for further reference to storage layouts for encoding.

2.5.  ffv2_data_server4

Haynes                     Expires 9 May 2025                   [Page 7]
Internet-Draft              erasure encoding               November 2024

      /// struct ffv2_data_server4 {
      ///     deviceid4               ffds_deviceid;
      ///     uint32_t                ffds_efficiency;
      ///     ffv2_file_info4         ffds_file_info<>;
      ///     fattr4_owner            ffds_user;
      ///     fattr4_owner_group      ffds_group;
      ///     ffv2_ds_flags4          ffds_flags;
      /// };

                                  Figure 6

   The ffv2_data_server4 (in Figure 6) describes a data file and how to
   access it via the different NFS protocols.

2.6.  ffv2_encoding_type_data

      /// union ffv2_encoding_type_data switch
      ///         (ffv2_encoding_type fetd_encoding) {
      ///     case FFV2_ENCODING_MIRRORED:
      ///         void;
      /// };

                                  Figure 7

   The ffv2_encoding_type_data (in Figure 7) describes erasure encoding
   type specific fields.  I.e., this is how the encoding type can
   communicate the need for counts of active, spare, parity, and repair
   types of blocks.

2.7.  ffv2_mirror4

      /// struct ffv2_mirror4 {
      ///     ffv2_data_server4       ffm_data_servers<>;
      ///     ffv2_encoding_type_data ffm_encoding_type_data;
      /// };

                                  Figure 8

   The ffv2_mirror4 (in Figure 8) describes the Flexible File Layout
   Version 2 specific fields.

2.8.  ffv2_layout4

Haynes                     Expires 9 May 2025                   [Page 8]
Internet-Draft              erasure encoding               November 2024

      /// struct ffv2_layout4 {
      ///     length4                 ffl_stripe_unit;
      ///     ffv2_mirror4            ffl_mirrors<>;
      ///     ff_flags4               ffl_flags;
      ///     uint32_t                ffl_stats_collect_hint;
      /// };

                                  Figure 9

   The ffv2_layout4 (in Figure 9) describes the Flexible Files Layout
   Version 2.

2.9.  ffv2_layouthint4

   /// union ffv2_mirrors_hint switch (ffv2_encoding_type ffmh_type) {
   ///     case FFV2_ENCODING_MIRRORED:
   ///         void;
   /// };
   ///
   /// struct ffv2_layouthint4 {
   ///     ffv2_encoding_type fflh_supported_types<>;
   ///     ffv2_mirrors_hint fflh_mirrors_hint;
   /// };

                                 Figure 10

   The ffv2_layouthint4 (in Figure 10) describes the layout_hint (see
   Section 5.12.4 of [RFC8881]) that the client can provide to the
   metadata server.

2.10.  Mixing of Encoding Types

   Note that effectively, multiple encoding types can be present in a
   Flexible Files Version 2 Layout Type layout.  The ffv2_layout4 has an
   array of ffv2_mirror4, each of which has a ffv2_encoding_type.  The
   main reason to allow for this is to provide for either the
   assimilation of a non-erasure encoded file to an erasure encoded file
   or the exporting of an erasure encoded file to a non-erasure encoded
   file.

   Assume there is an additional ffv2_encoding_type of
   FFV2_ENCODING_REED_SOLOMON and it needs 4 active blocks, 2 parity
   blocks, and 2 spare blocks.  The user wants to actively assimilate a
   regular file.  As such, a layout might be as represented in
   Figure 11.  As this is an assimilation, most of the data reads will
   be satisfied by READ4 (see Section 18.22 of [RFC8881]) calls to index
   0.  However, as this is also an active file, there could also be
   READ_BLOCK4 (see Section 6.3) calls to the other indexes.

Haynes                     Expires 9 May 2025                   [Page 9]
Internet-Draft              erasure encoding               November 2024

            +---------------------------------------------------+
            | ffv2_layout4:                                     |
            +---------------------------------------------------+
            |     ffl_mirrors[0]:                               |
            |         ffm_data_servers:                         |
            |             ffv2_data_server4[0]                  |
            |                 ffds_flags: 0                     |
            |         ffm_encoding: FFV2_ENCODING_MIRRORED      |
            +---------------------------------------------------+
            |     ffl_mirrors[1]:                               |
            |         ffm_data_servers:                         |
            |             ffv2_data_server4[0]                  |
            |                 ffds_flags: FFV2_DS_FLAGS_ACTIVE  |
            |             ffv2_data_server4[1]                  |
            |                 ffds_flags: FFV2_DS_FLAGS_ACTIVE  |
            |             ffv2_data_server4[2]                  |
            |                 ffds_flags: FFV2_DS_FLAGS_ACTIVE  |
            |             ffv2_data_server4[3]                  |
            |                 ffds_flags: FFV2_DS_FLAGS_ACTIVE  |
            |             ffv2_data_server4[4]                  |
            |                 ffds_flags: FFV2_DS_FLAGS_PARITY  |
            |             ffv2_data_server4[5]                  |
            |                 ffds_flags: FFV2_DS_FLAGS_PARITY  |
            |             ffv2_data_server4[6]                  |
            |                 ffds_flags: FFV2_DS_FLAGS_SPARE   |
            |             ffv2_data_server4[7]                  |
            |                 ffds_flags: FFV2_DS_FLAGS_SPARE   |
            |     ffm_encoding: FFV2_ENCODING_REED_SOLOMON      |
            +---------------------------------------------------+

                                 Figure 11

   When performing I/O via a FFV2_ENCODING_MIRRORED encoding type, the
   non-transformed data will be used, Whereas with other encoding types,
   a metadata header and transformed block will be sent.  Further, when
   reading data from the instance files, the client MUST be prepared to
   have one of the encoding types supply data and the other type not to
   supply data.  I.e., the READ_BLOCK4 call might return rlr_eof set to
   true (see Figure 37), which indicates that there is no data, where
   the READ4 call might return eof to be false, which indicates that
   there is data.  The client MUST determine that there is in fact data.

   An example use case is the active assimilation of a file to ensure
   integrity.  As the client is helping to translated the file to the
   new encoding scheme, it is actively modifying the file.  As such, it
   might be sequentially reading the file in order to translate.  The
   READ4 call would be returning data and the READ_BLOCK4 would not be
   returning data.  As the client overwrites the file, the WRITE4 call

Haynes                     Expires 9 May 2025                  [Page 10]
Internet-Draft              erasure encoding               November 2024

   and the WRITE_BLOCK4 call would both have data sent.  Finally, if the
   client read back a section which had been modified earlier, both the
   READ4 and READ_BLOCK4 calls would return data.

3.  Erasure Encoding

   Erasure Encoding takes an data block and transforms it to a payload
   to send to the data servers (see Figure 12).  It generates a metadata
   header and transformed block per data server.  The header is metadata
   information for the transformed block.  From now on, the metadata is
   simply referred to as the header and the transformed block as the
   block.  The payload of a data block is the set of generated headers
   and blocks for that data block.

   The change_id is an unique identifier generated by the client to
   describe the current write transaction.  The client_id is an unique
   identifier assigned by the metadata server to describe which client
   is making the current write transaction.  The seq_id describes the
   index across payload.  The eff_len is the length of the data within
   the block.  Finally, the crc32 is the 32 bit crc calculation of the
   header (with the crc32 field being 0) and the block.  By combining
   the two parts of the payload, integrity is ensured for both the
   parts.

   While the data block might have a length of 4kB, that does not
   necessarily mean that the length of the block is 4kB.  That length is
   determined by the erasure encoding type algorithm.  For example, Reed
   Solomon might have 4kB blocks with the data integrity being
   compromised by parity blocks.  Another example would be the Mojette
   Transformation, which might have 1kB block lengths.

   The payload contains redundancy which will allow the erasure encoding
   type algorithm to repair blocks in the payload as it is transformed
   back to a data block (see Figure 17).  A payload is consistent when
   all of the contained headers share the same change_id and client_id.
   It has integrity when it is consistent and the blocks all pass the
   crc32 checks.

3.1.  Encoding a Data Block

Haynes                     Expires 9 May 2025                  [Page 11]
Internet-Draft              erasure encoding               November 2024

                         +-----------------+
                         |  data block     |
                         +-----------------+
                         |                 |
                         | 3kB data        |
                         |                 |
                         +-----------------+
                         | 1kB empty       |
                         +-------+---------+
                                 |
                                 |
          +----------------------+-----------------------+
          |      Erasure Encoding (Transform Forward)    |
          +----+-------------------------------------+---+
               |                                     |
               |                                     |
           +---+----------------+         +----------+---------+
           | HEADER             |         | HEADER             |
           +--------------------+         +--------------------+
           | change_id: 3       |         | change_id: 3       |
           | client_id: 6       |         | client_id: 6       |
           | seq_id   : 0       |         | seq_id   : 5       |
           | eff_len  : 3kB     |  ...    | eff_len  : 3kB     |
           | crc32    :         |         | crc32    :         |
           +--------------------+         +--------------------+
           | BLOCK              |         | BLOCK              |
           +--------------------+         +--------------------+
           | data: ....         |         | data: ....         |
           +--------------------+         +--------------------+
                Data Server 1                 Data Server 6

                                 Figure 12

   Each data block of the file resident in the client's cache of the
   file will be encoded into N different payloads to be sent to the data
   servers as shown in Figure 12.  As WRITE_BLOCK4 (see Section 6.5) can
   encode multiple write_block4 into a single transaction, a more
   accurate description of a WRITE_BLOCK4 might be as in Figure 13.

Haynes                     Expires 9 May 2025                  [Page 12]
Internet-Draft              erasure encoding               November 2024

           +------------------------------------+
           | WRITE_BLOCK4args                   |
           +------------------------------------+
           | wba_stateid: 0                     |
           | wba_offset: 1                      |
           | wba_stable: FILE_SYNC4             |
           | wba_seq_id: 0                      |
           | wba_owner:                         |
           |            bo_change_id: 3         |
           |            bo_client_id: 6         |
           | wba_block[0]:                      |
           |            wb_crc    :  0x32ef89   |
           |            wb_effective_len  : 4kB |
           |            wb_block  :  ......     |
           | wba_block[1]:                      |
           |            wb_crc    :  0x56fa89   |
           |            wb_effective_len  : 4kB |
           |            wb_block  :  ......     |
           | wba_block[2]:                      |
           |            wb_crc    :  0x7693af   |
           |            wb_effective_len  : 3kB |
           |            wb_block  :  ......     |
           +------------------------------------+

                                 Figure 13

   // pay attention to the 128 bits alignment for wb_block_val
   //
   // -- DF

   This describes a 3 block write of data from an offset of 1 block in
   the file.  As each block shares the wba_owner, it is only presented
   once.  I.e., the data server will be able to construct the header for
   each wba_block from the wba_seq_id, wba_owner, wb_effective_len, and
   wb_crc.

   Assuming that there were no issues, Figure 14 illustrates the
   results.  The payload sequence id is implicit in the
   WRITE_BLOCK4args.

Haynes                     Expires 9 May 2025                  [Page 13]
Internet-Draft              erasure encoding               November 2024

           +-------------------------------+
           | WRITE_BLOCK4resok             |
           +-------------------------------+
           | wbr_count: 3                  |
           | wbr_committed: FILE_SYNC4     |
           | wbr_writeverf: 0xf1234abc     |
           | wbr_owners[0]:                |
           |            bo_block_id: 1     |
           |            bo_change_id: 3    |
           |            bo_client_id: 6    |
           |            bo_activated: true |
           | wbr_owners[1]:                |
           |            bo_block_id: 2     |
           |            bo_change_id: 3    |
           |            bo_client_id: 6    |
           |            bo_activated: true |
           | wbr_owners[2]:                |
           |            bo_block_id: 3     |
           |            bo_change_id: 3    |
           |            bo_client_id: 6    |
           |            bo_activated: true |
           +-------------------------------+

                                 Figure 14

3.1.1.  Calculating the CRC32

           +---+----------------+
           | HEADER             |
           +--------------------+
           | change_id: 7       |
           | client_id: 6       |
           | seq_id   : 0       |
           | eff_len  : 3kB     |
           | crc32    : 0       |
           +--------------------+
           | BLOCK              |
           +--------------------+
           | data:  ....        |
           +--------------------+
                Data Server 1

                                 Figure 15

Haynes                     Expires 9 May 2025                  [Page 14]
Internet-Draft              erasure encoding               November 2024

   Assuming the header and payload as in Figure 15, the crc32 needs to
   be calculated in order to fill in the wb_crc field.  In this case,
   the crc32 is calculated over the 5 fields as shown in the header and
   the data of the block.  In this example, it is calculated to be
   0x21de8.  The resulting WRITE_BLOCK4 is shown in Figure 16.

           +------------------------------------+
           | WRITE_BLOCK4args                   |
           +------------------------------------+
           | wba_stateid: 0                     |
           | wba_offset: 1                      |
           | wba_stable: FILE_SYNC4             |
           | wba_seq_id: 0                      |
           | wba_owner:                         |
           |            bo_change_id: 7         |
           |            bo_client_id: 6         |
           | wba_block[0]:                      |
           |            wb_crc    :  0x21de8    |
           |            wb_effective_len  : 3kB |
           |            wb_block  :  ......     |
           +------------------------------------+

                                 Figure 16

3.2.  Decoding a Data Block

Haynes                     Expires 9 May 2025                  [Page 15]
Internet-Draft              erasure encoding               November 2024

                Data Server 1                 Data Server 6
           +--------------------+         +--------------------+
           | HEADER             |         | HEADER             |
           +--------------------+         +--------------------+
           | change_id: 1       |         | change_id: 1       |
           | client_id: 6       |         | client_id: 6       |
           | seq_id   : 0       |         | seq_id   : 5       |
           | eff_len  : 3kB     |  ...    | eff_len  : 3kB     |
           | crc32    :         |         | crc32    :         |
           +--------------------+         +--------------------+
           | BLOCK              |         | BLOCK              |
           +--------------------+         +--------------------+
           | data:  ....        |         | data:  ....        |
           +---+----------------+         +----------+---------+
               |                                     |
               |                                     |
          +----+-------------------------------------+---+
          |      Erasure Decoding (Transform Reverse)    |
          +----------------------+-----------------------+
                                 |
                                 |
                         +-------+---------+
                         |  data block     |
                         +-----------------+
                         |                 |
                         | 3kB data        |
                         |                 |
                         +-----------------+
                         | 1kB empty       |
                         +-----------------+

                                 Figure 17

   When reading blocks via a READ_BLOCK4 operation, the client will
   decode the headers and payload into data blocks as shown in
   Figure 17.  If the resulting data block is to be sized less than a
   data block, i.e., the rb_effective_len is less than the data block
   size, then the inverse transformation MUST fill the remainder of the
   data block with 0s.  It must appear as a freshly written data block
   which was not completely filled.

   Note that at this time, the client could detect issues in the
   integrity of the data.  The handling and repair are out of the scope
   of this document and MUST be addressed in the document describing
   each erasure encoding type.

3.2.1.  Checking the CRC32

Haynes                     Expires 9 May 2025                  [Page 16]
Internet-Draft              erasure encoding               November 2024

           +------------------------------------+
           | READ_BLOCK4resok                   |
           +------------------------------------+
           | rbr_eof: false                     |
           | rbr_blocks[0]:                     |
           |            rb_crc: 0x21de8         |
           |            rb_effective_len  : 3kB |
           |            rb_owner:               |
           |                 bo_block_id: 1     |
           |                 bo_change_id: 7    |
           |                 bo_client_id: 6    |
           |                 bo_activated: true |
           |            rb_block  :  ......     |
           +------------------------------------+

                                 Figure 18

   Assuming the READ_BLOCK4 results as in Figure 18, the crc32 needs to
   be checked in order to ensure data integrity.  Conceptually, a header
   and payload can be built as shown in Figure 19.  The crc32 is
   calculated over the 5 fields as shown in the header and the 3kB of
   data block.  In this example, it is calculated to be 0x21de8.  Thus
   this payload for the data server has data integrity.

           +---+----------------+
           | HEADER             |
           +--------------------+
           | change_id: 7       |
           | client_id: 6       |
           | seq_id   : 0       |
           | eff_len  : 3kB     |
           | crc32    : 0       |
           +--------------------+
           | BLOCK              |
           +--------------------+
           | data:  ....        |
           +--------------------+
                Data Server 1

                                 Figure 19

Haynes                     Expires 9 May 2025                  [Page 17]
Internet-Draft              erasure encoding               November 2024

4.  Blocks and Activating

   Unlike the regular NFSv4.2 I/O operations, the base unit of I/O in
   this document is the block.  The raw data stream is encoded/decoded
   into blocks as described in Section 3.  Each block has the concept of
   whether it is activated or pending activation.  This is crucial in
   detecting write holes.  A write hole occurs either when two different
   clients write to the same block concurrently or when a client
   overwrites existing data.  In the first scenario, the order of writes
   is not deterministic and can result in a mixture of blocks in the
   payload.  In the last scenario, network partitions or client restarts
   can result in partial writes.  In both cases, the blocks have to be
   repaired, either by abandoning the new I/O or by sorting out the
   winner.  Note that unlike the case of the encoding type detecting
   data integrity issues (see Section 3.2), the case of write holes is
   in the scope of this document.

   What is out of scope of this document is the manner in which the data
   servers implement the semantics of the new operations.  I.e., the
   data servers might be able to leverage the native file system to
   achieve the semantics or it might completely implement a multi-file
   approach to stage WRITE_BLOCK4 results and then shuffle blocks when
   the ACTIVATE_BLOCK4 or ROLLBACK_BLOCK4 operations activate the data.

4.1.  Dead or Partitioned Client

   Consider a client which was in the middle of sending WRITE_BLOCK4 to
   a set of data servers and it crashes.  Regardless of whether it comes
   back online or not, the metadata server can detect that the client
   had restarted when it had an outstanding LAYOUTIOMODE4_RW on the
   file.  The metadata server can assign the file to a repair program,
   which would basically scan the entire file with READ_BLOCK_STATUS4.
   When it determines that it does not have enough payload blocks to
   rebuild the data block, it can determine that the I/O for that data
   block was not complete and throw away the blocks.

   Note that the repair process can throw away the blocks by using the
   ROLLBACK_BLOCK4 operation to unstage the pending written blocks.

4.2.  Client Overwrite

   Consider a client which gets back conflicting information in the
   WRITE_BLOCK4 results.  Assume that we had written to 6 data servers
   with WRITE_BLOCK4s as in Figure 20.  And we get the results as in
   Figure 21.

Haynes                     Expires 9 May 2025                  [Page 18]
Internet-Draft              erasure encoding               November 2024

           +------------------------------------+
           | WRITE_BLOCK4args                   |
           +------------------------------------+
           | wba_stateid: 0                     |
           | wba_offset: 1                      |
           | wba_stable: FILE_SYNC4             |
           | wba_seq_id: 0                      |
           | wba_owner:                         |
           |            bo_change_id: 3         |
           |            bo_client_id: 6         |
           | wba_block[0]:                      |
           |            wb_crc    :  0x32ef89   |
           |            wb_effective_len  : 4kB |
           |            wb_block  :  ......     |
           | wba_block[1]:                      |
           |            wb_crc    :  0x56fa89   |
           |            wb_effective_len  : 4kB |
           |            wb_block  :  ......     |
           +------------------------------------+

                                 Figure 20

   Figure 21 shows that the first block was an overwrite and an
   activation has to be done in order for the newly written block to be
   returned in a READ_BLOCK4.  Assume that the next four data servers
   had the same type of response.

Haynes                     Expires 9 May 2025                  [Page 19]
Internet-Draft              erasure encoding               November 2024

                   Data Server 1
           +--------------------------------+
           | WRITE_BLOCK4resok              |
           +--------------------------------+
           | wbr_count: 2                   |
           | wbr_committed: FILE_SYNC4      |
           | wbr_writeverf: 0xf1234abc      |
           | wbr_owners[0]:                 |
           |            bo_block_id: 1      |
           |            bo_change_id: 2     |
           |            bo_client_id: 6     |
           |            bo_activated: true  |
           | wbr_owners[1]:                 |
           |            bo_block_id: 1      |
           |            bo_change_id: 3     |
           |            bo_client_id: 6     |
           |            bo_activated: false |
           | wbr_owners[2]:                 |
           |            bo_block_id: 2      |
           |            bo_change_id: 3     |
           |            bo_client_id: 6     |
           |            bo_activated: true  |
           +--------------------------------+

                                 Figure 21

   But assume that data server 4 does not respond to the WRITE_BLOCK4
   operation.  While the client can detect this and send the
   WRITE_BLOCK4 to any data server marked as FFV2_DS_FLAGS_SPARE, it
   might decide to see if the data server did in fact do the
   transaction.  It might also be the case that there are no data
   servers marked as FFV2_DS_FLAGS_SPARE.  The client issues a
   READ_BLOCK_STATUS4 (see Figure 22) and gets the results in Figure 23.
   This indicates that data server 4 did not get the WRITE_BLOCK4
   request.

   In general, the client can either resend the WRITE_BLOCK4 request,
   determine by the erasure encoding type that there is sufficient
   payload blocks present to decode the data block, or ROLLBACK_BLOCK4
   the existing blocks to back out the change.

Haynes                     Expires 9 May 2025                  [Page 20]
Internet-Draft              erasure encoding               November 2024

                   Data Server 4
           +--------------------------------+
           | READ_BLOCK_STATUS4args         |
           +--------------------------------+
           | rbsa_stateid: 0                |
           | rbsa_offset: 1                 |
           | rbsa_count: 3                  |
           +----------+---------------------+

                                 Figure 22

                   Data Server 4
           +--------------------------------+
           | READ_BLOCK_STATUS4resok        |
           +--------------------------------+
           | rbsr_eof: true                 |
           | rbsr_blocks[0]:                |
           |            bo_block_id: 1      |
           |            bo_change_id: 2     |
           |            bo_client_id: 6     |
           |            bo_activated: true  |
           +--------------------------------+

                                 Figure 23

4.3.  Racing Clients

   Assume that the client has written to 6 data servers with
   WRITE_BLOCK4s as in Figure 20.  But now it gets back the conflicting
   results in Figure 24 and Figure 25.  From this, it can detect that
   there was a race with another client.  Note, even though both clients
   present the same bo_change_id, nothing can be inferred as to the
   ordering of the two transactions.  In some cases, bo_client_id 10 won
   the race and in some cases, bo_client_id 6 won the race.

   As a subsequent READ_BLOCK4 will produce garbage, the clients need to
   agree on how to fix this issue without any communication.  A
   simplistic approach is for each client to retry the WRITE_BLOCK4
   until such time as the payload is consistent.  Note, this does not
   mean that both clients win, it just means that one of them wins.

   Another option is for the clients to report a LAYOUTERROR4 (see
   Section 15.6 of [RFC7862]) to the metadata server with an error of
   NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT.  That would then allow the
   metadata server to assign the repairing of the file.

Haynes                     Expires 9 May 2025                  [Page 21]
Internet-Draft              erasure encoding               November 2024

                   Data Server 1
           +--------------------------------+
           | WRITE_BLOCK4resok              |
           +--------------------------------+
           | wbr_count: 2                   |
           | wbr_committed: FILE_SYNC4      |
           | wbr_writeverf: 0xf1234abc      |
           | wbr_owners[0]:                 |
           |            bo_block_id: 1      |
           |            bo_change_id: 3     |
           |            bo_client_id: 10    |
           |            bo_activated: true  |
           | wbr_owners[1]:                 |
           |            bo_block_id: 1      |
           |            bo_change_id: 3     |
           |            bo_client_id: 6     |
           |            bo_activated: false |
           | wbr_owners[2]:                 |
           |            bo_block_id: 2      |
           |            bo_change_id: 3     |
           |            bo_client_id: 6     |
           |            bo_activated: true  |
           +--------------------------------+

                                 Figure 24

Haynes                     Expires 9 May 2025                  [Page 22]
Internet-Draft              erasure encoding               November 2024

                   Data Server 2
           +--------------------------------+
           | WRITE_BLOCK4resok              |
           +--------------------------------+
           | wbr_count: 2                   |
           | wbr_committed: FILE_SYNC4      |
           | wbr_writeverf: 0xf1234abc      |
           | wbr_owners[0]:                 |
           |            bo_block_id: 1      |
           |            bo_change_id: 3     |
           |            bo_client_id: 6     |
           |            bo_activated: true  |
           | wbr_owners[1]:                 |
           |            bo_block_id: 1      |
           |            bo_change_id: 3     |
           |            bo_client_id: 10    |
           |            bo_activated: false |
           | wbr_owners[2]:                 |
           |            bo_block_id: 2      |
           |            bo_change_id: 3     |
           |            bo_client_id: 6     |
           |            bo_activated: true  |
           +--------------------------------+

                                 Figure 25

4.3.1.  Multiple Writers

   Note that nothing prevents pending blocks from accumulating or from
   more than 2 writers trying to write the same payload.  An example of
   such a WRITE_BLOCK4resok in response to the example of Figure 20 is
   shown in Figure 26.  Note only has client 6 tried to update the block
   1, but all of clients 6, 7, and 20 are attempting to update it.

Haynes                     Expires 9 May 2025                  [Page 23]
Internet-Draft              erasure encoding               November 2024

                   Data Server 2
           +--------------------------------+
           | WRITE_BLOCK4resok              |
           +--------------------------------+
           | wbr_count: 2                   |
           | wbr_committed: FILE_SYNC4      |
           | wbr_writeverf: 0xf1234abc      |
           | wbr_owners[0]:                 |
           |            bo_block_id: 1      |
           |            bo_change_id: 3     |
           |            bo_client_id: 6     |
           |            bo_activated: true  |
           | wbr_owners[1]:                 |
           |            bo_block_id: 1      |
           |            bo_change_id: 4     |
           |            bo_client_id: 6     |
           |            bo_activated: false |
           | wbr_owners[2]:                 |
           |            bo_block_id: 1      |
           |            bo_change_id: 20    |
           |            bo_client_id: 7     |
           |            bo_activated: false |
           | wbr_owners[3]:                 |
           |            bo_block_id: 1      |
           |            bo_change_id: 3     |
           |            bo_client_id: 10    |
           |            bo_activated: false |
           | wbr_owners[4]:                 |
           |            bo_block_id: 2      |
           |            bo_change_id: 3     |
           |            bo_client_id: 6     |
           |            bo_activated: true  |
           +--------------------------------+

                                 Figure 26

4.4.  Reader and Writer Racing

   In addition to the above write hole scenarios, a further complication
   is a racing reader and writer.  If the client reads a block and
   determines that the payload is not consistent (i.e., not all of the
   payload blocks share the same client_id and change_id), then it can
   assume that it has encountered a race with another client writing to
   the file.  It SHOULD retry the READ_BLOCK4 operation until payload
   consistency is achieved.  It may determine to send a LAYOUTERROR4 to
   the metadata server with an error of
   NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT.
   // And should it hang forever?  Perhaps a new layout error that the

Haynes                     Expires 9 May 2025                  [Page 24]
Internet-Draft              erasure encoding               November 2024

   // client can send the MDS?  Or should it probe with
   // READ_BLOCK_STATUS4 to try to repair?
   //
   // -- TH
   // Perhaps a LAYOUTERROR_BLOCK4 to send an encoding type specific
   // location?
   //
   // -- TH

5.  New Infrastructure

5.1.  Errors

5.1.1.  Error 10097 - NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT

   The client encountered a payload in which the blocks were
   inconsistent and stays inconsistent.  As the client can not tell if
   another client is actively writing, it informs the metadata server of
   this error via LAYOUTERROR4.  The metadata server can then arrange
   for repair of the file.

   Note that due to the opaqueness of the clientid4, the client can not
   differentiate between boot instances of the metadata server or
   client, but the metadata server can do that differentiation.  I.e.,
   it can tell if the inconsistency is from the same client, whether
   that client is active and actively writing to the file (i.e., does
   the client have the file open and with a LAYOUTIOMODE4_RW layout?).

5.1.2.  Error 10098 - NFS4ERR_ERASURE_ENCODING_NOT_SUPPORTED

   The client requested a ffv2_encoding_type which the metadata server
   does not support.  I.e., if the client sends a layout_hint requesting
   an erasure encoding type that the metadata server does not support,
   this error code can be returned.  The client might have to send the
   layout_hint several times to determine the overlapping set of
   supported erasure encoding types.

5.1.3.  Error 10099 - NFS4ERR_ERASURE_ENCODING_BLOCK_MISMATCH

   The client requested to the data server to update the header only and
   the data server can not find a matching block at that offset.

5.2.  EXCHGID4_FLAG_USE_PNFS_DS

   /// const EXCHGID4_FLAG_USE_ERASURE_DS      = 0x00100000;

                                 Figure 27

Haynes                     Expires 9 May 2025                  [Page 25]
Internet-Draft              erasure encoding               November 2024

   When a data server connects to a metadata server it can via
   EXCHANGE_ID (see Section 18.35 of [RFC8881]) state its pNFS role.
   The data server can use EXCHGID4_FLAG_USE_ERASURE_DS (see Figure 27)
   to indicate that it supports the new NFSv4.2 operations introduced in
   this document.  Section 13.1 [RFC8881] describes the interaction of
   the various pNFS roles masked by EXCHGID4_FLAG_MASK_PNFS.  However,
   that does not mask out EXCHGID4_FLAG_USE_ERASURE_DS.  I.e.,
   EXCHGID4_FLAG_USE_ERASURE_DS can be used in combination with all of
   the pNFS flags.

   If the data server sets EXCHGID4_FLAG_USE_ERASURE_DS during the
   EXCHANGE_ID operation, then it MUST support: ACTIVATE_BLOCK4,
   READ_BLOCK_STATUS4, READ_BLOCK4, ROLLBACK_BLOCK4, and WRITE_BLOCK4.
   Further, note that this support is orthoganol to the Erasure Encoding
   Type selected.  The data server is unaware of which type is driving
   the I/O.  It is also unaware of the payload layout or what type of
   block it is serving.

5.3.  Block Owner

   /// struct block_owner4 {
   ///     uint32_t    bo_block_id;
   ///     changeid4   bo_change_id;
   ///     clientid4   bo_client_id;
   ///     bool        bo_activated;
   /// };

                                 Figure 28

   The block_owner4 (see Figure 28) is used to determine when and by
   whom a block was written.  The bo_block_id is used to identify the
   block and MUST be the index of the block within the file.  I.e., it
   is the offset of the start of the block divided by the block len.
   The bo_client_id MUST be the client id handed out by the metadata
   server to the client as the eir_clientid during the EXCHANGE_ID
   results (see Section 18.35 of [RFC8881]) and MUST NOT be the client
   id supplied by the data server to the client.  I.e., across all data
   files, the bo_client_id uniquely describes one and only one client.

   The bo_change_id is like the change attribute (see Section 5.8.1.4 of
   [RFC8881]) in that each block write by a given client has to have an
   unique bo_change_id.  I.e., it can be determined which transaction
   across all data files that a block corresponds.

   The bo_activated is used by the data server to indicate whether the
   block I/O was activated or pending activation.  The first
   WRITE_BLOCK4 to a location is automatically activated if the
   WRITE_BLOCK_FLAGS_ACTIVATE_IF_EMPTY is set.  Subsequent WRITE_BLOCK4

Haynes                     Expires 9 May 2025                  [Page 26]
Internet-Draft              erasure encoding               November 2024

   modifications to that block location are not automatically activated.
   The client has to ACTIVATE_BLOCK4 the block in order to get it
   activated.

   The concept of automatically activating is dependent on the
   wba_stable field of the WRITE_BLOCK4args.

6.  New NFSv4.2 Operations

6.1.  Operation 77: ACTIVATE_BLOCK4 - Activate Cached Block Data

6.1.1.  ARGUMENTS

   /// struct ACTIVATE_BLOCK4args {
   ///     /* CURRENT_FH: file */
   ///     offset4         aba_offset;
   ///     count4          aba_count;
   ///     block_owner4    aba_blocks<>;
   /// };

                                 Figure 29

6.1.2.  RESULTS

   /// struct ACTIVATE_BLOCK4resok {
   ///     verifier4       abr_writeverf;
   /// };

                                 Figure 30

   /// union ACTIVATE_BLOCK4res switch (nfsstat4 abr_status) {
   ///     case NFS4_OK:
   ///         ACTIVATE_BLOCK4resok   abr_resok4;
   ///     default:
   ///         void;
   /// };

                                 Figure 31

6.1.3.  DESCRIPTION

   ACTIVATE_BLOCK4 is COMMIT4 (see Section 18.3 of [RFC8881]) with
   additional semantics over the block_owner activating the blocks.  As
   such, all of the normal semantics of COMMIT4 directly apply.

   The main difference between the two operations is that
   ACTIVATE_BLOCK4 works on blocks and not a raw data stream.  As such
   aba_offset is the starting block offset in the file and not the byte

Haynes                     Expires 9 May 2025                  [Page 27]
Internet-Draft              erasure encoding               November 2024

   offset in the file.  Some erasure encoding types can have different
   block sizes depending on the block type.  Further, aba_count is a
   count of blocks to activate and not bytes to activate.

   Further, while it may appear that the combination of aba_offset and
   aba_count are redundant to aba_blocks, the purpose of aba_blocks is
   to allow the data server to differentiate between potentially
   multiple pending blocks.

6.2.  Operation 78: READ_BLOCK_STATUS4 - Read Block Commit Status from
      File

6.2.1.  ARGUMENTS

   /// struct READ_BLOCK_STATUS4args {
   ///     /* CURRENT_FH: file */
   ///     stateid4    rbsa_stateid;
   ///     offset4     rbsa_offset;
   ///     count4      rbsa_count;
   /// };

                                 Figure 32

6.2.2.  RESULTS

   /// struct READ_BLOCK_STATUS4resok {
   ///     bool            rbsr_eof;
   ///     block_owner4    rbsr_blocks<>;
   /// };

                                 Figure 33

   /// union READ_BLOCK_STATUS4res switch (nfsstat4 rbsr_status) {
   ///     case NFS4_OK:
   ///         READ_BLOCK4resok     rbsr_resok4;
   ///     default:
   ///         void;
   /// };

                                 Figure 34

6.2.3.  DESCRIPTION

   READ_BLOCK_STATUS4 differs from READ_BLOCK4 in that it only reads
   active and pending headers in the desired data range.

6.3.  Operation 79: READ_BLOCK4 - Read Blocks from File

Haynes                     Expires 9 May 2025                  [Page 28]
Internet-Draft              erasure encoding               November 2024

6.3.1.  ARGUMENTS

   /// struct READ_BLOCK4args {
   ///     /* CURRENT_FH: file */
   ///     stateid4    rba_stateid;
   ///     offset4     rba_offset;
   ///     count4      rba_count;
   /// };

                                 Figure 35

6.3.2.  RESULTS

   /// struct read_block4 {
   ///     uint32_t        rb_crc;
   ///     uint32_t        rb_effective_len;
   ///     block_owner4    rb_owner;
   ///     uint32_t        rb_seq_id;
   ///     opaque          rb_block<>;
   /// };

                                 Figure 36

   /// struct READ_BLOCK4resok {
   ///     bool        rbr_eof;
   ///     read_block4 rbr_blocks<>;
   /// };

                                 Figure 37

   /// union READ_BLOCK4res switch (nfsstat4 rbr_status) {
   ///     case NFS4_OK:
   ///          READ_BLOCK4resok     rbr_resok4;
   ///     default:
   ///          void;
   /// };

                                 Figure 38

6.3.3.  DESCRIPTION

   READ_BLOCK is READ4 (see Section 18.22 of [RFC8881]) with additional
   semantics over the block_owner and the activation of blocks.  As
   such, all of the normal semantics of READ4 directly apply.

   The main difference between the two operations is that READ_BLOCK
   works on blocks and not a raw data stream.  As such rba_offset is the
   starting block offset in the file and not the byte offset in the

Haynes                     Expires 9 May 2025                  [Page 29]
Internet-Draft              erasure encoding               November 2024

   file.  Some erasure encoding types can have different block sizes
   depending on the block type.  Further, rba_count is a count of blocks
   to read and not bytes to read.

   READ_BLOCK also only returns the activated block at the location.
   I.e., if a client overwrites a block at offset 10, then tries to read
   the block without activating it, then the original block is returned.

   When reading a set of blocks across the data servers, it can be the
   case that some data servers do not have any data at that location.
   In that case, the server either returns rbr_eof if the rba_offset
   exceeds the number of blocks that the data server is aware or it
   returns an empty block for that block.

   For example, in Figure 39, the client asks for 4 blocks starting with
   the 3rd block in the file.  The second data server responds as in
   Figure 40.  The client would read this as there is valid data for
   blocks 2 and 4, there is a hole at block 3, and there is no data for
   block 5.  Note that the data server MUST calculate a valid rb_crc for
   block 3 based on the generated fields.

                   Data Server 2
           +--------------------------------+
           | READ_BLOCK4args                |
           +--------------------------------+
           | rba_stateid: 0                 |
           | rba_offset: 2                  |
           | rba_count: 4                   |
           +----------+---------------------+

                                 Figure 39

Haynes                     Expires 9 May 2025                  [Page 30]
Internet-Draft              erasure encoding               November 2024

                   Data Server 2
           +--------------------------------+
           | READ_BLOCK4resok               |
           +--------------------------------+
           | rbr_eof: true                  |
           | rbr_blocks[0]:                 |
           |     rb_crc: 0x3faddace         |
           |     rb_effective_len: 4kB      |
           |     rb_owner:                  |
           |            bo_block_id: 2      |
           |            bo_change_id: 3     |
           |            bo_client_id: 6     |
           |            bo_activated: true  |
           |     rb_seq_id: 1               |
           |     rb_block: ....             |
           | rbr_blocks[0]:                 |
           |     rb_crc: 0xdeade4e5         |
           |     rb_effective_len: 4kB      |
           |     rb_owner:                  |
           |            bo_block_id: 3      |
           |            bo_change_id: 0     |
           |            bo_client_id: 0     |
           |            bo_activated: false |
           |     rb_seq_id: 1               |
           |     rb_block: 0000...00000     |
           | rbr_blocks[0]:                 |
           |     rb_crc: 0x7778abcd         |
           |     rb_effective_len: 2kB      |
           |     rb_owner:                  |
           |            bo_block_id: 4      |
           |            bo_change_id: 3     |
           |            bo_client_id: 6     |
           |            bo_activated: true  |
           |     rb_seq_id: 1               |
           |     rb_block: ....             |
           +--------------------------------+

                                 Figure 40

6.4.  Operation 80: ROLLBACK_BLOCK - Rollback Cached Block Data

6.4.1.  ARGUMENTS

Haynes                     Expires 9 May 2025                  [Page 31]
Internet-Draft              erasure encoding               November 2024

   /// struct ROLLBACK_BLOCK4args {
   ///     /* CURRENT_FH: file */
   ///     offset4         rba_offset;
   ///     count4          rba_count;
   ///     block_owner4    rba_blocks<>;
   /// };

                                 Figure 41

6.4.2.  RESULTS

   /// struct ROLLBACK_BLOCK4resok {
   ///     verifier4       rbr_writeverf;
   /// };

                                 Figure 42

   /// union ROLLBACK_BLOCK4res switch (nfsstat4 rbr_status) {
   ///     case NFS4_OK:
   ///         ROLLBACK_BLOCK4resok   rbr_resok4;
   ///     default:
   ///         void;
   /// };

                                 Figure 43

6.4.3.  DESCRIPTION

   ROLLBACK_BLOCK4 is a new form like COMMIT4 (see Section 18.3 of
   [RFC8881]) with additional semantics over the block_owner the rolling
   back the writing of blocks.  As such, all of the normal semantics of
   COMMIT4 directly apply.

   The main difference between the two operations is that
   ROLLBACK_BLOCK4 works on blocks and not a raw data stream.  As such
   rba_offset is the starting block offset in the file and not the byte
   offset in the file.  Some erasure encoding types can have different
   block sizes depending on the block type.  Further, rba_count is a
   count of blocks to rollback and not bytes to rollback.

   Further, while it may appear that the combination of rba_offset and
   rba_count are redundant to rba_blocks, the purpose of rba_blocks is
   to allow the data server to differentiate between potentially
   multiple pending blocks.

   ROLLBACK_BLOCK4 deletes prior WRITE_BLOCK4 transactions.  In case of
   write holes, it allows the client to undo transactions to repair the
   file.

Haynes                     Expires 9 May 2025                  [Page 32]
Internet-Draft              erasure encoding               November 2024

6.5.  Operation 81: WRITE_BLOCK4 - Write Blocks to File

6.5.1.  ARGUMENTS

   /// const WRITE_BLOCK_FLAGS_UPDATE_HEADER_ONLY   = 0x00000001;
   /// const WRITE_BLOCK_FLAGS_ACTIVATE_IF_EMPTY      = 0x00000002;

                                 Figure 44

   /// struct write_block4 {
   ///     uint32_t        wb_crc;
   ///     uint32_t        wb_effective_len;
   ///     uint32_t        wb_flags;
   ///     opaque          wb_block<>;
   /// };

                                 Figure 45

   /// struct guard_block_owner4 {
   ///     changeid4   gbo_change_id;
   ///     clientid4   gbo_client_id;
   /// };

                                 Figure 46

   /// union write_block_guard4 (bool wbg_check) {
   ///     case TRUE:
   ///         guard_block_owner4   wbg_block_owner;
   ///     case FALSE:
   ///         void;
   /// };

                                 Figure 47

   /// struct WRITE_BLOCK4args {
   ///     /* CURRENT_FH: file */
   ///     stateid4           wba_stateid;
   ///     offset4            wba_offset;
   ///     stable_how4        wba_stable;
   ///     block_owner4       wba_owner;
   ///     uint32_t           wba_seq_id;
   ///     write_block_guard4 wba_guard;
   ///     write_block4       wba_data<>;
   /// };

                                 Figure 48

Haynes                     Expires 9 May 2025                  [Page 33]
Internet-Draft              erasure encoding               November 2024

6.5.2.  RESULTS

   /// struct WRITE_BLOCK4resok {
   ///     count4          wbr_count;
   ///     stable_how4     wbr_committed;
   ///     verifier4       wbr_writeverf;
   ///     block_owner4    wbr_owners<>;
   /// };

                                 Figure 49

   /// union WRITE_BLOCK4res switch (nfsstat4 wbr_status) {
   ///     case NFS4_OK:
   ///         WRITE_BLOCK4resok    wbr_resok4;
   ///     default:
   ///         void;
   /// };

                                 Figure 50

6.5.3.  DESCRIPTION

   WRITE_BLOCK4 is WRITE4 (see Section 18.32 of [RFC8881]) with
   additional semantics over the block_owner and the activation of
   blocks.  As such, all of the normal semantics of WRITE4 directly
   apply.

   The main difference between the two operations is that WRITE_BLOCK4
   works on blocks and not a raw data stream.  As such wba_offset is the
   starting block offset in the file and not the byte offset in the
   file.  Some erasure encoding types can have different block sizes
   depending on the block type.  Further, wbr_count is a count of
   written blocks and not written bytes.

   If wba_stable is FILE_SYNC4, the data server MUST commit the written
   header and block data plus all file system metadata to stable storage
   before returning results.  This corresponds to the NFSv2 protocol
   semantics.  Any other behavior constitutes a protocol violation.  If
   wba_stable is DATA_SYNC4, then the data server MUST commit all of the
   header and block data to stable storage and enough of the metadata to
   retrieve the data before returning.  The data server implementer is
   free to implement DATA_SYNC4 in the same fashion as FILE_SYNC4, but
   with a possible performance drop.  If wba_stable is UNSTABLE4, the
   data server is free to commit any part of the header and block data
   and the metadata to stable storage, including all or none, before
   returning a reply to the client.  There is no guarantee whether or
   when any uncommitted data will subsequently be committed to stable
   storage.  The only guarantees made by the data server are that it

Haynes                     Expires 9 May 2025                  [Page 34]
Internet-Draft              erasure encoding               November 2024

   will not destroy any data without changing the value of writeverf and
   that it will not commit the data and metadata at a level less than
   that requested by the client.

   The activation of header and block data interacts with the
   bo_activated for each of the written blocks.  If the data is not
   committed to stable storage then the bo_activated field MUST NOT be
   set to true.  Once the data is committed to stable storage, then the
   data server can set the block's bo_activated if one of these
   conditions apply:

   *  it is the first write to that block and the
      WRITE_BLOCK_FLAGS_ACTIVATE_IF_EMPTY flag is set

   *  the ACTIVATE_BLOCK4 is issued later for that block.

   There are subtle interactions with write holes caused by racing
   clients.  One client could win the race in each case, but because it
   used a wba_stable of UNSTABLE4, the subsequent writes from the second
   client with a wba_stable of FILE_SYNC4 can be awarded the
   bo_activated being set to true for each of the blocks in the payload.

   Finally, the interaction of wba_stable can cause a client to
   mistakenly believe that by the time it gets the response of
   bo_activated of false, that the blocks are not activated.  A
   subsequent READ_BLOCK4 or READ_BLOCK_STATUS4 might show that the
   bo_activated is true without any interaction by the client via
   ACTIVATE_BLOCK4.
   // Automatic setting of bo_activated to true if it is the first write
   // should be a performance boost.  But it can lead to the client
   // having incorrect information (as above) and trying to
   // ACTIVATE_BLOCK4 a payload that has lost the race.  But is that
   // bad?  If you have racing clients, there is no guarantee at all as
   // to the contents of the file.
   //
   // -- TH

6.5.3.1.  Guarding the Write

   A guarded WRITE_BLOCK4 is when the writing of a block MUST fail if
   wba_guard.wbg_check is set and the target block does not have both
   the same change_id as the gbo_change_id and the same client_id as the
   gbo_client_id.  This is useful in read-update-write scenarios.  The
   client reads a block, updates it, and is prepared to write it back.
   It guards the write such that if another writer has modified the
   block, the data server will reject the modification.

Haynes                     Expires 9 May 2025                  [Page 35]
Internet-Draft              erasure encoding               November 2024

   Note that as the guard_block_owner4 (see Figure 46 does not have a
   block_id and the WRITE_BLOCK4 applies to all blocks in the range of
   wba_offset to the length of wba_data, then each of the target blocks
   MUST have the same change_id and client_id.  The client SHOULD
   present the smallest set of blocks as possible to meet this
   requirement.

   // And the complexity goes up here.  Does the DS reject only based on
   // active blocks?  Or can inactive ones also cause rejection?
   //
   // -- TH

   // Is the DS supposed to vet all blocks first or proceed to the first
   // error?  Or do all blocks and return an array of errors?  (This
   // last one is a no-go.)  Also, if we do the vet first, what happens
   // if a WRITE_BLOCK4 comes in after the vetting?  Are we to lock the
   // file during this process.  Even if we do that, we still have the
   // issue of multiple DSes.
   //
   // -- TH

6.5.3.2.  Updating the Header Only

   Some erasure encoding types keep their blocks in plain text and have
   parity blocks in order to provide integrity.  A common configuration
   for Reed Solomon is 4 active blocks, 2 parity blocks, and 2 spares.
   Assuming 4kB data blocks, then each payload delivers 16kB of data and
   8kB of parity.  If the application modifies the first data block,
   then all that needs to change is the first active block and the two
   parity blocks in the payload.

   In any other approach, only 12kB of the total 24kB has to be written
   to storage.  If that is attempted in the Flexible Files Version 2
   Layout Type, then the payload will be deemed as inconsistent.  The
   reason for this is that the change_id for the unmodified blocks will
   not match those of the modified blocks.

   The WRITE_BLOCK_FLAGS_UPDATE_HEADER_ONLY flag in wb_flags can be used
   to save the transmission of the blocks.  If it is set, then the
   wb_block is ignored.  It MUST be empty.  Note that the client MUST
   only modify both the wb_crc and the wba_owner.bo_change_id fields in
   this case.  The wb_crc MUST change as the wba_owner.bo_change_id has
   been modified (see Section 3.1.1).

Haynes                     Expires 9 May 2025                  [Page 36]
Internet-Draft              erasure encoding               November 2024

   For the purpose of computing the activation state of the block, The
   data server MUST treat this as an overwrite.  Thus, in the response,
   bo_activated MUST be false.

7.  Extraction of XDR

   This document contains the external data representation (XDR)
   [RFC4506] description of the Flexible Files Version 2 Layout Type.
   The XDR description is embedded in this document in a way that makes
   it simple for the reader to extract into a ready-to-compile form.
   The reader can feed this document into the following shell script to
   produce the machine readable XDR description of the new flags:

   #!/bin/sh
   grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'

   That is, if the above script is stored in a file called 'extract.sh',
   and this document is in a file called 'spec.txt', then the reader can
   do:

   sh extract.sh < spec.txt > erasure_coding_prot.x

   The effect of the script is to remove leading white space from each
   line, plus a sentinel sequence of '///'.  XDR descriptions with the
   sentinel sequence are embedded throughout the document.

   Note that the XDR code contained in this document depends on types
   from the NFSv4.2 nfs4_prot.x file (generated from [RFC7863]) and the
   Flexible Files Layout Type flexfiles.x file (generated from
   [RFC8435]).  This includes both nfs types that end with a 4, such as
   offset4, length4, etc., as well as more generic types such as
   uint32_t and uint64_t.

   While the XDR can be appended to that from [RFC7863], the various
   code snippets belong in their respective areas of that XDR.

8.  Security Considerations

   This document has the same security considerations as both Flex Files
   Layout Type version 1 (see Section 15 of [RFC8435]) and NFSv4.2 (see
   Section 17 of [RFC7862]).

9.  IANA Considerations

Haynes                     Expires 9 May 2025                  [Page 37]
Internet-Draft              erasure encoding               November 2024

9.1.  pNFS Layout Types Registry

   [RFC8881] introduced the 'pNFS Layout Types Registry'; new layout
   type numbers in this registry need to be assigned by IANA.  This
   document defines the protocol associated with an existing layout type
   number: LAYOUT4_FLEX_FILES_V2 (see Table 1).

    +=======================+=======+==========+=====+================+
    | Layout Type Name      | Value | RFC      | How | Minor Versions |
    +=======================+=======+==========+=====+================+
    | LAYOUT4_FLEX_FILES_V2 | 0x6   | RFCTBD10 | L   | 1              |
    +-----------------------+-------+----------+-----+----------------+

                      Table 1: Layout Type Assignments

9.2.  NFSv4 Recallable Object Types Registry

   [RFC8881] also introduced the 'NFSv4 Recallable Object Types
   Registry'.  This document defines new recallable objects for
   RCA4_TYPE_MASK_FFV2_LAYOUT_MIN and RCA4_TYPE_MASK_FFV2_LAYOUT_MAX
   (see Table 2).

   +================================+=======+==========+===+==========+
   | Recallable Object Type Name    | Value | RFC      |How| Minor    |
   |                                |       |          |   | Versions |
   +================================+=======+==========+===+==========+
   | RCA4_TYPE_MASK_FFV2_LAYOUT_MIN | 20    | RFCTBD10 |L  | 1        |
   +--------------------------------+-------+----------+---+----------+
   | RCA4_TYPE_MASK_FFV2_LAYOUT_MAX | 21    | RFCTBD10 |L  | 1        |
   +--------------------------------+-------+----------+---+----------+

               Table 2: Recallable Object Type Assignments

9.3.  Flexible Files Version 2 Layout Type Erasure Encoding Type
      Registry

   This document introduces the 'Flexible Files Version 2 Layout Type
   Erasure Encoding Type Registry'.  This document defines the
   FFV2_ENCODING_MIRRORED type for Client-Side Mirroring (see Table 3).

Haynes                     Expires 9 May 2025                  [Page 38]
Internet-Draft              erasure encoding               November 2024

    +============================+=======+==========+=====+==========+
    | Erasure Encoding Type Name | Value | RFC      | How | Minor    |
    |                            |       |          |     | Versions |
    +============================+=======+==========+=====+==========+
    | FFV2_ENCODING_MIRRORED     | 1     | RFCTBD10 | L   | 2        |
    +----------------------------+-------+----------+-----+----------+

      Table 3: Flexible Files Version 2 Layout Type Erasure Encoding
                             Type Assignments

10.  References

10.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC4506]  Eisler, M., Ed., "XDR: External Data Representation
              Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May
              2006, <https://www.rfc-editor.org/info/rfc4506>.

   [RFC7530]  Haynes, T., Ed. and D. Noveck, Ed., "Network File System
              (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530,
              March 2015, <https://www.rfc-editor.org/info/rfc7530>.

   [RFC7862]  Haynes, T., "Network File System (NFS) Version 4 Minor
              Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862,
              November 2016, <https://www.rfc-editor.org/info/rfc7862>.

   [RFC7863]  Haynes, T., "Network File System (NFS) Version 4 Minor
              Version 2 External Data Representation Standard (XDR)
              Description", RFC 7863, DOI 10.17487/RFC7863, November
              2016, <https://www.rfc-editor.org/info/rfc7863>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

   [RFC8178]  Noveck, D., "Rules for NFSv4 Extensions and Minor
              Versions", RFC 8178, DOI 10.17487/RFC8178, July 2017,
              <https://www.rfc-editor.org/info/rfc8178>.

   [RFC8435]  Halevy, B. and T. Haynes, "Parallel NFS (pNFS) Flexible
              File Layout", RFC 8435, DOI 10.17487/RFC8435, August 2018,
              <https://www.rfc-editor.org/info/rfc8435>.

Haynes                     Expires 9 May 2025                  [Page 39]
Internet-Draft              erasure encoding               November 2024

   [RFC8881]  Noveck, D., Ed. and C. Lever, "Network File System (NFS)
              Version 4 Minor Version 1 Protocol", RFC 8881,
              DOI 10.17487/RFC8881, August 2020,
              <https://www.rfc-editor.org/info/rfc8881>.

10.2.  Informative References

   [Plank97]  Plank, J., "A Tutorial on Reed-Solomon Coding for Fault-
              Tolerance in RAID-like System", September 1997,
              <http://web.eecs.utk.edu/~jplank/plank/papers/CS-
              96-332.html>.

   [RFC1813]  Callaghan, B., Pawlowski, B., and P. Staubach, "NFS
              Version 3 Protocol Specification", RFC 1813,
              DOI 10.17487/RFC1813, June 1995,
              <https://www.rfc-editor.org/info/rfc1813>.

Appendix A.  Acknowledgments

   The following from Hammerspace were instrumental in driving Flex
   Files v2: David Flynn, Trond Myklebust, Tom Haynes, Didier Feron,
   Jean-Pierre Monchanin, Pierre Evenou, and Brian Pawlowski.

   Christoph Helwig was instrumental in making sure Flexible Files
   Version 2 Layout Type was applicable to more than one Erasure-
   Encoding Type.

Appendix B.  RFC Editor Notes

   This section is to be removed before publishing as an RFC.

   [RFC Editor: prior to publishing this document as an RFC, please
   replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
   RFC number of this document]

Author's Address

   Thomas Haynes
   Hammerspace
   Email: loghyr@gmail.com

Haynes                     Expires 9 May 2025                  [Page 40]