NFSv4                                                         C. Hellwig
Internet-Draft                                             July 02, 2017
Intended status: Standards Track
Expires: January 3, 2018


                    Parallel NFS (pNFS) RDMA Layout
                 draft-hellwig-nfsv4-rdma-layout-00.txt

Abstract

   The Parallel Network File System (pNFS) allows a separation between
   the metadata (onto a metadata server) and data (onto a storage
   device) for a file.  The RDMA Layout Type is defined in this document
   as an extension to pNFS to allow the use of RDMA Verbs operations to
   access remote storage, with a special focus on accessing byte
   addressable persistent memory.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on January 3, 2018.

Copyright Notice

   Copyright (c) 2017 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of




Hellwig                  Expires January 3, 2018                [Page 1]


Internet-Draft              pNFS RDMA Layout                   July 2017


   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Conventions Used in This Document . . . . . . . . . . . .   3
     1.2.  General Definitions . . . . . . . . . . . . . . . . . . .   3
     1.3.  Code Components Licensing Notice  . . . . . . . . . . . .   4
     1.4.  XDR Description . . . . . . . . . . . . . . . . . . . . .   4
   2.  RDMA Layout Description . . . . . . . . . . . . . . . . . . .   6
     2.1.  Background and Architecture . . . . . . . . . . . . . . .   6
     2.2.  layouttype4 . . . . . . . . . . . . . . . . . . . . . . .   6
     2.3.  Device Addressing and Discovery . . . . . . . . . . . . .   7
       2.3.1.  pnfs_rdma_device_addr4  . . . . . . . . . . . . . . .   7
     2.4.  Data Structures: Extents and Extent Lists . . . . . . . .   7
       2.4.1.  Layout Requests and Extent Lists  . . . . . . . . . .   9
       2.4.2.  Layout Commits  . . . . . . . . . . . . . . . . . . .  11
       2.4.3.  Layout Returns  . . . . . . . . . . . . . . . . . . .  11
       2.4.4.  Layout Revocation . . . . . . . . . . . . . . . . . .  12
       2.4.5.  Client Copy-on-Write Processing . . . . . . . . . . .  12
       2.4.6.  Extents are Permissions . . . . . . . . . . . . . . .  13
       2.4.7.  End-of-file Processing  . . . . . . . . . . . . . . .  14
       2.4.8.  Layout Hints  . . . . . . . . . . . . . . . . . . . .  15
     2.5.  Crash Recovery Issues . . . . . . . . . . . . . . . . . .  15
     2.6.  Transient and Permanent Errors  . . . . . . . . . . . . .  15
   3.  Security Considerations . . . . . . . . . . . . . . . . . . .  16
   4.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  17
   5.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  17
     5.1.  Normative References  . . . . . . . . . . . . . . . . . .  17
     5.2.  Informative References  . . . . . . . . . . . . . . . . .  18
   Appendix A.  RFC Editor Notes . . . . . . . . . . . . . . . . . .  18
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  18

1.  Introduction

   Figure 1 shows the overall architecture of a Parallel NFS (pNFS)
   system:













Hellwig                  Expires January 3, 2018                [Page 2]


Internet-Draft              pNFS RDMA Layout                   July 2017


       +-----------+
       |+-----------+                                 +-----------+
       ||+-----------+                                |           |
       |||           |       NFSv4.1 + pNFS           |           |
       +||  Clients  |<------------------------------>|   Server  |
        +|           |                                |           |
         +-----------+                                |           |
              |||                                     +-----------+
              |||                                           |
              |||                                           |
              ||| Storage        +-----------+              |
              ||| Protocol       |+-----------+             |
              ||+----------------||+-----------+  Control   |
              |+-----------------|||           |    Protocol|
              +------------------+||  Storage  |------------+
                                  +|  Systems  |
                                   +-----------+

                                 Figure 1

   The overall approach is that pNFS-enhanced clients obtain sufficient
   information from the server to enable them to access the underlying
   storage (on the storage systems) directly.  See the Section 12 of
   [RFC5661] for more details.  RDMA ([RFC5040] [RFC5041] [IBARCH]) is a
   technique for moving data efficiently between end nodes.  By
   directing data into destination buffers as it is sent on a network,
   and placing it via direct memory access by hardware, the benefits of
   faster transfers and reduced host overhead are obtained.  Unlike the
   RPC RDMA transport [RFC8166] the pNFS RDMA layout does not transfer
   remote procedural calls over RDMA networks, but instead uses raw RDMA
   READ and WRITE operations to access a memory region exposed on a
   storage device.

1.1.  Conventions Used in This Document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

1.2.  General Definitions

   The following definitions are provided for the purpose of providing
   an appropriate context for the reader.

   Byte  This document defines a byte as an octet, i.e., a datum exactly
      8 bits in length.





Hellwig                  Expires January 3, 2018                [Page 3]


Internet-Draft              pNFS RDMA Layout                   July 2017


   Client  The "client" is the entity that accesses the NFS server's
      resources.  The client may be an application that contains the
      logic to access the NFS server directly.  The client may also be
      the traditional operating system client that provides remote file
      system services for a set of applications.

   Server  The "server" is the entity responsible for coordinating
      client access to a set of file systems and is identified by a
      server owner.

   metadata server (MDS)  The metadata server is a pNFS server which
      provides metadata information for a file system object.  It also
      is responsible for generating layouts for file system objects.
      Note that the MDS is also responsible for directory-based
      operations.

1.3.  Code Components Licensing Notice

   The external data representation (XDR) description and scripts for
   extracting the XDR description are Code Components as described in
   Section 4 of "Legal Provisions Relating to IETF Documents" [LEGAL].
   These Code Components are licensed according to the terms of
   Section 4 of "Legal Provisions Relating to IETF Documents".

1.4.  XDR Description

   This document contains the XDR [RFC4506] description of the NFSv4.1
   RDMA layout protocol.  The XDR description is embedded in this
   document in a way that makes it simple for the reader to extract into
   a ready-to-compile form.  The reader can feed this document into the
   following shell script to produce the machine readable XDR
   description of the NFSv4.1 RDMA layout:

   #!/bin/sh
   grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'

   That is, if the above script is stored in a file called "extract.sh",
   and this document is in a file called "spec.txt", then the reader can
   do:

   sh extract.sh < spec.txt > rdma_prot.x

   The effect of the script is to remove leading white space from each
   line, plus a sentinel sequence of "///".

   The embedded XDR file header follows.  Subsequent XDR descriptions,
   with the sentinel sequence are embedded throughout the document.




Hellwig                  Expires January 3, 2018                [Page 4]


Internet-Draft              pNFS RDMA Layout                   July 2017


   Note that the XDR code contained in this document depends on types
   from the NFSv4.1 nfs4_prot.x file [RFC5662].  This includes both nfs
   types that end with a 4, such as offset4, length4, etc., as well as
   more generic types such as uint32_t and uint64_t.

      /// /*
      ///  * This code was derived from RFCTBD10
      ///  * Please reproduce this note if possible.
      ///  */
      /// /*
      ///  * Copyright (c) 2010,2015 IETF Trust and the persons
      ///  * identified as the document authors.  All rights reserved.
      ///  *
      ///  * Redistribution and use in source and binary forms, with
      ///  * or without modification, are permitted provided that the
      ///  * following conditions are met:
      ///  *
      ///  * - Redistributions of source code must retain the above
      ///  *   copyright notice, this list of conditions and the
      ///  *   following disclaimer.
      ///  *
      ///  * - Redistributions in binary form must reproduce the above
      ///  *   copyright notice, this list of conditions and the
      ///  *   following disclaimer in the documentation and/or other
      ///  *   materials provided with the distribution.
      ///  *
      ///  * - Neither the name of Internet Society, IETF or IETF
      ///  *   Trust, nor the names of specific contributors, may be
      ///  *   used to endorse or promote products derived from this
      ///  *   software without specific prior written permission.
      ///  *
      ///  *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
      ///  *   AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
      ///  *   WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
      ///  *   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
      ///  *   FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO
      ///  *   EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
      ///  *   LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
      ///  *   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
      ///  *   NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
      ///  *   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
      ///  *   INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
      ///  *   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
      ///  *   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
      ///  *   IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
      ///  *   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
      ///  */
      ///



Hellwig                  Expires January 3, 2018                [Page 5]


Internet-Draft              pNFS RDMA Layout                   July 2017


      /// /*
      ///  *      nfs4_rdma_layout_prot.x
      ///  */
      ///
      /// %#include "nfsv41.h"
      ///

2.  RDMA Layout Description

2.1.  Background and Architecture

   A pNFS RDMA layout is responsible for mapping from an NFS file (or
   portion of a file) to memory regions that contain the file.  These
   regions are expressed as extents with 64-bit offsets and lengths
   using the existing NFSv4 offset4 and length4 types, and map to memory
   regions that the servers registered, and for which it exposes a
   handle (R_key or stag) that allows for RDMA READ and RDMA WRITE
   operations from the client.

   The pNFS operation for requesting a layout (LAYOUTGET) includes the
   "layoutiomode4 loga_iomode" argument, which indicates whether the
   requested layout is for read-only use or read-write use.  A read-only
   layout may contain holes that are read as zero, whereas a read-write
   layout will contain allocated, but un-initialized storage in those
   holes (read as zero, can be written by client).  This document also
   supports client participation in copy-on-write (e.g., for file
   systems with snapshots) by providing both read-only and un-
   initialized storage for the same extent in a layout.  Reads are
   initially performed on the read-only storage, with writes going to
   the un-initialized storage.  After the first write that initializes
   the un-initialized storage, all reads are performed to that now-
   initialized writable storage, and the corresponding read-only storage
   is no longer used.

2.2.  layouttype4

   The layout4 type defined in [RFC5662] is extended with a new value as
   follows:

       enum layouttype4 {
           LAYOUT4_NFSV4_1_FILES   = 1,
           LAYOUT4_OSD2_OBJECTS    = 2,
           LAYOUT4_BLOCK_VOLUME    = 3,
           LAYOUT4_SCSI            = 4,
           LAYOUT4_RDMA            = 0x80000006
   [[RFC Editor: please modify the LAYOUT4_RDMA
     to be the layouttype assigned by IANA]]
       };



Hellwig                  Expires January 3, 2018                [Page 6]


Internet-Draft              pNFS RDMA Layout                   July 2017


   This document defines structure associated with the layouttype4 value
   LAYOUT4_RDMA.  [RFC5661] specifies the loc_body structure as an XDR
   type "opaque".  The opaque layout is uninterpreted by the generic
   pNFS client layers, but obviously must be interpreted by the Layout
   Type implementation.

2.3.  Device Addressing and Discovery

   Data operations to a storage device require the client to know the
   network address of the storage device.  The NFSv4.1+ GETDEVICEINFO
   operation (Section 18.40 of [RFC5661]) is used by the client to
   retrieve that information.

2.3.1.  pnfs_rdma_device_addr4

   The "pnfs_rdma_device_addr4" data structure is returned by the server
   as the storage-protocol-specific opaque field da_addr_body in the
   "device_addr4" structure by a successful GETDEVICEINFO operation
   [RFC5661].  It contains the network address of the storage device.
   The RDMA Connection manager (RDMA/CM) shall be used to establish the
   queue pair for the RDMA READ and RDMA WRITE operations used by the
   layout.  Details of connection establishment will be provided in
   future versions of this document.

    /// struct pnfs_rdma_device_addr4 {
    ///       struct netaddr4       addr; /* address of the device */
    /// };
    ///

2.4.  Data Structures: Extents and Extent Lists

   A pNFS RDMA layout is a list of extents within a flat array of data
   in a device.  The RDMA layout describes the individual byte ranges
   (extents) on the device that make up the file.  The offsets and
   length contained in an extent are specified in units of bytes.
















Hellwig                  Expires January 3, 2018                [Page 7]


Internet-Draft              pNFS RDMA Layout                   July 2017


    /// enum pnfs_rdma_extent_state4 {
    ///     PNFS_RDMA_READ_WRITE_DATA = 0, /* the data located by
    ///                                       this extent is valid
    ///                                       for reading and
    ///                                       writing. */
    ///     PNFS_RDMA_READ_DATA      = 1,  /* the data located by this
    ///                                       extent is valid for
    ///                                       reading only; it may not
    ///                                       be written. */
    ///     PNFS_RDMA_INVALID_DATA   = 2,  /* the location is valid; the
    ///                                       data is invalid.  It is a
    ///                                       newly (pre-) allocated
    ///                                       extent.  The client MUST
    ///                                       not read from this
    ///                                       space */
    ///     PNFS_RDMA_NONE_DATA      = 3   /* the location is invalid.
    ///                                       It is a hole in the file.
    ///                                       The client MUST NOT read
    ///                                       from or write to this
    ///                                       space */
    /// };


    ///
    /// struct pnfs_rdma_extent4 {
    ///     deviceid4    re_device_id;     /* id of the device on
    ///                                       which extent of file is
    ///                                       stored. */
    ///     offset4      re_file_offset;   /* starting byte offset
    ///                                       in the file */
    ///     uint32       re_handle;        /* registered memory
    ///                                       handle */
    ///     length4      re_length;        /* size in bytes of the
    ///                                       extent */
    ///     offset4      re_storage_offset;/* starting byte offset
    ///                                       in the volume */
    ///     pnfs_rdma_extent_state4 re_state;
    ///                                    /* state of this extent */
    /// };
    ///

    /// /* RDMA layout-specific type for loc_body */
    /// struct pnfs_rdma_layout4 {
    ///     pnfs_rdma_extent4 rl_extents<>;
    ///                                    /* extents which make up this
    ///                                       layout. */
    /// };
    ///



Hellwig                  Expires January 3, 2018                [Page 8]


Internet-Draft              pNFS RDMA Layout                   July 2017


   The RDMA layout consists of a list of extents that map the regions of
   the file to locations on a device.  The "re_storage_offset" field
   within each extent identifies a location on the device specified by
   the "re_device_id" field in the extent.

   Each extent maps a region of the file onto a portion of the specified
   device.  The re_file_offset, re_length, and re_state fields for an
   extent returned from the server are valid for all extents.  In
   contrast, the interpretation of the re_storage_offset field depends
   on the value of re_state as follows (in increasing order):

   PNFS_RDMA_READ_WRITE_DATA  means that re_storage_offset is valid, and
      points to valid/initialized data that can be read and written.

   PNFS_RDMA_READ_DATA  means that re_storage_offset is valid and points
      to valid/initialized data that can only be read.  Write operations
      are prohibited.

   PNFS_RDMA_INVALID_DATA  means that re_storage_offset is valid, but
      points to invalid un-initialized data.  This data MUST not be read
      from the device until it has been initialized.  A read request for
      a PNFS_RDMA_INVALID_DATA extent MUST fill the user buffer with
      zeros, unless the extent is covered by a PNFS_RDMA_READ_DATA
      extent of a copy-on-write file system.  Write requests MUST write
      whole server-sized blocks to the device; bytes not initialized by
      the user MUST be set to zero.  Any write to parts of a device
      covered by a PNFS_RDMA_INVALID_DATA extent changes the written
      portion of the extent to PNFS_RDMA_READ_WRITE_DATA; the pNFS
      client is responsible for reporting this change via LAYOUTCOMMIT.

   PNFS_RDMA_NONE_DATA  means that re_storage_offset is not valid, and
      this extent MAY not be used to satisfy write requests.  Read
      requests MAY be satisfied by zero-filling as for
      PNFS_RDMA_INVALID_DATA.  PNFS_RDMA_NONE_DATA extents MAY be
      returned by requests for readable extents; they are never returned
      if the request was for a writable extent.

   An extent list contains all relevant extents in increasing order of
   the re_file_offset of each extent; any ties are broken by increasing
   order of the extent state (re_state).

2.4.1.  Layout Requests and Extent Lists

   Each request for a layout specifies at least three parameters: file
   offset, desired size, and minimum size.  If the status of a request
   indicates success, the extent list returned MUST meet the following
   criteria:




Hellwig                  Expires January 3, 2018                [Page 9]


Internet-Draft              pNFS RDMA Layout                   July 2017


   o  A request for a readable (but not writable) layout MUST return
      either PNFS_RDMA_READ_DATA or PNFS_RDMA_NONE_DATA extents.  It
      SHALL NOT return PNFS_RDMA_INVALID_DATA or
      PNFS_RDMA_READ_WRITE_DATA extents.

   o  A request for a writable layout MUST return
      PNFS_RDMA_READ_WRITE_DATA or PNFS_RDMA_INVALID_DATA extents, and
      it MAY return addition PNFS_RDMA_READ_DATA extents for ranges
      covered by PNFS_RDMA_INVALID_DATA extents to allow client side
      copy-on-write operations.  A request for a writable layout SHALL
      NOT return PNFS_RDMA_NONE_DATA extents.

   o  The first extent in the list MUST contain the requested starting
      offset.

   o  The total size of extents within the requested range MUST cover at
      least the minimum size.  One exception is allowed: the total size
      MAY be smaller if only readable extents were requested and EOF is
      encountered.

   o  Extents in the extent list MUST be logically contiguous for a
      read-only layout.  For a read-write layout, the set of writable
      extents (i.e., excluding PNFS_RDMA_READ_DATA extents) MUST be
      logically contiguous.  Every PNFS_RDMA_READ_DATA extent in a read-
      write layout MUST be covered by one or more PNFS_RDMA_INVALID_DATA
      extents.  This overlap of PNFS_RDMA_READ_DATA and
      PNFS_RDMA_INVALID_DATA extents is the only permitted extent
      overlap.

   o  Extents MUST be ordered in the list by starting offset, with
      PNFS_RDMA_READ_DATA extents preceding PNFS_RDMA_INVALID_DATA
      extents in the case of equal re_file_offsets.

   The server shall ensure that it has registered handles for the memory
   regions that the extents in the layout refer to so that RDMA READ
   and/or RDMA WRITE requests can be performed by the client.  Multiple
   extents may refer to the same handle.  The handle shall be
   invalidated on LAYOUTRETURN operation, including implicit layout
   returns as part of CB_LAYOUTRECALL operations, or when a layout is
   revoked.

   According to [RFC5661], if the minimum requested size,
   loga_minlength, is zero, this is an indication to the metadata server
   that the client desires any layout at offset loga_offset or less that
   the metadata server has "readily available".  Given the lack of a
   clear definition of this phrase, in the context of the RDMA layout
   type, when loga_minlength is zero, the metadata server SHOULD:




Hellwig                  Expires January 3, 2018               [Page 10]


Internet-Draft              pNFS RDMA Layout                   July 2017


   o  when processing requests for readable layouts, return all such,
      even if some extents are in the PNFS_RDMA_NONE_DATA state.

   o  when processing requests for writable layouts, return extents
      which can be returned in the PNFS_RDMA_READ_WRITE_DATA state.

2.4.2.  Layout Commits

    ///
    /// /* RDMA layout-specific type for lou_body */
    ///
    /// struct pnfs_rdma_range4 {
    ///     offset4      rr_file_offset;   /* starting byte offset
    ///                                       in the file */
    ///     length4      rr_length;        /* size in bytes */
    /// };
    ///
    /// struct pnfs_rdma_layoutupdate4 {
    ///     pnfs_rdma_range4 rlu_commit_list<>;
    ///                                    /* list of extents which
    ///                                     * now contain valid data.
    ///                                     */
    /// };

   The "pnfs_rdma_layoutupdate4" structure is used by the client as the
   RDMA layout-specific argument in a LAYOUTCOMMIT operation.  The
   "rlu_commit_list" field is a list covering regions of the file layout
   that were previously in the PNFS_RDMA_INVALID_DATA state, but have
   been written by the client and SHOULD now be considered in the
   PNFS_RDMA_READ_WRITE_DATA state.  The extents in the commit list MUST
   be disjoint and MUST be sorted by rr_file_offset.  Implementors
   should be aware that a server MAY be unable to commit regions at a
   granularity smaller than a file-system block (typically 4 KB or 8
   KB).  As noted above, the block-size that the server uses is
   available as an NFSv4 attribute, and any extents included in the
   "rlu_commit_list" MUST be aligned to this granularity and have a size
   that is a multiple of this granularity.  Since the block in question
   is in state PNFS_RDMA_INVALID_DATA, byte ranges not written SHOULD be
   filled with zeros.  This applies even if it appears that the area
   being written is beyond what the client believes to be the end of
   file.

2.4.3.  Layout Returns

   A LAYOUTRETURN operation represents an explicit release of resources
   by the client.  This MAY be done in response to a CB_LAYOUTRECALL or
   before any recall, in order to avoid a future CB_LAYOUTRECALL.  When
   the LAYOUTRETURN operation specifies a LAYOUTRETURN4_FILE return



Hellwig                  Expires January 3, 2018               [Page 11]


Internet-Draft              pNFS RDMA Layout                   July 2017


   type, then the layoutreturn_file4 data structure specifies the region
   of the file layout that is no longer needed by the client.

   The LAYOUTRETURN operation is done without any RDMA layout specific
   data.  The opaque "lrf_body" field of the "layoutreturn_file4" data
   structure MUST have length zero.

2.4.4.  Layout Revocation

   Layouts MAY be unilaterally revoked by the server, due to the
   client's lease time expiring, or the client failing to return a
   layout which has been recalled in a timely manner.  For the RDMA
   layout type this is accomplished by invalidating the handle for the
   remote memory region exposed to the client.  Once the invalidation
   has completed the HCA will reject all access from the client to the
   memory region.

2.4.5.  Client Copy-on-Write Processing

   Copy-on-write is a mechanism used to support file and/or file system
   snapshots.  When writing to unaligned regions, or to regions smaller
   than a file system block, the writer MUST copy the portions of the
   original file data to a new location on disk.  This behavior can
   either be implemented on the client or the server.  The paragraphs
   below describe how a pNFS RDMA layout client implements access to a
   file that requires copy-on-write semantics.

   Distinguishing the PNFS_RDMA_READ_WRITE_DATA and PNFS_RDMA_READ_DATA
   extent types in combination with the allowed overlap of
   PNFS_RDMA_READ_DATA extents with PNFS_RDMA_INVALID_DATA extents
   allows copy-on-write processing to be done by pNFS clients.  In
   classic NFS, this operation would be done by the server.  Since pNFS
   enables clients to do direct block access, it is useful for clients
   to participate in copy-on-write operations.  All pNFS RDMA layout
   clients MUST support this copy-on-write processing.

   When a client wishes to write data covered by a PNFS_RDMA_READ_DATA
   extent, it MUST have requested a writable layout from the server;
   that layout will contain PNFS_RDMA_INVALID_DATA extents to cover all
   the data ranges of that layout's PNFS_RDMA_READ_DATA extents.  More
   precisely, for any re_file_offset range covered by one or more
   PNFS_RDMA_READ_DATA extents in a writable layout, the server MUST
   include one or more PNFS_RDMA_INVALID_DATA extents in the layout that
   cover the same re_file_offset range.  When performing a write to such
   an area of a layout, the client MUST effectively copy the data from
   the PNFS_RDMA_READ_DATA extent for any partial blocks of
   re_file_offset and range, merge in the changes to be written, and
   write the result to the PNFS_RDMA_INVALID_DATA extent for the blocks



Hellwig                  Expires January 3, 2018               [Page 12]


Internet-Draft              pNFS RDMA Layout                   July 2017


   for that re_file_offset and range.  That is, if entire blocks of data
   are to be overwritten by an operation, the corresponding
   PNFS_RDMA_READ_DATA blocks need not be fetched, but any partial-
   block writes MUST be merged with data fetched via PNFS_RDMA_READ_DATA
   extents before storing the result via PNFS_RDMA_INVALID_DATA extents.
   For the purposes of this discussion, "entire blocks" and "partial
   blocks" refer to the server's file-system block size.  Storing of
   data in a PNFS_RDMA_INVALID_DATA extent converts the written portion
   of the PNFS_RDMA_INVALID_DATA extent to a PNFS_RDMA_READ_WRITE_DATA
   extent; all subsequent reads MUST be performed from this extent; the
   corresponding portion of the PNFS_RDMA_READ_DATA extent MUST NOT be
   used after storing data in a PNFS_RDMA_INVALID_DATA extent.  If a
   client writes only a portion of an extent, the extent MAY be split at
   block aligned boundaries.

   When a client wishes to write data to a PNFS_RDMA_INVALID_DATA extent
   that is not covered by a PNFS_RDMA_READ_DATA extent, it MUST treat
   this write identically to a write to a file not involved with copy-
   on-write semantics.  Thus, data MUST be written in at least block-
   sized increments, aligned to multiples of block-sized offsets, and
   unwritten portions of blocks MUST be zero filled.

2.4.6.  Extents are Permissions

   Layout extents returned to pNFS clients grant permission to read or
   write; PNFS_RDMA_READ_DATA and PNFS_RDMA_NONE_DATA are read-only
   (PNFS_RDMA_NONE_DATA reads as zeroes), PNFS_RDMA_READ_WRITE_DATA and
   PNFS_RDMA_INVALID_DATA are read/write, (PNFS_RDMA_INVALID_DATA reads
   as zeros, any write converts it to PNFS_RDMA_READ_WRITE_DATA).  This
   is the only means a client has of obtaining permission to perform
   direct I/O to storage devices; a pNFS client MUST NOT perform direct
   I/O operations that are not permitted by an extent held by the
   client.  Client adherence to this rule places the pNFS server in
   control of potentially conflicting storage device operations,
   enabling the server to determine what does conflict and how to avoid
   conflicts by granting and recalling extents to/from clients.

   If a client makes a layout request that conflicts with an existing
   layout delegation, the request will be rejected with the error
   NFS4ERR_LAYOUTTRYLATER.  This client is then expected to retry the
   request after a short interval.  During this interval, the server
   SHOULD recall the conflicting portion of the layout delegation from
   the client that currently holds it.  This reject-and-retry approach
   does not prevent client starvation when there is contention for the
   layout of a particular file.  For this reason, a pNFS server SHOULD
   implement a mechanism to prevent starvation.  One possibility is that
   the server can maintain a queue of rejected layout requests.  Each
   new layout request can be checked to see if it conflicts with a



Hellwig                  Expires January 3, 2018               [Page 13]


Internet-Draft              pNFS RDMA Layout                   July 2017


   previous rejected request, and if so, the newer request can be
   rejected.  Once the original requesting client retries its request,
   its entry in the rejected request queue can be cleared, or the entry
   in the rejected request queue can be removed when it reaches a
   certain age.

   NFSv4 supports mandatory locks and share reservations.  These are
   mechanisms that clients can use to restrict the set of I/O operations
   that are permissible to other clients.  Since all I/O operations
   ultimately arrive at the NFSv4 server for processing, the server is
   in a position to enforce these restrictions.  However, with pNFS
   layouts, I/Os will be issued from the clients that hold the layouts
   directly to the storage devices that host the data.  These devices
   have no knowledge of files, mandatory locks, or share reservations,
   and are not in a position to enforce such restrictions.  For this
   reason the NFSv4 server MUST NOT grant layouts that conflict with
   mandatory locks or share reservations.  Further, if a conflicting
   mandatory lock request or a conflicting open request arrives at the
   server, the server MUST recall the part of the layout in conflict
   with the request before granting the request.

2.4.7.  End-of-file Processing

   The end-of-file location can be changed in two ways: implicitly as
   the result of a WRITE or LAYOUTCOMMIT beyond the current end-of-file,
   or explicitly as the result of a SETATTR request.  Typically, when a
   file is truncated by an NFSv4 client via the SETATTR call, the server
   frees any disk blocks belonging to the file that are beyond the new
   end-of-file byte, and MUST write zeros to the portion of the new end-
   of-file block beyond the new end-of-file byte.  These actions render
   any pNFS layouts that refer to the blocks that are freed or written
   semantically invalid.  Therefore, the server MUST recall from clients
   the portions of any pNFS layouts that refer to blocks that will be
   freed or written by the server before effecting the file truncation.
   These recalls may take time to complete; as explained in [RFC5661],
   if the server cannot respond to the client SETATTR request in a
   reasonable amount of time, it SHOULD reply to the client with the
   error NFS4ERR_DELAY.

   Blocks in the PNFS_RDMA_INVALID_DATA state that lie beyond the new
   end-of-file block present a special case.  The server has reserved
   these blocks for use by a pNFS client with a writable layout for the
   file, but the client has yet to commit the blocks, and they are not
   yet a part of the file mapping on disk.  The server MAY free these
   blocks while processing the SETATTR request.  If so, the server MUST
   recall any layouts from pNFS clients that refer to the blocks before
   processing the truncate.  If the server does not free the
   PNFS_RDMA_INVALID_DATA blocks while processing the SETATTR request,



Hellwig                  Expires January 3, 2018               [Page 14]


Internet-Draft              pNFS RDMA Layout                   July 2017


   it need not recall layouts that refer only to the
   PNFS_RDMA_INVALID_DATA blocks.

   When a file is extended implicitly by a WRITE or LAYOUTCOMMIT beyond
   the current end-of-file, or extended explicitly by a SETATTR request,
   the server need not recall any portions of any pNFS layouts.

2.4.8.  Layout Hints

   The layout hint attribute specified in [RFC5661] is not supported by
   the RDMA layout, and the pNFS server MUST reject setting a layout
   hint attribute with a loh_type value of LAYOUT4_RDMA_VOLUME during
   OPEN or SETATTR operations.  On a file system only supporting the
   RDMA layout a server MUST NOT report the layout_hint attribute in the
   supported_attrs attribute.

2.5.  Crash Recovery Issues

   A critical requirement in crash recovery is that both the client and
   the server know when the other has failed.  Additionally, it is
   required that a client sees a consistent view of data across server
   restarts.  These requirements and a full discussion of crash recovery
   issues are covered in the "Crash Recovery" section of the NFSv41
   specification [RFC5661].  This document contains additional crash
   recovery material specific only to the RDMA layout.

   When the server crashes while the client holds a writable layout, and
   the client has written data to blocks covered by the layout, and the
   blocks are still in the PNFS_RDMA_INVALID_DATA state, the client has
   two options for recovery.  If the data that has been written to these
   blocks is still cached by the client, the client can simply re-write
   the data via NFSv4, once the server has come back online.  However,
   if the data is no longer in the client's cache, the client MUST NOT
   attempt to source the data from the data servers.  Instead, it SHOULD
   attempt to commit the blocks in question to the server during the
   server's recovery grace period, by sending a LAYOUTCOMMIT with the
   "loca_reclaim" flag set to true.  This process is described in detail
   in Section 18.42.4 of [RFC5661].

2.6.  Transient and Permanent Errors

   The server may respond to LAYOUTGET with a variety of error statuses.
   These errors can convey transient conditions or more permanent
   conditions that are unlikely to be resolved soon.

   The error NFS4ERR_RECALLCONFLICT indicates that the server has
   recently issued a CB_LAYOUTRECALL to the requesting client, making it
   necessary for the client to respond to the recall before processing



Hellwig                  Expires January 3, 2018               [Page 15]


Internet-Draft              pNFS RDMA Layout                   July 2017


   the layout request.  A client can wait for that recall to be receive
   and processe or it can retry as for NFS4ERR_TRYLATER, as described
   below.

   The error NFS4ERR_TRYLATER is used to indicate that the server cannot
   immediately grant the layout to the client.  This may be due to
   constraints on writable sharing of blocks by multiple clients or to a
   conflict with a recallable lock (e.g. a delegation).  In either case,
   a reasonable approach for the client is to wait several milliseconds
   and retry the request.  The client SHOULD track the number of
   retries, and if forward progress is not made, the client SHOULD
   abandon the attempt to get a layout and perform READ and WRITE
   operations by sending them to the server

   The error NFS4ERR_LAYOUTUNAVAILABLE MAY be returned by the server if
   layouts are not supported for the requested file or its containing
   file system.  The server MAY also return this error code if the
   server is the progress of migrating the file from secondary storage,
   there is a conflicting lock that would prevent the layout from being
   granted, or for any other reason that causes the server to be unable
   to supply the layout.  As a result of receiving
   NFS4ERR_LAYOUTUNAVAILABLE, the client SHOULD abandon the attempt to
   get a layout and perform READ and WRITE operations by sending them to
   the MDS.  It is expected that a client will not cache the file's
   layoutunavailable state forever.  In particular, when the file is
   closed or opened by the client, issuing a new LAYOUTGET is
   appropriate.

3.  Security Considerations

   The pNFS extension partitions the NFSv4.1+ file system protocol into
   two parts, the control path and the data path (storage protocol).
   The control path contains all the new operations described by this
   extension; all existing NFSv4 security mechanisms and features apply
   to the control path.  The combination of components in a pNFS system
   is required to preserve the security properties of NFSv4.1+ with
   respect to an entity accessing data via a client, including security
   countermeasures to defend against threats that NFSv4.1+ provides
   defenses for in environments where these threats are considered
   significant.

   The metadata server enforces the file access-control policy at
   LAYOUTGET time.  The client should use suitable authorization
   credentials for getting the layout for the requested iomode (READ or
   RW) and the server verifies the permissions and ACL for these
   credentials, possibly returning NFS4ERR_ACCESS if the client is not
   allowed the requested iomode.  If the LAYOUTGET operation succeeds
   the client receives, as part of the layout, a set of credentials



Hellwig                  Expires January 3, 2018               [Page 16]


Internet-Draft              pNFS RDMA Layout                   July 2017


   allowing it I/O access to the specified data files corresponding to
   the requested iomode.  When the client acts on I/O operations on
   behalf of its local users, it MUST authenticate and authorize the
   user by issuing respective OPEN and ACCESS calls to the metadata
   server, similar to having NFSv4 data delegations.  If access is
   allowed, the client uses the corresponding (READ or RW) credentials
   to perform the I/O operations at the data file's storage devices.
   When the metadata server receives a request to change a file's
   permissions or ACL, it SHOULD recall all layouts for that file and it
   MUST fence off the clients holding outstanding layouts for the
   respective file by implicitly invalidating the outstanding
   credentials on all data files comprising before committing to the new
   permissions and ACL.  Doing this will ensure that clients re-
   authorize their layouts according to the modified permissions and ACL
   by requesting new layouts.  Recalling the layouts in this case is
   courtesy of the server intended to prevent clients from getting an
   error on I/Os done after the client was fenced off.

4.  IANA Considerations

   IANA is requested to assign a new pNFS layout type in the pNFS Layout
   Types Registry as follows (the value 5 is suggested): Layout Type
   Name: LAYOUT4_RDMA Value: 0x00000006 RFC: RFCTBD10 How: L (new layout
   type) Minor Versions: 1

5.  References

5.1.  Normative References

   [LEGAL]    IETF Trust, "Legal Provisions Relating to IETF Documents",
              November 2008, <http://trustee.ietf.org/docs/
              IETF-Trust-License-Policy.pdf>.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", March 1997.

   [RFC4506]  Eisler, M., "XDR: External Data Representation Standard",
              STD 67, RFC 4506, May 2006.

   [RFC5661]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
              "Network File System (NFS) Version 4 Minor Version 1
              Protocol", RFC 5661, January 2010.

   [RFC5662]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
              "Network File System (NFS) Version 4 Minor Version 1
              External Data Representation Standard (XDR) Description",
              RFC 5662, January 2010.




Hellwig                  Expires January 3, 2018               [Page 17]


Internet-Draft              pNFS RDMA Layout                   July 2017


   [RFC8166]  Lever, C., Simpson, W., and T. Talpey, "Remote Direct
              Memory Access Transport for Remote Procedure Call Version
              1", RFC RFC8166, June 2017.

5.2.  Informative References

   [IBARCH]   InfiniBand Trade Association, "InfiniBand Architecture
              Specification Volume 1 Release 1.3", March 2015.

   [RFC5040]  Recio, B., Ed., Metzler, B., Ed., Culley, P., Ed.,
              Hilland, J., Ed., and D. Garcia, Ed., "A Remote Direct
              Memory Access Protocol Specification", RFC 5040, October
              2007.

   [RFC5041]  Shah, H., Ed., Pinkerton, J., Ed., Recio, B., Ed., and P.
              Culley, Ed., "Direct Data Placement over Reliable
              Transports", RFC 5041, October 2007.

Appendix A.  RFC Editor Notes

   [RFC Editor: please remove this section prior to publishing this
   document as an RFC]

   [RFC Editor: prior to publishing this document as an RFC, please
   replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
   RFC number of this document]

Author's Address

   Christoph Hellwig

   Email: hch@lst.de



















Hellwig                  Expires January 3, 2018               [Page 18]