Skip to main content

Using the Parallel NFS (pNFS) SCSI Layout with NVMe
draft-ietf-nfsv4-scsi-layout-nvme-02

The information below is for an old version of the document.
Document Type
This is an older version of an Internet-Draft that was ultimately published as RFC 9561.
Authors Christoph Hellwig , Chuck Lever , Sorin Faibish , David L. Black
Last updated 2023-03-13
Replaces draft-hellwig-nfsv4-scsi-layout-nvme
RFC stream Internet Engineering Task Force (IETF)
Formats
Reviews
Additional resources Mailing list discussion
Stream WG state WG Document
Document shepherd (None)
IESG IESG state Became RFC 9561 (Proposed Standard)
Consensus boilerplate Unknown
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-ietf-nfsv4-scsi-layout-nvme-02
NFSv4                                                    C. Hellwig, Ed.
Internet-Draft                                                          
Intended status: Standards Track                                C. Lever
Expires: 14 September 2023                                        Oracle
                                                              S. Faibish
                                              Cirrus Data Solutions Inc.
                                                                D. Black
                                                       Dell Technologies
                                                           13 March 2023

          Using the Parallel NFS (pNFS) SCSI Layout with NVMe
                  draft-ietf-nfsv4-scsi-layout-nvme-02

Abstract

   This document specifies how to use the Parallel Network File System
   (pNFS) SCSI Layout Type to access storage devices using the NVMe
   protocol family.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 14 September 2023.

Copyright Notice

   Copyright (c) 2023 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

Hellwig, et al.         Expires 14 September 2023               [Page 1]
Internet-Draft          pNFS SCSI Layout for NVMe             March 2023

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
     1.2.  General Definitions . . . . . . . . . . . . . . . . . . .   3
   2.  SCSI Layout mapping to NVMe . . . . . . . . . . . . . . . . .   3
     2.1.  Volume Identification . . . . . . . . . . . . . . . . . .   4
     2.2.  Client Fencing  . . . . . . . . . . . . . . . . . . . . .   4
       2.2.1.  PRs - Key Registration  . . . . . . . . . . . . . . .   5
       2.2.2.  PRs - MDS Registration and Reservation  . . . . . . .   5
       2.2.3.  Fencing Action  . . . . . . . . . . . . . . . . . . .   6
       2.2.4.  Client Recovery after a Fence Action  . . . . . . . .   6
     2.3.  Volatile write caches . . . . . . . . . . . . . . . . . .   6
   3.  Security Considerations . . . . . . . . . . . . . . . . . . .   6
   4.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   7
   5.  Normative References  . . . . . . . . . . . . . . . . . . . .   7
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   8

1.  Introduction

   NFSv4.1 (in [RFC8881]) includes a pNFS feature that allows reads and
   writes to be performed by other means than directing read and write
   operations to the server.  Through use of this feature.  The server,
   in the role of metadata server is responsible for managing file and
   directory metadata while separate means are provided for execution of
   reads and writes.

   These other means of performing file read and writes are defined by
   individual mapping types which often have their own specifications.
   The SCSI Layout Type, defined in RFC8154, describes how IO is to be
   done directly to block storage devices.

   The pNFS Small Computer System Interface (SCSI) layout [RFC8154] is a
   layout type that allows NFS clients to directly perform I/O to block
   storage devices while bypassing the Metadata Server (MDS).  It is
   specified by using concepts from the SCSI protocol family for the
   data path to the storage devices.

Hellwig, et al.         Expires 14 September 2023               [Page 2]
Internet-Draft          pNFS SCSI Layout for NVMe             March 2023

   This document defines how NVMe Namespaces using the NVM Command Set
   [NVME-NVM] exported by NVMe Controllers implementing the NVMe Base
   specification [NVME-BASE] are to be used as storage devices using the
   SCSI Layout Type.  The definition is independent of the underlying
   transport used by the NVMe Controller and thus supports Controllers
   implementing a wide variety of transports, including PCI Express,
   RDMA, TCP and FibreChannel.

   This document does not amend the existing SCSI layout document.
   Rather, it defines how NVMe Namespaces can be used within the SCSI
   Layout by establishing a mapping of the SCSI constructs used in the
   SCSI layout document to corresponding NVMe constructs.

1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

1.2.  General Definitions

   The following definitions are provided for the purpose of providing
   an appropriate context for the reader.

   Client  The "client" is the entity that accesses the NFS server's
      resources.  The client may be an application that contains the
      logic to access the NFS server directly or be part of the
      operating system that provides remote file system services for a
      set of applications.

   Metadata Server (MDS)  The Metadata Server (MDS) is the entity
      responsible for coordinating client access to a set of file
      systems and is identified by a server owner.

2.  SCSI Layout mapping to NVMe

   The SCSI layout definition [RFC8154] references only a few SCSI-
   specific concepts directly.  This document provides a mapping from
   these SCSI concepts to NVM Express concepts that are used when using
   the pNFS SCSI layout with NVMe namespaces.

Hellwig, et al.         Expires 14 September 2023               [Page 3]
Internet-Draft          pNFS SCSI Layout for NVMe             March 2023

2.1.  Volume Identification

   The pNFS SCSI layout uses the Device Identification VPD page (page
   code 0x83) from [SPC5] to identify the devices used by a layout.
   Implementations that use NVMe namespaces as storage devices map NVMe
   namespace identifiers to a subset of the identifiers that the Device
   Identification VPD page supports for SCSI logical units.

   To be used as storage devices for the pNFS SCSI layout, NVMe
   namespaces MUST support either the EUI64 or NGUID value reported in a
   Namespace Identification Descriptor, the I/O Command Set Independent
   Identify Namespace Data Structure, and the Identify Namespace Data
   Structure, NVM Command Set. If available, use of the NGUID value is
   preferred as it is the larger identifier.

   Note: The PS_DESIGNATOR_T10 and PS_DESIGNATOR_NAME have no equivalent
   in NVMe and cannot be used to identify NVMe storage devices.

   The pnfs_scsi_base_volume_info4 structure for an NVMe namespace SHALL
   be constructed as follows:

   1.  The "sbv_code_set" field SHALL be set to PS_CODE_SET_BINARY.

   2.  The "pnfs_scsi_designator_type" field SHALL be set to
       PS_DESIGNATOR_EUI64.

   3.  The "sbv_designator" field SHALL contain either the NGUID or the
       EUI64 identifier for the namespace.  If both NGUID and EUI64
       identifiers are available, then the NGUID identifier SHOULD be
       used as it is the larger identifier.

   RFC 8154 specifies the "sbv_designator" field as an XDR variable
   length opaque<>.  The length of that XDR opaque<> value (part of its
   XDR representation) indicates which NVMe identifier is present.  That
   length MUST be 16 octets for an NVMe NGUID identifier and MUST be 8
   octets for an NVMe EUI64 identifier.  All other lengths MUST NOT be
   used with an NVMe namespace.

2.2.  Client Fencing

   The SCSI layout uses Persistent Reservations (PRs) to provide client
   fencing.  For this to be achieved, both the MDS and the Clients have
   to register a key with the storage device, and the MDS has to create
   a reservation on the storage device.

Hellwig, et al.         Expires 14 September 2023               [Page 4]
Internet-Draft          pNFS SCSI Layout for NVMe             March 2023

   The following sub-sections provide a full mapping of the required
   PERSISTENT RESERVE IN and PERSISTENT RESERVE OUT SCSI commands [SPC5]
   to NVMe commands which MUST be used when using NVMe namespaces as
   storage devices for the pNFS SCSI layout.

2.2.1.  PRs - Key Registration

   On NVMe namespaces, reservations keys are registered using the
   Reservation Register command (refer to Section 7.3 of [NVME-BASE])
   with the Reservation Register Action (RREGA) field set to 000b (i.e.,
   Register Reservation Key) and supplying the reservation key in the
   New Reservation Key (NRKEY) field.

   Reservation keys are unregistered using the Reservation Register
   command with the Reservation Register Action (RREGA) field set to
   001b (i.e., Unregister Reservation Key) and supplying the reservation
   key in the Current Reservation Key (CRKEY) field.

   One important difference between SCSI Persistent Reservations and
   NVMe Reservations is that NVMe reservation keys always apply to all
   controllers used by a host (as indicated by the NVMe Host
   Identifier).  This behavior is analogous to setting the ALL_TG_PT bit
   when registering a SCSI Reservation key, and is always supported by
   NVMe Reservations, unlike the ALL_TG_PT for which SCSI support is
   inconsistent and cannot be relied upon.  Registering a reservation
   key with a namespace creates an association between a host and a
   namespace.  A host that is a registrant of a namespace may use any
   controller with which that host is associated (i.e., that has the
   same Host Identifier, refer to Section 5.27.1.25 of [NVME-BASE]) to
   access that namespace as a registrant.

2.2.2.  PRs - MDS Registration and Reservation

   Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the
   MDS needs to prepare the volume for fencing using PRs.  This is done
   by registering the reservation generated for the MDS with the device
   (see Section 2.2.1) followed by a Reservation Acquire command (refer
   to Section 7.2 of [NVME-BASE]) with the Reservation Acquire Action
   (RACQA) field set to 000b (i.e., Acquire) and the Reservation Type
   (RTYPE) field set to 4h (i.e., Exclusive Access - Registrants Only
   Reservation).

Hellwig, et al.         Expires 14 September 2023               [Page 5]
Internet-Draft          pNFS SCSI Layout for NVMe             March 2023

2.2.3.  Fencing Action

   In case of a non-responding client, the MDS fences the client by
   executing a Reservation Acquire command (refer to section 7.2 of
   [NVME-BASE]), with the Reservation Acquire Action (RACQA) field set
   to 001b (i.e., Preempt) or 010b (i.e., Preempt and Abort), the
   Current Reservation Key (CRKEY) field set to the server's reservation
   key, the Preempt Reservation Key (PRKEY) field set to the reservation
   key associated with the non-responding client and the Reservation
   Type (RTYPE) field set to 010b (i.e., Exclusive Access - Registrants
   Only Reservation).  The client can distinguish I/O errors due to
   fencing from other errors based on the Reservation Conflict NVMe
   status code.

2.2.4.  Client Recovery after a Fence Action

   If an NVMe command issued by the client to the storage device returns
   a non-retryable error (refer to the DNR bit defined in Figure 92 in
   [NVME-BASE]), the client MUST commit all layouts that use the storage
   device through the MDS, return all outstanding layouts for the
   device, forget the device ID, and unregister the reservation key.

2.3.  Volatile write caches

   For NVMe controllers a volatile write cache is enabled if bit 0 of
   the Volatile Write Cache (VWC) field in the Identify Controller Data
   Structure, I/O Command Set Independent (see Figure 275 in
   [NVME-BASE]) is set and the Volatile Write Cache Enable (WCE) bit
   (i.e., bit 00) in the Volatile Write Cache Feature (Feature
   Identifier 06h) (see Section 5.27.1.4 [NVME-BASE]) is set.  If a
   volatile write cache is enabled on an NVMe namespace used as a
   storage device for the pNFS SCSI layout, the pNFS server (MDS) MUST
   use the NVMe Flush command to flush the volatile write cache to
   stable storage before the LAYOUTCOMMIT operation returns by using the
   Flush command (see Section 7.1 [NVME-BASE]).  The NVMe Flush command
   is the equivalent to the SCSI SYNCHRONIZE CACHE commands.

3.  Security Considerations

   NFSv4 clients access NFSv4 metadata servers using the NFSv4 protocol.
   The security considerations generally described in [RFC8881] apply to
   a client's interactions with the metadata server.  However, NFSv4
   clients and servers access NVMe storage devices at a lower layer than
   NFSv4.  NFSv4 and RPC security are not directly applicable to the I/
   Os to data servers using NVMe.  Refer to Section of 2.4.6 (Extents
   Are Permissions) and Section 4 (Security Considerations) of [RFC8154]
   for the Security Considerations of direct block access from NFS
   clients.

Hellwig, et al.         Expires 14 September 2023               [Page 6]
Internet-Draft          pNFS SCSI Layout for NVMe             March 2023

   pNFS with an NVMe layout can be used with NVMe transports (e.g., NVMe
   over PCIe [NVME-PCIE]) that provide essentially no additional
   security functionality.  Or, pNFS may be used with storage protocols
   such as NVMe over TCP [NVME-TCP] that can provide significant
   transport layer security.

   It is the responsibility of those administering and deploying pNFS
   with an NVMe layout to ensure that appropriate protection is deployed
   to that protocol.  When using IP-based storage protocols such as NVMe
   over TCP, data confidentiality and integrity SHOULD be provided for
   traffic between pNFS clients and NVMe storage devices by using a
   secure communication protocol such as TLS [RFC8446].  For NVMe over
   TCP, TLS SHOULD be used as described in [NVME-TCP] to protect traffic
   between pNFS clients and NVMe namespaces used as storage devices.

   Physical security is a common means for protocols not based on IP.
   In environments where the security requirements for the storage
   protocol cannot be met, pNFS with an NVMe layout SHOULD NOT be
   deployed.

   When security is available for the data server storage protocol, it
   is generally at a different granularity and with a different notion
   of identity than NFSv4 (e.g., NFSv4 controls user access to files,
   and NVMe controls initiator access to volumes).  As with pNFS with
   the block layout type [RFC5663], the pNFS client is responsible for
   enforcing appropriate correspondences between these security layers.
   In environments where the security requirements are such that client-
   side protection from access to storage outside of the layout is not
   sufficient, pNFS with a SCSI layout on a NVMe namespace SHOULD NOT be
   deployed.

   As with other block-oriented pNFS layout types, the metadata server
   is able to fence off a client's access to the data on an NVMe
   namespace used as a storage device.  If a metadata server revokes a
   layout, the client's access MUST be terminated at the storage devices
   via fencing as specified in Section 2.2.  The client has a subsequent
   opportunity to acquire a new layout.

4.  IANA Considerations

   The document does not require any actions by IANA.

5.  Normative References

   [NVME-BASE]
              NVM Express, Inc., "NVM Express Base Specification,
              Revision 2.0b", January 2022.

Hellwig, et al.         Expires 14 September 2023               [Page 7]
Internet-Draft          pNFS SCSI Layout for NVMe             March 2023

   [NVME-NVM] NVM Express, Inc., "NVM Express NVM Command Set
              Specification, Revision 1.0b", January 2022.

   [NVME-PCIE]
              NVM Express, Inc., "NVMe over PCIe Transport
              Specification, Revision 1.0b", January 2022.

   [NVME-TCP] NVM Express, Inc., "NVM Express TCP Transport
              Specification, Revision 1.0b", January 2022.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC5663]  Black, D., Fridella, S., and J. Glasgow, "Parallel NFS
              (pNFS) Block/Volume Layout", RFC 5663,
              DOI 10.17487/RFC5663, January 2010,
              <https://www.rfc-editor.org/info/rfc5663>.

   [RFC8154]  Hellwig, C., "Parallel NFS (pNFS) Small Computer System
              Interface (SCSI) Layout", May 2017,
              <https://www.rfc-editor.org/info/rfc8154>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

   [RFC8446]  Rescorla, E., "The Transport Layer Security (TLS) Protocol
              Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018,
              <https://www.rfc-editor.org/info/rfc8446>.

   [RFC8881]  Noveck, D., Ed. and C. Lever, "Network File System (NFS)
              Version 4 Minor Version 1 Protocol", RFC 8881,
              DOI 10.17487/RFC8881, August 2020,
              <https://www.rfc-editor.org/info/rfc8881>.

   [SPC5]     INCITS Technical Committee T10, "SCSI Primary Commands-5",
              ANSI INCITS 502-2019, 2019.

Authors' Addresses

   Christoph Hellwig (editor)
   Email: hch@lst.de

   Charles Lever
   Oracle Corporation
   United States of America

Hellwig, et al.         Expires 14 September 2023               [Page 8]
Internet-Draft          pNFS SCSI Layout for NVMe             March 2023

   Email: chuck.lever@oracle.com

   Sorin Faibish
   Cirrus Data Solutions Inc.
   11 Selwyn Road
   Newton, MA 02461
   United States of America
   Email: sorin.faibish@cdsi.us.com

   David L. Black
   Dell Technologies
   176 South Street
   Hopkinton, MA 01748
   United States of America
   Email: david.black@dell.com

Hellwig, et al.         Expires 14 September 2023               [Page 9]