Using the Parallel NFS (pNFS) SCSI Layout with NVMe
draft-ietf-nfsv4-scsi-layout-nvme-00
The information below is for an old version of the document.
Document | Type |
This is an older version of an Internet-Draft that was ultimately published as RFC 9561.
|
|
---|---|---|---|
Authors | Christoph Hellwig , Chuck Lever , Sorin Faibish , David L. Black | ||
Last updated | 2022-09-29 | ||
RFC stream | Internet Engineering Task Force (IETF) | ||
Formats | |||
Reviews | |||
Additional resources | Mailing list discussion | ||
Stream | WG state | WG Document | |
Document shepherd | (None) | ||
IESG | IESG state | Became RFC 9561 (Proposed Standard) | |
Consensus boilerplate | Unknown | ||
Telechat date | (None) | ||
Responsible AD | (None) | ||
Send notices to | (None) |
draft-ietf-nfsv4-scsi-layout-nvme-00
NFSv4 C. Hellwig Internet-Draft Intended status: Informational C. Lever Expires: 30 March 2023 Oracle S. Faibish Cirrus Data Solutions Inc. D. Black Dell Technologies 26 September 2022 Using the Parallel NFS (pNFS) SCSI Layout with NVMe draft-ietf-nfsv4-scsi-layout-nvme-00 Abstract This document explains how to use the Parallel Network File System (pNFS) SCSI Layout Type with transports using the NVMe or NVMe over Fabrics protocol. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 30 March 2023. Copyright Notice Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved. Hellwig, et al. Expires 30 March 2023 [Page 1] Internet-Draft pNFS SCSI Layout for NVMe September 2022 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 1.2. General Definitions . . . . . . . . . . . . . . . . . . . 3 2. SCSI Layout mapping to NVMe . . . . . . . . . . . . . . . . . 3 2.1. Volume Identification . . . . . . . . . . . . . . . . . . 3 2.2. Client Fencing . . . . . . . . . . . . . . . . . . . . . 4 2.2.1. PRs - Key Registration . . . . . . . . . . . . . . . 4 2.2.2. PRs - MDS Registration and Reservation . . . . . . . 5 2.2.3. Fencing Action . . . . . . . . . . . . . . . . . . . 5 2.2.4. Client Recovery after a Fence Action . . . . . . . . 5 2.3. Volatile write caches . . . . . . . . . . . . . . . . . . 6 3. Security Considerations . . . . . . . . . . . . . . . . . . . 6 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 5. Normative References . . . . . . . . . . . . . . . . . . . . 7 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8 1. Introduction The pNFS Small Computer System Interface (SCSI) layout [RFC8154] is a layout type that allows NFS clients to directly perform I/O to block storage devices while bypassing the Metadata Server (MDS). It is specified by using concepts from the SCSI protocol family for the data path to the storage devices. This document explains how to access NVM Command set Namespaces [NVME-NVM] exported by NVMe Controllers implementing the NVMe Base specification ([NVME-BASE]) using the SCSI layout type. This document works independent of the underlying transport used by the NVMe Controller and thus supports Controllers implementing a wide variety of transports, including PCIe Express, RDMA, TCP and Fibre Channel. This document does not amend the pNFS SCSI layout document, but instead explains how to map the SCSI constructs used in the pNFS SCSI layout document to NVMe concepts. Hellwig, et al. Expires 30 March 2023 [Page 2] Internet-Draft pNFS SCSI Layout for NVMe September 2022 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 1.2. General Definitions The following definitions are provided for the purpose of providing an appropriate context for the reader. Client The "client" is the entity that accesses the NFS server's resources. The client may be an application that contains the logic to access the NFS server directly. The client may also be the traditional operating system client that provides remote file system services for a set of applications. Server The "server" is the entity responsible for coordinating client access to a set of file systems and is identified by a server owner. 2. SCSI Layout mapping to NVMe The SCSI layout definition [RFC8154] only references few SCSI specific concepts directly. This document provides a mapping from these SCSI concepts to NVM Express concepts that SHOULD be used when using the pNFS SCSI layout with NVMe namespaces. 2.1. Volume Identification The pNFS SCSI layout uses the Device Identification VPD page (page code 0x83) from [SPC5] to identify the devices used by a layout. Implementations that use NVMe namespaces as storage devices map NVMe namespace identifiers to a subset of the identifiers that the Device Identification VPD page supports for SCSI logical units. To be used as storage devices for the pNFS SCSI layout, NVMe namespaces MUST support either the EUI64 or NGUID value reported in a Namespace Identification Descriptor, the I/O Command Set Independent Identify Namespace Data Structure, and the Identify Namespace Data Structure, NVM Command Set. If available, the NGUID value SHOULD be used as it is the larger identifier. Methods based on the Serial Number are not suitable for unique addressing needs and thus MUST NOT be used. Hellwig, et al. Expires 30 March 2023 [Page 3] Internet-Draft pNFS SCSI Layout for NVMe September 2022 The pnfs_scsi_base_volume_info4 structure for an NVMe namespace SHALL be constructed as follows: 1. The "sbv_code_set" field SHALL be set to PS_CODE_SET_BINARY. 2. The "pnfs_scsi_designator_type" field SHALL be set to PS_DESIGNATOR_EUI64. 3. The "sbv_designator" field SHALL contain either the NGUID or the EUI64 identifier for the namespace. If both NGUID and EUI64 identifiers are available, then the NGUID identifier SHOULD be used as it is the larger identifier. RFC 8154 specifies the "sbv_designator" field as an XDR variable length opaque<>. The length of that XDR opaque<> value (part of its XDR representation) indicates which NVMe identifier is present. That length MUST be 16 octets for an NVMe NGUID identifier and MUST be 8 octets for an NVMe EUI64 identifier. All other lengths MUST NOT be used with an NVMe namespace. 2.2. Client Fencing The SCSI layout uses Persistent Reservations (PRs) to provide client fencing. For this both the MDS and the Clients have to register a key with the storage device, and the MDS has to create a reservation on the storage device. The following is a full mapping of the required PERSISTENT RESERVE IN and PERSISTENT RESERVE OUT SCSI commands [SPC5] to NVMe commands which MUST be used when using NVMe namespaces as storage devices for the pNFS SCSI layout. 2.2.1. PRs - Key Registration On NVMe namespaces, reservations keys are registered using the Reservation Register command (refer to Section 7.3 of [NVME-BASE]) with the Reservation Register Action (RREGA) field set to 000b (i.e., Register Reservation Key) and supplying the reservation key in the New Reservation Key (NRKEY) field. Reservation keys are unregistered using the Reservation Register command with the Reservation Register Action (RREGA) field set to 001b (i.e., Unregister Reservation Key) and supplying the reservation key in the Current Reservation Key (CRKEY) field. One important difference between SCSI Persistent Reservations and NVMe Reservations is that NVMe reservation keys always apply to all controllers used by a host (as indicated by the NVMe Host Hellwig, et al. Expires 30 March 2023 [Page 4] Internet-Draft pNFS SCSI Layout for NVMe September 2022 Identifier). This behavior is analogous to setting the ALL_TG_PT bit when registering a SCSI Reservation key, and is always supported by NVMe Reservations, unlike the ALL_TG_PT for which SCSI support is inconsistent and cannot be relied upon. Registering a reservation key with a namespace creates an association between a host and a namespace. A host that is a registrant of a namespace may use any controller with which that host is associated (i.e., that has the same Host Identifier, refer to Section 5.27.1.25 of [NVME-BASE]) to access that namespace as a registrant. 2.2.2. PRs - MDS Registration and Reservation Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the MDS needs to prepare the volume for fencing using PRs. This is done by registering the reservation generated for the MDS with the device (see Section 2.2.1) followed by a Reservation Acquire command (refer to Section 7.2 of [NVME-BASE]) with the Reservation Acquire Action (RACQA) field set to 000b (i.e., Acquire) and the Reservation Type (RTYPE) field set to 4h (i.e., Exclusive Access - Registrants Only Reservation). 2.2.3. Fencing Action In case of a non-responding client, the MDS fences the client by executing a Reservation Acquire command (refer to section 7.2 of [NVME-BASE]), with the Reservation Acquire Action (RACQA) field set to 001b (i.e., Preempt) or 010b (i.e., Preempt and Abort), the Current Reservation Key (CRKEY) field set to the server's reservation key, the Preempt Reservation Key (PRKEY) field set to the reservation key associated with the non-responding client and the Reservation Type (RTYPE) field set to 010b (i.e., Exclusive Access - Registrants Only Reservation). The client can distinguish I/O errors due to fencing from other errors based on the Reservation Conflict NVMe status code. 2.2.4. Client Recovery after a Fence Action If an NVMe command issued by the client to the storage device returns a non-retryable error (refer to the DNR bit defined in Figure 92 in [NVME-BASE]), the client MUST commit all layouts that use the storage device through the MDS, return all outstanding layouts for the device, forget the device ID, and unregister the reservation key. Hellwig, et al. Expires 30 March 2023 [Page 5] Internet-Draft pNFS SCSI Layout for NVMe September 2022 2.3. Volatile write caches For NVMe controllers a volatile write cache is enabled if bit 0 of the Volatile Write Cache (VWC) field in the Identify Controller Data Structure, I/O Command Set Independent (see Figure 275 in [NVME-BASE]) is set and the Volatile Write Cache Enable (WCE) bit (i.e., bit 00) in the Volatile Write Cache Feature (Feature Identifier 06h) (see Section 5.27.1.4 [NVME-BASE]) is set. If a volatile write cache is enabled on an NVMe namespace used as a storage device for the pNFS SCSI layout, the pNFS server (MDS) MUST use the NVMe FLUSH command to flush the volatile write cache to stable storage before the LAYOUTCOMMIT operation returns by using the Flush command (see Section 7.1 [NVME-BASE]). 3. Security Considerations NFSv4 clients access NFSv4 metadata servers using the NFSv4 protocol. The security considerations generally described in [RFC8881] apply to a client's interactions with the metadata server. However, NFSv4 clients and servers access NVMe storage devices at a lower layer than NFSv4. NFSv4 and RPC security are not directly applicable to the I/ Os to data servers using NVMe. pNFS with an NVMe layout can be used with NVMe transports (e.g., NVMe over PCIe [NVME-PCIE]) that provide essentially no additional security functionality. Or, pNFS may be used with storage protocols such as NVMe over TCP [NVME-TCP] that can provide significant transport layer security. It is the responsibility of those administering and deploying pNFS with an NVMe layout to ensure that appropriate protection is deployed to that protocol. When using IP-based storage protocols such as NVMe over TCP, data confidentiality and integrity SHOULD be provided for traffic between pNFS clients and NVMe storage devices by using a secure communication protocol such as TLS [RFC8446]. For NVMe over TCP, TLS SHOULD be used as described in [NVME-TCP] to protect traffic between pNFS clients and NVMe namespaces used as storage devices. Physical security is a common means for protocols not based on IP. In environments where the security requirements for the storage protocol cannot be met, pNFS with an NVMe layout SHOULD NOT be deployed. When security is available for the data server storage protocol, it is generally at a different granularity and with a different notion of identity than NFSv4 (e.g., NFSv4 controls user access to files, and NVMe controls initiator access to volumes). As with pNFS with the block layout type [RFC5663], the pNFS client is responsible for Hellwig, et al. Expires 30 March 2023 [Page 6] Internet-Draft pNFS SCSI Layout for NVMe September 2022 enforcing appropriate correspondences between these security layers. In environments where the security requirements are such that client- side protection from access to storage outside of the layout is not sufficient, pNFS with a SCSI layout on a NVMe namespace SHOULD NOT be deployed. As with other block-oriented pNFS layout types, the metadata server is able to fence off a client's access to the data on an NVMe namespace used as a storage device. If a metadata server revokes a layout, the client's access MUST be terminated at the storage devices via fencing as specified in Section 2.2. The client has a subsequent opportunity to acquire a new layout. 4. IANA Considerations The document does not require any actions by IANA. 5. Normative References [NVME-BASE] NVM Express, Inc., "NVM Express Base Specification, Revision 2.0b", January 2022. [NVME-NVM] NVM Express, Inc., "NVM Express NVM Command Set Specification, Revision 1.0b", January 2022. [NVME-PCIE] NVM Express, Inc., "NVMe over PCIe Transport Specification, Revision 1.0b", January 2022. [NVME-TCP] NVM Express, Inc., "NVM Express TCP Transport Specification, Revision 1.0b", January 2022. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", March 1997, <https://www.rfc-editor.org/info/rfc2119>. [RFC5663] Black, D., Fridella, S., and J. Glasgow, "Parallel NFS (pNFS) Block/Volume Layout", RFC 5663, DOI 10.17487/RFC5663, January 2010, <https://www.rfc-editor.org/info/rfc5663>. [RFC8154] Hellwig, C., "Parallel NFS (pNFS) Small Computer System Interface (SCSI) Layout", May 2017, <https://www.rfc-editor.org/info/rfc8154>. Hellwig, et al. Expires 30 March 2023 [Page 7] Internet-Draft pNFS SCSI Layout for NVMe September 2022 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/info/rfc8174>. [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, <https://www.rfc-editor.org/info/rfc8446>. [RFC8881] Noveck, D., Ed. and C. Lever, "Network File System (NFS) Version 4 Minor Version 1 Protocol", RFC 8881, DOI 10.17487/RFC8881, August 2020, <https://www.rfc-editor.org/info/rfc8881>. [SPC5] INCITS Technical Committee T10, "SCSI Primary Commands-5", ANSI INCITS 502-2019, 2019. Authors' Addresses Christoph Hellwig Email: hch@lst.de Charles Lever Oracle Corporation United States of America Email: chuck.lever@oracle.com Sorin Faibish Cirrus Data Solutions Inc. 11 Selwyn Road Newton, MA 02461 United States of America Email: sorin.faibish@cdsi.us.com David L. Black Dell Technologies 176 South Street Hopkinton, MA 01748 United States of America Email: david.black@dell.com Hellwig, et al. Expires 30 March 2023 [Page 8]