NFSv4                                                          M. Eisler
Internet-Draft                                                    NetApp
Intended status: Standards Track                        October 27, 2008
Expires: April 30, 2009


                       Metadata Striping for pNFS
               draft-eisler-nfsv4-pnfs-metastripe-01.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on April 30, 2009.

Abstract

   This Internet-Draft describes a means to add metadata striping to
   pNFS.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [1].







Eisler                   Expires April 30, 2009                 [Page 1]


Internet-Draft           pNFS Metadata Striping             October 2008


Table of Contents

   1.  Introduction and Motivation  . . . . . . . . . . . . . . . . .  3
   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
   3.  Scope of Metadata Striping . . . . . . . . . . . . . . . . . .  4
   4.  The Definition of Metadata Striping Layout . . . . . . . . . .  5
     4.1.  Name of Metadata Striping Layout Type  . . . . . . . . . .  5
     4.2.  Value of Metadata Striping Layout Type . . . . . . . . . .  5
     4.3.  Definition of the da_addr_body Field of the
           device_addr4 Data Type . . . . . . . . . . . . . . . . . .  6
     4.4.  Definition of the loh_body Field of the layouthint4
           Data Type  . . . . . . . . . . . . . . . . . . . . . . . .  7
     4.5.  Definition of the loc_body Field of the
           layout_content4 Data Type  . . . . . . . . . . . . . . . .  8
     4.6.  Definition of the lou_body Field of the layoutupdate4
           Data Type  . . . . . . . . . . . . . . . . . . . . . . . . 14
     4.7.  Storage Access Protocols . . . . . . . . . . . . . . . . . 14
     4.8.  Revocation of Layouts  . . . . . . . . . . . . . . . . . . 14
     4.9.  Stateids . . . . . . . . . . . . . . . . . . . . . . . . . 15
     4.10. Lease Terms  . . . . . . . . . . . . . . . . . . . . . . . 15
     4.11. Layout Operations Sent to an L-MDS . . . . . . . . . . . . 15
     4.12. Filehandles in Metadata Layouts  . . . . . . . . . . . . . 16
     4.13. READ and WRITE Operations  . . . . . . . . . . . . . . . . 16
     4.14. Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 16
       4.14.1.  Failure and Restart of Client . . . . . . . . . . . . 16
       4.14.2.  Failure and Restart of Server . . . . . . . . . . . . 16
       4.14.3.  Failure and Restart of Storage Device . . . . . . . . 16
   5.  Negotiation  . . . . . . . . . . . . . . . . . . . . . . . . . 16
   6.  Operational Recommendation for Deployment  . . . . . . . . . . 16
   7.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17
   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 17
   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 17
   10. Normative References . . . . . . . . . . . . . . . . . . . . . 17
   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 18
   Intellectual Property and Copyright Statements . . . . . . . . . . 19
















Eisler                   Expires April 30, 2009                 [Page 2]


Internet-Draft           pNFS Metadata Striping             October 2008


1.  Introduction and Motivation

   The NFSv4.1 specification describes pNFS [2].  In NFSv4.1, pNFS is
   limited to the data contents of regular files.  The content of
   regular files is distributed (striped) across multiple storage
   devices.  Metadata is not distributed or striped, and indeed, the
   model presented in the NFSv4.1 specification is that of a single
   metadata server.  This document describes a means to add metadata
   striping to pNFS, which includes the notion of multiple metadata
   servers.  With metadata striping, multiple metadata servers may work
   together to provide a higher parallel performance.

   This document does not require a new minor version of NFSv4.
   Instead, it requires a new layout type.

   The XDR description is provided in this document in a way that makes
   it simple for the reader to extract into a ready to compile form.
   The reader can feed this document into the following shell script to
   produce the machine readable XDR description of the metadata layout:

   #!/bin/sh
   grep "^  *///" | sed 's?^  *///  ??' | sed 's?^.*///??'


   I.e. if the above script is stored in a file called "extract.sh", and
   this document is in a file called "spec.txt", then the reader can do:

    sh extract.sh < spec.txt > md.x

   The effect of the script is to remove leading white space from each
   line of the specification, plus a sentinel sequence of "///".


2.  Terminology

   o  Initial Metadata Server (I-MDS).  The I-MDS is the metadata server
      the client obtains a filehandle from prior to acquiring any layout
      on the file.

   o  Layout Metadata Server (L-MDS).  The L-MDS is the metadata server
      the client obtains a filehandle from after direction from a
      layout.

   o  Regular file: An object of file type NF4REG or NF4NAMEDATTR.







Eisler                   Expires April 30, 2009                 [Page 3]


Internet-Draft           pNFS Metadata Striping             October 2008


3.  Scope of Metadata Striping

   This proposal assumes a model where there are two or more servers
   capable of supporting NFSv4.1 operations.  At least one server is an
   I-MDS, and the I-MDS should be thought of as a normal NFSv4.1 server,
   with the additional capability of granting metadata layouts on
   demand.  The I-MDS might also be capable of granting non-metadata
   layouts, but this is irrelevant to the scope of metadata striping.
   The model also requires at least one additional server, an L-MDS,
   that is capable of supporting NFSv4.1 operations that are directed to
   the server by the I-MDS.  It is permissible for an I-MDS to also be
   an L-MDS, and an L-MDS to also be an I-MDS.  Indeed, a simple
   submodel is for every NFSv4.1 server in a set to be both an I-MDS and
   L-MDS.

   Metadata striping applies to all NFSv4.1 operations that operate on
   file objects.  These operations can be broken down into three
   classes:

   o  Filehandle-only.  These are operations that take just filehandles
      as arguments, i.e. the current filehandle, or both the current
      filehandle and the saved filehandle, and no component names of
      files.  When a client obtains a filehandle of an file object from
      an NFS server, it can obtain a metadata layout that indicates the
      optimal destination in the network to send filehandle-only
      operations for that file object.  For example, after obtaining the
      filehandle via OPEN, and the metadata layout via LAYOUTGET, the
      client wants to get a byte range lock on the file.  The client
      sends the LOCK request to the network address specified in the
      metadata layout.

   o  Name-based.  These are operations that take one or two filehandles
      (i.e. the current file handle, or both the current file handle and
      the saved filehandle) and one or two component names of files.
      When a client obtains a filehandle of a file object that is of
      type directory, it can obtain a metadata layout that indicates the
      optimal destinations in the network to send name-based operations
      for that directory.  The optimal destinations MUST apply to the
      current filehandle that the operation uses.  In other words, for
      LINK and RENAME, which take both the saved filehandle and the
      current filehandle as parameters, the pNFS client would use the
      metadata layout of the target directory (indicated in the current
      filehandle) for guidance where to send the operation.  Note that
      if an L-MDS accepts a LINK or RENAME operation, the L-MDS MUST
      perform the operation atomically.  If it cannot, then the L-MDS
      MUST return the error NFS4ERR_XDEV, and the client MUST send the
      operation to the I-MDS.




Eisler                   Expires April 30, 2009                 [Page 4]


Internet-Draft           pNFS Metadata Striping             October 2008


      The choice of destination is a function of the name the client is
      requesting.  For example, after the client obtains the filehandle
      of a directory via LOOKUP and the metadata layout via LAYOUTGET,
      the client wants to open a regular file within the directory.  As
      with the LAYOUT4_NFSV4_1_FILES layout type, the client has a list
      network addresses to send requests to.  With the
      LAYOUT4_NFSV4_1_FILES layout, the choice of the index in the list
      of network addresses was computed from the offset of the the read
      or write request.  With the metadata layout, the choice of the
      index is derived from the name (or some other method, such as the
      name and one or more attributes of the directory, such as the
      filehandle, fileid, etc.) passed to OPEN.

   o  Directory-reading.  These are operations that take one filehandle
      and return the contents of a directory (currently, NFSv4 has just
      one such operation, READDIR).  When a client obtains a filehandle
      of a file object that is of type directory, it can obtain a
      metadata layout that indicates the optimal destination in the
      network to send directory reading operations for that directory.
      For example, after the client obtains the filehandle of a
      directory via LOOKUP and the metadata layout via LAYOUTGET, the
      client wants to read the directory.  As with the
      LAYOUT4_NFSV4_1_FILES layout type, the client has a list network
      addresses to send requests to.  With the LAYOUT4_NFSV4_1_FILES
      layout, the choice of the index in list of network addresses was
      computed from the offset of the the read or write request.  Since
      directories have cookies which resemble offsets, the choice of the
      index is computed from the the "cookie" argument to the operation.


4.  The Definition of Metadata Striping Layout

4.1.  Name of Metadata Striping Layout Type

   The name of the metadata striping layout type is LAYOUT4_METADATA.

4.2.  Value of Metadata Striping Layout Type

   The value of the metadata striping layout type is TBD1.












Eisler                   Expires April 30, 2009                 [Page 5]


Internet-Draft           pNFS Metadata Striping             October 2008


4.3.  Definition of the da_addr_body Field of the device_addr4 Data Type

   ///  %#include "nfs4_prot.h"
   ///  union md_layout_addr4 switch (bool mdla_simple) {
   ///    case TRUE:
   ///      multipath_list4              mdla_simple_addr;
   ///    case FALSE:
   ///      nfsv4_1_file_layout_ds_addr4 mdla_complex_addr;
   ///  };


                                 Figure 1

   If mdla_simple is TRUE, the remainder of the device address contains
   a list of elements (mdla_simple_addr), where each element represents
   a network address of an L-MDS which can serve equally as the target
   of metadata operations (typically the filehandle-only operations).
   See Section 13.5 of [2] for a description of how the multipath_list4
   data type supports multi-pathing.

   If mdla_simple is FALSE, the remainder of the device address is the
   same as the LAYOUT4_NFSV4_1_FILES device address, consisting of an
   array of lists of L-MDSes servers (nflda_multipath_ds_list), and an
   array of indices (nflda_stripe_indices).  Each element of
   nflda_multipath_ds_list contains one or more subelements, and each
   subelement represents a network address of an L-MDS which may serve
   equally as the target of name-based and directory-reading operations
   (see Section 13.5 of [2]).  The number of elements in
   nflda_multipath_ds_list array might be different than the stripe
   count.  The stripe count is the number of elements in
   nflda_stripe_indices.  The value of each element of
   nflda_stripe_indices is an index into nflda_multipath_ds_list, and
   thus the value of each element of nflda_stripe_indices MUST be less
   than the number of elements in nflda_multipath_ds_list.

















Eisler                   Expires April 30, 2009                 [Page 6]


Internet-Draft           pNFS Metadata Striping             October 2008


4.4.  Definition of the loh_body Field of the layouthint4 Data Type

   ///  enum md_layout_hint_care4 {
   ///         MD4_CARE_STRIPE_UNIT_SIZE    = 0x040,
   ///         MD4_CARE_STRIPE_CNT_NAMEOPS  = 0x080,
   ///         MD4_CARE_STRIPE_CNT_DIRRDOPS = 0x100
   ///  };
   ///  %
   ///  %/* Encoded in the loh_body field of type layouthint4: */
   ///  %
   ///  struct md_layouthint4 {
   ///         uint32_t        mdlh_care;
   ///         count4          mdlh_stripe_cnt_nameops;
   ///         count4          mdlh_stripe_cnt_dirrdops;
   ///         nfs_cookie4     mdlh_stripe_unit_size;
   ///  };

                                 Figure 2

   The layout-type specific content for the LAYOUT4_METDATA layout type
   is composed of four fields.  The first field, mdlh_care, is a set of
   flags indicating which values of the hint the client cares about.  If
   MD4_CARE_STRIPE_CNT_NAMEOPS is set, then the client indicates in the
   second field, mdlh_stripe_cnt_nameops the preferred stripe count for
   name-based operations.  If MD4_CARE_STRIPE_CNT_DIRRDOPS is set, then
   the client indicates in the third field, mdlh_stripe_cnt_dirrdops,
   the preferred stripe count for directory-reading operations.  If
   MD4_CARE_STRIPE_UNIT_SIZE is set, then the client indicates in the
   fourth field, mdlh_stripe_unit_size, the preferred stripe unit size
   for directory-reading operations.





















Eisler                   Expires April 30, 2009                 [Page 7]


Internet-Draft           pNFS Metadata Striping             October 2008


4.5.  Definition of the loc_body Field of the layout_content4 Data Type

   ///  struct md_layout_fhonly {
   ///    deviceid4   mdlf_devid;
   ///    nfs_fh4     mdlf_fh<1>;
   ///  };
   ///
   ///  struct md_layout_namebased {
   ///    deviceid4   mdln_devid;
   ///    uint32_t    mdln_namebased_alg;
   ///    uint32_t    mdln_first_index;
   ///    nfs_fh4     mdln_fh_list<>;
   ///  };
   ///
   ///  union md_layout_dirread_fhlist
   ///        switch (bool mdldf_use_namebased) {
   ///    case TRUE:
   ///      void;
   ///    case FALSE:
   ///      nfs_fh4     mdldf_fh_list<>;
   ///  };
   ///
   ///  struct md_layout_dirread {
   ///    deviceid4                mdld_devid;
   ///    nfs_cookie4              mdld_first_cookie;
   ///    nfs_cookie4              mdld_unit_size;
   ///    uint32_t                 mdld_first_index;
   ///    md_layout_dirread_fhlist mdld_fh_list;
   ///  };
   ///
   ///  struct md_layout4 {
   ///    md_layout_fhonly    mdl_fhops_layout<1>;
   ///    md_layout_namebased mdl_nameops_layout<1>;
   ///    md_layout_dirread   mdl_dirrdops_layout_segments<>;
   ///  };


                                 Figure 3

   The reply to a successful LAYOUTGET request it MUST contain exactly
   one element in logr_layout.  The elements contains the metadata
   layout.  The metadata layout consists of three variable length
   arrays.  At least one of the arrays MUST be of non-zero length.

   o  mdl_fhops_layout.  This is an array of up to one element.  If
      there is one element, the element indicates the preferred set
      L-MDSes as the target of filehandle-only operations.  The element
      contains two fields, mdlf_devid, the pNFS device ID of the L-MDS



Eisler                   Expires April 30, 2009                 [Page 8]


Internet-Draft           pNFS Metadata Striping             October 2008


      and mdlf_fh, an array of up to one filehandle.

      When the client receives a layout that has a mdl_fhops_layout
      array with one element, it uses GETDEVICEINFO to map mdlf_devid to
      a device address, of data type md_layout_addr4.  The value of the
      device address field mdla_simple MUST be TRUE.  The client can
      then select any element in mdla_simple_addr to send a filehandle-
      only operation.  The field mdlf_devid MUST map to a device address
      with mdla_simple set to TRUE.  The current filehandle REQUIRED for
      use with the filehandle-only operation is either mdlf_fh[0] (if
      and only if mdlf_fh has one element) or it is the filehandle the
      pNFS client used as the current filehandle to the LAYOUTGET
      operation that returned the metadata layout.

   o  mdl_nameops_layout.  This is an array of up to one element.  If
      there is one element, the element indicates the preferred set of
      L-MDS servers to as the target of name-based operations.  The list
      of L-MDSes is mapped from the mdln_devid device ID.  The array
      mdln_fh_list is used to select a filehandle for accessing an
      L-MDS.  The number of elements in this array MUST be one of three
      values:

      *  Zero.  The means that filehandles used for each L-MDS are the
         same as the filehandle used as the current filehandle to
         LAYOUTGET.

      *  One. This means that every L-MDS uses filehandle in
         mdln_fh_list[0].

      *  The same number of elements as
         mdla_complex_addr.nflda_multipath_ds_list.  Thus when sending a
         name-based operation to any L-MDS in
         mdla_complex_addr.nflda_multipath_ds_list[X], the filehandle in
         mdln_fh_list[X] MUST be used.

      The field mdld_first_index is the index into the first element of
      the of mdla_complex_addr.nflda_stripe_indices array to use.  The
      field mdln_namebased_alg identifies the algorithm used to compute
      the actual element in the mdla_complex_addr.nflda_stripe_indices
      array to use.

      When the client receives a layout that has a mdl_nameops_layout
      array with one element, it uses GETDEVICEINFO to map mdln_devid to
      a device address of data type md_layout_addr4.  The value of the
      device address field mdla_simple MUST be set to FALSE.  The client
      determines the filehandle and the set of L-MDS network addresses
      to send a name-based operation via the following algorithm:




Eisler                   Expires April 30, 2009                 [Page 9]


Internet-Draft           pNFS Metadata Striping             October 2008


   let F be the function designated by
     mdln_namebased_alg;

   let X = (x1, x2, x3, ...) some set of inputs for
     function F, such that x1 SHOULD be the
     component name of the file;

   stripe_unit_number = F(X);
   stripe_count = number of elements in
     mdla_complex_addr.nflda_stripe_indices;

   j = (stripe_unit_number + mdln_first_index) %
     stripe_count;

   idx = nflda_stripe_indices[j];

   fh_count = number of elements in mdln_fh_list;
   lmds_count = number of elements in
   mdla_complex_addr.nflda_multipath_ds_list;

   switch (fh_count) {
   case lmds_count:
     fh = mdln_fh_list[idx];
     break;

   case 1:
     fh = mdln_fh_list[0];
     break;

   case 0:
     fh = current filehandle passed to LAYOUTGET;
     break;

   default:
     throw a fatal exception;
     break;
   }

   address_list =
     mdla_complex_addr.nflda_multipath_ds_list[idx];

                                   Figure 4

      The client would then select an L-MDS from address_list, and send
      the name-based operation using the filehandle specified in fh.

   o  mdl_dirops_layout_segments.  This is an array of zero or more
      elements.  Each element indicates the preferred set of L-MDSes as



Eisler                   Expires April 30, 2009                [Page 10]


Internet-Draft           pNFS Metadata Striping             October 2008


      the preferred destination for directory reading operations and the
      pattern over which directory reading operations iterates over the
      L-MDSes.  The set of L-MDSes is mapped from the mdld_devid device
      ID.  The field mdld_devid is the device ID.  The field
      mdld_first_cookie indicates the first directory entry cookie a
      directory reading operation can use for the first unit of the
      pattern in this element.  E.g., the value of mdld_first_cookie can
      be used as the value of the "cookie" field in READDIR4args.  In
      the first element, mdld_first_cookie MUST be zero.  The last
      cookie that can be used on the pattern can be no higher than one
      less than the value of mdld_first_cookie of the next element.  If
      there is no next element, then the pattern is valid for all
      cookies from mdld_first_cookie through NFS4_UINT64_MAX inclusive.
      The field mdld_unit_size indicates the maximum number of cookies
      that can be read from each unit of a pattern, and thus indicates
      the lowest value of the "cookie" field in READDIR4args for each
      unit after the first unit.  For example, if mdld_unit_size is
      100000, and mdld_first_cookie is zero, then value of the "cookie"
      field in the READDIR4args of the READDIR operation sent to the
      second unit MUST be greater than or equal to 100000, and less than
      200000.  The field mdld_fh_list is used to select a filehandle for
      accessing an L-MDS.  It is a switched union with a boolean
      discriminator mdldf_use_namebased.  If mdldf_use_namebased is
      TRUE, then the filehandle is selected from
      mdl_nameops_layout.mdln_fh_list.  The number of elements in this
      array MUST be one of three values:

      *  Zero.  The means that filehandles used for each L-MDS are the
         same as the filehandle used as the current filehandle to
         LAYOUTGET.

      *  One. This means that every L-MDS uses the filehandle in
         mdld_fh_list[0].

      *  The same number of elements as
         mdld_complex_addr.nflda_multipath_ds_list.  Thus when sending a
         name-based operation to any L-MDS in
         mdld_complex_addr.nflda_multipath_ds_list[X], the filehandle in
         mdln_fh_list[X] MUST be used.

      The field mdld_first_index is the index into the first element of
      the mdld_complex_addr.nflda_stripe_indices array to use.

      When the client receives a layout that has a
      mdl_dirops_layout_segments array with more than zero elements, it
      uses GETDEVICEINFO to map the mdln_devid of each element of the
      array to a device address of data type md_layout_addr4.  The value
      of the device address field mdla_simple MUST be set to FALSE.  The



Eisler                   Expires April 30, 2009                [Page 11]


Internet-Draft           pNFS Metadata Striping             October 2008


      client determines the filehandle and the set of L-MDS network
      addresses to send a name-based operation via the following
      algorithm:

   let cookie_arg be the cookie the pNFS client will
     use as the value of the cookie argument to a
     directory reading operation;

   segment_count = number of elements in
     mdl_dirrdops_layout_segments;

   find index k, such that (cookie_arg >=
     mdl_dirrdops_layout_segments[k].mdld_first_cookie)
     && ((k == (segment_count - 1)) || (cookie_arg
     < mdl_dirrdops_layout_segments[k+1]));

   relative_cookie = cookie_arg -
     mdl_dirrdops_layout_segments[k].mdld_first_cookie;

   i = floor(relative_cookie /
     mdl_dirrdops_layout_segments[k].mdld_unit_size);

   stripe_count = number of elements in
     mdla_complex_addr.nflda_stripe_indices;

   j = (stripe_unit_number + mdld_first_index) % stripe_count;

   idx = nflda_stripe_indices[j];

   if (mdl_dirrdops_layout_segments[k].
       mdldf_use_namebased == TRUE) {
     fh_count = number of elements in mdln_fh_list;
     lmds_count = number of elements in
       mdla_complex_addr.nflda_multipath_ds_list;
   } else {
     fh_count = number of elements in
       mdl_dirrdops_layout_segments[k].mdld_fh_list.
         mdldf_fh_list;
     lmds_count = number of elements in
       mdla_complex_addr.nflda_multipath_ds_list;
   }

   switch (fh_count) {
     case lmds_count:
       if (mdl_dirrdops_layout_segments[k].
           mdldf_use_namebased == TRUE) {
         fh = mdln_fh_list[idx];
       } else {



Eisler                   Expires April 30, 2009                [Page 12]


Internet-Draft           pNFS Metadata Striping             October 2008


         fh = mdl_dirrdops_layout_segments[k].mdld_fh_list.
           mdldf_fh_list[idx];
       }
       break;

     case 1:
       if (mdl_dirrdops_layout_segments[k].
           mdldf_use_namebased == TRUE) {
         fh = mdln_fh_list[0];
       } else {
         fh = mdl_dirrdops_layout_segments[k].mdld_fh_list.
           mdldf_fh_list[0];
       }
       break;

     case 0:
       fh = current filehandle passed to LAYOUTGET;
       break;

     default:
       throw a fatal exception;
       break;
   }

   address_list =  mdla_complex_addr.
     nflda_multipath_ds_list[idx];

                                   Figure 5

      The client would then select an L-MDS from address_list, and send
      the directory-reading operation using the filehandle specified in
      fh.  When the client is reading the beginning of the directory,
      cookie_arg is always zero.  Subsequent directory-reading
      operations to read the rest of the directory will use the last
      cookie returned by the L-MDS.  Am MDS returning a metadata layout
      SHOULD return cookies that can be used directly to the I-MDS that
      returned the layout.  However this might not always be possible.
      For example, the directory design of the filesystem of the MDS,
      might not return cookies in ascending order, or any order at all
      for that matter.  Whereas, striping by definition requires an
      ordering.  In such cases, if a directory is restriped while a pNFS
      client is reading its contents from the L-MDSes, it is possible
      that client will be unable to complete reading the directory, and
      as a result an error is returned to process reading the directory.
      To mitigate this, servers that have sent a CB_LAYOUTRECALL on the
      directory SHOULD NOT revoke the layout as long as they detect that
      the client is completing a read of the entire directory.  Once a
      client has received a CB_LAYOUTRECALL, it SHOULD NOT send a



Eisler                   Expires April 30, 2009                [Page 13]


Internet-Draft           pNFS Metadata Striping             October 2008


      directory-reading operation to an L-MDS with a cookie argument of
      zero.  If the server has sent a CB_LAYOUTRECALL, the L-MDS SHOULD
      reject requests to read the directory that have a cookie argument
      zero and return the error NFS4ERR_PNFS_NO_LAYOUT.

4.6.  Definition of the lou_body Field of the layoutupdate4 Data Type

   ///  %/*
   ///  % * LAYOUT4_METADATA.
   ///  % * Encoded in the lou_body field of type layoutupdate4:
   ///  % *      Nothing. lou_body is a zero length array of octets.
   ///  % */
   ///  %

                                 Figure 6

   The LAYOUT4_METADATA layout type has no content for lou_body filed of
   the layoutupdate4 data type.

4.7.  Storage Access Protocols

   The LAYOUT4_METADATA layout type uses NFSv4.1 operations (and
   potentially, operations of higher minor versions of NFSv4, subject to
   the definition of a minor version of NFSv4) to access striped
   metadata.  The LAYOUT4_METADATA does not affect access to storage
   devices.  Thus a client might be able to obtain both a
   LAYOUT4_METADATA layout, and a non-LAYOUT4_METADATA layout type
   (e.g., LAYOUT4_NFSV4_1_FILES, LAYOUT4_OSD2_OBJECTS, or
   LAYOUT4_BLOCK_VOLUME) on the same regular file.  Of course, for a
   non-regular file, a pNFS client will be unable to get layouts of
   types LAYOUT4_NFSV4_1_FILES, LAYOUT4_OSD2_OBJECTS, or
   LAYOUT4_BLOCK_VOLUME).

4.8.  Revocation of Layouts

   Servers MAY revoke layouts of type LAYOUT4_METADATA.  A client
   detects if layout has been revoked if the operation is rejected with
   NFS4ERR_PNFS_NO_LAYOUT.  In NFSv4.1, the error NFS4ERR_PNFS_NO_LAYOUT
   could be returned only by READ and WRITE.  When the server returns a
   layout of type LAYOUT4_METADATA, the set of operations that can
   return NFS4ERR_PNFS_NO_LAYOUT is: ACCESS, CLOSE, COMMIT, CREATE,
   DELEGRETURN, GETATTR, LINK, LOCK, LOCKT, LOCKU, LOOKUP, LOOKUPP,
   NVERIFY, OPEN, OPENATTR, OPEN_DOWNGRADE, READ, READDIR, READLINK,
   REMOVE, RENAME, SECINFO, SETATTR, VERIFY, WRITE, GET_DIR_DELEGATION,
   SECINFO, SECINFO_NO_NAME, and WANT_DELEGATION.






Eisler                   Expires April 30, 2009                [Page 14]


Internet-Draft           pNFS Metadata Striping             October 2008


4.9.  Stateids

   The pNFS specification for LAYOUT4_NFSV4_1_FILES states data servers
   MUST be aware of the stateids granted by MDS so that the stateids
   passed to READ and WRITE can be properly validated.  This requirement
   extends to the LAYOUT4_METADATA layout type: the L-MDS MUST be aware
   of any non-layout stateids granted by the I-MDS, if and only if the
   client is in contact the L-MDS under direction of a metadata layout
   returned by the I-MDS, and the I-MDS has not recalled or revoked that
   layout.  In addition, because an L-MDS can accept operations like
   OPEN and LOCK that create or modify stateids, the I-MDS MUST be aware
   of stateids that an L-MDS has returned to a client, if and only if
   the I-MDS granted the client a metadata layout that directed the
   client to the L-MDS.

   In some cases, one L-MDS MUST be aware of a stateid generated by
   another L-MDS.  For example a client can obtain a stateid from the
   L-MDS serving as the destination of name-based operations, which
   includes OPEN.  However operations that use the stateid will be
   filehandle-only operations, and the L-MDS the OPEN operation is sent
   to might differ from the L-MDS the LOCK operation for the same target
   file is sent to.

4.10.  Lease Terms

   Any state the client obtains from an I-MDS or L-MDS is guaranteed to
   last for an interval lasting as long as the maximum of the lease_time
   attribute of the the I-MDS, and any L-MDS the client is directed to
   as the result of a metadata layout.  The client has a lease for each
   client ID it has with an I-MDS or L-MDS, and each lease MUST be
   renewed separately for each client ID.

4.11.  Layout Operations Sent to an L-MDS

   An L-MDS MAY allow a LAYOUTGET operation.  One reason the L-MDS might
   allow a LAYOUTGET operation is to allow hierarchical striping.  For
   example, for name-based operations, the pNFS server might use a radix
   tree, (which the field mdln_namebased_alg would indicate).  The first
   four bytes of the component name would be combined to form a 32 bit
   stripe_unit_number.  Once the client contacted the L-MDS, it would
   repeat the algorithm on the second four bytes of the component, and
   so on until the component name was exhausted.

   One an L-MDS grants a layout, the client MUST use only the L-MDS that
   granted to the layout to send LAYOUTUPDATE, LAYOUTCOMMIT, and
   LAYOUTRETURN.





Eisler                   Expires April 30, 2009                [Page 15]


Internet-Draft           pNFS Metadata Striping             October 2008


4.12.  Filehandles in Metadata Layouts

   The filehandles returned in a metadata layout are subject to becoming
   stale at any time.  The L-MDS SHOULD NOT return NFS4ERR_STALE unless
   the I-MDS has recalled or revoked the corresponding layout.

4.13.  READ and WRITE Operations

   READ and WRITE are filehandle-only operations, and thus the pNFS
   client SHOULD attempt to obtain a non-metadata layout for a regular
   file.  If it cannot, then it MAY use the metadata layout to send READ
   and WRITE operations to an L-MDS.  An L-MDS MUST accept a READ or
   WRITE operation if the layout the I-MDS returned to the client
   included a filehandle-only layout.

4.14.  Recovery

   [[Comment.1: it is likely this section will follow that of the files
   layout type specified in the NFSv4.1 specification.]]

4.14.1.  Failure and Restart of Client

   TBD

4.14.2.  Failure and Restart of Server

   TBD

4.14.3.  Failure and Restart of Storage Device

   TBD


5.  Negotiation

   An pNFS client sends a GETATTR operation for attribute
   fs_layout_type.  If the reply contains the metadata layout type, then
   metadata striping is supported, subject to further verification by a
   LAYOUTGET operation.  If not, the client cannot use metadata
   striping.


6.  Operational Recommendation for Deployment

   Deploy the metadata striping layout when it is anticipated that the
   workload will involve a high fraction of non-I/O operations on
   filehandles.




Eisler                   Expires April 30, 2009                [Page 16]


Internet-Draft           pNFS Metadata Striping             October 2008


7.  Acknowledgements

   Brent Welch had the idea of returning a separate device ID for
   filehandle-only operations in the metadata layout.  Pranoop Erasani,
   Dave Noveck, and Richard Jernigan provided valuable feedback.


8.  Security Considerations

   The security considerations of Section 13.12 of [2] which are
   specific to data servers apply to lMDSes.  In addition, each lMDS
   server and client are, respectively, a complete NFSv4.1 server and
   client, and so the security considerations of [2] apply to any client
   or server using the metadata layout type.


9.  IANA Considerations

   This specification requires an addition to the Layout Types registry
   described in Section 22.4 of [2].  The five fields added to the
   registy are:

   1.  Name of layout type: LAYOUT4_METADATA

   2.  Value of layout type: TBD1.

   3.  Standards Track RFC that describes this layout: RFCTBD2, which is
       the RFC of this document.

   4.  How the RFC Introduces the specification: L.

   5.  Minor versions of NFSv4 that can use the layout type: 1.

   This specification requires the creation of a registry of hash
   algorithms for supporting the field mdln_namebased_alg.  Details TBD.


10.  Normative References

   [1]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
        Levels", RFC 2119, March 1997.

   [2]  Shepler, S., Eisler, M., and D. Noveck, "NFS Version 4 Minor
        Version 1", draft-ietf-nfsv4-minorversion1-26 (work in
        progress), Sep 2008.






Eisler                   Expires April 30, 2009                [Page 17]


Internet-Draft           pNFS Metadata Striping             October 2008


Author's Address

   Mike Eisler
   NetApp
   5765 Chase Point Circle
   Colorado Springs, CO  80919
   US

   Phone: +1-719-599-9026
   Email: mike@eisler.com









































Eisler                   Expires April 30, 2009                [Page 18]


Internet-Draft           pNFS Metadata Striping             October 2008


Full Copyright Statement

   Copyright (C) The IETF Trust (2008).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.











Eisler                   Expires April 30, 2009                [Page 19]