Internet Engineering Task Force                           M. Eisler, Ed.
Internet-Draft                                                    NetApp
Intended status: Informational                          M. Susairaj, Ed.
Expires: April 16, 2011                                           Oracle
                                                        October 13, 2010


            Extending NFS to Support Enterprise Applications
                 draft-eisler-nfsv4-enterprise-apps-00

Abstract

   This document proposes a new operating to efficiently initialize
   files.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on April 16, 2011.

Copyright Notice

   Copyright (c) 2010 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.





Eisler & Susairaj        Expires April 16, 2011                 [Page 1]


Internet-Draft              Abbreviated Title               October 2010


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
     1.1.  Requirements Language  . . . . . . . . . . . . . . . . . .  4
   2.  Operation XX: INITIALIZE - Initialize File . . . . . . . . . .  4
     2.1.  ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . . .  4
     2.2.  RESULT . . . . . . . . . . . . . . . . . . . . . . . . . .  4
     2.3.  MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . .  4
     2.4.  DESCRIPTION  . . . . . . . . . . . . . . . . . . . . . . .  5
     2.5.  IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . .  6
   3.  Operation XX: IO_ADVISE - Advise server of client's
       intended I/O access pattern  . . . . . . . . . . . . . . . . .  6
     3.1.  ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . . .  7
     3.2.  RESULT . . . . . . . . . . . . . . . . . . . . . . . . . .  7
     3.3.  MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . .  7
     3.4.  DESCRIPTION  . . . . . . . . . . . . . . . . . . . . . . .  8
   4.  Operation XX: READ_WITH_ADVICE - READ with advice  . . . . . .  9
     4.1.  ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . . .  9
     4.2.  RESULT . . . . . . . . . . . . . . . . . . . . . . . . . . 10
     4.3.  MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . 10
     4.4.  DESCRIPTION  . . . . . . . . . . . . . . . . . . . . . . . 11
   5.  Operation XX: WRITE_WITH_ADVICE - WRITE with advice  . . . . . 11
     5.1.  ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . . . 12
     5.2.  RESULT . . . . . . . . . . . . . . . . . . . . . . . . . . 13
     5.3.  MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . 13
     5.4.  DESCRIPTION  . . . . . . . . . . . . . . . . . . . . . . . 13
   6.  Operation XX: SET_WORKFLOW_TAG - Sets the workflow tag of
       a given session  . . . . . . . . . . . . . . . . . . . . . . . 14
     6.1.  ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . . . 15
     6.2.  RESULT . . . . . . . . . . . . . . . . . . . . . . . . . . 15
     6.3.  MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . 15
     6.4.  DESCRIPTION  . . . . . . . . . . . . . . . . . . . . . . . 15
   7.  Operation XX: SESSION_CTL - Adjust session parameters  . . . . 15
     7.1.  ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . . . 16
     7.2.  RESULT . . . . . . . . . . . . . . . . . . . . . . . . . . 16
     7.3.  MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . 16
     7.4.  DESCRIPTION  . . . . . . . . . . . . . . . . . . . . . . . 17
   8.  Modification to Operation 42: EXCHANGE_ID - Instantiate
       Client ID  . . . . . . . . . . . . . . . . . . . . . . . . . . 18
     8.1.  ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . . . 18
     8.2.  RESULT . . . . . . . . . . . . . . . . . . . . . . . . . . 18
     8.3.  MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . 19
     8.4.  DESCRIPTION  . . . . . . . . . . . . . . . . . . . . . . . 19
   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19
   10. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 20
   11. Security Considerations  . . . . . . . . . . . . . . . . . . . 20
   12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 20
     12.1. Normative References . . . . . . . . . . . . . . . . . . . 20



Eisler & Susairaj        Expires April 16, 2011                 [Page 2]


Internet-Draft              Abbreviated Title               October 2010


     12.2. Informative References . . . . . . . . . . . . . . . . . . 20
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 21

















































Eisler & Susairaj        Expires April 16, 2011                 [Page 3]


Internet-Draft              Abbreviated Title               October 2010


1.  Introduction

   Enterprise applications (such as databases) have requirements that go
   beyond the traditional use cases for NFS.  The requirements falls
   into two broad categories: (1) data integrity and (2) quality of
   service.  This document proposes a set of operatons for a future
   minor version of NFSv4 to support requirements of enterprise
   applications.

1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].


2.  Operation XX: INITIALIZE - Initialize File

2.1.  ARGUMENT

   struct INITIALIZE4args {
              /* CURRENT_FH: file */
              stateid4        ia_stateid;
              offset4         ia_offset;
              length4         ia_blocksize
              length4         ia_blockcount;
              length4         ia_reloff_pattern;
              length4         ia_reloff_blocknum;
              opaque          ia_pattern<>;
   };


2.2.  RESULT

    nfsstat4;

2.3.  MOTIVATION

   Most enterprise applications that use files almost always need to
   initialize such files to a known state.  Even with existing files,
   after such a file grows, the application needs to initialize the
   expanded region the file.  The most trivial initial state is
   intialize every byte to zero.  The problem with initializing to zero
   is that it is often difficult to distinguish a byte-range of
   initialized to all zeroes from data corruption, since a pattern of
   zeroes is a probable pattern for corruption.  Instead, some
   applications, such as database management systems, use pattern
   consisting of bytes or words of non-zero values.  Ideally one would



Eisler & Susairaj        Expires April 16, 2011                 [Page 4]


Internet-Draft              Abbreviated Title               October 2010


   like to efficiently initialize an entire file to a specified pattern
   without having to send WRITE requests for the entire file.  The
   INITIALIZE operation is bandwidth conserving operation for
   initializing file state.

2.4.  DESCRIPTION

   The INITIALIZE operation is used to initialize an open file to an
   iterated pattern.  The pattern consists of a fixed string, and a
   block number.  The pattern is defined by the arguments.

   o  ia_offset: where to start the iterated pattern.  This value is
      specified in bytes.

   o  ia_blocksize: the size of each iteration of the pattern.  Each
      iteration is called a block.

   o  ia_reloff_pattern: the relative offset within a block where to
      write the specified pattern encoded in ia_pattern.

   o  ia_reloff_blocknum: the relative offset within a block where to
      write a 64 bit block number.  The block number is incremented once
      a block is written.  The block number is always written in little
      endian order.  If ia_reloff_blocknum is set to NFS4_UINT64_MAX,
      then this informs the server that no block number is to be
      written.

   o  ia_pattern: a fixed string written to every block.  If the length
      of ia_pattern is zero, then this informs the server that no string
      is to be written.

   The field ia_stateid is the stateid corresponding to the current
   filehandle's share reservation, delegation, or byte range lock.

   An example will illustrate how the client uses INITIALIZE.  Suppose
   the arguments (except for ia_stateid) are: { 0, 500, 1000, 8, 0,
   "DeadBeef" }.  Then starting with offset zero, the content of the
   file will have these contents.













Eisler & Susairaj        Expires April 16, 2011                 [Page 5]


Internet-Draft              Abbreviated Title               October 2010


   offset value (decimal or ASCII)
   0      0   0   0   0   0   0   0   0
   8      'D' 'e' 'a' 'd' 'B' 'e' 'e' 'f'
   16-499 zeroes

   500    0   0   0   0   0   0   0   1
   508    'D' 'e' 'a' 'd' 'B' 'e' 'e' 'f'
   516-999 zeroes

   ...

   499500  0   0   0   0   0   0   3   231
   499508  'D' 'e' 'a' 'd' 'B' 'e' 'e' 'f'
   499516-499999 zeroes


2.5.  IMPLEMENTATION

   When an NFS server receives this operation, instead of writing the
   iterated pattern over each block, it should de-allocate the data of
   affected range of the file and record the the values of ia_offset,
   ia_blocksize, ia_blockcount, ia_reloff_pattern, ia_reloff_blocknum,
   and ia_pattern<> in the file's system metadata.  When a client sends
   a READ request, instead of returning zeroes, it should construct a
   response corresponding to the pattern specified in the arguments to
   INITIALIZE.

   An application likely has a legacy pattern for initialized blocks
   which cannot be mapped to that specified for INITIALIZE.  The
   application should modified to detect that the block corresponds to
   INITIALIZE's pattern.  When the application sees such a block, it can
   overwrite the block with the legacy pattern.  Note that will cause
   the block to be allocated on the NFS server.

   When the length of ia_pattern is zero and the value of
   ia_reloff_blocknum is NFS4_UINT64_MAX, then the client is requesting
   that a hole be punched into the file.


3.  Operation XX: IO_ADVISE - Advise server of client's intended I/O
    access pattern










Eisler & Susairaj        Expires April 16, 2011                 [Page 6]


Internet-Draft              Abbreviated Title               October 2010


3.1.  ARGUMENT

           enum io_advise_type {
              IO_ADVISE4_SEQUENTIAL_CACHE       = 0,
              IO_ADVISE4_SEQUENTIAL_DONTCACHE       = 1,
              IO_ADVISE4_RANDOM           = 2,
              IO_ADVISE4_PREFETCH         = 3,
              IO_ADVISE4_PREFETCH_OPPORTUNISTIC = 4,
              IO_ADVISE4_INTENT_TO_WRITE = 5,
              IO_ADVISE4_RECENTLY_USED = 6
           };


           struct io_directions {
                   stateid4        iod_stateid;
                   offset4         iod_offset;
                   bitmap4         iod_flags;
           };

           struct IO_ADVISE4args {
                  /* CURRENT_FH: file */
                  io_directions ioaa_directions;
                  length4       ioaa_count;
           };



3.2.  RESULT

      struct IO_ADVISE4resok {
              bitmap4              ioar_flags;
      };

      union IO_ADVISE4res switch (nfsstat4 ioar_status) {
       case NFS4_OK:
               IO_ADVISE4resok  ioar_resok4;
       default:
               void;
      };


3.3.  MOTIVATION

   The client is in a better position to deduce the intended I/O pattern
   than the server, especially if the application provides this
   information.  With this information, the server can optimize I/O to
   the file.




Eisler & Susairaj        Expires April 16, 2011                 [Page 7]


Internet-Draft              Abbreviated Title               October 2010


3.4.  DESCRIPTION

   The IO_ADVISE operation is used advise the server as to how the
   holder of the stateid intends to access the file over the specified
   byte range (iod_offset through iod_offset + ioaa_count - 1).

   o  IO_ADVISE4_SEQUENTIAL_CACHE: Sequential access to data expected.
      The server should leave data in its cache.

   o  IO_ADVISE4_SEQUENTIAL_DONTCACHE: Sequential access to data
      expected.  The server does not need to leave data in its cache.

   o  IO_ADVISE4_RANDOM: Random access to data expected.

   o  IO_ADVISE4_PREFETCH: Stateid holder expects to access the data
      soon; prefetch data in preparation.

   o  IO_ADVISE4_PREFETCH_OPPORTUNISTIC: Stateid holder expects to
      access the data soon; prefetch if it can be done at a marginal
      cost.

   o  IO_ADVISE4_INTENT_TO_WRITE: Byte range will be written soon so no
      point in caching data.

   o  IO_ADVISE4_RECENTLY_USED: The client has recently accessed the
      byte range in its own cache.  This informs the server that the
      data in the byte range remains important to the client.  When the
      server reaches resource exhaustion, knowing which data is more
      important allows the server to make better choices about which
      data to, for example purge from a cache, or move to secondary
      storage.  It also informs the server which delegations are more
      important, since if delegations are working correctly, once
      delegated to a client, a server might never receive another I/O
      request for the file.

   The results indicate which advice the server intends to follow.  The
   server MUST NOT return an error if it does not recognize or does not
   support the requested advice.  The server MAY return different advice
   than what the client requested.  If it does, then this might be due
   to one of several conditions, including, but not limited to: another
   client advising of a different I/O access pattern; a different I/O
   access pattern from another client that that the server has
   heuristically detected; or the server is not able to support the
   requested I/O access pattern, perhaps due to a temporary resource
   limitation (for example, a request for IO_ADVISE4_SEQUENTIAL_CACHE
   might not be supported because the server cannot afford to cache
   data, and/or cannot afford to queue read-a-head requests).




Eisler & Susairaj        Expires April 16, 2011                 [Page 8]


Internet-Draft              Abbreviated Title               October 2010


4.  Operation XX: READ_WITH_ADVICE - READ with advice

4.1.  ARGUMENT

           enum io_advise_type {
              IO_ADVISE4_SEQUENTIAL_CACHE       = 0,
              IO_ADVISE4_SEQUENTIAL_DONTCACHE       = 1,
              IO_ADVISE4_RANDOM           = 2,
              IO_ADVISE4_PREFETCH         = 3,
              IO_ADVISE4_PREFETCH_OPPORTUNISTIC = 4,
              IO_ADVISE4_INTENT_TO_WRITE = 5,
              IO_ADVISE4_RECENTLY_USED = 6
           };


           struct io_directions {
                   stateid4        iod_stateid;
                   offset4         iod_offset;

                   bitmap4         iod_flags;
           };

      struct READ_WITH_ADVICE4args {
              /* CURRENT_FH: file */
              io_directions rwaa_directions;
              length4       rwaa_count;
      };
























Eisler & Susairaj        Expires April 16, 2011                 [Page 9]


Internet-Draft              Abbreviated Title               October 2010


4.2.  RESULT

      const NFS4_TWO_GB            = 0x80000000;

      typedef opaque twoGB_byte_array4[NFS4_INT32_MAX];

      struct fourGB_buffer4 {
           twoGB_byte_array4[2];
      }

      struct large_buffer4 {
              fourGB_buffer4 lb_big_buffers<>;
              opaque         lb_small_buffer<>;
      }


      struct READ_WITH_ADVICE4resok {
              bool                 rwar_eof;
              bitmap4              rwar_flags;
              large_buffer4        rwar_data;

      };

      union READ_WITH_ADVICE4res switch (nfsstat4 rwar_status) {
       case NFS4_OK:
               READ_WITH_ADVICE4resok  rwar_resok4;
       default:
               void;
      };


4.3.  MOTIVATION

   Under some circumstances, the IO_ADVISE operation is insufficient
   when the client is also performing a READ operation.  Some advice
   needs to be communicated atomically with the READ operation and an
   IO_ADVISE in the same COMPOUND operation as the READ operation would
   fail to provide the necessary advice.  For example, if IO_ADVISE
   proceeded READ, and the server was given advice to not cache the data
   requested by READ, the IO_ADVISE would be too late, because the
   server might already have cached the data.  If IO_ADVISE preceded
   READ, in order to be effective, the advice would have to be
   communicated across two operations in the same COMPOUND.  This would
   complicate the server implementation.







Eisler & Susairaj        Expires April 16, 2011                [Page 10]


Internet-Draft              Abbreviated Title               October 2010


4.4.  DESCRIPTION

   The READ_WITH_ADVICE operation is used read from a file and to advise
   the server as to how the reader of intends to access the file over
   the specified byte range (iod_offset through iod_offset + rwaa_count
   - 1).

   o  IO_ADVISE4_SEQUENTIAL_CACHE: Sequential access to data expected.
      The server should leave data in its cache.

   o  IO_ADVISE4_SEQUENTIAL_DONTCACHE: Sequential access to data
      expected.  The server does not need to leave data in its cache.

   o  IO_ADVISE4_RANDOM: Random access to data expected.

   o  IO_ADVISE4_PREFETCH: Not applicable.

   o  IO_ADVISE4_PREFETCH_OPPORTUNISTIC: Not applicable.

   o  IO_ADVISE4_INTENT_TO_WRITE: Byte range will be written soon so no
      point in caching data.

   o  IO_ADVISE4_RECENTLY_USED: Explicit hint to keep data of byte range
      in cache.

   The results indicate which advice the server intends to follow.  The
   server MUST NOT return an error if it does not recognize or does not
   support the requested advice.

   The intent is that READ_WITH_ADVICE is preferred over READ.  In
   addition to providing I/O hints, READ_WITH_ADVICE uses 64 bit data
   lengths, which anticipates the expected improvements in average
   network speeds and network buffer capacities.  Because the XDR
   standard does not support 64 bit array lengths, the large_buffer4
   data type is introduced to encode an array of zero or more buffers of
   fixed size of 2^32 bytes, followed by a variable length array of up
   to 2^32 - 1 bytes


5.  Operation XX: WRITE_WITH_ADVICE - WRITE with advice











Eisler & Susairaj        Expires April 16, 2011                [Page 11]


Internet-Draft              Abbreviated Title               October 2010


5.1.  ARGUMENT

              enum stable_how4 { /* from NFSv4.0 */
               UNSTABLE4       = 0,
               DATA_SYNC4      = 1,
               FILE_SYNC4      = 2,
               LAYOUT_SYNC4    = 3 /* new */
              };

           enum io_advise_type {
              IO_ADVISE4_SEQUENTIAL_CACHE       = 0,
              IO_ADVISE4_SEQUENTIAL_DONTCACHE       = 1,
              IO_ADVISE4_RANDOM           = 2,
              IO_ADVISE4_PREFETCH         = 3,
              IO_ADVISE4_PREFETCH_OPPORTUNISTIC = 4,
              IO_ADVISE4_INTENT_TO_WRITE = 5,
              IO_ADVISE4_RECENTLY_USED = 6
           };

           struct io_directions {
                   stateid4        iod_stateid;
                   offset4         iod_offset;
                   bitmap4         iod_flags;
           };

      struct WRITE_WITH_ADVICE4args {
              /* CURRENT_FH: file */
              stable_how4     wwaa_stable;
              io_directions   wwaa_directions;
              large_buffer4   wwaa_data<>;
      };




















Eisler & Susairaj        Expires April 16, 2011                [Page 12]


Internet-Draft              Abbreviated Title               October 2010


5.2.  RESULT


      struct WRITE_WITH_ADVICE4resok {
              length4              wwar_count;
              stable_how4          wwar_committed;
              bitmap4              wwar_flags;


      };

      union WRITE_WITH_ADVICE4res switch (nfsstat4 wwar_status) {
       case NFS4_OK:
               WRITE_WITH_ADVICE4resok  wwar_resok4;
       default:
               void;
      };


5.3.  MOTIVATION

   Under some circumstances, the IO_ADVISE operation is insufficient
   when the client is also performing a WRITE operation.  Some advice
   needs to be communicated atomically with the WRITE operation and an
   IO_ADVISE in the same COMPOUND operation as the WRITE operation would
   fail to provide the necessary advice.  For example, if IO_ADVISE
   proceeded WRITE and the server was given advice to not cache the data
   requested by WRITE the IO_ADVISE would be too late, because the
   server might already have cached the data.  If IO_ADVISE preceded
   WRITE in order to be effective, the advice would have to be
   communicated across two operations in the same COMPOUND.  This would
   complicate the server implementation.

   This operation adds a new enumerated value for stable_how4 called
   LAYOUT_SYNC4 in order to reduce the need for LAYOUT_COMMIT
   operations.

5.4.  DESCRIPTION

   The WRITE_WITH_ADVICE operation is used write to a file and to advise
   the server as to how the writer intends to access the file over the
   specified byte range (iod_offset through iod_offset + amount of data
   in wwaa_data - 1).

   o  IO_ADVISE4_SEQUENTIAL_CACHE: Sequential access to data expected.
      The server should leave data in its cache.





Eisler & Susairaj        Expires April 16, 2011                [Page 13]


Internet-Draft              Abbreviated Title               October 2010


   o  IO_ADVISE4_SEQUENTIAL_DONTCACHE: Sequential access to data
      expected.  The server does not need to leave data in its cache.

   o  IO_ADVISE4_RANDOM: Random access to data expected.

   o  IO_ADVISE4_PREFETCH: Not applicable.

   o  IO_ADVISE4_PREFETCH_OPPORTUNISTIC: Not applicable.

   o  IO_ADVISE4_INTENT_TO_WRITE: Byte range will be over-written soon
      so no point in caching data.

   o  IO_ADVISE4_RECENTLY_USED: Explicit hint to keep data of byte range
      in cache.

   The results indicate which advice the server intends to follow.  The
   server MUST NOT return an error if it does not recognize or does not
   support the requested advice.

   The intent is that WRITE_WITH_ADVICE is preferred over WRITE.  In
   addition to providing I/O hints, WRITE_WITH_ADVICE uses 64 bit data
   lengths, which anticipates the expected improvements in average
   network speeds and network buffer capacities.  Because the XDR
   standard does not support 64 bit array lengths, the large_buffer4
   data type is introduced to encode an array of zero or more buffers of
   fixed size of 2^32 bytes, followed by a variable length array of up
   to 2^32 - 1 bytes

   If general, if the value of wwaa_stable is valid, then the value of
   wwar_committed in the reply MUST NOT be less than the value of
   wwaa_stable.  The exception is if the wwaa_stable is LAYOUT_SYNC4.
   LAYOUT_SYNC4 is an enumerated value that can be used by the client
   when the server is an pNFS data server, and the client has a layout
   that covers the byte range specified by iod_offset and the amlount of
   data in wwaa_data.  If the client sends a WRITE_WITH_ADVICE to a data
   server with wwaa_stable set to LAYOUT_SYNC4, then a successful reply
   MUST return value of wwar_committed equal to LAYOUT_SYNC4 or
   FILE_SYNC4.  Regardless what value wwaa_stable is, if the server is a
   pNFS data server, it MAY return a value of wwar_committed equal to
   LAYOUT_SYNC4.  Whenever wwar_committed is LAYOUT_SYNC4, this
   indicates that range of the layout covered by iod_offset and
   wwar_count has been committed to the metadata server, and there is
   not need to send a LAYOUT_COMMIT for that range.


6.  Operation XX: SET_WORKFLOW_TAG - Sets the workflow tag of a given
    session




Eisler & Susairaj        Expires April 16, 2011                [Page 14]


Internet-Draft              Abbreviated Title               October 2010


6.1.  ARGUMENT



      struct SET_WORKFLOW_TAG 4args {

              uint64_t  swta_tag;
      };


6.2.  RESULT


           nfsstat4


6.3.  MOTIVATION

   Enterprise applications require guarantees of quality and/or priority
   of service Providing end-to-end guarantees requires awareness at the
   file services level of the necessary quality and/or priority.

6.4.  DESCRIPTION

   Sets the workflow tag of a given session.  All operations in progress
   before the server receives SET_WORKFLOW_TAG use the previous tag (if
   any).  All operations received after the server receives
   SET_WORKFLOW_TAG use the new tag.


7.  Operation XX: SESSION_CTL - Adjust session parameters




















Eisler & Susairaj        Expires April 16, 2011                [Page 15]


Internet-Draft              Abbreviated Title               October 2010


7.1.  ARGUMENT

      struct channel_attrs4 { /* from NFSv4.1 */
              count4                  ca_headerpadsize;
              count4                  ca_maxrequestsize;
              count4                  ca_maxresponsesize;
              count4                  ca_maxresponsesize_cached;
              count4                  ca_maxoperations;
              count4                  ca_maxrequests;
              uint32_t                ca_rdma_ird&lt1>;
      };

      /* from NFSv4.1 */

      const CREATE_SESSION4_FLAG_PERSIST              = 0x00000001;
      const CREATE_SESSION4_FLAG_CONN_BACK_CHAN       = 0x00000002;
      const CREATE_SESSION4_FLAG_CONN_RDMA            = 0x00000004;

     struct session_ctl4 {
            uint32_t                sc_flags;
            channel_attrs4          sc_fore_chan_attrs;
            channel_attrs4          sc_back_chan_attrs;
     };

     typedef session_ctl SESSION_CTL4args;




7.2.  RESULT


      union SESSION_CTL4res switch (nfsstat4 scr_status) {
      case NFS4_OK:
              session_ctl4  scr_resok4;
      default:
              void;
      };



7.3.  MOTIVATION

   The introduction of the session model in NFSv4.1 imposes an explicit
   limitation on the number of outstanding requests a client can make of
   an NFS server.  In enterprise applications, it is possible each NFS
   request corresponds to a single application request.  Thus, the size
   of the slot table can bound the number of outstanding application



Eisler & Susairaj        Expires April 16, 2011                [Page 16]


Internet-Draft              Abbreviated Title               October 2010


   requests.  While there are workarounds (examples include (1)implement
   a mapping layer between application's request slot list and the
   client's slot table (2) create additional sessions in order to
   preserve a one-to-one mapping between application and client slots),
   these workarounds introduce complexity.  The application's needs for
   more slots are dynamic.  The NFSv4.1 model assumes a dynamic slot
   table, but the size of the slot table is driven by the server via the
   reply to the SEQUENCE operation and the CB_RECALL_SLOT operation.
   What is missing is a method for the client to request a larger slot
   table.

7.4.  DESCRIPTION

   This operation allows the client to request changes to the session's
   parameters.  There are three major fields in the arguments and
   results:

   o  sc_flags.  These flags correspond to the csa_flags and csr_flags
      argument and result of CREATE_SESSION.  In the result, the value
      of a bit in sc_flags MUST be one of:

      *  The corresponding bit in sc_flags of the arguments to
         SESSION_CTL.

      *  The corresponding bit in sc_flags of the result of the previous
         SESSION_CTL that the server executed.

      *  If the server has not executed a previous SESSION_CTL, then the
         corresponding bit in the csr_flags field of the reply the
         CREATE_SESSION operation that created the session.

   o  sc_fore_chan_attrs.  In the arguments of SESSION_CTL, the fields
      within sc_fore_chan_attrs correspond to the fields of the argument
      csa_fore_chan_attrs in the arguments of CREATE_SESSION.  In the
      results of SESSION_CTL, the values fields within
      sc_fore_chan_attrs correspond to the fields of the result
      csr_fore_chan_attrs in the response to CREATE_SESSION.  The values
      of the fields in the result sc_fore_chan_attrs are governed
      according to the same rules that govern the values of the fields
      of csr_fore_chan_attrs.

   o  sc_back_chan_attrs.  In the arguments of SESSION_CTL, the fields
      within sc_back_chan_attrs correspond to the fields of the argument
      csa_back_chan_attrs in the arguments of CREATE_SESSION.  In the
      results of SESSION_CTL, the values fields within
      sc_back_chan_attrs correspond to the fields of the result
      csr_back_chan_attrs in the response to CREATE_SESSION.  The values
      of the fields in the result sc_back_chan_attrs are governed



Eisler & Susairaj        Expires April 16, 2011                [Page 17]


Internet-Draft              Abbreviated Title               October 2010


      according to the same rules that govern the values of the fields
      of csr_back_chan_attrs.

   The SESSION_CTL operation MUST be sent on a COMPOUND operation
   prefixed by a SEQUENCE operation with the sa_slotid argument set to
   zero.  If SESSION_CTL requests a smaller slot table on the fore
   channel, and there are operations in progress on other slots of the
   fore channel, the server MUST do one of (1) return
   NFS4ERR_FORE_CHAN_BUSY (a new error); (2) allow SESSION_CTL to
   succeed, wait for the in progress operations to complete and reply to
   those operations before replying to SESSION_CTL; or (3) if all the in
   progress operations allow the one or both of the errors NFS4ERR_DELAY
   or NFS4ERR_SERVERFAULT, allow SESSION_CTL to succeed, abort the in
   progress operations, reply with to those operations with either
   NFS4ERR_DELAY or NFS4ERR_SERVERFAULT, and then reply to SESSION_CTL.
   Because a server is free to return NFS4ERR_FORE_CHAN_BUSY, it is
   strongly RECOMMENDED that when a client sends a SESSION_CTL operation
   that it have no other requests in progress.

   If SESSION_CTL request a smaller slot table on the backchannel and
   there are operations in progress on other slots of the backchannel,
   the server MUST do one of (1) return NFS4ERR_BACK_CHAN_BUSY; (2)
   allow SESSION_CTL to succeed, and for wait replies to the in progress
   backchannel operations before replying to SESSION_CTL; or (3) if all
   the in progress operations allow the one or both of the errors
   NFS4ERR_DELAY or NFS4ERR_SERVERFAULT, allow SESSION_CTL to succeed,
   abort the in progress operations, reply with to those operations with
   either NFS4ERR_DELAY or NFS4ERR_SERVERFAULT, and then reply to
   SESSION_CTL.  Before a client sends a SESSION_CTL operation, it
   SHOULD reply to all in progress backchannel requests of the same
   session as the SESSION_CTL operation.


8.  Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID

8.1.  ARGUMENT

   /* new */
   const EXCHGID4_FLAG_SUPP_FENCE_OPS     = 0x00000004;


8.2.  RESULT
   Unchanged








Eisler & Susairaj        Expires April 16, 2011                [Page 18]


Internet-Draft              Abbreviated Title               October 2010


8.3.  MOTIVATION

   Enterprise applications require guarantees that an operation has
   either aborted or completed.  NFSv4.1 provides this guarantee as long
   as the session is alive: simply send a SEQUENCE operation on the same
   slot with a new sequence number, and the successful return of
   SEQUENCE indicates the previous operation has completed.  However, if
   the session is lost, there is no way to know when any in progress
   operations have aborted or completed.  In hindsight, the NFSv4.1
   specification should have mandated that DESTROY_SESSION abort/
   complete all outstanding operations.

8.4.  DESCRIPTION

   A client MAY request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability when
   it sends an EXCHANGE_ID operation.  The server MAY set this
   capability in the EXCHANGE_ID request whether the client requests it
   or not.  If the client ID is created with this capability then the
   following will occur:

   o  The server will not reply to DESTROY_SESSION until all operations
      in progress are completed or aborted.

   o  The server will not reply to subsequent EXCHANGE_ID invoked on the
      same Client Owner with a new verifier until all operations in
      progress on the Client ID's session are completed or aborted.

   o  When DESTROY_CLIENTID is invoked, if there are sessions (both idle
      and non-idle), opens, locks, delegations, layouts, and/or wants
      (Section 18.49) associated with the client ID are removed.
      Pending operations will be completed or aborted before the
      sessions, opens, locks, delegations, layouts, and/or wants are
      deleted.

   o  The NFS server SHOULD support client ID trunking, and if it does
      and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a
      session ID created on one node of the storage cluster MUST be
      destroyable via DESTROY_SESSION.  In addition, DESTROY_CLIENTID
      and an EXCHANGE_ID with a new verifier affects all sessions
      regardless what node the sessions were created on.


9.  Acknowledgements

   Contributors to this document include: Sumanta Chatterjee, Steve
   Daniel, Mike Eisler, Jeff Kimmel, Akshay Shah, Margaret Susairaj, and
   Lynne Thieme.




Eisler & Susairaj        Expires April 16, 2011                [Page 19]


Internet-Draft              Abbreviated Title               October 2010


10.  IANA Considerations

   The IO_ADVISE4 flags are considered extendable.  Values 32 through 63
   are reserved for private use.  All others are standards track.


11.  Security Considerations

   None.


12.  References

12.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

12.2.  Informative References

   [I-D.eisler-nfsv4-pnfs-dedupe]
              Eisler, M., "Storage De-Duplication Awareness in NFS",
              draft-eisler-nfsv4-pnfs-dedupe-00 (work in progress),
              October 2008.

   [I-D.eisler-nfsv4-pnfs-metastripe]
              Eisler, M., "Metadata Striping for pNFS",
              draft-eisler-nfsv4-pnfs-metastripe-01 (work in progress),
              October 2008.

   [I-D.faibish-nfsv4-pnfs-access-permissions-check]
              Faibish, S., Black, D., Eisler, M., and J. Glasgow, "pNFS
              Access Permissions Check",
              draft-faibish-nfsv4-pnfs-access-permissions-check-03 (work
              in progress), July 2010.

   [I-D.ietf-nfsv4-minorversion1]
              Shepler, S., Eisler, M., and D. Noveck, "NFS Version 4
              Minor Version 1", draft-ietf-nfsv4-minorversion1-29 (work
              in progress), December 2008.

   [I-D.lentini-nfsv4-server-side-copy]
              Lentini, J., Eisler, M., Kenchammana, D., Madan, A., and
              R. Iyer, "NFS Server-side Copy",
              draft-lentini-nfsv4-server-side-copy-05 (work in
              progress), July 2010.

   [I-D.myklebust-nfsv4-pnfs-backend]



Eisler & Susairaj        Expires April 16, 2011                [Page 20]


Internet-Draft              Abbreviated Title               October 2010


              Myklebust, T., "Network File System (NFS) version 4 pNFS
              back end protocol extensions",
              draft-myklebust-nfsv4-pnfs-backend-00 (work in progress),
              July 2009.

   [I-D.quigley-nfsv4-sec-label]
              Quigley, D. and J. Morris, "MAC Security Label Support for
              NFSv4", draft-quigley-nfsv4-sec-label-01 (work in
              progress), February 2010.

   [RFC3530]  Shepler, S., Callaghan, B., Robinson, D., Thurlow, R.,
              Beame, C., Eisler, M., and D. Noveck, "Network File System
              (NFS) version 4 Protocol", RFC 3530, April 2003.


Authors' Addresses

   Michael Eisler (editor)
   NetApp
   5765 Chase Point Circle
   Colorado Springs, CO  80919
   US

   Phone: +1 719 599 9026
   Email: mike@eisler.com


   Margaret Susairaj (editor)
   Oracle
   7806 Garden Bend
   Sugar Land, TX  77479
   US

   Phone: +1 408 431 7405
   Email: Margaret.Susairaj@oracle.com
















Eisler & Susairaj        Expires April 16, 2011                [Page 21]