NFSv4                                                          M. Eisler
Internet-Draft                                                    NetApp
Intended status: Standards Track                        October 27, 2008
Expires: April 30, 2009


                Storage De-Duplication Awareness in NFS
               draft-eisler-nfsv4-pnfs-metastripe-00.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on April 30, 2009.

Abstract

   This Internet-Draft describes a means to add awareness of de-
   duplication storage to NFS.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [1].







Eisler                   Expires April 30, 2009                 [Page 1]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


Table of Contents

   1.  Introduction and Motivation  . . . . . . . . . . . . . . . . .  3
   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  4
   3.  Scope of De-Duplication  . . . . . . . . . . . . . . . . . . .  5
   4.  Overview of De-Duplication via pNFS  . . . . . . . . . . . . .  5
   5.  The Definition of De-Duplication Layouts . . . . . . . . . . .  5
     5.1.  Name of De-Duplication Striping Layout Type  . . . . . . .  5
     5.2.  Value of De-Duplication Striping Layout Type . . . . . . .  6
     5.3.  Definition of the da_addr_body Field of the
           device_addr4 Data Type . . . . . . . . . . . . . . . . . .  6
     5.4.  Definition of the loh_body Field of the layouthint4
           Data Type  . . . . . . . . . . . . . . . . . . . . . . . .  7
     5.5.  Definition of the loc_body Field of the
           layout_content4 Data Type  . . . . . . . . . . . . . . . .  8
     5.6.  Definition of the lou_body Field of the layoutupdate4
           Data Type  . . . . . . . . . . . . . . . . . . . . . . . . 20
     5.7.  Storage Access Protocols . . . . . . . . . . . . . . . . . 20
     5.8.  Revocation of Layouts  . . . . . . . . . . . . . . . . . . 20
     5.9.  Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 21
       5.9.1.  Failure and Restart of Client  . . . . . . . . . . . . 21
       5.9.2.  Failure and Restart of Server  . . . . . . . . . . . . 21
       5.9.3.  Failure and Restart of Storage Device  . . . . . . . . 21
   6.  Negotiation  . . . . . . . . . . . . . . . . . . . . . . . . . 21
   7.  Operational Recommendation for Deployment  . . . . . . . . . . 21
   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 21
   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 21
   10. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 22
   11. Normative References . . . . . . . . . . . . . . . . . . . . . 22
   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 22
   Intellectual Property and Copyright Statements . . . . . . . . . . 24




















Eisler                   Expires April 30, 2009                 [Page 2]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


1.  Introduction and Motivation

   De-duplication is an emerging trend in the data storage.  De-
   duplication means that two files that have common content derive that
   content from a common location on the same storage device.  As a
   result, the total storage used is less than the total length of each
   file.  A holes in a file represents a primitive form of de-
   duplication.  De-duplication is also called folding.

   De-duplication is accomplished in several ways including,

   o  Hierarchical de-duplication, where one file is derived from
      another, usually by one file starting of as copy of another, but
      zero, or nearly zero bytes of data are actually copied or moved.
      Instead, the two files share common blocks of data storage.  An
      example is a snapshot, where a snapshot is made of a file system,
      such that the snapshot and active file system are equal at the
      time snapshot is taken, and share the same data storage, and thus
      are effectively copies that involve zero or near zero movement of
      data.  As the source file system changes, the number of shared
      blocks of data storage reduces.  A variation of this is a writable
      snapshot (aka clone) which is taken of a file system.  In this
      variation as the source and cloned file systems each change, there
      are fewer shared blocks.

   o  In-line de-duplication, where a storage access protocol initiator
      (e.g. an NFS client) creates content via write operations, and the
      target of the storage access protocol checks if the content being
      written is duplicated some where else on the target's storage.  If
      so, the data is not written, but instead the logical content
      refers to the duplicate.

   o  Background de-duplication, where a background task on the storage
      access protocol target scans for duplicate blocks, and frees all
      but one of the duplicates, mapping the pointers to the now free
      blocks to the remaining duplicate.

   The use of de-duplicated storage does not require changes to the NFS
   protocol.  However if the NFS client is caching content from an NFS
   server that provides access to de-duplicated files, without changes
   to the protocol, inefficient use of the resources like memory and
   network bandwidth will result.  E.g., two files of length 1024 bytes
   are exactly the same and are de-duplicated.  The client reads, and
   caches the first file.  A process on the client requests to read the
   second file.  If the client were aware the second file was a
   duplicate of the first, it would not have read the second file, nor
   would it have to cache the second file.  A classic use case is
   hypervisors, which switch between multiple guest operating systems on



Eisler                   Expires April 30, 2009                 [Page 3]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


   a single physical computer.  If each of these guest operating systems
   were cloned from a single source, or if each guest was installed from
   the same operating system installation image, then much of the data
   of each guest might be highly de-duplicated.  De-duplication
   awareness is consistent with the typical reasons for deploying a
   hypervisor: reducing costs by reducing utilization of memory,
   computer cycles, and network.

   This document describes a method by which NFSv4.1 clients can be
   aware of de-duplicated storage.  This document does not require a new
   minor version of NFSv4.  Instead, it requires several new layout
   types, and thus uses the pNFS protocol [2].

   The XDR description is provided in this document in a way that makes
   it simple for the reader to extract into a ready to compile form.
   The reader can feed this document into the following shell script to
   produce the machine readable XDR description of the de-duplication
   layout:

   #!/bin/sh
   grep "^  *///" | sed 's?^  *///  ??' | sed 's?^.*///??'


   I.e. if the above script is stored in a file called "extract.sh", and
   this document is in a file called "spec.txt", then the reader can do:

    sh extract.sh < spec.txt > dd.x

   The effect of the script is to remove leading white space from each
   line of the specification, plus a sentinel sequence of "///".


2.  Terminology

   o  Source file, the file that contains the de-duplicated data.

   o  Target file, the file the client has opened.

   o  Block, the smallest unit of de-duplication that the server is
      willing to support.

   o  Slab, a byte range that refers to lists of other byte ranges that
      contain de-duplicated data (either in whole, or part).  A slab can
      refer to a lists of smaller slabs, or lists of blocks.

   o  Regular file: An object of file type NF4REG or NF4NAMEDATTR.





Eisler                   Expires April 30, 2009                 [Page 4]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


3.  Scope of De-Duplication

   This document only de-duplicates the data contents of regular files.
   Everything else is considered metadata, and de-duplication of
   metadata is not considered in this document.  [[Comment.1: Some
   metadata, including the contents of directories and symbolic links,
   as well as attributes (e.g.  ACLs) are practical to de-duplicate.  A
   future revision of this document might address de-duplication of
   metadata.]]


4.  Overview of De-Duplication via pNFS

   Providing awareness of de-duplication to clients needs to be
   practical.  If the data structures the server provides the client are
   not compact, or require expensive processing, then de-duplication
   awareness is not practical.  The approach presented in this document
   uses leaf bitmaps to indicate whether a byte range of a file has been
   de-duplicated, and if so from what offset of what file.  Since the
   granularity of de-duplication will vary by implementation, and by
   file, the NFS server has the option of providing indirect bitmaps
   that refer to bitmaps of finer grained byte ranges.  An indirect
   bitmap can refer to another indirect bitmap or a leaf bitmap.

   As noted in Section 1, de-duplication can be the result of
   hierarchical, inline, or background processes.  The approach to
   providing awareness of de-duplication allows servers to optimize for
   any approach.

   NFSv4.1 introduces pNFS, which allows clients to access data from
   multiple storage devices.  This means that the NFS server is
   distributed across a set of nodes on a network.  Such a server might
   be capable of de-duplication among the server's nodes.  The de-
   duplication awareness feature will allow servers to present awareness
   of cross-node de-duplication to NFS clients.


5.  The Definition of De-Duplication Layouts

5.1.  Name of De-Duplication Striping Layout Type

   There are multiple de-duplication layout types, in order to support
   multiple levels of indirection plus a leaf level.  Since the maximum
   sized file in pNFS is 2^64 - 1 bytes, a total of 63 levels of
   indirection are provided.  If for some reason a server needs more
   levels of indirection, the server will use a layout type from the
   private use range, 0x80000000 to 0xFFFFFFFF.




Eisler                   Expires April 30, 2009                 [Page 5]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


   The name of the top-level de-duplication layout type is
   LAYOUT4_DEDUP_TOP.  The names of the remaining de-duplication layout
   types are LAYOUT4_DEDUP_LEVEL_<xx>, where <xx> is a two digit decimal
   number that ranges between 02 and 64.  The server MUST NOT return
   LAYOUT4_DEDUP_LEVEL_<xx> in the response to a GETATTR request for the
   fs_layout_type attribute.

5.2.  Value of De-Duplication Striping Layout Type

   The value LAYOUT4_DEDUP_TOP is TBD1.  The values of
   LAYOUT4_DEDUP_LEVEL_<xx> are TBD02 through TBD64.

5.3.  Definition of the da_addr_body Field of the device_addr4 Data Type

   ///  %#include "nfs4_prot.h"
   ///
   ///  %/* Encoded in the da_addr_body field. */
   ///
   ///  union dd_layout_addr switch (bool ddla_simple) {
   ///    case TRUE:
   ///      multipath_list4 ddla_simple_addr;
   ///    case FALSE:
   ///      layouttype4     ddla_complex_addr;
   ///  };


                                 Figure 1

   The device address is only used in leaf layouts, and even then, only
   when cross server-node de-duplication is in effect.  There are two
   types of device addresses, a simple network address, with zero or
   more alternate addresses for multipathing, or a complex address which
   is the value of another layout type.  The value of
   ddla_complex_addr.ddldp_ltype MUST NOT be LAYOUT4_DEDUP_TOP or any of
   LAYOUT4_DEDUP_LEVEL_<xx>.
















Eisler                   Expires April 30, 2009                 [Page 6]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


5.4.  Definition of the loh_body Field of the layouthint4 Data Type

   ///  enum dd_layout_hint_care4 {
   ///
   ///         DD4_CARE_STRIPE_UNIT_SIZE    = 0x040,
   ///         DD4_CARE_STRIPE_UNIT_ALIGN   = 0x100
   ///  };
   ///  %
   ///  %/* Encoded in the loh_body field of type layouthint4: */
   ///  %
   ///  struct dd_layouthint4 {
   ///         uint32_t       ddlh_care;
   ///         length4        ddlh_stripe_unit_size;
   ///         length4        ddlh_stripe_unit_align;
   ///  };

                                 Figure 2

   The layout-type specific content for the LAYOUT4_DEDUP_TOP layout
   type is composed of three fields.  The first field, ddlh_care, is a
   set of flags indicating which values of the hint the client cares
   about.  If DD4_CARE_STRIPE_UNIT_SIZE is set, then the client
   indicates in the second field, preferred unit of granularity for de-
   duplication in bytes.  If DD4_CARE_STRIPE_UNIT_ALIGN is set, then the
   client indicates in the third field, the preferred minimum alignment
   de-duplicated units.  For example, if the client specifies
   ddlh_stripe_unit_size as 1024, and ddlh_stripe_unit_align as 128,
   then if two files have in common content a string of bytes that is
   1024 bytes long, and the string is at offset zero in the first file,
   and offset 1024 + 128 = 1152 in the second file, then the client
   would like the server to de-duplicate the common 1024 byte string.
   Note that the leaf layouts returned by the server are unable to
   indicate byte ranges that are not whole multiple of the unit size the
   server uses, so if the server accepts a layout hint with
   ddlh_stripe_unit_align less than ddlh_stripe_unit_size, it will
   report units that are equal to ddlh_stripe_unit_align.  If the client
   specifies a value in ddlh_stripe_unit_align that is greater than the
   value of ddlh_stripe_unit_size, the server will ignore the
   ddlh_stripe_unit_align hint.












Eisler                   Expires April 30, 2009                 [Page 7]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


5.5.  Definition of the loc_body Field of the layout_content4 Data Type

   ///  %/*
   ///  %/* How the bits of each element
   ///  % * of ddll_blockmap are split up
   ///  % */
   ///  const DDLL4_BLKMAP_MASK_ACTIVE      = 0x8000000000000000;
   ///
   ///  %/* The remain bits follow DDLL4_BITS_* */
   ///  const DDLL4_BLKMAP_MASK_PARTITIONED = 0x7FFFFFFFFFFFFFFF;
   ///
   ///  %/* These constants index into ddll_bmap_partition */
   ///  const DDLL4_BITS_FOR_DEVID_IDX   = 0;
   ///  const DDLL4_BITS_FOR_FH_IDX      = 1;
   ///  const DDLL4_BITS_FOR_BLK_NUM_IDX = 2;
   ///
   ///  struct dd_layout_leaf4 {
   ///    length4   ddll_block_size;
   ///
   ///  % /* ddll_blockmap_partition[0-2] MUST add up to 63 */
   ///
   ///    opaque    ddll_blockmap_partition[4];
   ///    verifier4 ddll_fhsuffix;
   ///    nfs_fh4   ddll_fhlist<>;
   ///    uint64_t  ddll_change_attr<>;
   ///    deviceid4 ddll_devlist<>;
   ///    uint64_t  ddll_blockmap<>;
   ///  };
   ///
   ///  struct dd_layout_indirect4 {
   ///    length4     ddli_slab_size;
   ///    layouttype4 ddli_next_level;
   ///    bitmap4     ddli_bitmap;
   ///  };
   ///
   ///  union dd_layout4_u switch (bool ddl_is_leaf) {
   ///    case TRUE:
   ///      dd_layout_leaf4     ddl_leaf;
   ///    case FALSE:
   ///      dd_layout_indirect4 ddl_indirect;
   ///  };
   ///  struct dd_layout4 {
   ///    offset4      ddl_firstoff;
   ///    offset4      ddl_lastoff;
   ///    dd_layout4_u ddl_u;
   ///  };





Eisler                   Expires April 30, 2009                 [Page 8]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


                                 Figure 3

   The first fields further bound the layout.

   o  ddl_firstoff, the first offset in the file that the layout has de-
      duplication information for.  The relationship between the
      lo_offset field of the layout4 data type that envelops the de-
      duplication layout and ddl_firstoff is that ddl_firstoff MUST be
      greater than or equal to lo_offset.  If ddl_firstoff is not equal
      to lo_offset, then this means that the byte range from lo_offset
      through ddl_firstoff - 1 inclusive either has not been de-
      duplicated or the server has decided to not provide the
      information.  The value of the field ddl_firstoff MUST be a whole
      multiple of ddli_slab_size or ddll_block_size.

   o  ddl_lastoff, the last offset in the file that the layout has de-
      duplication information for.  Field ddl_lastoff MUST be greater
      than or equal to ddl_firstoff.  Field ddl_lastoff MUST be less
      than or equal to lo_offset + lo_length - 1.  If the difference
      between ddl_lastoff and lo_offset + lo_length - 1 exceeds zero,
      then this means that byte range from offset ddl_lastoff + 1
      through lo_offset + lo_length - 1 inclusive either has not be been
      de-duplicated or the server has decided to not provide the
      information.  The value of the ddl_lastoff + 1 MUST be a whole
      multiple of ddli_slab_size or ddll_block_size, even if this means
      ddl_lastoff goes beyond the end of file.

   The remainder of the de-duplication layout is either a leaf layout or
   an indirect layout.

   An indirect layout consists of,

   o  ddli_slab_size is the length, in bytes of each slab represented by
      the ddli_bitmap bitmap array.

   o  ddli_next_level is the layout type the NFS client MUST use when
      using LAYOUTGET to get finer grained de-duplication information
      about the de-duplication of one or more slabs.  This field SHOULD
      be one of LAYOUT4_DEDUP_LEVEL_<xx>.  The use of ddli_next_level
      provides a hint to the server for what slab or block size to use
      on the next level of de-duplication.

   o  ddli_bitmap, is a bitmap.  If bit N is set in ddli_bitmap, then
      this means that slab N has de-duplicated content.  Each bit
      respects a byte range (a slab) of size ddli_slab_size, such that
      ddl_firstoff is that start of the first slab (slab zero, relative
      to ddl_firstoff), and ddl_lastoff is the start of the last slab.
      Slab N represents the byte range ddl_firstoff + N * ddli_slab_size



Eisler                   Expires April 30, 2009                 [Page 9]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


      to ddl_firstoff + (N + 1) * ddli_slab_size - 1, inclusive.  The
      field ddli_bitmap is an array of elements each consisting of a 32
      bit unsigned integer.  The number of elements in ddli_bitmap MUST
      be greater than or equal to ((((ddl_lastoff - ddl_firstoff) + 1) /
      ddli_slab_size) / 32) rounded up to the next whole number.

   A leaf layout consists of,

   o  ddll_block_size is the length, in bytes of each slab represented
      by the ddll_blockmap array.

   o  ddll_blockmap_partition is an array of bytes, the first three of
      which are inspected by the client.  This array indicates how each
      element of ddll_blockmap is partitioned.

   o  ddll_fhlist is an array of zero or more filehandles.  Each element
      of ddll_blockmap can correspond to a filehandle in ddll_fhlist.
      Each filehandle represents a source file that has a de-duplicated
      block that it shares with the target file.  If the array is of
      zero length, then the source file for all de-duplicated blocks is
      the target file.

   o  ddll_fhsuffix MUST be appended to each filehandle in ddll_fhlist
      that the client uses for READ or LAYOUTGET operations.  This
      allows the server to detect if the client is using an invalid
      layout.

   o  ddll_change_attr is an array of zero or more change attributes.
      If the array is not zero in length, then each element corresponds
      an element in ddll_fhlist with the same position in the array.
      I.e. ddll_change_attr[i] is the change attribute for the source
      file identified by ddll_fhlist[i].  If the array is of zero
      length, then the server is promising to recall the byte ranges of
      the layout that refer to each file identified in the ddll_fhlist
      array.  If the ddll_fhlist array is of zero length, and the
      ddll_change_attr array has one element, then ddll_change_attr[0]
      is the change attribute for the source file, which also happens to
      be the target file.

   o  ddll_devlist is an array of zero or more device IDs, for the
      purpose of enabling cross-node de-duplication.  Each element of
      ddll_blockmap can correspond to a device ID in ddll_devlist.  Each
      device ID represents a device that has a source file with a de-
      duplicated block.  The device ID is always for a LAYOUT4_DEDUP_TOP
      device, and can either map to a network address of an MDS, or a
      non-de-duplication layout type.  The device ID will map to an MDS
      network address if the source file has not been striped.
      Otherwise, the device ID will be the layout type used for striping



Eisler                   Expires April 30, 2009                [Page 10]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


      the file.  By providing the layout type, the client does not have
      to send a GETATTR request on the source file for fs_layout_type
      attribute.

   o  ddll_blockmap is an array of elements, each a 64 bit unsigned
      integer.  Each element corresponds to a block of size
      ddll_block_size.  E.g., the first element, ddll_blockmap[0]
      corresponds to the byte range, ddl_firstoff through ddl_firstoff +
      ddll_block_size - 1 inclusive.

      *  If ddll_blockmap[i] & DDLL4_BLKMAP_MASK_ACTIVE is non-zero,
         then this element corresponds to a block that is de-duplicated.
         Otherwise, the element does not correspond to a de-duplicated
         block, and the rest of the element is undefined.

      *  The mask ddll_blockmap[i] & DDLL4_BLKMAP_MASK_PARTITIONED
         represents a bit field that is partitioned according to the
         content of ddll_blockmap_partition.

         The element ddll_blockmap_partition[DDLL4_BITS_FOR_DEVID_IDX]
         indicates how many bits at the start of the bit field are for
         indexing into the ddll_devlist array.  The number of elements
         in ddll_devlist MUST be less than or equal to
         2^ddll_blockmap_partition[DDLL4_BITS_FOR_DEVID_IDX].  If
         ddll_blockmap_partition[DDLL4_BITS_FOR_DEVID_IDX] is zero, then
         this means that the blocks of the source file come from the
         same MDS as the target file.

         The element ddll_blockmap_partition[DDLL4_BITS_FOR_FH_IDX]
         indicates how many bits in the middle of the bit field are for
         indexing into the ddll_fhlist array.  The number of elements in
         ddll_fhlist MUST ne less than or equal to
         2^ddll_blockmap_partition[DDLL4_BITS_FOR_FH_IDX].  If
         ddll_blockmap_partition[DDLL4_BITS_FOR_FH_IDX] is zero, this
         means that the source file is the same as the target file in
         every element of ddll_blockmap_partition.

         The element ddll_blockmap_partition[DDLL4_BITS_FOR_BLK_NUM_IDX]
         indicates how many bits at the end of the bit field correspond
         to an absolute block number into the source file.  The absolute
         offset is calculated by computing the product of
         ddll_block_size and the absolute block number.  If
         ddll_blockmap_partition[DDLL4_BITS_FOR_BLK_NUM_IDX] is zero,
         then this means the absolute block number of the source is the
         same as the absolute block number of the target.

         The dynamic partitioning of the ddll_blockmap element allows
         for several optimizations.  If the de-duplication in the range



Eisler                   Expires April 30, 2009                [Page 11]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


         identified by the layout is due to hierarchical de-duplication,
         then there is no need for a block number, so
         ddll_blockmap_partition[DDLL4_BITS_FOR_BLK_NUM_IDX] will be
         zero.  If there is no cross node de-duplication in the range
         then ddll_blockmap_partition[DDLL4_BITS_FOR_DEVID_IDX] will be
         zero.  If all the de-duplication in the range is confined to
         the target file, i.e. the duplicate blocks were only in the
         target file and no other file, then
         ddll_blockmap_partition[DDLL4_BITS_FOR_FH_IDX] will be zero.

   An outline for an algorithm for processing a read() system call when
   the potential for de-duplicated data exists follows.  This algorithm
   illustrates how the layout is interpreted.  In this algorithm, we
   assume that the client always starts with a layout that spans the
   entire file.


   /*
    * Returns a vector call "result" of elements
    * containing key / value pairs of ((offset,
    * length), (status, source_mds, source_fh,
    * source_offset)).
    */

   dedupe_read(read_offset, read_length, target_fh,
       layout4 logr_layout[]) {

     if (number of elements in logr_layout == zero)
     {
       result[(read_offset, read_length)] =
           NO_DEDUP_AVAILABLE;

       return result;
     }

     for i from the end of logr_layout to start {
       if (logr_layout[i].lo_offset > read_offset) {
         continue;
       }

       /* check for range split across segments */
       if (logr_layout[i].lo_length <
           read_length) {

         read_offset_A = read_offset;
         read_length_A = logr_layout[i].lo_length;
         read_offset_B = logr_layout[i+1].lo_offset;
         read_length_B = read_length -



Eisler                   Expires April 30, 2009                [Page 12]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


           read_length_A;

         result[(read_offset_A, read_length_A)] =
           dedupe_read(read_offset_A, read_length_A,
           target_fh, logr_layout);

         result[(read_offset_B, read_length_B)] =
           dedupe_read(read_offset_B, read_length_B,
           target_fh, logr_layout);

         return result;
       }

       last_offset = read_offset + read_length - 1;

       if (read_offset > ddl_lastoff) {
         result[(read_offset, read_length)] =
           NO_DEDUP_AVAILABLE;
       }

       if (last_offset > ddl_lastoff) {
         /* we cannot de-dupe the entire range */

         result[(ddl_lastoff + 1, last_offset -
           ddl_lastoff)] = NO_DEDUP_AVAILABLE;
         last_offset = ddl_lastoff;
       }
       if (read_offset < ddl_firstoff) {
         /* we cannot de-dupe the entire range */

         result[(read_offset, ddl_firstoff -
           read_offset)] = NO_DEDUP_AVAILABLE;
         read_offset = ddl_firstoff;
       }

       if (ddl_is_leaf == FALSE) {
         /* indirect layout */

         let trunc_read_off = read_offset truncated
           to next lowest multiple of
           ddli_slab_size;

         let round_last_off = (last_offset rounded
           to next highest multiple of
           ddli_slab_size) - 1;

         first_bit = trunc_read_off /
           ddli_slab_size; last_bit =



Eisler                   Expires April 30, 2009                [Page 13]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


           (round_last_off + 1) / ddli_slab_size;

         for (j = first_bit; j++; j <= last_bit)
         {
           k = j / 32;
           l = j mod 32;
           bit = l << 1;

           if (j == first_bit) {
             read_offset_A = read_offset;
             read_length_A = trunc_read_off +
               ddli_slab_size - read_offset;

           } else {
             read_offset_A = ddl_firstoff + (j *
               ddli_slab_size);
             read_length_A = ddli_slab_size;
           }

           if ((ddli_bitmap[k] & bit) == 1) {
             next_layout_off = j * ddli_slab_size +
               trunc_read_off;

             next_layout_length = ddli_slab_size;
             next_layout_type = ddli_next_level;

             if (client does not have layout for
                 (next_layout_off,
                 next_layout_length, and
                 ddli_next_level) {

                send a LAYOUTGET request;
             }
             let logr_layout_A = logr_layout array
                 of layout for (next_layout_off,
                 next_layout_length,
                 next_layout_type);

             result[(read_offset_A, read_length_A)]
               = dedupe_read(read_offset_A,
               read_length_A, target_fh,
               logr_layout_A);

           } else {
             result[(read_offset_A, read_length_A)]
               = NO_DEDUP_AVAILABLE;

           }



Eisler                   Expires April 30, 2009                [Page 14]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


         }
       } else {
         /* process a leaf layout */

         let trunc_read_off = read_offset truncated
           to next lowest multiple of
           ddll_block_size;

         let round_last_off = (last_offset rounded
           to next highest multiple of
           ddll_block_size) - 1;

         bits_for_blknum = ddll_blockmap_partition
           [DDLL4_BITS_FOR_BLK_NUM_IDX];

         mask_for_blknum = 0;
         for (j = 0; j < bits_for_blknum; j++) {
           mask_for_blknum = (mask_for_blknum
             << 1) | 1;
         }

         bits_for_fh = ddll_blockmap_partition
           [DDLL4_BITS_FOR_FH_IDX];

         mask_for_fh = 0;
         for (j = 0; j < bits_for_fh; j++) {
           mask_for_fh = (mask_for_blknum <<
             1) | 1;
         }

         mask_for_fh = mask_for_fh <<
           bits_for_blknum;

         bits_for_dev = ddll_blockmap_partition
           [DDLL4_BITS_FOR_DEVID_IDX];

         mask_for_dev = 0;
         for (j = 0; j < bits_for_dev; j++) {
           mask_for_dev = (mask_for_dev << 1)
             | 1;
         }
         mask_for_dev = mask_for_dev <<
           (bits_for_blknum + mask_for_fh);

         if ((bits_for_blknum + bits_for_fh +
             bits_for_dev) != 63) {

           result[(read_offset, read_length)] =



Eisler                   Expires April 30, 2009                [Page 15]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


             CORRUPT_LAYOUT;

           return result;
         }

         first_block = trunc_read_off /
           ddll_block_size;
         last_block = (round_last_off + 1) /
           ddll_block_size;
         slopoff = read_offset - trunc_read_off;
         sloplen = round_last_off - last_offset;

         read_offset_A = trunc_read_off;

         for (j = first_block; j++, read_offset_A +=
             ddll_block_size; j <= last_block) {

           if (ddll_blockmap[j] &
               DDLL4_BLKMAP_MASK_ACTIVE) {

             blockmap = ddll_blockmap[j] &
               DDLL4_BLKMAP_MASK_PARTITIONED;

             source_length = ddll_block_size;
             source_change = 0;
             source_dev = 0;

             if (mask_for_blknum == 0) {
               source_offset = ddl_firstoff + j *
                 ddll_block_size;
             } else {
               source_offset = (blockmap &
                 mask_for_blknum) * ddll_block_size;
             }

             if (j == first_block) {
               source_offset += slopoff;
               read_offset_B = read_offset;
             } else {
               read_offset_B = read_offset_A;
             }

             if (j == last_block) {
               source_length -= sloplen;
             }

             if (mask_for_fh == 0) {
               source_fh = target_fh;



Eisler                   Expires April 30, 2009                [Page 16]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


               if (number of elements in
                   ddll_change_attr > 0) {
                 source_change = ddl_change_attr[0];
               }
             } else {
               fhidx = (blockmap & mask_for_fh) >>
                 bits_for_blknum;
               source_fh = ddll_fhlist[fhidx];
               if (number of elements in
                   ddll_change_attr > 0) {
                 source_change =
                   ddl_change_attr[fhidx];
               }
             }
             read_source_fh = source_fh concatenated
               with ddll_fhsuffix;
             source_ltype = 0;
             source_mds = MDS of target_fh;
             if (mask_for_dev != 0) {
               devidx = (blockmap & mask_for_dev) >>
                 bits_for_blknum;
               source_dev = ddll_devlist[devidx];

               if (client does not have device
                   address for source_dev) {
                 send a GETDEVICEINFO
                   (LAYOUT4_DEDUP_TOP, source_dev);
               }

               if (ddla_simple from GETDEVICEINFO is
                   TRUE) {
                 let source_mds be an element of
                   ddla_simple_addr;
               } else {
                 source_ltype = ddldp_ltype;

                 if (client does not have layout for
                     (source_mds, source_fh,
                     source_ltype, source_offset,
                     source_length)) {

                   send a LAYOUTGET request for
                     (read_source_fh, source_ltype,
                     source_dev, source_offset,
                     source_length) to target_fh's
                     MDS;

                   cache LAYOUTGET result;



Eisler                   Expires April 30, 2009                [Page 17]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


                 }

                 if (client still does not have
                     layout for (source_mds, source_fh,
                     source_ltype, source_offset,
                     source_length)) {
                   source_ltype = 0;
                 } else {
                   let source_layout = the layout
                     from cache;
                 }
               }
             }

             if (source_change == 0 || client has
                 delegation on source_fh) {

               if ({source_fh, source_mds,
                   source_offset, source_length} in
                   cache) {

                 result[(read_offset_B,
                   source_length)] =
                   (SATISFY_READ_FROM_CACHE,
                   source_mds, source_fh,
                   source_offset;)

               } else {
                 if (source_ltype == 0) {
                   if (read_source_fh not yet open)
                   {
                     send an OPEN request for
                       read_source_fh;
                   }
                   send a { PUTFH read_source_fh,
                     READ source_offset,
                     source_length } request to
                     source_mds;

                   enter results in cache;

                 } else {
                   read from read_source_fh,
                     source_offset, source_length
                     according to source_layout;

                   enter results in cache;
                 }



Eisler                   Expires April 30, 2009                [Page 18]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


                 result[(read_offset_B,
                   source_length)] =
                   (SATISFY_READ_FROM_CACHE,
                   source_mds, source_fh,
                   source_offset);

               }
             } else {
               if ({source_mds, source_fh,
                   source_offset, source_length} in
                   cache) {

                 send a { PUTFH source_fh, GETATTR
                   change } request to source_mds;

                 if (change attribute ==
                     source_change) {

                   result[(read_offset_B,
                     source_length)] =
                     (SATISFY_READ_FROM_CACHE,
                     source_mds, source_fh,
                     source_offset);

                 } else {
                   result[(read_offset_B,
                     source_length)] =
                     (STALE_DEDUP_LAYOUT,
                     source_mds, source_fh,
                     source_offset);

                 }
               }
             }
           }
         }
       }
       return result;
     }

     /* should never get here */
     result[(read_offset, read_length)] =
       CORRUPT_LAYOUT;

     return result;
   }





Eisler                   Expires April 30, 2009                [Page 19]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


                                 Figure 4

   There is a trade off between resources (space and time) used for
   providing de-duplication layouts (especially leaf layouts) and
   resources for redundant caching of de-duplicated storage.  E.g., if a
   client has to descend through 52 levels of caching to avoid caching a
   single 4096 byte block twice, then it is not cost effective for the
   server to return a layout.  On the other hand, if 99% of a file is
   using de-duplicated storage, then having a complete block map for a
   one gigabyte file, or at least the parts of the file the client wants
   to cache, is more effective than redundantly caching nearly one
   gigabyte of storage.

5.6.  Definition of the lou_body Field of the layoutupdate4 Data Type

   ///  %/*
   ///  % * LAYOUT4_DEDUP_TOP or any of LAYOUT4_DEDUP_LEVEL_<xx>.
   ///  % * Encoded in the lou_body field of type layoutupdate4:
   ///  % *      Nothing. lou_body is a zero length array of octets.
   ///  % */
   ///  %

                                 Figure 5

   The LAYOUT4_DEDUP_TOP and LAYOUT4_DEDUP_LEVEL_<xx> layout types have
   no content for lou_body filed of the layoutupdate4 data type.

5.7.  Storage Access Protocols

   The LAYOUT4_DEDUP_TOP and LAYOUT4_DEDUP_LEVEL_<xx> layout types use
   NFSv4.1 operations (and potentially, operations of higher minor
   versions of NFSv4, subject to the definition of a minor version of
   NFSv4) to access de-duplicated data.  The de-duplication layout types
   do not affect access to storage devices.  Thus a client might be able
   to obtain both a de-duplication layout type and a non-de-duplication
   layout type (e.g., LAYOUT4_NFSV4_1_FILES, LAYOUT4_OSD2_OBJECTS, or
   LAYOUT4_BLOCK_VOLUME) on the same regular file.

5.8.  Revocation of Layouts

   Servers MAY revoke de-duplication layouts.  A client using a de-
   duplication layout SHOULD check if the change attribute of the source
   file has changed.  The use of the ddll_fhsuffix will prevent clients
   using revoked de-duplication layouts from using potentially stale
   information.  Attempts to use filehandles with the value of
   ddll_fhsuffix appended, will result in NFS4ERR_STALE.





Eisler                   Expires April 30, 2009                [Page 20]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


5.9.  Recovery

   [[Comment.2: it is likely this section will follow that of the files
   layout type specified in the NFSv4.1 specification.]]

5.9.1.  Failure and Restart of Client

   TBD

5.9.2.  Failure and Restart of Server

   TBD

5.9.3.  Failure and Restart of Storage Device

   TBD


6.  Negotiation

   A pNFS client sends a GETATTR request for the fs_layout_type
   attribute to see if the LAYOUT4_DEDUP_TOP layout type is supported.


7.  Operational Recommendation for Deployment

   Deploy the de-duplication layouts when it a significant fraction of
   data storage is de-duplicated.


8.  Acknowledgements

   Thanks to Pranoop Erasani, Arthur Lent, and Dave Noveck for
   validating the strategy described in this document.


9.  Security Considerations

   If an ACCESS operation by the principal on the source file would
   fail, then the server has take care when processing requests for de-
   duplication layouts of the target file.  If the server is unable to
   perform access control at the granularity of the a byte-range, then
   the server MUST NOT allow the principal to read the source file.  A
   related concern is that if the server can provide per-byte-range
   access, then the server will need to allow an OPEN operation of the
   source file by the principal.  The server will need to reject READ
   operations for the non-de-duplicated data.  The reader should adjust
   the algorithm in Figure 4 accordingly.



Eisler                   Expires April 30, 2009                [Page 21]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


10.  IANA Considerations

   This specification requires 64 additions to the Layout Types registry
   described in Section 22.4 of [2].  Each added entry has five fields.
   This entry is:

   1.  Name of layout type: LAYOUT4_DEDUP_TOP.

   2.  Value of layout type: TBD1.

   3.  Standards Track RFC that describes this layout: RFCTBD65, which
       is the RFC of this document.

   4.  How the RFC Introduces the specification: L.

   5.  Minor versions of NFSv4 that can use the layout type: 1.

   The second through 64th additions to the Layout Types registry each
   have the following form, where <xx> is a decimal number between 02
   and 64, inclusive:

   1.  Name of layout type: LAYOUT4_DEDUP_LEVEL_<xx>.

   2.  Value of layout type: TBD_<xx>.

   3.  Standards Track RFC that describes this layout: RFCTBD65, which
       is the RFC of this document.

   4.  How the RFC Introduces the specification: L.

   5.  Minor versions of NFSv4 that can use the layout type: 1.


11.  Normative References

   [1]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
        Levels", RFC 2119, March 1997.

   [2]  Shepler, S., Eisler, M., and D. Noveck, "NFS Version 4 Minor
        Version 1", draft-ietf-nfsv4-minorversion1-26 (work in
        progress), Sep 2008.










Eisler                   Expires April 30, 2009                [Page 22]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


Author's Address

   Mike Eisler
   NetApp
   5765 Chase Point Circle
   Colorado Springs, CO  80919
   US

   Phone: +1-719-599-9026
   Email: mike@eisler.com









































Eisler                   Expires April 30, 2009                [Page 23]


Internet-Draft   Storage De-Duplication Awareness in NFS    October 2008


Full Copyright Statement

   Copyright (C) The IETF Trust (2008).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.











Eisler                   Expires April 30, 2009                [Page 24]