NFSv4 Working Group D. Hildebrand
Internet Draft M. Eshel
Intended status: Standards Track IBM Almaden
Expires: March 2011 September 29, 2010
Simple and Efficient Read Support for Sparse Files
draft-hildebrand-nfsv4-read-sparse-01.txt
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79.
This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other
than English.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
This Internet-Draft will expire on March 29, 2009.
Hildebrand, et al. Expires March 29, 2011 [Page 1]
Internet-Draft Read Support for Sparse Files September 2010
Copyright Notice
Copyright (c) 2010 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the BSD License.
This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other
than English.
Abstract
This document extends the NFSv4.1 protocol to support efficient
reading of sparse files. The number of sparse files is growing in
the data center, most notably due to the increasing number of virtual
disk images. This simple extension provides an easy and efficient
way for administrators to copy and manage these files without wasting
disk space or transferring data unnecessarily.
Table of Contents
1. Introduction...................................................3
1.1. Requirements Language.....................................4
2. Terminology....................................................4
3. Applications and Sparse Files..................................4
4. Overview of Sparse Files and NFSv4.............................5
5. Definition of Sparse Reads with NFS............................6
5.1. Definition of the READ4res................................6
5.2. Definition of the READ4reshole............................7
Hildebrand, et al. Expires March 29, 2011 [Page 2]
Internet-Draft Read Support for Sparse Files September 2010
6. Related Work...................................................7
7. Security Considerations........................................9
8. IANA Considerations............................................9
9. References.....................................................9
9.1. Normative References......................................9
9.2. Informative References....................................9
10. Acknowledgments..............................................10
1. Introduction
NFS is now used in many data centers as the sole or primary method of
data access. Consequently, more types of applications are using NFS
than ever before, each with their own requirements and generated
workloads. As part of this, sparse files are increasing in number
while NFS continues to lack any specific knowledge of a sparse file's
layout. This document extends the NFSv4.1 protocol to support
efficient reading of sparse files.
A sparse file is a common way of representing a large file without
having to pre-allocate data for it. Consequently, a sparse file uses
fewer blocks than its size indicates. This means the file contains
'holes', byte ranges within the file that contain no data. Most
modern file systems support sparse files, including most UNIX file
systems and NTFS, but notably not Apple's HFS+. Common examples of
sparse files include VM OS/disk images, database files, log files,
and even checkpoint recovery files most commonly used by the HPC
community.
If an application reads 'holes' in a sparse file, the file system
converts empty blocks into "real" blocks filled with zeros, and
returns them to the application. For local data access there is
little penalty, but with NFS these zeroes must be transferred back to
the client. If an application uses the NFS client to read data into
memory, this wastes time and bandwidth as the application waits for
the zeroes to be transferred. Once the zeroes arrive, they then
steal memory or cache space from real data. To make matters worse,
if an application then proceeds to write data to another file system,
the zeros are written into the file, expanding the sparse file into a
full sized regular file. Beyond wasting disk space, this can
actually prevent large sparse files from ever being copied to another
storage location due to space limitations.
This document simply adds a new return value to the READ RPC to avoid
reading holes in sparse files and to tell the client the location of
the next valid data block. This solution is intentionally very
simple and does not build on complicated and optional features such
Hildebrand, et al. Expires March 29, 2011 [Page 3]
Internet-Draft Read Support for Sparse Files September 2010
as pNFS. This hopefully ensures that sparse files become supported
by the widest number of client implementations.
The XDR description is provided in this document in a way that makes
it simple for the reader to extract into a ready to compile form.
The reader can feed this document into the following shell script to
produce the machine readable XDR description of the metadata layout:
#!/bin/sh
grep "^ *///" | sed 's?^ */// ??' | sed 's?^.*///??'
I.e. if the above script is stored in a file called "extract.sh", and
this document is in a file called "spec.txt", then the reader can do:
sh extract.sh < spec.txt > md.x
The effect of the script is to remove leading white space from each
line of the specification, plus a sentinel sequence of "///".
1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC-2119 [1].
2. Terminology
o Regular file: An object of file type NF4REG or NF4NAMEDATTR.
o Sparse File. A Regular file that has a size greater than the
number of blocks allocated for the file.
o Hole. A byte range within a Sparse file that contains no data or
simply zeroes.
3. Applications and Sparse Files
Applications may cause an NFS client to read empty blocks in a file
for several reasons. This section describes three different
application workloads that cause the NFS client to transfer data
unnecessarily. These workloads are simply examples, and there are
probably many more workloads that are negatively impacted by sparse
files.
The first workload that can cause empty blocks to be read is
sequential reads within a sparse file. When this happens, the NFS
client may perform read requests ("readahead") into sections of the
Hildebrand, et al. Expires March 29, 2011 [Page 4]
Internet-Draft Read Support for Sparse Files September 2010
file not explicitly requested by the application. Since the NFS
client cannot differentiate between allocated and unallocated
sections, the NFS client may prefetch empty sections of the file.
This workload is exemplified by Virtual Machines and their associated
file system images, e.g., VMware .vmdk files, which are large sparse
files encapsulating an entire operating system. If a VM reads files
within the file system image, this will translate to sequential NFS
read requests into the much larger file system image file. Since NFS
does not understand the internals of the file system image, it ends
up performing readahead into unallocated sections. Note that it is
also common for several VMs on different NFS clients to share a
single file system image file, which exacerbates the problem by
resending empty blocks to multiple clients.
The second workload is generated by copying a file from a directory
in NFS to either the same NFS server, to another file system, e.g.,
another NFS or Samba server, to a local ext3 file system, or even a
network socket. In this case, bandwidth and server resources are
wasted as the entire file, including both allocated and unallocated
blocks, are transferred from the NFS server to the NFS client. Once
a data block has been transferred to the client, it is up to the
client application, e.g., rsync, cp, scp, on how it writes the data
to the target location. For example, cp supports sparse files and
will not write zero filled blocks, whereas scp does not support
sparse files and will transfer every data block.
The third workload is generated by applications that do not utilize
the NFS client cache, but instead use direct I/O and manage cached
data independently, e.g., databases. These applications may perform
whole file caching with sparse files, which would mean that even the
unallocated sections will be transferred to the clients and cached.
4. Overview of Sparse Files and NFSv4
This proposal seeks to provide sparse file support to the largest
number of NFS client and server implementations, and as such proposes
to add a new return code to the mandatory NFSv4.1 READ operation
instead of proposing additions or extensions of new or existing
optional features (such as pNFS).
As well, this document seeks to ensure that the proposed extensions
are simple and do not transfer data between the client and server
unnecessarily. For example, one possible way to implement sparse file
read support would be to have the client, on the first hole
encountered or at OPEN time, request a block layout map from the
server. While this option seems simple, it can become inefficient
Hildebrand, et al. Expires March 29, 2011 [Page 5]
Internet-Draft Read Support for Sparse Files September 2010
and cumbersome. First, large block layout maps can be returned from
the server, which can reduce overall READ performance. For example,
VMware's .vmdk files use 64KB blocks and can have a file size of over
100 GBs. This means that the possible number of allocated (or
unallocated) blocks in the file can grow very large in the worse case
scenario. In addition, this large block layout map may need to be
transferred multiple times with each update to the file. For
example, a VM that updates a config file in its file system image
would invalidate the block layout map not only for itself, but for
all other clients accessing the same file system image.
Another way to handle holes is compression, but this not ideal since
it requires all implementations to agree on a single compression
algorithm and requires a fair amount of computational overhead.
Note that supporting writing to a sparse file does not require
changes to the protocol. Applications and/or NFS implementations can
choose to ignore WRITE requests of all zeroes to the NFS server
without consequence.
5. Definition of Sparse Reads with NFS
The following sections details changes to the READ operation in the
NFSv4.1 specification [3] to allow NFS clients to avoid reading holes
in a file.
Our proposal is very simple, if a client READ request would return
all zeroes from a file hole, the server does not waste computational
and network overhead by sending the zeroes back to the client.
Instead, the server returns a new return value and result structure
that tells the client that the READ result is all zeroes AND the
offset of the next non-zero segment of data. Sending the location of
the next valid data block, and only upon request, avoids transferring
large block layout maps that may be soon invalidated and avoids
sending large amount of information about a file that may not even be
read in its entirely.
5.1. Definition of the READ4res
/// union READ4res switch (nfsstat4 status) {
/// case NFS4_OK:
/// READ4resok resok4;
/// case NFS4ERR_HOLE:
/// READ4reshole reshole4;
/// default:
/// void;
/// };
Hildebrand, et al. Expires March 29, 2011 [Page 6]
Internet-Draft Read Support for Sparse Files September 2010
If status is NFS4ERR_HOLE, then the entire byte range of the read
request is in a hole, and can be assumed to be zero. Information
regarding the location of the next non-hole, or allocated block, in
the file is contained in reshole4.
5.2. Definition of the READ4reshole
/// struct READ4reshole {
/// offset4 data_offset;
/// length4 data_length;
/// };
If a READ request is into a hole, a READ4reshole structure is
returned. The READ4reshole structure is considered valid until the
file is changed (detected via the change attribute). If the first
part of the READ request is into a section of the file that has non-
zero data, and the rest of the request is all zeros, the server
should return a short read.
The values of the fields are as follows,
o data_offset, which is the offset of the next region of allocated
data in the file.
o data_length, which is the length of the non-zero data segment at
data_offset. If data_length is not zero, then the data in the
file from data_offset until data_length is allocated and does not
contain a hole. If data_length is zero, then either the server
has no further information regarding holes in the remainder of the
file or it can be assumed that all remaining bytes in the file are
allocated and contain no holes. Either way, the client can ignore
the information in READ4reshole.
5.3. Sparse Files and pNFS
With pNFS, the semantics of NFS4ERR_HOLE remain the same. Any data
server can return a NFS4ERR_HOLE result for a READ request that it
receives. In addition, when a data server is returning a
READ4reshole structure, it should still contain the offset and length
of the next allocated block in the file, even if that block is not
located on that particular data server.
When a pNFS client receives a NFS4ERR_HOLE result and a READ4reshole
structure with a non-zero data_length, it uses this information in
Hildebrand, et al. Expires March 29, 2011 [Page 7]
Internet-Draft Read Support for Sparse Files September 2010
conjunction with a valid layout for the file to determine the next
data server for the next allocated block of data.
5.4. Example
To see how NFS4ERR_HOLE will work, the following table describes a
sparse file. For each byte range, the file contains either non-zero
data or all zero data.
+-------------+-----------+
| Byte-Range | Contents |
+-------------+-----------+
| 0-31999 | Non-Zero |
| 32K-255999 | Zero |
| 256K-287999 | Non-Zero |
| 288K-353999 | Zero |
| 354K-417999 | Non-Zero |
+-------------+-----------+
Under the given circumstances, if a client was to read the file from
beginning to end with a max read size of 64K, the following will be
the result. This assumes the client has already opened the file and
acquired a valid stateid and just needs to issue READ requests.
1. READ(s, 0, 64K) --> NFS_OK, eof = false, data<>[32K]. Return a
short read, as the last half of the request was all zeroes.
2. READ(s, 32K, 64K) --> NFS4ERR_HOLE, READ4reshole(256K, 32K). The
requested range was all zeros, and the next chunk is located at
offset 256K and is 32K in length.
3. READ(s, 256K, 32K) --> NFS_OK.
4. READ(s, 288K, 64K) --> NFS4ERR_HOLE, READ4reshole(354K, 64K). The
client has no information regarding this range, so it issues the
read request to find the next hole in the file.
5. READ(s, 354K, 64K) --> NFS_OK, eof = true.
6. Related Work
Solaris and ZFS support an extension to lseek(2) that allows
applications to discover holes in a file. The values, SEEK_HOLE and
Hildebrand, et al. Expires March 29, 2011 [Page 8]
Internet-Draft Read Support for Sparse Files September 2010
SEEK_DATA, allow clients to seek to the next hole or beginning of
data, respectively.
XFS supports the XFS_IOC_GETBMAP extended attribute, which returns
the allocation information for a file. Clients can then use this
information to only read allocated data blocks.
NTFS and CIFS support the FSCTL_SET_SPARSE attribute, which allows
applications to control whether empty regions of the file are
preallocated and filled in with zeros or simply left unallocated.
7. Security Considerations
The additions to the NFS protocol for supporting sparse file reads
does not alter the security considerations of the NFSv4.1 protocol
[3].
8. IANA Considerations
There are no IANA considerations in this document. All NFSv4.1 IANA
considerations are covered in [3].
9. References
9.1. Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", BCP 14, RFC 2119, March 1997.
[2] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
C., Eisler, M., and D. Noveck, "Network File System (NFS)
version 4 Protocol", RFC 3530, April 2003.
[3] Shepler, S., Eisler, M., and D. Noveck, "Network File System
(NFS) Version 4 Minor Version 1 Protocol", RFC 5661, January
2010.
9.2. Informative References
[4] Shepler, S., Eisler, M., and D. Noveck, "Network File System
(NFS) Version 4 Minor Version 1 External Data Representation
Standard (XDR) Description", RFC 5662, January 2010.
[5] Nowicki, B., "NFS: Network File System Protocol specification",
RFC 1094, March 1989.
Hildebrand, et al. Expires March 29, 2011 [Page 9]
Internet-Draft Read Support for Sparse Files September 2010
[6] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3
Protocol Specification", RFC 1813, June 1995.
10. Acknowledgments
This document was prepared using 2-Word-v2.0.template.dot. Valuable
input and advice was received from Sorin Faibish, Benny Halevy, Trond
Myklebust, and Richard Scheffenegger.
Hildebrand, et al. Expires March 29, 2011 [Page 10]
Internet-Draft Read Support for Sparse Files September 2010
Authors' Addresses
Dean Hildebrand
IBM Almaden
650 Harry Rd
San Jose, CA 95120
Phone: +1 408-927-2013
Email: dhildeb@us.ibm.com
Marc Eshel
IBM Almaden
650 Harry Rd
San Jose, CA 95120
Phone: +1 408-927-1894
Email: eshel@almaden.ibm.com
Hildebrand, et al. Expires March 29, 2011 [Page 11]