Internet Draft                                              Garth Gibson
Expires: August 2004                                  Panasas Inc. & CMU
                                                           Peter Corbett
                                                 Network Appliance, Inc.

Document: draft-gibson-pnfs-problem-statement-00.txt       February 2004




                          pNFS Problem Statement


Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
        http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
        http://www.ietf.org/shadow.html.





Copyright Notice

   Copyright (C) The Internet Society (2004).  All Rights Reserved.











Gibson et al            Expires - August 2004                [Page 1]


Internet Draft          pNFS Problem Statement           February 2004



Abstract

   This draft considers the problem of limited bandwidth to NFS servers.
   The bandwidth limitation exists because an NFS server has limited
   network, CPU, memory and disk I/O resources.  Yet, access to any one
   file system through the NFSv4 protocol requires that a single server
   be accessed.  While NFSv4 allows file system migration, it does not
   provide a mechanism that supports multiple servers simultaneously
   exporting a single writable file system.

   This problem has become aggravated in recent years with the advent of
   very cheap and easily expanded clusters of application servers that
   are also NFS clients.  The aggregate bandwidth demands of such
   clustered clients, typically working on a shared data set
   preferentially stored in a single file system, can increase much more
   quickly than the bandwidth of any server.  The proposed solution is
   to provide for the parallelization of file services, by enhancing
   NFSv4 in a minor version.


Table of Contents

   1. Introduction...................................................2
   2. Bandwidth Scaling in Clusters..................................4
   3. Clustered Applications.........................................4
   4. Existing File Systems for Clusters.............................6
   5. Eliminating the Bottleneck.....................................7
   6. Separated control and data access techniques...................8
   7. Security Considerations........................................9
   8. Informative References.........................................9
   9. Acknowledgments...............................................11
   10. Author's Addresses...........................................11
   11. Full Copyright Statement.....................................11


1. Introduction

   The storage I/O bandwidth requirements of clients are rapidly
   outstripping the ability of network file servers to supply them.
   Increasingly, this problem is being encountered in installations
   running the NFS protocol.  The problem can be solved by increasing
   the server bandwidth.  This draft suggests that an effort be mounted
   to enable NFS file service to scale with its clusters of clients.
   The proposed approach is to increase the aggregate bandwidth possible
   to a single file system by parallelizing the file service, resulting
   in multiple network connections to multiple server endpoints
   participating in the transfer of requested data.  This should be



Gibson et al            Expires - August 2004                [Page 2]


Internet Draft          pNFS Problem Statement           February 2004


   achievable within the framework of NFS, possibly in a minor version
   of the NFSv4 protocol.

   In many application areas, single system servers are rapidly being
   replaced by clusters of inexpensive commodity computers. As
   clustering technology has improved, the barriers to running
   application codes on very large clusters have been lowered. Examples
   of application areas that are seeing the rapid adoption of scalable
   client clusters are data intensive applications such as genomics,
   seismic processing, data mining, content and video distribution, and
   high performance computing. The aggregate storage I/O requirements of
   a cluster can scale proportionally to the number of computers in the
   cluster.  It is not unusual for clusters today to make bandwidth
   demands that far outstrip the capabilities of traditional file
   servers.  A natural solution to this problem is to enable file
   service to scale as well, by increasing the number of server nodes
   that are able to service a single file system to a cluster of
   clients.

   Scalable bandwidth can be claimed by simply adding multiple
   independent servers to the network. Unfortunately, this leaves to
   file system users the task of spreading data across these independent
   servers.  Because the data processed by a given data-intensive
   application is usually logically associated, users routinely co-
   locate this data in a single file system, directory or even a single
   file.  The NFSv4 protocol currently requires that all the data in a
   single file system be accessible through a single exported network
   endpoint, constraining access to be through a single NFS server.

   A better way of increasing the bandwidth to a single file system is
   to enable access to be provided through multiple endpoints in a
   coordinated or coherent fashion.  Separation of control and data
   flows provides a straightforward framework to accomplish this, by
   allowing transfers of data to proceed in parallel from many clients
   to many data storage endpoints.  Control and file management
   operations, inherently more difficult to parallelize, can remain the
   province of a single NFS server, inheriting the simple management of
   today's NFS file service, while offloading data transfer operations
   allows bandwidth scalability.  Data transfer may be done using NFS or
   other protocols, such as iSCSI.

   While NFS is a widely used network file system protocol, most of the
   world's data resides in data stores that are not accessible through
   NFS.  Much of this data is stored in Storage Area Networks,
   accessible by SCSI's Fibre Channel Protocol (FCP), or increasingly,
   by iSCSI.  Storage Area Networks routinely provide much higher data
   bandwidths than do NFS file servers.  Unfortunately, the simple array
   of blocks interface into Storage Area Networks does not lend itself
   to controlling multiple clients that are simultaneously reading and


Gibson et al            Expires - August 2004                [Page 3]


Internet Draft          pNFS Problem Statement           February 2004


   writing the blocks of the same or different files, a workload usually
   referred to as data sharing.  NFS file service, with its hierarchical
   namespace of separately controlled files, offers simpler and more
   cost-effective management.  One might conclude that users must chose
   between high bandwidth and data sharing.  Not only is this conclusion
   false, but it should also be possible to allow data stored in SAN
   devices, FCP or iSCSI, to be accessed under the control of an NFS
   server.  Such an approach protects the industry's large investment in
   NFS, since the bandwidth bottleneck no longer needs to drive users to
   adopt a proprietary alternative solution, and leverages SAN storage
   infrastructures, all within a common architectural framework.


2. Bandwidth Scaling in Clusters

   When applied to data-intensive applications, clusters can generate
   unprecedented demand for storage bandwidth.  At present, each node in
   the cluster is likely to be a dual processor, with each processor
   running at multiple GHz, with gigabytes of DRAM.  Depending on the
   specific application, each node is capable of sustaining a demand of
   10s to 100s of MB/s of data from storage.  In addition, the number of
   nodes in a cluster is commonly in the 100s, with many instances of
   1000s to 10,000s of nodes.  The result is that storage systems may be
   called upon to provide an aggregate bandwidth of GB/s ranging upwards
   toward TB/s.

   The performance of a single NFS server has been improving, but it is
   not able to keep pace with cluster demand. Directly connected storage
   devices behind an NFS server have given way to disk arrays and
   networked disk arrays, making it now possible for an NFS server to
   directly access 100s to 1000s of disk drives whose aggregate capacity
   reaches upwards to PBs and whose raw bandwidths range upwards to 10s
   of GB/s.

   An NFS server is interposed between the scalable storage subsystem
   and the scalable client cluster.  Multiple NIC endpoints help network
   bandwidth keep up with DRAM bandwidth.  However, the rate of
   improvement of NFS server performance is not faster than the rate of
   improvement in each client node. As long as an NFS file system is
   associated with a single client-side network endpoint, the aggregate
   capabilities of a single NFS server to move data between storage
   networks and client networks will not be able to keep pace with the
   aggregate demand of clustered clients and large disk subsystems.


3. Clustered Applications

   Large datasets and high bandwidth processing of large datasets are
   increasingly common in a wide variety of applications.  As most


Gibson et al            Expires - August 2004                [Page 4]


Internet Draft          pNFS Problem Statement           February 2004


   computer users can affirm, the size of everyday presentations,
   pictures and programs seems to grow continuously, and in fact average
   file size does grow with time [Ousterhout85, Baker91].  Simple
   copying, viewing, archiving and sharing of even this baseline use of
   growing files in day-to-day business and personal computing drives up
   the bandwidth demand on servers.

   Some applications, however, make much larger demands on file and file
   system capacity and bandwidth.  Databases of DNA sequences, used in
   bioinformatics search, range up to tens of GBs and are often in use
   by all cluster users are the same time [NIH03].  These huge files may
   experience bursts of many concurrent clients loading the whole file
   independently.

   Bioinformatics is an example of extensive search in science
   application.  Extensive search is much broader than science.  Wall
   Street has taken to collecting long-term transaction record
   histories.  Looking for patterns of unbilled transactions, fraud or
   predictable market trends is a growing financial opportunity
   [Agarwal95, Senator95].

   Security and authentication are driving a need for image search, such
   as face recognition [Flickner95].  Databasing the faces of approved
   or suspected individuals and searching through many camera feeds
   involves huge data and bandwidths.  Traditional database indexing in
   these high dimension data structures often fails to avoid full
   database scans of these huge files [Berchtold97].

   With huge storage repositories and fast computers, huge sensor
   capture is increasingly used in many applications.  Consumer digital
   photography fits this model, with photo touch-up and slide show
   generation tools driving bandwidth, although much more demanding
   applications are not unusual.

   Medical test imagery is being captured at very high resolution and
   tools are being developed for automatic preliminary diagnosis, for
   example [Afework98]. In the science world, even larger datasets are
   captured from satellites, telescopes, and atom-smashers, for example
   [Greiman97].  Preliminary processing of a sky survey suggests that
   thousand node clusters may sustain GB/s storage bandwidths [Gray03].
   Seismic trace data, often measured in helicopter loads, commands
   large clusters for days to months [Knott03].

   At the high end of science application, accurate physical simulation,
   its visualization and fault-tolerance checkpointing, has been
   estimated to need 10 GB/s bandwidth and 100 TB of capacity for every
   thousand nodes in a cluster [SGPFS01].




Gibson et al            Expires - August 2004                [Page 5]


Internet Draft          pNFS Problem Statement           February 2004


   Most of these applications make heavy use of shared data across many
   clients, users and applications, have limited budgets available to
   fund aggressive computational goals, and have technical or scientific
   users with strong preferences for file systems and no patience for
   tuning storage.  NFS file service, appropriately scaled up in
   capacity and bandwidth, is highly desired.

   In addition to these search, sensor and science applications,
   traditional database applications are increasingly employing NFS
   servers.  These applications often have hotspot tables, leading to
   high bandwidth storage demands.   Yet SAN-based solutions are
   sometimes harder to manage than NFS based solutions, especially in
   databases with a large number of tables. NFS servers with scalable
   bandwidth would accelerate the adoption of NFS for database
   applications.

   These examples suggest that there is no shortage of applications
   frustrated by the limitations of a single network endpoint on a
   single NFS server exporting a single file system or single huge file.


4. Existing File Systems for Clusters

   The server bottleneck has induced various vendors to develop
   proprietary alternatives to NFS.

   Known variously as asymmetric, out-of-band, clustered or SAN file
   systems, these proprietary alternatives exploit the scalability of
   storage networks by attaching all nodes in the client cluster to the
   storage network.  Then, by reorganizing client and server code
   functionality to separate data traffic from control traffic, client
   nodes are able to access storage devices directly rather than
   requesting all data from the same single network endpoint in the file
   server that handles control traffic.

   Most proprietary alternative solutions have been tailored to storage
   area networks based on the fixed-sized block SCSI storage device
   command set and its Fibrechannel SCSI transport.  Examples in this
   class include EMC's High Road (www.emc.com); IBM's TotalStorage SAN
   FS, SANergy and GPFS (www.ibm.com); Sistina/Redhat's GFS
   (www.readhat.com); SGI's CXFS (www.sgi.com); Veritas' SANPoint Direct
   and CFS (www.veritas.com); and Sun's QFS (www.sun.com).  The
   Fibrechannel SCSI transport used in these systems may soon be
   replaceable by a TCP/IP SCSI transport, iSCSI, enabling these
   proprietary alternatives to operate on the same equipment and IETF
   protocols commonly used by NFS servers.

   While fixed-sized block SCSI storage devices are used in most file
   systems with separated data and control paths, this is not the only


Gibson et al            Expires - August 2004                [Page 6]


Internet Draft          pNFS Problem Statement           February 2004


   alternative available today.  SCSI's newly emerging command set, the
   Object Storage Device (OSD) command set, transmits variable length
   storage objects over SCSI transports [T10-03].  Panasas' ActiveScale
   storage cluster employs a proto-OSD command set over iSCSI on its
   separated data path (www.panasas.com).  IBM's research is also
   demonstrating a variant of their TotalStorage SAN FS employing proto-
   OSD commands [Azagury02].

   Even more distinctive is Zforce's File Switch technology
   (www.zforce.com).  Zforce virtualizes a CIFS file server spreading
   the contents of a file share over many backend CIFS storage servers
   and places their control path functionality inside a network switch
   in order to have some of the properties of both separated and non-
   separated data and control paths.  However, striping files over
   multiple file-based storage servers is not a new concept.  Berkeley's
   Zebra file system, the successor to the log-based file system
   developed for RAID storage, had a separated data and control path
   with file protocols to both [Hartman95].


5. Eliminating the Bottleneck

   The restriction of a single network endpoint results from the way NFS
   associates file servers and file systems.  Essentially, each client
   machine "mounts" each exported file system; these mount operations
   bind a network endpoint to all files in the exported file system,
   instructing the client to address that network endpoint with all
   requests associated with all files in that file system.  Mechanisms
   intended for primarily for failover have been established for giving
   clients a list of network endpoints associated with a given file
   system.

   Multiple NFS servers can be used instead of a single NFS server, and
   many cluster administrators, programmers and end-users have
   experimented with this alternative.  The principle compromise
   involved in exploiting multiple NFS servers is that a single file or
   single file system is decomposed into multiple files or file systems,
   respectively. For instance, a single file can be decomposed into many
   files, each located in a part of the namespace that is exported by a
   different NFS server; or the files of a single directory can be
   linked to files in directories located in file systems exported by
   different NFS servers.  Because this decomposition is done without
   NFS server support, the work of decomposing and recomposing and the
   implications of the decomposition on capacity and load balancing,
   backup consistency, error recovery, and namespace management all fall
   to the customer. Moreover, the additional statefulness of NFSv4 makes
   correct semantics for files decomposed over multiple services without
   NFS support much more complex. Such extra work and extra problems are



Gibson et al            Expires - August 2004                [Page 7]


Internet Draft          pNFS Problem Statement           February 2004


   usually referred to as storage management costs, and are blamed for
   causing a high total cost of ownership for storage.

   Preserving the relative ease of use of NFS storage systems requires
   solutions to the bandwidth bottleneck that do not decompose files and
   directories in the file subtree namespace.
   A solution to this problem should continue to use the existing single
   network endpoint for control traffic, including namespace
   manipulations. Decompositions of individual files and file systems
   over multiple network endpoints can be provided via the separated
   data paths, without separating the control and metadata paths.


6. Separated control and data access techniques

   Separating storage data flow from file system control flow
   effectively moves the bottleneck away from the single endpoint of an
   NFS server and distributes it across the bisectional bandwidth of the
   storage network between the cluster nodes and storage devices.  Since
   switch bandwidths of upwards of terabits per second are available
   today, this bottleneck is at least two orders of magnitude better
   than that of an NFS server network endpoint.

   In an architecture that separates the storage data path from the NFS
   control path there are choices of protocol for the data path.  One
   straightforward answer is to extend the NFS protocol so it can
   accommodate can be used on both control and separated data paths.
   Another straightforward answer is to capture the existing market's
   dominant separated data path, fixed-sized block SCSI storage. A third
   alternative is the emerging object storage SCSI command set, OSD,
   which is appearing in new products with separate data and control
   paths.

   A solution that accommodates all of these approaches provides the
   broadest applicability for NFS.  Specifically, NFS extensions should
   make minimal assumptions about the storage data server access
   protocol.  The clients in such an extended NFS system should be
   compatible with the current NFSv4 protocol, and should be compatible
   with earlier versions of NFS as well.  A solution should be capable
   of providing both asymmetric data access, with the data path
   connected via NFS or other protocols and transports, and symmetric
   parallel access to servers that run NFS on each server node.
   Specifically, it is desirable to enable NFS to manage asymmetric
   access to storage attached via iSCSI and Fibre Channel/SCSI storage
   area networks.

   As previously discussed, the root cause of the NFS server bottleneck
   is the binding between one network endpoint and all the files in a
   file system.  NFS extensions can allow the association of additional


Gibson et al            Expires - August 2004                [Page 8]


Internet Draft          pNFS Problem Statement           February 2004


   network endpoints with specific files.  These associations could be
   represented as layout maps [Gibson98].  NFS clients could be extended
   to have the ability to retrieve and use these layout maps.

   NFSv4 provides an excellent foundation for this.  We may be able to
   extend the current notion of file delegations to include the ability
   to retrieve and utilize a file layout map.  A number of ideas have
   been proposed for storing, accessing, and acting upon layout
   information stored by NFS servers to allow separate access to file
   data over separate data paths.  Data access can be supported over
   multiple protocols, including NFSv4, iSCSI, and OSD.

7. Security Considerations

   Bandwidth scaling solutions that employ separation of control and
   data paths will introduce new security concerns.  For example, the
   data access methods will require authentication and access control
   mechanisms that are consistent with the primary mechanisms on the
   NFSv4 control paths.  Object storage employs revocable cryptographic
   restrictions on each object, which can be created and revoked in the
   control path. With iSCSI access methods, iSCSI security capabilities
   are available, but do not contain NFS access control.  Fibre Channel
   based SCSI access methods have less sophisticated security than
   iSCSI.  These access methods typically use private networks to
   provide security.

   Any proposed solution must be analyzed for security threats and any
   such threats must be addressed.  The IETF and the NFS working group
   have significant expertise in this area.


8. Informative References

   [Afework98] A. Afework, M. Beynon, F. Bustamonte, A. Demarzo, R.
      Ferriera, R. Miller, M. Silberman, J. Saltz, A. Sussman, H. Tang,
      "Digital dynamic telepathology - the virtual microscope," Proc. of
      the AMIA'98 Fall Symposium 1998.

   [Agarwal95] Agrawal, R. and Srikant, R. "Fast Algorithms for Mining
      Association Rules" VLDB, September 1995.

   [Azagury02] Azagury, A., Dreizin, V., Factor, M., Henis, E., Naor,
      D., Rinetzky, N., Satran, J., Tavory, A., Yerushalmi, L, "Towards
      an Object Store," IBM Storage Systems Technology Workshop,
      November 2002.

   [Baker91] Baker, M.G., Hartman, J.H., Kupfer, M.D., Shirriff, K.W.
      and Ousterhout, J.K. "Measurements of a Distributed File System"
      SOSP, October 1991.


Gibson et al            Expires - August 2004                [Page 9]


Internet Draft          pNFS Problem Statement           February 2004



   [Berchtold97] Berchtold, S., Boehm, C., Keim, D.A. and Kriegel, H. "A
      Cost Model For Nearest Neighbor Search in High-Dimensional Data
      Space" ACM PODS, May 1997.

   [Fayyad98] Fayyad, U. "Taming the Giants and the Monsters: Mining
      Large Databases for Nuggets of Knowledge" Database Programming and
      Design, March 1998.

   [Flickner95] Flickner, M., Sawhney, H., Niblack, W., Ashley, J.,
      Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petkovic,
      D., Steele, D. and Yanker, P. "Query by Image and Video Content:
      the QBIC System" IEEE Computer, September 1995.

   [Gibson98] Gibson, G. A., et. al., "A Cost-Effective, High-Bandwidth
      Storage Architecture," International Conference on Architectural
      Support for Programming Languages and Operating Systems (ASPLOS),
      October 1998.

   [Gray03] Jim Gray, "Distributed Computing Economics," Technical
      Report MSR-TR-2003-24, March 2003.

   [Greiman97] Greiman, W., W. E. Johnston, C. McParland, D. Olson, B.
      Tierney, C. Tull, "High-Speed Distributed Data Handling for HENP,"
      Computing in High Energy Physics, April, 1997. Berlin, Germany.

   [Hartman95] John H. Hartman and John K. Ousterhout, "The Zebra
      Striped Network File System," ACM Transactions on Computer Systems
      13, 3, August 1995.

   [Knott03] Knott, T., "Computing colossus," BP Frontiers magazine,
      Issue 6, April 2003, http://www.bp.com/frontiers.

   [NIH03] "Easy Large-Scale Bioinformatics on the NIH Biowulf
      Supercluster," http://biowulf.nih.gov/easy.html, 2003.

   [Ousterhout85] Ousterhout, J.K., DaCosta, H., Harrison, D., Kunze,
      J.A., Kupfer, M. and Thompson, J.G. "A Trace Drive Analysis of the
      UNIX 4.2 BSD FIle System" SOSP, December 1985.

   [Senator95] Senator, T.E., Goldberg, H.G., Wooten, J., Cottini, M.A.,
      Khan, A.F.U., Klinger, C.D., Llamas, W.M., Marrone, M.P. and Wong,
      R.W.H. "The Financial Crimes Enforcement Network AI System (FAIS):
      Identifying potential money laundering from reports of large cash
      transactions"  AIMagazine 16 (4), Winter 1995.

   [SGPFS01] SGS File System RFP, DOE NNCA and DOD NSA, April 25, 2001.




Gibson et al            Expires - August 2004               [Page 10]


Internet Draft          pNFS Problem Statement           February 2004


   [T10-03] Draft OSD Standard, T10 Committee, Storage Networking
      Industry Association(SNIA),
      ftp://www.t10.org/ftp/t10/drafts/osd/osd-r08.pdf


9. Acknowledgments

   David Black, Gary Grider, Benny Halevy, Dean Hildebrand, Dave Noveck,
   Julian Satran, Tom Talpey, and Brent Welch contributed to the
   development of this problem statement.


10. Author's Addresses

   Garth Gibson
   Panasas Inc, and Carnegie Mellon University
   1501 Reedsdale Street
   Pittsburgh, PA 15233 USA
   Phone: +1 412 323 3500
   Email: ggibson@panasas.com

   Peter Corbett
   Network Appliance Inc.
   375 Totten Pond Road
   Waltham, MA 02451 USA
   Phone: +1 781 768 5343
   Email: peter@pcorbett.net


11. Full Copyright Statement

   Copyright (C) The Internet Society (2004).  All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.
   However, this document itself may not be modified in any way, such as
   by removing the copyright notice or references to the Internet
   Society or other Internet organizations, except as needed for the
   purpose of developing Internet standards in which case the procedures
   for copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.


Gibson et al            Expires - August 2004               [Page 11]


Internet Draft          pNFS Problem Statement           February 2004



   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.












































Gibson et al            Expires - August 2004               [Page 12]