draft-garcia-direct-access-problem-00

                                               S. Bailey    (Sandburst)
Internet-draft                                 D. Garcia       (Compaq)
Expires: May 2002                              J. Hilland      (Compaq)
                                               A. Romanow       (Cisco)

                    Direct Access Problem Statement
                 draft-garcia-direct-access-problem-00


Status of this Memo

     This document is an Internet-Draft and is in full conformance with
     all provisions of Section 10 of RFC2026.

     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as Internet-
     Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-Drafts
     as reference material or to cite them other than as "work in
     progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.

Copyright Notice

     Copyright (C) The Internet Society (2001). All Rights Reserved.


Abstract


     This problem statement describes barriers to the use of Internet
     Protocols for highly scalable, high bandwidth, low latency
     transfers necessary in some of today's important applications,
     particularly applications found within data centers.  In addition
     to describing technical reasons for the problems, it gives an
     overview of common non-IP solutions to these problems which have
     been deployed over the years.

     The perspective of this draft is that it would be very beneficial



Garcia, et al               Expires May 2002                    [Page 1]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     to have an IP-based solution for these problems so IP can be used
     for high speed data transfers within data centers, in addition to
     IP's many other uses.


Table Of Contents

     1.     Introduction . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.   High Bandwidth Transfer Overhead . . . . . . . . . . . .   3
     1.2.   Proliferation Of Fabrics in Data Centers . . . . . . . .   4
     1.3.   Potential Solutions  . . . . . . . . . . . . . . . . . .   4
     2.     High Bandwidth Data Transfer In The Data Center  . . . .   6
     2.1.   Scalable Data Center Applications  . . . . . . . . . . .   7
     2.2.   Client/Server Communication  . . . . . . . . . . . . . .   7
     2.3.   Block Storage  . . . . . . . . . . . . . . . . . . . . .   8
     2.4.   File Storage . . . . . . . . . . . . . . . . . . . . . .   9
     2.5.   Backup . . . . . . . . . . . . . . . . . . . . . . . . .   9
     2.6.   The Common Thread  . . . . . . . . . . . . . . . . . . .  10
     3.     Non-IP Solutions . . . . . . . . . . . . . . . . . . . .  10
     3.1.   Proprietary Solutions  . . . . . . . . . . . . . . . . .  11
     3.2.   Standards-based Solutions  . . . . . . . . . . . . . . .  11
     3.2.1. The Virtual Interface Architecture (VIA) . . . . . . . .  12
     3.2.2. InfiniBand . . . . . . . . . . . . . . . . . . . . . . .  12
     4.     Conclusion . . . . . . . . . . . . . . . . . . . . . . .  13
     5.     Security Considerations  . . . . . . . . . . . . . . . .  13
     6.     References . . . . . . . . . . . . . . . . . . . . . . .  13
     Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . .  15
     A.     RDMA Technology Overview . . . . . . . . . . . . . . . .  16
     A.1    Use of Memory Access Transfers . . . . . . . . . . . . .  16
     A.2    Use Of Push Transfers  . . . . . . . . . . . . . . . . .  17
     A.3    RDMA-based I/O Example . . . . . . . . . . . . . . . . .  18
     Full Copyright Statement  . . . . . . . . . . . . . . . . . . .  19



1.  Introduction


     Protocols in the IP family offer a huge, ever increasing range of
     functions, including mail, messaging, telephony, media and
     hypertext content delivery, block and file storage, and network
     control.  IP has been so successful that applications only use
     other forms of communication when there is a very compelling
     reason.  Currently, it is often not acceptable to use IP protocols
     for high-speed communication within a data center.  In these cases,
     copying data to application buffers consumes too much CPU that is
     otherwise needed to perform application functions.




Garcia, et al               Expires May 2002                    [Page 2]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     This limitation of IP protocols has not been particularly important
     until now because the domain of high performance transfers was
     limited to a relatively specialized niche of low volume
     applications, such as scientific supercomputing.  Applications that
     needed more efficient transfer than IP could offer simply used
     other purpose-built solutions.

     As the use of the Internet has become pervasive and critical, the
     growth in number and importance of data centers has matched the
     growth of the Internet. The role of the data center is similarly
     critical.  The high-end environment of the data center makes up the
     core and nexus of today's Internet. Everything goes in and out of
     data centers.

     Applications running within data centers frequently require high
     bandwidth data transfer.  Due to the high host processing overhead
     of high bandwidth communication in IP, the industry has developed
     non-IP technology to serve data center traffic.  That said, the
     obstacles to lowering host processing overhead in the IP are well-
     understood and straightforward to address.  Simple techniques could
     allow the penetration of existing IP protocols into data centers
     where non-IP technology is currently used.

     Technology advances have made feasible specially designed network
     interfaces that place IP protocol data directly in application
     buffers.  While it is certainly possible to use control information
     directly from existing IP protocol messages to place data in
     application buffers, but the sheer number and diversity of current
     and future IP protocols calls for a generic solution instead.
     Therefore, the goal is to investigate a generic data placement
     solution for IP protocols that would allow a single network
     interface to perform direct data placement for a wide variety of
     mature, evolving and completely new protocols.

     There is a great desire to develop lower overhead, more scalable
     data transfer technology based on IP.  This desire comes from the
     advantages of using one protocol technology rather than several,
     and from the many efficiencies of technology based upon a single,
     widely adopted, open standard.

     This document describes the problems that IP faces in delivering
     highly scalable high bandwidth data transfer.  The first section
     describes the issues in general.  The second section describes
     several specific scenarios, discussing particular application
     domains and specific problems that arise.  The third section
     describes approaches that have historically been used to address
     low overhead, high bandwidth data transfer needs.  The appendix
     gives an overview of how a particular class of non-IP technologies



Garcia, et al               Expires May 2002                    [Page 3]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     addresses this problem with Remote Direct Memory Access (RDMA).


1.1.  High Bandwidth Transfer Overhead


     Transport protocols such as TCP [TCP] and SCTP [SCTP] have
     successfully shielded upper layers from the complexities of moving
     data between two computers.  This has been very successful in
     making TCP/IP ubiquitous.  However, with current IP
     implementations, Upper Layer Protocols (ULPs), such as NFS [NFSv3]
     and HTTP [HTTP], require incoming data packets to be buffered and
     copied before the data is used.

     It is this data copying that is a primary source of overhead in IP
     data transfers.  Copying received data for high bandwidth transfers
     consumes significant processing time and memory bandwidth.  If data
     is buffered and then copied, the data moves across the memory bus
     at least three times during the data transfer.  By comparison, if
     the incoming data is placed directly where the application requires
     it, the data moves across the memory bus only once.  This copying
     overhead currently means that additional processing resources, such
     as additional processors in a multiprocessor machine, are needed to
     reach faster and faster wire speeds.

     A wide range of ad hoc solutions have been explored to eliminate
     data copying overhead withing the framework of current IP
     protocols, but despite extensive study, still no adequate or
     general solution exists [Chase].


1.2.  Proliferation Of Fabrics in Data Centers


     The current alternative to paying the high costs due to data
     transfer overhead in data centers is the use of several different
     communication technologies at once. Data centers are likely to have
     separate Ethernet IP, Fibre Channel storage, and InfiniBand, VIA or
     proprietary interprocess communication (IPC) networks.  Special
     purpose networks are used for storage and IPC to reduce the
     processor overhead associated with data communications; and in the
     case of IPC, to reduce latency as well.

     Using such proprietary and special purpose solutions runs counter
     to the requirements of data center computing.  Data center
     designers and operators do not want the expense and complexity of
     building and maintaining three separate communications networks.
     Three NICs and three fabric ports are expensive, consume valuable



Garcia, et al               Expires May 2002                    [Page 4]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     IO card slots, power and machine room space.

     A single IP fabric would be far preferable.  IP networks are best
     positioned to fill the role of all three of these existing
     networks.  At 1 to 10 gigabit speeds current IP interconnects could
     offer comparable or superior performance characteristics to special
     purpose purpose interconnects, if it were not for the high overhead
     and latency of IP data transfers.  An IP-based alternative to the
     IPC and storage fabrics would be less costly, and much more easily
     manageable than maintaining separate communication fabrics.


1.3.  Potential Solutions


     One frequently proposed solution to the problem of data transfer
     overhead in IP data transfers is to wait for the next generation of
     faster processors and speedier memories to render the problem
     irrelevant.  However, in the evolution of the Internet, processor
     and memory speeds are not the only variables that have increased
     exponentially over time.  Data link speeds have grown exponentially
     as well.  Recently, spurred by the demand for core network
     bandwidth, data link speeds have grown faster than both processor
     computation rates and processor memory transfer rates.  Whatever
     speed increases occur in processors and memories, it is clear that
     link speeds will continue to grow aggressively as well.

     Rather than relying on increasing CPU performance, non-IP solutions
     use network interface hardware to attack several Several distinct
     sources of overhead can be seen.  For a small, one-way IP data
     transfer, typically both the sender and receiver must make several
     context switches, process several interrupts, and send and receive
     a network packet.  In addition, the receiver must perform at least
     one data copy.  This single transfer could require 10,000
     instructions of execution and total time measured in hundreds of
     microseconds if not milliseconds.  The sources of overhead in this
     transfer are:

     o    context switches and interrupts,

     o    execution of protocol code,

     o    copying the data on the receiver.

     Copying competes with DMA and other processor accesses for memory
     system bandwidth, and all these sources of overhead can also have
     significant secondary effects on the efficiency of application
     execution by interfering with system caches.



Garcia, et al               Expires May 2002                    [Page 5]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     Depending on the application, each of these sources of overhead may
     be small or large factor in total overhead, but the cumulative
     effect of all of them is nearly always substantial for high
     bandwidth transfers.  If data transfers are very small, data
     copying is only a small cost, but context switching and protocol
     stack execution become performance limiting factors.  For large
     transfers, the most common high bandwidth data transfers, context
     switching and protocol stack execution can be amortized away,
     within certain limits, but data copying becomes costly.

     Non-IP solutions address these sources of overhead with network
     interface hardware that:

     o    reduces context switches and interrupts with kernel-bypass
          capability, where the application communicates directly
          through network interface without kernel intervention,

     o    reduces protocol stack processing with protocol offload
          hardware that performs some or all protocol processing (e.g.
          ACK processing),

     o    reduces data copying overhead by placing data directly in
          application buffers.

     The application of these techniques reduces both data transfer
     overhead, and data transfer latency.  Context switches and data
     copying are substantial sources of end-to-end latency that are
     eliminated by kernel-bypass and direct data placement.  Offloaded
     protocol processing can also typically be performed an order of
     magnitude faster than a comparable, general purpose protocol stack,
     due to the ability to exploit extensive parallelism in hardware.
     While protocol offload does reduce overhead, for the vast majority
     of current high bandwidth data transfer applications, eliminating
     data copies is much more important.

     These techniques, and others, may be equally applicable to reducing
     the overhead of IP data transfers.


2.  High Bandwidth Transfers In The Data Center


     There are numerous uses of high bandwidth data transfers in today's
     data centers.  While these applications are found in the data
     center, they have implications for the desktop as well.  This
     problem statement focuses on data center scenarios below, but it
     would be beneficial to find a solution that meets data center while
     possibly remaining affordable for the desktop.



Garcia, et al               Expires May 2002                    [Page 6]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     Why is high bandwidth data transfer in the data center important
     for IP networking?  Performance on the Internet, as well as
     intranets, is dependent on the performance of the data center.
     Every request, be it a web page, database query or file and print
     service goes to or through data center servers.  Often a multi-
     tiered computing solution is used, where multiple machines in the
     data center satisfy these requests.  Despite the explosive growth
     of the server market, data centers are running into critical
     limitations that impact every client directly or indirectly.
     Unlike servers, clients are largely limited in performance by the
     human at the interface.  In contrast, data center performance is
     limited by the speeds and feeds of the network and I/O devices as
     well as hardware and software components.

     With new protocols such as iSCSI, IP networks are increasingly
     taking on the functions of special purpose interconnects, such as
     Fibre Channel.  However, the limitations created by high data
     transfer overhead described here have not as yet been addressed for
     IP protocols in general.

     First and foremost, all the problems illustrated in scenarios below
     occur on IP protocol based networks.  It is imperative to
     understand the pervasiveness of IP networks within the data center
     and that all of the problems described below occur in IP-based data
     transfer solutions.  Therefore, a solution to these problems will
     naturally also be a part of the IP protocol suite.

     Although the problems discussed below manifest themselves in
     different ways, investigation into the source of these problems
     shows a common thread running through them.  These scenarios are
     not exhaustive list, but rather describe the wide range of problems
     exhibited in scalability and performance of the applications and
     infrastructures encountered in data center computing as a result of
     high communication overhead.


2.1.  Scalable Data Center Applications


     A key characteristic of any data center application is its ability
     to scale as demands increase.  For many Internet services,
     applications must scale in response to the success of the service
     and the increased demand which results.  In other cases,
     applications must be scaled as capabilities are added to a service,
     again in response to the success of the service, changes in the
     competitive environment or goals of the provider.

     Virtually all data center applications require intermachine



Garcia, et al               Expires May 2002                    [Page 7]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     communication, and therefore, application scalability may be
     directly limited by communication overhead.  From the application
     viewpoint, every CPU cycle spent performing data transfer is a
     wasted cycle that affects scalability.  For high bandwidth data
     transfers using IP, this overhead can be 30-40% of available CPU.
     If an application is running on a single single server, and it is
     scaled by adding a second server, communication overhead of 40%
     means that the CPU available to the application from two servers is
     only 120% of that of the single server.  The problem is even worse
     with many servers, because most servers are communicating with more
     than one other server.  If three servers are connected in a
     pipeline where 40% CPU is required for data transfers to or from
     another server, the total available CPU power would still be only
     120% of the power of a single server!  Not all data center
     applications require this level of communication, but many do.  The
     high overhead of data transfers in IP severely impacts the
     viability of IP for scalable data center applications.


2.2.  Client/Server Communication


     Client/server communication in the data center is a variation of
     the scalable data center application scenario, but applies to
     standalone servers as well as parallel applications.  The overhead
     of high bandwidth data communication weighs heavily on the server.
     The server's ability to respond is limited by any communication
     overhead it incurs.

     In addition, client/server application performance is often
     dominated by data transfer latency characteristics.  Reducing
     latency can greatly improve application performance.  Techniques
     commonly employed in IP network interfaces, such as TCP checksum
     calculation offload, reduce transfer overhead somewhat, but they
     typically do not reduce latency at all.  Another technique used to
     reduce latency in IP communication is to dedicate multiple threads
     of execution, each running on a separate processor, to processing
     requests concurrently.  However, this multithreading solution has
     limits, as the number of outstanding requests can vastly exceed the
     number of processors.  Furthermore, the effect of multithreading
     concurrency is additive with any other latency reduction in the
     data transfers themselves.

     To address the problems of high bandwidth IP client/server
     communication, a solution would ideally reduce both end to end
     communication latency, and communication overhead.





Garcia, et al               Expires May 2002                    [Page 8]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


2.3.  Block Storage


     Block storage, in the form of iSCSI [iSCSI] and IP Fibre Channel
     protocols [FCIP, iFCP], is a IP new application area of great
     interest to the storage and data center communities.  Just as data
     centers eagerly desire to replace special-purpose interprocess
     communication fabrics with IP, there is parallel and equal interest
     in migrating block storage traffic from special-purpose storage
     fabrics to IP.

     As with other forms of high bandwidth communication, the data
     transfer overhead in traditional IP implementations, particularly
     the three bus crossings required for receiving data, may
     substantially limit data center storage transfer performance
     compared to what is commonplace with special-purpose storage
     fabrics.  In addition, data copying, even if it is performed within
     a specialized IP-storage adapter, will substantially increase
     transfer latency, which can noticeably degrade the performance of
     both file systems, and applications.

     Protocol offload and direct data placement comparable to what is
     provided by existing storage fabric interfaces (Fibre Channel,
     SCSI, FireWire, etc.) are possible pieces of a solution to the
     problems created by IP data transfer overhead for block storage.
     It has been claimed that block storage is such an important
     application that IP block storage protocols should be directly
     offloaded by network interface hardware, rather through use of a
     generic application-independent offload solution.  However, even
     the block storage community recognizes the benefits of more
     general-purpose ways to reduce IP transfer overhead, and most
     expect to eventually use such general-purpose capabilities for
     block storage when they become available, if for no other reason
     than it reduces the risks and impact of changing and evolving the
     block storage protocols themselves.


2.4.  File Storage


     The file storage application exhibits a compound problem within the
     data center.  File servers and clients are subject to the
     communication characteristics of both block storage and
     client/server applications.  The problems created by high transfer
     overhead are particularly acute for file storage implementations
     that are built with a substantial amount of user-mode code.  In any
     form of file storage application, many CPU cycles are spent
     traversing the kernel mode file system, disk storage subsystems,



Garcia, et al               Expires May 2002                    [Page 9]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     protocol stacks, and driving network hardware, similar to the block
     storage scenario.  In addition, file systems must address the
     communication problems of a distributed client/server application.
     There may be substantial shared state distributed among servers and
     clients creating the need for extensive communication to maintain
     this shared state.

     A solution to the communication overhead problems of IP data
     transfer for file storage involves a union of the approaches for
     efficient disk storage and efficient client/server communication,
     as discussed above.  In other words, both low overhead and low
     latency communication are goals.


2.5.  Backup


     One of the problems with IP-based storage backup is that it
     consumes a great deal of the host CPU's time and resources.
     Unfortunately, the high overhead required for IP-based backup is
     typically not acceptable in an active data center.

     The challenge of backup is that it is usually performed on machines
     which are also actively participating in the services the data
     center is providing.  At a minimum, a machine performing backup
     must maintain some synchronization with other machines modifying
     the state being backed up, so the backup is coherent.  As discussed
     in the section above on Scalable Data Center Applications, any
     overhead placed on active machines can substantially affect
     scalability and solution cost.

     Backup solutions on specialized storage-fabrics allow systems to
     back up the data without the host processor ever touching the data.
     Data is transfered to the backup device from disk storage through
     host memory, or sometimes even directly without passing through the
     host, as a so-called third party transfer.

     Storage backup in the data center could be done with IP if data
     transfer overhead were substantially reduced.


2.6.  The Common Thread


     There is a common thread running through the problems of using IP
     communication in all of these scenarios.  The union of the
     solutions to these problems are a high bandwidth, low latency, low
     CPU overhead data transfer solution.  Non-IP solutions offer



Garcia, et al               Expires May 2002                   [Page 10]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     technical solutions to these problems but the they lack the
     ubiquity and price/performance characteristics necessary for a
     viable, general solution.


3.  Non-IP Solutions


     The most refined non-IP solution to reducing communication
     overhead, has a rich history reaching back almost 20 years.  This
     solution uses a data transfer metaphor called Remote Direct Memory
     Access (RDMA).  See Appendix A for an introduction to RDMA.  In
     spite of the technical advantages of the various non-IP solutions,
     all have ultimately lacked the ubiquity and price/performance
     characteristics necessary to gain widespread usage.  This lack of
     widespread adoption has also resulted in various shortcomings of
     particular incarnations, such as incomplete integration with native
     platform capabilities, or other software implementation
     limitations.  In addition, no non-IP solutions offer the massive
     range of network scalability IP protocols support.  Non-IP
     solutions typically only scale to tens or hundreds of nodes in a
     single network, and have no story to tell about interconnection of
     multiple networks.

     Several non-IP solutions will be briefly described here to show the
     state of experience with this set of problems.


3.1.  Proprietary Solutions


     Low overhead communication technologies have traditionally been
     developed as proprietary value-added products by computer platform
     vendors.  Such solutions were tightly integrated with platform
     operating systems and did provide powerful, well integrated
     communication capabilities.  However, applications written for one
     solution were not portable to others.  Also, the solutions were
     expensive, as is typically the case with value-added technologies.

     The earliest example of an low overhead communication technology
     was Digital's VAX Cluster Interconnect (CI), first released in
     1983.  The CI allowed computers and storage to be connected as
     peers on a small multipoint network used for both IPC and I/O.  The
     CI made VAX/VMS Clusters the only alternative to mainframes for
     large commercial applications for many years.

     Tandem ServerNet was a another proprietary block transfer
     technology developed in the mid 1990s.  It has been used to perform



Garcia, et al               Expires May 2002                   [Page 11]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     Disk I/O, IPC and network I/O in the Himalaya product line.  This
     architecture allows the Himalaya platform to be inherently scalable
     because the software has been designed to take advantage of the
     offload capability and zero copy techniques.  Tandem attempted to
     take this product into the Industry Standard Server market, but the
     price/performance characteristics and its of being a proprietary
     solution prevented wide adoption.

     Silicon Graphics used a standards-based network fabric, HiPPI-800,
     but built a proprietary low overhead communication mechanism on
     top.  Other platform vendors such as IBM, HP and Sun have also
     offered a variety of proprietary low overhead communication
     solutions over the years.



3.2.  Standards-based Solutions


     Increasing fluidity in the landscape of major platform vendors has
     drastically increased the desire for all applications to be
     portable.  Platforms which were here yesterday might be gone
     tomorrow.  This has killed the willingness of application and data
     center designers and maintainers to use proprietary features of any
     platform.

     Unwillingness to continue to use proprietary interconnects forced
     platform vendors to collaborate on standards-based low overhead
     communication technologies to replace the proprietary ones which
     had become critical to building data center applications.  Two of
     these standards-based solutions considered to be roughly parent and
     child are described below.


3.2.1.  The Virtual Interface Architecture (VIA)


     VIA [VI] was a technology jointly developed by Compaq, Intel and
     Microsoft.  VIA helped prove the feasibility of doing IPC offload,
     user mode I/O and traditional kernel mode I/O as well.

     While VIA implementations met with some limited success, VIA turned
     out to only fill a small market niche, for several reasons.  First,
     commercially available operating systems lacked a pervasive
     interface.  Second, because the standard did not define a wire
     protocol, no two implementations of the VIA standard were
     interoperable on the wire.  Third, different implementations were
     not interoperable at the software layer either, since the API



Garcia, et al               Expires May 2002                   [Page 12]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     definition was an appendix to the specification and not part of the
     specification itself.

     Yet with parallel applications, VIA proved itself time and again.
     It was used to set the new benchmark record in the terabyte data
     sort in Sandia Labs.  It set new TPC-C records for distributed
     databases, and it was used to set new TPC-C records as the client-
     server communication link.  VIA also set the foundation for work
     such as the Sockets Direct Protocol through the implementation of
     the Winsock Direct Protocol in Windows 2000 [WSD].  And it gave the
     DAFS collective a rally point for a common programming interface
     [DAFSAPI].



3.2.2.  InfiniBand

     InfiniBand [IB] was developed by the InfiniBand Trade Association
     (IBTA) as a low overhead communication technology that provides
     remote direct memory access transfers, including interlocked atomic
     operations, as well as traditional datagram-style transfers.

     InfiniBand defines a new electromechanical interface, card and
     cable form factors, physical interface, link layer, transport layer
     and upper layer software transport interface.  The IBTA has also
     described a fabric management infrastructure to initialize and
     maintain the fabric.

     While all of the specialized technology of InfiniBand does provide
     impressive performance characteristics, IB lacks the ubiquity and
     price/performance of IP.  In addition, management of InfiniBand
     fabrics will require new tools and training, and InfiniBand
     additionally lacks the huge base of applications, protocols,
     thoroughly engineered security and routing technology available in
     IP.


4.  Conclusion


     This document has described the set of problems that hinder the
     widespread use of IP for high speed data transfers in data centers.
     There have been a variety of other, non-IP solutions available
     which have met with only limited success, for different reasons.
     After many years of experience in both the IP and non-IP domains,
     the problems appear to be reasonably well understood, and a
     direction to a solution is suggested by this study.  However, some
     additional investigation and subsequent execution on an



Garcia, et al               Expires May 2002                   [Page 13]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     architecture and necessary protocol(s) for reducing overhead in
     high bandwidth IP data transfers are required.


5.  Security Considerations

     This draft states a problem and, therefore, does not require
     particular security considerations other than those dedicated to
     squelching the free spread of ideas, should the problem discussion
     itself be considered seditious or otherwise unsafe.


6.  References

     [Chase]
          Jeff S. Chase, et.al., "End system optimizations for high-
          speed TCP", IEEE Communications Magazine , Volume: 39, Issue:
          4 , April 2001, pp 68-74.
          http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf}


     [DAFSAPI]
          "Direct Access File System Application Programming Interface",
          version 0.9.5, 09/21/2001.
          http://www.dafscollaborative.org/tools/dafs_api.pdf


     [FCIP]
          Raj Bhagwat, et al., "Fibre Channel Over TCP/IP (FCIP)",
          09/20/2001.  http://www.ietf.org/internet-drafts/draft-ietf-
          ips-fcovertcpip-06.txt


     [HTTP]
          J. Gettys et al., "Hypertext Transfer Protocol - HTTP/1.1",
          RFC 2616, June 1999


     [IB] InfiniBand Architecture Specification, Volumes 1 and 2,
          release 1.0.a.  http://www.infinibandta.org


     [iFCP]
          Charles Monia et al., "iFCP - A Protocol for Internet Fibre
          Channel Storage Networking", 10/19/2001.
          http://www.ietf.org/internet-drafts/draft-ietf-ips-ifcp-06.txt





Garcia, et al               Expires May 2002                   [Page 14]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     [iSCSI]
          J. Satran, et al., "iSCSI", 10/01/2001.
          http://www.ietf.org/internet-drafts/draft-ietf-ips-
          iscsi-08.txt


     [NFSv3]
          B. Callaghan, "NFS Version 3 Protocol Specification", RFC
          1813, June 1995


     [SCTP]
          R.R. Stewart, Q. Xie, K. Morneault, C. Sharp, H.J.
          Schwarzbauer, T. Taylor, I.  Rytina, M.  Kalla, L.  Zhang,
          and, V.  Paxson, "Stream Control Transmission Protocol,"
          RFC2960, October 2000.


     [TCP]
          Postel, J., "Transmission Control Protocol - DARPA Internet
           Program Protocol Specification", RFC 793, September 1981


     [VI] Virtual Interface Architecture Specification version 1.0.
          http://www.viarch.org/html/collateral/san_10.pdf


     [WSD]
          "Winsock Direct and Protocol Offload On SANs", version 1.0,
          3/3/2001, from "Designing Hardware for the Microsoft Windows
          Family of Operating Systems".
          http://www.microsoft.com/hwdev/network/san


Authors' Addresses



     Stephen Bailey
     Sandburst Corporation
     600 Federal Street
     Andover, MA  01810
     USA

     Phone: +1 978 689 1614
     Email: steph@sandburst.com





Garcia, et al               Expires May 2002                   [Page 15]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     Dave Garcia
     Compaq Computer Corp.
     19333 Valco Parkway
     Cupertino, CA  95014
     USA

     Phone: +1 408 285 6116
     EMail: dave.garcia@compaq.com


     Jeff Hilland
     Compaq Computer Corp.
     20555 SH 249
     Houston, TX  77070
     USA

     Phone: +1 281 514 9489
     EMail: jeff.hilland@compaq.com


     Allyn Romanow
     Cisco Systems, Inc.
     170 W. Tasman Drive
     San Jose, CA  95134
     USA

     Phone: +1 408 525 8836
     Email: allyn@cisco.com



Appendix A. RDMA Technology Overview


     This section describes how Remote Direct Memory Access (RDMA)
     technology such as the Virtual Interface Architecture (VIA) and
     InfiniBand (IB) provide for low overhead data transfer. VIA and IB
     are examples of the RDMA technology also used by many proprietary
     low over head data transfer solutions.

     The IB and VIA protocols both provide memory access and push
     transfer semantics.  With memory access transfers, data from the
     local computer is written/read directly to/from an address space of
     the remote computer.  How, when and why buffers are accessed is
     defined by the ULP layer above IB or VIA.

     With push transfers, the data source pushes data to an anonymous
     receive buffer at the destination.  TCP and UDP transfers are both



Garcia, et al               Expires May 2002                   [Page 16]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     example of push transfers.  VIA and IB both call their Push
     transfer a Send operation, which is a datagram-style push transfer.
     The data receiver chooses where to place the data; the receive
     buffer is anonymous with respect to the sender of the data.


A.1 Use of Memory Access Transfers


     In the memory access transfer model, the initiator of the data
     transfer explicitly indicates where data is extracted from or
     placed on the remote computer.  VI and InfiniBand both define
     memory access read (called RDMA Read) and memory access write
     (called RDMA Write) transfers.  The buffer address is carried in
     each PDU allowing the network interface to directly place the data
     in application buffers. Placing the data directly into the
     application's buffer has three significant benefits:


     o    CPU and memory bus utilization are lowered by not having to
          copy the data.  Since memory access transfers use buffer
          addresses supplied by the application, data can be directly
          placed at its final location.

     o    memory access transfers incur no CPU overhead during transfers
          if the network interface offloads RDMA (and lower layer)
          protocol processing.  There is enough information in RDMA PDUs
          for the target network interface to complete RDMA Reads or
          RDMA Writes without any local CPU action.

     o    Memory access transfers allow splitting of ULP headers and
          data. With memory access transfers, the ULP can control the
          exact placement of all received data, including ULP headers
          and ULP data.  ULP headers and other control information can
          be placed in separate buffers from ULP data.  This is
          frequently a distinct advantage compared to having ULP headers
          and data in the same buffers, as an additional data copy may
          be otherwise required to separate them.

     Providing memory access transfers does not mean a processor's
     entire memory space is open for unprotected transfers.  The remote
     computer controls which of its buffers can be accessed by memory
     access transfers.  Incoming RDMA Read and RDMA Write operations can
     only access buffers to which the receiving host has explicitly
     permitted RDMA accesses.  When the ULP allows RDMA access to a
     buffer, the extent and address characteristics of buffer can be
     chosen by the ULP.  A buffer could use the virtual address space of
     the process, it could be a physical address (if allowed), or it



Garcia, et al               Expires May 2002                   [Page 17]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     could be a new virtual address space created for the individual
     buffer.

     In both IB and VIA the RDMA buffer is registered with the receiving
     network interface before RDMA operations can occur. For a typical
     hardware offload network interface, this is enough information to
     build an address translation table and associate appropriate
     security information with the buffer. The address translation table
     lets the NIC convert the incoming buffer target address into a
     local physical address.


A.2 Use Of Push Transfers


     Memory access transfers contrast with the push transfers typically
     used by IP applications.  With push transfers the source has no
     visibility or control over where data will be delivered on the
     destination machine.  While most protocols use some form of push
     transfer, IB and VIA define a datagram-style push transfer that
     allows a form of direct data placement on the receive side.

     IB and VIA both require the application to pre-post receive
     buffers.  The application pre-posts receive buffers for a
     connection and they are filled by subsequent incoming Send
     operations.  Since the receive buffer is pre-posted, the network
     interface can place the data from the incoming Send operation
     directly into the application's buffer. IB and VIA allow use of a
     scattered receive buffers to support splitting the ULP header from
     data within a single Send.

     Neither memory access nor push transfers are inherently superior --
     each has its merits.  Furthermore, memory access transfers can be
     built atop push transfers or vice versa.  However, direct support
     of memory access transfers allows much lower transfer overhead than
     if memory access transfers are emulated.


A.3 RDMA-based I/O Example


     If the RDMA protocol is offloaded to the network interface, the
     RDMA Read operation allows an I/O subsystem, such as a storage
     array, to fully control all aspects of data transfer for
     outstanding I/O operations.  An example of a simple I/O operation
     shows several benefits of using memory access transfers.

     Consider an I/O block Write operation where the host processor



Garcia, et al               Expires May 2002                   [Page 18]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     wishes to move a block of data (the data source) to an I/O
     subsystem. The host first registers the data source with its
     network interface as an RDMA address block.  Next the host pushes a
     small Send operation to the I/O subsystem.  The message describes
     the I/O write request and tells the I/O subsystem where it can find
     the data in the virtual address space presented through the
     communication connection by the network interface.  After receiving
     this message, the I/O subsystem can pull the data from the host's
     buffer as needed. This gives the I/O subsystem the ability to both
     schedule and pace its data transfer, thereby requiring less
     buffering on the I/O subsystem.  When the I/O subsystem completes
     the data pull, it pushes a completion message back to the host with
     a small Send operation. The completion message tells the host the
     I/O operation is complete and that it can deregister its RDMA
     block.

     In this example the host processor spent very few CPU cycles doing
     the I/O block Write operation. The processor sent out a small
     message and the I/O subsystem did all the data movement.  After the
     I/O operation was completed the host processor received a single
     completion message.


Full Copyright Statement


     Copyright (C) The Internet Society (2001). All Rights Reserved.

     This document and translations of it may be copied and furnished to
     others, and derivative works that comment on or otherwise explain
     it or assist in its implementation may be prepared, copied,
     published and distributed, in whole or in part, without restriction
     of any kind, provided that the above copyright notice and this
     paragraph are included on all such copies and derivative works.
     However, this document itself may not be modified in any way, such
     as by removing the copyright notice or references to the Internet
     Society or other Internet organizations, except as needed for the
     purpose of developing Internet standards in which case the
     procedures for copyrights defined in the Internet Standards process
     must be followed, or as required to translate it into languages
     other than English.

     The limited permissions granted above are perpetual and will not be
     revoked by the Internet Society or its successors or assigns.

     This document and the information contained herein is provided on
     an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
     ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR



Garcia, et al               Expires May 2002                   [Page 19]


Internet-Draft       Direct Access Problem Statement         13 Nov 2001


     IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
     THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
     WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
















































Garcia, et al               Expires May 2002                   [Page 20]

Document	Document type	Expired Internet-Draft (individual) Expired & archived
	Select version	00
	Author	Dave Garcia Email authors
	RFC stream	(None)
	Intended RFC status	(None)
	Other formats	txt pdf bibtex bibxml