S. Bailey (Sandburst)
Internet-draft D. Garcia (Compaq)
Expires: May 2002 J. Hilland (Compaq)
A. Romanow (Cisco)
Direct Access Problem Statement
draft-garcia-direct-access-problem-00
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-Drafts
as reference material or to cite them other than as "work in
progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Copyright Notice
Copyright (C) The Internet Society (2001). All Rights Reserved.
Abstract
This problem statement describes barriers to the use of Internet
Protocols for highly scalable, high bandwidth, low latency
transfers necessary in some of today's important applications,
particularly applications found within data centers. In addition
to describing technical reasons for the problems, it gives an
overview of common non-IP solutions to these problems which have
been deployed over the years.
The perspective of this draft is that it would be very beneficial
Garcia, et al Expires May 2002 [Page 1]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
to have an IP-based solution for these problems so IP can be used
for high speed data transfers within data centers, in addition to
IP's many other uses.
Table Of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . 2
1.1. High Bandwidth Transfer Overhead . . . . . . . . . . . . 3
1.2. Proliferation Of Fabrics in Data Centers . . . . . . . . 4
1.3. Potential Solutions . . . . . . . . . . . . . . . . . . 4
2. High Bandwidth Data Transfer In The Data Center . . . . 6
2.1. Scalable Data Center Applications . . . . . . . . . . . 7
2.2. Client/Server Communication . . . . . . . . . . . . . . 7
2.3. Block Storage . . . . . . . . . . . . . . . . . . . . . 8
2.4. File Storage . . . . . . . . . . . . . . . . . . . . . . 9
2.5. Backup . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6. The Common Thread . . . . . . . . . . . . . . . . . . . 10
3. Non-IP Solutions . . . . . . . . . . . . . . . . . . . . 10
3.1. Proprietary Solutions . . . . . . . . . . . . . . . . . 11
3.2. Standards-based Solutions . . . . . . . . . . . . . . . 11
3.2.1. The Virtual Interface Architecture (VIA) . . . . . . . . 12
3.2.2. InfiniBand . . . . . . . . . . . . . . . . . . . . . . . 12
4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . 13
5. Security Considerations . . . . . . . . . . . . . . . . 13
6. References . . . . . . . . . . . . . . . . . . . . . . . 13
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 15
A. RDMA Technology Overview . . . . . . . . . . . . . . . . 16
A.1 Use of Memory Access Transfers . . . . . . . . . . . . . 16
A.2 Use Of Push Transfers . . . . . . . . . . . . . . . . . 17
A.3 RDMA-based I/O Example . . . . . . . . . . . . . . . . . 18
Full Copyright Statement . . . . . . . . . . . . . . . . . . . 19
1. Introduction
Protocols in the IP family offer a huge, ever increasing range of
functions, including mail, messaging, telephony, media and
hypertext content delivery, block and file storage, and network
control. IP has been so successful that applications only use
other forms of communication when there is a very compelling
reason. Currently, it is often not acceptable to use IP protocols
for high-speed communication within a data center. In these cases,
copying data to application buffers consumes too much CPU that is
otherwise needed to perform application functions.
Garcia, et al Expires May 2002 [Page 2]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
This limitation of IP protocols has not been particularly important
until now because the domain of high performance transfers was
limited to a relatively specialized niche of low volume
applications, such as scientific supercomputing. Applications that
needed more efficient transfer than IP could offer simply used
other purpose-built solutions.
As the use of the Internet has become pervasive and critical, the
growth in number and importance of data centers has matched the
growth of the Internet. The role of the data center is similarly
critical. The high-end environment of the data center makes up the
core and nexus of today's Internet. Everything goes in and out of
data centers.
Applications running within data centers frequently require high
bandwidth data transfer. Due to the high host processing overhead
of high bandwidth communication in IP, the industry has developed
non-IP technology to serve data center traffic. That said, the
obstacles to lowering host processing overhead in the IP are well-
understood and straightforward to address. Simple techniques could
allow the penetration of existing IP protocols into data centers
where non-IP technology is currently used.
Technology advances have made feasible specially designed network
interfaces that place IP protocol data directly in application
buffers. While it is certainly possible to use control information
directly from existing IP protocol messages to place data in
application buffers, but the sheer number and diversity of current
and future IP protocols calls for a generic solution instead.
Therefore, the goal is to investigate a generic data placement
solution for IP protocols that would allow a single network
interface to perform direct data placement for a wide variety of
mature, evolving and completely new protocols.
There is a great desire to develop lower overhead, more scalable
data transfer technology based on IP. This desire comes from the
advantages of using one protocol technology rather than several,
and from the many efficiencies of technology based upon a single,
widely adopted, open standard.
This document describes the problems that IP faces in delivering
highly scalable high bandwidth data transfer. The first section
describes the issues in general. The second section describes
several specific scenarios, discussing particular application
domains and specific problems that arise. The third section
describes approaches that have historically been used to address
low overhead, high bandwidth data transfer needs. The appendix
gives an overview of how a particular class of non-IP technologies
Garcia, et al Expires May 2002 [Page 3]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
addresses this problem with Remote Direct Memory Access (RDMA).
1.1. High Bandwidth Transfer Overhead
Transport protocols such as TCP [TCP] and SCTP [SCTP] have
successfully shielded upper layers from the complexities of moving
data between two computers. This has been very successful in
making TCP/IP ubiquitous. However, with current IP
implementations, Upper Layer Protocols (ULPs), such as NFS [NFSv3]
and HTTP [HTTP], require incoming data packets to be buffered and
copied before the data is used.
It is this data copying that is a primary source of overhead in IP
data transfers. Copying received data for high bandwidth transfers
consumes significant processing time and memory bandwidth. If data
is buffered and then copied, the data moves across the memory bus
at least three times during the data transfer. By comparison, if
the incoming data is placed directly where the application requires
it, the data moves across the memory bus only once. This copying
overhead currently means that additional processing resources, such
as additional processors in a multiprocessor machine, are needed to
reach faster and faster wire speeds.
A wide range of ad hoc solutions have been explored to eliminate
data copying overhead withing the framework of current IP
protocols, but despite extensive study, still no adequate or
general solution exists [Chase].
1.2. Proliferation Of Fabrics in Data Centers
The current alternative to paying the high costs due to data
transfer overhead in data centers is the use of several different
communication technologies at once. Data centers are likely to have
separate Ethernet IP, Fibre Channel storage, and InfiniBand, VIA or
proprietary interprocess communication (IPC) networks. Special
purpose networks are used for storage and IPC to reduce the
processor overhead associated with data communications; and in the
case of IPC, to reduce latency as well.
Using such proprietary and special purpose solutions runs counter
to the requirements of data center computing. Data center
designers and operators do not want the expense and complexity of
building and maintaining three separate communications networks.
Three NICs and three fabric ports are expensive, consume valuable
Garcia, et al Expires May 2002 [Page 4]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
IO card slots, power and machine room space.
A single IP fabric would be far preferable. IP networks are best
positioned to fill the role of all three of these existing
networks. At 1 to 10 gigabit speeds current IP interconnects could
offer comparable or superior performance characteristics to special
purpose purpose interconnects, if it were not for the high overhead
and latency of IP data transfers. An IP-based alternative to the
IPC and storage fabrics would be less costly, and much more easily
manageable than maintaining separate communication fabrics.
1.3. Potential Solutions
One frequently proposed solution to the problem of data transfer
overhead in IP data transfers is to wait for the next generation of
faster processors and speedier memories to render the problem
irrelevant. However, in the evolution of the Internet, processor
and memory speeds are not the only variables that have increased
exponentially over time. Data link speeds have grown exponentially
as well. Recently, spurred by the demand for core network
bandwidth, data link speeds have grown faster than both processor
computation rates and processor memory transfer rates. Whatever
speed increases occur in processors and memories, it is clear that
link speeds will continue to grow aggressively as well.
Rather than relying on increasing CPU performance, non-IP solutions
use network interface hardware to attack several Several distinct
sources of overhead can be seen. For a small, one-way IP data
transfer, typically both the sender and receiver must make several
context switches, process several interrupts, and send and receive
a network packet. In addition, the receiver must perform at least
one data copy. This single transfer could require 10,000
instructions of execution and total time measured in hundreds of
microseconds if not milliseconds. The sources of overhead in this
transfer are:
o context switches and interrupts,
o execution of protocol code,
o copying the data on the receiver.
Copying competes with DMA and other processor accesses for memory
system bandwidth, and all these sources of overhead can also have
significant secondary effects on the efficiency of application
execution by interfering with system caches.
Garcia, et al Expires May 2002 [Page 5]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
Depending on the application, each of these sources of overhead may
be small or large factor in total overhead, but the cumulative
effect of all of them is nearly always substantial for high
bandwidth transfers. If data transfers are very small, data
copying is only a small cost, but context switching and protocol
stack execution become performance limiting factors. For large
transfers, the most common high bandwidth data transfers, context
switching and protocol stack execution can be amortized away,
within certain limits, but data copying becomes costly.
Non-IP solutions address these sources of overhead with network
interface hardware that:
o reduces context switches and interrupts with kernel-bypass
capability, where the application communicates directly
through network interface without kernel intervention,
o reduces protocol stack processing with protocol offload
hardware that performs some or all protocol processing (e.g.
ACK processing),
o reduces data copying overhead by placing data directly in
application buffers.
The application of these techniques reduces both data transfer
overhead, and data transfer latency. Context switches and data
copying are substantial sources of end-to-end latency that are
eliminated by kernel-bypass and direct data placement. Offloaded
protocol processing can also typically be performed an order of
magnitude faster than a comparable, general purpose protocol stack,
due to the ability to exploit extensive parallelism in hardware.
While protocol offload does reduce overhead, for the vast majority
of current high bandwidth data transfer applications, eliminating
data copies is much more important.
These techniques, and others, may be equally applicable to reducing
the overhead of IP data transfers.
2. High Bandwidth Transfers In The Data Center
There are numerous uses of high bandwidth data transfers in today's
data centers. While these applications are found in the data
center, they have implications for the desktop as well. This
problem statement focuses on data center scenarios below, but it
would be beneficial to find a solution that meets data center while
possibly remaining affordable for the desktop.
Garcia, et al Expires May 2002 [Page 6]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
Why is high bandwidth data transfer in the data center important
for IP networking? Performance on the Internet, as well as
intranets, is dependent on the performance of the data center.
Every request, be it a web page, database query or file and print
service goes to or through data center servers. Often a multi-
tiered computing solution is used, where multiple machines in the
data center satisfy these requests. Despite the explosive growth
of the server market, data centers are running into critical
limitations that impact every client directly or indirectly.
Unlike servers, clients are largely limited in performance by the
human at the interface. In contrast, data center performance is
limited by the speeds and feeds of the network and I/O devices as
well as hardware and software components.
With new protocols such as iSCSI, IP networks are increasingly
taking on the functions of special purpose interconnects, such as
Fibre Channel. However, the limitations created by high data
transfer overhead described here have not as yet been addressed for
IP protocols in general.
First and foremost, all the problems illustrated in scenarios below
occur on IP protocol based networks. It is imperative to
understand the pervasiveness of IP networks within the data center
and that all of the problems described below occur in IP-based data
transfer solutions. Therefore, a solution to these problems will
naturally also be a part of the IP protocol suite.
Although the problems discussed below manifest themselves in
different ways, investigation into the source of these problems
shows a common thread running through them. These scenarios are
not exhaustive list, but rather describe the wide range of problems
exhibited in scalability and performance of the applications and
infrastructures encountered in data center computing as a result of
high communication overhead.
2.1. Scalable Data Center Applications
A key characteristic of any data center application is its ability
to scale as demands increase. For many Internet services,
applications must scale in response to the success of the service
and the increased demand which results. In other cases,
applications must be scaled as capabilities are added to a service,
again in response to the success of the service, changes in the
competitive environment or goals of the provider.
Virtually all data center applications require intermachine
Garcia, et al Expires May 2002 [Page 7]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
communication, and therefore, application scalability may be
directly limited by communication overhead. From the application
viewpoint, every CPU cycle spent performing data transfer is a
wasted cycle that affects scalability. For high bandwidth data
transfers using IP, this overhead can be 30-40% of available CPU.
If an application is running on a single single server, and it is
scaled by adding a second server, communication overhead of 40%
means that the CPU available to the application from two servers is
only 120% of that of the single server. The problem is even worse
with many servers, because most servers are communicating with more
than one other server. If three servers are connected in a
pipeline where 40% CPU is required for data transfers to or from
another server, the total available CPU power would still be only
120% of the power of a single server! Not all data center
applications require this level of communication, but many do. The
high overhead of data transfers in IP severely impacts the
viability of IP for scalable data center applications.
2.2. Client/Server Communication
Client/server communication in the data center is a variation of
the scalable data center application scenario, but applies to
standalone servers as well as parallel applications. The overhead
of high bandwidth data communication weighs heavily on the server.
The server's ability to respond is limited by any communication
overhead it incurs.
In addition, client/server application performance is often
dominated by data transfer latency characteristics. Reducing
latency can greatly improve application performance. Techniques
commonly employed in IP network interfaces, such as TCP checksum
calculation offload, reduce transfer overhead somewhat, but they
typically do not reduce latency at all. Another technique used to
reduce latency in IP communication is to dedicate multiple threads
of execution, each running on a separate processor, to processing
requests concurrently. However, this multithreading solution has
limits, as the number of outstanding requests can vastly exceed the
number of processors. Furthermore, the effect of multithreading
concurrency is additive with any other latency reduction in the
data transfers themselves.
To address the problems of high bandwidth IP client/server
communication, a solution would ideally reduce both end to end
communication latency, and communication overhead.
Garcia, et al Expires May 2002 [Page 8]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
2.3. Block Storage
Block storage, in the form of iSCSI [iSCSI] and IP Fibre Channel
protocols [FCIP, iFCP], is a IP new application area of great
interest to the storage and data center communities. Just as data
centers eagerly desire to replace special-purpose interprocess
communication fabrics with IP, there is parallel and equal interest
in migrating block storage traffic from special-purpose storage
fabrics to IP.
As with other forms of high bandwidth communication, the data
transfer overhead in traditional IP implementations, particularly
the three bus crossings required for receiving data, may
substantially limit data center storage transfer performance
compared to what is commonplace with special-purpose storage
fabrics. In addition, data copying, even if it is performed within
a specialized IP-storage adapter, will substantially increase
transfer latency, which can noticeably degrade the performance of
both file systems, and applications.
Protocol offload and direct data placement comparable to what is
provided by existing storage fabric interfaces (Fibre Channel,
SCSI, FireWire, etc.) are possible pieces of a solution to the
problems created by IP data transfer overhead for block storage.
It has been claimed that block storage is such an important
application that IP block storage protocols should be directly
offloaded by network interface hardware, rather through use of a
generic application-independent offload solution. However, even
the block storage community recognizes the benefits of more
general-purpose ways to reduce IP transfer overhead, and most
expect to eventually use such general-purpose capabilities for
block storage when they become available, if for no other reason
than it reduces the risks and impact of changing and evolving the
block storage protocols themselves.
2.4. File Storage
The file storage application exhibits a compound problem within the
data center. File servers and clients are subject to the
communication characteristics of both block storage and
client/server applications. The problems created by high transfer
overhead are particularly acute for file storage implementations
that are built with a substantial amount of user-mode code. In any
form of file storage application, many CPU cycles are spent
traversing the kernel mode file system, disk storage subsystems,
Garcia, et al Expires May 2002 [Page 9]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
protocol stacks, and driving network hardware, similar to the block
storage scenario. In addition, file systems must address the
communication problems of a distributed client/server application.
There may be substantial shared state distributed among servers and
clients creating the need for extensive communication to maintain
this shared state.
A solution to the communication overhead problems of IP data
transfer for file storage involves a union of the approaches for
efficient disk storage and efficient client/server communication,
as discussed above. In other words, both low overhead and low
latency communication are goals.
2.5. Backup
One of the problems with IP-based storage backup is that it
consumes a great deal of the host CPU's time and resources.
Unfortunately, the high overhead required for IP-based backup is
typically not acceptable in an active data center.
The challenge of backup is that it is usually performed on machines
which are also actively participating in the services the data
center is providing. At a minimum, a machine performing backup
must maintain some synchronization with other machines modifying
the state being backed up, so the backup is coherent. As discussed
in the section above on Scalable Data Center Applications, any
overhead placed on active machines can substantially affect
scalability and solution cost.
Backup solutions on specialized storage-fabrics allow systems to
back up the data without the host processor ever touching the data.
Data is transfered to the backup device from disk storage through
host memory, or sometimes even directly without passing through the
host, as a so-called third party transfer.
Storage backup in the data center could be done with IP if data
transfer overhead were substantially reduced.
2.6. The Common Thread
There is a common thread running through the problems of using IP
communication in all of these scenarios. The union of the
solutions to these problems are a high bandwidth, low latency, low
CPU overhead data transfer solution. Non-IP solutions offer
Garcia, et al Expires May 2002 [Page 10]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
technical solutions to these problems but the they lack the
ubiquity and price/performance characteristics necessary for a
viable, general solution.
3. Non-IP Solutions
The most refined non-IP solution to reducing communication
overhead, has a rich history reaching back almost 20 years. This
solution uses a data transfer metaphor called Remote Direct Memory
Access (RDMA). See Appendix A for an introduction to RDMA. In
spite of the technical advantages of the various non-IP solutions,
all have ultimately lacked the ubiquity and price/performance
characteristics necessary to gain widespread usage. This lack of
widespread adoption has also resulted in various shortcomings of
particular incarnations, such as incomplete integration with native
platform capabilities, or other software implementation
limitations. In addition, no non-IP solutions offer the massive
range of network scalability IP protocols support. Non-IP
solutions typically only scale to tens or hundreds of nodes in a
single network, and have no story to tell about interconnection of
multiple networks.
Several non-IP solutions will be briefly described here to show the
state of experience with this set of problems.
3.1. Proprietary Solutions
Low overhead communication technologies have traditionally been
developed as proprietary value-added products by computer platform
vendors. Such solutions were tightly integrated with platform
operating systems and did provide powerful, well integrated
communication capabilities. However, applications written for one
solution were not portable to others. Also, the solutions were
expensive, as is typically the case with value-added technologies.
The earliest example of an low overhead communication technology
was Digital's VAX Cluster Interconnect (CI), first released in
1983. The CI allowed computers and storage to be connected as
peers on a small multipoint network used for both IPC and I/O. The
CI made VAX/VMS Clusters the only alternative to mainframes for
large commercial applications for many years.
Tandem ServerNet was a another proprietary block transfer
technology developed in the mid 1990s. It has been used to perform
Garcia, et al Expires May 2002 [Page 11]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
Disk I/O, IPC and network I/O in the Himalaya product line. This
architecture allows the Himalaya platform to be inherently scalable
because the software has been designed to take advantage of the
offload capability and zero copy techniques. Tandem attempted to
take this product into the Industry Standard Server market, but the
price/performance characteristics and its of being a proprietary
solution prevented wide adoption.
Silicon Graphics used a standards-based network fabric, HiPPI-800,
but built a proprietary low overhead communication mechanism on
top. Other platform vendors such as IBM, HP and Sun have also
offered a variety of proprietary low overhead communication
solutions over the years.
3.2. Standards-based Solutions
Increasing fluidity in the landscape of major platform vendors has
drastically increased the desire for all applications to be
portable. Platforms which were here yesterday might be gone
tomorrow. This has killed the willingness of application and data
center designers and maintainers to use proprietary features of any
platform.
Unwillingness to continue to use proprietary interconnects forced
platform vendors to collaborate on standards-based low overhead
communication technologies to replace the proprietary ones which
had become critical to building data center applications. Two of
these standards-based solutions considered to be roughly parent and
child are described below.
3.2.1. The Virtual Interface Architecture (VIA)
VIA [VI] was a technology jointly developed by Compaq, Intel and
Microsoft. VIA helped prove the feasibility of doing IPC offload,
user mode I/O and traditional kernel mode I/O as well.
While VIA implementations met with some limited success, VIA turned
out to only fill a small market niche, for several reasons. First,
commercially available operating systems lacked a pervasive
interface. Second, because the standard did not define a wire
protocol, no two implementations of the VIA standard were
interoperable on the wire. Third, different implementations were
not interoperable at the software layer either, since the API
Garcia, et al Expires May 2002 [Page 12]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
definition was an appendix to the specification and not part of the
specification itself.
Yet with parallel applications, VIA proved itself time and again.
It was used to set the new benchmark record in the terabyte data
sort in Sandia Labs. It set new TPC-C records for distributed
databases, and it was used to set new TPC-C records as the client-
server communication link. VIA also set the foundation for work
such as the Sockets Direct Protocol through the implementation of
the Winsock Direct Protocol in Windows 2000 [WSD]. And it gave the
DAFS collective a rally point for a common programming interface
[DAFSAPI].
3.2.2. InfiniBand
InfiniBand [IB] was developed by the InfiniBand Trade Association
(IBTA) as a low overhead communication technology that provides
remote direct memory access transfers, including interlocked atomic
operations, as well as traditional datagram-style transfers.
InfiniBand defines a new electromechanical interface, card and
cable form factors, physical interface, link layer, transport layer
and upper layer software transport interface. The IBTA has also
described a fabric management infrastructure to initialize and
maintain the fabric.
While all of the specialized technology of InfiniBand does provide
impressive performance characteristics, IB lacks the ubiquity and
price/performance of IP. In addition, management of InfiniBand
fabrics will require new tools and training, and InfiniBand
additionally lacks the huge base of applications, protocols,
thoroughly engineered security and routing technology available in
IP.
4. Conclusion
This document has described the set of problems that hinder the
widespread use of IP for high speed data transfers in data centers.
There have been a variety of other, non-IP solutions available
which have met with only limited success, for different reasons.
After many years of experience in both the IP and non-IP domains,
the problems appear to be reasonably well understood, and a
direction to a solution is suggested by this study. However, some
additional investigation and subsequent execution on an
Garcia, et al Expires May 2002 [Page 13]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
architecture and necessary protocol(s) for reducing overhead in
high bandwidth IP data transfers are required.
5. Security Considerations
This draft states a problem and, therefore, does not require
particular security considerations other than those dedicated to
squelching the free spread of ideas, should the problem discussion
itself be considered seditious or otherwise unsafe.
6. References
[Chase]
Jeff S. Chase, et.al., "End system optimizations for high-
speed TCP", IEEE Communications Magazine , Volume: 39, Issue:
4 , April 2001, pp 68-74.
http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf}
[DAFSAPI]
"Direct Access File System Application Programming Interface",
version 0.9.5, 09/21/2001.
http://www.dafscollaborative.org/tools/dafs_api.pdf
[FCIP]
Raj Bhagwat, et al., "Fibre Channel Over TCP/IP (FCIP)",
09/20/2001. http://www.ietf.org/internet-drafts/draft-ietf-
ips-fcovertcpip-06.txt
[HTTP]
J. Gettys et al., "Hypertext Transfer Protocol - HTTP/1.1",
RFC 2616, June 1999
[IB] InfiniBand Architecture Specification, Volumes 1 and 2,
release 1.0.a. http://www.infinibandta.org
[iFCP]
Charles Monia et al., "iFCP - A Protocol for Internet Fibre
Channel Storage Networking", 10/19/2001.
http://www.ietf.org/internet-drafts/draft-ietf-ips-ifcp-06.txt
Garcia, et al Expires May 2002 [Page 14]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
[iSCSI]
J. Satran, et al., "iSCSI", 10/01/2001.
http://www.ietf.org/internet-drafts/draft-ietf-ips-
iscsi-08.txt
[NFSv3]
B. Callaghan, "NFS Version 3 Protocol Specification", RFC
1813, June 1995
[SCTP]
R.R. Stewart, Q. Xie, K. Morneault, C. Sharp, H.J.
Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang,
and, V. Paxson, "Stream Control Transmission Protocol,"
RFC2960, October 2000.
[TCP]
Postel, J., "Transmission Control Protocol - DARPA Internet
Program Protocol Specification", RFC 793, September 1981
[VI] Virtual Interface Architecture Specification version 1.0.
http://www.viarch.org/html/collateral/san_10.pdf
[WSD]
"Winsock Direct and Protocol Offload On SANs", version 1.0,
3/3/2001, from "Designing Hardware for the Microsoft Windows
Family of Operating Systems".
http://www.microsoft.com/hwdev/network/san
Authors' Addresses
Stephen Bailey
Sandburst Corporation
600 Federal Street
Andover, MA 01810
USA
Phone: +1 978 689 1614
Email: steph@sandburst.com
Garcia, et al Expires May 2002 [Page 15]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
Dave Garcia
Compaq Computer Corp.
19333 Valco Parkway
Cupertino, CA 95014
USA
Phone: +1 408 285 6116
EMail: dave.garcia@compaq.com
Jeff Hilland
Compaq Computer Corp.
20555 SH 249
Houston, TX 77070
USA
Phone: +1 281 514 9489
EMail: jeff.hilland@compaq.com
Allyn Romanow
Cisco Systems, Inc.
170 W. Tasman Drive
San Jose, CA 95134
USA
Phone: +1 408 525 8836
Email: allyn@cisco.com
Appendix A. RDMA Technology Overview
This section describes how Remote Direct Memory Access (RDMA)
technology such as the Virtual Interface Architecture (VIA) and
InfiniBand (IB) provide for low overhead data transfer. VIA and IB
are examples of the RDMA technology also used by many proprietary
low over head data transfer solutions.
The IB and VIA protocols both provide memory access and push
transfer semantics. With memory access transfers, data from the
local computer is written/read directly to/from an address space of
the remote computer. How, when and why buffers are accessed is
defined by the ULP layer above IB or VIA.
With push transfers, the data source pushes data to an anonymous
receive buffer at the destination. TCP and UDP transfers are both
Garcia, et al Expires May 2002 [Page 16]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
example of push transfers. VIA and IB both call their Push
transfer a Send operation, which is a datagram-style push transfer.
The data receiver chooses where to place the data; the receive
buffer is anonymous with respect to the sender of the data.
A.1 Use of Memory Access Transfers
In the memory access transfer model, the initiator of the data
transfer explicitly indicates where data is extracted from or
placed on the remote computer. VI and InfiniBand both define
memory access read (called RDMA Read) and memory access write
(called RDMA Write) transfers. The buffer address is carried in
each PDU allowing the network interface to directly place the data
in application buffers. Placing the data directly into the
application's buffer has three significant benefits:
o CPU and memory bus utilization are lowered by not having to
copy the data. Since memory access transfers use buffer
addresses supplied by the application, data can be directly
placed at its final location.
o memory access transfers incur no CPU overhead during transfers
if the network interface offloads RDMA (and lower layer)
protocol processing. There is enough information in RDMA PDUs
for the target network interface to complete RDMA Reads or
RDMA Writes without any local CPU action.
o Memory access transfers allow splitting of ULP headers and
data. With memory access transfers, the ULP can control the
exact placement of all received data, including ULP headers
and ULP data. ULP headers and other control information can
be placed in separate buffers from ULP data. This is
frequently a distinct advantage compared to having ULP headers
and data in the same buffers, as an additional data copy may
be otherwise required to separate them.
Providing memory access transfers does not mean a processor's
entire memory space is open for unprotected transfers. The remote
computer controls which of its buffers can be accessed by memory
access transfers. Incoming RDMA Read and RDMA Write operations can
only access buffers to which the receiving host has explicitly
permitted RDMA accesses. When the ULP allows RDMA access to a
buffer, the extent and address characteristics of buffer can be
chosen by the ULP. A buffer could use the virtual address space of
the process, it could be a physical address (if allowed), or it
Garcia, et al Expires May 2002 [Page 17]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
could be a new virtual address space created for the individual
buffer.
In both IB and VIA the RDMA buffer is registered with the receiving
network interface before RDMA operations can occur. For a typical
hardware offload network interface, this is enough information to
build an address translation table and associate appropriate
security information with the buffer. The address translation table
lets the NIC convert the incoming buffer target address into a
local physical address.
A.2 Use Of Push Transfers
Memory access transfers contrast with the push transfers typically
used by IP applications. With push transfers the source has no
visibility or control over where data will be delivered on the
destination machine. While most protocols use some form of push
transfer, IB and VIA define a datagram-style push transfer that
allows a form of direct data placement on the receive side.
IB and VIA both require the application to pre-post receive
buffers. The application pre-posts receive buffers for a
connection and they are filled by subsequent incoming Send
operations. Since the receive buffer is pre-posted, the network
interface can place the data from the incoming Send operation
directly into the application's buffer. IB and VIA allow use of a
scattered receive buffers to support splitting the ULP header from
data within a single Send.
Neither memory access nor push transfers are inherently superior --
each has its merits. Furthermore, memory access transfers can be
built atop push transfers or vice versa. However, direct support
of memory access transfers allows much lower transfer overhead than
if memory access transfers are emulated.
A.3 RDMA-based I/O Example
If the RDMA protocol is offloaded to the network interface, the
RDMA Read operation allows an I/O subsystem, such as a storage
array, to fully control all aspects of data transfer for
outstanding I/O operations. An example of a simple I/O operation
shows several benefits of using memory access transfers.
Consider an I/O block Write operation where the host processor
Garcia, et al Expires May 2002 [Page 18]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
wishes to move a block of data (the data source) to an I/O
subsystem. The host first registers the data source with its
network interface as an RDMA address block. Next the host pushes a
small Send operation to the I/O subsystem. The message describes
the I/O write request and tells the I/O subsystem where it can find
the data in the virtual address space presented through the
communication connection by the network interface. After receiving
this message, the I/O subsystem can pull the data from the host's
buffer as needed. This gives the I/O subsystem the ability to both
schedule and pace its data transfer, thereby requiring less
buffering on the I/O subsystem. When the I/O subsystem completes
the data pull, it pushes a completion message back to the host with
a small Send operation. The completion message tells the host the
I/O operation is complete and that it can deregister its RDMA
block.
In this example the host processor spent very few CPU cycles doing
the I/O block Write operation. The processor sent out a small
message and the I/O subsystem did all the data movement. After the
I/O operation was completed the host processor received a single
completion message.
Full Copyright Statement
Copyright (C) The Internet Society (2001). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain
it or assist in its implementation may be prepared, copied,
published and distributed, in whole or in part, without restriction
of any kind, provided that the above copyright notice and this
paragraph are included on all such copies and derivative works.
However, this document itself may not be modified in any way, such
as by removing the copyright notice or references to the Internet
Society or other Internet organizations, except as needed for the
purpose of developing Internet standards in which case the
procedures for copyrights defined in the Internet Standards process
must be followed, or as required to translate it into languages
other than English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on
an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
Garcia, et al Expires May 2002 [Page 19]
Internet-Draft Direct Access Problem Statement 13 Nov 2001
IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Garcia, et al Expires May 2002 [Page 20]