Allyn Romanow      (Cisco)
Internet-draft                                Jeff Mogul        (Compaq)
Expires: September 2002                       Tom Talpey        (NetApp)
                                              Stephen Bailey (Sandburst)

                     RDMA over IP Problem Statement
          draft-romanow-rdma-over-ip-problem-statement-00.txt


Status of this Memo

     This document is an Internet-Draft and is in full conformance with
     all provisions of Section 10 of RFC2026.

     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as Internet-
     Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-Drafts
     as reference material or to cite them other than as "work in
     progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.

Copyright Notice

     Copyright (C) The Internet Society (2002). All Rights Reserved.

Abstract

     This draft describes the problem that copying in network I/O
     typically causes high system costs in end-hosts at high speeds.
     The problem is due to the high cost of memory bandwidth, and it can
     be substantially improved using "copy avoidance." The high overhead
     has prevented TCP/IP from being used as an interconnection network,
     and instead special purpose memory-to-memory fabrics have been
     developed and widely used.  An IP-based solution, developed within
     the IETF, is desirable for interoperability of various network
     fabrics. It is also particularly important for the IETF to guide
     the standardization because interconnection technology will soon
     start to be used over the wide area in the Internet.



Romanow, et al           Expires September 2002                 [Page 1]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


Table Of Contents

     1.   Introduction . . . . . . . . . . . . . . . . . . . . . . .   2
     2.   The high cost of data movement operations in network I/O .   3
     2.1. Copy avoidance improves processing overhead  . . . . . . .   5
     3.   Memory bandwidth is the root cause of the problem  . . . .   6
     4.   High copy overhead is problematic for many key Internet
          applications . . . . . . . . . . . . . . . . . . . . . . .   7
     5.   How remote direct memory access (RDMA) can solve this
          problem  . . . . . . . . . . . . . . . . . . . . . . . . .   9
     6.   Why this problem is relevant for the IETF  . . . . . . . .  11
     7.   Security Considerations  . . . . . . . . . . . . . . . . .  12
     8.   Acknowledgements . . . . . . . . . . . . . . . . . . . . .  12
          References . . . . . . . . . . . . . . . . . . . . . . . .  12
          Author's Address . . . . . . . . . . . . . . . . . . . . .  16
          Full Copyright Statement . . . . . . . . . . . . . . . . .  17


1.  Introduction

     This draft considers the problem of high host processing overhead
     associated with network I/O that occurs under high speed
     conditions. This problem is often referred to as the "I/O
     bottleneck" [CT90]. More specifically, the source of high overhead
     that is of interest here is data movement operations-- copying.
     This issue is not be confused with TCP offload, which is not
     addressed here. High speed refers to conditions where the network
     link speed is high relative to the bandwidths of the host CPU and
     memory. With today's computer systems, one Gbits/s and over is
     considered high speed.

     High costs associated with copying is an issue primarily for large
     scale systems. Although smaller systems such as rack-mounted PCs
     and small workstations would benefit from a reduction in copying
     overhead, the benefit to smaller machines will be primarily in the
     next few years as they scale in the amount of bandwidth they
     handle. Today it is large system machines with high bandwidth
     feeds, usually multiprocessors and clusters, that are adversely
     affected from copying overhead. Examples of such machines include
     all varieties of servers: database servers, storage servers,
     application servers for transaction processing, for e-commerce, and
     web serving, content distribution, video distribution, backups,
     data mining and decision support, and scientific computing.

     These larger systems typically, though not exclusively, terminate
     local connections rather than just wide area network connections.
     They are often located in data centers and they carry corporate and
     Internet traffic.  Increasing, large systems access storage over a



Romanow, et al           Expires September 2002                 [Page 2]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


     Storage Area Network (SAN) rather than using directly attached
     disks, and many SANs are IP-based.

     Note that such servers almost exclusively service many concurrent
     sessions (transport connections), which, in aggregate, are
     responsible for > 1 Gbits/s of communication.  Nonetheless, the
     cost of copying overhead for a particular load is the same whether
     from few or many sessions.

     Because of high end-host processing overhead in current
     implementations, the TCP/IP protocol stack is not used for high
     speed transfer. Instead special purpose network fabrics using
     remote direct memory access (RDMA) have been developed and are
     widely used. RDMA is a technology that allows the network adapter,
     under control of the application, to place data directly into and
     out of application buffers. This capability is also referred to as
     "direct data placement". Examples of such interconnection fabrics
     include Fibre Channel [FIBRE] for block storage transfer, Virtual
     Interface Architecture [VI] for database clusters, Infiniband [IB],
     Compaq Servernet [SRVNET], Quadrix [QUAD] for System Area Networks.
     These link level technologies limit application scaling in both
     distance and size, meaning the number of nodes.

     This problem statement substantiates the claim that in network I/O
     processing, high overhead is caused from data movement operations,
     specifically copying; and that copy avoidance significantly
     decreases the processing overhead. It describes when and why the
     high processing overheads occur, explains why the overhead is
     problematic, and points out which applications are most affected.
     The draft also considers why this problem needs to be addressed by
     the IETF in particular.

     The I/O bottleneck, and the role of data movement operations, have
     been widely studied in research and industry over the last
     approximately 14 years, and we draw freely on these results. The
     problem was investigated when high speed meant 100 Mbits/s FDDI and
     Fast Ethernet; it was again of concern when ATM with 155 Mbits/s
     and 1 Gbits/s Ethernet were introduced.  And now that 10 Gbits/s
     Ethernet is becoming available there is an upswing of activity in
     industry and research [DAFS, IB, VI, CGZ01, Ma02, MAF+02].

2.  The high cost of data movement operations in network I/O

     A wealth of data from research and industry shows that copying is
     responsible for substantial amounts of processing overhead. It
     further shows that even in carefully implemented systems,
     eliminating copies significantly reduces the overhead, as
     referenced below.



Romanow, et al           Expires September 2002                 [Page 3]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


     Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead
     processing is attributable to both operating system costs such as
     interrupts, context switches, process management, buffer
     management, timer management, and to the costs associated with
     processing individual bytes, specifically computing the checksum
     and moving data in memory. They found moving data in memory is the
     more important of the costs, and their experiments show that memory
     bandwidth is the greatest source of limitation.  In the data
     presented [CJRS89], 64% of the measured microsecond overheads was
     attributable to data touching operations, and 48% was accounted for
     by copying.  The system measured Berkeley TCP on a Sun-3/60 using
     1460 Byte Ethernet packets.

     In a well-implemented system, copying can occur between the network
     interface and the kernel, and between the kernel and application
     buffers - two copies, each of which is two memory bus crossings -
     for read and write. Although in certain circumstances it is
     possible to do better, usually two copies are required on receive.

     Subsequent work has consistently shown the same phenomenon as the
     earlier Clark study. A number of studies report results that data-
     touching operations, checksumming and data movement, dominate the
     processing costs for messages longer than 128 Bytes [BS96, CGY01,
     Ch96, CJRS89, DAPP93, KP96].  For smaller sized messages, per-
     packet overheads dominate [KP96, CGY01].

     The percentage of overhead due to data-touching operations
     increases with packet size, since time spent on per-byte operations
     scales linearly with message size [KP96]. For example, Chu [Ch96]
     reported substantial per-byte latency costs as a percentage of
     total networking software costs for an MTU size packet on
     SPARCstation/20 running memory-to-memory TCP tests over networks
     with 3 different MTU sizes. The percentage of total software costs
     attributable to per-byte operations were:

        1500 Byte Ethernet 18-25%
        4352 Byte FDDI     35-50%
        9180 Byte ATM      55-65%


     Although, many studies report results for data-touching operations
     including both checksumming and data movement together, much work
     has focused just on copying [BS96, B99, Ch96, TK95]. For example,
     [KP96] reports results that separate processing times for checksum
     from data movement operations. For 1500 Byte Ethernet size, 20% of
     total processing overhead time is attributable to copying.  The
     study used 2 DECstations 5000/200 connected by an FDDI network.
     (In this study checksum accounts for 30% of the processing time.)



Romanow, et al           Expires September 2002                 [Page 4]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


2.1.  Copy avoidance improves processing overhead

     A number of studies show that eliminating copies substantially
     reduces overhead.  For example, results from copy-avoidance in the
     IO-Lite system [PDZ99], which aimed at improving web server
     performance, show a throughput increase of 43% over an optimized
     web server, and 137% improvement over an Apache server. The system
     was implemented in a 4.4BSD derived UNIX kernel, and the
     experiments used a server system based on a 333MHz Pentium II PC
     connected to a switched 100 Mbits/s Fast Ethernet.

     There are many other examples where elimination of copying using a
     variety of different approaches showed significant improvement in
     system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97].  We
     will discuss the results of one of these studies in detail in order
     to clarify the significant degree of improvement produced by copy
     avoidance [Ch02].

     Recent work by Chase et al. [CGY01], measuring CPU utilization,
     shows that avoiding copies reduces CPU time spent on data access
     from 24% to 15% at 370 Mbits/s for a 32 KBytes MTU using a Compaq
     Professional Workstation and a Myrinet adapter [BCF+95]. This is an
     absolute improvement of 9% due to copy avoidance.

     The total CPU utilization was 35%, with data access accounting for
     24%.  Thus the relative importance of reducing copies is 26%.  At
     370 Mbits/s, the system is not very heavily loaded. The relative
     improvement in achievable bandwidth is 34%. This is the improvement
     we would see if copy avoidance were added when the machine was
     saturated by network I/O.

     Note that improvement from the optimization becomes more important
     if the overhead it targets is a larger share of the total cost.
     This is what happens if other sources of overhead, such as
     checksumming, are eliminated. In [CGY01], after removing checksum
     overhead, copy avoidance reduces CPU utilization from 26% to 10%.
     This is a 16% absolute reduction, a 61% relative reduction, and a
     160% relative improvement in achievable bandwidth.

     In fact, today's NICs commonly offload the checksum, which removes
     the other source of per-byte overhead. They also coalesce
     interrupts to reduce per-packet costs.  Thus, today copying costs
     account for a relatively larger part of CPU utilization than
     previously, and therefore relatively more benefit is to be gained
     in reducing them. (Of course this argument would be specious if the
     amount of overhead were insignificant, but it has been shown to be
     substantial.)




Romanow, et al           Expires September 2002                 [Page 5]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


3.  Memory bandwidth is the root cause of the problem

     Data movement operations are expensive because memory bandwidth is
     scarce relative to network bandwidth and CPU bandwidth [PAC+97].
     This trend existed in the past and is expected to continue into the
     future [HP97, STREAM], especially in large multiprocessor systems.

     With copies crossing the bus twice per copy, network processing
     overhead is high whenever network bandwidth is large in comparison
     to CPU and memory bandwidths. Generally with today's end-systems,
     the effects are observable at network speeds over 1 Gbits/s.

     A common question is whether increase in CPU processing power
     alleviates the problem of high processing costs of network I/O. The
     answer is no, it is the memory bandwidth that is the issue. Faster
     CPUs do not help if the CPU spends most of its time waiting for
     memory [CGY01].

     The widening gap between microprocessor performance and memory
     performance has long been a widely recognized and well-understood
     problem [PAC+97]. Hennessy [HP97] shows microprocessor performance
     grew from 1980-1998 at 60% per year, while the access time to DRAM
     improved at 10% per year, giving rise to an increasing "processor-
     memory performance gap".

     Another source of relevant data is the STREAM Benchmark Reference
     Information website which provides information on the STREAM
     benchmark [STREAM]. The benchmark is a simple synthetic benchmark
     program that measures sustainable memory bandwidth (in MBytes/s)
     and the corresponding computation rate for simple vector kernels
     measured in MFLOPS.  The website tracks information on sustainable
     memory bandwidth for hundreds of machines and all major vendors.

     Results show measured system performance statistics. Processing
     performance from 1985-2001 increased at 50% per year on average,
     and sustainable memory bandwidth from 1975 to 2001 increased at 35%
     per year on average over all the systems measured. A similar 15%
     per year lead of processing bandwidth over memory bandwidth shows
     up in another statistic, machine balance [Mc95], a measure of the
     relative rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained
     memory ops/cycle) [STREAM].

     Network bandwidth has been increasing about 10-fold roughly every 8
     years, which is a 40% per year growth rate.

     A typical example illustrates that the memory bandwidth compares
     unfavorably with link speed.  The STREAM benchmark shows that a
     modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001,



Romanow, et al           Expires September 2002                 [Page 6]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


     will move the data 3 times in doing a receive operation -- 1 for
     the NIC to deposit the data in memory, and 2 for the CPU to copy
     the data. With 1 GBytes/s of memory bandwidth, meaning one read or
     one write, the machine could handle approximately 2.67 Gbits/s of
     network bandwidth, one third the copy bandwidth. But this assumes
     100% utilization, which is not possible, and more importantly the
     machine would be totally consumed!  (A rule of thumb for databases
     is that 20% of the machine should be required to service I/O,
     leaving 80% for the database application. And, the less the
     better.)

     In 2001, 1 Gbits/s links were common. An application server may
     typically have two 1 Gbits/s connections - one connection backend
     to a storage server and one front-end, say for serving HTTP
     [FGM+99]. Thus the communications could use 2 Gbits/s. In our
     typical example, the machine could handle 2.7 Gbits/s at
     theoretical maximum while doing nothing else. This means that the
     machine basically could not keep up with the communication demands
     in 2001, and with the relative growth trends it make the situation
     worse.

4.  High copy overhead is problematic for many key Internet applications

     If a significant portion of resources on an application machine is
     consumed in network I/O rather than in application processing, it
     makes it difficult for the application to scale - to handle more
     clients, to offer more services.

     Several years ago the most affected applications were streaming
     multimedia, parallel file systems, supercomputing on clusters
     [BS96]. In addition, today the applications that suffer from
     copying overhead are more central in Internet computing - they
     store, manage, and distribute the information of the Internet and
     the enterprise. They include database applications doing
     transaction processing, e-commerce, web serving, decision support,
     content distribution, video distribution, and backups. Clusters are
     typically used for this category of application, since they have
     advantages of availability and scalability.

     Today these applications, which provide and manage Internet and
     corporate information, are typically run in data centers that are
     organized into three logical tiers. One tier is typically web
     servers connecting to the WAN. The second tier is application
     servers that run the specific applications usually on more powerful
     machines, and the third tier is backend databases. Physically, the
     first two tiers - web server and application server - are usually
     combined [Pi01].  For example an e-commerce server communicates
     with a database server and with a customer site, or a content



Romanow, et al           Expires September 2002                 [Page 7]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


     distribution server connects to a server farm, or an OLTP server
     connects to a database and a customer site.

     When network I/O uses too much memory bandwidth, performance on
     network paths between tiers can suffer.  (There might also be
     performance issues on SAN paths used either by the database tier or
     the application tier.)  The high overhead from network-related
     memory copies diverts system resources from other application
     processing.  It also can create bottlenecks that limit total system
     performance.

     There are a large and growing number of these application servers
     distributed throughout the Internet.  In 1999 approximately 3.4
     million server units were shipped, in 2000, 3.9 million units, and
     the estimated annual growth rate for 2000-2004 was 17 percent
     [Ne00, PA01].

     There is high motivation to maximize the processing capacity of
     each CPU, as scaling by adding CPUs one way or another has
     drawbacks. For example, adding CPUs to a multiprocessor will not
     necessarily help, as a multiprocessor improves performance only
     when the memory bus has additional bandwidth to spare. Clustering
     can add additional complexity to handling the applications.

     In order to scale a cluster or multiprocessor system, one must
     proportionately scale the interconnect bandwidth.  Interconnect
     bandwidth governs the performance of communication-intensive
     parallel applications; if this (often expressed in terms of
     "bisection bandwidth") is too low, adding additional processors
     cannot improve system throughput.  Interconnect latency can also
     limit the performance of applications that frequently share data
     between processors.

     So, excessive overheads on network paths in a "scalable" system
     both can require the use of more processors than optimal, and can
     reduce the marginal utility of those additional processors.

     Copy avoidance scales a machine upwards by removing at least two-
     thirds the bus bandwidth load from the "very best" 1-copy (on
     receive) implementations, and removes at least 80% of the bandwidth
     overhead from the 2-copy implementations.

     An example showing poor performance with copies and improved
     scaling with copy avoidance is illustrative.  The IO-Lite work
     [PDZ99] shows higher server throughput servicing more clients using
     a zero-copy system. In an experiment designed to mimic real world
     web conditions by simulating the effect of TCP WAN connections on
     the server, the performance of 3 servers was compared. One server



Romanow, et al           Expires September 2002                 [Page 8]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


     was Apache, another an optimized server called Flash, and the third
     the Flash server running IO-Lite, called Flash-Lite with zero copy.
     The measurement was of throughput in requests/second as a function
     of the number of slow background clients that could be served. As
     the table shows, Flash-Lite has better throughput, especially as
     the number of clients increases.

                Apache              Flash         Flash-Lite
                ------              -----         ----------
     #Clients   Thruput reqs/s      Thruput       Thruput

     0          520                 610           890
     16         390                 490           890
     32         360                 490           850
     64         360                 490           890
     128        310                 450           880
     256        310                 440           820


     Traditional Web servers (which mostly send data and can keep most
     of their content in the file cache) are not the worst case for copy
     overhead.  Web proxies (which often receive as much data as they
     send) and complex Web servers based on SANs or multi-tier systems
     will suffer more from copy overheads than in the example above.

5.  How remote direct memory access (RDMA) can solve this problem

     RDMA is a technology that allows the network adapter, under control
     of the application, to place data directly into and out of
     application buffers.  This capability is also referred to as
     "direct data placement". It reduces the need for data movement.
     RDMA has been used extensively in memory-to-memory networks, both
     in research and in industry, as referenced below. It is a simple
     solution that once implemented does not need to be constantly
     revised with OS and architectural changes. Also it can be used with
     any OS and machine architecture.

     There has been extensive investigation and experience with two main
     alternative approaches to eliminating data movement overhead, often
     along with improving other Operating System processing costs.  In
     one approach, hardware and/or software changes within a single host
     reduce processing costs. In another approach, memory-to-memory
     networking [MAF+02], hosts communicate via information that allows
     them to reduce processing costs.

     As discussed below, research and industry experience has shown that
     copy avoidance techniques within the receiver processing path alone
     have proven to be problematic.  Many implementations have



Romanow, et al           Expires September 2002                 [Page 9]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


     successfully achieved zero-copy transmit, but few have accomplished
     zero-copy receive.  And those that have done so make strict
     alignment and no-touch requirements on the application, greatly
     reducing the portability and usefulness of the implementation.

     In contrast, experience has been very satisfactory with memory-to-
     memory systems that do direct data placement, eliminating copies by
     passing information between sender and receiver. Direct data
     placement is a single solution for zero-copy networking in both the
     transmit and receive paths.

     The single host approaches range from entirely new hardware and
     software architectures [KSZ95, Wa97] to new or modified software
     systems [BP96, Ch96, TK95, DP93, PDZ99].

     In early work, one goal of the software approaches was to show that
     TCP could go faster with appropriate OS support [CJR89, CFF+94].
     While this goal was achieved, further investigation and experience
     showed that, though possible to craft software solutions, specific
     system optimizations have been complex, fragile, extremely
     interdependent with other system parameters in complex ways, and
     often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93,
     KSZ95, PDZ99].  The network I/O system interacts with other aspects
     of the Operating System such as machine architecture and file I/O,
     and disk I/O [Br99, Ch96, DP93].

     For example, the Solaris Zero-Copy TCP work [Ch96], which relies on
     page remapping, shows that the results are highly interdependent
     with other systems, such as the file system, and that the
     particular optimizations are specific for particular architectures,
     meaning for each variation in architecture optimizations must be
     re-crafted [Ch96].

     A number of research projects and industry products have been based
     on the memory-to-memory approach to copy avoidance. These include
     U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB],
     Winsock Direct [Pi01]. Several memory-to-memory systems have been
     widely used and have generally been found to be robust, to have
     good performance, and to be relatively simple to implement. These
     include VI [VI], Myrinet [BCF+95], Quadrix [QUAD], Compaq/Tandem
     Servernet [SRVNET].  Networks based on these memory-to-memory
     architectures have been used widely in scientific applications and
     in data centers for block storage, file system access, and
     transaction processing.

     By exporting direct memory access "across the wire", applications
     may direct the network stack to manage all data directly from
     application buffers. A large and growing class of applications has



Romanow, et al           Expires September 2002                [Page 10]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


     already emerged which takes advantage of such capabilities,
     including all the major databases, as well as file systems such as
     DAFS [DAFS} and network protocols such as Sockets Direct [SD].

6.  Why this problem is relevant for the IETF

     There are several reasons why this is issue is relevant for the
     IETF. Interoperability is one reason, and the others involve the
     convergence of interconnection network and WAN.

     Most interconnection technology has been proprietary, even when
     developed by multiple vendors. There have been interoperability
     problems even with standards such as SCSI and PCI. An IP approach
     developed in the IETF would allow a heterogeneous underlying fabric
     to be tied together by a single IP networking technology. This
     would allow for multiple vendor systems, underlying hardware
     interconnection fabrics that could change over time and remain
     interoperable, and for interoperation over multiple hardware
     technologies, such as 1 and 10 Gbits/s Ethernet.

     Traditionally interconnection technology has been developed in an
     electrical engineering domain, and networking technology has been
     developed in the IETF.  These domains are now converging, as
     hardware designers increasingly adopt networking-based approaches,
     and in particular are building IP-based systems. Since the IETF
     represents the best networking expertise, it is desirable to have
     it guide the standardization work.

     The most compelling reason interconnection network technology is
     relevant for the IETF is that our experience suggests that
     inevitably, and soon, there will be an intermixing between
     "interconnect" networks and WAN/Internet networks.  Although today
     IP-based interconnect traffic is in local clusters and within the
     data center, inevitably this traffic will "leak out" and will be
     seen over the wide area network, including the Internet.  There is
     already pressure for distributed data centers in the metro domain.
     Data centers distributed over the WAN will add value, and therefore
     someone will do it. It would be better for the development of the
     Internet and for the IETF to guide the development of IP-based
     interconnection technology properly while it is still primarily in
     the local environment, rather than having to deal with the
     technology later as it emerges onto the Internet.

     Unfortunately if the IETF does not become involved in engineering
     an IP standard, it will not prevent such a set of protocols from
     being developed, only unfortunately the appropriate IETF networking
     expertise will not benefit them.




Romanow, et al           Expires September 2002                [Page 11]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


7.  Security Considerations

     The problem of reducing copying overhead in high bandwidth
     transfers via one or more protocols does not suggest any new
     security concerns. As a layer properly atop Internet transport
     protocols, the protocol(s) will gain leverage from IPSec and other
     Internet security standards. When a solution is proposed, security
     will be addressed in detail for that particular solution.

     The immediate target systems are local, where traditionally
     security has been more treated in a more relaxed fashion. However,
     the fact that almost certainly high speed interconnects will run
     over the Internet, makes it especially important to get security
     right from the outset. This is another good reason for the IETF to
     guide the standardization.

8.  Acknowledgements

     Jeff Chase generously provided many useful insights information.

9.  References

     [BCF+95]
          N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L.
          Seitz, J. N. Seizovic, and W. Su. "Myrinet - A gigabit-per-
          second local-area network", IEEE Micro, February 1995

     [BJM+96]
          G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J. Wilkes,
          "An implementation of the Hamlyn send-managed interface
          architecture", in Proceedings of the Second Symposium on
          Operating Systems Design and Implementation, USENIX Assoc.,
          Oct. 1996

     [BLA+94]
          M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten,
          "A virtual memory mapped network interface for the SHRIMP
          multicomputer", in Proceedings of the 21st Annual Symposium on
          Computer Architecture, April 1994, pp. 142-153

     [Br99]
          J. C. Brustoloni, "Interoperation of copy avoidance in network
          and file I/O", Proceedings of IEEE Infocom, 1999, pp. 534-542

     [BS96]
          J. C. Brustoloni, P. Steenkiste, "Effects of buffering
          semantics on I/O performance", Proceedings OSDI'96, USENIX,
          Seattle, WA Oct. 1996, pp. 277-291



Romanow, et al           Expires September 2002                [Page 12]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


     [CFF+94]
          C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A.
          Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde, "High-
          performance TCP/IP and UDP/IP networking in DEC OSF/1 for
          Alpha AXP",  Proceedings of the 3rd IEEE Symposium on High
          Performance Distributed Computing, August 1994, pp. 36-42

     [CGY01]
          J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system
          optimizations for high-speed TCP", IEEE Communications
          Magazine , Volume: 39, Issue: 4 , April 2001, pp 68-74.
          http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf}

     [Ch96]
          H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996
          Annual Technical Conference, San Diego, CA, Jan. 1996

     [Ch02]
          Jeffrey Chase, Personal communication

     [CJRS89]
          D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An analysis
          of TCP processing overhead", IEEE Communications Magazine,
          volume: 27, Issue: 6, June 1989, pp 23-29

     [CT90]
          D. D. Clark, D. Tennenhouse, "Architectural considerations for
          a new generation of protocols", Proceedings of the ACM SIGCOMM
          Conference, 1990

     [DAFS]
          Direct Access File System http://www.dafscollaborative.org
          http://www.ietf.org/internet-drafts/draft-wittle-dafs-00.txt

     [DAPP93]
          P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson,
          "Network subsystem design", IEEE Network, July 1993, pp. 8-17

     [DP93]
          P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth cross-
          domain transfer facility", Proceedings of the 14th ACM
          symposium of Operating Systems Principles, Dec. 1993

     [EBBV95]
          T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A
          user-level network interface for parallel and distributed
          computing", Proc. of the 15th ACM Symposium on Operating
          Systems Principles, Copper Mountain, Colorado, Dec. 3-6, 1995



Romanow, et al           Expires September 2002                [Page 13]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


     [FGM+99]
          R. Fielding, J. Gettys, J. Mogul, F. Frystyk, L. Masinter, P.
          Leach, T. Berners-Lee, "Hypertext Transfer Protocol -
          HTTP/1.1", RFC 2616, June 1999

     [FIBRE]
          Fibre Channel Standard
          http://www.fibrechannel.com/technology/index.master.html

     [HP97]
          J. L. Hennessy, D. A. Patterson, Computer Organization and
          Design, 2nd Edition, San Francisco: Morgan Kaufmann
          Publishers, 1997

     [IB] InfiniBand Architecture Specification, Volumes 1 and 2,
          Release 1.0.a.  http://www.infinibandta.org

     [KP96]
          J. Kay, J. Pasquale, "Profiling and reducing processing
          overheads in TCP/IP", IEEE/ACM Transactions on Networking, Vol
          4, No. 6, pp.817-828, Dec. 1996

     [KSZ95]
          K. Kleinpaste, P. Steenkiste, B. Zill, "Software support for
          outboard buffering and checksumming", SIGCOMM'95

     [Ma02]
          K. Magoutis, "Design and Implementation of a Direct Access
          File System (DAFS) Kernel Server for FreeBSD", in Proceedings
          of USENIX BSDCon 2002 Conference, San Franscisco, CA, February
          11-14, 2002.

     [MAF+02]
          Kostas Magoutis, Salimah Addetia, Alexandra Fedorova, Margo I.
          Seltzer, Jeffrey S. Chase, Drew Gallatin, Richard Kisley,
          Rajiv Wickremesinghe, Eran Gabber, "Structure and Performance
          of the Direct Access File System (DAFS)", accepted for
          publication at the 2002 USENIX Annual Technical Conference,
          Monterey, CA, June 9-14, 2002.

     [Mc95]
          J. D. McCalpin, "A Survey of memory bandwidth and machine
          balance in current high performance computers", IEEE TCCA
          Newsletter, December 1995

     [Ne00]
          A. Newman, "IDC report paints conflicted picture of server
          market circa 2004", ServerWatch, July 24, 2000



Romanow, et al           Expires September 2002                [Page 14]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


          http://serverwatch.internet.com/news/2000_07_24_a.html

     [Pa01]
          M. Pastore, "Server shipments for 2000 surpass those in 1999",
          ServerWatch, Feb. 7, 2001
          http://serverwatch.internet.com/news/2001_02_07_a.html

     [PDZ99]
          V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O
          buffering and caching system", Proc. of the 3rd Symposium on
          Operating Systems Design and Implementation, New Orleans, LA,
          Feb. 1999

     [PAC+97]
          D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton,
          C. Kozyrakis, R. Thomas, K. Yelick , "A case for intelligient
          RAM: IRAM", IEEE Micro, April 1997

     [Pi01]
          J. Pinkerton, "Winsock Direct: the value of System Area
          Networks". http://www.microsoft.com/windows2000/techinfo/
          howitworks/communications/winsock.asp

     [Po81]
          Postel, J., "Transmission Control Protocol - DARPA Internet
          Program Protocol Specification", RFC 793, September 1981

     [QUAD]
          Quadrix Solutions, http://www.quadrix.com

     [SD] Sockets Direct,

     [SRVNET]
          Compaq Servernet,
          http://nonstop.compaq.com/view.asp?PAGE=ServerNet

     [STREAM]
          The STREAM Benchmark Reference Information,
          http://www.cs.virginia.edu/stream/

     [TK95]
          M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O
          framework for UNIX", Technical Report, SMLI TR-95-39, May 1995

     [VI] Virtual Interface Architecture Specification Version 1.0.
          http://www.viarch.org/html/collateral/san_10.pdf





Romanow, et al           Expires September 2002                [Page 15]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


     [Wa97]
          J. R. Walsh, "DART: Fast application-level networking via
          data-copy avoidance", IEEE Network, July/August 1997, pp.
          28-38

Author's Address


     Allyn Romanow
     Cisco Systems, Inc.
     170 W. Tasman Drive
     San Jose, CA 95134 USA

     Phone: +1 408 525 8836
     Email: allyn@cisco.com


     Tom Talpey
     Network Appliance
     375 Totten Pond Road
     Waltham, MA 02451 USA

     Phone: +1 781 768-5329
     EMail: thomas.talpey@netapp.com


     Jeffrey C. Mogul
     Western Research Laboratory
     Compaq Computer Corporation
     250 University Avenue
     Palo Alto, California, 94305 USA

     Phone: +1 650 617 3304 (email preferred)
     EMail: JeffMogul@acm.org


     Stephen Bailey
     Sandburst Corporation
     600 Federal Street
     Andover, MA  01810
     USA

     Phone: +1 978 689 1614
     Email: steph@sandburst.com







Romanow, et al           Expires September 2002                [Page 16]


Internet-Draft       RDMA over IP Problem Statement          21 Feb 2002


Full Copyright Statement

     Copyright (C) The Internet Society (2002). All Rights Reserved.

     This document and translations of it may be copied and furnished to
     others, and derivative works that comment on or otherwise explain
     it or assist in its implementation may be prepared, copied,
     published and distributed, in whole or in part, without restriction
     of any kind, provided that the above copyright notice and this
     paragraph are included on all such copies and derivative works.
     However, this document itself may not be modified in any way, such
     as by removing the copyright notice or references to the Internet
     Society or other Internet organizations, except as needed for the
     purpose of developing Internet standards in which case the
     procedures for copyrights defined in the Internet Standards process
     must be followed, or as required to translate it into languages
     other than English.

     The limited permissions granted above are perpetual and will not be
     revoked by the Internet Society or its successors or assigns.

     This document and the information contained herein is provided on
     an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
     ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
     IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
     THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
     WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
























Romanow, et al           Expires September 2002                [Page 17]