Internet-Draft Brent Callaghan
Expires: January 2005 Sun Microsystems, Inc.
Tom Talpey
Network Appliance, Inc.
Document: draft-ietf-nfsv4-rpcrdma-00.txt July, 2004
RDMA Transport for ONC RPC
Status of this Memo
By submitting this Internet-Draft, I certify that any applicable
patent or other IPR claims of which I am aware have been disclosed,
or will be disclosed, and any of which I become aware will be
disclosed, in accordance with RFC 3668.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-
Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Copyright Notice
Copyright (C) The Internet Society (2004). All Rights Reserved.
Abstract
A protocol is described providing RDMA as a new transport for ONC
RPC. The RDMA transport binding conveys the benefits of efficient,
Expires: January 2005 Callaghan and Talpey [Page 1]
Internet-Draft RDMA Transport for ONC RPC July 2004
bulk data transport over high speed networks, while providing for
minimal change to RPC applications and with no required revision of
the application RPC protocol, or the RPC protocol itself.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Abstract RDMA Model . . . . . . . . . . . . . . . . . . . . 3
3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 5
3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 5
3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 6
3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6
3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7
3.5. Padding . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6. XDR Decoding with Read Chunks . . . . . . . . . . . . . 10
3.7. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11
3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 11
4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 14
4.1. RPC RDMA Transport Header . . . . . . . . . . . . . . . 14
4.2. XDR Language Description . . . . . . . . . . . . . . . . 16
5. Large Chunkless Messages . . . . . . . . . . . . . . . . . 18
5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 19
5.2. RDMA Write of Long Replies . . . . . . . . . . . . . . . 20
5.3. RPC RDMA header errors . . . . . . . . . . . . . . . . . 21
6. Connection Configuration Protocol . . . . . . . . . . . . 22
6.1. Initial Connection State . . . . . . . . . . . . . . . . 22
6.2. Protocol Description . . . . . . . . . . . . . . . . . . 23
7. Memory Registration Overhead . . . . . . . . . . . . . . . 24
8. Errors and Error Recovery . . . . . . . . . . . . . . . . 24
9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 25
10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 25
11. Security . . . . . . . . . . . . . . . . . . . . . . . . 25
12. IANA Considerations . . . . . . . . . . . . . . . . . . . 26
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 26
14. Normative References . . . . . . . . . . . . . . . . . . 26
15. Informative References . . . . . . . . . . . . . . . . . 27
16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 28
17. Full Copyright Statement . . . . . . . . . . . . . . . . 28
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 29
1. Introduction
RDMA is a technique for efficient movement of data over high speed
transports. It facilitates data movement via direct memory access by
Expires: January 2005 Callaghan and Talpey [Page 2]
Internet-Draft RDMA Transport for ONC RPC July 2004
hardware, yielding faster transfers of data over a network while
reducing host CPU overhead.
ONC RPC [RFC1831] is a remote procedure call protocol that has been
run over a variety of transports. Most implementations today use UDP
or TCP. RPC messages are defined in terms of an eXternal Data
Representation (XDR) [RFC1832] which provides a canonical data
representation across a variety of host architectures. An XDR data
stream is conveyed differently on each type of transport. On UDP,
RPC messages are encapsulated inside datagrams, while on a TCP byte
stream, RPC messages are delineated by a record marking protocol. An
RDMA transport also conveys RPC messages in a unique fashion that
must be fully described if client and server implementations are to
interoperate.
RDMA transports present new semantics unlike the behaviors of either
UDP and TCP. They retain message delineations like UDP while also
providing a reliable, sequenced data transfer like TCP. All provide
the new efficient, bulk transfer service of RDMA. RDMA transports
are therefore naturally viewed as a new transport type by ONC RPC.
RDMA as a transport will benefit the performance of RPC protocols
that move large "chunks" of data, since RDMA hardware excels at
moving data efficiently between host memory and a high speed network
with little or no host CPU involvement. In this context, the NFS
protocol, in all its versions, is an obvious beneficiary of RDMA.
Many other RPC-based protocols will also benefit.
Although the RDMA transport described here provides relatively
transparent support for any RPC application, the proposal goes
further in describing mechanisms that can optimize the use of RDMA
with more active participation by the RPC application.
2. Abstract RDMA Model
An RPC transport is responsible for conveying an RPC message from a
sender to a receiver. An RPC message is either an RPC call from a
client to a server, or an RPC reply from the server back to the
client. An RPC message contains an RPC call header followed by
arguments if the message is an RPC call, or an RPC reply header
followed by results if the message is an RPC reply. The call header
contains a transaction ID (XID) followed by the program and procedure
Expires: January 2005 Callaghan and Talpey [Page 3]
Internet-Draft RDMA Transport for ONC RPC July 2004
number as well as a security credential. An RPC reply header begins
with an XID that matches that of the RPC call message, followed by a
security verifier and results. All data in an RPC message is XDR
encoded. For a complete description of the RPC protocol and XDR
encoding, see [RFC1831] and [RFC1832].
This protocol assumes an abstract model for RDMA transports. The
following terms, common in the RDMA lexicon, are used in this
document. A more complete glossary of RDMA terms can be found in
[RDMA].
o Registered Memory
All data moved via RDMA must be resident in registered
memory at its source and destination. Each segment of
registered memory must be identified with a Steering Tag
(STag) of no more than 32 bits and memory addresses of up
to 64 bits in length.
o RDMA Send
The RDMA provider supports an RDMA Send operation with
completion signalled at the receiver when data is placed
in a pre-posted buffer. The amount of transferred data
is limited only by the size of the receiver's buffer.
Sends complete at the receiver in the order they were
issued at the sender.
Expires: January 2005 Callaghan and Talpey [Page 4]
Internet-Draft RDMA Transport for ONC RPC July 2004
o RDMA Write
The RDMA provider supports an RDMA Write operation to
directly place data in the receiver's buffer. An RDMA
Write is initiated by the sender and completion is
signalled at the sender. No completion is signalled at
the receiver. The sender uses a Steering Tag (STag),
memory address and length of the remote destination
buffer. A subsequent completion, provided by RDMA Send,
must be obtained at the receiver to guarantee that RDMA
Write data has been successfully placed in the receiver's
memory.
o RDMA Read
The RDMA provider supports an RDMA Read operation to
directly place peer source data in the requester's buffer.
An RDMA Read is initiated by the receiver and completion is
signalled at the receiver. The receiver provides
Steering Tags, memory addresses and a length for the
remote source and local destination buffers.
Since the peer at the data source receives no notification
of RDMA Read completion, there is an assumption that on
receiving the data the receiver will signal completion
with an RDMA Send message, so that the peer can free the
source buffers.
In its abstract form, this protocol is not an interoperable stan-
dard. It becomes a useful, implementable standard only when mapped
onto a specific RDMA transport, like iWARP [RDDP] or Infiniband
[IB].
3. Protocol Outline
An RPC message can be conveyed in identical fashion, whether it is a
CALL or REPLY message. In each case, the transmission of the message
proper is preceded by transmission of a transport header for use by
RPC over RDMA transports. This header is analogous to the record
marking used for RPC over TCP, but is more extensive, since RDMA
transports support several modes of data transfer and it is important
to allow the client and server to use the most efficient mode for any
given transfer. Multiple segments of a message may be transferred in
Expires: January 2005 Callaghan and Talpey [Page 5]
Internet-Draft RDMA Transport for ONC RPC July 2004
different ways to different remote memory destinations.
All transfers of a CALL or REPLY begin with an RDMA send which
transfers at least the transport header, usually with the CALL or
REPLY message appended, or at least some part thereof. Because the
size of what may be transmitted via RDMA send is limited by the size
of the receiver's pre-posted buffer, the RPC over RDMA transport
provides a number of methods to reduce the amount transferred by
means of the RDMA send, when necessary, by transferring various parts
of the message using RDMA read and RDMA write.
3.1. Short Messages
Many RPC messages are quite short. For example, the NFS version 3
GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32
byte filehandle argument and 4 bytes of length. The reply to this
common request is about 100 bytes.
There is no benefit in transferring such small messages with an RDMA
Read or Write operation. The overhead in transferring STags and
memory addresses is justified only by large transfers. The critical
message size that justifies RDMA transfer will vary depending on the
RDMA implementation and network, but is typically of the order of a
few kilobytes. It is appropriate to transfer a short message with an
RDMA Send to a pre-posted buffer. The transport header with the
short message (CALL or REPLY) immediately following is transferred
using a single RDMA send operation.
Short RPC messages over an RDMA transport will look like this:
Client Server
| RPC Call |
Send | ------------------------------> |
| |
| RPC Reply |
| <------------------------------ | Send
Expires: January 2005 Callaghan and Talpey [Page 6]
Internet-Draft RDMA Transport for ONC RPC July 2004
3.2. Data Chunks
Some protocols, like NFS, have RPC procedures that can transfer very
large "chunks" of data in the RPC call or reply and would cause the
maximum send size to be exceeded if one tried to transfer them as
part of the RDMA send. These large chunks typically range from a
kilobyte to a megabyte or more. An RDMA transport can transfer large
chunks of data more efficiently via the direct placement of an RDMA
Read or RDMA Write operation. Using direct placement instead of in-
line transfer not only avoids expensive data copies, but provides
correct data alignment at the destination.
3.3. Flow Control
It is critical to provide flow control for an RDMA connection. RDMA
receive operations will fail if a pre-posted receive buffer is not
available to accept an incoming RDMA Send. Such errors are fatal to
the connection. This is a departure from conventional TCP/IP
networking where buffers are allocated dynamically on an as-needed
basis, and pre-posting is not required.
It is not practical to provide for fixed credit limits at the RPC
server. Fixed limits scale poorly, since posted buffers are
dedicated to the associated connection until consumed by receive
operations. Additionally for protocol correctness, the server must
be able to reply whether or not a new buffer can be posted to accept
future receives.
Flow control is implemented as a simple request/grant protocol in the
transport header associated with each RPC message. The transport
header for RPC CALL messages contains a requested credit value for
the server, which may be dynamically adjusted by the caller to match
its expected needs. The transport header for the RPC REPLY messages
provide the granted result, which may have any value except it may
not be zero when no in-progress operations are present at the server,
since such a value would result in deadlock. The value may be
adjusted up or down at each opportunity to match the server's needs
or policies.
While RPC CALLs may complete in any order, the current flow control
limit at the RPC server is known to the RPC client from the Send
ordering properties. It is always the most recent server granted
credits minus the number of requests in flight.
Expires: January 2005 Callaghan and Talpey [Page 7]
Internet-Draft RDMA Transport for ONC RPC July 2004
3.4. XDR Encoding with Chunks
The data comprising an RPC call or reply message is marshaled or
serialized into a contiguous stream by an XDR routine. XDR data
types such as integers, strings, arrays and linked lists are commonly
implemented over two very simple functions that encode either an XDR
data unit (32 bits) or an array of bytes.
Normally, the separate data items in an XDR call or reply are encoded
as a contiguous sequence of bytes for network transmission over UDP
or TCP. However, in the case of an RDMA transport, local routines
such as XDR encode can determine that an opaque byte array is large
enough to be more efficiently moved via an RDMA data transfer
operation like RDMA Read or RDMA Write.
When sending any message (request or reply) that contains a candidate
large data chunk, the XDR encoding routine avoids moving the data
into the XDR stream. Instead, it does not encode the data portion,
but records the address and size of each chunk in a separate "read
chunk list" encoded within RPC RDMA transport-specific headers. Such
chunks will be transferred via RDMA Read operations initiated by the
receiver.
Since the chunks are to be moved via RDMA, the memory for each chunk
must be registered. This registration may take place within XDR
itself, providing for full transparency to upper layers, or it may be
performed by any other specific local implementation.
Additionally, when making an RPC call that can result in bulk data
transferred in the reply, it is desirable to provide chunks to accept
the data directly via RDMA Write. These chunks will therefore be
pre-filled by the server prior to responding, and XDR decode at the
client will not be required. These "write chunk lists" undergo a
similar registration and advertisement to chunks built as a part of
XDR encoding. Just as with an encoded read chunk list, the memory
referenced in an encoded write chunk list must be pre-registered. If
the client chooses not to make a write chunk list available, then the
server must return data inline in the reply, or via a read chunk
list.
When any data within a message is provided via either read or write
chunks, the chunk itself refers only to the data portion of the XDR
stream element. In particular, for counted fields (e.g. a "<>"
Expires: January 2005 Callaghan and Talpey [Page 8]
Internet-Draft RDMA Transport for ONC RPC July 2004
encoding) the byte count which is encoded as part of the field
remains in the XDR stream, as well as being encoded in the chunk
list. Only the data portion is elided. This is important to
maintain upper layer implementation compatibility - both the count
and the data must be transferred as part of the XDR stream. In
addition, any byte count in the XDR stream must match the sum of the
byte counts present in the corresponding read or write chunk list.
If they do not agree, an RPC protocol encoding error results.
The following items are contained in a chunk list entry.
STag
Steering tag or handle obtained when the chunk
memory is registered for RDMA.
Length
The length of the chunk in bytes.
Offset
The offset or memory address of the chunk.
Position
For data which is to be encoded, the position in
the XDR stream where the chunk would normally
reside. It is possible that a contiguous sequence
of chunks might all have the same position. For
data which is to be decoded, no "position" is
used.
When XDR marshaling is complete, the chunk list is XDR encoded,
then sent to the receiver prepended to the RPC message. Any source
data for a read chunk, or the destination of a write chunk, remain
behind in the sender's registered memory.
+----------------+----------------+-------------
| | |
| RDMA header w/ | RPC Header | Non-chunk args/results
| chunks | |
+----------------+----------------+-------------
Read chunk lists are structured differently from write chunk lists.
This is due to the different usage - read chunks are decoded and
indexed by their position in the XDR data stream, and may be used
for both arguments and results. Write chunks on the other hand are
Expires: January 2005 Callaghan and Talpey [Page 9]
Internet-Draft RDMA Transport for ONC RPC July 2004
used only for results, and have no preassigned offset in the XDR
stream until the results are produced. The mapping of Write chunks
onto designated NFS procedures and results is described in [NFS-
DDP].
Therefore, read chunks are encoded as a single array, with each
entry tagged by its position in the XDR stream. Write chunks are
encoded as a list of arrays of RDMA buffers, with each list element
providing buffers for a separate result.
3.5. Padding
Alignment of specific opaque data enables certain scatter/gather
optimizations. Padding leverages the useful property that RDMA
transfers preserve alignment of data, even when they are placed into
pre-posted receive buffers by Sends.
Many servers can make good use of such padding. Padding allows the
chaining of RDMA receive buffers such that any data transferred by
RDMA on behalf of RPC requests will be placed into appropriately
aligned buffers on the system that receives the transfer. In this
way, the need for servers to perform RDMA Read to satisfy all but the
largest client writes is obviated.
The effect of padding is demonstrated below showing prior bytes on an
XDR stream (XXX) followed by an opaque field consisting of four
length bytes (LLLL) followed by data bytes (DDDD). The receiver of
the RDMA Send has posted two chained receive buffers. Without
padding, the opaque data is split across the two buffers. With the
addition of padding bytes (ppp) prior to the first data byte, the
data can be forced to align correctly in the second buffer.
Buffer 1 Buffer 2
Unpadded -------------- --------------
XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD
Padded
Expires: January 2005 Callaghan and Talpey [Page 10]
Internet-Draft RDMA Transport for ONC RPC July 2004
XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD
Padding is implemented completely within the RDMA transport encoding,
flagged with a specific message type. Where padding is applied, two
values are passed to the peer: an "rdma_align" which is the padding
value used, and "rdma_thresh", which is the opaque data size at or
above which padding is applied. For instance, if the server is using
chained 4 KB receive buffers, then up to (4 KB - 1) padding bytes
could be used to achieve alignment of the data. If padding is to
apply only to chunks at least 1 KB in size, then the threshold should
be set to 1 KB. The XDR routine at the peer will consult these
values when decoding opaque values. Where the decoded length exceeds
the rdma_thresh, the XDR decode will skip over the appropriate
padding as indicated by rdma_align and the current XDR stream
position.
3.6. XDR Decoding with Read Chunks
The XDR decode process moves data from an XDR stream into a data
structure provided by the client or server application. Where
elements of the destination data structure are buffers or strings,
the RPC application can either pre-allocate storage to receive the
data, or leave the string or buffer fields null and allow the XDR
decode to automatically allocate storage of sufficient size.
When decoding a message from an RDMA transport, the receiver first
XDR decodes the chunk lists from the RDMA transport header, then
proceeds to decode the body of the RPC message (arguments or
results). Whenever the XDR offset in the decode stream matches that
of a chunk in the read chunk list, the XDR routine initiates an RDMA
Read to bring over the chunk data into locally registered memory for
the destination buffer. After completing such a transfer, the RPC
receiver must issue an RDMA_DONE message (described in Section 3.8)
to notify the peer that the source buffers can be freed.
The read chunk list is constructed and used entirely within the
RPC/XDR layer. Other than specifying the minimum chunk size, the
management of the read chunk list is automatic and transparent to an
RPC application.
Expires: January 2005 Callaghan and Talpey [Page 11]
Internet-Draft RDMA Transport for ONC RPC July 2004
3.7. XDR Decoding with Write Chunks
When a "write chunk list" is provided for the results of the RPC
CALL, the server must provide any corresponding data via RDMA Write
to the memory referenced in the chunk list entries. The RPC REPLY
conveys this by returning the write chunk list to the client with the
lengths rewritten to match the actual transfer. The XDR "decode" of
the reply therefore performs no local data transfer but merely
returns the length obtained from the reply.
Each decoded result consumes one entry in the write chunk list, which
in turn consists of an array of RDMA segments. The length is
therefore the sum of all returned lengths in all segments comprising
the corresponding list entry. As each list entry is "decoded", the
entire entry is consumed.
The write chunk list is constructed and used by the RPC application.
The RPC/XDR layer simply conveys the list between client and server
and initiates the RDMA Writes back to the client. The mapping of
write chunk list entries to procedure arguments must be determined
for each protocol. An example of a mapping is described in [NFSDDP].
3.8. RPC Call and Reply
The RDMA transport for RPC provides three methods of moving data
between client and server:
In-line
Data are moved between client and server
within an RDMA Send.
RDMA Read
Data are moved between client and server
via an RDMA Read operation via STag, address
and offset obtained from a read chunk list.
RDMA Write
Result data is moved from server to client
via an RDMA Write operation via STag, address
and offset obtained from a write chunk list
or reply chunk in the client's RPC call message.
Expires: January 2005 Callaghan and Talpey [Page 12]
Internet-Draft RDMA Transport for ONC RPC July 2004
These methods of data movement may occur in combinations within a
single RPC. For instance, an RPC call may contain some in-line
data along with some large chunks transferred via RDMA Read by the
server. The reply to that call may have some result chunks that
the server RDMA Writes back to the client. The following protocol
interactions illustrate RPC calls that use these methods to move
RPC message data:
An RPC with write chunks in the call message looks like this:
Client Server
| RPC Call + Write Chunk list |
Send | ------------------------------> |
| |
| Chunk 1 |
| <------------------------------ | Write
| : |
| Chunk n |
| <------------------------------ | Write
| |
| RPC Reply |
| <------------------------------ | Send
An RPC with read chunks in the call message looks like this:
Client Server
| RPC Call + Read Chunk list |
Send | ------------------------------> |
| |
| Chunk 1 |
| +------------------------------ | Read
| v-----------------------------> |
| : |
| Chunk n |
| +------------------------------ | Read
| v-----------------------------> |
| |
| RPC Reply |
| <------------------------------ | Send
Expires: January 2005 Callaghan and Talpey [Page 13]
Internet-Draft RDMA Transport for ONC RPC July 2004
And an RPC with read chunks in the reply message looks like this:
Client Server
| RPC Call |
Send | ------------------------------> |
| |
| RPC Reply + Read Chunk list |
| <------------------------------ | Send
| |
| Chunk 1 |
Read | ------------------------------+ |
| <-----------------------------v |
| : |
| Chunk n |
Read | ------------------------------+ |
| <-----------------------------v |
| |
| RPC Done |
Send | ------------------------------> |
The final RPC Done message allows the client to signal the server
that it has received the chunks, so the server can de-register and
free the memory holding the chunks. An RPC Done completion is not
necessary for an RPC call, since the RPC reply Send is itself a
receive completion notification.
The RPC Done message has no effect on protocol latency since the
client has no expectation of a reply from the server. Nor does it
adversely affect bandwidth since it is only 16 bytes in length. In
the event that the client fails to return the Done message, the
server can proceed with a de-register and free chunk buffers after
a time-out.
It is important to note that the RPC Done message consumes a credit
at the server. The client must account for this in its accounting
of available credits, and the server should replenish the credit
consumed by RPC Done at its earliest oportunity.
Finally, it is possible to conceive of RPC exchanges that involve
any or all combinations of write chunks in the RPC CALL, read
chunks in the RPC CALL, and read chunks in the RPC REPLY. Support
for such exchanges is straightforward from a protocol perspective,
but in practice such exchanges would be quite rare, limited to
Expires: January 2005 Callaghan and Talpey [Page 14]
Internet-Draft RDMA Transport for ONC RPC July 2004
upper layer protocol exchanges which transferred bulk data in both
the call and corresponding reply.
4. RPC RDMA Message Layout
RPC call and reply messages are conveyed across an RDMA transport
with a prepended RDMA transport header. The transport header
includes data for RDMA flow control credits, padding parameters and
lists of addresses that provide direct data placement via RDMA Read
and Write operations. The layout of the RPC message itself is
unchanged from that described in [RFC1831] except for the possible
exclusion of large data chunks that will be moved by RDMA Read or
Write operations. If the RPC message (along with the transport
header) is too long for the posted receive buffer (even after any
large chunks are removed), then the entire RPC message can be moved
separately as a chunk, leaving just the transport header in the RDMA
Send.
4.1. RPC RDMA Transport Header
The RPC RDMA transport header begins with four 32-bit fields that are
always present and which control the RDMA interaction including RDMA-
specific flow control. These are then followed by a number of items
such as chunk lists and padding which may or may not be present
depending on the type of transmission. The four fields which are
always present are:
Expires: January 2005 Callaghan and Talpey [Page 15]
Internet-Draft RDMA Transport for ONC RPC July 2004
1. Transaction ID (XID).
The XID generated for the RPC call and reply. Having
the XID at the beginning of the message makes it easy to
establish the message context. This XID mirrors the XID
in the RPC call header, and takes precedence.
2. Version number.
This version of the RPC RDMA message protocol is 1.
The version number must be increased by one whenever the
format of the RPC RDMA messages is changed.
3. Flow control credit value.
When sent in an RPC CALL message, the requested value is
provided. When sent in an RPC REPLY message, the
granted value is returned. RPC CALLs must not be sent
in excess of the currently granted limit.
4. Message type.
RDMA_MSG = 0 indicates that chunk lists and RPC message
follow. RDMA_NOMSG = 1 indicates that after the chunk
lists there is no RPC message. In this case, the chunk
lists provide information to allow the message proper to
be transferred using RDMA read or write and thus is not
appended to the RPC RDMA transport header. RDMA_MSGP =
2 indicates that a chunk list and RPC message with some
padding follow. RDMA_DONE = 3 indicates that the
message signals the completion of a chunk transfer via
RDMA Read. RDMA_ERROR = 4 is used to signal any detected
error(s) in the RPC RDMA chunk encoding.
Because the version number is encoded as part of this header, and
the RDMA_ERROR message type is used to indicate errors, these first
four fields and the start of the following message body must always
remain aligned at these fixed offsets for all versions of the RPC
RDMA transport header.
For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write
chunk lists follow. If the Read chunk list is null (a 32 bit word
of zeros), then there are no chunks to be transferred separately
and the RPC message follows in its entirety. If non-null, then
it's the beginning of an XDR encoded sequence of Read chunk list
entries. If the Write chunk list is non-null, then an XDR encoded
sequence of Write chunk entries follows.
Expires: January 2005 Callaghan and Talpey [Page 16]
Internet-Draft RDMA Transport for ONC RPC July 2004
If the message type is RDMA_MSGP, then two additional fields that
specify the padding alignment and threshold are inserted prior to
the Read and Write chunk lists.
A transport header of message type RDMA_MSG or RDMA_MSGP will be
followed by the RPC call or reply message, beginning with the XID.
This XID should match the one at the beginning of the RPC message
header.
+--------+---------+---------+-----------+-------------+----------
| | | | Message | NULLs | RPC Call
| XID | Version | Credits | Type | or | or
| | | | | Chunk Lists | Reply Msg
+--------+---------+---------+-----------+-------------+----------
Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or
RPC message follows. As an implementation hint: a gather operation
on the Send of the RDMA RPC message can be used to marshal the ini-
tial header, the chunk list, and the RPC message itself.
4.2. XDR Language Description
Here is the message layout in XDR language.
struct xdr_rdma_segment {
uint32 handle; /* Registered memory handle */
uint32 length; /* Length of the chunk in bytes */
uint64 offset; /* Chunk virtual address or offset */
};
struct xdr_read_chunk {
uint32 position; /* Position in XDR stream */
struct xdr_rdma_segment target;
};
struct xdr_read_list {
struct xdr_read_chunk entry;
struct xdr_read_list *next;
};
Expires: January 2005 Callaghan and Talpey [Page 17]
Internet-Draft RDMA Transport for ONC RPC July 2004
struct xdr_write_chunk {
struct xdr_rdma_segment target<>;
};
struct xdr_write_list {
struct xdr_write_chunk entry;
struct xdr_write_list *next;
};
struct rdma_msg {
uint32 rdma_xid; /* Mirrors the RPC header xid */
uint32 rdma_vers; /* Version of this protocol */
uint32 rdma_credit; /* Buffers requested/granted */
rdma_body rdma_body;
};
enum rdma_proc {
RDMA_MSG=0, /* An RPC call or reply msg */
RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */
RDMA_MSGP=2, /* An RPC call or reply msg with padding */
RDMA_DONE=3, /* Client signals reply completion */
RDMA_ERROR=4 /* An RPC RDMA encoding error */
};
union rdma_body switch (rdma_proc proc) {
case RDMA_MSG:
rpc_rdma_header rdma_msg;
case RDMA_NOMSG:
rpc_rdma_header_nomsg rdma_nomsg;
case RDMA_MSGP:
rpc_rdma_header_padded rdma_msgp;
case RDMA_DONE:
void;
case RDMA_ERROR:
rpc_rdma_error rdma_error;
};
Expires: January 2005 Callaghan and Talpey [Page 18]
Internet-Draft RDMA Transport for ONC RPC July 2004
struct rpc_rdma_header {
struct xdr_read_list *rdma_reads;
struct xdr_write_list *rdma_writes;
struct xdr_write_chunk *rdma_reply;
/* rpc body follows */
};
struct rpc_rdma_header_nomsg {
struct xdr_read_list *rdma_reads;
struct xdr_write_list *rdma_writes;
struct xdr_write_chunk *rdma_reply;
};
struct rpc_rdma_header_padded {
uint32 rdma_align; /* Padding alignment */
uint32 rdma_thresh; /* Padding threshold */
struct xdr_read_list *rdma_reads;
struct xdr_write_list *rdma_writes;
struct xdr_write_chunk *rdma_reply;
/* rpc body follows */
};
enum rpc_rdma_errcode {
ERR_VERS = 1,
ERR_CHUNK = 2
};
union rpc_rdma_error switch (rpc_rdma_errcode) {
case ERR_VERS:
uint32 rdma_vers_low;
uint32 rdma_vers_high;
case ERR_CHUNK:
void;
default:
uint32 rdma_extra[8];
};
5. Large Chunkless Messages
The receiver of RDMA Send messages is required to have previously
posted one or more correctly sized buffers. The client can inform
the server of the maximum size of its RDMA Send messages via the
Connection Configuration Protocol described later in this document.
Expires: January 2005 Callaghan and Talpey [Page 19]
Internet-Draft RDMA Transport for ONC RPC July 2004
Since RPC messages are frequently small, memory savings can be
achieved by posting small buffers. Even large messages like NFS READ
or WRITE will be quite small once the chunks are removed from the
message. However, there may be large, chunkless messages that would
demand a very large buffer be posted. A good example is an NFS
READDIR reply which may contain a large number of small filename
strings. Also, the NFS version 4 protocol [RFC3530] features
COMPOUND request and reply messages of unbounded length.
Ideally, each upper layer will negotiate these limits. However, it
is frequently necessary to provide a transparent solution.
5.1. Message as an RDMA Read Chunk
One relatively simple method is to have the client identify any RPC
message that exceeds the server's posted buffer size and move it
separately as a chunk, i.e. reference it as the first entry in the
read chunk list with an XDR position of zero.
Normal Message
+--------+---------+---------+------------+-------------+----------
| | | | | | RPC Call
| XID | Version | Credits | RDMA_MSG | Chunk Lists | or
| | | | | | Reply Msg
+--------+---------+---------+------------+-------------+----------
Long Message
+--------+---------+---------+------------+-------------+
| | | | | |
| XID | Version | Credits | RDMA_NOMSG | Chunk Lists |
| | | | | |
+--------+---------+---------+------------+-------------+
|
| +----------
| | Long RPC Call
+->| or
| Reply Message
+----------
If the receiver gets a transport header with a message type of
Expires: January 2005 Callaghan and Talpey [Page 20]
Internet-Draft RDMA Transport for ONC RPC July 2004
RDMA_NOMSG and finds an initial read chunk list entry with a zero XDR
position, it allocates a registered buffer and issues an RDMA Read of
the long RPC message into it. The receiver then proceeds to XDR
decode the RPC message as if it had received it in-line with the Send
data. Further decoding may issue additional RDMA Reads to bring over
additional chunks.
Although the handling of long messages requires one extra network
turnaround, in practice these messages should be rare if the posted
receive buffers are correctly sized, and of course they will be non-
existent for RDMA-aware upper layers.
An RPC with long reply returned via RDMA Read looks like this:
Client Server
| RPC Call |
Send | ------------------------------> |
| |
| RPC Transport Header |
| <------------------------------ | Send
| |
| Long RPC Reply Msg |
Read | ------------------------------+ |
| <-----------------------------v |
| |
| RPC Done |
Send | ------------------------------> |
5.2. RDMA Write of Long Replies
An alternative method of handling long, chunkless RPC replies is to
have the client post a large buffer into which the server can write a
large RPC reply. This has the advantage that an RDMA Write may be
slightly faster in network latency than an RDMA Read. Additionally,
for a reply it removes the need for an RDMA_DONE message if the large
reply is returned as a Read chunk.
This protocol supports direct return of a large reply via the
inclusion of an optional rdma_reply write chunk after the read chunk
list and the write chunk list. The client allocates a buffer sized
Expires: January 2005 Callaghan and Talpey [Page 21]
Internet-Draft RDMA Transport for ONC RPC July 2004
to receive a large reply and enters its STag, address and length in
the rdma_reply write chunk. If the reply message is too long to
return in-line with an RDMA Send (exceeds the size of the client's
posted receive buffer), even with read chunks removed, then the
server RDMA writes the RPC reply message into the buffer indicated by
the rdma_reply chunk. If the client doesn't provide an rdma_reply
chunk, or if it's too small, then the message must be returned as a
Read chunk.
An RPC with long reply returned via RDMA Write looks like this:
Client Server
| RPC Call with rdma_reply |
Send | ------------------------------> |
| |
| Long RPC Reply Msg |
| <------------------------------ | Write
| |
| RPC Transport Header |
| <------------------------------ | Send
The use of RDMA Write to return long replies requires that the
client application anticipate a long reply and have some knowledge
of its size so that a correctly sized buffer can be allocated.
This is certainly true of NFS READDIR replies; where the client
already provides an upper bound on the size of the encoded direc-
tory fragment to be returned by the server.
5.3. RPC RDMA header errors
When a peer receives an RPC RDMA message, it must perform certain
basic validity checks on the header and chunk contents. If errors
are detected in an RPC request, an RDMA_ERROR reply should be
generated.
Two types of errors are defined, version mismatch and invalid chunk
format. When the peer detects an RPC RDMA header version which it
does not support (currently this draft defines only version 1), it
replies with an error code of ERR_VERS, and provides the low and high
inclusive version numbers it does, in fact, support. The version
number in this reply can be any value otherwise valid at the
Expires: January 2005 Callaghan and Talpey [Page 22]
Internet-Draft RDMA Transport for ONC RPC July 2004
receiver. When other decoding errors are detected in the header or
chunks, either an RPC decode error may be returned, or the error code
ERR_CHUNK.
6. Connection Configuration Protocol
RDMA Send operations require the receiver to post one or more buffers
at the RDMA connection endpoint, each large enough to receive the
largest Send message. Buffers are consumed as Send messages are
received. If a buffer is too small, or if there are no buffers
posted, the RDMA transport will return an error and break the RDMA
connection. The receiver must post sufficient, correctly sized
buffers to avoid buffer overrun or capacity errors.
The protocol described above includes only a mechanism for managing
the number of such receive buffers, and no explicit features to allow
the client and server to provision or control buffer sizing, nor any
other session parameters.
In the past, this type of connection management has not been
necessary for RPC. RPC over UDP or TCP does not have a protocol to
negotiate the link. The server can get a rough idea of the maximum
size of messages from the server protocol code. However, a protocol
to negotiate transport features on a more dynamic basis is desirable.
The Connection Configuration Protocol allows the client to pass its
connection requirements to the server, and allows the server to
inform the client of its connection limits.
6.1. Initial Connection State
This protocol will be used for connection setup prior to the use of
another RPC protocol that uses the RDMA transport. It operates in-
band, i.e. it uses the connection itself to negotiate the connection
parameters. To provide a basis for connection negotiation, the
connection is assumed to provide a basic level of interoperability:
the ability to exchange at least one RPC message at a time that is at
least 1 KB in size. The server may exceed this basic level of
configuration, but the client must not assume it.
Expires: January 2005 Callaghan and Talpey [Page 23]
Internet-Draft RDMA Transport for ONC RPC July 2004
6.2. Protocol Description
Version 1 of the protocol consists of a single procedure that allows
the client to inform the server of its connection requirements and
the server to return connection information to the client.
The maxcallsize argument is the maximum size of an RPC call message
that the client will send in-line in an RDMA Send message to the
server. The server may return a maxcallsize value that is smaller or
larger than the client's request. The client must not send an in-
line call message larger than what the server will accept. The
maxcallsize limits only the size of in-line RPC calls. It does not
limit the size of long RPC messages transferred as an initial chunk
in the Read chunk list.
The maxreplysize is the maximum size of an in-line RPC message that
the client will accept from the server.
The maxrdmaread is the maximum number of RDMA Reads which may be
active at the peer. This number correlates to the RDMA incoming RDMA
Read count ("IRD") configured into each originating endpoint by the
client or server. If more than this number of RDMA Read operations
by the connected peer are issued simultaneously, connection loss or
suboptimal flow control may result, therefore the value should be
observed at all times. The peers' values need not be equal. If
zero, the peer must not issue requests which require RDMA Read to
satisfy, as no transfer will be possible.
The align value is the value recommended by the server for opaque
data values such as strings and counted byte arrays. The client can
use this value to compute the number of prepended pad bytes when XDR
encoding opaque values in the RPC call message.
typedef unsigned int uint32;
struct config_rdma_req {
uint32 maxcallsize; /* max size of in-line RPC call */
uint32 maxreplysize; /* max size of in-line RPC reply */
uint32 maxrdmaread; /* max active RDMA Reads at client */
};
Expires: January 2005 Callaghan and Talpey [Page 24]
Internet-Draft RDMA Transport for ONC RPC July 2004
struct config_rdma_reply {
uint32 maxcallsize; /* max call size accepted by server */
uint32 align; /* server's receive buffer alignment */
uint32 maxrdmaread; /* max active RDMA Reads at server */
};
program CONFIG_RDMA_PROG {
version VERS1 {
/*
* Config call/reply
*/
config_rdma_reply CONF_RDMA(config_rdma_req) = 1;
} = 1;
} = nnnnnn; <-- Need program number assigned
7. Memory Registration Overhead
RDMA requires that all data be transferred between registered memory
regions at the source and destination. All protocol headers as well
as separately transferred data chunks must use registered memory.
Since the cost of registering and de-registering memory can be a
large proportion of the RDMA transaction cost, it is important to
minimize registration activity. This is easily achieved within RPC
controlled memory by allocating chunk list data and RPC headers in a
reusable way from pre-registered pools.
The data chunks transferred via RDMA may occupy memory that persists
outside the bounds of the RPC transaction. Hence, the default
behavior of an RDMA transport is to register and de-register these
chunks on every transaction. However, this is not a limitation of
the protocol - only of the existing local RPC API. The API is easily
extended through such functions as rpc_control(3) to change the
default behavior so that the application can assume responsibility
for controlling memory registration through an RPC-provided
registered memory allocator.
8. Errors and Error Recovery
Error reporting and recovery is outside the scope of this protocol.
It is assumed that the link itself will provide some degree of error
Expires: January 2005 Callaghan and Talpey [Page 25]
Internet-Draft RDMA Transport for ONC RPC July 2004
detection and retransmission. Additionally, the RPC layer itself can
accept errors from the link level and recover via retransmission.
RPC recovery can handle complete loss and re-establishment of the
link.
9. Node Addressing
In setting up a new RDMA connection, the first action by an RPC
client will be to obtain a transport address for the server. The
mechanism used to obtain this address, and to open an RDMA connection
is dependent on the type of RDMA transport, and outside the scope of
this protocol.
10. RPC Binding
RPC services normally register with a portmap or rpcbind service,
which associates an RPC program number with a service address. In
the case of UDP or TCP, the service address for NFS is normally port
2049. This policy should be no different with RDMA interconnects.
One possibility is to have the server's portmapper register itself on
the RDMA interconnect at a "well known" service address. On UDP or
TCP, this corresponds to port 111. A client could connect to this
service address and use the portmap protocol to obtain a service
address in response to a program number, e.g. a VI discriminator or
an Infiniband GID.
11. Security
ONC RPC provides its own security via the RPCSEC_GSS framework [RFC
2203]. RPCSEC_GSS can provide message authentication, integrity
checking, and privacy. This security mechanism will be unaffected by
the RDMA transport. The data integrity and privacy features alter
the body of the message, presenting it as a single chunk. For large
messages the chunk may be large enough to qualify for RDMA Read
transfer. However, there is much data movement associated with
computation and verification of integrity, or encryption/decryption,
so any performance advantage will be lost.
There should be no new issues here with exposed addresses. The only
Expires: January 2005 Callaghan and Talpey [Page 26]
Internet-Draft RDMA Transport for ONC RPC July 2004
exposed addresses here are in the chunk list and in the transport
packets generated by an RDMA. The data contained in these addresses
is adequately protected by RPCSEC_GSS integrity and privacy.
RPCSEC_GSS security mechanisms are typically implemented by the host
CPU. This additional data movement and CPU use may cancel out much
of the RDMA direct placement and offload benefit.
A more appropriate security mechanism for RDMA links may be link-
level protection, like IPSec, which may be co-located in the RDMA
link hardware. The use of link-level protection may be negotiated
through the use of a new RPCSEC_GSS mechanism like the Credential
Cache GSS Mechanism (CCM) [CCM].
12. IANA Considerations
As a new RPC transport, this protocol should have no effect on RPC
program numbers or registered port numbers. The new RPC transport
should be assigned a new RPC "netid". If adopted, the Connection
Configuration protocol described herein will require an RPC program
number assignment.
13. Acknowledgements
The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak,
Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve
Kleiman, Mike Eisler, Mark Wittle and Shantanu Mehendale for their
contributions to this document.
14. Normative References
[RFC1831]
R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification
Version 2",
Standards Track RFC,
http://www.ietf.org/rfc/rfc1831.txt
[RFC1832]
R. Srinivasan, "XDR: External Data Representation Standard",
Standards Track RFC,
http://www.ietf.org/rfc/rfc1832.txt
Expires: January 2005 Callaghan and Talpey [Page 27]
Internet-Draft RDMA Transport for ONC RPC July 2004
[RFC1813]
B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol
Specification",
Informational RFC,
http://www.ietf.org/rfc/rfc1813.txt
[RFC3530]
S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M.
Eisler, D. Noveck, "NFS version 4 Protocol",
Standards Track RFC,
http://www.ietf.org/rfc/rfc3530.txt
[RFC2203]
M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol Specification",
Standards Track RFC,
http://www.ietf.org/rfc/rfc2203.txt
15. Informative References
[RDMA] R. Recio et al, "An RDMA Protocol Specification",
Internet Draft Work in Progress,
http://www.ietf.org/internet-drafts/
draft-ietf-rddp-rdmap-01.txt
[CCM] M. Eisler, N. Williams, "CCM: The Credential Cache GSS Mechanism",
Internet Draft Work in Progress,
http://www.ietf.org/internet-drafts/
draft-ietf-nfsv4-ccm-03.txt
[NFSRDMA]
T. Talpey, S. Shepler, J. Bauman, "NFSv4 Session Extensions"
Internet Draft Work in Progress,
http://www.ietf.org/internet-drafts/
draft-ietf-nfsv4-session-00.txt
[NFSDDP]
B. Callaghan, T. Talpey, "NFS Direct Data Placement"
Internet Draft Work in Progress,
http://www.ietf.org/internet-drafts/
draft-ietf-nfsv4-nfsdirect-00.txt
Expires: January 2005 Callaghan and Talpey [Page 28]
Internet-Draft RDMA Transport for ONC RPC July 2004
[RDDP]
Remote Direct Data Placement Working Group Charter,
http://www.ietf.org/html.charters/rddp-charter.html
[RDDPPS]
Remote Direct Data Placement Working Group Problem Statement,
Internet Draft Work in Progress,
A. Romanow, J. Mogul, T. Talpey, S. Bailey,
http://www.ietf.org/internet-drafts/
draft-ietf-rddp-problem-statement-04.txt
[IB]
Infiniband Architecture Specification,
http://www.infinibandta.org
16. Authors' Addresses
Brent Callaghan
Sun Microsystems, Inc.
17 Network Circle
Menlo Park, California 94025 USA
Phone: +1 650 786 5067
EMail: brent.callaghan@sun.com
Tom Talpey
Network Appliance, Inc.
375 Totten Pond Road
Waltham, MA 02451 USA
Phone: +1 781 768 5329
EMail: thomas.talpey@netapp.com
17. Full Copyright Statement
Expires: January 2005 Callaghan and Talpey [Page 29]
Internet-Draft RDMA Transport for ONC RPC July 2004
Copyright (C) The Internet Society (2004). This document is sub-
ject to the rights, licenses and restrictions contained in BCP 78
and except as set forth therein, the authors retain all their
rights.
This document and the information contained herein are provided on
an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REP-
RESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE
INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed
to pertain to the implementation or use of the technology described
in this document or the extent to which any license under such
rights might or might not be available; nor does it represent that
it has made any independent effort to identify any such rights.
Information on the procedures with respect to rights in RFC docu-
ments can be found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use
of such proprietary rights by implementers or users of this speci-
fication can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at ietf-
ipr@ietf.org.
Acknowledgement
Funding for the RFC Editor function is currently provided by the
Internet Society.
Expires: January 2005 Callaghan and Talpey [Page 30]