RPC-over-RDMA Extensions to Reduce Internode Round-trips
draft-dnoveck-nfsv4-rpcrdma-rtrext-00
The information below is for an old version of the document.
| Document | Type | Active Internet-Draft (individual) | |
|---|---|---|---|
| Author | David Noveck | ||
| Last updated | 2016-06-05 | ||
| Stream | (None) | ||
| Formats | plain text xml htmlized pdfized bibtex | ||
| Stream | Stream state | (No stream defined) | |
| Consensus boilerplate | Unknown | ||
| RFC Editor Note | (None) | ||
| IESG | IESG state | I-D Exists | |
| Telechat date | (None) | ||
| Responsible AD | (None) | ||
| Send notices to | (None) |
draft-dnoveck-nfsv4-rpcrdma-rtrext-00
Network File System Version 4 D. Noveck
Internet-Draft HPE
Intended status: Standards Track June 5, 2016
Expires: December 7, 2016
RPC-over-RDMA Extensions to Reduce Internode Round-trips
draft-dnoveck-nfsv4-rpcrdma-rtrext-00
Abstract
It is expected that the RPC-over-RDMA transport will, at some point,
allow protocol extensions to be defined. This would provide for the
specification of OPTIONAL features allowing participants who
implement the OPTIONAL features to cooperate as specified by that
extension, while still interoperating with participants who do not
support that extension.
A particular extension is described herein, whose purpose is to
reduce the latency due to inter-node round-trips needed to effect
operations which involve direct data placement or which transfer RPC
messages longer than the fixed inline buffer size limit.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on December 7, 2016.
Copyright Notice
Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
Noveck Expires December 7, 2016 [Page 1]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
1.2. Introduction . . . . . . . . . . . . . . . . . . . . . . 3
1.3. Prerequisites . . . . . . . . . . . . . . . . . . . . . . 3
1.4. Role Terminology . . . . . . . . . . . . . . . . . . . . 4
2. Extension Overview . . . . . . . . . . . . . . . . . . . . . 5
3. Direct Data Placement Feature . . . . . . . . . . . . . . . . 5
3.1. Current Situation . . . . . . . . . . . . . . . . . . . . 5
3.2. RDMA_MSGP . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3. Send-based DDP . . . . . . . . . . . . . . . . . . . . . 7
3.4. Other DDP-Related Extensions . . . . . . . . . . . . . . 7
4. Message Continuation Feature . . . . . . . . . . . . . . . . 8
4.1. Current Situation . . . . . . . . . . . . . . . . . . . . 8
4.2. Message Continuation Changes . . . . . . . . . . . . . . 9
4.3. Message Continuation and Credits . . . . . . . . . . . . 9
5. Protocol Additions . . . . . . . . . . . . . . . . . . . . . 10
5.1. New Operation Support . . . . . . . . . . . . . . . . . . 10
5.2. Message Continuation Support . . . . . . . . . . . . . . 11
5.3. Send-based DDP Support . . . . . . . . . . . . . . . . . 11
5.4. Error Reporting . . . . . . . . . . . . . . . . . . . . . 12
6. XDR Preliminaries . . . . . . . . . . . . . . . . . . . . . . 13
6.1. Message Continuation Preliminaries . . . . . . . . . . . 13
6.2. Data Placement Preliminaries . . . . . . . . . . . . . . 14
7. Data Placement Structures . . . . . . . . . . . . . . . . . . 17
7.1. Data Placement Overview . . . . . . . . . . . . . . . . . 17
7.2. Buffer Structure Definition . . . . . . . . . . . . . . . 18
7.3. Message DDP Structures . . . . . . . . . . . . . . . . . 20
7.4. Response Direction DDP Structures . . . . . . . . . . . . 21
8. Transport Characteristics . . . . . . . . . . . . . . . . . . 24
8.1. Characteristics List . . . . . . . . . . . . . . . . . . 24
8.2. RTR Support Characteristic . . . . . . . . . . . . . . . 25
8.3. Receive Buffer Structure Characteristic . . . . . . . . . 25
8.4. Request Transmission Receive Limit Characteristic . . . . 26
8.5. Response Transmission Send Limit Characteristic . . . . . 26
9. New Operations . . . . . . . . . . . . . . . . . . . . . . . 27
9.1. Operations List . . . . . . . . . . . . . . . . . . . . . 27
9.2. Transmit Request Operation . . . . . . . . . . . . . . . 28
9.3. Transmit Response Operation . . . . . . . . . . . . . . . 28
9.4. Transmit Continue Operations . . . . . . . . . . . . . . 29
Noveck Expires December 7, 2016 [Page 2]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
9.5. Transmit Error Operations . . . . . . . . . . . . . . . . 30
10. XDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
10.1. Code Component License . . . . . . . . . . . . . . . . . 34
10.2. XDR Proper for Extension . . . . . . . . . . . . . . . . 36
11. Security Considerations . . . . . . . . . . . . . . . . . . . 36
12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 36
13. References . . . . . . . . . . . . . . . . . . . . . . . . . 36
13.1. Normative References . . . . . . . . . . . . . . . . . . 36
13.2. Informative References . . . . . . . . . . . . . . . . . 37
Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 37
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 38
1. Preliminaries
1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
1.2. Introduction
This document describes a potential extension to the RPC-over-RDMA
protocol, which would allow participating implementations to have
more flexibility in how they use RDMA sends and receives to effect
necessary transmission of RPC requests and replies.
In contrast to existing facilities defined in RPC-over-RDMA Version
One in which the mapping between RPC messages and RPC-over-RDMA
messages is strictly one-to-one and DDP is effected only through use
of explicit RDMA operations, the following features are made
available through this extension:
o The ability to effect Direct Data Placement in the context of a
single RPC-over-RDMA transmission, rather than requiring explicit
RDMA operations to effect the necessary placement.
o The ability to continue an RPC request or reply over multiple RPC-
over-RDMA transmissions
1.3. Prerequisites
This document is written assuming that certain underlying facilities
will be made available to build upon, in the context of a future
version of RPC-over-RDMA. It is most likely that such facilities
will be first available in Version Two.
Noveck Expires December 7, 2016 [Page 3]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
o A protocol extension mechanism is needed to enable the extensions
to RPC-over-RDMA described here.
This document is currently written to conform to the extension
model for the proposed RPC-over-RDMA Version Two as described in
[rpcrdmav2].
o An existing means of communicating transport characteristics
between the RPC-over-RDMA endpoints is assumed.
This document is currently written assuming the transport
characteristic model defined in [xcharext] will be available and
can be extended to meet the needs of this extension.
As the documents referred to above are currently personal Internet
rafts, and subject to change, adjustments to this document are
expected to be necessary when and if, the needed facilities are
defined in working group documents.
1.4. Role Terminology
A number of different terms are used regarding the roles of the two
participants in an RPC-over-RMA connection. Some of these roles last
for the duration of a connection while others vary from request to
request or from message to message.
The roles of the client and server are fixed for the lifetime of the
connection, with the client defined as the endpoint which initiated
the connection.
The roles of requester and responder often parallel those of client
and server, although this is not always the case. Most requests are
made in the forward direction, in which the client is the requester
and the server is the responder. However, backward direction
requests are possible, in which case the server is the requester and
the client is the responder. As a result clients and servers may
both act as requesters and responders for different requests issued
on the same connection.
The roles of sender and receiver vary from message to messages. With
regard to the messages described in this document, the sender may act
as a requester by sending RPC requests or a responder by sending RPC
requests or as both at the same time by sending a mix of the two.
Noveck Expires December 7, 2016 [Page 4]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
2. Extension Overview
This extension is intended to function as part of RPC-over-RDMA and
implementations should successfully interoperate with existing RPC-
over-RDMA Version One implementations. Nevertheless, this extension
seeks to take a somewhat different approach to high-performance RPC
operation than has been used previously in that it seeks to de-
emphasize the use of explicit RDMA operations. It does this in two
ways:
o By implementing a send-based form of Direct Data Placement (see
Section 3), use of explicit RDMA operations can be avoided in many
common cases in which data is directly placed.
o Use of explicit RDMA to support reply chunks and position-zero
read chunks can be avoided by allowing a single message to be
split into multiple transmissions. This can be used to avoid many
instances of the only existing use of explicit RDMA operations not
associated with Direct Data Placement.
While use of explicit RDMA operations allows the cost of the actual
data transfer to be offloaded from the client and server CPUs to the
RNIC, there are ancillary costs in setting up the transfer that
cannot be ignored. As a result send-based functions are often
preferable, since the RNIC also uses DMA to effect these operations.
In addition, the cost of the additional inter-node round trips
required by explicit RDMA operation can be an issue, which can
becomes increasingly troublesome as internode distances increase.
Once one moves from in-machine-room to campus-wide or metropolitan-
area distances the additional round-trip delay of 16 microseconds per
mile becomes an issue impeding use of explicit RDMA operations.
3. Direct Data Placement Feature
3.1. Current Situation
Although explicit RDMA operations are used in the existing RPC-over-
RDMA protocol for purposes unrelated to Direct Data Placement, all
DDP is effected using explicit RDMA operations.
As a result, all operations involving Direct Data Placement involve
multiple internode round trips.
3.2. RDMA_MSGP
Although this was not stated explicitly, it appears that RDMA_MSGP
(defined in [RFC5666], removed from RPC-OVER-RDMA by [rfc5666bis]),
Noveck Expires December 7, 2016 [Page 5]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
was an early attempt to effect correct placement of bulk data within
a single RPC-over-RDMA transmission.
As things turned out, the fields within the RDMA_MSGP header were not
described in [RFC5666] in a way that allowed this message type to be
implemented.
In attempting to provide DDP functionality, we have to keep in mind
and avoid the problems that led to failure of RDMA_MSGP. It appears
that the problems go deeper than neglecting to write a few relevant
sentences. It is helpful to note that:
o The inline message size limits eventually adopted were too small
to allow RDMA_MSGP to be used effectively. This is true of both
the 1K limit in Version One [rfc5666bis] and the 4K limit
specified in [rpcrdmav2].
On the other hand, there is text within [RFC5667] that suggests
that much longer messages were anticipated at some points during
the evolution of RPC-over-RDMA.
o The fact that NFSv4 COMPOUNDs often have additional operations
beyond the one including the bulk data means that the RDMA_MSGP
model cannot be extended to NFSv4. As a result, the bulk data
needs to be excised from the data stream just as chunks are, so
that the payload stream can include non-bulk data both before and
after the logical position of the excised bulk data.
o In order for the sender to determine the appropriate amount of
padding necessary within a transmission to place the bulk data at
the proper position within receive buffer, the server must know
more about the structure of the receiver's buffers. Since the
padding needs to bring the bulk data to a position within the
buffer that is appropriate to receive the bulk data, the sender
needs to know where within the receive buffers such DDP-eligible
areas are located.
o While appropriate padding could place the bulk data within a large
WRITE into an appropriately aligned buffer or set of buffer, there
is no corresponding provision for the bulk data associated with a
READ. In short, there is no way to indicate to the responder that
it should use RDMA_MSGP to appropriately place bulk data in the
response.
o There is no explicit discussion of the required padding's use in
effecting proper data placement or connection with the ULB's
specification of DDP-eligible XDR.
Noveck Expires December 7, 2016 [Page 6]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
To summarize, RDMA_MSGP was an attempt to properly place data which
was thought of as a local optimization and insufficient attention was
given to it to make it successful. As a result, as RPC-over-RDMA
Version One was developed, Direct Data Placement was identified with
the use of explicit RDMA operations, and the possibility of Data
Placement within sends was not recognized.
3.3. Send-based DDP
In the sections below, we will attempt to provide send-based data
placement in a more complete way.
o By defining the structure of receive buffers as a transport
characteristic.
o By treating positioning of bulk data within a message as an
instance of DDP, causing the bulk data to be excised from the
payload XDR stream, as is the case with other forms of DDP.
o By defining new DDP control data structures that support both
send-based DDP and the form of DDP using explicit RDMA operations
that was specified in RPC-over-RDMA Version One. These new
control structures, described in Section 7.1 are organized
differently from the chunk-based structures described in
[rfc5666bis].
3.4. Other DDP-Related Extensions
In order to support send-based DDP, new DDP-related data structures
have been defined, as described in Sections 7.3 and 7.4.
These new data structures support both send-based and RDMA-operation-
based DDP. In addition, because of the restructuring described in
Section 7.1, a number of additional facilities are made available:
o The ability to restrict entries regarding DDP in response data to
XDR data items generated in response to performing particular
constituent operations within a given RPC request (e.g. operations
within an NFSv4 COMPOUND).
o The ability to make use of DDP contingent on the actual length of
a DDP-eligible data item in the response.
o The ability to specify whether use of DDP for a particular DDP-
eligible data item is required or optional.
These additional facilities will be available to implementations that
do not support send-based DDP, as long as both parties support the
Noveck Expires December 7, 2016 [Page 7]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
OPTIONAL Header types that include these new structures. For more
information about the relationships among, the new transport
characteristics, operations, and features, see Section 5.
4. Message Continuation Feature
4.1. Current Situation
Within RPC-over-RDMA Version One [rfc5666bis], Each transmission of a
request or reply involves sending a single RDMA send message and
conversely each message-related transmission involves only a single
RPC request or reply.
This strict one-to-one model leads to some potential performance
issues.
o Because of RDMA's use of fixed-size receives, some requests and
replies will inevitably not fit in the limited space available,
even if they do not contain any DDP-eligible bulk data.
Such cases will raise performance issues because, to deal with
them, the server is interrupted twice to receive a single request
and all the necessary transfers are serialized. In particular,
there are two server interrupt latencies involved before the
server can process the actual request, in addition to the OTW
round-trip latencies.
o In the case of replies, there may be cases in which reply chucks
need to be allocated and registered even if the actual reply would
fit within the fixed receive-size limit. Because the decision to
create a reply chunk is made at the time the request is sent, even
an extremely low probability of a longer reply will trigger
allocation of a reply chunk.
Because this decision is made in conformance with ULB rules,
which, by their nature, may only reference a limited set of data,
a reply chunk may be required even when the actual probability of
a long reply is exactly zero. For example a GETATTR request can
generate a long reply due to a long ACL, and thus COMPOUND with
this operation might allocate a reply chunk, even if the specific
file system being interrogated only supports ACLs of limited
sizes, or the GETATTR in question does not interrogate one of the
ACL attributes. Also, the OWNER attribute is a string and may be
impossible to determine a priori that the owner of any particular
file has no chance of requiring more than 4K bytes of space, for
example. The fact there are no such user names, while it is
probably is true, is not a fact that RPC-over-RDMA implementations
can depend on.
Noveck Expires December 7, 2016 [Page 8]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
4.2. Message Continuation Changes
Continuing a single RPC request or reply is addressed by defining
separate optional header types to begin and to continue sending a
single RPC message. This is rather than creating a header with a
continuation bit. In this approach, all of the DDP-related fields,
which include support for send-based DDP, appear in the starting
header (of typed ROPT_XMTREQ and ROPT_XMTRESP) and apply to the RPC
message as a whole.
Later RPC-over-RDMA messages (of type ROPT_XMTCONT) may extend the
payload stream and/or provide additional buffers to which bulk data
can be directed.
In this case, all of the RPC-over-RDMA messages used together are
referred to as a transmission group and must be received in order
without any intervening message.
In implementations using this optional facility, those decoding RPC
messages received using RPC-over-RDMA no longer have the assurance
that that each RPC message is in a contiguous buffer. As most XDR
implementations are built based on the assumption that input will not
be contiguous, this will not affect performance in most cases.
4.3. Message Continuation and Credits
Using multiple transmissions to send a single request or response can
complicate credit management. In the case of the message
continuation feature, deadlocks can be avoided because use of message
continuation is not obligatory. The requester or responder can use
explicit RDMA operations if sufficient credits to use message
continuation are not available.
A requester is well positioned to make this choice with regard to the
sending of requests. The requester must know, before sending a
request, how long it will be, and therefore, how many credits it
would require to send the request using message continuation. If
these are not available, it can avoid message continuation by either
creating read chunks sufficient to make the payload stream fit in a
single transmission or by creating a position-zero read chunk.
With regard to the response, the requester is not in position to know
exactly how long the response will be. However, the ULB will allow
the maximum response length to be determined based on the request.
This value can be used:
o To determine the maximum number of receive buffers that might be
required to receive any response sent.
Noveck Expires December 7, 2016 [Page 9]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
o To allocate and register a reply chunk to hold a possible large
reply.
The requester can avoid doing the second of these if the responder
has indicated it can use message continuation to send the response.
In this case, it makes sure that the buffers will be available and
indicates to the responder how many additional buffers (in the form
of pre-posted reads have been made available to accommodate
continuation transmissions.
When the responder processes the request, those additional receive
buffers may be used or not, or used only in part. This may be
because the response is shorter than the maximum possible response,
or because a reply chunk was used to transmit the response.
After the first or only transmission associated with the response is
received by the requester, it can be determined how many of the
additional buffers were used for the response. Any unused buffers
can be made available for other uses such as expanding the pool of
receive buffers available for the initial transmissions of response
or for receiving opposite direction requests. Alternatively, they
can be kept in reserve for future uses, such as being made available
to future requests which have potentially long responses.
5. Protocol Additions
In using existing RPC-over-RDMA facilities for protocol extension,
interoperability with existing implementations needs to be assured.
Because this document describes support for multiple features, we
need to clearly specify the various possible extensions and how peers
can determine whether certain facilities are supported by both ends
of the connection.
5.1. New Operation Support
Note that most of the new operations defined in this extension are
not tightly tied to a specific feature. XOPT_XMTREQ and XOPT_XMTRESP
are designed to support implementations that support either or both
Send-based DDP or message continuation. However, the converse is not
the case and these header types can be implemented by those not
supporting either of these features. For example, implementations
may only need support for the facilities described in Section 3.4.
Implementations may determine whether a peer implementation supports
XOPT_XMTREQ, XOPT_XMTREQ, or XOPT_XMTCONT by attempting these
operations. An alternative is to interrogate the Rtr Support
Characteristic for information about which operations are supported.
Noveck Expires December 7, 2016 [Page 10]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
5.2. Message Continuation Support
Implementations may determine and act based on the level of peer
implementation of support for message continuation as follows:
o To deal with issues relating to sending the peer multi-
transmission requests, the requester can interrogate the peer's
value of the Request Transmission Receive Limit (Section 8.4). In
cases in which the characteristic is not provided or has the value
one, the requester implementation can avoid sending multi-
transmission requests, and use the equivalent of position-zero
read chunks to convey a request larger than the receive buffer
limit.
Similarly, if the request is longer than can fit in a set of
transmissions given that limit, the request can be conveyed in the
same fashion,
o To deal with issues relating to sending the peer multi-
transmission responses, responders will only send multi-
transmission responses for requests conveyed using XOPT_XMTREQ
where the number of response transmissions is less than or equal
to buffer reservation count (in the field optxrq_rsbuf). The
requester can avoid receiving a message consisting of too many
transmissions by setting this field appropriately. This includes
the case in which the requester cannot handle any multi-
transmission responses.
o To avoid reserving receive buffers that the responder is not
prepared to use, the requester can interrogate the peer's value of
the Response Transmission Send Receive Limit (Section 8.5). In
cases in which it is possible that a request might result in a
response too large for this set of buffers, the requester, the
requester can provide a reply chunk to receive the response, which
the responder can use if the count of buffers provided is
insufficient.
5.3. Send-based DDP Support
Implementations may determine and adapt to the level of peer
implementation support for send-based DDP as described below. Note
that an implementation may be able to send messages containing bulk
data items placed using send-based DDP while not being prepared to
receive them, or the reverse.
o The requester can interrogate the responder's Receive Buffer
Structure Characteristic. In cases in which the characteristic is
not provided or shows no DDP-targetable buffer segments, an
Noveck Expires December 7, 2016 [Page 11]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
implementation knows that messages containing bulk data may not be
sent using send-based DDP. In such cases, when XOPT_XMTREQ is
used to send a request, bulk items may be transferred by setting
the associated DDP information to indicate that the bulk data is
to be fetched using explicit RDMA operations.
o In cases in which a requester is unprepared to accept messages
using send-based DDP, its Receive Buffer Structure Characteristic
will make this clear to the responder. Nevertheless, the
requester will generally indicate to the responder that bulk data
items are to be returned using explicit RDMA operations. As a
result, requesters may use XOPT_XMTREQ (and get the benefit of the
DDP-related features discussed in Section 3.4 even if they support
neither message continuation nor send-based DDP.
o Since it is possible for a responder to generate responses
containing bulk data using send-based DDP even if it is not
prepared to send such message, a requester who is prepared to
accept such messages should specify in the request that the
responses are to contain (or may contain) bulk data placed in this
way. In deciding whether this is to be done the requester can
interrogate the responder's RTR Support characteristic for
information about which whether the peer can send responses in
this form. It can do this without regard to whether the responder
can accept messages containing bulk data items placed using send-
based DDP.
In determining whether bulk data will be placed using send-based DDP
or via explicit RDMA operations, the level of support for message
continuation will have a role. This is because DDP using explicit
RDMA will reduce message size while send-based DDP reduces the size
of the payload stream by rearranging the message, leaving the message
size the same. As a result, the considerations discussed in
Section 4.3 will have to be attended to by the sender in determining
which form of DDP is to be used.
5.4. Error Reporting
The more extensive transport layer functionality described in this
document requires its own means of reporting errors, to deal with
issues that are distinct from:
o Errors (including XDR errors) in the XDR stream as received by
responder or requester.
o XDR errors detected in the XDR headers defined by the base
protocol.
Noveck Expires December 7, 2016 [Page 12]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
o XDR errors detected in the new operations defined in this
document.
Beyond the above, the following sorts of errors will have to be dealt
with, depending on which of the features of the extension are
implemented.
o Information associated with send-based DDP may be inconsistent or
otherwise invalid, even though it conforms to the XDR definition.
o There may be problems with the organization of transmission groups
in that there are missing or extraneous transmissions.
In each of the above cases, the problem will be reported to the
sender using the Transmit Error operation which needs to be supported
by every endpoint that sends ROPT_XMITREQ, ROPT_XMITRESP, or
ROPT_XMITCONT. This includes cases in which the problem is one with
a reply. The function of the Transmit Error operation is to aid in
diagnosing transport protocol errors and allowing the sender to
recover or decide recovery is not possible. Reporting failure to the
requesting process is dealt with indirectly. For example,
o When the transmissions used to send a request are ill-formed, the
requestor can respond to the error indication by proceeding to
send the request using existing (i.e. non-extended) facilities.
If it chooses not to do so, the requestor can report an RPC
request failure to the initiator of the RPC.
o When the transmissions used to send a response are ill-formed, the
responder need to know about the problem since it will otherwise
assume that the transmissions succeeded. It can proceed to resend
the reply using existing (i.e. non-extended) facilities. If it
chooses not to do so, the requester will not see a response and
eventually an RPC timeout will occur.
6. XDR Preliminaries
6.1. Message Continuation Preliminaries
In order to implement message continuation, we have occasion to refer
to particular RPC-over-RDMA transmissions within a transmission group
or to characteristics if a later transmission group.
Noveck Expires December 7, 2016 [Page 13]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
<CODE BEGINS>
typedef uint32 xms_grpxn;
typedef uint32 xms_grpxc;
struct xms_id {
uint32 xmsi_xid;
msg_type xmsi_dir;
xms_grpxn xmsi_seq;
}
<CODE ENDS>
An xms_grpxn designates a particular RPC-over-RDMA transmission
within a set of transmissions devoted to sending a single RPC
message.
An xms_grpxc specifies the number of RPC-over-RDMA transmissions in a
potential group of transmissions devoted to sending a single RPC
message.
6.2. Data Placement Preliminaries
Data structures related to data placement use a number of XDR
typedefs to help clarify the meaning of fields in the data structures
which use these typedefs.
<CODE BEGINS>
typedef uint32 xmddp_itemlen;
typedef uint32 xmddp_pldisp;
typedef uint32 xmddp_vsdisp;
typedef uint32 xmddp_tbsn;
enum xmddp_type {
XMDTYPE_EXRW = 1,
XMDTYPE_TBSN = 2,
XMDTYPE_CHOOSE = 3,
XMDTYPE_BYSIZE = 4,
XMDTYPE_TOOSHORT = 5,
XMDTYPE_NOITEM = 6
};
<CODE ENDS>
An xmddp_itemlen specifies the length of XDR item. Because items
excised from the XDR stream are XDR items, lengths of items excised
from the XDR stream are denoted by xmddp_itemlens.
Noveck Expires December 7, 2016 [Page 14]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
An xmddp_pldisp specifies a specific displacement with the payload
stream associated with a single RPC-over-RDNA transmission or a group
of such transmissions. Note that when multiple transmissions are
used for a single message, all of the payload streams within a
transmission group are considered concatenated.
An xmddp_vsdisp specifies a displacement within the virtual XDR
stream associates with the set of RPC messages transferred by single
RPC-over-RDNA transmission or a group of such transmissions. The
virtual XDR stream includes bulk data excised from the payload stream
and so displacements within it reflect those of the corresponding
objects in the XDR stream that might be sent and received if no bulk
data excision facilities were involved in the RPC transmission.
An xmddp_tbsn designates a particular target buffer segment within a
(trivial or non-trivial) RPC-over-RDMA transmission group. Each DDP-
targetable buffer segment is assigned a number starting with zero and
proceeding through all the buffer segments for all the RPC-over-RDMA
transmissions in the group. This includes buffer segments not
actually used because transmission are shorter than the maximum size
and those in which a DDP-targetable buffer segment is used to hold
part of the payload XDR stream rather than bulk data.
An xmddp_type allows a selection between DDP using explicit RDMA
operations and that using send-based DDP. It is used in a number of
contexts. The specific context governs which subset of the types is
valid:
o In request messages, they indicate where each of the directly-
placed data items within the request has been placed. In this
case, xmddp_type appears as the discriminator within an xmddp_loc
which is part of an xmddp_mitem that is an element within a
request's optxrq_ddp field.
o In request messages, they direct the responder as to where
potential directly-placed items are to be placed. In this case,
xmddp_type appears as the discriminator within an xmddp_rsdloc
which is part of an xmddp_rsditem that is an element within a
request's optxrq_rsd field.
o In response messages, they indicate how each of the potential
directly-placed items has been dealt with. A subset of these are
directly-placed data items and are presented in the same form as
that used for directly-placed data items within a request. In
this case, xmddp_type appears as the discriminator within an
xmddp_loc which is part of an xmddp_mitem that is an element
within a response's optxrs_ddp field.
Noveck Expires December 7, 2016 [Page 15]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
A number of these type are valid in all of these contexts, since they
specify use of a specific mode of direct placement which is to be
used or has been used.
o XMDTYPE_EXRW selects DDP using explicit RDMA reads and writes.
o XMDTYPE_TBSN selects use of send-based DDP in which DDP-eligible
data is located in DDP-targetable buffer segments.
Another set of types is used to direct the use of specific sets of
types but cannot specify an actual choice that has been made.
o XMDTYPE_CHOICE indicates that the responder may use either send-
based DDP or chunk-based DDP using explicit RDMA operations, with
a place for the latter having been provided by the requester.
o XMDTYPE_BYSIZE indicates that the responder is to use either send-
based DDP or chunk-based DDP using explicit RDMA operations, with
the choice between the two governed by the actual size of the
associated DDP-eligible XDR item.
The following types are used when no actual direct placement has
occurred. They are used in responses to indicate ways in which a
direction to govern DDP in a reply was responded to without resulting
in direct placement.
o XMDTYPE_TOOSHORT indicates that the corresponding entry in an
xmddp_rsdset was matched with a DDP-eligible item which was too
small to be handled using direct placement, resulting in the DDP-
eligible item being placed inline.
o XMDTYPE_NOITEM indicates that the corresponding entry in an
xmddp_rsdset was not matched with a DDP-eligible item in the
reply.
The following table indicates which of the above types is valid in
each of the contexts in which these types may appear. For valid
occurrences, it distinguishes those which give sender-generated
information about the message, and those that direct reply
construction, from those that indicate how those directions governed
the construction of a reply. For invalid occurrences, we distinguish
between those that result in XDR decode errors and those which are
valid from the XDR point of view but are semantically invalid.
Noveck Expires December 7, 2016 [Page 16]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
+------------------+--------------+-----------------+---------------+
| Type | xmddp_loc in | xmddp_rsdloc in | xmddp_loc in |
| | request | request | response |
+------------------+--------------+-----------------+---------------+
| XMDTYPE_EXRW | Valid Info | Valid Direction | Valid Result |
| XMDTYPE_TBSN | Valid Info | Valid Direction | Valid Result |
| XMDTYPE_BYSIZE | XDR Invalid | Valid Direction | XDR Invalid |
| XMDTYPE_CHOICE | XDR Invalid | Valid Direction | XDR Invalid |
| XMDTYPE_TOOSHORT | Sem. Invalid | XDR Invalid | Valid Result |
| XMDTYPE_NOITEM | Sem. Invalid | XDR Invalid | Valid Result |
+------------------+--------------+-----------------+---------------+
Table 1
7. Data Placement Structures
7.1. Data Placement Overview
To understand the new DDP structure defined here, it is necessary to
review the existing DDP structures used in RPC-over-RDMA Version One
and look at the corresponding structures in the new message
transmission headers defined in this document.
We look first at the existing structures.
o Read chunks are specified on requests to indicate data items to be
excised from the payload stream and fetched from the requester's
memory by the responder. As such, they serve as a means of
supplying data excised from the payload XDR stream.
Read chunks appear in replies but they have no clear function
there.
o Write chunks are specified on requests to provide locations in
requester memory to which DDP-eligible items in the corresponding
reply are to be transferred. They do not describe data in the
request but serve to direct reply construction.
When write chunks appear in replies they serve to indicate the
length of the data transferred. The addresses to which the bulk
reply data has been transferred is available, but this information
is already known to the requester.
o Reply chunks are specified to provide a location in the
requester's memory to which the responder can transfer the
response using RDMA Write. Like write chunks, they do not
describe data in the request but serve to direct reply
construction.
Noveck Expires December 7, 2016 [Page 17]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
When reply chunks appear in reply message headers, they serve
mainly to indicate whether the reply chunk was actually used.
Within the DDP structures defined here a different organization is
used, even where DDP using explicit RDMA operations in supported.
o All messages that contain bulk data contain structures that
indicate where the excised data is located. See Section 7.3 for
details.
o Requests that might generate replies containing bulk data contain
structures that provide guidance as to where the bulk data is to
be placed. See Section 7.4 for details.
Both sets of data structure are defined at the granularity of an RPC-
over-RDMA transmission group. That is, they describe the placement
of data within an RPC message and the scope of description is not
limited to a single RPC-over-RDMA transmission.
7.2. Buffer Structure Definition
Buffer structure definition information is used to allow the sender
to know how receive buffers are constructed, to allow it to
appropriately pad messages being sent so that bulk data will be
received into a memory area with the appropriate characteristics.
In this case, Direct Data Placement will not place data in a specific
address, picked and registered in advance as is done to effect DDP
using explicit RDMA operations. Instead, a message is sent so that
when it is matched with one of the preposted receives, the bulk data
will be received into a memory area of the appropriate
characteristics, including:
o size
o alignment
o DDP-targetability and potentially other memory characteristics
such as speed, persistence.
Noveck Expires December 7, 2016 [Page 18]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
<CODE BEGINS>
struct xmrbs_seg {
uint32 xmrseg_length;
uint32 xmrseg_align;
uint32 xmrseg_flags;
};
const uint32 XMRSFLAG_DDP = 0x01;
struct xmrbs_group {
uint32 xmrgrp_count;
xmrbs_seg xmrgrp_info;
};
struct xmrbs_buf {
uint32 xmrbuf_length;
xmrbs_group xmrbuf_groups<>;
};
<CODE ENDS>
Buffers can be, and typically are, structured to contain multiple
segments. Preposted receives that target a buffer uses a scatter
list to place received messages in successive buffer segments.
An xmrbs_seg defines a single buffer segment. The fields included
are:
o xmrseg_length is the length of this contiguous buffer segment
o xmrseg_align specifies the guaranteed alignment for the
corresponding buffer segment.
o xmrseg_flags which specify some noteworthy characteristics of the
associated buffer segment.
The following flag bit is the only one currently defined:
o XMRSFLAG_DDP indicates that the buffer segment in question is to
be considered suitable as a target for direct data placement.
An xmrgs_group designates a set of buffer segment all with the same
buffer segment characteristics as indicated by xmr_grpinfo. The
buffer segments are contiguous within the buffer although they are
likely not to be physically contiguous.
Noveck Expires December 7, 2016 [Page 19]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
An xmrbs_buf defines a receiver's buffer structure and consists of
multiple xmrbs_groups. This buffer structure, when made available as
a transport characteristic, allows the sender to structure
transmissions so as to place DDP-eligible data in appropriate target
buffer segments.
7.3. Message DDP Structures
These data structures show where in the virtual XDR stream for the
set of messages, data is to be excised from that XDR stream and where
that excised bulk data is to be found instead.
<CODE BEGINS>
union xmddp_loc switch(xmddp_type type)
case XMDTPE_EXRW:
rpcrdma1_segment xmdl_ex<>;
case XMDTYPE_TBSN:
xmddp_itemlen xmdl_offset;
xmddp_tbsn xmdl_bsnum<>;
case XMDTYPE_TOOSHORT:
case XMDTYPE_NOITEM:
void;
};
struct xmddp_mitem {
xmddp_vsdisp xmdmi_disp;
xmddp_itemlen xmdmi_length;
xmddp_loc xmdmi_where;
};
typedef xmddp_mitem xmddp_grpinfo<>;
<CODE ENDS>
An xmddp_loc shows where a particular piece of bulk data is located.
This information exists in multiple forms.
o The case for DDP using explicit RDMA operations, contains, in
xmdl_ex an array of rpcrdma1_segments showing where bulk data is
to be fetched from or has been transferred to.
o The case for send-based DDP contains, in xmdl_tbsn an array DDP-
targetable buffer segments, indicating where bulk data, excised
from the payload stream, is actually located. The bulk data
starts xmdl_offset bytes into the buffer segment designated by
Noveck Expires December 7, 2016 [Page 20]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
xmdl_bsnum[0] and then proceeds through buffer segments denoted by
successive xmdl_bsnum entries until the length of the data item is
exhausted.
o The cases for XMDDP_TOOSHORT and XMDDP_NPITEM are only valid in
responses
An xmddp_mitem denotes a specific item of bulk data. It consists of:
o The XDR stream displacement of the bulk data excised from the
payload stream, in xmdmi_disp.
o The length of the data item, in xmdmi_length.
o The actual location of the bulk data, in xmdmi_loc.
An xmddp_grpinfo consists of an array of xmddp_mitems describing all
of the bulk data excised from all RPC messages sent in a single RPC-
over-RDMA transmission group. Some possible cases:
o The array is of length zero, indicating that there is no DDP-
eligible data excised from the virtual XDR stream. In this case,
the virtual XDR stream and the payload stream are identical.
o The array consists of one or more xmddp_mitems, each of whose
xmdmi_where fields is of type XMDTPE_EXRW. In this case, the DDP
data corresponds to read chunks in the case in which a request is
being sent and to write chunks in the case in which a reply is
being sent.
o The array consists of one or more xmddp_mitems, each of whose
xmdmi_where fields is of type XMDTPE_TBSN. In this case, each
entry, whether it applies to bulk data in a request or a reply,
describes data logically part of the message being sent, which may
be part of any RPC-over-RDMA transmissions in the same
transmission group.
o The array consists of one or more xmddp_mitems, with xmdmi_where
fields of a mixture of types, In this case, each entry, whether it
applies to bulk data in a request or a reply, describes data
logically part of the message being sent, although the method of
getting access to that data may vary from entry to entry.
7.4. Response Direction DDP Structures
These data structures, when sent as part of the request, instruct the
responder how to use Direct Data Placement to place response data
subject to direct data placement.
Noveck Expires December 7, 2016 [Page 21]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
<CODE BEGINS>
union xmddp_rsdloc switch(xmddp_type type)
case XMDTPE_EXRW:
case XMDTPE_CHOICE:
rpcrdma1_segment xmdrsdl_ex<>;
case XMDTPE_BYSIZE:
xmddp_itemlen xmdrsdl_dsdov;
rpcrdma1_segment xmdrsdl_bsex<>;
case XMDTYPE_TBSN:
void;
};
struct xmddp_rsdrange {
xmddp_vsdisp xmdrsdr_begin;
xmddp_vsdisp xmdrsdr_end;
};
struct xmddp_rsditem {
xmddp_itemlen xmdrsdi_minlen;
xmddp_rsdloc xmdrsdi_loc;
};
struct xmddp_rsdset {
xmddp_rsdrange xmdrsds_range;
xmddp_rsditem xmdrsds_items<>;
};
typedef xmddp_rsdset xmddp_rsdgroup<>;
<CODE ENDS>
An xmddp_rsdloc contains information specifying where bulk data
generated as part of a reply is to be placed. This information is
defined as a union with the following cases:
o The case for DDP using explicit RDMA operations, XMDTYPE_EXRW,
contains, in xmrsdl_ex, an array of rpcrdma1_segments showing
where bulk data generated by the corresponding reply is to be
transferred to.
o The case allowing the responder to freely choose the DDP method,
XMDTYPE_CHOICE, is identical. It also contains, in xmrsdl_ex, an
array of rpcrdma1_segments showing where bulk data generated by
the corresponding reply is to be transferred to if explicit RDMA
requests are to be used.
Noveck Expires December 7, 2016 [Page 22]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
o The case for send-based ddp, XMDTYPE_TBSN. is void, since the
decisions as to where bulk data is to be placed are made by the
responder.
o In the case directing the responder to choose the DDP method based
on item size, XMDTYPE_BYSIZE, an array of rpcrdma1_segments is in
xmrsdl_bsex.
In all cases, each xmddp_rsdloc sent as part of a request has a
corresponding xmddp_loc in the associated response. The xmddp_type
specified in the request will affect the type in the response, but
the types are not necessarily the same. The table below describes
the valid combinations of request and response xmdp_types.
In this table, rows correspond to types in requests directing, the
responder as to the desired placement in the response while the
columns correspond to types in the ensuing response. Invalid
combinations are labelled "Inv" while valid combination are labelled
either "NDR" denoting no need to deregister memory, or "DR" to
indicate that memory previously registered will need to be
deregistered.
+---------+--------+--------+-----------+---------+
| Type | EXRW | TBSN | TOOSHORT | NOITEM |
+---------+--------+--------+-----------+---------+
| EXRW | DR | Inv. | DR | DR |
| TBSN | Inv. | NDR | NDR | NDR |
| CHOICE | DR | NDR | DR | DR |
| BYSIZE | DR | NDR | DR | DR |
+---------+--------+--------+-----------+---------+
Table 2
An xmddp_rsdrange denotes a range of positions in the XDR stream
associated with a request. Particular directions regarding bulk data
in the corresponding response are limited to such ranges, where
response XDR stream positions and request XDR stream positions can be
reliably tied together.
When the ULP supports multiple individual operations per RPC request
(e.g., COMPOUND and CB_COMPOUND in NFSv4), an xmd_rsdrange can
isolate elements of the reply due to particular operations.
An xmddp_rsditem specifies the handling of one potential item of bulk
data. The handling specified is qualified by a length range. If the
item is smaller than xmdrsdi_minlen, it is not treated as bulk data
and the corresponding data item appears in the payload stream, while
that particular xmddp_rsditem is considered used up, making the next
Noveck Expires December 7, 2016 [Page 23]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
xmddp_rsditem in the xmddp_rsdset the target of the next DDP-eligible
data item in the reply. Note that in the case in which xmdrsdi_loc
specifies use of explicit RDMA operations, the area specified is not
used and the requester is responsible for deregistering it.
For each xmddp_rsditem, there will be a corresponding xmddp_mitem
An xmddp_rsdset contains a set of xmddp_rsditems applicable to a
given xmddp_range in the request.
An xmddp_rsdgroup designates a set of xmddp_rsdsets applicable to a
particular RPC-over-RDMA transmission group. The xmdrsds_range
fields of successive xmddp_rsdsets must be disjoint and in strictly
increasing order.
8. Transport Characteristics
8.1. Characteristics List
In this document we take advantage of the fact that the set of
transport characteristics defined in [xcharext] is subject to later
extension. The additional transport characteristics are summarized
below in Table 3.
In that table the columns have the following values:
o The column labeled "characteristic" identifies the transport
characteristic described by the current row.
o The column labeled "#" specifies the xcharid value used to
identify this characteristic.
o The column labeled "XDR type" gives XDR type of the data used to
communicate the value of this characteristic. This data overlays
the nominally opaque field xchv_data in an xcharval.
o The column labeled "default" gives the default value for the
characteristic which is to be assumed by those who do not receive,
or are unable to interpret, information about the actual value of
the characteristic.
o The column labeled "section" indicates the section (within this
document) that explains the semantics and use of this transport
characteristic.
Noveck Expires December 7, 2016 [Page 24]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
+------------------------------+----+-----------+---------+---------+
| characteristic | # | XDR type | default | section |
+------------------------------+----+-----------+---------+---------+
| RTR Support | 4 | uint32 | 0 | 8.2 |
| Receive Buffer Structure | 5 | xmrbs_buf | Note1 | 8.3 |
| Request Transmission Receive | 6 | xms_grpxc | 1 | 8.4 |
| Limit | | | | |
| Response Transmission Send | 7 | xms_grpxc | 1 | 8.5 |
| Limit | | | | |
+------------------------------+----+-----------+---------+---------+
Table 3
The following notes apply to the above table:
1. The default value for the Receive Buffer Structure always
consists of a single buffer segment, without any alignment
restrictions and not targetable for DDP. The length of that
buffer segment derives from the Receive Buffer Size
characteristic if available, and from the default receive buffer
size otherwise.
8.2. RTR Support Characteristic
<CODE BEGINS>
const uint32 XCHAR_RTRSUPP = 4;
typedef uint32 xchrrtrs;
const uint32 RTRS_XREQ = 1;
const uint32 RTRS_XRESP = 2;
const uint32 RTRS_XCONT = 4;
<CODE ENDS>
8.3. Receive Buffer Structure Characteristic
This characteristic defines the structure of the endpoint's receive
buffers, in order to give a sender the ability to place bulk data in
specific DDP-targetable buffer segments.
<CODE BEGINS>
const uint32 XCHAR_RBSTRUCT = 5;
typedef xmrbs_buf xchrrbs;
<CODE ENDS>
Noveck Expires December 7, 2016 [Page 25]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
Normally, this characteristic, if specified should be in agreement
with Receive Buffer Size characteristic. However, the following
rules apply.
o If the value of Receive Buffer Structure characteristic is not
specified, it is derived from the Receive Buffer Size
characteristic, if known and the default buffer size otherwise.
The buffer is considered to consist of a single non-DDP-
targettable segment whose size is the buffer size.
o If the value of Receive Buffer Size characteristic is not
specified and the Receive Buffer Structure characteristic is
specified, the value of the former is derived from the latter, by
adding up the length of all buffer segments specified.
8.4. Request Transmission Receive Limit Characteristic
This characteristic specifies the length of the longest request
messages (in terms of number of transmissions) that a responder will
accept.
<CODE BEGINS>
const uint32 XCHAR_REQRXLIM = 6;
typedef uint32 xchrqrxl;
<CODE ENDS>
A requester can use this characteristic to determine whether to send
long requests by use of message continuation or by using a position-
zero read chunk.
8.5. Response Transmission Send Limit Characteristic
This characteristic specifies the length of the longest response
message (in terms of number of transmissions) that a responder will
generate.
<CODE BEGINS>
const uint32 XCHAR_RESPSXLIM = 7;
typedef uint32 xchrssxl;
<CODE ENDS>
Noveck Expires December 7, 2016 [Page 26]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
9. New Operations
9.1. Operations List
The proposed new operation are set for in Table 4 below. In that
table, the columns have the following values:
o The column labeled "operation" specifies the particular operation.
o The column labeled "#" specifies the value of opttype for this
operation.
o The column labeled "XDR type" gives XDR type of the data structure
used to describe the information in this new message type. This
data overlays the nominally opaque field optinfo in an
RDMA_OPTIONAL message.
o The column labeled "msg" indicates whether this operation is
followed (or not) by an RPC message payload (or something else).
o The column labeled "section" indicates the section (within this
document) that explains the semantics and use of this optional
operation.
+--------------------+----+--------------+--------+----------+
| operation | # | XDR type | msg | section |
+--------------------+----+--------------+--------+----------+
| Transmit Request | 5 | optxmt_req | Note1 | 9.2 |
| Transmit Response | 6 | optxmt_resp | Note1 | 9.3 |
| Transmit Continue | 7 | optxmt_cont | Note2 | 9.4 |
| Transmit Error | 8 | optxmt_err | No. | 9.5 |
+--------------------+----+--------------+--------+----------+
Table 4
The following notes apply to the above table:
1. Contains an initial segment of the message payload stream for an
RPC message, or the entre payload stream. The optxr[qs]_pslen
field, indicates the length of the section present
2. May contain a part of a message payload stream for an RPC
message, although not the entre payload stream. The optxc_pslen
field, if non-zero, indicates that this portion is present, and
the length of the section.
Noveck Expires December 7, 2016 [Page 27]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
9.2. Transmit Request Operation
The message definition for this operation is as follows:
<CODE BEGINS>
const uint32 ROPT_XMTREQ = 5;
struct optxmt_req {
xmddp_grpinfo optxrq_ddp;
xmddp_rsdgroup optxrq_rsd;
xms_grpxc optxrq_count;
xms_grpxc optxrq_rsbuf;
xmddp_pldisp optxrq_pslen;
};
<CODE ENDS>
The field optxrq_ddp describes the fields in virtual XDR stream which
have been excised in forming the payload stream, and information
about where the corresponding bulk data is located.
The field optxrq_rsd consists of information directing the responder
as to how to construct the reply, in terms of DDP. of length zero.
The field optrq_count specifies the count of transmissions in this
group of transmissions used to send a request.
The field optrq_repch serves as a way to transfer a reply chunk to
the responder to serve as a way in which a reply longer than the
inline size limit may be transferred. Although, not prohibited by
the protocol, it is unlikely to be used in environments in which
message continuation is supported.
The field optrq_pslen gives the length of the payload stream for the
RPC transmitted. The payload stream begins right after the end of
the optxmt_msg and proceeds for optxm_pslen bytes. This can include
crossing buffer segment boundaries.
9.3. Transmit Response Operation
The message definition for this operation is as follows:
Noveck Expires December 7, 2016 [Page 28]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
<CODE BEGINS>
const uint32 ROPT_XMTRESP = 6;
struct optxmt_resp {
xmddp_grpinfo optxrs_ddp;
xms_grpxn optxrs_count;
xmddp_pldisp optxrs_pslen;
};
<CODE ENDS>
The field optxrs_ddp describes the fields in virtual XDR stream which
have been excised in forming the payload stream, and information
about where the corresponding bulk data is located.
The field optrs_count specifies the count of transmissions in this
group of transmissions used to send a reply.
The field optrq_pslen gives the length of the payload stream for the
RPC transmitted. The payload stream begins right after the end of
the optxmt_msg and proceeds for optxm_pslen bytes. This can include
crossing buffer segment boundaries.
9.4. Transmit Continue Operations
RPC-over-RDMA headers of this type are used to continue RPC messages
begun by RPC-over-RDMA message of type ROPT_XMTREQ or ROPT_XMTRESP.
The xid field of this message must match that in the initial
transmission.
This operation needs to be supported for the message continuation
feature to be used.
The message definition for this operation is as follows:
<CODE BEGINS>
const uint32 ROPT_XMTCONT = 7;
struct optxmt_cont {
xms_grpxn optxc_xnum;
uint32 optxc_itype;
xmddp_pldisp; optxc_pslen;
};
<CODE ENDS>
Noveck Expires December 7, 2016 [Page 29]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
The field optxc_xnum indicates the transmission number of this
transmission within its transmission group.
The field optxc_pslen gives the length of the section of the payload
stream which is located in the current RPC-over-RDMA transmission.
It is valid for this length to be zero, indicating that there is no
portion of the payload stream in this transmission. Except when the
length is zero, the payload stream begins right after the end of the
optxmt_cont and proceeds for optxc_pslen bytes. This can include
crossing buffer segment boundaries. In any case, the payload streams
for all transmissions within the same group are considered
concatenated.
9.5. Transmit Error Operations
This RPC-over-RDMA message type is used to signal the occurrence of
errors that do not involve:
The preliminary error-related definition is as follows:
Noveck Expires December 7, 2016 [Page 30]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
<CODE BEGINS>
enum optx_err {
OPTXERR_BADHMT = 1,
OPTXERR_BADOMT = 2,
OPTXERR_BADCONT = 3,
OPTXERR_BADSEQ = 4,
OPTXERR_BADXID = 5,
OPTXERR_BADOFF = 6,
OPTXERR_BADTBSN = 7,
OPTXERR_BADPL = 8
}
union optx_info switch(optx_err optxe_which) {
case OPTXERR_BADHMT:
case OPTXERR_BADOMT:
case OPTXERR_BADSEQ:
case OPTXERR_BADXID:
uint32 optxi_expect;
uint32 optxi_current;
case OPTXERR_BADCONT:
void;
case OPTXERR_BADTBSN:
case OPTXERR_BADOFF:
case OPTXERR_BADPL:
uint32 optxi_value;
uint32 optxi_min;
uint32 optxi_max;
};
<CODE ENDS>
optx_err enumerates the various error conditions that might be
reported.
o OPTXERR_BADHMT indicates that a header message type other than the
one expected was received. In this context, a particular message
type can be considered "expected" only because of message or group
continuation.
o OPTXERR_BADOMT indicates that an optional message type other than
the one expected was received. In this context, a particular
Noveck Expires December 7, 2016 [Page 31]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
message type can be considered "expected" only because of message
or group continuation.
o OPTXERR_BADCONT indicates that a continuation messages was
received when there was no reason to expect one.
o OPTXERR_BADSEQ indicate that a transmission sequence number other
than the one expected was received.
o OPTXERR_BADXID indicate that an xid other than the one expected in
a continuation context.
o OPTXERR_BADTBSN indicate that an invalid target buffer sequence
number was received.
o OPTXERR_BADOFF indicate that a bad offset was received as part of
an xmddp_loc. This is typically because the offset is larger than
the buffer segment size.
o OPTXERR_BADPL indicates that a bad offset was received for the
payload length. This is typically because the length would make
the area devoted to the payload stream not a subset of the actual
transmission.
The optx_info gives error about the specfific invalid field being
reported. The additional information given depends on the specific
error.
o For the errors OPTXERR_BADHMT, OPTXERR_BADOMT, OPTXERR_BADSEQ,
andd OPTXERR_BADXID, the expected and actual values of the field
are reported
o For the error OPTXERR_CONT, no additional information is provided.
o For the errors OPTXERR_BADTBSN, OPTXERR_BADOFF, and OPTXERR_BADPL,
the actual value together with a range of valid values is
provided. When the actual value is with the valid range, it can
be inferred that the actual value is not propely aligned (e.g. not
on a 32-bit boundary)
The message definition for this operation is as follows:
Noveck Expires December 7, 2016 [Page 32]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
<CODE BEGINS>
const uint32 ROPT_XMTERR = 8;
struct optxmt_err {
xms_id optxe_bad;
xms_id *optxe_lead;
optx_info optxe_info;
};
<CODE ENDS>
The field optxe_bad is a description of the transmission on which the
error was actually detected.
The optional field optxe_lead is a description of an earlier
transmission that might have led to the error reported.
The field optxe_info provides informtion about the
10. XDR
This section contains an XDR [RFC4506] description of the proposed
extension.
This description is provided in a way that makes it simple to extract
into ready-to-use form. The reader can apply the following shell
script to this document to produce a machine-readable XDR description
of extension which can be combined with XDR for the base protocol to
produce an XDR that incude the base protocol together with the
optional extensions.
<CODE BEGINS>
#!/bin/sh
grep '^ *///' | sed 's?^ /// ??' | sed 's?^ *///$??'
<CODE ENDS>
That is, if the above script is stored in a file called "extract.sh"
and this document is in a file called "ext.txt" then the reader can
do the following to extract an XDR description file for this
extension:
Noveck Expires December 7, 2016 [Page 33]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
<CODE BEGINS>
sh extract.sh < ext.txt > xmitext.x
<CODE ENDS>
The XDR descriptuion for this extension can be combined with tht for
other extension and that for the base protocol. While this is a
complete description and can be processed by the XDR compiler, the
result might not be usable to process the extended protocol, for a
number of reasons:
The RPC-over-RDMA transport headers do not constitute an RPC
program and version negotation and message selection part of the
XDR, rather than being external to it.
Headers used for requests and replies are not ncessarily paired,
as they would be in an RPC program.
Header types defined as optional extensions ovrlay existing
nominally opaque fields in the base protocol. While this overlay
architecture allows code aware of the overlay relationships to
have a more complete view of header structure, this overlay
relationship cannot be expressed within the XDR language
10.1. Code Component License
Code components extracted from this document must include the
following license text. When the extracted XDR code is combined with
other complementary XDR code which itself has an identical license,
only a single copy of the license text need be preserved.
Noveck Expires December 7, 2016 [Page 34]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
<CODE BEGINS>
/// /*
/// * Copyright (c) 2010, 2016 IETF Trust and the persons
/// * identified as authors of the code. All rights reserved.
/// *
/// * The author of the code is: D. Noveck.
/// *
/// * Redistribution and use in source and binary forms, with
/// * or without modification, are permitted provided that the
/// * following conditions are met:
/// *
/// * - Redistributions of source code must retain the above
/// * copyright notice, this list of conditions and the
/// * following disclaimer.
/// *
/// * - Redistributions in binary form must reproduce the above
/// * copyright notice, this list of conditions and the
/// * following disclaimer in the documentation and/or other
/// * materials provided with the distribution.
/// *
/// * - Neither the name of Internet Society, IETF or IETF
/// * Trust, nor the names of specific contributors, may be
/// * used to endorse or promote products derived from this
/// * software without specific prior written permission.
/// *
/// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
/// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
/// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
/// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
/// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
/// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
/// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
/// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
/// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
/// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
/// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
/// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
/// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
/// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
/// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
/// */
<CODE ENDS>
Noveck Expires December 7, 2016 [Page 35]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
10.2. XDR Proper for Extension
<CODE BEGINS>
/// /*
/// * TBD soon
/// *
/// * In a future iteration, this will be assembled from existing XDR
/// * code fragments already aoppearing in the document.
/// */
<CODE ENDS>
11. Security Considerations
The information transferred in the transport characteristics
described in this document do not raise any security issues.
If and when additional transport characteristics are proposed, the
review of the associated standards track document should deal with
possible security issues raised by those new transport
characteristics
12. IANA Considerations
This document does not require any actions by IANA.
13. References
13.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<http://www.rfc-editor.org/info/rfc2119>.
[RFC4506] Eisler, M., Ed., "XDR: External Data Representation
Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May
2006, <http://www.rfc-editor.org/info/rfc4506>.
[rfc5666bis]
Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct
Memory Access Transport for Remote Procedure Call", May
2016, <http://www.ietf.org/id/
draft-ietf-nfsv4-rfc5666bis-07.txt>.
Noveck Expires December 7, 2016 [Page 36]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
Work in progress.
13.2. Informative References
[RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
"Network File System (NFS) Version 4 Minor Version 1
External Data Representation Standard (XDR) Description",
RFC 5662, DOI 10.17487/RFC5662, January 2010,
<http://www.rfc-editor.org/info/rfc5662>.
[RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access
Transport for Remote Procedure Call", RFC 5666,
DOI 10.17487/RFC5666, January 2010,
<http://www.rfc-editor.org/info/rfc5666>.
[RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS)
Direct Data Placement", RFC 5667, DOI 10.17487/RFC5667,
January 2010, <http://www.rfc-editor.org/info/rfc5667>.
[rpcrdmav2]
Lever, C., Ed. and D. Noveck, "RPC-over-RDMA Version Two",
June 2016, <http://www.ietf.org/id/
draft-cel-nfsv4-rpcrdma-version-two-01.txt>.
Work in progress.
[xcharext]
Noveck, D., "RPC-over-RDMA Extension to Manage Transport
Characterisitcs", April 2016, <http://www.ietf.org/id/
draft-dnoveck--nfsv4-xcharext-00.txt>.
Work in progress.
Appendix A. Acknowledgements
The author gratefully acknowledges the work of Brent Callaghan and
Tom Talpey producing the original RPC-over-RDMA Version One
specification [RFC5666] and also Tom's work in helping to clarify
that specification.
The author also wishes to thank Chuck Lever for his work resurrecting
NFS support for RDMA in [rfc5666bis], for clarifying the relationshp
between RDMA and direct data placement, and for beginning the work on
RPC-over-RDMA Version Two.
The extract.sh shell script and formatting conventions were first
described by the authors of the NFSv4.1 XDR specification [RFC5662].
Noveck Expires December 7, 2016 [Page 37]
Internet-Draft RPC/RDMA Round-trip Reductions June 2016
Author's Address
David Noveck
Hewlett Packard Enterprise
165 Dascomb Road
Andover, MA 01810
USA
Phone: +1 781-572-8038
Email: davenoveck@gmail.com
Noveck Expires December 7, 2016 [Page 38]