INTERNET-DRAFT                           Jeff Hilland
draft-hilland-rddp-verbs-00.txt            Hewlett-Packard Company
                                         Paul Culley
                                           Hewlett-Packard Company
                                         Jim Pinkerton
                                           Microsoft Corporation
                                         Renato Recio
                                           IBM Corporation

                                         Expires: October, 2003



   RDMA Protocol Verbs Specification

1  Status of this Memo

   This document is an Internet-Draft and is subject to all provisions
   of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.html The list of Internet-Draft
   Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

2  Abstract

   This document describes an abstract interface to a RDMA enabled NIC
   (RNIC). This interface is implemented as a combination of the RNIC,
   its associated firmware, and host software. It provides access to
   the RNIC queuing and memory management resources, as well as the
   underlying networking layers.










   Hilland, et al.       Expires October 2003                  [Page 1]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Table of Contents

   1    Status of this Memo.........................................1
   2    Abstract....................................................1
   3    Introduction................................................7
   4    Glossary....................................................9
   4.1  Abbreviations..............................................19
   5    RNIC Interface.............................................22
   5.1  The RNIC...................................................23
   5.1.1  RNIC Resources...........................................23
   5.1.1.1   Expected Creation Sequence............................24
   5.1.1.2   Expected Destruction Sequence.........................25
   5.1.2  Opening an RNIC..........................................28
   5.1.3  Query RNIC...............................................28
   5.1.4  Closing an RNIC..........................................28
   5.2  Protection Domains.........................................28
   5.2.1  Allocating a PD..........................................29
   5.2.2  Deallocating a PD........................................30
   5.3  Completion Queues..........................................30
   5.3.1  Creating a Completion Queue..............................30
   5.3.2  Querying Completion Queue Attributes.....................31
   5.3.3  Modifying Completion Queue Attributes....................32
   5.3.4  Destroying a Completion Queue............................32
   6    Queue Pairs................................................33
   6.1  Queue Pair Resource Handling...............................34
   6.1.1  Creating a Queue Pair....................................34
   6.1.2  Querying Queue Pair Attributes...........................35
   6.1.3  Modifying Queue Pair Attributes..........................36
   6.1.4  Destroying a Queue Pair..................................39
   6.2  Queue Pair Resource States.................................41
   6.2.1  Idle State...............................................43
   6.2.1.1   Idle to Idle..........................................44
   6.2.1.2   Idle to RTS...........................................44
   6.2.1.3   Idle to Error.........................................46
   6.2.2  RTS (Ready to Send) State................................48
   6.2.2.1   RTS to RTS............................................48
   6.2.2.2   RTS to Closing........................................49
   6.2.2.3   RTS to Terminate......................................49
   6.2.2.4   RTS to Error..........................................50
   6.2.3  Terminate State..........................................53
   6.2.4  Error State..............................................56
   6.2.5  Closing State............................................58
   6.3  Shared Receive Queue.......................................62
   6.3.1  Creating a Shared Receive Queue..........................63
   6.3.2  Modifying a Shared Receive Queue.........................63
   6.3.3  Destroying a Shared Receive Queue........................63
   6.3.4  Associating an S-RQ with a QP............................64
   6.3.5  Shared Receive Queue Processing Model....................64
   6.3.6  S-RQ Error Semantics.....................................66
   6.3.7  S-RQ Resource Sizing.....................................66



   Hilland, et al.        Expires October 2003               [Page 2]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   6.3.8  S-RQ Limit Checking......................................67
   6.4  Stopping QP processing and Sending the Terminate Message...68
   6.5  Outstanding RDMA Read Resource Management..................71
   6.5.1  Example IRD/ORD Negotiation..............................74
   6.6  Connection Management......................................75
   6.6.1  Connection Initialization................................75
   6.6.1.1   Active Connection Initialization after LLP Startup....76
   6.6.1.2   Passive Connection Initialization after LLP Startup...78
   6.6.2  Connection Teardown......................................79
   6.6.2.1   Normal Close..........................................80
   6.6.2.2   ULP Initiated Termination.............................81
   6.6.2.3   ULP Initiated Abortive Teardown.......................82
   6.6.2.4   Remote Termination....................................83
   6.6.2.5   Local Termination, Local Abortive Teardown and Remote
   Abortive Teardown...............................................83
   7    Memory Management..........................................87
   7.1  Memory Management Overview.................................87
   7.2  Steering Tag (STag)........................................88
   7.2.1  STag of zero.............................................90
   7.2.2  Summary of Memory Region STag States.....................91
   7.3  Memory Registration........................................93
   7.3.1  Memory Regions...........................................94
   7.3.1.1   Memory Region Tagged Offset (TO)......................94
   7.3.2  Memory Region Creation and Registration..................94
   7.3.2.1   Allocate Non-Shared Memory Region STag................95
   7.3.2.2   RI-Register Non-Shared Memory Region..................95
   7.3.2.3   RI-Reregister Non-Shared Memory Region................96
   7.3.2.4   Register Shared Memory Region.........................98
   7.3.2.5   Fast-Register Non-Shared Memory Region................99
   7.4  Access to Registered Memory...............................100
   7.4.1  Local Access to Registered Memory.......................101
   7.4.2  Remote Access to Registered Memory......................101
   7.4.3  Multiple Registrations of Memory Regions................103
   7.5  Memory Access Control.....................................104
   7.5.1  Local Access Control....................................105
   7.5.2  Remote Access Control...................................106
   7.6  Addressing................................................106
   7.6.1  Addressing Registered Memory............................106
   7.6.1.1   Addressing with VA based TO..........................107
   7.6.1.2   Addressing with Zero Based TO........................108
   7.6.2  Physical Buffer Lists...................................109
   7.6.2.1   Page Lists...........................................109
   7.6.2.2   Block Lists..........................................110
   7.6.3  Error Checking of Local and Remote Accesses to MRs......110
   7.7  Querying Memory Regions...................................111
   7.8  Invalidating Memory Regions...............................111
   7.9  Deallocation of STag associated with a Memory Region......114
   7.10   Memory Windows..........................................115
   7.10.1  Allocating Memory Windows..............................115
   7.10.2  Binding Memory Windows to Memory Regions...............116



   Hilland, et al.        Expires October 2003               [Page 3]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   7.10.3  Querying Memory Windows................................120
   7.10.4  Invalidating or De-allocating Memory Windows...........120
   7.10.4.1  Invalidating or De-allocating Active Windows.........121
   7.10.5  Summary of Memory Window STag States...................121
   7.10.6  Error Checking during Memory Window Operations.........122
   7.10.6.1  Error Checking at Window Bind Time...................122
   7.10.6.2  Error Checking at Window Access Time.................123
   7.10.6.3  Error Checking at Window Invalidate Time.............123
   8    Work Requests and the WR Processing Model.................125
   8.1  Work Requests.............................................125
   8.1.1  Creating Work Requests..................................125
   8.1.2  Work Request Types......................................125
   8.1.2.1   Send/Receive.........................................125
   8.1.2.2   RDMA.................................................126
   8.1.2.3   Memory...............................................129
   8.1.3  Work Request Contents...................................130
   8.1.3.1   Signaled Completions.................................130
   8.1.3.2   Scatter/Gather List..................................131
   8.1.3.3   RDMA Data Source & Data Sink.........................132
   8.2  Work Request Processing Model.............................133
   8.2.1  Submitting Work Request to a Work Queue.................133
   8.2.2  Work Request Processing.................................134
   8.2.2.1   Memory Management Operation Ordering.................137
   8.2.2.2   Read Fence and Local Fence Indicators................140
   8.2.3  Completion Processing...................................143
   8.2.4  Returning Completed Work Requests.......................144
   8.2.5  Asynchronous Completion Notification....................145
   8.3  Error Handling............................................147
   8.3.1  Immediate Errors........................................148
   8.3.2  Work Completion Errors..................................148
   8.3.3  Asynchronous Errors.....................................150
   9    RNIC Verbs................................................157
   9.1  Consumer Accessibility....................................157
   9.2  RNIC Resource Management..................................158
   9.2.1  RNIC....................................................158
   9.2.1.1   Open RNIC............................................158
   9.2.1.2   Query RNIC...........................................159
   9.2.1.3   Close RNIC...........................................161
   9.2.2  Protection Domain.......................................162
   9.2.2.1   Allocate PD..........................................162
   9.2.2.2   Deallocate PD........................................163
   9.2.3  Completion Queue........................................163
   9.2.3.1   Create CQ............................................163
   9.2.3.2   Query CQ.............................................164
   9.2.3.3   Modify CQ............................................165
   9.2.3.4   Destroy CQ...........................................166
   9.2.4  Shared Receive Queue....................................167
   9.2.4.1   Create S-RQ..........................................167
   9.2.4.2   Query S-RQ...........................................168
   9.2.4.3   Modify S-RQ..........................................169



   Hilland, et al.        Expires October 2003               [Page 4]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   9.2.4.4   Destroy S-RQ.........................................170
   9.2.5  Queue Pair..............................................170
   9.2.5.1   Create QP............................................170
   9.2.5.2   Query QP.............................................174
   9.2.5.3   Modify QP............................................176
   9.2.5.4   Destroy QP...........................................178
   9.2.6  Memory Management.......................................179
   9.2.6.1   Allocate Non-Shared Memory Region STag...............179
   9.2.6.2   Register Non-Shared Memory Region (RI-Register)......180
   9.2.6.3   Query Memory Region..................................182
   9.2.6.4   Deallocate STag......................................183
   9.2.6.5   Reregister Non-Shared Memory Region (RI-Reregister)..184
   9.2.6.6   Register Shared Memory Region........................187
   9.2.6.7   Allocate Memory Window...............................188
   9.2.6.8   Query Memory Window..................................189
   9.3  Work Request Processing...................................190
   9.3.1  QP Operations...........................................190
   9.3.1.1   PostSQ...............................................190
   9.3.1.2   PostRQ...............................................197
   9.3.2  CQ Operations...........................................198
   9.3.2.1   Poll for Completion (Poll CQ)........................198
   9.3.2.2   Request Completion Notification......................200
   9.4  Event Handling............................................200
   9.4.1  Set Completion Event Handler............................200
   9.4.2  Set Asynchronous Event Handler..........................202
   9.5  Result Types..............................................203
   9.5.1  Immediate Status Codes..................................203
   9.5.1.1   RNIC Management Verb Status..........................204
   9.5.1.2   PD Management Verb Status............................204
   9.5.1.3   CQ Management Verb Status............................205
   9.5.1.4   S-RQ Management Verb Status..........................205
   9.5.1.5   QP Management Verb Status............................206
   9.5.1.6   Memory Management Verb Status........................207
   9.5.1.7   Post Verb Status.....................................208
   9.5.1.8   Event Management Verb Status.........................209
   9.5.2  Completion Status Codes.................................210
   9.5.3  Asynchronous Event Identifiers..........................212
   10   Security Considerations...................................217
   11   IANA Considerations.......................................218
   12   References................................................219
   12.1   Normative References....................................219
   12.2   Informative References..................................219
   13   Appendix..................................................220
   13.1   Connection Initialization at LLP Startup................220
   13.2   Graceful Receive Overflow Handling......................221
   14   AuthorÆs Addresses........................................223
   15   Acknowledgments...........................................224
   16   Full Copyright Statement..................................227





   Hilland, et al.        Expires October 2003               [Page 5]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Table of Figures

   Figure 1 - Architectural RNIC & RI Model..........................8
   Figure 2 - Resource Creation Dependency Diagram..................25
   Figure 3 - Resource Destruction Dependency Diagram...............27
   Figure 4 - Allowable QP Attribute Modifications..................37
   Figure 5 - Optional QP Attribute Modifications...................38
   Figure 6 - QP State Diagram......................................42
   Figure 7 - Idle State summary....................................47
   Figure 8 - RTS State summary.....................................52
   Figure 9 - Terminate State summary...............................55
   Figure 10 - Error State summary..................................57
   Figure 11 - Closing State summary................................61
   Figure 12- Terminate Control Field Values........................71
   Figure 13 - An example RDMA Read Resource negotiation............75
   Figure 14 - Connection Initialization after LLP Startup..........76
   Figure 15 - Normal Close on TCP..................................81
   Figure 16 - Abortive Teardown example on TCP.....................86
   Figure 17 - Memory Region and Window State Diagram...............92
   Figure 18 - Valid Combinations of MR Access Rights..............103
   Figure 19 - MR to MW Valid Binding Combinations.................117
   Figure 20 - Valid Combinations of MW & MR Access Rights.........119
   Figure 21 - Valid QP & STag Access Right Combinations...........128
   Figure 22 - Fencing on Prior Operations.........................142
   Figure 23 - Completion Errors with Resulting Terminate Codes....150
   Figure 24 - Affiliated Asynchronous Errors with Terminate Codes.155
   Figure 25 - Unaffiliated Asynchronous Errors with Terminate Code156
   Figure 26 - Memory Management Verbs.............................179
   Figure 27 - PostSQ Input Modifier Validity......................196
   Figure 28 - RNIC Management Verb Status.........................204
   Figure 29 - PD Management Verb Status...........................204
   Figure 30 - CQ Management Verb Status...........................205
   Figure 31 - S-RQ Management Verb Status.........................206
   Figure 32 - QP Management Verb Status...........................207
   Figure 33 - Memory Management Verb Status.......................208
   Figure 34 - Post Verb Status....................................209
   Figure 35 - Event Management Verb Status........................209
   Figure 36 - Completion Status Codes.............................212
   Figure 37 - Asynchronous Event Identifiers......................216
   Figure 39 - Connection Initialization at LLP Startup (using TCP)220













   Hilland, et al.        Expires October 2003               [Page 6]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

3  Introduction

   This document describes an abstract interface to an RDMA aware NIC
   (RNIC). The RNIC implements the RDMA Protocol [RDMAP][DDP] above a
   reliable transport, such as [MPA] over TCP. The Verbs provide the
   Consumer with a semantic definition of the RNIC Interface.

   RDMA provides Verbs Consumers the capability to control data
   placement, eliminate data copy operations, and significantly reduce
   communications overhead and latencies by allowing one Verbs Consumer
   to directly place information in another Verbs Consumer's memory,
   while preserving OS and memory protection semantics. Specification
   of syntactic definitions (API's, hardware registers) and
   implementation details (hardware, firmware, software tradeoffs) are
   beyond the scope of this specification.

   Section 5 of this document defines the semantics of the RNIC
   Interface (RI). This interface is implemented as a combination of
   the RNIC, its associated firmware, and host software. Section 6
   describes Queue Pairs, which represent the focus of interaction with
   the RNIC for work submission. Section 7 describes Memory Management
   and how the RNIC accesses buffers which contain data to be
   transferred. Section 8 describes Work Requests and the WR Processing
   Model, detailing the processing of the units of work from submission
   to completion. Section 9 describes the RNIC Verbs. The Verbs are an
   abstract description of the functionality of an RNIC Interface.
   Section 10 describes security issues associated with implementing an
   RDMA infrastructure.

   A concept frequently encountered in this specification is that of
   the Verbs Consumer, or simply, the Consumer. The precise meaning of
   the phrase varies, as a function of context, but it always means the
   executing entity employing the capabilities of the RNIC to
   accomplish some objective. In some instances the Verb Consumer may
   be an OS kernel thread, in others a non-privileged application, and
   in still others, some special, privileged process. Where the
   difference is important to the correct behavior of an
   implementation, it is defined explicitly.

   Specification of the API used by the Verbs Consumer to access the
   capabilities of the RI is outside of the scope of this
   specification.

   Figure 1 is a conceptual diagram that describes an architectural
   model which includes Privileged Mode consumers, Non-Privileged Mode
   consumers, RNIC components and the RI.







   Hilland, et al.        Expires October 2003               [Page 7]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003




             < Figure 1 did not convert properly from source >
             <  to be corrected in an upcoming version       >















                 Figure 1 - Architectural RNIC & RI Model






























   Hilland, et al.        Expires October 2003               [Page 8]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

4  Glossary

   Access Rights - The Local and Remote Memory Access Rights assigned
       to an STag. This includes Local Read, Local Write, Remote Read,
       Remote Write, Remote Access Flag, and Bind.

   Address List - A list of addresses that represent the physical pages
       or blocks referenced by the Physical Buffer List.

   Advertisement (Advertised, Advertise, Advertisements, Advertises) -
       The act of informing a Remote Peer that a Local Node's Buffer is
       available to it. A Node makes a buffer available for incoming
       RDMA Read Request Message or incoming RDMA Write Message access
       by informing its RDMA/DDP peer of the Tagged Buffer identifiers
       (STag, TO, and buffer length). This advertisement of Tagged
       Buffer information is not defined by RDMA/DDP and is left to the
       ULP. A typical method would be for the Local Peer to embed the
       Tagged Buffer's Steering Tag, TO, and length in a Send Message
       destined for the Remote Peer.

   Affiliated Asynchronous Event - This is an indication from the Verb
       layer to the Consumer that an event has occurred related to a
       specific identifiable RNIC Resource, such as a Completion Queue
       or Queue Pair.

   Affiliated Error - An error that can be directly related back to a
       specific RNIC Resource, such as a QP, S-RQ or CQ, but that
       cannot be returned through a Work Completion.

   Associated QP - The QP on the Remote Peer which is directly
       accessing the other end of the RDMA Stream.

   Asynchronous Error - This is an error that could not be reported
       through immediate or completion error-handling mechanisms at the
       local end. An asynchronous mechanism is necessary as a single
       point of error handling for errors which could not otherwise be
       reported through the normal mechanism since they are not
       associated directly with any single QP, S-RQ or CQ or the QP
       and/or CQ is in a state where an error cannot be reported.
       Asynchronous errors may be Unaffiliated or may be Affiliated
       with a specific QP, CQ or S-RQ.

   Base Tagged Offset (Base TO) - The offset assigned to the first byte
       of a Memory Region or a Memory Window.

   Bind, Binding, Bound - The act of associating an STag, TO, and
       Length within a previously registered Memory Region in order to
       define a Memory Window.





   Hilland, et al.        Expires October 2003               [Page 9]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Block List - A list of physical addresses describing a set of memory
       blocks, which specifies the block size, list of physical
       addresses, and offset to the start of the memory region of the
       first block. Each block has the same length and that length can
       be any value in the range supported by the RNIC. Each block may
       start at a byte granularity address. The starting address for
       the entire list may be an offset into the first block and the
       entire list may have any length.

   Complete (Completed, Completion, Completes) - When the Consumer can
       determine that a particular RDMA Operation has performed all
       functions specified for the RDMA Operation, including Placement
       and Delivery. This can be determined through a Work Completion
       for Signaled Work Requests. For Unsignaled Work Requests, this
       means that the Completion Rules have been met. Note that this is
       a superset of the [RDMAP] definition for RDMA Completion.

   Completion Error - A Processing Error reported through the
       Completion Queue.

   Completion Queue (CQ) - A sharable queue containing one or more
       entries which can contain Completion Queue Entries. A CQ is used
       to create a single point of completion notification for multiple
       Work Queues. The Work Queues associated with a Completion Queue
       may be from different QPs and of differing queue types (SQs or
       RQs).

   Completion Queue Entry (CQE) - The RNIC Interface internal
       representation of a Work Completion.

   Completion Status - The resultant status of a Work Request returned
       as part of a Work Completion.

   Consumer, Verbs Consumer - A software process that communicates
       using RDMA/DDP Verbs. The Consumer typically consists of an
       application program, or an operating system adaptation layer,
       which provides some OS specific API.

   Direct Data Placement Protocol (DDP) - A wire protocol that supports
       Direct Data Placement by associating explicit memory buffer
       placement information with the LLP payload units.

   Data Delivery (Delivery, Delivered, Delivers) - Delivery is defined
       as the process of informing the ULP or Consumer that a
       particular Message is available for use. This is specifically
       different from Data Placement, which may generally occur in any
       order, while the order of Data Delivery is strictly defined.

   Data Placement (Placement, Placed, Places) - A mechanism whereby ULP
       data contained within RDMA/DDP Segments may be put directly into



   Hilland, et al.        Expires October 2003              [Page 10]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       its final destination in memory without processing by the ULP.
       This may occur even when the RDMA/DDP Segments arrive out of
       order. Note that this differs from Data Delivery (see definition
       in this section). From the Verbs viewpoint, Data Placement is
       only confirmed upon Completion.

   Data Sink - The peer receiving a data payload. Note that the Data
       Sink can be required to both send and receive RDMA/DDP Messages
       to transfer a data payload.

   Data Source - The peer sending a data payload. Note that the Data
       Source can be required to both send and receive RDMA/DDP
       Messages to transfer a data payload.

   Event - An indication provided by the RDMAP Layer to the ULP to
       indicate a Completion or other condition requiring immediate
       attention.

   Fabric - The collection of links, switches, and routers that connect
       a set of Nodes with RDMA/DDP protocol implementations.

   First Byte Offset (FBO) - The offset into the first Physical Buffer
       of a Memory Region. The value of the FBO cannot exceed the size
       of the Physical Buffer Entry Size associated with the Memory
       Region.

   Handle - An opaque identifier used to reference an RNIC or an RNIC
       Resource. Whether this is an index, object or some other
       construct is outside the scope of this specification.

   Immediate Error -                   - An error discovered by the RNIC Interface (RI) and
       reported through the RI without affecting the RNIC.

   Inbound RDMA Read Queue Depth (IRD) - The maximum number of incoming
       outstanding RDMA Read Request Messages the RNICÆs QP can handle
       at the Data Source.

   Inbound RDMA Read Request Queue (IRRQ) - The RI internal resource
       which handles incoming RDMA Read Request Messages, queues them
       for processing them by the RI, and then generates the RDMA Read
       Response Messages. This corresponds to Queue Number 1 in [DDP].

   Invalidate STag (Invalidate, Invalidated, etc.) - A mechanism used
       to prevent the Remote Peer from reusing an Advertised STag,
       until the Local Peer transitions the STag to the Valid state.

   Invalidate Local STag - A Work Request that takes an STag which is
       valid within the local RI and performs an Invalidate STag
       operation.




   Hilland, et al.        Expires October 2003              [Page 11]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   iWARP - A suite of wire protocols comprised of [RDMAP] & [DDP]. The
       iWARP protocol suite may be layered above [MPA] and [TCP], or it
       may be layered over [SCTP] or other transport protocols.

   Local Access - The rights used to verify the RNIC's ability to
       access the Data Sink for incoming Untagged Messages, the Data
       Source for outgoing Untagged Messages and the Data Source for
       outgoing RDMA Write Messages.

   Local Fence - To block the current operation from executing until
       all prior local operations submitted on the same Work Queue have
       Completed.

   Local Peer - The RDMA/DDP protocol implementation on the local end
       of the connection. Used to refer to the local entity when
       describing a protocol exchange or other interaction between two
       Nodes.

   Lower Layer Protocol (LLP) - The protocol layer beneath the protocol
       layer currently being referenced. For example, for DDP the LLP
       is SCTP, MPA, or other transport protocols. For RDMA, the LLP is
       DDP.

   LLP Closed (LLP Close)- When the LLP Stream can no longer be used
       for data transmission. If there is a single LLP Stream on an LLP
       Connection, it may also mean that the LLP Connection has been
       torn down. For example, for TCP this could include the states
       TIME_WAIT, CLOSING, LAST-ACK, and CLOSED

   LLP Connection - Corresponds to an LLP transport-level connection
       between the peer LLP layers on two nodes.

   LLP Reset - The abnormal LLP closing mechanism, usually used to
       indicate that the LLP Stream (and possibly Connection) was
       aborted mid-stream. An example of this would be a TCP connection
       being closed due to the reception or transmission of a TCP RST
       on the connection.

   LLP Stream - Corresponds to a single bi-directional LLP transport-
       level association between the peer LLP layers on two Nodes. One
       or more LLP Streams may map to a single transport-level LLP
       Connection. For transport protocols that support multiple
       Streams per connection (e.g. SCTP), a LLP Stream corresponds to
       one transport-level Stream.

   Memory Region (MR) - An area of memory that the Consumer wants the
       RNIC to be able to (locally or locally and remotely) access
       directly in a logically contiguous fashion. A Memory Region is
       identified by an STag, a Base TO, and a length. A Memory Region
       is associated with a Physical Buffer List through the STag.



   Hilland, et al.        Expires October 2003              [Page 12]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Memory Registration (Registration, Register) - The mechanism used to
       enable direct (local or local and remote) access by the RNIC of
       a Consumer Memory Region. The memory registration operation
       associates a Physical Buffer List to the Steering Tag (STag)
       returned.

   Memory Translation and Protection Table(s) (TPT) - The data
       structure(s) used by an RNIC to control buffer access and
       translate STags and Tagged Offsets into local memory addresses
       directly accessible by the local Node.

   Memory Window (MW) -                      - A subset of a Memory Region, which can be
       remotely accessed in a logically contiguous fashion. A Memory
       Window is identified by an STag, a Base TO, and a length, but
       also references an underlying Memory Region and has Access
       Rights.

   Message Sequence Number (MSN) - For the Untagged Buffer Model, it
       specifies a sequence number that is increasing with each DDP
       Message.

   Modifiers - In a Verb definition, the list of input and output
       objects that specify how, and on what, the Verb is to be
       executed.

   Node - A computing device attached to one or more links of a Fabric
       (network). A Node in this context does not refer to a specific
       application or protocol instantiation running on the computer. A
       Node may consist of one or more RNICs installed in a host
       computer.

   Non-Privileged Mode - An operating mode in which Consumers must rely
       on another agent, having a sufficiently high level of privilege,
       to manipulate OS data structures.

   Non-Shared Memory Region - A Memory Region that solely owns the
       Physical Buffer List associated with the Memory Region.
       Specifically, the PBL is not shared, and has never been shared,
       with another Memory Region.

   Outbound RDMA Read Queue Depth (ORD) - The maximum number of
       outstanding RDMA Read Request Messages the RNIC can initiate
       from the SQ at the Data Sink.

   Outstanding - The state of a Work Request after it has been posted
       on a Work Queue, but before the retrieval of the Work
       Completion, or confirmation that the WR has been completed, by
       the Consumer.





   Hilland, et al.        Expires October 2003              [Page 13]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Page List - A list of physical addresses describing a set of memory
       pages, which specifies the page size, list of physical
       addresses, and offset to the start of the memory region of the
       first page. The starting physical addresses of each page is
       aligned on power-of-two addresses and the size of the page is a
       power of two. Note that it is possible for the starting offset
       to be an offset into the first page and to be of a byte
       granularity and the entire list may have an arbitrary length.

   Physical Address - A physical address is used by an RNIC to retrieve
       contents from the local host's memory. Physical addresses are
       determined via the translation of the STag and Tagged Offset by
       the use of the Memory Translation and Protection Table(s).

   Physical Buffer - A set of physically contiguous memory locations
       that can be directly accessed by the RNIC through Physical
       Addresses. A Physical Buffer can either be a block buffer or a
       page buffer, depending on its use as part of a Page List or
       Buffer List.

   Physical Buffer Entry Size - The size, in bytes, of each Physical
       Buffer in the Physical Buffer List. If the Physical Buffer List
       references a Page List, the size is a power of two. If the
       Physical Buffer List references a Block List, the size can have
       any value within the range supported by the RNIC.

   Physical Buffer List (PBL) - A list of Physical Buffers. The
       Physical Buffer List can either be a Block List or a Page List.

   Physical Memory Addresses - The addresses an RNIC uses when
       accessing host system memory.

   Pinning memory - A function supplied by the OS that forces the
       Memory Region to be resident in physical memory and keeps the
       virtual-to-physical address translations constant from the
       RNIC's point of view.

   Place - Also Placed, Placement. See Data Placement.

   Post Receive Queue Work Request (PostRQ) - A Verb that posts a Work
       Request to the Receive Queue of a Queue Pair. This is done to
       indicate the Data Sink Buffers for incoming Send Operation
       Types.

   Post Send Queue Work Request (PostSQ) - A Verb that posts a Work
       Request to the Send Queue of a Queue Pair. This is done to
       initiate all data transfer operations as well as Fast-Register,
       Bind MW and Local Invalidate operations.





   Hilland, et al.        Expires October 2003              [Page 14]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Privileged Mode - A mode in which Consumers operate where they have
       a privilege level sufficient to access OS internal data
       structured directly, and that have the responsibility to control
       access to the RI.

   Processing Error - An error detected below the RNIC Interface during
       the processing of a Work Request or an incoming RDMA operation.

   Protection Domain (PD) - A mechanism for tracking the association of
       Queue Pairs, Memory Windows, and Memory Regions. PDs are
       intended to be set by a Privileged Consumer to provide
       protection of one process from accessing another's memory
       through the use of the RNIC.

   Protection Domain ID (PD ID) - The identifier which represents a
       Protection Domain. It is passed in as an Input Modifier when
       creating QPs, Memory Windows and MRs. The value of PD IDs are
       compared during processing of Work Requests.

   Queue Pair (QP) - The pair of queues that allow the Consumer to
       interact with the RNIC Interface. The two queues are the Send
       Queue and the Receive Queue. Each queue stores a Work Queue
       Element from the time it is posted until the time it is
       completed.

   Queue Pair Context - The collection of information needed by the
       RNIC Interface to perform the RDMA Operations associated with
       the Queue Pair. This includes various pointers to buffers,
       queues, and CQs, as well as LLP specific connection and stream
       information.

   Queue Pair Identifier (QP ID) - An identifier representing a Queue
       Pair.

   Read Fence - To block the current operation from executing until all
       prior RDMA Read Type WRs submitted to the Send Queue have
       Completed.

   Receive Queue (RQ) - One of the two Work Queues associated with a
       Queue Pair. The Receive Queue contains Work Queue Elements that
       describe the Buffers into which data from incoming Send
       Operation Types is placed.

   Remote Access - The Access Rights used to verify the RNIC's ability
       to access the Data Sink for incoming DDP Tagged Messages and the
       Data Source for RDMA Read Request Messages.

   Remote Direct Memory Access (RDMA) - A method of accessing memory on
       a remote system in which the local system specifies the remote
       location of the data to be transferred. Employing an RNIC in the



   Hilland, et al.        Expires October 2003              [Page 15]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       remote system allows the access to take place without
       interrupting the processing of the CPU(s) on the system. Also
       used to indicate the layer implementing the RDMAP wire protocol
       semantics.

   RDMA Message - The sequence of DDP segments which represents an RDMA
       Operation.

   RDMA Operation - A sequence of RDMAP Messages, including control
       Messages, to transfer data from a Data Source to a Data Sink.
       The following RDMA Operations are defined - RDMA Write
       Operation, RDMA Read Operation, Send Operation, Send with
       Invalidate Operation, Send with Solicited Event Operation, Send
       with Solicited Event & Invalidate Operation, and Terminate
       Operation. Note that the various forms of Send Operations are
       defined in [RDMAP] to be called Send Type Operations.

   RDMA Protocol (RDMAP) - A wire protocol that supports RDMA
       Operations to transfer ULP data between a Local Peer and the
       Remote Peer. See [RDMAP].

   RDMA Read Operation - An RDMA Operation that consists of a single
       RDMA Read Request Message and a single RDMA Read Response
       Message. The Data Sink uses this operation to transfer the
       contents of a Data Source buffer from the Remote Peer to the
       Local Peer.

   RDMA Read Request - An RDMA Message used by the Data Sink to request
       the Data Source to transfer the contents of a buffer. The RDMA
       Read Request Message describes both the Data Source and Data
       Sink buffers.

   RDMA Read Response - An RDMA Message used by the Data Source to
       respond to an RDMA Read Request Message.

   RDMA Read Type Work Request - A PostSQ Work Request which specifies
       an operation type of either an RDMA Read or an RDMA Read with
       Invalidate Local STag.

   RDMA Stream - A single bi-directional association between the peer
       RDMA layers on two Nodes over a single LLP Stream.

   RDMA Write Operation - An RDMA Operation that transfers the contents
       of a source buffer from the Local Peer to a destination buffer
       at the Remote Peer using an RDMAP Write Message. The RDMAP Write
       Message only describes the Data Sink's buffer.

   RDMA Network Interface Controller (RNIC) - A network I/O adapter or
       embedded controller with iWARP and Verbs functionality.




   Hilland, et al.        Expires October 2003              [Page 16]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Remote Peer - The RDMA protocol implementation on the opposite end
       of the connection. Used to refer to the remote entity when
       describing protocol exchanges or other interactions between two
       Nodes.

   Remote RDMA Read Operation - a sequence of events that begins upon
       receipt of an incoming RDMA Read Request by the RI and stays in-
       process until the corresponding RDMA Read Response Message has
       been generated. This includes posting the RDMA Read Request to
       the Inbound RDMA Read Request Queue (See Section 6.5 -
       Outstanding RDMA Read Resource Management).

   RNIC Interface (RI) - The presentation of the RNIC to the Verbs
       Consumer as implemented through the combination of the RNIC and
       the RNIC device driver.

   Scatter/Gather Element (SGE) - An individual entry in a
       Scatter/Gather List. Each SGE consists of an STag, Tagged Offset
       and Length.

   Scatter/Gather List (SGL) - A List of Scatter/Gather Elements. The
       list describes one or more ULP Buffers which will have their
       data gathered on transmission or scattered upon reception.

   Send - An RDMA Operation that transfers the contents of an Untagged
       buffer from the Local Peer to an Untagged buffer at the Remote
       Peer.

   Send Operation Types - The set of Send operations that result in the
       consumption of a Receive Queue Work Request at the Data Sink.
       Specifically this includes Send, Send with Invalidate, Send with
       Solicited Event and Send with Solicited Event & Invalidate.

   Send Queue (SQ) - One of the two Work Queues associated with a Queue
       Pair. The Send Queue contains PostSQ Work Queue Elements that
       have specific operation types, such as Send Type, RDMA Write, or
       RDMA Read Type Operations, as well as STag operations such as
       Bind and Invalidate.

   Shared Memory Region - An MR that currently shares, or at one time
       shared, the Physical Buffer List associated with the Memory
       Region. Specifically, the PBL is currently shared or was
       previously shared with another Memory Region.

   Shared Receive Queue - An optional mechanism which allows the
       Receive Queues from multiple QPs to retrieve Receive Queue Work
       Queue Elements from the same shared queue as needed.

   Signaled - A WR which requires that the RNIC generate a Work
       Completion.



   Hilland, et al.        Expires October 2003              [Page 17]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Solicited Event (SE) - A facility by which an RDMA Operation sender
       may cause an Event to be generated at the recipient, if the
       recipient is configured to generate such an Event, when a Send
       with Solicited Event or Send with Solicited Event & Invalidate
       Message is received.

   Steering Tag (STag) - An identifier of a Memory Window or Memory
       Region. STags are composed of two components: an STag Index and
       an STag Key. The Consumer forms the STag by combining the STag
       Index with the STag Key. This specification further refines the
       definitions of STags contained in [RDMAP] and [DDP].

   STag Key - The least significant 8 bit portion of an STag. This
       field of an STag can be set to any value by the Consumer when
       performing a Memory Registration operation, such as Bind Memory
       Window, Fast-Register Memory Region and Register Memory Region.

   STag Index - The most significant 24 bits of an STag. This field of
       the STag is managed by the RI and is treated as an opaque object
       by the Consumer.

   Tagged Buffer - A buffer that can be Advertised to a Remote Peer
       through exchange of an STag, Tagged Offset, and length.

   Tagged Offset (TO) - The offset within a Tagged Buffer.

   Terminate - An RDMA Message used by a Node to pass an error
       indication to the Remote Peer on an RDMA Stream.

   Upper Layer Protocol (ULP) - The protocol layer above the Verb
       layer. An example is SDP.

   ULP Buffer - A buffer owned above the RI that can be represented
       within the RNIC, in whole or in part, by a Memory Window or a
       Memory Region.

   ULP Message - The ULP data that is handed to a specific protocol
       layer for transmission. Data boundaries are preserved as they
       are transmitted through iWARP.

   ULP Payload - The portion of a ULP Message that is contained within
       a single protocol segment or packet (e.g. a DDP Segment).

   Unaffiliated Asynchronous Event - This is an indication from the
       Verb layer to the Consumer that an event has occurred unrelated
       to any single identifiable RNIC Resource.

   Unsignaled - A Work Request which only generates a Work Completion
       if it encounters an error during processing.




   Hilland, et al.        Expires October 2003              [Page 18]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Untagged Buffer - A buffer which is not Advertised to a Remote Peer,
       that has Local Access Rights, and that is referenced by an STag,
       Tagged Offset, and length.

   Verbs - An abstract description of the functionality of an RNIC
       Interface. The OS may expose some or all of this functionality
       via one or more APIs to applications. The OS will also use some
       of the functionality to manage the RNIC Interface.

   Virtual Address - An address represented in the address space of a
       local process on a node. It is generally used to present
       logically contiguous addressability for an underlying and
       possibly non-contiguous list of physical pages.

   Virtual Address Based Tagged Offset (VA Based TO) - The Base TO of
       an MR or MW that starts at a non-zero TO.

   Work Completion (WC) - The output modifiers that the Consumer
       retrieves from a Completion Queue indicating the results of a
       Work Request.

   Work Queue (WQ) - One of either a Send Queue or Receive Queue.

   Work Queue Element (WQE) - The RNIC Interface's internal
       representation of Work Request.

   Work Request (WR) - An elementary object used by Consumers to
       enqueue a requested operation (WQEs) onto the Send and Receive
       Queues of a QP.

   Work Request List (WRL) - A list of Work Requests.

   Zero Based Tagged Offset (Zero Based TO) - The Base TO of an MR or
       MW that starts at TO=0.

4.1  Abbreviations

   CQ - Completion Queue

   CQE - Completion Queue Entry

   DDP - Direct Data Placement Protocol

   FBO - First Byte Offset

   IRD - Inbound RDMA Read Queue Depth

   IRRQ - Inbound RDMA Read Request Queue

   LLP - Lower Layer Protocol



   Hilland, et al.        Expires October 2003              [Page 19]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   MR - Memory Region

   MW - Memory Window

   ORD - Outbound RDMA Read Queue Depth

   PBL - Physical Buffer List

   PD - Protection Domain

   PD ID - Protection Domain Identifier

   QP - Queue Pair

   QP ID - Queue Pair Identifier

   RQ - Receive Queue

   RDMA - Remote Direct Memory Access

   RDMAP - Remote Direct Memory Access Protocol

   RNIC - RDMA NIC

   RI - RNIC Interface

   SGE - Scatter-Gather Element

   SGL - Scatter-Gather List

   SE - Solicited Event

   S-RQ - Shared Receive Queue

   SQ - Send Queue

   STag - Steering Tag

   TO - Tagged Offset

   TPT - Translation & Protection Table

   ULP - Upper Layer Protocol

   WC - Work Completion

   WQ - Work Queue

   WQE - Work Queue Element




   Hilland, et al.        Expires October 2003              [Page 20]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   WR - Work Request

   WRL - Work Request List


















































   Hilland, et al.        Expires October 2003              [Page 21]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

5  RNIC Interface

   The RNIC Interface (RI) is the locus of interaction between the
   Consumer of RNIC services and the RNIC. Semantic behavior of the
   RNIC is specified via Verbs, which enable creation and management of
   Queue Pairs, management of the RNIC, management of Work Requests,
   and transferring error indications from the RI that may be surfaced
   via the Verbs. All these activities must be carried out so as to
   enable Verbs Consumers to expect the same level of protection and
   security as are guaranteed other entities supported by the host
   operating system.

   A fundamental function of the RI is management of RNICs. This
   includes arranging access to them, accessing and modifying their
   attributes, and shutting them down. These activities are described
   below, and details of the corresponding Verbs semantics are given in
   subsequent sections.

   Direct, protected access to Consumer memory is critical to realizing
   the performance potential of the RNIC. This specification describes
   the semantics of memory access defined in this architecture. It
   describes in detail the ideas of Memory Regions and Memory Windows,
   how they are created and managed, Access Rights for local and remote
   access to registered memory, and the semantics of errors that may
   arise.

   The RI is assumed to be a traditional software interface, typically
   synchronous in behavior, while QP interactions are assumed to be
   work requests queued to connection specific, hardware based queues.
   The queue processing model and associated memory protection
   semantics allow QPs to be safely mapped and utilized by both Non-
   Privileged and Privileged routines.

   Queue Pairs (QPs) are a key component required for the operation of
   the RI. They are the RNIC resource used by Consumers to submit Work
   Requests to the RI. A QP is used to interact with an RDMA Stream on
   an RNIC which is running the RDMA Protocol. There may be thousands
   of QPs per RNIC. Each QP provides the Consumer with a single point
   of access to an individual RDMA Stream.

   Work Requests (WRs) provide the mechanism for Consumers to enqueue
   Work Queue Elements (WQEs) onto the Send and Receive queues of a QP.
   The varieties of WRs, and the dynamics of their creation, use, and
   disposition are described in the sections to follow, as are the
   disposition of errors that may arise as WR are processed. Details of
   the WR contents are discussed as well.

   Completion Queues (CQs) provide the mechanism for the Consumer to
   retrieve WR status. In addition, there are notification mechanisms




   Hilland, et al.        Expires October 2003              [Page 22]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   which help a Consumer to efficiently notice when WRs have completed
   processing in the RI. There may be thousands of CQs per RNIC.

   Event Handlers provide the mechanism for Consumers to be notified of
   Asynchronous Events which occur within the RI but which cannot be
   reported through the Completion Queues due to their asynchronous
   nature or the fact that they are not easily associated with a Work
   Completion.

5.1  The RNIC

   Consumers gain access to an RNIC through the RNIC Interface. The
   Verbs allow the Consumer to open the RNIC, retrieve RNIC attributes,
   and close the RNIC.

   All resources MUST be in the scope of the RNIC on which they are
   created. This means that there is no requirement for resources on
   one RNIC to be available, associated with or meaningful to another
   RNIC, even if they are managed by the same RNIC driver. This
   includes all QPs, STags, PDs, CQs, and multiple Completion Event
   Handlers. This also means that any IDs which are created by the RI
   are specific to that RNIC and are not guaranteed to be unique across
   all RNICs.

   An intent of the architecture is to allow an implementation to pass
   Work Requests and Work Completions to and from a Non-Privileged Mode
   Consumer process directly to and from the RNIC. Another intent of
   the architecture is to optimize for a Privileged Mode
   implementation, which shares the Work Request and Work Completion
   requirements of Non-Privileged Mode Consumers but has slightly
   different memory management requirements.

   Because the architecture attempts to optimize for both Privileged
   Mode and Non-Privileged Mode Consumers, there are some Verbs and
   Verb modes which are not allowed to be executed by non-Privileged
   Mode Consumers. An example of this is the use of the STag of zero or
   the ability to do Fast-Register WRs. In addition, there are some
   operations that, while being allowed in kernel mode, are intended to
   be used by Non-Privileged mode applications. An example of this is
   Memory Windows. Any restrictions are clearly specified in this
   document where required.

5.1.1  RNIC Resources

   RNIC Resources can be allocated from a variety of places. They can
   be allocated in host memory on behalf of the Consumer or allocated
   within the RNIC. Where an RNIC allocates resources is implementation
   specific. Consequently, values that the RNIC returns as output
   modifiers when Querying the RNIC indicate the maximum amount of any




   Hilland, et al.        Expires October 2003              [Page 23]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   given resource it can allocate, in the absence of other resource
   allocations.

   For example, an RNIC may allocate QPs, CQs from the same memory
   within the RNIC. If a Consumer allocates the maximum amount of QPs
   before allocating any CQs, it may not be able to allocate any CQs
   due to an insufficient resource condition - even though the RNIC
   indicates that its maximum number of CQs is much larger than the
   number currently allocated.

   The purpose of a handle is to provide a mechanism to lookup a
   specific resource. Resources that have handles associated with them
   are the RNIC, CQ, S-RQ, QP, and Asynchronous Event Handler. Often a
   handle is an address in memory. An identifier or index also
   references a specific resource. An identifier or index is used when
   the value must be used in a comparison operation. The QP ID, PD ID,
   Completion Event Handler Identifier and STag Index fall in this
   category.

   It is expected that a resource manager above the RI will manage RNIC
   resources appropriately for the operating environment.

5.1.1.1  Expected Creation Sequence

   Due to RI Resource interdependencies, there is an ordering sequence
   to the allocation and creation of RNIC resources. The sequence
   indicated below, while not strictly required in all cases, may be
   helpful to the reader.

   1.  Open the RNIC and setup up an Asynchronous Event Handler.

   2.  Prior to initiating a LLP Connection, select the opened RNIC on
       which you will create the connection and create a Protection
       Domain.

   3.  Create one or more Completion Queues.

   4.  Set up one or more Completion Event Handlers.

   5.  Allocate and initialize a Shared Receive Queue, if desired.

   6.  Allocate and initialize one or more QPs.

   7.  Register one or more Memory Regions.

   8.  Allocate Non-Shared Memory Region STags, if desired.

   9.  Allocate Memory Windows, if desired.

   10. Transition the QP through the state diagram to RTS.



   Hilland, et al.        Expires October 2003              [Page 24]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   11. Initiate Work Request Processing through PostSQ, PostRQ and Poll
       CQ.

   Below in Figure 2 is a dependency diagram which may also be helpful
   when determining the order in which resources are created. The
   arrows indicate that the resource the arrow comes from must be
   created or allocated before the item the arrow points to can be
   created or allocated.










             < Figure 2 did not convert properly from source >
             <  to be corrected in an upcoming version       >















              Figure 2 - Resource Creation Dependency Diagram

5.1.1.2  Expected Destruction Sequence

   Due to RI Resource interdependencies, there is an ordering of de-
   allocation and destruction of RNIC resources. The sequence indicated



   Hilland, et al.        Expires October 2003              [Page 25]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   below, while not strictly required in all cases, may be helpful to
   the reader.

   1.  Invalidate all Memory Windows which are in the Valid state
       through a QP WR, if possible.

   2.  Drain the SQ & RQ of WRs and poll the Work Completions through
       the CQ.

   3.  Transition the QP state to Closing.

   4.  When the QP is in the Idle state, Destroy the Memory Windows.

   5.  Destroy the Memory Regions.

   6.  Destroy the Queue Pair.

   7.  Destroy the Shared Receive Queue, if created.

   8.  Destroy the Completion Queues.

   9.  Destroy the Protection Domain.

   10. Close the RNIC.

   Below in Figure 3 is a dependency diagram which may also be helpful
   when determining the order in which resources are destroyed. The
   arrows indicate that the resource the arrow comes from must be
   destroyed or deallocated before the item the arrow points to can be
   destroyed or deallocated. A dashed line means the action should
   occur before the resource can be destroyed. A solid line means the
   action must occur before the resource can be destroyed.





















   Hilland, et al.        Expires October 2003              [Page 26]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003













             < Figure 3 did not convert properly from source >
             <  to be corrected in an upcoming version       >


















            Figure 3 - Resource Destruction Dependency Diagram







   Hilland, et al.        Expires October 2003              [Page 27]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

5.1.2  Opening an RNIC

   The Open RNIC Verb is used to open an RNIC and returns an opaque
   handle to uniquely reference each RNIC so that Consumers can
   distinguish between RNICs in the Local Node.

   Opening an RNIC prepares it for use by the Consumer. Once opened, an
   RNIC cannot be opened again until after it has been closed. At the
   time the RNIC is opened, the RI MUST perform any initialization
   functions required by the RNIC and the RI.

   When the Consumer invokes the Open RNIC Verb, it indicates if this
   RNIC is to be opened in Page Mode or Block Mode. The RI MUST
   initialize the RNIC in either Page Mode or Block Mode, as indicated
   by the Consumer with the input modifier. This will affect all Memory
   Registrations and usage as well as resource consumption on the RNIC.
   Note that while Page Mode MUST be supported, Block Mode is OPTIONAL.
   For more information on Block Mode vs. Page Mode, see Section 7.6.2
   - Physical Buffer Lists.

   Detailed information on the accompanying Verb can be found in
   Section 9.2.1.1 - Open RNIC.

5.1.3  Query RNIC

   Consumers MUST be able to retrieve all of the defined attributes and
   characteristics of the RNIC through the Query RNIC Verb. The full
   list of RNIC Attributes is defined in Section 9.2.1.2 - Query RNIC.

   The maximum values returned when querying the RNIC are values which
   the RI will not exceed. This does not imply that a Consumer can
   allocate all resources to their maximum levels simultaneously.

5.1.4  Closing an RNIC

   Closing the RNIC resets the RNIC and deallocates any resources
   allocated during the RNIC open.

   The RI MUST track all RNIC resources created on behalf of the
   Consumer, such as those allocated within the RI during the creation
   of PDs, QPs, CQs, Memory Windows and MRs. When the Close RNIC verb
   returns, the RI MUST have freed all RNIC resources.

   Detailed information on the accompanying Verb can be found in
   Section 9.2.1.3 - Close RNIC.

5.2  Protection Domains

   A Protection Domain (PD) is the mechanism used to associate Queue
   Pairs with Memory Regions and Memory Windows as a means of enabling



   Hilland, et al.        Expires October 2003              [Page 28]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   and controlling RNIC access to host system memory. A Protection
   Domain is represented by a unique identifier called a Protection
   Domain Identifier (PD ID).

   When the Consumer creates a PD, a PD ID is returned. The Consumer
   then provides the PD ID to the RI when creating QPs, MRs & Memory
   Windows. When a data transfer takes place, if the STag refers to an
   MR, then the PD ID of the MR is validated against the PD ID of the
   QP. If they do not match, the data transfer generates an error and
   no data transfer takes place. If the STag refers to an MW, then the
   PD ID of the MW is validated against the PD ID of the QP when the MW
   is Bound to the QP. When a data transfer takes place, the QP ID of
   the MW is validated against the QP ID of the QP. These rules allow
   the Consumer to ensure that any STag being used on that connection,
   either locally or remotely, has been specifically allowed by the
   Consumer to be used on that connection.

   Each Queue Pair in an RNIC MUST be associated with a single PD ID.
   Multiple Queue Pairs MUST be able to be associated with the same PD
   ID.

   Each Memory Region MUST be associated with a single PD ID. Multiple
   Memory Regions MUST be able to be associated with the same PD ID.

   Each Memory Window MUST be associated with a single PD ID when
   allocated. Multiple Memory Windows MUST be able to be associated
   with the same PD ID.

   The RI MUST be able to associate any PD ID with any MW, MR, QP or S-
   RQ on the RNIC.

   Binding a Memory Window to a Memory Region and Fast-Register are
   performed as Send Queue operations. The Bind operation MUST only be
   allowed if the PD ID of the QP matches the PD ID of the Memory
   Region and the PD ID of the QP matches the PD ID of the Memory
   Window. Similarly, the Fast-Register operation MUST only be allowed
   if the PD ID of the QP matches the PD ID of the STag used as an
   input modifier for the Fast-Register. If the PD ID checks fail for
   either operation, the operation MUST NOT take place and a Completion
   Error MUST be generated.

   Note that S-RQs use PDs as well. PD rules for S-RQs are covered in
   Section 6.3

5.2.1  Allocating a PD

   Protection Domains MUST only be allocated through the RI. A PD ID is
   required to be supplied as an input modifier when creating a Queue
   Pair, registering a Memory Region, or allocating a Memory Window.




   Hilland, et al.        Expires October 2003              [Page 29]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   The RI MUST assign a unique PD ID to each PD allocated by the RI. PD
   ID's MUST be unique per RNIC. PD ID's MAY be unique across multiple
   RNICS which share the same RI.

   Detailed information on the accompanying Verb can be found in
   Section 9.2.2.1 - Allocate PD.

5.2.2  Deallocating a PD

   PDs MUST only be deallocated through the RI. A PD MUST NOT be
   deallocated if it is still associated with any Queue Pair, Shared
   Receive Queue, Memory Region, or Memory Window. If this is
   attempted, the Verbs MUST return an Immediate Error and not allow
   the PD to be deallocated.

   Detailed information on the accompanying Verb can be found in
   Section 9.2.2.2 - Deallocate PD.

5.3  Completion Queues

   The Completion Queue consists of entries to hold Work Completions.
   The RI's internal representations of Work Completions are called
   Completion Queue Entries (CQEs). The RI will post a CQE to the CQ
   when it completes the operation of a Signaled WR. The Consumer then
   Polls the CQ to retrieve the CQE as a Work Completion. When the Work
   Completion is retrieved, the CQE is freed from the CQ and the entry
   is available for another Work Request's Work Completion information.
   For an Unsignaled WR, the RI will not generate a CQE when the WR
   completes successfully. The RI will post a CQE to the CQ when an
   Unsignaled WR completes in an error. For more information on
   Signaled and Unsignaled Completions, see Section 8.1.3.1.

   A Completion Queue (CQ) MUST be the only mechanism used for the
   retrieval of Work Completions.

   A single CQ is used to hold CQEs from one or more Work Queues across
   one or more Queue Pairs on the same RNIC. A CQ MAY have zero or more
   Work Queue associations. Completion Queues MUST be able to service
   Send Queues, Receive Queues or both. Work Queues from multiple QPs
   MUST be able to be associated with a single CQ.

   Completion Queues and Completion Queue Entries are internal to the
   RNIC Interface, and are not directly accessible, nor is the format
   directly visible, by Verb Consumers.

5.3.1  Creating a Completion Queue

   Completion Queues MUST only be created through the RNIC Interface.





   Hilland, et al.        Expires October 2003              [Page 30]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   The RI MUST verify that the Consumer has specified the number of
   CQEs the CQ should hold when creating a Completion Queue. The
   Consumer should ensure that this value is the maximum number of
   Completions the Consumer expects to be outstanding. The RNIC will
   then create the CQ with at least the specified number of entries.
   The number of entries allocated for the CQ by the RI MAY be greater
   than the number requested. If the CQ can be created, the RI MUST
   return the actual number of entries allocated for that CQ to the
   Consumer. If the RI is unable to allocate at least as many entries
   as the Consumer requested, an Immediate Error MUST be returned and
   the CQ MUST NOT be created.

   The RI is NOT REQUIRED to perform CQ overflow detection or
   protection. Therefore, the CQ overflow error codes in this document
   are OPTIONAL. When an overflow occurs, the results are
   indeterminate. Overflow of a CQ MUST NOT affect QPs which do not
   report Work Completions to that CQ and MUST NOT affect other CQs.
   Consequently, when creating the CQ, the Consumer should request
   enough outstanding Work Requests so that if every possible
   outstanding WR were to complete (such as may happen in an error
   case), there would be room for the CQE on the CQ. The RI MUST NOT
   enforce that every WQE on every Work Queue associated with the CQ
   must have a CQE available for the WQE's Work Completion information.

   If the Consumer wishes to have deterministic error behavior, at
   Create/Modify QP, the sum of the maximum number of WQEs associated
   with a single CQ should be less than or equal to the number of
   entries in the CQ. A Consumer can size the CQ smaller, in which case
   the error semantics of a CQ overflow are not deterministic, but
   possible RNIC behavior includes overwriting previous CQEs in whole
   or in part and thus may result in a data integrity issue.

   An additional consideration for sizing the CQ is QP Destruction. Any
   outstanding WRs which were on a Work Queue when it is destroyed may
   occupy entries on the associated CQ. For more information, see
   Section 6.1.4 - Destroying a Queue Pair.

   Detailed information on the accompanying Verb can be found in
   Section 9.2.3.1 - Create CQ.

5.3.2  Querying Completion Queue Attributes

   There are two Completion Queue attributes that can be queried
   through the RI.

   The first of these attributes is the maximum number of entries
   allowed on the CQ. This attribute MUST be able to be retrieved
   through the Query CQ Verb.





   Hilland, et al.        Expires October 2003              [Page 31]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   The other attribute is the Completion Event Handler Identifier,
   which also MUST be able to be retrieved through the Query CQ Verb.

   With one exception, the CQ Verbs do not expose which Work Queues are
   associated with a CQ. The exception is that the QP ID is reported by
   Poll CQ.

   Detailed information on the accompanying Verb can be found in
   Section 9.2.3.2 - Query CQ.

5.3.3  Modifying Completion Queue Attributes

   An implementation MUST support resizing of a CQ through the RI while
   WRs are outstanding. Work Completions MUST NOT be lost due to a CQ
   resize. Resizing the CQ MUST NOT directly generate errors beyond
   Resize CQ Verb Immediate Errors and must either succeed or fail
   atomically. It is understood that this may adversely affect
   performance, and MAY result in connection timeouts. Note that this
   could ultimately result in the connection being torn down. If the
   Consumer wishes to avoid any possibility of a connection being torn
   down during the CQ resize operation, it should quiesce operations to
   the Work Queues associated with the CQ before resizing the CQ. The
   RI MUST NOT allow a CQ to be resized to a size that is smaller than
   the number of CQEs currently on the CQ; if this is attempted, an
   Immediate Error MUST be returned.

   Detailed information on the accompanying Verb can be found in
   Section 9.2.3.3 - Modify CQ.

5.3.4  Destroying a Completion Queue

   CQs MUST only be destroyed through the RI.

   A CQ MUST NOT be destroyed if it is still associated with any Work
   Queue. If this is attempted, the Verbs MUST return an Immediate
   Error and not allow the CQ to be destroyed.

   When the Destroy CQ Verb returns, the RI MUST have returned or
   released any host resources allocated below the RNIC Interface on
   behalf of the Consumer that are related to the specified CQ. After
   the Destroy CQ Verb returns, the RI MUST NOT return any more Work
   Completions that are associated with the destroyed CQ.

   Detailed information on the accompanying Verb can be found in
   Section 9.2.3.4 - Destroy CQ.








   Hilland, et al.        Expires October 2003              [Page 32]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

6  Queue Pairs

   Queue Pairs (QP) are the RNIC resource used by Consumers to submit
   operations to the RNIC. A QP consists of a pair of Work Queues (Send
   and Receive) as well as a posting mechanism for each queue. The Send
   Queue (SQ) and Receive Queue (RQ) are each Work Queues, in that the
   Consumer posts Work Requests (WR) to them in order to get the RI to
   perform operations. In addition, there are resources that make up
   the QP with which the Consumer does not directly interact. These
   include the Inbound RDMA Read Request Queue and the Work Queue
   Elements (WQEs).

   Work Queue Elements are the representation of Work Requests inside
   of the RI, once the Work Requests have been posted to the QP.

   An internal Inbound RDMA Read Request Queue (IRRQ) MUST be
   associated with a Queue Pair when the QP is created or modified to
   support greater than zero incoming RDMA Read Request Messages. The
   IRRQ enqueues incoming RDMA Read Request Messages and processes them
   in order, sending RDMA Read Response Messages as a result. The depth
   of this queue MUST be specified when the QP is created and is set
   with the IRD Input Modifier.

   A QP is created by the RI at the request of a Consumer. The
   resources required by the RI to create the Work Queues and get them
   to transmit and receive resources are allocated at this time. The
   memory needed may be allocated from system memory, memory associated
   within the RNIC, or any other resources accessible through the
   Verbs.

   Certain QP attributes may be changed after QP creation. A Modify QP
   Verb is provided to modify the attributes. The details of this Verb
   are defined in Section 6.1.3 - Modifying Queue Pair Attributes.

   The Consumer should instruct the RI to destroy a QP that is no
   longer in use. The semantics for destruction of a QP are provided in
   this Section 6.1.4 - Destroying a Queue Pair.

   The Verbs Post Send Queue Work Request (Section 9.3.1.1 PostSQ) and
   Post Receive Queue Work Request (Section 9.3.1.2 PostRQ) provide a
   posting mechanism for the Consumer to indicate to the RI that there
   is work for the RI to perform and that there is a new WR,
   represented within the RI by a WQE, on the Work Queue. Details of
   Work Request handling are defined in Section 8 - Work Requests and
   the WR Processing Model.








   Hilland, et al.        Expires October 2003              [Page 33]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

6.1  Queue Pair Resource Handling

6.1.1  Creating a Queue Pair

   Queue Pairs are created through the RI. When a QP is created, the RI
   MUST verify that the Consumer has specified a complete set of
   initial attributes. The attributes that need to be defined when the
   QP is created are specified in Section 9.2.5.1 - Create QP.

   Two of the attributes that must be initialized when a QP is created
   is the maximum number of Outstanding WRs on the SQ and the maximum
   number of Outstanding WRs on the RQ. This number represents the
   maximum number of WRs which have been submitted but which have not
   Completed at any given time. This is really the maximum depth of the
   SQ or RQ and not the number of WRs on the Work Queue at the moment.
   The RI MUST support Consumers specifying the maximum number of
   outstanding WRs on the SQ and on the RQ and allow the maximum number
   of outstanding WRs on the SQ to be different from that on the RQ.
   The Consumer requests a maximum number of outstanding WRs on the SQ
   and on the RQ. The RI MUST return the maximum number of outstanding
   WRs allocated on the SQ and on the RQ, and each of these numbers MAY
   be greater than the number requested. For information on determining
   when WRs are completed, see Section 8.1.3.1 - Signaled Completions.
   Note that if the QP uses an S-RQ for incoming Untagged Messages, the
   maximum number of Outstanding WRs on the RQ is not needed.

   Each Work Queue in a QP MUST be associated with one and only one CQ
   when that QP is created.

   Since both WQEs and CQEs are implemented below the RI and the
   implementations are outside the scope of this specification, they
   may be implemented using a variety of mechanisms, including in the
   Local Host virtual memory address space. The RI MAY require that the
   Work Queues be in the same memory space as the corresponding
   Completion Queues or the creation of the QP will fail. Therefore the
   Consumer should assume that the CQ & QP share the same address
   space. If the RI detects that QP and CQ are inaccessible to each
   other, creation of the QP MAY fail.

   Other attributes that MUST be initialized when a QP is created are
   whether or not this QP will support the Fast-Register Non-Shared
   Memory Region operation and whether the QP supports an STag of zero.
   These attributes must only be enabled on QPs used by Privileged Mode
   Consumers. See Section 7.2.1 - STag of zero for an explanation of
   the STag of Zero. For an explanation of the Fast-Register Non-Shared
   Memory Region operation, see Section 7.3.2.5 - Fast-Register Non-
   Shared Memory Region.

   When a QP is created it MUST be associated with a PD. This is done
   by specifying the PD ID as an Input Modifier to Create QP.



   Hilland, et al.        Expires October 2003              [Page 34]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   An attribute that MUST be created within the RI when the Consumer
   invokes the Create QP Verbs is a Queue Pair Identifier (QP ID). The
   QP ID MUST be used by the RI to uniquely identify this QP within
   this RNIC to the Consumer. The QP ID is used when trying to
   determine if a Memory Window is Bound to the QP, as discussed in
   Section 7.10.2 - Binding Memory Windows to Memory Regions. The QP ID
   value MUST be returned as part of the Create QP, Query QP and Poll
   CQ Verbs.

   Create QP MUST NOT associate an LLP Connection or LLP Stream with
   the QP. No data will flow until the QP is Associated with another QP
   through an LLP Stream and the QP state is changed to RTS. For more
   details, see Section 6.6.1 - Connection Initialization.

   A QP can exist in one of several states. For the details of the QP
   states, see Section 6.2 - Queue Pair Resource States. The following
   list summarizes the valid QP states:

   *   Idle state - No LLP Stream is associated with the QP.

   *   RTS state - An LLP Stream is associated with the QP and normal
       data transfer can occur.

   *   Closing state - An error free LLP Close has begun but has not
       finished. It was initiated by either the Remote Peer or Local
       Peer.

   *   Terminate state - An error occurred. A Terminate Message was
       either sent or received, and the QP is waiting for either a LLP
       Close or LLP Reset before automatically transitioning the QP to
       the Error state.

   *   Error state - An error occurred. No LLP Stream is associated
       with the QP. A Terminate Message will be available through
       QueryQP if the QP transitioned through the Terminate state
       before entering the Error state. If the transition was from the
       Closing state to the Error state, a Terminate Message may be
       available.

   When the QP is created, it is initialized to the Idle state.

   Detailed information on the accompanying Verb can be found in
   Section 9.2.5.1 - Create QP.

6.1.2  Querying Queue Pair Attributes

   Queue Pairs have attributes that can be retrieved through the Query
   QP Verb. The RI MUST support the complete list of QP attributes as
   described in Section 9.2.5.2 - Query QP.



   Hilland, et al.        Expires October 2003              [Page 35]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

6.1.3  Modifying Queue Pair Attributes

   Certain QP attributes may be modified after the QP has been created.
   If the Consumer invokes Modify QP without specifying all Required
   Attributes as defined in Figure 4, the RI MUST NOT modify any of the
   QP attributes and MUST return an Immediate Error. The RI MUST allow
   the Consumer to request a change for the Allowed Additional
   Attributes as described in Figure 4, for the QP state transitions
   also shown in the figure. On Consumer request, the RI MAY change the
   allowed Additional Attributes as described in Figure 5, for the QP
   state transitions shown in the figure, if the RI indicates through
   Query RNIC that the attribute in question is allowed to be changed.
   The Modify QP Verb output modifiers can be used to determine if the
   changes are actually made.

   If any of the QP attributes requested to be modified are invalid or
   the requested state transition is invalid, the RI MUST NOT modify
   any of the QP attributes and an Immediate Error MUST be returned.
   Note that the table is heavily dependent upon the QP state. For
   further information on the QP state, see Section 6.2 - Queue Pair
   Resource States.
































   Hilland, et al.        Expires October 2003              [Page 36]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

    Transition   Attributes that        Attributes that the RI must
                 Consumer is Required   Support and the Consumer may
                 to Supply for the      Supply for the State
                 State Transition       Transition

    Idle->Idle   Next state             ORD

    Idle->RTS    Next state,            Stream message buffer,
                 LLP Stream Handle      ORD

    Idle->Error  Next state             None

    RTS->RTS     Next state             ORD
    (Footnote 1)

    RTS->CLOSING Next state             None

    RTS->TERM    Next state             None


    RTS->Error   Next state             None

    Error->Idle  Next state             None

              Figure 4 - Allowable QP Attribute Modifications




















   Footnote 1: Changing these parameters in RTS requires care to avoid
   race conditions to prevent errors.




   Hilland, et al.        Expires October 2003              [Page 37]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

      Transition   Attributes that the RI Optionally Supports
                   and the Consumer may Supply for the State
                   Transition

      Idle->Idle   Max Number of SQ WQE,
                   Max Number of RQ WQE (Footnote 2),
                   IRD,
                   QP's RQ Limit,
                   QP's RQ Limit Armed

      Idle->RTS    Max Number of SQ WQE,
                   Max Number of RQ WQE (Footnote 2),
                   IRD,
                   QP's RQ Limit,
                   QP's RQ Limit Armed

      Idle->Error  None

      RTS->RTS     Max Number of SQ WQE,
      (Footnote 3) Max Number of RQ WQE (Footnote 2),
                   IRD

      RTS->CLOSING None

      RTS->TERM    None

      RTS->Error   None

      Error->Idle  None

              Figure 5 - Optional QP Attribute Modifications

   It is possible to modify the QP attributes in Figure 4 and Figure 5
   with Work Requests outstanding on the QP. Depending on the
   modification, any Work Requests outstanding on the specified QP
   might not execute properly when the attributes are changed.

   An RNIC MAY allow the Consumer to change the maximum number of
   outstanding WRs on the SQ and on the RQ. The RNIC MUST indicate to
   the Consumer if it supports the ability to change the number of


   Footnote 2: Note that changing the Max Number of RQ WQEs has no
   effect if the QP uses an S-RQ

   Footnote 3: Changing these parameters in RTS requires care to avoid
   race conditions to prevent errors.




   Hilland, et al.        Expires October 2003              [Page 38]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   outstanding WRs on a QP. If the RNIC supports it, it MUST allow the
   number of outstanding WRs on both the SQ and the RQ to be changed
   while WRs are still outstanding. In addition, the RI MUST support
   the ability to change this on every QP if it indicates an ability to
   change the outstanding number of WRs.

   It is understood that changing the number of WRs that a Work Queue
   may have outstanding may adversely affect performance. Resizing the
   QP MUST NOT cause Immediate, Completion or Asynchronous Errors, with
   the exception of Immediate Errors returned by the Modify Queue Pair
   Verb and possible LLP time-outs. It is expected that the resize
   operation MAY adversely affect the Associated QP attempting to
   communicate with the Local QP during the resize operation in the
   form of LLP time-outs and retries which could result in LLP Stream
   teardown (which would result in an Asynchronous Error). It is
   suggested that the Consumer only perform this resize operation when
   activity on the connections has been quiesced to minimize the risk
   of transitioning Associated QPs to the Error state as a result of
   LLP time-outs.

   If the number of requested outstanding WRs is smaller than the
   actual number of outstanding WRs currently on the Work Queue(s),
   then the modification of the QP MUST fail with an Immediate Error
   and the QP MUST remain in the original state.

   For information on performing a Modify QP and modifying the value of
   IRD and/or ORD, see Section 6.5 - Outstanding RDMA Read Resource
   Management.

   When the Modify QP Verb completes, any state change requested MUST
   have occurred or an Immediate Error MUST be returned, in which case
   the QP state and accompanying modifier changes MUST remain as they
   were prior to the Modify QP Verb being invoked.

   The LLP Stream and the LLP Stream Message Buffer Input Modifiers for
   Modify QP are covered in Section 6.6.1.

   Detailed information on the accompanying Verb can be found in
   Section 9.2.5.3 - Modify QP.

6.1.4  Destroying a Queue Pair

   Queue Pairs MUST only be destroyed through the RNIC Interface.

   Successful destruction of a QP MUST release all resources allocated
   by the RI for the QP on behalf of the Consumer. The RI MUST have
   destroyed the QP when the Destroy QP Verb has successfully
   completed. If the LLP Stream is still associated with the QP, a
   Destroy QP MUST include disassociating the LLP resources from the
   QP, and MAY include an LLP Reset. After a Destroy QP finishes, the



   Hilland, et al.        Expires October 2003              [Page 39]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   QP ID will be immediately available for use on any subsequently
   created QP. The QP will cease processing all WRs, and no additional
   CQEs resulting from any outstanding WRs on this QP will be posted to
   the CQ.

   The RI MUST not allow a QP to be destroyed if there are still Memory
   Windows Bound to the QP. If the Consumer attempts to destroy a QP
   with Memory Windows Bound to the QP, an Immediate Error MUST be
   returned by the RI.

   The RI MUST allow the Destroy QP Verb to succeed regardless of the
   QP's state, provided there are no MWs Bound to it. For more
   information on the resource destruction and deallocation sequence,
   see Section 5.1.1.2 - Expected Destruction Sequence.

   It is RECOMMENDED that before a Consumer attempts to destroy a Queue
   Pair, it should cleanly complete all outstanding Work Requests and
   invalidate all Memory Windows which are Bound to the QP. It is
   recommended that ULPs and Consumers provide a graceful termination
   mechanism, return all Advertised STags to a known state, submit WRs
   to Invalidate all outstanding Memory Windows and then move through
   the Closing state. The Consumer should then retrieve all outstanding
   Work Completions through the CQ(s) associated with the QP's SQ & RQ.
   Only then should the Consumer destroy the QP.

   A QP is allowed to have Work Requests outstanding on both Work
   Queues when a request to destroy the QP is made.

   Any outstanding WRs posted to the QP but not yet processed by the RI
   MAY result in CQEs that MAY be retrievable by the Consumer. Note
   that even in the case where CQEs were generated it might not be
   possible for the Consumer to retrieve them after the QP has been
   destroyed. Since it is implementation dependent as to whether CQEs
   are consumed for outstanding WRs on a QP after that QP is destroyed,
   for the purposes of CQ overflow prevention, the Consumer should
   consider each outstanding WR to have consumed an entry in the CQ.
   There are three ways to free the CQE consumed within the CQ. Any
   method is acceptable and they are not mutually exclusive. The three
   methods are:

   *   the Consumer polls the CQ (See Section 9.3.2.1 - Poll for
       Completion (Poll CQ)) until the CQ is empty, or

   *   the Consumer retrieves a WC for a WR which was submitted to a
       Work Queue associated with the same CQ and that WR was submitted
       after the previous QP was destroyed, or

   *   the Consumer polls (See Section 9.3.2.1 - Poll for Completion
       (Poll CQ) a number of Work Completions equal to the total number
       of entries that the CQ can hold.



   Hilland, et al.        Expires October 2003              [Page 40]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Detailed information on the accompanying Verb can be found in
   Section 9.2.5.4 - Destroy QP.

6.2  Queue Pair Resource States

   The RI MUST restrict the QP to only be in one of the five Resource
   States (or just "states") as shown in Figure 6. The RI MUST NOT
   support transitions between QP states that are not shown in Figure
   6.

   During any state in which iWARP processing is done, it is possible
   for errors to be detected by the RNIC. When this occurs, the QP
   state will eventually transition to the Error state.

   State transitions must only be initiated by the Modify QP Verb,
   except where otherwise explicitly stated in the state descriptions.

   Creation of a QP causes the QP to enter the state diagram in the
   Idle state. Destruction of a QP causes the QP to exit the state
   diagram.

   Below, in Figure 6, is the QP State diagram. It shows the five QP
   states and the allowed transitions between states as well as the
   events and methods which cause those transitions. The individual
   states and transitions are described in the following sections in
   detail.



























   Hilland, et al.        Expires October 2003              [Page 41]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003












             < Figure 6 did not convert properly from source >
             <  to be corrected in an upcoming version       >













                        Figure 6 - QP State Diagram





   Hilland, et al.        Expires October 2003              [Page 42]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

6.2.1  Idle State

   The QP MUST be in the Idle state following QP creation or when moved
   to this state with Modify QP. In this state, Send or Receive WRs MAY
   be posted but they MUST NOT be processed and CQEs MUST NOT be
   generated.

   Note that whether or not the Consumer posts WRs to the Send Queue
   when the QP is in the Idle state depends on the method chosen for
   connection initialization (see Section 6.6.1 - Connection
   Initialization).

   While in the Idle state the RI MUST NOT associate an LLP stream to
   the QP.

   The RI MUST return an Immediate Error if the Consumer attempts to
   transition the QP from the Idle state to the Terminate state or to
   the Closing state.

   A short summary table describing the state changes for Idle state is
   shown in Figure 7. The following are detailed descriptions of those
   changes.

   Note that under certain conditions the Consumer might be required to
   flush Work Requests from a prior RDMAP Stream when in the Idle
   state. This can be done by transitioning the QP from the Idle to
   Error state (the Error state flushes all WRs) and then back to the
   Idle state. This may be necessary if when the Idle state is reached
   automatically (i.e. no Consumer intervention) from the RTS state at
   the Local Peer, which will occur if:

   *   the QP is currently in the RTS state, and the Consumer is
       actively posting Work Requests (PostSQ or PostRQ),

   *   the Remote Peer initiates an LLP Close (e.g. for TCP, it
       generates a FIN segment),

   *   the Local RI receives the LLP Close request, and immediately
       transitions to Closing state,

   *   the RI automatically creates an LLP Close acknowledgement (i.e.
       for TCP, it generates a FIN ACK segment), thus finishing the LLP
       Close from the Local PeerÆs perspective,

   *   the RI flushes all WRs, and no errors occurred during the LLP
       Close or flush,

   *   the RI automatically transitions the QP to the Idle state,





   Hilland, et al.        Expires October 2003              [Page 43]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   the Consumer is not aware of the transition to Idle, and posts a
       Work Request thinking it can still transmit or receive data.

   Note that a normal close should only be done by a ULP after an end-
   to-end synchronization to ensure all outstanding Work Requests have
   been flushed end-to-end. This is because RDMAP does not provide a
   graceful close. Thus if the Consumer performed a PostSQ, it is an
   error made by the Consumer. However, if the ULP posted an extra
   PostRQ buffer, it is arguable whether this is an error made by the
   Consumer or not. In either case, to recover the resources before
   reusing the QP, the Consumer should cause the QP to transition to
   Error state to flush the WQEs on the SQ and RQ, and then transition
   the QP back to the Idle state.

6.2.1.1  Idle to Idle

   The Modify QP Verb MUST allow a transition of the QP from the Idle
   state to the Idle state. This is to allow certain Queue Pair Context
   attributes to be modified in this state before an association with a
   Remote Peer's QP has been established.

6.2.1.2  Idle to RTS

   The Modify QP Verb MUST allow a transition from the Idle state to
   the RTS state. This is to support LLP Stream establishment. For this
   transition, the Modify QP Verb requires an LLP Stream Handle, and
   allows a Stream Message Buffer as well as other Input Modifiers. In
   order to transition from Idle to RTS, the LLP must be in its
   "Established" state, able to send and receive data. If not, the
   Modify QP Verb MUST return an Immediate Error. For more details on
   LLP Stream establishment, see Section 6.6.1 - Connection
   Initialization.

   The RI performs the following actions in the Idle to RTS transition,
   which MAY be performed in order:

   1.  The RI resets the RDMAP, DDP and MPA layers to the initial
       conditions specified in the appropriate specifications. For
       example, the DDP Untagged Message Sequence Numbers (MSN) for the
       Receive queue & IRRQ, and the MPA marker position must be reset
       as described in [RDMAP], [DDP], and [MPA].

   2.  If the Modify QP Verb includes a Stream Message Buffer to send,
       it is RECOMMENDED that the RI performs the following list in
       order:

      1. The implementation should stop receiving messages from the LLP
         Stream.





   Hilland, et al.        Expires October 2003              [Page 44]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

      2. The RI should transmit the specified message buffer to the
         Remote Peer in streaming mode.

      3. The RI should associate the LLP Stream with the RDMAP, DDP,
         and MPA layers and the RI should enable the QP to receive and
         transmit iWARP messages.

      4. The implementation should resume receiving messages from the
         LLP Stream.

   3.  If the Modify QP Verb does not include a Stream Message Buffer
       to send, the RI should associate the LLP Stream with the RDMA,
       DDP, and MPA layers and the RI should enable the QP to receive
       and transmit iWARP messages.

   4.  The RI moves the QP to the RTS state and begins normal
       operation.

   The RI MAY implement the Verb in other ways, but the end result
   MUST:

   1. Associate RDMAP and Lower layers with the QP;

   2. While in streaming mode, transmit any Stream Message Buffer that
      was included in the Modify QP;

   3. Ensure that the QP enables reception and transmission of iWARP
      messages; and,

   4. That regardless of how quickly the remote side returns the first
      iWARP message, ensure that messages MUST NOT be lost.

   For example, if the Verb did not stop the LLP receive side, the
   following race condition MUST be handled properly:

   1. The Associated QP transitions to RTS,

   2. It begins transmitting RDMA packets,

   3. Then the rapid arrival of an iWARP message from the Remote Peer
      occurs while the Local Peer is transitioning, but not completed
      the transition, to the RTS state.

   Note that the Modify QP Idle to RTS transition that includes a
   Stream Message Buffer to send may take a significant amount of time
   to complete. This is due to the requirement to reliably transmit the
   stream message.






   Hilland, et al.        Expires October 2003              [Page 45]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

6.2.1.3  Idle to Error

   The Modify QP Verb MUST allow the Consumer to modify the QP from the
   Idle state to the Error state.

   If it becomes necessary to remove WQEs posted to the queues in the
   Idle state, the Consumer may Modify the QP to the Error state, and
   then back to Idle. Any WQEs on the SQ & RQ will be Completed with a
   Flushed status by this procedure. This procedure will not change the
   Completion Status of CQEs already Completed on the CQ. The Consumer
   can then Poll for Completion on the Completion Queue and examine the
   Completion Status to determine which WRs were flushed.

   Note there is no effect on the LLP since no LLP Stream has been
   associated with the QP at this point.






































   Hilland, et al.        Expires October 2003              [Page 46]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Event                        Action                       Next
                                                             State

   PostSQ, PostRQ               Enqueue WQE                  Idle

   WQE is present on or added   WQE is NOT processed         Idle
   to the tail of the SQ

   Modify QP->Idle (Footnote 4)                              Idle

   Modify QP->RTS and Stream    Reset RDMAP and Lower layers RTS
   Message Buffer included      to their initial conditions.
                                Associate RDMAP and Lower
                                layers with QP. Transmit the
                                specified Stream Message
                                buffer in Streaming mode and
                                enable iWARP mode as
                                described in 6.2.1.2.

   Modify QP->RTS with NO       Reset RDMAP and lower layers RTS
   Stream Message Buffer        to their initial conditions.
   included                     Associate RDMAP and lower
                                layers with QP. Enable iWARP
                                mode.

   Modify QP->Error                                          Error

   PostSQ/PostRQ error          Return an Immediate Error    Idle

   Modify QP, results in error  Return an Immediate Error    Idle

                       Figure 7 - Idle State summary













   Footnote 4: This transition allows changing QP parameters as defined
   in Figure 4.




   Hilland, et al.        Expires October 2003              [Page 47]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

6.2.2  RTS (Ready to Send) State

   The RTS state is the main operational state for iWARP operation. All
   normal message processing, both incoming and outgoing, occurs in
   this state.

   The QP MUST be in the RTS state to begin transmitting and receiving
   any messages. Prior to moving to this state, the LLP Connection &
   LLP Stream MUST be fully established.

   Once in this state, any WQEs already posted on the Send Queue will
   begin processing. Any new WQEs posted MUST be added to the tail of
   the queue, (and begin processing, if the queue is empty). Once in
   this state, valid incoming iWARP Messages MUST be processed, placed
   and Completed. In this state, posted Receive WRs will be added to
   the Receive Queue (or S-RQ), processed when a Send Operation Type
   arrives, and Completed as described in Section 8.2.4 - Completed
   Work Requests.

   The RTS state MAY be left automatically by any of a variety of
   processing Errors, which will cause a transition to either the
   Terminate or Error state. See Section 8.3 - Error Handling for
   details on which errors result in transitioning to which state.

   The RI MUST return an Immediate Error if the Consumer attempts to
   transition the QP from the RTS state to the Idle state.

   A short summary table describing the state changes for RTS state is
   shown in Figure 8. Following are detailed descriptions of those
   changes.

6.2.2.1  RTS to RTS

   The Modify QP Verb MUST allow the Consumer to modify the QP from the
   RTS state to the RTS state. This allows certain QP parameters to be
   changed while the QP is Associated with another QP through an LLP
   Stream.

   Among the parameters that MAY be changed are IRD and ORD, the
   maximum number of WQEs supported by the SQ or RQ. A Consumer should
   take care when making changes to these parameters in order to
   prevent potential race conditions between the Modify operation, the
   posting of operations on the Send and Receive Queue, and incoming
   messages. For example, reducing the size of the Send or Receive
   Queue can only be done when there are fewer WQEs present on the
   queue than the new size. It is the responsibility of the consumer to
   track the number of outstanding WR on the SQ and RQ if it intends to
   modify the size of the SQ or the RQ. For IRD and ORD details, see
   Section 6.5 - Outstanding RDMA Read Resource Management.




   Hilland, et al.        Expires October 2003              [Page 48]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

6.2.2.2  RTS to Closing

   If the Remote Peer begins an LLP Close operation that does not
   include a Terminate Message (e.g. for TCP a FIN was received), the
   RI MUST cause the QP to leave the RTS state automatically. If all
   Send Queue Work Requests and Remote RDMA Read Operations (i.e.
   incoming RDMA Read Request Messages and associated RDMA Read
   Response Messages) are completed, the QP MUST transition to the
   Closing state; If this is not true, or a Terminate Message was
   received, the QP MUST transition to the Terminate state (see
   following section). In all of the above cases the RI MUST create an
   Affiliated Asynchronous Event to report the transition.

   The Modify QP Verb MUST allow the Consumer to modify the QP from the
   RTS state to the Closing state, to begin an LLP Close operation
   (e.g. for TCP a FIN segment is generated), and MUST NOT generate an
   Affiliated Asynchronous Event. See Section 6.6.2.1 - Normal Close
   for more details. When doing a Modify QP to Closing, all Send Queue
   Work Requests should have been previously Completed, any Remote RDMA
   Read Operations should have been previously finished, and the
   Consumer should have stopped posting PostSQ operations, so that no
   work remains for the QP to do. If this is not the case, the RI MUST
   ensure that either of the following actions are taken:

   *   The Modify QP MAY cause a transition to the Closing state which
       is immediately followed by a transition to the Error state (due
       to the SQ being non-empty).

   *   The Modify QP MAY cause a transition to the Closing state
       followed by a transition to the Idle state (because the SQ was
       originally empty, the LLP Close completed, causing the
       transition to the Idle state, and yet the Consumer was still
       posting SQ operations).

   If this Modify QP Verb completes without error, the QP has
   successfully transitioned to the Closing state (although it may have
   already transitioned out of the Closing state).

6.2.2.3  RTS to Terminate

   The Modify QP Verb MUST allow the Consumer to modify the QP from the
   RTS state to the Terminate state. This enables the Consumer to
   inform the Remote Peer that an Abnormal ULP Termination of the
   connected stream is being done. The Modify QP will result in the
   Error Code subfield of the Terminate Control Field of the Terminate
   Message (See [RDMAP]) having a value of 0x0000: Local Catastrophic
   Error. The Terminate Buffer will then be available to the Local node
   via Query QP and to the Remote Peer through Query QP (provided the
   Terminate Message arrives at and is processed by the Remote Peer).




   Hilland, et al.        Expires October 2003              [Page 49]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   When this Verb completes, the QP is in the Terminate state. For more
   details, see 6.6.2.2 - ULP Initiated Termination.

   The RTS to Terminate state transition MUST occur automatically
   following: a locally detected error; a Remote Peer beginning an LLP
   Close (e.g. for TCP a FIN was received) with either local Send Queue
   WQEs incomplete, or local Remote RDMA Read Operations incomplete;
   operation error; or any other error that would cause the RI to
   generate a Terminate Message. If the transition to the Terminate
   state is due to other locally detected errors, the RI MUST create
   the appropriate Asynchronous Error Event reporting that error. See
   Section 8.3.3 - Asynchronous Errors.

   The WR, if any, which caused the QP to enter into the Terminate
   state MUST be completed with the correct Completion Error Code for
   the error through the CQ associated with the WQ that experienced the
   error.

   If a remote Terminate Message is received, the Terminate state MUST
   be automatically entered and an Asynchronous Error Event MUST be
   reported with a status of "Termination Message Received". In this
   case, the RI MUST NOT send a Terminate Message back to the Remote
   Peer. Note that if TCP is the LLP, depending upon implementation of
   LLP Close, the RI may immediately transition to the Error state or
   it may wait for a TCP ACK before the transition.

6.2.2.4  RTS to Error

   The Modify QP Verb MUST allow the Consumer to modify the QP from the
   RTS state to the Error state. This enables the Consumer to perform
   an Abnormal ULP initiated Abortive Teardown (for more details, see
   Section 6.6.2.3 - ULP Initiated Abortive Teardown).

   An LLP failure that prevents further transmissions will also cause
   the RTS to Error transition.

   When the QP transitions from the RTS state to the Error state, the
   LLP stream MUST NOT be associated with the QP.

   The following are done prior to entering Error state:

   *   The RI MUST stop processing SQ WRs, Remote RDMA Read Operations,
       and any incoming iWARP Segments targeting the QP. See Section
       6.4 - Stopping QP processing and Sending the Terminate Message
       for additional information.

   *   If the LLP Stream has not closed, an LLP Reset MUST occur

   *   The LLP Stream resources MUST no longer be associated with the
       QP once the LLP actions, if any, are taken.



   Hilland, et al.        Expires October 2003              [Page 50]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   If this transition is due to a failure of the LLP, the RI MUST
       create an Asynchronous Error event reporting the error.

   When the prior items complete, the QP MUST be transitioned to the
   Error state.
















































   Hilland, et al.        Expires October 2003              [Page 51]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Event                        Action                     Next
                                                           State

   PostSQ, PostRQ               Enqueue WQE                RTS
   Valid iWARP Segment Arrives  Process Segment            RTS
   WQE is present on or added   Process WQE(s) and send    RTS
   to Send Queue                data (as necessary)
   Modify QP->Closing           Begin LLP Graceful Close   Closing
   Modify QP->RTS               Modify QP parameters as    RTS
                                document in Section 6.5
   Modify QP->Error             Stop QP processing,LLP     Error
                                Reset & LLP Disassociated
   Modify QP->Terminate         Generate Terminate Message  Terminate
   PostSQ/PostRQ error          Return an Immediate Error  RTS

   Modify QP, resulting in an   Return an Immediate Error  RTS
   Immediate Error
   LLP Failure that prevents    Stop QP processing, LLP    Error
   transmission of the          Reset, LLP Disassociated,
   Terminate Message            Create Asynchronous Error
   LLP Failure that allows      Generate Terminate Message Terminate
   transmission of the
   Terminate Message
   Local incoming RDMA Message  Generate Terminate Message Terminate
   processing error (RDMA Read
   Request, RDMA Read Response,
   or RDMA Write handling)
   Local incoming Send Type     Generate Terminate Message Terminate
   Message Processing Error
   Local WQ processing error    Complete WR as necessary,  Terminate
                                Generate Terminate Message
   Received Terminate Message                              Terminate
   LLP Close Received AND (SQ   Generate Terminate Message Terminate
   NOT empty OR IRRQ NOT empty)
   LLP Close Received AND SQ                               Closing
   empty AND IRRQ empty

                       Figure 8 - RTS State summary



   Hilland, et al.        Expires October 2003              [Page 52]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

6.2.3  Terminate State

   The Terminate state is used to send the final Terminate Message and
   begin an LLP Close if an error has occurred, or as a staging ground
   to perform an LLP Close if a Terminate Message was received from the
   Remote Peer. This state is transitory. The duration is limited by
   the time to finish the LLP Close operation or a final timeout in LLP
   Close (which would cause an LLP Reset).

   When the Terminate state is exited to the Error state, the LLP
   Stream MUST no longer be associated with the QP and the LLP Stream
   MUST be in either a condition of LLP Closed or LLP Reset.

   It is possible to examine the Terminate Message buffer while in this
   state by using Query QP (Section 9.2.5.2) to retrieve the Terminate
   Message.

   A short summary table describing the state changes for the Terminate
   state is shown in Figure 9. The following are detailed descriptions
   of those changes.

   While in the Terminate state, the following are done:

   *   The RI MUST stop processing SQ WRs, Remote RDMA Read Operations
       and any new incoming iWARP Segments targeting the QP. For
       additional information, see Section 6.4 - Stopping QP processing
       and Sending the Terminate Message.

   *   The RNIC MUST attempt to send the RDMAP Terminate Message,
       indicating the cause of error, except when the Terminate state
       is entered due to reception of a remote Terminate Message. Note
       that sending the Terminate Message may not be successful if an
       LLP Reset occurs.

   *   The RI MUST begin an LLP Close operation.

   *   If the current stream is the last (or only) active LLP Stream on
       the LLP Connection, or the LLP is in a state where all streams
       are unable to operate, the LLP Close MUST cause the LLP
       Connection to be closed. (For example, in [TCP] the FIN is sent
       and the close sequence is done.)

   *   If an LLP error occurs during the sending of the Terminate
       Message (including reception of an incoming LLP Reset, between
       the time the Terminate state is entered and the LLP Close
       sequence is completed), or due to an LLP final timeout while the
       LLP Close operation is not finished, then an LLP Reset MUST
       occur and its resources MUST no longer be associated with the
       QP. Note that the LLP MUST use a timeout to detect errors, so
       that the QP is in the Terminate state for a bounded time.



   Hilland, et al.        Expires October 2003              [Page 53]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   At some point in the Terminate state, the RI MUST begin to
       return an Immediate Error for any attempt to post a WR to a Work
       Queue; prior to that point, WQEs MUST be enqueued (and
       eventually flushed) or result in an Immediate Error.

   *   The RI MAY begin to flush any incomplete WRs on the SQ or RQ.
       Please see the Section 6.2.4 - Error State for further
       requirements about flushing incomplete WRs.

   *   When the prior actions are done:

       1.  If the transition to the Terminate state is due to the
           Modify QP Verb, the RI MUST NOT create an Asynchronous Error
           Event reporting "Error State Entered". If the transition to
           the Terminate state is due to the Modify QP Verb, but an LLP
           error occurred while in the Terminate state, then the RI
           MUST generate an Asynchronous Error reporting "Bad Close".

       2.  If the transition to the Terminate state is due to an error
           that is reported in a Work Completion, the RI MUST NOT
           create an Asynchronous Error. See Section 8.3.2 - Work
           Completion Errors. If the transition to the Terminate state
           is due to an error that is reported in a Completion, but an
           LLP error occurred while in the Terminate state, then the RI
           MUST generate an Asynchronous Error reporting "Bad Close".

   When the actions listed above are complete, and the LLP Close is
   finished, the QP state MUST move automatically to the Error state.

   When the LLP Close is finished or an LLP Reset occurs, the RI MUST
   disassociate the QP from the LLP Stream, including any LLP Stream
   context and any resources associated with it. Disassociating the LLP
   Stream from the QP means that it becomes possible for the QP to be
   transitioned to Idle and to RTS with a new LLP Stream.



















   Hilland, et al.        Expires October 2003              [Page 54]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Any attempt to perform a Modify QP in the Terminate state MUST
   return with an Immediate Error.

   Event                        Action                      Next
                                                            State

   On entry                     Stop QP processing,         Terminate
                                Send & attempt to complete
                                Terminate Message if one
                                wasn't received.
                                LLP Close Initiated

   LLP Close complete           Create Asynchronous Event   Error
                                if necessary,
                                LLP Disassociated from QP

   LLP Failure that prevents    LLP Reset and create        Error
   transmission of the          Asynchronous Event if
   Terminate Message            necessary,
                                LLP Disassociated from QP

   Valid IWARP Segment Arrives  Ignore Segment              Terminate

   PostSQ/PostRQ error          Return an Immediate Error   Terminate

   Modify QP                    Return an Immediate Error   Terminate

   WQE is present on or added   WQE is NOT processed and is Terminate
   to a Work Queue              eventually flushed.

                    Figure 9 - Terminate State summary




















   Hilland, et al.        Expires October 2003              [Page 55]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

6.2.4  Error State

   The Error state provides an indication that the QP has experienced
   an error (or transitioned to the Error state through the use of a
   Modify QP) and has stopped operations. On entry to the Error state,
   the LLP Stream MUST NOT be associated with the QP.

   The RI MUST return an Immediate Error if the Consumer attempts to
   transition the QP from the Error state to the RTS, Terminate, or
   Closing state.

   The following is done on entry into the Error state:

   *   The RI MUST flush any incomplete WRs on the SQ or RQ. All WQEs
       on the SQ and RQ, except for the WQE that caused the error (if
       any), MUST be returned with the Flushed Error Completion Status
       through the Completion Queue associated with the WQ. Note that
       the WQE which caused the error may not be at the head of the
       Work Queue. The Consumer should expect in some cases to retrieve
       Work Completions with the Flushed Error Completion Status, as
       well as potential successful completions, before retrieving the
       WC for the WR which caused the error. The RI MUST NOT return
       more than one Work Completion with a Work Completion Status set
       to something other than the Flushed Completion Status or the
       Success Completion Status.

   *   At some point in the execution of the flushing operation, the RI
       MUST begin to return an Immediate Error for any attempt to post
       a WR to a Work Queue; prior to that point, any WQEs posted to a
       Work Queue MUST be enqueued and then flushed as described above
       (e.g. The PostSQ is done in Non-Privileged Mode and the Non-
       Privileged Mode portion of the RI has not yet been informed that
       the QP is in the Error state).

   If a Terminate Message was sent or received, the RI MUST allow the
   Consumer to retrieve it through the Query QP Verb (Section 9.2.5.2).

   Following entry to the Error state, and before Destroying the QP or
   restarting the QP by going through Idle to RTS, it may be necessary
   to clean up some of the resources associated with the QP.

   *   Work Completions should be reaped by using Poll for Completion
       (Poll CQ) (see Section 9.3.2.1) before destroying the QP,
       otherwise they may become inaccessible.

   *   Memory Window resources MUST be deallocated by using Deallocate
       STag (see Section 9.2.6.4). This is necessary since in the Valid
       state they are associated with the QP. QP destruction will fail
       when Memory Windows which are in the Valid state are still Bound
       to the QP.



   Hilland, et al.        Expires October 2003              [Page 56]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   Memory Regions can be invalidated by posting an Invalidate Local
       STag WR to other SQs in the same PD, or they can be deallocated
       by using Deallocate STag. If left in the Valid state, the
       associated memory may be at risk of unexpected remote access.

   If the QP is transitioning to the Error state, or has not yet
   finished flushing the Work Queues, a Modify QP request to transition
   to the IDLE state MUST fail with an Immediate Error. If none of the
   prior conditions are true, a Modify QP to the Idle state MUST take
   the QP to the Idle state. No other state transitions out of Error
   are supported. Any attempt to transition the QP to a state other
   than Idle MUST result in an Immediate Error.

   A short summary table describing the state changes for Error state
   is shown in Figure 10.

   Event                        Action                       Next
                                                             State

   On Entry                     Flush any incomplete WQEs

   Modify QP->Idle                                           Idle
   (no outstanding WRs and
    not in transition to Error)

   Modify QP->Idle              Return an Immediate Error    Error
   (outstanding WRs or
    in transition to Error)

   Post WR                      Post WQE, and then Flush it, Error
                                OR
                                Return an Immediate Error

   Modify QP, resulting in an   Return an Immediate Error    Error
   error

                      Figure 10 - Error State summary














   Hilland, et al.        Expires October 2003              [Page 57]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

6.2.5  Closing State

   This state is used to wait for the LLP to complete the LLP Close, if
   no errors occurred. For some LLPs or some RI implementations, moving
   a QP from the RTS state to the Idle state can require an end-to-end
   acknowledgement or require the Remote Peer to close their half of
   the LLP Stream before the LLP Close is finished. This may take a
   significant amount of time. Thus the Closing state is provided so
   that these operations are done in a fashion that is visible to the
   Consumer. Note that some RI implementations may require the LLP
   Stream to be completely closed before transitioning to the Idle
   state. This can be in the order of tens of seconds (e.g. an RI
   implementation on TCP may require TCP to be in the CLOSED state,
   possibly waiting in the TIME-WAIT state for a significant amount of
   time).

   If the LLP Close operation does not require the LLP to transmit
   messages (e.g. for SCTP there is no mechanism to close a single LLP
   Stream, thus when one LLP Stream is closed and other LLP Streams
   remain active, there is no end-to-end handshake required), then the
   RI MAY transition rapidly through this state.

   When the Closing state is exited to Idle, the LLP Stream MUST NOT be
   associated with the QP.

   Any attempt to perform a Modify QP in the Closing state MUST return
   an Immediate Error.

   Errors detected by the RI when the QP is in the Closing state result
   in a transition to the Error state; for LLP failures, this is
   indicated with the specific Asynchronous Event "LLP Connection
   Lost".

   A short summary table describing the state changes for the Closing
   state is shown in Figure 11. Following are detailed descriptions of
   those changes.

   The following are done prior to exiting Closing state:

   *   The RI MUST stop processing SQ WRs and Remote RDMA Read
       Operations targeting the QP.

   *   The RI MUST stop processing any incoming segments, though the RI
       MAY process any arriving Terminate Messages.

   *   At some point in the Closing state the RI MUST begin to return
       an Immediate Error for any attempt to post a WR to a Work Queue;
       prior to that point, WQEs MUST be enqueued or result in an
       Immediate Error.




   Hilland, et al.        Expires October 2003              [Page 58]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   The RI MUST flush all incomplete WQEs on the RQ. All WQEs on the
       RQ MUST be returned with the Flushed Error Completion Status
       through the Completion Queue associated with the RQ. If RQ WQEs
       are enqueued, the RI MUST flush the WQE with the Flushed Error
       Completion Status through the Completion Queue associated with
       the RQ.

   *   If no errors have been detected (see next bullet), an LLP Close
       MUST occur. If the LLP Stream is the last or only active stream
       for the LLP Connection, the LLP Connection MUST be attempted to
       be closed gracefully. (For example, in [TCP] the FIN is sent and
       close sequence is done.).

   *   The RI MUST generate an Asynchronous Error if:

       o   Any SQ WQEs were on the SQ at any time during the Closing
           state. Note, this condition may happen if the PostSQ is done
           in Non-Privileged Mode and the Non-Privileged Mode portion
           of the RI has not yet been informed that the QP is in the
           Closing state. Also, the Error state will flush all SQ WQEs.

       o   Any incoming data arrives during the LLP Close. If the
           incoming data is a Terminate Message, the RI MAY allow the
           Consumer to retrieve the Terminate Message through the Query
           QP Verb.

       o   Any Remote RDMA Read Operations are in process.

       o   An LLP Stream failure (e.g. LLP Stream is lost) occurs
           during the LLP Close. Note that the RI MUST use a timeout
           mechanism to detect LLP errors during the LLP Close, so that
           the QP is in the Closing state for a bounded time. If the
           LLP detects a final timeout, it MUST be considered an error.

   *   If the RI generates an Asynchronous Error, the following MUST
       occur in order:

       o   An LLP Reset MUST occur and the LLP resources MUST no longer
           be associated with the QP.

       o   The QP MUST be transitioned to the Error state.

       o   The RI MUST generate an Asynchronous Event

   *   If no error occurs during the LLP Close operation:

       o   When all RQ WRs have been flushed and the LLP Close has
           finished, the LLP Stream MUST be disassociated with the QP,
           the RI MUST generate an Asynchronous Event "LLP Close
           Complete".



   Hilland, et al.        Expires October 2003              [Page 59]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   When the prior items complete, the QP MUST be transitioned
           to the Idle state.

   When the LLP Close is finished or an LLP Reset occurs, the RI MUST
   disassociate the QP from the LLP Stream, including any LLP Stream
   context and any resources associated with it. Disassociating the LLP
   Stream from the QP means that it becomes possible for the QP to be
   transitioned to Idle and to RTS with a new LLP Stream.

   Note that it is possible for the Consumer to post WRs while the
   automatic transition from RTS to Closing to Idle is occurring. See
   Section 6.2.1 - Idle State for additional details.









































   Hilland, et al.        Expires October 2003              [Page 60]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Event                       Action                        Next
                                                             State

   On Entry                    Stop QP processing, start LLP Closing
                               Close, and start Flushing any
                               incomplete WQEs on Receive
                               queues.

   LLP Close complete, all RQ  Create Asynchronous Event:    Idle
   WQEs flushed, and no SQ     "LLP Close Complete", LLP
   WQEs on the SQ              Disassociated from QP

   At least one SQ WQE on the  Perform LLP Reset, Create     Error
   SQ or Remote RDMA Read      Asynchronous Event: "Bad
   Operation in progress.      Close", LLP Disassociated
                               from QP

   LLP Connection Failure      Perform LLP Reset, Create     Error
                               Asynchronous Event: "LLP
                               Connection Lost", LLP
                               Disassociated from QP

   Segment Arrives and is not  Perform LLP Reset, Segment is Error
   a Terminate Message         not processed. Create
                               Asynchronous Event: "Bad
                               Close", LLP Disassociated
                               from QP

   Segment Arrives and is a    Perform LLP Reset, MAY create Error
   Terminate Message           Async Event: "Bad Close"; MAY
                               allow examination of
                               Terminate Message, LLP
                               Disassociated from QP

   PostSQ/PostRQ with          Return an Immediate Error     Closing
   Immediate Error

   Modify QP                   Return an Immediate Error     Closing

   PostRQ without Immediate    Enqueue and flush             Closing
   Error

   PostSQ without Immediate    Enqueue & Flush, Perform LLP  Error
   Error                       Reset, Create Async Event
                               "Bad Close", LLP
                               Disassociated from QP.

                     Figure 11 - Closing State summary




   Hilland, et al.        Expires October 2003              [Page 61]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

6.3  Shared Receive Queue

   The Verbs support a Shared Receive Queue (S-RQ). Support for the
   Shared Receive Queue is OPTIONAL. The Query RNIC Verb MUST indicate
   whether the RNIC supports the Shared Receive Queue.

   A Shared Receive Queue is an RNIC resource which allows multiple RQs
   to retrieve WQEs from the same shared queue on an as needed basis.
   This allows a Consumer to post WRs to the S-RQ instead of the RQ.
   When a message arrives, the RI uses a WQE from the S-RQ and makes it
   appear as if the WQE has been copied from the S-RQ to the QP's RQ. A
   CQE for an incoming message which result in a WQE being consumed
   from an S-RQ MUST be posted to the CQ associated with the QP's RQ.

   The RI MUST return the maximum number of S-RQs supported by the RI
   as an output modifier of Query RNIC, and the value MUST be zero if
   the RI does not support S-RQs.

   The RI MUST return the maximum number of outstanding WRs on an S-RQ
   as an output modifier of Query RNIC, and the value MUST be zero if
   the RI does not support S-RQs.

   Each S-RQ MUST be associated with a single PD ID. Multiple S-RQs
   MUST be able to be associated with the same PD ID.

   The SQ of a QP associated with an S-RQ MUST operate no differently
   than the SQ of a QP which is not associated with an S-RQ.

   When using an S-RQ, the RI MUST allow Work Requests to be posted to
   the S-RQ and MUST NOT allow WRs to be posted to an RQ of a QP
   associated with the S-RQ.

   If the RI supports an S-RQ, then it MUST:

   *   support the Create S-RQ Verb (See Section 9.2.4.1),

   *   support the Query S-RQ Verb (See Section 9.2.4.2),

   *   support the Modify S-RQ Verb (See Section 9.2.4.3),

   *   support the Destroy S-RQ Verb (See Section 9.2.4.4),

   *   support the S-RQ Handle as an Input Modifier for Create QP (See
       Section 9.2.5.1), and

   *   support an S-RQ Limit Event and a QP RQ Limit Event (See Section
       6.3.8),

   *   support the S-RQ Handle as an Input Modifier for PostRQ,




   Hilland, et al.        Expires October 2003              [Page 62]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   support the S-RQ Handle as an Asynchronous Event Handler routine
       parameter.

6.3.1  Creating a Shared Receive Queue

   When the S-RQ is created, it MUST be associated with a PD ID, and
   the maximum number of WRs which can be posted at any time must be
   provided as an Input Modifier. Note that the number of WQEs on the
   S-RQ at any given moment is dependent upon the completion semantics
   described below.

6.3.2  Modifying a Shared Receive Queue

   The RI MAY allow the Consumer to change the maximum number of
   outstanding WRs on the S-RQ. If the RI supports the ability to
   change the number of outstanding WRs on a SQ and RQ, and the RI
   supports S-RQs, then it MUST:

   *   allow the maximum number of outstanding WRs on the S-RQ to be
       changed;

   *   allow the maximum number of outstanding WRs to be changed while
       WRs are still outstanding; and

   *   support the ability to change this on every S-RQ.

   It is understood that changing the number of WRs that an S-RQ may
   have outstanding MAY adversely affect performance. Resizing the S-RQ
   MUST NOT cause Immediate, Completion or Asynchronous Errors, with
   the exception of Immediate Errors returned by the Modify S-RQ Verb
   and possible LLP time-outs. It is expected that the resize operation
   MAY adversely affect the Associated QPs attempting to communicate
   with the QPs associated with the S-RQ during the resize operation
   possibly resulting in LLP time-outs and retries which could result
   in LLP Stream teardown (which would result in an Asynchronous
   Error). It is suggested that the Consumer only perform this resize
   operation when activity on the connections has been quiesced to
   minimize the risk of transitioning Associated QPs to the Error state
   as a result of LLP time-outs.

   If the number of requested outstanding WRs is smaller than the
   actual number of outstanding WRs currently on the S-RQ, then the
   modification of the S-RQ MUST fail with an Immediate Error and the
   S-RQ MUST remain in the original state.

6.3.3  Destroying a Shared Receive Queue

   The Verbs provide a Destroy S-RQ Verb to allow a Consumer to destroy
   an S-RQ that is no longer needed. The RI MUST only allow an S-RQ to
   be destroyed when all the QPs associated with that S-RQ have been



   Hilland, et al.        Expires October 2003              [Page 63]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   destroyed. The RI MUST allow an S-RQ to be destroyed when there are
   WRs still posted to the S-RQ. Note that it is recommended that a
   Consumer drain the S-RQ or track all WRs posted to the S-RQ before
   destroying it so that no WRs are lost. For example, a WR which was
   Posted to the S-RQ but which was never Completed would still be on
   the S-RQ when the S-RQ was destroyed so the Consumer would never be
   notified that the buffers associated with the WR were available
   again.

   After the Destroy S-RQ returns to the Consumer, the RI:

   *   MUST have freed all RI resources associated with Receive Work
       Requests that were not Completed and were posted on that S-RQ,
       and

   *   MUST ensure that it will no longer reference any Consumer
       resources associated with Receive Work Requests that were not
       Completed and were posted on that S-RQ.

6.3.4  Associating an S-RQ with a QP

   A Shared Receive Queue MUST only be associated with a QP when the QP
   is created. When the QP is created, the RI MUST ignore the maximum
   number of outstanding RQ WRs Input Modifier.

6.3.5  Shared Receive Queue Processing Model

   If a QP is associated with an S-RQ, the RI MUST allow WRs to be
   posted to the S-RQ using PostRQ, specifying the S-RQ Handle instead
   of the QP Handle. If the QP is associated with an S-RQ, the RI MUST
   NOT allow WRs to be posted to the Local RQ through PostRQ and MUST
   return an Immediate Error if Posting to the Local RQ is attempted by
   the Consumer.

   The RI MUST ensure that S-RQs follow the rules for Work Queues with
   respect to the posting rules and completion rules defined in Section
   8.2.1 - Submitting Work Request to a Work Queue and Section 8.2.3 -
   Completion Processing. This means the RI MUST prevent a Consumer
   from overflowing the S-RQ using the PostRQ.

   When an incoming Untagged Message arrives on a QP, the RI determines
   if the QP is associated with an S-RQ. If it is, the RI must make it
   appear as if the WQE has been dequeued from the S-RQ and queued to
   the QP's local RQ. This does not guarantee that the S-RQ WQE is
   free. The S-RQ WQE is considered to be part of the S-RQ until the
   Work Completion associated with the S-RQ WQE has been retrieved or
   the S-RQ is destroyed.

   The RI MAY dequeue or use the S-RQ WQEs in any order. Since the WQEs
   are in an implementation specific order, the Consumer should not



   Hilland, et al.        Expires October 2003              [Page 64]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   depend on S-RQ post order in any way. The RI should support one of
   the following two models: sequential order or arrival order.

   *   In sequential ordering, the RI dequeues S-RQ WQEs as messages
       arrive. If messages arrive out of order, in addition to
       dequeueing the WQE required to place the data for that message,
       the RI also dequeues a WQE for each message with an MSN lower
       than the out-of-order message that has not arrived and does not
       yet have an associated WQE.

   *   In arrival ordering, the RI dequeues S-RQ WQEs as the messages
       arrive. If messages arrive out of order, only the WQE required
       to place the out of order message will be dequeued from the S-
       RQ. WQEs required to place data for the messages with an MSN
       lower than the out of order message will be dequeued from the S-
       RQ when those messages arrive.

   The RI MUST Complete incoming Send Message Types in the order they
   were Posted to the Associated QP's Send Queue. This means Work
   Completions retrieved from the CQ for any individual QP will be
   retrieved only in Message Sequence Number (MSN) order (see [DDP] for
   details). The RI MUST dequeue only one WQE from the S-RQ to place
   any message represented by a single MSN. Note that the Work
   Completions are not necessarily in the order in which the Send
   Message Types arrived, nor in the order the WQEs were posted to the
   S-RQ, nor in the order the WQEs were dequeued from the S-RQ.

   When a Work Completion which represents a WR originally submitted to
   an S-RQ has been returned to the Consumer via the Poll for
   Completion Verb, the RI MUST allow the Consumer to be able to post
   another Work Request to the S-RQ immediately.

   All QPs that use an S-RQ MUST be able to consume S-RQ WQEs, as long
   as the S-RQ has unconsumed WQEs available. If there are no S-RQ WQEs
   when an Untagged Message arrives on a QP which is associated with
   that S-RQ, then the LLP Stream MAY be Terminated. If the LLP Stream
   is not terminated, the reader should see Section 13.2 - Graceful
   Receive Overflow Handling for one implementation option.

   Protection Domain checking rules are slightly different for an S-RQ.
   An S-RQ MUST have a PD ID assigned as an Input Modifier for Create
   S-RQ. When an Untagged Message arrives and the QP has been
   determined to use an S-RQ for its incoming Untagged Message WQEs,
   then the PD ID of the STags in the WQEs MUST be validated against
   the PD ID of the S-RQ and MUST NOT be validated against the PD ID of
   the QP.

   Note that due to the Protection Domain checking rule above, the
   Consumer will not be able to invalidate an STag used by the S-RQ
   unless the S-RQ's PD is the same as the QP's PD, even if the QP uses



   Hilland, et al.        Expires October 2003              [Page 65]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   the S-RQ. This is because the PD used for comparison in Invalidation
   operations is that of the QP, not the S-RQ.

   The use of the STag of zero as part of a SGE in a WR MUST be
   validated by the RI based on the QP's attribute which indicates if
   it is allowed on the QP. If use of the STag of zero is not permitted
   on the QP and a WQE referencing STag zero is processed on the QP,
   the RI MUST return a Completion Error. Consequently, if the Consumer
   uses the STag of zero in S-RQ Work Requests and the S-RQ is accessed
   by QPs that have the use of STag of zero enabled as well as QPs that
   do not have the use of STag of zero enabled, then the QPs that do
   not have the use of STag of zero enabled will transition to the
   Error state as soon as they retrieve a WQE which contains an STag of
   zero.

6.3.6  S-RQ Error Semantics

   All errors encountered MUST be reported through Work Completions
   where possible. This is due to the semantic requiring the WQE to
   appear as if it had been on the QP's RQ. The exception is that a
   catastrophic S-RQ error MUST be reported as an Affiliated
   Asynchronous Error.

   Errors related to a connection for a QP associated with an S-RQ MUST
   NOT affect the S-RQ. Any WQEs already consumed by the QP from the S-
   RQ will be completed in error or flushed in the case of an LLP
   Stream error. Any other QPs associated with the S-RQ MUST remain
   unaffected by a local QP error.

   Errors related to a Work Request on an S-RQ will be posted to the CQ
   associated with the QP's RQ if they are processing errors, or
   returned as Verb results if they are Immediate Errors.

   In the case of a catastrophic S-RQ failure, any QP associated with
   the S-RQ will transition to the Terminate state when the QP attempts
   to dequeue a WQE from the S-RQ when handling an incoming Send Type
   Message. The resource ID returned by the Asynchronous Event Handler
   MUST be the QP ID. All outstanding WQEs on the QP will be flushed
   and an Affiliated Asynchronous Event: "S-RQ error on a QP" MUST be
   generated as part of the Terminate state transition.

   The RI MUST NOT flush the WQEs on an S-RQ which have not been used
   to Place incoming Untagged Messages when any associated QP
   transitions to the Terminate, Error or Closing states.

6.3.7  S-RQ Resource Sizing

   The Consumer is responsible for sizing the S-RQ and the CQs
   associated with the QP's RQs appropriately. The RI MUST ignore the
   sizing information provided for the QP's RQ when the QP uses an S-



   Hilland, et al.        Expires October 2003              [Page 66]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   RQ. The Consumer should note this fact when invoking the Create QP
   Verb using an S-RQ handle. In addition, S-RQs are subject to the
   Completion confirmation rules defined in Section 8.2.3 - Completion
   Processing. This means that the WR MUST be considered to be in the
   scope of the RI, and thus using a WQE on the S-RQ until the Work
   Completion has been retrieved. In addition, the RI MUST allow any
   single RQ to utilize all of the WQEs posted to an S-RQ . Note also
   that the RI is not required to perform CQ overflow detection.

   The RQ size Input Modifier is not used when a QP is associated with
   an S-RQ. In this case, the RQ has no defined size. It can be up to
   the size of the S-RQ. If the S-RQ is resized, any QP MUST be able to
   utilize all of the WQEs posted to the S-RQ. It is up to the
   implementation to process multiple messages in progress at one time.
   Note that the number of messages that can be in progress at once is
   limited by the S-RQ size, the LLP receive window, and possibly other
   factors.

6.3.8  S-RQ Limit Checking

   An RI that supports the S-RQ MUST support an S-RQ Limit
   Notification. An RI that supports S-RQ MUST support an S-RQ Limit
   input modifier on the Create S-RQ and Modify S-RQ Verbs to establish
   the value of the Limit. The S-RQ Limit detection MUST be armed by
   the RI upon creation of the S-RQ, if non-zero. This is only used for
   generation of the Affiliated Asynchronous Event and MUST NOT
   otherwise disrupt the QP operation. When the number of available (or
   unused) WQEs posted to the S-RQ drops below the S-RQ Limit, the RI
   MUST generate an Asynchronous Event and provide the S-RQ Handle as
   the Resource ID. This event will only be triggered once after it is
   armed and will not generate another event until the Consumer re-arms
   the event. The RI MUST allow the Consumer to re-arm this event
   through the use of Modify S-RQ. The RI MUST arm this event when the
   S-RQ is created if the S-RQ Limit is greater than zero. The RI MUST
   allow an already armed S-RQ Limit to be armed again. If the S-RQ
   Limit is armed for an S-RQ and the maximum number of outstanding WRs
   on the S-RQ is modified below S-RQ Limit, then the RI MUST return an
   Immediate Error indicating that an invalid Input Modifier was
   provided.

   An RI that supports the S-RQ MUST support a QP RQ Limit Notification
   for QPs associated with an S-RQ. The QP RQ Limit detection MUST be
   armed by the RI upon creation of the QP, if non-zero. The Consumer
   specifies the QP RQ Limit as part of either Create QP or Modify QP.
   This is only used for generation of the Affiliated Asynchronous
   Event and MUST NOT otherwise disrupt the QP operation. When the
   number of messages in progress on the QP (which is defined as
   messages being Placed, and thus have WQEs associated with them, but
   which have not yet had CQEs generated for the WQEs and thus have not
   been Delivered to the Consumer) exceeds the QP's RQ Limit, the RI



   Hilland, et al.        Expires October 2003              [Page 67]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   MUST generate an Asynchronous Event and provide the QP ID as the
   Resource ID. This event will only be triggered once after it is
   armed and will not generate another event until the Consumer re-arms
   the event. The RI MUST allow the Consumer to re-arm this event
   through the use of Modify QP. The RI MUST arm this event when the QP
   is created if the QP's RQ Limit is greater than zero. The RI MUST
   allow an already armed S-RQ Limit to be armed again. If the S-RQ
   Limit specified in the Create S-RQ or Modify S-RQ is greater than
   the maximum number of outstanding WRs on the S-RQ, then the RI MUST
   return an Immediate Error indicating that an invalid Input Modifier
   was provided.

   Note that neither Limit Notification forces Work Completions to be
   retrieved by the Consumer. Only retrieving the Work Completions
   allows the Consumer to Post additional WQEs to the S-RQ.
   Consequently, if separate Consumers are allowed to share an S-RQ,
   then one Consumer could consume all or part of the S-RQ entries if
   it does not retrieve Work Completions.

6.4  Stopping QP processing and Sending the Terminate Message

   Certain conditions require that QP operations be stopped, and a
   final Terminate Message be sent. Stopping WR processing on the QP
   and transmission of a Terminate Message are associated with QP state
   changes; the specific QP state transitions that require this are
   described in Section 6.2 - Queue Pair Resource States. When a QP
   must be stopped, either by a Modify QP Verb, or by QP state change
   due to an error, the following notes apply:

   1.  For Errors that do not impact the integrity of an outbound DDP
       Segment or for Modify QP Verb invocations that require stopping
       the QP, outbound processing MUST be stopped only on DDP Segment
       boundaries, in the absence of LLP errors. Any Terminate Message
       (if required) MUST be filled out as described in [RDMAP] and
       MUST be sent after the last complete outbound DDP Segment.

       For Errors that impact the integrity of an outbound DDP Segment
       that require stopping the QP:

       o   If the RI has not begun sending the DDP Segment, then
           outbound processing MUST be stopped before the DDP Segment
           is sent; and the Terminate Message and error code MUST be
           sent instead of the erroneous DDP Segment.

       o   If the RI has begun sending the DDP Segment, then outbound
           processing MUST be stopped immediately on the byte that
           experienced the error and the LLP Stream MUST be Reset.

   2.  For Errors or Modify QP Verbs (except for RTS to Closing
       transitions) that require stopping the QP, the RI MUST cease to



   Hilland, et al.        Expires October 2003              [Page 68]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       process inbound DDP Segments, at least by the time that any
       currently in-process DDP Segment has completed processing.

       The semantics of stopping QP processing and handling incoming
       DDP segments for Modify QP Verbs that require the transition
       from RTS to Closing are discussed at length in Section 6.2.5.

       Subsequent inbound DDP Segments (if any) are ignored and any
       inbound DDP Segments that have been Placed but not Delivered are
       never Delivered.

   3.  For Modify QP Verbs that require stopping the QP, the RI SHOULD
       stop outbound QP processing prior to sending any current DDP
       Segment to the LLP and MUST stop outbound QP processing at least
       by the time that any currently in-process outbound message has
       completed processing.

   4.  For Errors detected while creating RDMA Write, Send Type, or
       RDMA Read Type Work Requests, the RI MUST stop outbound QP
       processing prior to sending the current DDP Segment to the LLP.
       The Terminate Message and Error code MUST be sent instead of the
       original message (or DDP Segment). In this case, the [RDMAP]
       Terminate Message's Terminate Control Field is set to represent
       RDMA and the Error Type is set to represent Local Catastrophic
       Error.

   5.  For Errors detected while creating RDMA Read Responses to a
       Remote RDMA Read Operation, the RI MUST stop outbound QP
       processing prior to sending the erroneous DDP Segment to the
       LLP. The Terminate Message and Error code are sent instead of
       the erroneous RDMA Read Response Message.

   6.  For Errors detected while creating CQEs, or other reasons not
       directly associated with creating an outbound DDP Segment, the
       RI SHOULD stop outbound QP processing prior to sending any
       current DDP Segment to the LLP and MUST stop outbound QP
       processing at least by the time that any currently in-process
       outbound DDP message has completed processing. In this case,
       [RDMAP] Terminate Message's Terminate Control Field's Header
       Control Bits are all zero.

   7.  If an error is detected by an iWARP implementation while an
       incoming DDP Segment data is being Placed, the error actions
       (changing state, stopping the QP, etc.) MUST be delayed until
       after the segment is actually delivered by the LLP. If more than
       one error is detected on incoming segments, then the first DDP
       Segment Delivered with a detected error MUST result in the error
       actions. The first detected error MAY have been detected by the
       LLP, DDP Layer, or RDMA Layer. If, while waiting for Delivery of
       an incoming segment that contains an error, another error is



   Hilland, et al.        Expires October 2003              [Page 69]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       detected that is not associated with incoming segments (for
       example, an LLP error, Send Queue or RDMA Read Response
       processing error), then the RI MUST perform the actions for that
       error without waiting for Delivery of any other segments.

   8.  For errors detected on incoming DDP Segments (after they have
       been Delivered by the LLP), the Terminate Message MUST include a
       copy of the iWARP header from the DDP Segment in error (see
       [RDMA]).

   Below, in Figure 12, is a table which should indicate the values for
   the fields in the Terminate Control Field of the Terminate Message
   in [RDMAP].

                         Layer EType   Error HdrCt DDP   Term  Term
                                       Code        Seg.  DDP   RDMA
                                                   Lgt   Hdr.  Hdr.

For Modify QP from RTS   No Terminate Message is sent.
to Error

For Modify QP from RTS   RDMA  Local   None  000b  All   All   All
to Terminate             (0x)  Catast. (0x)        zeros zeros zeros
                                (0x)

For Errors detected      RDMA  Local   None  000b  All   All   All
while creating RDMA      (0x)  Catast. (0x)        zeros zeros zeros
Write, Send Type, or            (0x)
RDMA Read Request
Messages

For Errors detected      RDMA  Local   None  000b  All   All   All
while creating           (0x)  Catast. (0x)        zeros zeros zeros
completions, or other           (0x)
reasons not directly
associated with
creating an outbound
DDP Segment

For Errors detected      Depends on error, see [RDMAP] specification
processing or Placing    and/or Sections 8.3.2 & 8.3.3.
incoming Send Type,
RDMA Write, RDMA Read
Request or RDMA Read
Response Messages







   Hilland, et al.        Expires October 2003              [Page 70]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

                         Layer EType   Error HdrCt DDP   Term  Term
                                       Code        Seg.  DDP   RDMA
                                                   Lgt   Hdr.  Hdr.

For Errors detected      Depends on error, see [RDMAP] specification
while creating RDMA      and/or Sections 8.3.2 & 8.3.3.
Read Response Messages

For LLP layer errors     No Terminate Message is sent.
detected by an iWARP
implementation (e.g.
incoming LLP Reset
while QP in RTS)

Incoming Terminate Msg   No Terminate Message is sent.


                 Figure 12- Terminate Control Field Values

6.5  Outstanding RDMA Read Resource Management

   RDMA allows multiple RDMA Read Request Messages to be outstanding on
   a single LLP Stream. To enable this feature, the RNIC provides
   resources associated with both the inbound and outbound stream. For
   each outbound RDMA Read Request Message, the RNIC has some resources
   to track the request until a local Completion occurs. Similarly, for
   each inbound RDMA Read Request Message, the RNIC has an Inbound RDMA
   Read Request Queue (IRRQ) (associated with the DDP Queue Number of
   1) to store the state of the request until it has been satisfied by
   sending all of the requested data in the RDMA Read Response Message.
   The Input Modifier that specifies this value is called the Inbound
   RDMA Read Queue Depth (IRD).

   The Outbound RDMA Read Queue Depth (ORD) is the allocated number of
   outstanding RDMA Read Request Messages the RNIC is allowed to have
   outstanding at the Data Sink of an RDMA Read Operation. This is the
   resource used to track the request until a local Completion occurs.

   The Inbound RDMA Read Queue Depth (IRD) is the allocated number of
   incoming RDMA Read Request Messages a QP can support at the Data
   Source for an RDMA Read Operation. This is the resource used to
   track inbound RDMA Read Request Messages.

   An RNIC MUST implement these resources as either per QP resources,
   or shared per RNIC resources. Per QP means that the resources are
   tied to the QP and are most likely part of the QP Context. Per RNIC
   resources implies that the RNIC has a pool of such resources
   internally that it assigns to the QP based on the values of IRD and
   ORD associated with the QP.



   Hilland, et al.        Expires October 2003              [Page 71]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Query RNIC MUST return the type of resources the specified RNIC
   supports. The results are returned in the following Output Modifiers
   for Query RNIC:

   *   The maximum number of Inbound RDMA Read Request Queue messages
       that can be outstanding per RNIC. This is the per RNIC parameter
       that corresponds to IRD. This value is Zero if the resources for
       handling Inbound RDMA Read Requests are not shared between QPs.

   *   The maximum number of Outbound RDMA Read Request messages that
       can be outstanding per RNIC. This is the per RNIC parameter that
       corresponds to ORD. This value is Zero if the outstanding RDMA
       Read Requests are not shared between QPs.

   *   The maximum number of inbound RDMA Read Request Messages that
       the Inbound RDMA Read Request Queue can store per QP
       (corresponds to IRD).

   *   The maximum number of outbound RDMA Read Request Messages that
       can be outstanding per QP (corresponds to ORD).

   The Consumer is responsible for setting the RDMA Read Data Sink QP's
   ORD so that it does not exceed the Associated QP's IRD at the Data
   Source.

   If the Consumer attempts to set IRD or ORD to one or greater, and
   there are not enough resources to allow this, the Create QP or
   Modify QP Verb MUST fail with an Immediate Error. This can happen
   because the maximum amount of IRD/ORD resources returned by Query
   RNIC MAY be affected by consumption of unrelated resources, so that
   not all of the reported resources may actually be available
   simultaneously.

   If the IRD and ORD resources are not shared between QPs (e.g. fixed
   per QP instead of allocated out of a pool for the RNIC), then the
   ULP need only negotiate the values for IRD and ORD. But if the IRD
   and ORD resources are shared across the RNIC, then some function of
   the Consumer or Consumer's environment (such as a resource manager)
   must determine how to allocate the resources among the QPs in
   addition to negotiating the IRD and ORD values.

   The RNIC MUST ensure that it does not issue more RDMA Read Request
   Messages than is specified by the QP's ORD value. However, the RI
   MUST allow the Consumer to post as many RDMA Read Type Work Requests
   as it can, within the limit of the total Work Requests the Send
   Queue can support. The RI MUST delay processing of an RDMA Read Type
   Work Request posted to the SQ which would result in exceeding the
   QP's ORD value until a prior RDMA Read Type Work Request Completes.





   Hilland, et al.        Expires October 2003              [Page 72]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   The rules in Section 8.2.2 enable subsequent Work Requests to be
   executed before the RDMA Read Type Work Request Completes. If
   however, a delay in processing occurs due to waiting for a prior
   RDMA Read Type Work Completion, this will effectively prevent
   subsequent Work Requests from being executed until the delay is over
   (i.e. stall Send Queue processing). If the Consumer wants to avoid
   this type of delay in Send Queue processing, it can issue up to as
   many RDMA Read Work Requests as supported by the value of ORD for
   that QP, and when each one Completes, then add an additional RDMA
   Read Type Work Request.

   The Consumer should manage the number of RDMA Read Request Messages
   outstanding, either by correctly setting the QP's ORD value to be
   less than or equal to the Associated QP's IRD value, or by limiting
   the number of RDMA Read Type Work Requests the Consumer posts on the
   Send Queue at any one time to be less than or equal to the
   Associated QP's IRD value. If this is not done correctly, the Local
   Peer may attempt to send more RDMA Read Request Messages than the
   Remote Peer can accept, which will result in an error from the
   Remote Peer that Terminates the RDMAP Stream (See Section 6.6.2.4 -
   Remote Termination).

   The RDMA Read Resources (IRD and ORD) MUST be initialized at QP
   creation (Create QP). The RDMA Read Resources MAY be changed while
   the QP is in Idle state and when the QP is in the RTS state. If the
   Consumer changes the resources while the QP is in the RTS state, the
   Consumer should ensure that no RDMA Read Operations are outstanding
   for the affected direction (outbound for ORD, inbound for IRD). If
   the Consumer modifies the RDMA Read Resources when RDMA Read
   Operations are outstanding, the QP state MAY be indeterminate and
   the RI MUST NOT adversely affect any other QPs supported by the RI.
   Changing RDMA Read Resources when RDMA Read Operations are not
   outstanding is easily done if IRD and ORD are set before any RDMA
   Read Work Requests are posted by either Peer. If RDMA Read Work
   Requests have already been posted, it is up to the Consumer to
   ensure that they have all Completed before changing IRD or ORD or
   the QP may be in an indeterminate state.

   The following semantics are required of the RI:

   *   All RNICs MUST allow the Consumer to reduce the ORD in the IDLE
       and RTS states.

   *   It is OPTIONAL for an RI to allow the Consumer to increase IRD
       or ORD after the QP has been created.

   *   It is OPTIONAL for an RI to accept reductions of IRD from the
       Consumer after the QP has been created.





   Hilland, et al.        Expires October 2003              [Page 73]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   The RNIC MUST support a total number of inbound RDMA Read
       Request Messages and outbound RDMA Read Request Messages, so
       that each is at least equal to the total number of QPs supported
       by the RNIC. The RNIC thus MUST be able to support at least
       IRD=1 and ORD=1 for each QP.

   *   RNICs that implement shared "per RNIC" RDMA Read Resources for
       IRD and ORD, MUST have enough so that all of the QPs can be
       assigned a value of one for IRD and one for ORD. It is up to the
       resource manager to allocate these resources fairly, so that
       applications that need RDMA Read Resources can be assured of
       their availability.

   Note that the maximum amount of resources returned by Query RNIC may
   be adversely affected by consumption of unrelated resources, so that
   not all of the reported number may actually be available
   simultaneously.

   If the Consumer attempts to set either IRD or ORD to one or greater,
   and there are not enough resources to allow this, the Create QP or
   Modify QP Verb MUST fail with an Immediate Error.

   Note that when using "per RNIC" resources, the Create or Modify QP
   IRD and ORD values are also limited by the "per QP" resources.

6.5.1  Example IRD/ORD Negotiation

   The example in Figure 13 shows one possible negotiation for a single
   direction (if the ULP uses RDMA Read Operations in both directions
   on the RDMA Stream, it must also do the same thing in reverse). Note
   that the last step may be omitted if the ULP is not interested in
   reducing the resources used at the left side of the connection when
   the right side supports less.




















   Hilland, et al.        Expires October 2003              [Page 74]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003








             < Figure 13 did not convert properly from source >
             <  to be corrected in an upcoming version        >






           Figure 13 - An example RDMA Read Resource negotiation

6.6  Connection Management

6.6.1  Connection Initialization

   RDMA Stream initialization can occur as the transport connection is
   created or sometime thereafter. In the latter case, the connection
   may require a ULP supplied end-to-end handshake before iWARP is
   initialized. Either the active or passive side of the connection may
   initiate turning on iWARP.

   In either case, the ULP must know, before iWARP mode is to begin,
   which model of operation is to be used by the ULP.

   An RI MUST support RDMA Stream initialization sometime after the
   transport connection is established and some streaming mode data has
   been sent.

   An RI MAY support RDMA Stream startup along with the transport
   connection, with no streaming mode data sent. This option is more
   completely described in Section 13.1 - Connection Initialization at
   LLP Startup.

   Once iWARP initialization is complete, the RI MUST allow only iWARP
   messages to be sent across the LLP connection until the RDMA Stream
   is torn down.





   Hilland, et al.        Expires October 2003              [Page 75]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Section 6.6.1.1 and 6.6.1.2 provide informative examples of methods
   for the ULP to transition to RDMA mode. Other implementations are
   possible.

6.6.1.1  Active Connection Initialization after LLP Startup

   For this discussion, the Active side goes to iWARP mode first. In
   the figures below, the thin lines represent TCP Streaming mode and
   the thick lines represent iWARP mode.







             < Figure 14 did not convert properly from source >
             <  to be corrected in an upcoming version        >

















          Figure 14 - Connection Initialization after LLP Startup

   Below is the sequence for an active side iWARP startup. Note that
   the dotted line arrows above indicate messages that may not be
   needed for some implementations.

   1.  The ULP establishes the LLP Connection and LLP Stream.

   2.  The active side ULP ensures that the passive side is able to
       enter iWARP mode via some negotiation or other mechanism, which
       is outside the scope of this specification.




   Hilland, et al.        Expires October 2003              [Page 76]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   3.  The active side Consumer creates a QP, setting up the CQ, PD
       etc., and registers memory for buffers. Note that in some
       instances, this may have been done at some previous time during
       the initialization process.

   4.  The active side Consumer posts receive buffers via PostRQ that
       are appropriate for the expected traffic. A first message may
       arrive quickly after the transition to RTS.

   5.  The active side Consumer moves the QP to the RTS state. The
       Consumer includes the LLP Stream Handle in the Modify QP Verb,
       and a single message buffer which contains the last streaming
       mode message to be sent to the Remote Peer. The RI uses the
       presence of this message buffer to recognize the Active startup
       sequence. For information on implementing this state transition,
       see Section 6.2.1.2 - Idle to RTS.

   6.  When the active side Consumer receives the first RDMA/DDP
       Message from the passive side (e.g. a Send type message), the
       active side Consumer is free to post additional Work Requests to
       the Send Queue. The active side Consumer should not have posted
       any SQ WRs while the QP was in the Idle state, or while the QP
       is in the RTS state. The active side consumer should not post
       any SQ WRs until the first RDMA/DDP Message is received. If the
       Consumer posts SQ WRs during either of these times, the Remote
       Peer is likely to improperly synchronize to the LLP Stream and
       to Terminate the LLP Stream. One way that the Consumer can
       determine that the message arrives is to have the initial
       message sent from the Associated QP have the Solicited Event bit
       set, thus generating an event at the Local Peer.

   7.  If the local Consumer intends to perform RDMA Read Operations,
       the local Consumer obtains, by some ULP defined message, the
       number of Incoming RDMA Read Request Messages that the Remote
       Peer can have outstanding (IRD). If the Remote Peer's IRD is
       smaller than the local Peer's ORD, the local Consumer should
       also perform a Modify QP Verb with the Remote Peer's IRD placed
       into the local ORD prior to posting the first RDMA Read Type WR.
       The local Consumer may also transmit, in some ULP defined
       message, the number of Outbound RDMA Read Request Messages that
       the Local Peer can have outstanding (ORD).

   8.  If the local ULP intends the QP to be a target of RDMA Read
       Operations, the local Consumer provides, in some ULP defined
       mechanism, the number of Inbound RDMA Read Request Messages that
       the Local Peer can have outstanding (IRD). The Consumer may also
       receive, by some ULP defined mechanism, the Number of Outbound
       RDMA Read Request Messages that the Remote Peer can have
       outstanding (ORD). If the Remote Peer's ORD is smaller than the
       Local Peer's IRD and the Local RNIC supports IRD reduction, the



   Hilland, et al.        Expires October 2003              [Page 77]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       local Consumer could perform a Modify QP Verb with the Remote
       Peer's ORD placed into the local IRD prior to posting the first
       RDMA Read Type WR.

6.6.1.2  Passive Connection Initialization after LLP Startup

   Below is the sequence for a passive side iWARP startup:

   1.  The passive side ULP establishes the LLP Connection and LLP
       Stream.

   2.  The passive side ULP informs the active side that it is able to
       enter iWARP mode via some negotiation.

   3.  The passive side ULP waits for the Active side to send a last
       streaming mode message to indicate that it should enter RDMA
       mode and that the remote node is in RDMA mode. When that message
       arrives, and if it indicates that iWARP mode is desired, the
       passive side Consumer continues with the items below.

   4.  The passive side Consumer creates a QP, setting up the CQ, PD
       etc. Note that this may have been done previously.

   5.  The passive side Consumer posts receive buffers appropriate for
       the expected traffic to the RQ.

   6.  The passive side Consumer posts at least one Send type Work
       Request that is used by the active side to complete the
       negotiation. The WR may contain any data that the ULP needs to
       communicate.

       Note: the passive side Consumer may delay the posting of buffers
       and Work Requests until after the transition to RTS, described
       below.

   7.  The passive side Consumer moves the QP to RTS state, specifying
       the LLP Stream Handle. The passive side Consumer does not
       include a last streaming mode message buffer in the Modify QP
       Verb; if it does, the Remote Peer is likely to improperly
       synchronize to the RDMA Stream and be forced to terminate the
       LLP Stream.

   8.  The passive side Consumer may now begin posting additional Work
       Requests.

   9.  If the local Consumer intends to perform RDMA Read Operations,
       the local Consumer obtains, in some ULP defined message, the
       number of incoming RDMA Read Request Messages that the Remote
       Peer can have outstanding (IRD). If the Remote Peer's IRD is
       smaller than the local Peer's ORD, the local Consumer should



   Hilland, et al.        Expires October 2003              [Page 78]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       also perform a Modify QP Verb with the Remote Peer's IRD placed
       into the local ORD prior to posting the first RDMA Read Type WR.
       The local Consumer may also transmit, in some ULP defined
       message, the number of outgoing RDMA Read Request Messages that
       the Local Peer can have outstanding (ORD).

   10. If the local Consumer intends the QP to be a target of RDMA Read
       Operations, the Consumer provides, in some ULP defined message,
       the number of incoming RDMA Read Request Messages that the Local
       Peer can have outstanding (IRD). The Consumer may also receive,
       in some ULP defined message, the number of outgoing RDMA Read
       Request Messages that the Remote Peer can have outstanding
       (ORD). If the Remote Peer's ORD is smaller than the Local Peer's
       IRD, the local Consumer may also perform a Modify QP Verb with
       the Remote Peer's ORD value placed into the local IRD prior to
       posting the first RDMA Read Type WR, if the RI supports IRD
       reduction.

6.6.2  Connection Teardown

   Five types of iWARP and LLP connection teardown mechanisms are
   supported:

   *   A normal close is an LLP Close that finishes with no errors (see
       Section 6.2.5 - Closing State, for a list of possible errors).
       This is used when the Consumers on both sides of the connection
       have sent their last message and wish to close the LLP Stream
       (see Section 6.6.2.1 - Normal Close).

   *   A ULP initiated Termination is used when the ULP desires to
       perform an LLP Close with an error message to the Associated QP
       (see Section 6.6.2.2 - ULP Initiated Termination).

   *   A ULP initiated Abortive Teardown is used when the ULP wishes to
       perform an LLP Reset with no error message to the Associated QP
       (see Section 6.6.2.3 - ULP Initiated Abortive Teardown).

   *   Remote Termination occurs when the RI receives a Terminate
       Message from the Associated QP, and the LLP Close process has
       begun (see Section 6.6.2.4 - Remote Termination).

   *   Local Termination, Local Abortive Teardown and Remote Abortive
       Teardown occur when the RI or LLP Stream detects an error and a
       Terminate Message is sent prior to an LLP Close or an LLP Reset
       is initiated (see Section 6.6.2.5 - Local Termination, Local
       Abortive Teardown and Remote Abortive Teardown).

   Sections 6.6.2.1 through 6.6.2.5 provide informative examples of
   methods for the ULP to terminate an RDMA Stream. Other
   implementations are possible.



   Hilland, et al.        Expires October 2003              [Page 79]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

6.6.2.1  Normal Close

   A normal close is provided as a mechanism for the ULP to cease
   activity, flush any receive buffers that have been posted to the RQ,
   and disassociate the LLP Stream from the QP. It requires that no
   errors occur during the close process. If an error occurs, it is now
   an abnormal close, which would cause the QP to transition to the
   Error state.

   The Consumer initiates a normal close, either locally or remotely,
   when both sides of a LLP Stream agree to the close.

   When the Consumer desires a normal close, the following items must
   be done:

   1.  The Consumer waits for all outstanding Work Requests on the Send
       Queues on both sides of the LLP Stream to be Completed. Note:
       the Completion on the remote WQ can be inferred by the arrival
       of a SEND message from the ULP that indicates that it intends to
       do no more work.

   2.  One of the Consumers moves the QP state to Closing with the
       Modify QP Verb, resulting in the following actions:

       o   If any WQEs are present on the Send Queue, or if any RDMA
           Read Operations are incomplete on the IRRQ, an error will
           result (for more information, see Section 6.2.5 - Closing
           State).

       o   The RI stops QP processing and flushes all incomplete WQEs
           on the Receive Queue by Completing them with the Flushed
           Completion Status.

       o   The RI performs an LLP Close. If this QP was using the last
           LLP Stream on the LLP Connection, the RI closes the LLP
           Connection.

       o   When the LLP Close actions are complete, the RI
           automatically moves the QP to the Idle state and an
           Affiliated Asynchronous Event: "LLP Close Complete" is
           created.

   3.  The Consumer may re-use the QP for a new LLP Stream or it may
       destroy the QP (see Section 6.1.3 - Modifying Queue Pair
       Attributes and Section 6.1.4 - Destroying a Queue Pair).

   The normal close may also be initiated remotely (e.g. for TCP a FIN
   segment is received). If the Send Queue is empty and the IRRQ is
   empty, the RI moves the QP state to the Closing state and an




   Hilland, et al.        Expires October 2003              [Page 80]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Asynchronous Event: "LLP Close Complete" will be generated. If this
   is the last LLP Stream, the LLP Connection will be closed.








             < Figure 15 did not convert properly from source >
             <  to be corrected in an upcoming version        >













                      Figure 15 - Normal Close on TCP

6.6.2.2  ULP Initiated Termination

   A ULP initiated Termination is usually used when the Consumer (such
   as the OS) detects an error. The ULP needs to perform an LLP Close,
   but would like to let the Remote Peer know that an error occurred.
   Note that an ULP initiated termination may entail loss of data.

   When the ULP desires a ULP initiated Termination, the following
   items must be done:

   1.  The Consumer modifies the QP to the Terminate state.

       o   Before returning from the Modify QP -> Terminate, the RI
           stops QP processing, formats a Terminate Message containing
           the termination code: "Local Catastrophic Error" and sends
           it to the Remote Peer.

       o   The RI performs an LLP Close. If the LLP cannot deliver the
           Terminate Message, an LLP Reset is performed, and the RI
           generates an Asynchronous Error Event: "Bad Close".



   Hilland, et al.        Expires October 2003              [Page 81]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   2.  After returning from the Modify QP -> Terminate, the Consumer
       waits for the QP to automatically be moved to the Error state.
       This is signaled by an Asynchronous Error Event: "Error State
       Entered".

   3.  Once in the Error state, the RI flushes all incomplete WQEs on
       both the Send and Receive Queues by completing them with the
       Flushed Completion Status. The Consumer would presumably reap
       all of the Work Completions to ensure all resources are cleaned
       up. Once the Consumer believes all Work Completions have been
       reaped, it should attempt to transition the QP to the Idle state
       by performing a Modify QP. If the transition is successful, the
       Consumer knows it can either re-use the QP for another LLP
       Stream or call Destroy QP (see Section 6.1.3 - Modifying Queue
       Pair Attributes and Section 6.1.4 - Destroying a Queue Pair). If
       the Modify QP returns with an error (presumably because Work
       Requests are still being flushed), the Consumer must try at a
       later time to transition to the Idle state. The Consumer might
       arm a timeout. If the Consumer is unable to transition to the
       Idle state after some amount of time, it should destroy the QP
       (presumably because the QP can not recover from an internal
       error).

6.6.2.3  ULP Initiated Abortive Teardown

   A ULP initiated Abortive Teardown is usually used when the Consumer
   (such as the OS) detects an error, and the ULP needs to tear down
   the entire LLP Stream immediately (i.e. perform an LLP Reset). Note
   that a ULP initiated abortive teardown may entail loss of data.

   When the ULP desires an Abnormal ULP initiated Abortive Teardown,
   the following items must be done:

   1.  The Consumer modifies the QP to the Error state.

       o   The RI stops QP processing and performs an LLP Reset.

   2.  Once in the Error state, the RI flushes all incomplete WQEs on
       both the Send and Receive Queues by completing them with the
       Flushed Completion Status. The Consumer would presumably reap
       all of the Work Completions to ensure all resources are cleaned
       up. Once the Consumer believes all Work Completions have been
       reaped, it should attempt to transition the QP to the Idle state
       by performing a Modify QP. If the transition is successful, the
       Consumer knows it can either re-use the QP for another LLP
       Stream or it can invoke Destroy QP (see Section 6.1.3 -
       Modifying Queue Pair Attributes and Section 6.1.4 - Destroying a
       Queue Pair). If the Modify QP returns with an error (presumably
       because Work Requests are still being flushed), the Consumer
       must try at a later time to transition to the Idle state. The



   Hilland, et al.        Expires October 2003              [Page 82]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       Consumer might arm a timeout. If the Consumer is unable to
       transition to the Idle state after some amount of time, it
       should destroy the QP (presumably because the QP can not recover
       from an internal error).

6.6.2.4  Remote Termination

   Remote Termination occurs when the Associated QP sends a Terminate
   Message to the Local Peer. Note that remote termination may entail
   loss of data.

   When the Remote Peer sends a Terminate Message, and it is locally
   received, the following sequence occurs:

   1.  The RI stops QP processing.

   2.  The RI moves the QP automatically to the Terminate state. The RI
       then generates an Asynchronous Error Event: "Terminate Message
       Received".

   3.  The RI performs an LLP Close, or if an LLP final timeout occurs,
       an LLP Reset.

   4.  The RI moves the QP to the Error state.

   5.  Once in the Error state, the RI flushes all incomplete WQEs on
       both the Send and Receive Queues by completing them with the
       Flushed Completion Status. The Consumer would presumably reap
       all of the Work Completions to ensure all resources are cleaned
       up. Once the Consumer believes all Work Completions have been
       reaped, it should attempt to transition the QP to the Idle state
       by performing a Modify QP. If the transition is successful, the
       Consumer knows it can either re-use the QP for another LLP
       Stream or it can invoke Destroy QP (see Section 6.1.3 -
       Modifying Queue Pair Attributes and Section 6.1.4 - Destroying a
       Queue Pair). If the Modify QP returns with an error (presumably
       because Work Requests are still being flushed), the Consumer
       must try at a later time to transition to the Idle state. The
       Consumer might arm a timeout. If the Consumer is unable to
       transition to the Idle state after some amount of time, it
       should destroy the QP (presumably because the QP can not recover
       from an internal error).

6.6.2.5  Local Termination, Local Abortive Teardown and Remote Abortive
         Teardown

   iWARP defines an abortive teardown mechanism which is invoked if a
   catastrophic iWARP error is encountered locally. iWARP attempts to
   send a Terminate Message, but depending upon the condition of the
   LLP, it is possible a Terminate Message can not be sent or can not



   Hilland, et al.        Expires October 2003              [Page 83]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   be successfully delivered to the Associated QP. If an LLP Stream
   error occurs, it is possible for the LLP Stream or LLP Connection to
   be torn down before a) iWARP is aware of the error, b) before iWARP
   is able to send the Terminate Message, or c) after iWARP has posted
   the Terminate Message to the LLP, but it is still in the LLP send
   queue. Thus the Consumer at the Remote Peer may or may not be able
   to retrieve a valid Terminate reason for some forms of abortive
   teardown. The Consumer at the Remote Peer can retrieve the Terminate
   Message, if available, using the Query QP when the QP has
   transitioned to the Error state. The Consumer at the Local Peer
   should always be able to retrieve the Terminate Message that was
   sent (if the QP transitioned through the Terminate state),
   regardless of whether it was successfully delivered to the Remote
   Peer.

   Note that an abortive teardown may entail loss of data. The RI will
   complete all outstanding (incomplete) iWARP messages in error. In
   general, when an abortive teardown occurs it is impossible to tell
   for sure what iWARP messages were successfully placed and delivered
   at the Remote Peer. Thus even completed messages on the Send Queue
   should be treated as incomplete unless a ULP Acknowledge has been
   received. Note that Completed RDMA Read Type Work Requests act as a
   ULP Acknowledgement, in that any prior RDMA Write Messages, Send
   Type Messages, RDMA Read Operations and the RDMA Read Request
   Message itself are required to have arrived at the Remote Peer
   before the RDMA Read Response Message can be generated at the Remote
   Peer to Complete the RDMA Read Type Work Request.

   When iWARP detects a local error the following items are done:

   1.  If the LLP Stream is still functional, the RI moves the QP to
       the Terminate state. If the error was not reported in a CQE, the
       RI generates an Asynchronous Error Event, with an appropriate
       error code (see 8.3.3 - Asynchronous Errors). Then the RI stops
       QP processing.

       If the LLP Stream is not functional, the RI performs an LLP
       Reset and moves the QP to the Error state. If the error was not
       reported in a CQE, the RI generates an Asynchronous Error Event,
       with an appropriate error code (see 8.3.3 - Asynchronous
       Errors). The RI skips steps 2 and 3 below.

   2.  The RI formats a Terminate Message with an appropriate
       termination error code and sends it to the Remote Peer.

   3.  The RI performs an LLP Close. If the LLP could not successfully
       perform the LLP Close (e.g. for TCP, transitioning through the
       normal closing states incurred a final timeout), an LLP Reset
       occurs. Once either the LLP Close or LLP Reset is finished, the
       RI transitions the QP to the Error state.



   Hilland, et al.        Expires October 2003              [Page 84]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   4.  Once in the Error state, the RI flushes all incomplete WQEs on
       both the Send and Receive Queues by completing them with the
       Flushed Completion Status. The Consumer would presumably reap
       all of the Work Completions to ensure all resources are cleaned
       up. Once the Consumer believes all Work Completions have been
       reaped, it should attempt to transition the QP to the Idle state
       by performing a Modify QP. If the transition is successful, the
       Consumer knows it can either re-use the QP for another LLP
       Stream or it can invoke Destroy QP (see Section 6.1.3 -
       Modifying Queue Pair Attributes and Section 6.1.4 - Destroying a
       Queue Pair). If the Modify QP returns with an error (presumably
       because Work Requests are still being flushed), the Consumer
       must try at a later time to transition to the Idle state. The
       Consumer might arm a timeout. If the Consumer is unable to
       transition to the Idle state after some amount of time, it
       should destroy the QP (presumably because the QP can not recover
       from an internal error).

   Figure 16 is an example of how the abortive teardown might occur.
   Other sequences of events are possible. For example, the TCP FIN
   could be sent in a separate TCP segment. Another example is the
   Remote Peer RI might not transition from the Terminate state when
   the LLP can no longer be used for data transmission (i.e. the TCP
   FIN ACK segment is sent). Instead it waits for TCP finite state
   machine to reach the Closed state. If the latter implementation is
   used, QP resources may not be able to be recycled until after TCP
   finishes transitioning through the TIME-WAIT state, which takes a
   considerable amount of time. See Section 10, Security
   Considerations, for potential security issues with this approach.
























   Hilland, et al.        Expires October 2003              [Page 85]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003









             < Figure 16 did not convert properly from source >
             <  to be corrected in an upcoming version        >











               Figure 16 - Abortive Teardown example on TCP
























   Hilland, et al.        Expires October 2003              [Page 86]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

7  Memory Management

7.1  Memory Management Overview

   There are two basic methods for enabling memory to be accessed by an
   RNIC. These are Memory Regions and Memory Windows. Memory Regions
   are used to assign an STag to a Physical Buffer List, associate it
   with a starting Tagged Offset and length, and assign it Memory
   Access Rights. Memory Windows are used to assign an STag to a
   portion, or window, of a Memory Region.

   Fundamental to Memory Management is the definition of an STag (see
   Section 7.2 - Steering Tag (STag)) and the Tagged Offset (TO)
   associated with it (see Section 7.3.1.1 - Memory Region Tagged
   Offset (TO) and Section 7.6.1 - Addressing Registered Memory). Also
   fundamental is the concept of a Physical Buffer List (PBL), which
   contains the physical address mappings for the memory used in the
   Memory Region, as discussed in Section 7.6.2 - Physical Buffer
   Lists.

   An STag can be associated with either a Memory Region or a Memory
   Window. While both Memory Regions and Memory Windows can be used for
   data transfer operations, they differ with respect to the Verbs used
   to manipulate them. These distinctions are covered in great detail
   in this section.

   There are three mechanisms for associating a Memory Region's STag
   with a Physical Buffer List. A Consumer can allocate an STag with
   the PBL in one step, as is done with RI-Register Non-Shared Memory
   Region. A Consumer can also allocate an STag and then use a Fast-
   Register WR to associate the PBL with the STag. Finally, a Consumer
   can create a new STag that is associated with an existing Memory
   Region through the Register Shared Memory Region. For more
   information on Memory Region creation, see Section 7.3.2 - Memory
   Region Creation and Registration.

   There are two types of Memory Regions. These are Non-Shared MR and
   Shared MR. A Non-Shared MR has a PBL that is not shared with other
   MRs. A Shared MR has a PBL that may be shared with other MRs. A Non-
   Shared MR becomes a Shared MR through the Register Shared Memory
   Region operation. For more information on Shared Memory Regions, see
   Section 7.3.2.4 - Register Shared Memory Region. MR (without any
   qualifiers) is used to refer to both Non-Shared MR and Shared MRs.

   Before use, Memory Windows must first be allocated and then Bound to
   a Memory Region. The allocation is a RI Verbs call, but the Bind
   operation is a WR. For more information on Memory Windows, see
   Section 7.10 - Memory Windows.





   Hilland, et al.        Expires October 2003              [Page 87]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Memory registration enables access to a Memory Region by a specific
   RNIC. Binding a Memory Window enables the specific RNIC to access
   memory represented by that Memory Window. STags are specific to an
   RNIC and the RI is NOT REQUIRED to grant access to the Memory Region
   by other local RNICs.

   Mechanisms are provided for Re-registering Non-Shared Memory
   Regions. These are discussed in Sections 7.3.2.3 - RI-Reregister
   Non-Shared Memory Region. In addition, the Verbs provide mechanisms
   for Registering Memory Regions which share PBL mappings. These are
   discussed in Section 7.3.2.4 - Register Shared Memory Region.

   Architecturally, only Bind Memory Window and Fast-Register Non-
   Shared Memory Region are anticipated to be optimized for
   performance. The rest of the Memory Registration mechanisms are not
   anticipated to be performance optimized.

   All Memory Regions MUST have Access Rights associated with them to
   indicate if local read, local write, remote read and remote write
   accesses are allowed. This is discussed in Section 7.4 - Access to
   Registered Memory. All Memory Windows MUST have Access Rights
   associated with them to indicate if remote read and remote write
   accesses are allowed. This is discussed in Section 7.4 - Access to
   Registered Memory.

   Non-Shared Memory Regions and Memory Windows have to be invalidated
   before they can have their PBL associations changed. This has other
   benefits as well, such as preventing remote accesses using that
   STag. This is discussed is Section 7.8 - Invalidating Memory Regions
   and 7.10.4 - Invalidating or De-allocating Memory Windows.

   The RI also provides Verbs for retrieving STag attributes, as
   discussed in Section 7.7 - Querying Memory Regions and 7.10.3 -
   Querying Memory Windows. The Verbs also define the destruction and
   deallocation of Memory Windows and Memory Regions in Section 7.9 -
   Deallocation of STag associated with a Memory Region and in Section
   7.10.4 - Invalidating or De-allocating Memory Windows, respectively.

7.2  Steering Tag (STag)

   All local and remote memory accesses through the Verbs require the
   use of an STag. For local access, the STag, along with a Tagged
   Offset (TO) is used by the RI, when processing a Work RequestÆs SGE,
   to identify a memory location within a specific Memory Region. For
   remote access, the STag, along with a TO, is used by the RI when
   handling RDMA operations to identify a memory location within a
   specific Memory Region or Memory Window.

   An STag is a 32-bit identifier that has two sub-fields: a Consumer
   provided STag Key and an RI provided STag Index. The STag Key is the



   Hilland, et al.        Expires October 2003              [Page 88]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   8 least significant bits of the STag. The STag Index is the 24 most
   significant bits of the STag.

   The 8 bit STag Key is provided by the Consumer. The Consumer can use
   the STag Key in any way it desires. For example, it can be used as
   an incrementing value to help discover application errors by using a
   different value with each registration. As a general rule, the
   Consumer provides the STag Key to the RI whenever the consumer
   causes the transition of an STag to the Valid state, or when the
   STag is being Invalidated. In the Invalid state, only the STag Index
   is meaningful.

   There is no default value for the STag Key. The RI MUST use the STag
   Key provided by the Consumer for the following Verbs:

   *   Register Non-Shared Memory Region,

   *   Register Shared Memory Region,

   *   Reregister Non-Shared Memory Region,

   *   PostSQ Verb Fast-Register Non-Shared Memory Region operation,
       and

   *   PostSQ Verb Bind operation,

   *   PostSQ Invalidate Local STag.

   The RI MUST return the value of the STag Index sub-field on an
   invocation of the following:

   *   Allocate Non-Shared Memory Region STag,

   *   Allocate Memory Window,

   *   Register Non-Shared Memory Region,

   *   Register Shared Memory Region, and

   *   Reregister Non-Shared Memory Region.

   The RI MUST use the same STag Index sub-field as was passed in by
   the Consumer, on an invocation of the following:

   *   Query Memory Region,

   *   Query Memory Window,

   *   Register Shared Memory Region,




   Hilland, et al.        Expires October 2003              [Page 89]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   Reregister Non-Shared Memory Region,

   *   PostSQ Fast-Register Non-Shared Memory Region,

   *   PostSQ Bind Memory Window,

   *   PostSQ Invalidate Local STag, and

   *   Deallocate STag.

   Implementation Note: To guarantee that the immediately previous STag
   is no longer valid, the Consumer may change the STag Key field each
   time the STag is bound. The use of a suitable random number with
   each binding can provide a valuable interface check and diagnostic
   tool.

7.2.1  STag of zero

   The STag of zero (STag with a value of zero) is a special STag. It
   has a fixed value for the STag Index and STag Key. The STag Key is
   composed of all zeros and the STag Index is composed of all zeros.
   It has no PD associated with it and it cannot be used for Remote
   Access operations.

   The purpose of an STag of zero is to allow Privileged Mode Consumers
   to be able to reference a Physical Buffer in a WR without first
   registering the buffer with the RI. This approach has the advantage
   of reduced overhead. It has the potential disadvantage that the
   buffer is represented by only a single SGE and therefore must be
   contiguous. Note that buffers which are not contiguous can be
   represented by multiple SGEs in this case, but all SGLs have a
   finite limit of the number of entries allowed by the RI. If the
   buffer is not physically contiguous, any access to the non-existent
   memory may result in an access error.

   Using an STag of zero as part of a Scatter/Gather Element tells the
   RNIC that it MUST interpret the TO portion of the SGE as a physical
   address on the local node. Note the RI MUST never generate an STag
   Index of zero. The RI MUST NOT allow the Consumer to associate an
   STag Key with the STag of zero.

   The STag of zero has the following semantics, which are different
   than the semantics of any other STag:

   1.  The RNIC MUST NOT perform any PD checks on an STag of zero.

   2.  When accessing an STag of zero on a given QP, the RNIC MUST
       assure access to the STag of zero is enabled on that QP. If
       allowing an STag of zero is not enabled, then the operation MUST
       result in a protection error.



   Hilland, et al.        Expires October 2003              [Page 90]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   3.  The RNIC MUST NOT permit any remote access that references STag
       of zero and any attempt to do so MUST result in a protection
       error. The RI MUST grant STag of zero Local Read and Local Write
       Access Rights.

   4.  The RNIC MUST NOT allow Memory Windows to be Bound to STag of
       zero. Any attempt to do so MUST result in an error.

   5.  The RNIC MUST NOT allow a Local or Remote Invalidation of the
       STag of zero. Any attempt to do so MUST result in an error. The
       STag of zero MUST always be in the Valid state.

   6.  The RNIC MUST NOT allow an STag of zero to be an input modifier
       of an RI-Reregister Non-Shared Memory Region, Register Shared
       Memory Region, Query Memory Region, Query Memory Window, Bind
       Memory Window, Deallocate STag, Invalidate STag or Fast-Register
       and MUST return an Immediate Error if a Consumer attempts to do
       so.

   7.  The RI MUST NOT return a value of zero as an STag Index for RI-
       Register Non-Shared Memory Region, RI-Reregister Non-Shared
       Memory Region, Register Shared Memory Region, Allocate Non-
       Shared Memory Region STag and Allocate Memory Window.

7.2.2  Summary of Memory Region STag States

   The STag associated with a Non-Shared Memory Region has two states.
   They are Invalid and Valid. Memory accesses MUST NOT be allowed if
   the STag is in the Invalid state.

   Below in Figure 17 is the Memory Region and Memory Window state
   diagram. It indicates the state transitions required to change
   Memory Regions and Memory Windows from the Valid state to and from
   the Invalid state. In addition, it denotes the effects of the
   Register Shared Memory Region Verb on a Memory Region.


















   Hilland, et al.        Expires October 2003              [Page 91]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003













             < Figure 17 did not convert properly from source >
             <  to be corrected in an upcoming version        >


























            Figure 17 - Memory Region and Window State Diagram

   For a Non-Shared Memory Region, the following bulleted list
   indicates the state, if memory access is allowed in that state, and
   what Verbs are used to enter and exit the specified state.



   Hilland, et al.        Expires October 2003              [Page 92]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   Invalid - May not be used to access a memory location.

       o   Entered through: Allocate Non-Shared Memory Region STag,
           PostSQ Invalidate STag, incoming Send with Invalidate STag
           Message, incoming Send with Solicited Event and Invalidate
           STag Message, or local RDMA Read with Invalidate Local STag
           WR.

       o   Exited through: RI-Register Non-Shared Memory Region, RI-
           Reregister Non-Shared Memory Region, Fast-Register Non-
           Shared Memory Region WR, or Deallocate STag.

   *   Valid - May be used to access a memory location.

       o   Entered through: RI-Register Non-Shared Memory Region, RI-
           Reregister Non-Shared Memory Region, Fast-Register Non-
           Shared Memory Region WR.

       o   Exited through: PostSQ Invalidate STag, incoming Send with
           Invalidate STag Message, incoming Send with Solicited Event
           and Invalidate STag Message, local RDMA Read with Invalidate
           Local STag WR, or Deallocate STag.

   Note: Deallocate STag exits the state logic captured above, as does
   RI-Reregister Non-Shared Memory Region (if a different STag is
   returned).

   The STag associated with a Shared Memory Region MUST always be in
   the Valid state. Note that the Register Shared Memory Region Verb
   does two things - it returns a new Shared Memory Region STag for an
   existing Memory Region's Physical Buffer List (either Shared or Non-
   Shared), and if the input STag is for a Non-Shared MR, the Non-
   Shared MR is permanently converted into a Shared MR (See Section
   7.3.2.4 - Register Shared Memory Region). The following bulleted
   list indicates what Verbs are used to enter and exit the Valid state
   for a Shared Memory Region.

   *   Valid - May be used to access a memory location.

       o   Entered through: Register Shared Memory Region.

       o   Exited through: Deallocate STag.

   Note: Deallocate STag of a Non-Shared MR MUST exit the state logic
   captured above.

7.3  Memory Registration

   Memory Registration provides mechanisms that allow Consumers to
   describe a set of virtually contiguous memory locations or a set of



   Hilland, et al.        Expires October 2003              [Page 93]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   physically contiguous memory locations to the RI in order to allow
   the RNIC to access either as a virtually contiguous buffer using the
   STag and Tagged Offset.

   Memory Registration provides the RNIC with a mapping between a STag
   and Tagged Offset and a Physical Memory Address. It also provides
   the RNIC with a description of the access control associated with
   the memory location.

   Before using a data buffer with the RI, all Consumers MUST
   explicitly register with the RI the memory locations associated with
   the data buffer, except when using an STag of zero. Local or remote
   attempts to access unregistered memory MUST result in a protection
   error. Thus every WR simply uses an STag, TO and length to reference
   a buffer.

   Memory Registration MAY fail due to the RNICÆs inability to find
   resources to hold information needed by the RNIC to record the
   registration. Memory MUST NOT be registered in this case and MUST
   NOT consume any RI resources if the Registration fails.

7.3.1  Memory Regions

   A set of memory locations that have been registered are referred to
   as a Memory Region (MR).

   The RNIC uses two values to identify a memory location within a
   Memory Region: Steering Tag (STag) and Tagged Offset (TO).

7.3.1.1  Memory Region Tagged Offset (TO)

   The base of the TO field is specified by the Consumer when the
   Memory Region is registered through RI-Register Non-Shared Memory
   Region, RI-Reregister Non-Shared Memory Region, or Fast-Register
   Non-Shared Memory Region. Two bases MUST be supported by the RNIC:
   Virtual Address (VA) based TO and zero based TO. For a VA based TO,
   the TO of the first memory location associated with the Memory
   Region equals the VA value passed as an input modifier of the Verb
   or WR used to register the Memory Region. For a zero based TO, the
   TO of the first memory location associated with the Memory Region
   equals zero.

7.3.2  Memory Region Creation and Registration

   Before the RNIC can use a Memory Region, the resources associated
   with a Memory Region must be allocated and the Memory Region must be
   registered with the RNIC. The RI defines the following mechanisms
   for providing these functions through the Verbs interface: Allocate
   Non-Shared Memory Region STag, Register Shared Memory Region, RI-




   Hilland, et al.        Expires October 2003              [Page 94]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Register Non-Shared Memory Region, RI-Reregister Non-Shared Memory
   Region, and Fast-Register Non-Shared Memory Region.

   When registering a Memory Region, the Consumer specifies whether
   Memory Windows may be Bound to the Memory Region or not.

7.3.2.1  Allocate Non-Shared Memory Region STag

   This Verb allocates memory registration resources in the RI. When
   the Verb completes, the STag Index will be allocated as described
   below and provided as an output modifier.

   When allocating an STag:

   *   the RI MUST verify the Consumer specified maximum Physical
       Buffer List Size is less than or equal to the size allowed by
       the RI. The RI MUST return the Physical Buffer List (PBL) size
       allocated, which MUST be greater than or equal to the size
       requested. The RI MUST also return the allocated STag Index. If
       the Consumer specified a maximum PBL Size greater than the size
       allowed by the RI, the RI MUST return an Immediate Error.

   *   the RI MUST verify and use the Consumer specified Input Modifier
       called the Remote Access Flag to indicate if Remote Access is
       enabled with the STag. If the Remote Access Flag is enabled, the
       RI MUST be able to allow remote reads or remote writes that
       reference the STag. Otherwise, the RI MUST NOT allow the STag to
       be used in remote read or remote write operations.

   An STag created through the Allocate Non-Shared Memory Region STag
   Verb MUST be able to be used in an RI-Reregister or a Fast-Register
   Non-Shared Memory Region.

   When the Allocate Non-Shared Memory Region STag Verb returns control
   to the Consumer and the Verb has completed successfully, the
   returned STag is in the Invalid state. The STag MUST be placed in
   the Valid state before it can be used by a local or remote operation
   to access a memory location. See Section 7.2.2 - Summary of Memory
   Region STag States for the requirements on transitioning the STag to
   the Valid state.

   For a description of the Verb which Allocates an STag, see Section
   9.2.6.1 - Allocate Non-Shared Memory Region STag.

7.3.2.2  RI-Register Non-Shared Memory Region

   When the RI-Register Non-Shared Memory Region Verb returns, it has
   allocated the appropriate memory registration resources on the RNIC
   and has registered a Non-Shared Memory Region. When the RI-Register
   Non-Shared Memory Region Verb is invoked:



   Hilland, et al.        Expires October 2003              [Page 95]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   The RI MUST accept and use any STag Key passed in by the
       Consumer for the Memory Registration.

   *   The RI MUST use the Physical Buffer List passed in by the
       Consumer.

   *   The RI MUST verify and use the Consumer specified modifier which
       indicates if Remote Access is enabled with the STag. If Remote
       Access is enabled, the RI MUST allow remote reads or remote
       writes that reference the STag. Otherwise, the RI MUST NOT allow
       the STag to be used in remote read or remote write operations.

   When the RI-Register Non-Shared Memory Region Verb completes
   successfully:

   *   the RI MUST have Registered the Non-Shared Memory Region with
       the RNIC,

   *   the RI MUST return the STag Index associated with the Non-Shared
       Memory Region to the Consumer,

   *   the RI MUST return the number of Physical Buffer List Entries in
       the allocated Physical Buffer List, which may be larger than the
       requested size, and

   *   the returned STag MUST be in the Valid state.

   See Section 9.2.6.2 - Register Non-Shared Memory Region (RI-
   Register) for a description of the RI-Register Non-Shared Memory
   Region Verb.

7.3.2.3  RI-Reregister Non-Shared Memory Region

   This Verb conceptually performs the functional equivalent of
   Deallocate STag followed by RI-Register Non-Shared Memory Region.
   Where possible, resources below the Verb layer are expected to be
   reused instead of deallocated and reallocated. This Verb may be used
   to change the Access Rights and/or PD ID of a Region, as well as
   changing the memory locations that are registered.

   When the RI-Reregister Non-Shared Memory Region Verb is invoked:

   *   The STag MUST be the STag of a Non-Shared Memory Region.

   *   The STag MUST be in either the Invalid or Valid state.

   *   The RI MUST accept and use any STag Key passed in by the
       Consumer for the Memory Reregistration.





   Hilland, et al.        Expires October 2003              [Page 96]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   The RI MUST ensure that no Memory Windows are Bound to the STag
       Index passed in by the Consumer. If any Memory Windows are Bound
       to it, an Immediate Error is returned.

   *   The STag passed in by the Consumer MAY have an original PBL size
       that is smaller than the new PBL size to be associated with that
       STag. If the PBL passed in by the Consumer is greater than the
       PBL associated with the STag, the RI MAY return an error
       indicating it had insufficient resources to complete the
       request.

   If the RI-Reregister Non-Shared Memory Region Verb does not complete
   successfully:

   *   If the RI returns an "Invalid RNIC handle", "Invalid STag Index"
       or "One or more Memory Windows is still Bound to the Region"
       Immediate Error, the RI MUST make no changes to the current
       registration (assuming that it even exists).

   *   If the RI returns any error other than "Invalid RNIC handle",
       "Invalid STag Index" or "One or more Memory Windows is still
       Bound to the Region", the RI MUST Deallocate the Memory Region
       associated with the STag Index used as an Input Modifier and
       ensure that no new Memory Region is registered.

   When the RI-Reregister Non-Shared Memory Region Verb completes
   successfully:

   *   the RI MUST have registered the Non-Shared Memory Region with
       the RNIC;

   *   the RI MAY return a different STag Index than the one passed in
       by the Consumer. If a different STag Index is returned, all
       resources associated with the prior STag MUST have been
       effectively Deallocated (e.g. transition to the Deallocated
       state);

   *   the RI MUST return the number of Physical Buffer List Entries in
       the allocated Physical Buffer List, which may be larger than the
       requested size,

   *   the RI MUST use and set the Remote Access Rights and Remote
       Access Flag for the STag as indicated with the Input Modifier,
       and

   *   the returned STag MUST be in the Valid state. This STag can be
       used to access a memory location.

   The Consumer should note that since the STag Index returned MAY be
   different than the STag Index provided to the Verb, any attempt to



   Hilland, et al.        Expires October 2003              [Page 97]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   use the previous STag Index in this case would result in a memory
   protection error.

   The RI-Reregister Non-Shared Memory Region Verb can be used to
   modify the attributes of a Memory Region created through the RI-
   Register Non-Shared Memory Region, RI-Reregister Non-Shared Memory
   Region, or an Allocate Non-Shared Memory Region STag Verb. A Memory
   Region MUST be allowed to be reregistered an arbitrary number of
   times provided the PBL length is less than or equal to the original
   PBL length.

   For the error case where a Remote Peer is accessing a Non-Shared
   Memory Region while it is in the process of being reregistered,
   implementations MUST present the same semantics as a deallocate or
   invalidate operation followed by a separate registration operation.

   For information on the Verb to Reregister a Memory Region, see
   Section 9.2.6.5 - Reregister Non-Shared Memory Region (RI-
   Reregister).

7.3.2.4  Register Shared Memory Region

   Shared Memory Regions provide a way for the Consumer to obtain a new
   STag Index for a Memory Region that has already been registered.
   This allows optimization of RNIC resources because returning a new
   STag Index allows the Consumer to assign different Access Rights,
   change the VA Base, change if the Region is VA Based or Zero Based,
   assign an STag Key and use a different PD, but use the same Physical
   Buffer List as a previously registered Memory Region. Thus an
   optimized implementation is possible where the new STag can use the
   previous PBL for memory translation but has new STag properties for
   Access Rights and Protection Domain checks.

   When the Shared Memory Region Verb is invoked:

   *   If the STag Index, passed in by the Consumer, is associated with
       a Non-Shared Memory Region, the RI MUST verify that the Memory
       Region STag Index passed in is in the Valid state. Note that
       Shared Memory Regions are always in the Valid state.

   *   Any Memory Windows that are currently bound to the MR,
       associated with the STag Index passed in by the Consumer, MUST
       be unaffected.

   *   The RI MUST verify that the STag Key of the existing MR matches
       the STag Key supplied as an input modifier by the Consumer.

   *   The RI MUST accept and use any STag Key passed in by the
       Consumer for the Shared Memory Registration.




   Hilland, et al.        Expires October 2003              [Page 98]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   If the STag Index passed in by the Consumer references a VA
       based TO, the RI MUST verify that the VA passed in by the
       Consumer produces an FBO that matches the FBO of the PBL that is
       associated with the STag Index passed in by the Consumer.

   When the Shared Memory Region Verb completes successfully:

   *   the RI MUST have registered the new Shared Memory Region with
       the RNIC;

   *   the RI MUST return a different STag Index that is associated
       with the same or identical PBL as the PBL referenced by the STag
       Index passed in by the Consumer;

   *   The RI MUST allow the new Shared Memory Region to have different
       Access Rights, change the VA Base, change if the Region is VA
       Based or Zero Based, assign an STag Key and a different PD; and

   *   if the STag Index passed in by the Consumer is associated with a
       Non-Shared Memory Region, the RI MUST convert the Non-Shared
       Memory Region to a Shared Memory Region but MUST NOT change any
       other attributes of the Memory Region being converted.

   The returned STag, which references the new, Shared Memory Region,
   is in the Valid state. The STag can be used to access a memory
   location.

7.3.2.5  Fast-Register Non-Shared Memory Region

   Fast-Register provides a mechanism for the Consumer to use the
   PostSQ Verb to invoke an asynchronous memory registration. Fast-
   Register Non-Shared Memory Region MUST support registration using
   STags that were created with the Allocate Non-Shared Memory Region
   STag, RI-Register Non-Shared Memory Region Verb or RI-Reregister
   Non-Shared Memory Region Verb and have not subsequently been
   converted to a Shared Memory Region.

   When the Fast-Register Non-Shared Memory Region mechanism is
   invoked:

   *   The RI MUST accept and use any STag Key passed in by the
       Consumer for the Fast-Register operation.

   *   The RI MUST use the STag Index passed in by the Consumer to
       register a Non-Shared Memory Region with the RNIC.

   *   The RI MUST verify that the STag Index passed in by the Consumer
       is in the same PD as the QP. The RI MUST verify that the STag
       Index passed in by the Consumer is not the STag of zero. The RI
       MUST verify that the STag Index passed in by the consumer is not



   Hilland, et al.        Expires October 2003              [Page 99]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       the STag of a Memory Window. If the STag Index is not in the
       same PD as the QP or the STag is that of a Memory Window or the
       STag is the STag of zero, the RI MUST return an error.

   *   The STag MUST be in the Invalid state at the time the Fast-
       Register Non-Shared Memory Region is processed. See Section
       7.2.2 - Summary of Memory Region STag States for more details.
       If the STag is not in the Invalid state at the time the Fast-
       Register Non-Shared Memory Region WR is processed, the RI MUST
       return an error.

   *   If the Non-Shared Memory Region referenced by the STag does not
       have a maximum PBL size greater than or equal to the PBL size
       passed in the Fast-Register Non-Shared Memory Region, the RI
       MUST return an error.

   *   The RI MUST prevent an STag with the Remote Access Flag disabled
       from having its Access Rights changed to include remote Access
       Rights. The RNIC MUST assure an STag with the Remote Access Flag
       enabled can have its Access Rights changed to include remote and
       local, or local only Access Rights. Note that the Remote Access
       Flag cannot be changed except by the RI-Reregister Non-Shared
       Memory Region Verb. If Remote Access Rights are requested and
       the Remote Access Flag is not enabled, the RI MUST return an
       error.

   *   The RI MUST verify that Fast-Register access is enabled on the
       QP that is processing the Fast-Register Non-Shared Memory Region
       operation. Note that this is intended to prevent a Non-
       Privileged Mode application from accessing physical memory
       without Privileged Mode intervention. If Fast-Register is not
       enabled on the QP, the RI MUST return an error.

   The Fast-Register operation MUST take place within the RI at any
   time between when the Work Request is posted and before execution of
   the Work Request immediately after the Fast-Register operation.

   When the Fast-Register Non-Shared Memory Region operation completes
   successfully, the associated STag MUST be in the Valid state. The
   STag can be used to access a memory location.

   For a description of the Fast-Register Non-Shared Memory Region
   mechanism, see Section 9.3.1.1 - PostSQ.

7.4  Access to Registered Memory

   The RI MUST support four distinct Memory Region Access Rights: Local
   Read, Local Write, Remote Read, and Remote Write. The Access Rights
   of the Memory Region MUST apply to each memory location within the
   Memory Region. The RI MUST allow changing Access Rights from local



   Hilland, et al.        Expires October 2003             [Page 100]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   to local and remote only through an RI-Reregister or through a
   Deallocate followed by an Allocate or RI-Register.

   The RI MUST support a Remote Access Flag. It can be supplied as an
   Input Modifier for the Allocate STag, RI-Register and RI-Reregister
   Verbs. If the Remote Access Flag is enabled, the RI MUST allow the
   remote Access Rights to be set on the STag. If the Remote Access
   Flag is disabled, the RI MUST not allow the remote Access Rights to
   be set on the STag.

   When performing local and remote data transfer operations, the RI
   MUST validate all 32 bits of the STag used to represent the data
   transfer.

7.4.1  Local Access to Registered Memory

   The RI MUST allow the Consumer to assign one or both of the Local
   Access Rights to a given Memory Region. If the Consumer does not
   assign one of the local Access Rights, the RI MUST return an error.

   If the RI assigns Local Read Access to a Memory Region, the RNIC is
   allowed to use the STag and Tagged Offset to read any location
   within the Memory Region. If the RI assigns Local Write Access to a
   Memory Region, the RNIC is allowed to use the STag and Tagged Offset
   to write any location within the Memory Region.

   Work Requests may require the Consumer to supply a locally
   accessible data buffer. Locally accessible data buffers are
   described by the STag associated with that Memory Region, a Tagged
   Offset that points to a location within a Memory Region, and the
   quantity of bytes in the buffer that may be used by the Work
   Request.

   The RI MUST enforce that Scatter Gather Elements used in Send
   Operation Type and RDMA Write Work Requests posted to the SQ have
   Local Read Access enabled or a Completion Error will result.

   The RI MUST enforce that Scatter Gather Elements used in Receive
   Work Requests posted to the Receive Queue or Shared-Receive Queue
   have Local Write Access enabled or a Completion Error will result.

   The RI MUST use only Local Access Rights when determining the Access
   Rights for Scatter/Gather Elements. The RI MUST NOT use Remote
   Access Rights when determining the Access Rights for Scatter/Gather
   Elements.

7.4.2  Remote Access to Registered Memory

   The Consumer may, in addition to the Local Access Rights, request
   the RI to assign one or both of the Remote Access Rights to a given



   Hilland, et al.        Expires October 2003             [Page 101]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Memory Region. The RI MUST NOT allow the Consumer to assign Remote
   Write to an MR that has not been assigned Local Write. The RI MUST
   NOT allow the Consumer to assign Remote Read to an MR that has not
   been assigned Local Read.

   If the Consumer assigns Remote Read Access to a Memory Region, the
   RNIC is allowed to use the STag and Tagged Offset to read any subset
   of the Memory Region when processing an incoming RDMA Read Request
   Message. If the Consumer assigns Remote Write Access to a Memory
   Region, the RNIC is allowed to use the STag and Tagged Offset to
   write any subset of the Memory Region when processing an incoming
   RDMA Write or RDMA Read Response Message. For more information, see
   [RDMAP].

   The RI MUST enforce that Tagged Buffers at the Data Sink targeted by
   incoming RDMA Write Messages have Remote Write Access enabled or an
   Asynchronous Error will result at the Data Sink.

   The RI MUST enforce that Tagged Buffers whose contents are retrieved
   by RDMA Read Request Messages have Remote Read Access enabled or an
   Asynchronous Error will result at the Data Source.

   The RI MUST enforce that Tagged Buffers consumed by RDMA Read
   Response Messages have Remote Write Access enabled or an
   Asynchronous Error will result at the Data Sink. The access control
   on the Local Address is not verified until a remote access is
   attempted through the RDMA Read Response Message.

   Remote Access Rights MUST only be used by the RI when determining
   the Access Rights for incoming Tagged and remote Invalidation
   operations. The RI MUST NOT allow an STag with only Local Access
   Rights to be Invalidated by an incoming remote Invalidation
   operation or a protection error will result.

   Figure 18 summarizes local and remote Access Rights and the validity
   of their combinations that the RI MUST enforce:

















   Hilland, et al.        Expires October 2003             [Page 102]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

           Local              Remote          Valid Access Combination

            None               None                      No

            None               Read                      No

            None               Write                     No

            None          Read and Write                 No

            Read               None                     Yes

            Read               Read                     Yes

            Read               Write                     No

            Read          Read and Write                 No

           Write               None                     Yes

           Write               Read                      No

           Write              Write                     Yes

           Write          Read and Write                 No

       Read and Write          None                     Yes

       Read and Write          Read                     Yes

       Read and Write          Write                     Yes

       Read and Write     Read and Write                Yes

            Figure 18 - Valid Combinations of MR Access Rights

7.4.3  Multiple Registrations of Memory Regions

   The same set of memory locations may be registered multiple times,
   resulting in multiple STags. There are two methods for doing this in
   the architecture. The first is the Shared Memory Region, which is
   discussed in Section 7.3.2.4 - Register Shared Memory Region. The
   second is to simply register a set of memory locations a second time
   using the same, similar or overlapping Physical Buffer List.
   Regardless of the method, each resulting STag represents a separate
   and distinct Memory Region and may be independently associated with
   any PD and have distinct Access Rights.




   Hilland, et al.        Expires October 2003             [Page 103]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   The RI MUST support registration of Non-Shared Memory Regions that
   have partially or completely overlapping Physical Buffer Lists and
   return a different STag Index for each.

   In cases where multiple registrations that use the same memory
   locations is desired, provision for optimizing the use of RI
   resources is provided. This Verb is called Register Shared Memory
   Region and is discussed in Section 7.3.2.4 - Register Shared Memory
   Region and the Verb is discussed in Section 9.2.6.6 - Register
   Shared Memory Region.

   Given an existing Non-Shared Memory Region, a Shared Memory Region
   Verb creates a new Shared Memory Region associated with the same
   Physical Memory Addresses, with the intention that the new Shared
   Memory Region shares RNIC mapping resources to the extent possible.
   This also turns the existing Non-Shared Memory Region into a Shared
   Memory Region. Through repeated calls to the Register Shared Memory
   Region Verb, an arbitrary number of Shared Memory Regions can
   potentially share the same RNIC mapping resources, all associated
   with the same Physical Memory Addresses. The Base TO, VA (if the
   input STag Index references a VA Based TO), PD ID, and Access Rights
   specified for the new Shared Memory Region need not be the same as
   those of the existing Memory Region. For a VA Based TO, the RI MUST
   verify that the VA passed in by the Consumer produces a FBO that
   matches the FBO of the PBL that is associated with the STag Index
   passed in by the Consumer. The lengths are by definition the same.

7.5  Memory Access Control

   Only a Privileged Mode Consumer can invoke an RI-Register, RI-
   Reregister, or Allocate Non-Shared Memory Region STag Verb. In
   general, the OS is responsible for determining and enforcing access
   control policy for memory registrations it does on behalf of Non-
   privileged Consumers. For instance, it is anticipated, but not
   required, that operating systems will enforce policies similar to
   the following:

   *   A Non-Privileged Mode Consumer has control over which of its
       memory areas can be accessed by local and remote RNIC data
       transfer operations.

   *   A Non-Privileged Mode Consumer can enable any local memory area
       it has access to for access by RNIC data transfer operations.

   *   A Non-Privileged Mode Consumer cannot enable RNIC read access to
       memory areas that the Consumer itself doesnÆt have read access
       to.






   Hilland, et al.        Expires October 2003             [Page 104]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   A Non-Privileged Mode Consumer cannot enable RNIC write access
       to memory areas that the Consumer itself doesnÆt have write
       access to.

   When a Consumer creates QPs or CQs (through the appropriate Verbs),
   the RI automatically allocates and pins any local memory needed for
   the associated RI internal control structures. Access by the RNIC to
   these control structures is implicitly enabled. Access by the
   Consumer to these control structures is supported only indirectly
   through Verbs. Any STags used within the RI that are used for the
   control structures (if they exist) MUST NOT be exposed to the
   Consumer.

   A Consumer controls which Memory Regions and Memory Windows are
   accessible by each QP through the use of PDs. Prior to creating any
   QPs, registering any Memory Regions, or allocating any Memory
   Windows, the Consumer should allocate one or more PDs. When
   registering Memory Regions or allocating Memory Windows, the
   Consumer specifies the PD ID to associate to each. For information
   on the use of PDs, see Section 5.2 - Protection Domains.

7.5.1  Local Access Control

   With Send Type, RDMA Write, and Receive Queue WRs, the Consumer
   explicitly specifies the data buffers to be accessed through the
   local Scatter Gather Elements (SGEs) that the Consumer posts with
   the associated Work Requests.

   When registering a Memory Region, a Privileged Consumer can
   generally specify the following local Access Rights for the Region:
   read only, write only, read and write.

   The Consumer can access the Memory Region through the STag. This
   STag grants the Consumer local Access Rights for the entire Memory
   Region as bounded by the base TO and byte length and the granularity
   of the access control is enforced at the byte level.

   The following list defines the local Access Rights requirements for
   SGEs used in local operations:

   *   Local read access MUST be specified for Gather Elements used in
       Send Type WRs and RDMA Write WRs,

   *   Local Write access MUST be specified for Scatter Elements used
       in Receive WRs, and

   *   For RDMA Read Type WRs, Local Access Rights are not used to
       verify the Local Address or Remote Address.





   Hilland, et al.        Expires October 2003             [Page 105]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

7.5.2  Remote Access Control

   When a Consumer wants to allow Remote Peers to access its local
   memory using RDMA Writes or RDMA Read Operations, the Consumer
   should explicitly enable remote access and Advertise an appropriate
   STag to the Remote Peer for it to use when initiating these RDMA
   Operations targeting the ConsumerÆs (local) memory.

   A Consumer can use either of two mechanisms to enable remote access
   to its memory. The first mechanism consists of using a Memory Region
   that has remote Access Rights. The second mechanism consists of
   allocating and binding Memory Windows. Either results in an STag
   with associated remote Access Rights for the memory referenced by
   the STag.

   Two types of remote access - read and write - are supported. RDMA
   Write requires Remote Write Access at the Remote Peer. The RDMA
   Protocol converts an RDMA Read Type WR into an RDMA Read Operation
   that uses two RDMAP Messages: RDMA Read Request and RDMA Read
   Response. Remote Read Access MUST be enabled for Memory Regions read
   by a remote RDMA Read Request Message. Remote Write Access MUST be
   enabled for Memory Regions written by a remote RDMA Read Response
   Message. If the Memory Region does not have the appropriate Access
   Rights, a protection error occurs.

   For RDMA Read Operations, during the processing of a RDMA Read Type
   WR, the RNIC is responsible for generating one RDMA Read Request
   Message that contains a description of the Local Address and Remote
   Address. Local Access Rights are not used to verify the Local
   Address or Remote Address. The Remote Access Rights of the Local
   Address is not verified until an incoming RDMA Read Response Message
   is received. The Remote Access Rights of the Remote Address are
   verified when the Remote Peer processes the RDMA Read Request
   Message.

   In order to set either Remote Access control types in a Fast-
   Register operation, when the Non-Shared Memory Region STag was
   created, it MUST have been created with the Remote Access Flag
   enabled.

7.6  Addressing

   The Tagged Offset field is used by local and remote operations to
   address registered Memory Regions.

7.6.1  Addressing Registered Memory

   The RI MUST support two mechanisms for specifying the offset within
   Memory Regions: VA Based TO and Zero Based TO. At the time the
   Memory Region is registered, the RI MUST allow the Consumer to



   Hilland, et al.        Expires October 2003             [Page 106]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   choose between these two mechanisms. A Virtual Address Base Tagged
   Offset (VA Based TO) is one that has a Tagged Offset base that
   starts at a non-zero Virtual Address. A Zero Based Tagged Offset
   (Zero Based TO) is one that has a Tagged Offset base that starts at
   zero.

7.6.1.1  Addressing with VA based TO

   The Virtual Addresses that Consumers manipulate and pass as input
   modifiers are referred to simply as Virtual Addresses in this
   specification. The size of the Virtual Addresses used to specify a
   Memory Region to be registered is implementation dependent. The size
   of the TO MUST be 64 bits. The TO passed in the SGE defines the VA
   of the first byte of the SGE.

   A Memory Region is specified by a Virtual Address that points to the
   first byte, which is specified by the First Byte Offset of the
   Physical Buffer List, and by the length of the set in bytes. The
   Physical Buffer size that backs the Region depends on the host
   system hardware and host operating system.

   The RI MUST allow a Consumer to specify an arbitrary alignment and
   length of the virtually contiguous buffer to be registered through a
   RI-Register Non-Shared Memory Region Verb, RI-Reregister Non-Shared
   Memory Region Verb, or Fast-Register Non-Shared Memory Region.

   The following operations should be performed before registering a VA
   Based TO Non-Shared Memory Region:

   *   Translate the set of virtually contiguous memory locations that
       are associated with the Non-Shared Memory Region into a Physical
       Buffer List.

   *   Pin the Physical Buffers in the Physical Buffer List.

   While a Memory Region is Valid, every Physical Buffer within the
   Region must be pinned down in physical memory. This guarantees to
   the RNIC that the Memory Region is physically resident (not paged
   out) and that the virtual to physical address translation remains
   fixed while the Region is registered. The RI is NOT REQUIRED to
   verify that the Physical Buffers in the Physical Buffer List are
   pinned.

   When the Consumer registers a Non-Shared Memory Region addressed
   through the VA based TO mechanism, the following input modifiers are
   passed to the RI (along with additional input modifiers - see
   Section 9.2.6):

   *   Virtual Address - The VA Physical Buffer offset portion of the
       VA defines the offset into the first Physical Buffer of the Non-



   Hilland, et al.        Expires October 2003             [Page 107]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       Shared Memory Region. The RI checks that the VA modulo Physical
       Buffer Size equals the FBO.

   *   Physical Buffer size - Size of all Physical Buffers referenced
       by the Non-Shared Memory Region.

   *   First Byte Offset (FBO) - Offset into the first Physical Buffer
       of the Non-Shared Memory Region

   When a RI-Register Non-Shared Memory Region Verb, RI-Reregister Non-
   Shared Memory Region Verb, Register Shared Memory Region or Fast-
   Register Non-Shared Memory Region is processed, the RI MUST verify
   that the Base TO modulo the Physical Buffer Size is equal to the VA
   modulo the Physical Buffer Size.

7.6.1.2  Addressing with Zero Based TO

   A zero based contiguous set of memory locations is specified by the
   length of the set in bytes. The RI MUST associate a TO that has a
   value of zero with the First Byte Offset in the Physical Buffer
   List.

   The following operations must be performed before registering a zero
   Based TO Non-Shared Memory Region:

   *   Translate the set of virtually contiguous memory locations
       associated with the Non-Shared Memory Region into a Physical
       Buffer List.

   *   Pin the Physical Buffers in the Physical Buffer List.

   While a Memory Region is Valid, every Physical Buffer within the
   Region must be pinned down in physical memory. This guarantees to
   the RNIC that the Memory Region is physically resident (not paged
   out) and that the virtual to physical address translation remains
   fixed while the Region is registered. The RI is NOT REQUIRED to
   verify that the Physical Buffers in the Physical Buffer List are
   pinned.

   When the Consumer registers a Non-Shared Memory Region addressed
   through the Zero Based TO mechanism, the following input modifiers
   are passed to the RI (along with additional input modifiers - see
   Section 9.2.6):

   *   First Byte Offset - Offset into the first Physical Buffer of the
       Non-Shared Memory Region

   *   Buffer size - Size of all Physical Buffers referenced by the
       Non-Shared Memory Region.




   Hilland, et al.        Expires October 2003             [Page 108]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   When a RI-Register Non-Shared Memory Region Verb, RI-Reregister Non-
   Shared Memory Region Verb, Register Shared Memory Region Verb or
   Fast-Register Non-Shared Memory Region WR is processed for a Zero
   base TO MR, the base TO MUST be set to zero.

   Note that a Memory Window cannot be bound to a Zero base TO MR.

7.6.2  Physical Buffer Lists

   Two Physical Buffer types are defined in this specification: Page
   and Block. The RI MUST support the Page Physical Buffer type.
   Support for the Block Physical Buffer type by the RI is OPTIONAL. If
   the RI supports Block Mode, the RI MUST support the ability to place
   the RNIC into either Block Mode or Page Mode when the RNIC is
   opened. The RI MUST support a mechanism for querying the RNIC to
   determine if the Block Physical Buffer type is supported.

   Memory that is part of a Physical Buffer List should remain pinned
   while the RI has any reference to it. It is not safe for the
   Consumer to assume that when an STag is deallocated that the
   Physical Buffer can be unpinned, since another STag may still have a
   reference to that resource. It is the responsibility of the Consumer
   to determine if and when the Physical Buffers should be unpinned.

7.6.2.1  Page Lists

   A Page List is defined by the following attributes:

   *   Page size - The size, in bytes, of each page in the list.

   *   Address List - A list of addresses that point to the physical
       pages referenced by the Page List. The Address List has the
       following attributes:

       o   All pages in the list have the same size, and that size MUST
           be a power of two.

       o   Page addresses MUST be an integral number of page size. In
           other words, each address in the Address List modulo page
           size MUST equal zero.

   *   First Byte Offset (FBO) - Byte offset to start of Memory Region
       within the first page.

   *   Length - Total length in bytes of the Memory Region.

   When a Page List is used to register a Non-Shared Memory Region that
   has a VA based TO, the RI MUST check that the VA modulo the Page
   Size equals the FBO.




   Hilland, et al.        Expires October 2003             [Page 109]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

7.6.2.2  Block Lists

   A Block List is defined by the following attributes:

   *   Block size - The size, in bytes, of each block in the list.

   *   Address List - A list of addresses that point to the physical
       blocks referenced by the Block List. The Address List has the
       following attributes:

       o   The RI MUST interpret each block referenced in the Address
           List as having the same size.

       o   The RI MUST allow Block Addresses to have an arbitrary byte
           alignment.

   *   First Byte Offset (FBO) - Byte offset to start of Memory Region
       within the first block.

   *   Length - Total length in bytes of the Memory Region.

   When a Block List is used to register a Non-Shared Memory Region
   that has a VA based TO, the RI MUST check that the VA modulo the
   Block Size equals the FBO.

7.6.3  Error Checking of Local and Remote Accesses to MRs

   When a local or remote operation attempts to access a registered
   Memory Region, the RI MUST ensure that:

   *   The Access Rights of the Memory Region allow the type of access
       being performed by the operation,

   *   The Access Rights of the QP allow the type of access being
       performed by the operation,

   *   For a QP not associated with an S-RQ, the PD ID associated with
       the Memory Region matches the PD ID associated with the QP that
       is processing the operation,

   *   For a QP that is associated with an S-RQ:

       o   On an incoming Send Operation Type, the PD ID associated
           with the Memory Region matches the PD ID associated with the
           S-RQ that is processing the operation, and

       o   On an outbound Send or RDMA Write, or any incoming RDMA
           Message, the PD ID associated with the Memory Region matches
           the PD ID associated with the QP that is processing the
           operation,



   Hilland, et al.        Expires October 2003             [Page 110]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   The memory access as specified by the TO & length is within the
       base and bounds of the Memory Region. The RI MUST enforce this
       with a byte level granularity.

   If the length of the access is zero, the RI MUST NOT perform any of
   the above checks on the Memory Region.

7.7  Querying Memory Regions

   Memory Regions have attributes that can be retrieved through the
   Query Memory Region Verb. The RI MUST support the complete list of
   QP attributes as described in Section 9.2.6.3 - Query Memory Region.

7.8  Invalidating Memory Regions

   When access to a Non-Shared Memory Region by an RI is no longer
   required, but the Consumer wants to retain the STag for use in
   future Fast-Register Non-Shared Memory Region and RI-Reregister Non-
   Shared Memory Region Verb invocations, the Consumer may directly
   invalidate access to the Non-Shared Memory Region through an
   Invalidate Local STag WR or an RDMA Read with Invalidate Local STag
   WR. Additionally, an STag may be invalidated by a remote Consumer
   through the use of a Send with Invalidate Message or a Send with
   Solicited Event and Invalidate Message.

   Multiple Memory Regions can represent memory locations that have
   been registered multiple times. The invalidation of a single STag
   prevents RNIC access to those memory locations via the STag
   associated with that Memory Region. Access to the memory locations
   via STags associated with other Memory Regions other than the STag
   being Invalidated MUST NOT be affected. Invalidating an STag
   associated with a Memory Region that partially or completely overlap
   other Memory Regions MUST NOT cause the RI to affect the
   registration of those other Memory Regions.

   The requirements for unpinning the physical buffers associated with
   deallocated Memory Regions are covered in Section 7.6.2 - Physical
   Buffer Lists.

   Invalidating an STag associated with a Shared Memory Region MUST
   result in an Completion Error. Consequently, using an STag
   associated with a Shared Memory Region under the following
   conditions will cause a Completion Error at the Data Sink that
   results in the LLP Stream being torn down after the data transfer
   operation takes place:

   *   As the STag specified in an Invalidate Local STag WR.

   *   As the Data Sink STag for an RDMA Read with Invalidate Local
       STag WR.



   Hilland, et al.        Expires October 2003             [Page 111]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   As the STag to be Invalidated for a Send with Invalidate or Send
       with SE & Invalidate Message.

   When a local Invalidate Local STag WR, a local RDMA Read with
   Invalidate Local STag WR, an incoming Send with Invalidate, or an
   incoming Send with Solicited Event and Invalidate completes
   successfully, the RNIC MUST place the associated STag in the Invalid
   state. For more information, see Section 8.2.2.1 - Memory Management
   Operation Ordering.

   An Invalidated STag retains associated RI resources, such as the PD,
   and the Remote Access Flag, and the number of Physical Buffer List
   entries but the contents of the Address List Entries become
   indeterminate when the Memory Region is in the Invalid state.

   The RI MUST fail Local Work Requests or Remote Operations that
   attempt to access memory locations in a Non-Shared Memory Region
   that has had its STag Invalidated with a protection error. The RNIC
   MUST NOT be able to access any memory locations through an STag that
   is in the Invalid state.

   For Non-Shared Memory Regions created through the RI-Register Non-
   Shared Memory Region Verb, when an STag is Invalidated, the RNIC
   MUST retain:

   *   The Maximum Physical Buffer List (PBL) size and entries used:

       o   When the RI-Register Non-Shared Memory Region was invoked,
           if an RI-Reregister Verb has not been invoked on the Non-
           Shared Memory Region; or

       o   On the last RI-Reregister Non-Shared Memory Region that used
           the Non-Shared Memory Region.

   *   The state of the Remote Access Flag.

   *   The PD associated with the Non-Shared Memory Region.

   For Non-Shared Memory Regions created through the Allocate Non-
   Shared Memory Region STag Verb, when an STag is Invalidated, the
   RNIC MUST retain:

   *   The Maximum Physical Buffer List size and entries used:

       o   When the STag was created for a Non-Shared Memory Region, if
           an RI-Reregister Verb has not been invoked on the Non-Shared
           Memory Region; or

       o   On the last RI-Reregister Non-Shared Memory Region that used
           the Non-Shared Memory Region.



   Hilland, et al.        Expires October 2003             [Page 112]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   The state of the Remote Access Flag.

   *   The PD associated with the Non-Shared Memory Region.

   For Memory Regions created through the RI-Reregister Non-Shared
   Memory Region Verbs, when an STag is Invalidated, the RNIC MUST
   retain:

   *   The Maximum Physical Buffer List (PBL) size and entries used:

       o   When the RI-Register Non-Shared Memory Region was invoked,
           if an RI-Reregister Verb has not been invoked on the Non-
           Shared Memory Region; or

       o   On the last RI-Reregister Non-Shared Memory Region that used
           the Non-Shared Memory Region.

   *   The PD associated with the Non-Shared Memory Region.

   If a Fast-Register is invoked after an RI-Register Memory Region ,
   Allocate Non-Shared Memory Region STag or RI-Reregister Memory
   Region, the Consumer is guaranteed that the RNIC can register a Non-
   Shared Memory Region with a PBL size that is equal to or smaller
   than the original PBL size returned when the Non-Shared Memory
   Region was created or allocated.

   An STag is allowed to already be in the Invalid state, when the RNIC
   performs the STag Invalidation.

   In order to perform an Invalidation Operation on a given QP, either
   through a Local Invalidation operation or an incoming Send with
   Invalidate or Send with Solicited Event and Invalidate, the
   following checks MUST be performed by the RI:

   *   The STag MUST be Non-Shared and in the Valid or Invalid state.

   *   The STag MUST NOT be the STag of zero.

   *   If the STag is that of a Non-Shared Memory Region, the PD ID of
       the STag MUST equal the PD ID of the QP.

   *   If the STag is that of a Non-Shared Memory Region, there MUST
       NOT be any Memory Windows Bound to it.

   *   The STag Key supplied by the Invalidate Operation must be
       validated against the STag Key associated with the Memory Region
       when moving the STag to the Invalid state.

   *   If the Invalidation Operation is due to an Incoming Send with
       Invalidate or Send with Solicited Event & Invalidate, the RI



   Hilland, et al.        Expires October 2003             [Page 113]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       MUST ensure that the QP has either of the remote Access Rights
       enabled and the STag has either of the remote Access Rights
       enabled.

   If any of the above checks fail, a Protection Error MUST result
   unless the STag is in the Deallocated state, in which case an
   Operation Error MUST result. If the operation was initiated by a
   Local Invalidation, a Completion Error MUST result. If the operation
   was initiated by an incoming Invalidation operation, a processing
   error MUST result and the Queue Pair will enter the Terminate state.

   For descriptions of the Work Requests that Invalidate STags
   (Invalidate STag, Send with Invalidate, Send with Solicited Event
   and Invalidate and RDMA Read with Invalidate Local STag), see
   Section 9.3.1.1 - PostSQ.

7.9  Deallocation of STag associated with a Memory Region

   The Consumer can reverse the allocation or registration process that
   created the STag by invoking the Deallocate STag Verb. The process
   of deallocating an STag MUST revoke all RNIC Access Rights
   associated with that STag.

   The RI MUST verify that the STag Index used as an Input Modifier is
   a valid STag on the specified RNIC.

   Multiple Memory Regions can represent memory locations that have
   been registered multiple times. The deallocation of a single STag
   prevents RNIC access to those memory locations via the STag
   associated with that Memory Region. Access to memory locations using
   STags associated with other Memory Regions MUST NOT be affected.
   Deallocating an STag associated with a Memory Region that partially
   or completely overlaps other Memory Regions MUST NOT cause the RI to
   affect the registration of those other Memory Regions. Deallocating
   an STag associated with a Shared Memory Region MUST NOT cause the RI
   to affect the registration of any other Shared Memory Region.

   The requirements for unpinning the physical buffers associated with
   deallocated Memory Regions are covered in Section 7.6.2 - Physical
   Buffer Lists.

   When the Deallocate STag Verb is invoked, any in-process Local or
   Remote Operations that are actively referencing memory locations by
   using the STag being deallocated, MUST fail with a protection error.
   Local or Remote Operations attempting to access memory locations in
   a Memory Region with a deallocated STag MUST fail with a protection
   error.






   Hilland, et al.        Expires October 2003             [Page 114]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Before the Deallocate Verb returns, the RI MUST free all resources
   associated with the STag and revoke the right to use the STag in
   Local or Remote Operations.

   When a Deallocate STag is invoked, the RI MUST NOT:

   *   check the state of the associated STag. That is, an STag
       associated with a Non-Shared MR can be in either the Valid or
       Invalid state when the Deallocate STag is invoked.

   *   check the STag Key portion of the STag. Note that the Deallocate
       Verb does not have an STag Key Input Modifier.

   If any Memory Windows are Bound to the Memory Region and the
   Consumer invokes the Deallocate STag Verb, the RI MUST return an
   Immediate Error and MUST NOT deallocate the Memory Region. Memory
   Windows can reverse the Bind process through deallocation or
   invalidation.

   For a description of the Deallocate Memory Region mechanism, see
   Section 9.2.6.4 - Deallocate STag.

7.10 Memory Windows

   When a Consumer needs more flexible control over remote access to
   its memory, the Consumer can use Memory Windows. Memory Windows are
   intended for situations where:

   *   A Non-Privileged Mode Consumer wants to grant and revoke remote
       Access Rights to a registered Region in a dynamic fashion with
       less of a performance penalty than using
       deallocation/registration or invalidation/re-registration.

   *   A Consumer wants to grant different remote Access Rights to
       different Remote Peers and/or grant those rights over different
       ranges within a registered Region.

   To use a Memory Window, the Consumer allocates a Memory Window and
   then Binds it to a specified TO range of an existing Memory Region
   that is enabled for use with Memory Windows. The range can include
   the entire Memory Region or any subset of the Memory Region.

   See Section 9.2.6 - Memory Management for a description of the Verbs
   used to manage Memory Windows.

7.10.1 Allocating Memory Windows

   The Allocate Memory Window Verb is used to allocate a Memory Window.
   When the Verb returns, it must have allocated Memory Window
   resources on the RNIC, associated the STag with the PD ID supplied



   Hilland, et al.        Expires October 2003             [Page 115]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   as an Input Modifier by the Consumer, and returned the STag
   associated with the allocated Memory Window. The RI MUST ensure that
   the returned STag is in the Invalid state. The RI MUST NOT allow the
   returned STag to be used with RI-Reregister Non-Shared Memory
   Region, Register Shared Memory Region, Query Memory Region or Fast-
   Register Non-Shared Memory Region. For allocating a Memory Window,
   see Section 9.2.6.7 - Allocate Memory Window.

7.10.2 Binding Memory Windows to Memory Regions

   The PostSQ Verb is used to Bind a Memory Window to a previously
   registered Memory Region. After the WR that Binds the MW is
   processed, the STag associated with the Memory Window is in the
   Valid state.

   The RI MUST allow a MW to Bind to a Non-Shared Memory Region. The RI
   MUST allow a MW to Bind to a Shared Memory Region. The RI MUST allow
   all allocated MWs to be Bound to a single MR. The RI MUST allow all
   allocated MWs to be Bound to a single QP.

   If the STag representing the Memory Region to which the Memory
   Window will Bind has an STag of zero, the Verb MUST return either an
   Immediate Error or a Completion Error.

   During the processing of PostSQ Bind Memory Window Verb, the RNIC
   MUST ensure that the PD ID of the Memory Window equals the PD ID of
   the Memory Region and with the PD ID of the QP that is processing
   the PostSQ Bind Memory Window Verb. If the three PD IDs are equal,
   the Memory Window is Bound to the Memory Region and is associated
   with the QP that processed the PostSQ Bind Memory Window Verb.
   Otherwise an invalid PD Completion Error is returned to the
   Consumer. When a Memory Window is Bound to a QP at this point, it is
   conceptually equivalent to having the PD ID of the Memory Window
   replaced with the QP ID of the QP. Thus, instead of performing a PD
   check upon validating the STag for incoming RDMA operations, the QP
   ID of the Memory Window MUST be equal to the QP ID of the QP where
   the incoming RDMA operation arrived.

   The RI MUST check that the QP has the ability to Bind Memory Windows
   enabled.

   When Binding a Memory Window, the RI MUST ensure that the memory
   locations being associated with the Memory Window are within the
   base TO and length of the associated Memory Region. The RI MUST
   support Memory Windows with a Zero Based TO. The RI MUST support
   Memory Windows with a VA Based TO. The RI MUST allow Memory Windows
   to bind to Memory Regions with a VA based TO. If the Memory Window
   has a VA based TO, the RNIC MUST ensure that the value assigned for
   the base of the Memory Window be between the MR's base VA, and the
   MR's Base VA plus the MR's length.



   Hilland, et al.        Expires October 2003             [Page 116]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   When the Bind MW WR completes successfully:

   *   The RI MUST have Bound the MW to the Non-Shared Memory Region.

   *   The RI MUST have Bound the MW to the QP that processed the Bind
       WR, by associating the QP's QP ID to the MW.

   *   The RI MUST have set the MW STag's access rights as requested by
       the Consumer.

   *   The RI MUST accept and use the STag Key passed in by the
       Consumer for the Bind operation.

   *   The RI MUST have set the MW Address Type as requested by the
       Consumer.

   *   If the Address Type of the MW was requested as VA Based, the RI
       MUST have set the Virtual Address as requested by the Consumer.

   *   The RI MUST have placed the MW STag in the Valid State.

   Figure 19 indicates which MR to MW Binding combinations are valid.
   Note that the figure is based on the Base TO type of the Memory
   Region and Memory Window. If the Consumer attempts to Bind a MW to a
   Zero-based TO MR, the RI MUST return an error. The Underlying Memory
   Region in this case may be either a Non-Shared Memory Region or a
   Shared Memory Region.

     Underlying Memory   Memory Window TO base    Valid combination
       Region TO base

         Zero based            Zero based                 No

         Zero based             VA based                  No

          VA based             Zero based                Yes

          VA based              VA based                 Yes

              Figure 19 - MR to MW Valid Binding Combinations

   When a remote access references a Bound Memory Window, the RNIC MUST
   ensure that the QP ID associated with the Memory Window matches the
   QP ID associated with the remote access' RDMA Stream. The RNIC MUST
   also ensure that the memory locations being referenced by the remote
   access are within the base TO and length of the associated Bound
   Memory Window. The RI MUST enforce this with a byte level
   granularity.



   Hilland, et al.        Expires October 2003             [Page 117]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   When Binding a Memory Window, a Consumer can request any combination
   of remote Access Rights for the Window. However, if the associated
   Memory Region does not have local write access enabled and the
   Consumer requests remote write for the Window, implementations MUST
   return a Completion Error.

   Memory Windows MUST support two distinct remote Access Rights:
   Remote Read and Remote Write. Bind Memory Window WRs must specify
   one or both of these rights. Memory Windows with Remote Write Access
   MUST be bound to Memory Regions that have Local Write Access
   Enabled. Memory Windows with Remote Read access MUST be bound to
   Memory Regions that have Local Read Access Enabled.

   A Consumer is allowed and commonly expected to enable remote Access
   Rights when Binding a Window that it may not have enabled when it
   registered the underlying Region - provided it doesnÆt violate the
   above rule regarding local access. For example, a Consumer might
   register a Region with no remote Access Rights, and later Bind one
   or more Windows to that Region that would grant remote Access
   Rights.

   Figure 20 summarizes the access right mappings between Memory
   Regions and Memory Windows and if the Memory Window Access Right
   requested is allowable or not. The RI MUST validate Memory Windows
   Access Right requests according to Figure 20 and if the Access Right
   requested is not allowed, the Bind operation must result in a
   Completion Error.

     Underlying Memory      Requested Remote    Access Right Requested
       Region's Local       Access Rights for          allowed:
       Access Rights         Memory Window

         Local Read           Remote Write                No

         Local Read            Remote Read                Yes

         Local Read       Remote Read and Write           No

         Local Write           Remote Write               Yes

         Local Write           Remote Read                No

         Local Write      Remote Read and Write           No

    Local Read and Write       Remote Write               Yes

    Local Read and Write       Remote Read                Yes





   Hilland, et al.        Expires October 2003             [Page 118]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

     Underlying Memory      Requested Remote    Access Right Requested
       Region's Local       Access Rights for          allowed:
       Access Rights         Memory Window

    Local Read and Write  Remote Read and Write           Yes

            None                   Any                    No

            Any                   None                    No

          Figure 20 - Valid Combinations of MW & MR Access Rights

   Allocating or de-allocating a Memory Window requires a Privileged
   mode transition for a Non-Privileged Consumer, and thus incurs the
   associated software overhead. Binding a Memory Window is performed
   with a Work Request posted to a Send Queue, and thus incurs far less
   software overhead.

   An STag used in a PostSQ Bind Memory Window Verb MUST be in the
   Invalid state.

   Each time a Memory Window is Bound, the Consumer passes the STag Key
   portion of the STag to the RI. The RI MUST use the STag Key provided
   by the Consumer. Additionally, the RI MUST NOT change the STag Index
   portion of the STag passed in by the Consumer. Note that the Bind
   Memory Window WR has unique ordering rules which are detailed in
   Section 8.2.2.1 - Memory Management Operation Ordering. Once the
   Bind operation has completed processing, RNIC implementations MUST
   guarantee that no additional accesses on this Memory Window can be
   performed with any STag Key other than the one used in the last Bind
   operation.

   If the RNIC detects an error with the Bind operation, it MUST put
   the QP into the Error state.

   Multiple Windows can be Bound to the same Memory Region, each with
   arbitrary remote Access Rights, and their associated areas can be
   overlapping or disjoint.

   For a description of the error conditions checked during MW Bind and
   MW access, see Section 7.10.6 - Error Checking during Memory Window
   Operations.

   For a description of the Bind Memory Window operation, see Section
   9.3.1.1 - PostSQ.






   Hilland, et al.        Expires October 2003             [Page 119]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

7.10.3 Querying Memory Windows

   Memory Windows have attributes that can be retrieved through the
   Query Memory Window Verb. The RI MUST support the complete list of
   QP attributes as described in Section 9.2.6.8 - Query Memory Window.

7.10.4 Invalidating or De-allocating Memory Windows

   When access to a Memory Window by the RI is no longer required, but
   the Consumer wants to retain the STag for use in future PostSQ Bind
   Memory Window Verb invocations, the Consumer may directly invalidate
   access to the Memory Window through either an Invalidate Local STag
   WR or an RDMA Read with Invalidate Local STag WR. Additionally, an
   STag associated with a Memory Window may be invalidated by a remote
   Consumer through the use of a Send with Invalidate Message or a Send
   with Solicited Event and Invalidate Message. For more information on
   these Verbs, see Section 7.8 - Invalidating Memory Regions.

   Memory Windows are Deallocated in a fashion similar to Memory
   Regions: with the Deallocate STag Verb. For more information, see
   Section 7.9 - Deallocation of STag associated with a Memory Region.

   When processing an Invalidate operation on an MW STag:

   *   and the MW is in the Valid state, the RI MUST check and enforce
       that the QP ID associated with the MW is equal to the QP ID of
       the QP processing the Invalidate Local STag WR. If the QP IDs
       match, the RNIC MUST place the specified local STag in the
       Invalid state. If the QP IDs do not match, the RI MUST return an
       error.

   *   and the MW is in the Invalid state, the RI MUST check and
       enforce that the PD ID associated with the MW is equal to the PD
       ID associated with the QP processing the Invalidate Local STag
       WR. If the PD IDs do not match, the RI MUST return an error.

   When a local Invalidate Local STag WR, local RDMA Read with
   Invalidate Local STag WR, an incoming Send with Invalidate Message,
   or an incoming Send with Solicited Event and Invalidate Message
   completes successfully, the RNIC MUST:

   *   transition the associated STag to the Invalid state,

   *   change the association of the newly invalidated STag from the QP
       to the PD of the QP that processed the STag Invalidation,

   *   retain the Memory Window resources associated with the STag,

   *   remove the association of the Memory Window with the underlying
       Memory Region.



   Hilland, et al.        Expires October 2003             [Page 120]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   An invalidated STag which was either Invalidated as described above,
   or in the Invalid state because it was created through the Allocate
   Memory Window Verb but never used, can be used as the MW in a PostSQ
   Bind Memory Window WR.

   Once an STag associated with a MW is successfully Invalidated, the
   RI MUST associate the STag with the PD associated with the QP
   processing the Invalidate Local STag WR.

   For information on Invalidating Memory Windows through the
   Invalidate Local STag or RDMA Read with Invalidate Local STag WR,
   see Section 9.3.1.1 - PostSQ. For information on Invalidating Memory
   Windows through Send with Invalidate or Send with Solicited Event &
   Invalidate WR, see Section 9.3.1.1 - PostSQ. For a description of
   the Verb to deallocate a Memory Window, see Section 9.2.6.4 -
   Deallocate STag.

7.10.4.1 Invalidating or De-allocating Active Windows

   Under normal operation, it is improper for a Consumer to deallocate
   or Invalidate the STag of the Memory Window while it is being used
   in an incoming, remote operation. However, this can occur if the
   Remote Consumer misbehaves, or it can occur under error recovery
   circumstances.

   Any Remote Operations that are in-process and actively using a
   Memory Window when its STag is Invalidated MUST fail with a
   protection error. Once the Completion of the Invalidate operation
   has been determined by the Consumer, the RI MUST guarantee that no
   additional accesses can be performed under the previous binding.

   Any Remote Operations that are in-process and actively using a
   Memory Window when it is deallocated MUST fail with a protection
   error. Once the de-allocation Verb completes, RNIC implementations
   MUST guarantee that no additional accesses can be performed through
   that Memory Window.

   An STag is allowed to already be in the Invalid state, when the RNIC
   performs the STag Invalidation.

7.10.5 Summary of Memory Window STag States

   An STag associated with a Memory Window has two states:

   *   Invalid - May not be used to access a memory location.

       o   Entered through: Allocate Memory Window, PostSQ Invalidate
           STag WR, incoming Send with Invalidate STag Message,
           incoming Send with Solicited Event and Invalidate STag
           Message, or local RDMA Read with Invalidate Local STag WR.



   Hilland, et al.        Expires October 2003             [Page 121]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   Exited through: PostSQ Bind Memory Window WR or Deallocate
           STag.

   *   Valid - May be used to access a memory location.

       o   Entered through: PostSQ Bind Memory Window WR.

       o   Exited through: PostSQ Invalidate STag MW, incoming Send
           with Invalidate STag Message, incoming Send with Solicited
           Event and Invalidate STag Message, local RDMA Read with
           Invalidate Local STag WR, or Deallocate STag.

   Note: Deallocate STag exits the state logic captured above.

7.10.6 Error Checking during Memory Window Operations

7.10.6.1 Error Checking at Window Bind Time

   The RI MUST check for the following error conditions during the
   Memory Window Bind operation and, if any error is detected the RI
   MUST return a Completion Error.

   *   The RNIC MUST check and enforce that the MW STag is an MW STag
       and is in the Invalid state.

   *   The RNIC MUST check and enforce that the QP has Memory Window
       Binding enabled.

   *   The RNIC MUST check and enforce that the STag of the MR is an MR
       STag and is in the Valid state and is not the STag of zero.

   *   The RNIC MUST check and enforce that the Memory Window, Memory
       Region, and QP belong to the same PD.

   *   The RNIC MUST check and assure that the Memory Region has Window
       binding enabled.

   *   The RNIC MUST check and enforce that the Memory Window Access
       Rights are compatible with the Access Rights of the underlying
       Memory Region. (See Figure 19).

   *   The RNIC MUST check and enforce that the Memory Region is not a
       Zero based TO MR.

   *   The RNIC MUST check and enforce that the Memory Window base TO
       and bounds is within the base TO and bounds of the underlying
       Memory Region. The RI MUST enforce this with a byte level
       granularity.





   Hilland, et al.        Expires October 2003             [Page 122]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

7.10.6.2 Error Checking at Window Access Time

   The following conditions MUST be checked for each incoming RDMAP
   Tagged Message targeting an STag that is associated with a Memory
   Window:

   *   The RNIC MUST check and enforce that the MW STag is in the Valid
       state.

   *   The RNIC MUST check and enforce that the QP ID associated with
       the Memory Window is equal to the QP ID associated with the
       incoming remote operation that is accessing the Memory Window.

   *   The RNIC MUST check and enforce the incoming memory access as
       represented by the TO and length is within the TO base and
       bounds of the Memory Window. The RI MUST enforce this with a
       byte level granularity.

   *   The RNIC MUST check and enforce the Access Rights associated
       with the Memory Window.

   *   The RNIC MUST NOT check or enforce the Access Rights associated
       with the Memory Region to which the Memory Window is Bound.

   *   The RI MUST check that the appropriate MW and QP Remote Access
       Rights are enabled for the incoming RDMA Message. For example,
       if the incoming RDMA Message is an RDMA Write targeting a MW,
       the RI must check that the MW and the QP have Remote Write
       Access Rights enabled.

   If any of the above checks fail, the RI MUST not allow the memory
   access to take place and a protection error MUST be generated.

   If the length of the access is zero, the RI MUST NOT perform any of
   the above checks on the Memory Window.

   Note that the QP attributes must be verified as well. For more
   information, see Section 8.1.2.2.

7.10.6.3 Error Checking at Window Invalidate Time

   The following conditions MUST be checked on a PostSQ Invalidate
   Local STag WR, RDMA Read with Invalidate Local STag WR, incoming
   Send with Invalidate Message, or incoming Send with Solicited Event
   and Invalidate Message that accesses a Memory Window:

   *   If the Memory Window is in the Valid state, the RNIC MUST check
       and enforce that the QP ID associated with the Memory Window is
       equal to the QP ID associated with the QP processing the
       Invalidate Local STag WR.



   Hilland, et al.        Expires October 2003             [Page 123]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   If the Memory Window is in the Invalid state, the RNIC MUST
       check and enforce that the PD ID associated with the Memory
       Window is equal to the PD ID associated with the QP processing
       the Invalidate Local STag WR.

   If any of the above checks fail, the RI MUST NOT allow the
   invalidation to take place and the operation MUST result in an
   error.













































   Hilland, et al.        Expires October 2003             [Page 124]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

8  Work Requests and the WR Processing Model

8.1  Work Requests

   A Work Request is the fundamental unit of work used by the Consumer
   to indicate to the RNIC that there is data to transfer and control
   operations to process on a specific QP. The following sections
   describe the creation of Work Requests, types of Work Requests and
   Work Request Contents.

8.1.1  Creating Work Requests

   Work Requests MUST be the only mechanism available to Consumers to
   submit work to the Work Queues. The Work Requests Verbs MUST be used
   only to pass operations from the Consumer to the RI. Specifically,
   these Verbs are PostSQ (Section 9.3.1.1) and PostRQ (Section
   9.3.1.2).

   Work Requests can only be posted to the SQ or RQ of a specific QP,
   or, if the QP is associated with an S-RQ, to the S-RQ associated
   with the QP.

   Work Requests are created by the Consumer above the RI and submitted
   through the Verbs to the RI for processing. The format of Work
   Requests within the RI is not defined. Its structure is opaque to
   the Consumer and is not part of this specification. WRs are only
   valid during the Posting process. WRs are then represented by WQEs
   until Completed.

   The RNIC MUST support the submission of multiple WRs to the RI as a
   list of individual Work Requests. The intention of this requirement
   is to allow for optimizations in the RNIC such that the RI can
   inform the RNIC of WQEs in the most efficient manner for that
   individual RNIC.

8.1.2  Work Request Types

   There are three basic Work Request types. These are those dealing
   with Send/Receive, RDMA, and Memory.

8.1.2.1  Send/Receive

   The Send/Receive model supports the Untagged Buffer Model in the
   RDMAP/DDP specifications. The Send/Receive model uses a one-to-one
   correspondence between outgoing Sends Operation Type WRs and
   incoming Receive Queue WRs. Successful Send Type Work Requests MUST
   result in the consumption of a Receive Queue Work Request at the
   Associated QP. Receive Queue Work Requests should be posted to the
   RQ before the incoming Send Message Type arrives. If a WQE is not
   available on the RQ to describe the Untagged Buffer for the incoming



   Hilland, et al.        Expires October 2003             [Page 125]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Send Message Type, then the LLP Stream MAY be terminated. If the LLP
   Stream is not terminated, the reader should see Section 13.2 -
   Graceful Receive Overflow Handling for one implementation option.

   The RI MUST allow Send Work Requests to only be posted to a Send
   Queue. This includes all Send Operation Types, which are: Send, Send
   with Solicited Event, Send with Invalidate and Send with Solicited
   Event & Invalidate. The RI MUST allow only Receive Work Requests to
   be posted to a Receive Queue or Shared Receive Queue.

   A Receive Queue Scatter/Gather List Work Request MUST contain at
   least enough buffer space to place the incoming Send Message Type.
   If it does not, a Completion Error MUST be returned. The length of
   the buffer represented by the Scatter/Gather List of a Receive Queue
   Work Request MAY be greater than the length of the incoming data.
   The length of incoming data MUST be returned by the RI as part of
   the Work Completion. In the case of any Completion Error, the value
   of the length in the Work Completion MUST be considered
   indeterminate.

   Since segmentation and reassembly is provided by DDP, Send Operation
   Types and corresponding Receives can be larger than the EMSS (See
   [RDMAP][DDP]). The maximum data transfer length supported by the
   architecture is 2^32-1 octets of data. Note that for any given
   message, the length of the buffers represented by the WRs posted to
   the RQ MAY have a total length that is smaller than the maximum data
   transfer length. It is up to the Consumer to negotiate the maximum
   receive buffer size with the Remote Peer.

   The Data Source of Send Operation Types MUST be a local
   Scatter/Gather List. See Section 8.1.3.2 for a description of
   Scatter/Gather List.

   The Data Sink of Receive operations MUST be a local Scatter/Gather
   List.

8.1.2.2  RDMA

   RDMA Write WRs, RDMA Read WRs, and RDMA Read with Invalidate Local
   STag WRs  MUST NOT result in the consumption of a Receive Queue Work
   Request at the Remote Peer.

   The Data Source of an RDMA Write Work Request MUST be a
   Scatter/Gather List consisting of local buffers.

   The Data Sink used in an RDMA Read Type WR MUST be in the local
   node's address space as represented by the TO, STag and Length
   contained in the RDMA Read Type WR. The STag MUST be Bound to either
   a Memory Region or a Memory Window containing the buffer represented
   by the TO and length.



   Hilland, et al.        Expires October 2003             [Page 126]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   The Data Source for an RDMA Read Type WR and the Data Sink for an
   RDMA Write WR MUST be in the Remote Peer's address space as
   represented by the TO, STag and Length contained in the Work
   Request. The STag MUST represent either a Memory Region or a Memory
   Window containing the buffer represented by the STag, TO and length.

   Queue Pairs have RDMA Read enable and RDMA Write enable attributes.
   Memory Regions and Memory Windows have Remote Read and Remote Write
   attributes as well. Memory Regions also have Local Read and Local
   Write attributes. RDMA transfers MUST only take place when the
   appropriate QP RDMA attribute is enabled and the appropriate STag
   attribute is enabled where the STag represents either a Memory
   Region or a Memory Window. If the STag is that of a Memory Window,
   the attributes of the Memory Region do not apply at memory access
   time. These attributes are checked at the node where the target
   memory is located. After the STag Access Rights and QP Access Rights
   have been verified, the RI MUST verify that the STag Access Rights
   match the QP Access Rights. If the RI detects an invalid Access
   Rights combination, the operation MUST result in a protection error.
   The combinations of QP Access Rights and STag Access Rights which
   will allow the data transfer to take place are shown in Figure 21.
































   Hilland, et al.        Expires October 2003             [Page 127]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   STag Used as   QP Attribute        STag Attribute(5)   Access
                                                          Allowed?

   RDMA Read Type Inbound RDMA Read:  Remote Read Access:
   Data Source
                   Enabled             Enabled             Yes

                   Disabled            Either              No

                   Either              Disabled            No

   RDMA Write or  Inbound RDMA Write  Remote Write
   RDMA Read Type and inbound RDMA    Access:
   Data Sink      Read Response:

                   Enabled             Enabled             Yes

                   Disabled            Either              No

                   Either              Disabled            No

   RDMA Write or                      Local Read Access:
   Send Type Data  Either
   Source                              Enabled             Yes

                                       Disabled            No

   Receive Data                       Local Write Access:
   Sink            Either
                                       Enabled             Yes

                                       Disabled            No


           Figure 21 - Valid QP & STag Access Right Combinations

   The RDMA Read with Invalidate Local STag WR behaves similar to an
   RDMA Read Work Request which is then immediately followed by a
   Invalidate Local STag WR on the STag in the Local Address. The
   slight difference in behavior is in this case the Invalidate will
   not occur until after the RDMA Read Operation is complete; while
   with two separate WRs, the Invalidate operation could begin
   processing before the RDMA Read Type WR Completes. Work Requests
   subsequent to an RDMA Read with Invalidate Local STag WR may begin


   Footnote 5: The STag may have additional Access Rights, but only the
   rights listed effect the allowed access.




   Hilland, et al.        Expires October 2003             [Page 128]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   processing before the RDMA Read with Invalidate Local STag WR
   Completes. See Section 8.2.2.1 - Memory Management Operation
   Ordering for more details.

8.1.2.3  Memory

   The following Memory Operations can be posted to the SQ: Bind Memory
   Window, Fast-Register Non-Shared Memory Region, Invalidate Local
   STag and RDMA Read with Invalidate Local STag.

8.1.2.3.1   Bind Memory Windows

   The Bind Memory Window WR associates a previously allocated MW to a
   specified Tagged Offset (TO) range within an existing MR, as well as
   sets the MW's RDMA remote Access Rights.

   Bind operations MUST be posted to the SQ as a Work Request. Binds
   only affect local RNIC mapping resources and MUST NOT cause any
   segment to be issued to the LLP. No resources at the associated QP
   are directly affected.

   For more information on the Memory Window Bind operation, see
   Section 7.10.2 - Binding Memory Windows to Memory Regions.

8.1.2.3.2   Fast-Register Non-Shared Memory Region

   The Fast-Register Non-Shared Memory Region WR associates an MR STag
   that is in the Invalid state to a specified Physical Buffer List
   (For more information on Invalidating STags, see Section 7.8 -
   Invalidating Memory Regions). For information on the STag types
   allowed, see Section 7.3.2.5 - Fast-Register Non-Shared Memory
   Region.

   Fast-Register Non-Shared Memory Region operations MUST be posted to
   the Send Queue. Fast-Register Non-Shared Memory Region operations
   only affect local RNIC mapping resources and do not cause any data
   transfer. No resources at the Associated QP are directly affected.

8.1.2.3.3   Invalidate Local STag

   The Invalidate Local STag and RDMA Read with Invalidate Local STag
   WRs use the STag supplied as the target for the invalidation and
   transition the STag to the Invalid state.

   The STag which is the target of an Invalidate Local STag or RDMA
   Read with Invalidate Local STag WR MUST be associated with a Non-
   Shared Memory Region (i.e. created by Allocate Non-Shared Memory
   Region STag, RI-Register Non-Shared Memory Region, RI-Reregister
   Non-Shared Memory Region and has not transitioned to a Shared Memory
   Region) or MW (i.e. created by Allocate Memory Window).



   Hilland, et al.        Expires October 2003             [Page 129]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   For information on Invalidating STags associated with a Non-Shared
   MR, see Section 7.8 - Invalidating Memory Regions. For information
   on Invalidating STags associated with MWs, see Section 7.10.4 -
   Invalidating or De-allocating Memory Windows.

   Invalidate Local STag operations MUST be posted to the Send Queue as
   a Work Request. The Invalidate Local STag operations only affect
   local RNIC mapping resources and MUST NOT cause any data transfer.
   No resources at the Associated QP are directly affected.

   The initiation of an Invalidate Local STag operation must remain
   ordered with respect to other Work Requests on the same QP and the
   operation must take effect before any subsequent WRs can begin
   processing by the RNIC, as defined in the ordering rules in Section
   8.2.2.1 and Section 8.2.2.2.

8.1.3  Work Request Contents

   Every Work Request submitted through the Verbs contains all of the
   information required to perform the requested operation. The exact
   WR contents are covered in the Section 9.3.1.1 - PostSQ and 9.3.1.2
   - PostRQ. The characteristics of two of the Post Send Request Verb
   modifiers are discussed below.

8.1.3.1  Signaled Completions

   Signaled Completions refer to Work Requests that result in a Work
   Completion. Unsignaled Completions provide a mechanism where Work
   Requests posted to the Send Queue do not generate a Work Completion
   in the associated Completion Queue if the operations complete
   successfully. The RI MUST support PostSQ WRs with Unsignaled
   Completions on every QP.

   Every WR posted to the RQ MUST result in a Work Completion.
   Consequently, all RQ WRs are considered Signaled WRs.

   The Consumer can indicate that it does not need a Signaled
   Completion by setting the Unsignaled Completion indicator in a Work
   Request posted to the SQ.

   When an error is encountered on an Unsignaled or Signaled WR, a CQE
   will be generated for that WR with the appropriate error code. In
   addition, the RI MUST Complete all subsequent WRs with a Flushed
   Error Completion Status regardless of their signaling type. The
   Consumer is safe in assuming that all WRs prior to the one resulting
   in an error were completed successfully.

   An Unsignaled WR is defined as completed successfully when all of
   the following rules are met:




   Hilland, et al.        Expires October 2003             [Page 130]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   A Work Completion is retrieved from the CQ associated with the
       SQ where the unsignaled Work Request was posted,

   *   that Work Completion corresponds to a subsequent Work Request on
       the same Send Queue as the unsignaled Work Request, and

   *   the subsequent Work Request is ordered after the unsignaled Work
       Request as per the ordering rules. Depending on the Work Request
       used, this may require using the Local Fence indicator in order
       to guarantee ordering.

   When an unsignaled WQE completes successfully:

   *   The RI MUST free up any resources associated with the Unsignaled
       WQE,

   *   The Consumer MAY consider the WQE as having completed
       successfully, and

   *   The Consumer MAY re-use any resources associated with the
       Unsignaled WQE.

   The Consumer should ensure that in the event that a WQE with an
   Unsignaled Completion indicator results in an error that the CQ will
   not overflow as stated in Section 5.3.1. This is because the WQE
   will cause a CQE and every WQE after it will cause a CQE as well
   since they result in CQEs with the Flushed status.

8.1.3.2  Scatter/Gather List

   The RI MUST allow each Scatter/Gather List (SGL) to contain one or
   more Scatter/Gather Elements (SGE). The SGE references a buffer via
   an STag, TO, and length. The STag specified in the SGE MUST be
   Registered with the RI prior to submission, except for the STag of
   zero. These buffers referenced by the STag MUST be considered to be
   in the scope of the RI from the time they are submitted to a Work
   Queue until Completion of the Work Request has been confirmed.

   If a Memory Window STag is used in an SGE in a PostRQ or PostSQ Send
   Operation Type or the Data Source for an RDMA Write WR, the RI MUST
   Complete the Work Request with a Completion Error.

   The sum total of all of the buffer lengths in an SGL MUST NOT exceed
   the maximum message payload size specified for RDMAP. This is 2^32-1
   bytes. If an SGE has a length of zero, the STag MUST NOT be
   validated by the RI. For PostSQ WRs, the sum of the Length field in
   all of the SGEs MUST be the total length of that RDMAP operation.
   This value MUST be able to be zero.





   Hilland, et al.        Expires October 2003             [Page 131]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   An RI MAY support more than one Scatter/Gather Element per
   Scatter/Gather List. The exact number of Scatter/Gather Elements per
   Scatter/Gather List supported by the RNIC MUST be returned via the
   Query RNIC Verb (Section 9.2.1.2) where there is one value for Send
   Operation Type WR for Data Source buffers (which also applies to
   PostRQ buffers) and one value for RDMA Write WR Data Source buffers.
   The Consumer can specify the maximum number of Scatter/Gather
   Elements per Scatter/Gather List for each Work Queue as an input
   modifier to the Create QP (Section 9.2.5.1). The RI MUST return an
   Immediate Error if the value in Create QP exceeds the value
   supported by the RNIC.

   An RI MUST support at least four Scatter/Gather Elements per
   Scatter/Gather List when the Scatter/Gather List refers to the Data
   Source of a Send Operation Type or the Data Sink of a Receive
   Operation. An RI is NOT REQUIRED to support more than one
   Scatter/Gather Element per Scatter/Gather List when the
   Scatter/Gather List refers to the Data Source of an RDMA Write.

8.1.3.2.1   STag of zero Usage

   The ability to use the reserved STag of zero MUST NOT be allowed for
   Non-Privileged Mode accessible QPs. The RI must generate an
   Affiliated Asynchronous Error if an RDMAP Tagged message is received
   with an STag of zero. If the STag of zero is used in an outgoing
   RDMA Read Type WR or as the Data Sink of an RDMA Write WR, the RI
   MUST return a Completion Error. Thus the Consumer should not
   Advertise the STag of zero, since an error will result.

8.1.3.3  RDMA Data Source & Data Sink

   For RDMA Read Type Work Requests, the RI MUST support the Data
   Source Local Address as an input modifier to PostSQ. The structure
   representing this information is known as a Data Source Address. A
   Data Source Address consists of an STag, Tagged Offset and Length.
   An RI MUST support exactly one Data Source Address for RDMA Read
   Type Work Requests.

   For RDMA Write Work Requests, the RI MUST support the Data Source
   Scatter/Gather List as an input modifier to PostSQ.

   For RDMA Write and RDMA Read Type Work Requests, the RI MUST support
   the Data Sink Remote Address as an input modifier to PostSQ. The
   structure representing this information is known as a Data Sink
   Address. A Data Sink Address consists of an STag, Tagged Offset and
   Length. An RI MUST support exactly one Data Sink Address for RDMA
   Read Type Work Requests and RDMA Write Work Requests.






   Hilland, et al.        Expires October 2003             [Page 132]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

8.2  Work Request Processing Model

   The Work Request processing model describes how requests are sub-
   mitted, processed by the RNIC, and the results returned to the
   Consumer.

8.2.1  Submitting Work Request to a Work Queue

   Work Requests are submitted to the RNIC through the Verbs. They are
   represented within the RI as Work Queue Elements. Work Queue
   Elements are abstract. This means they are not accessible directly
   by the Consumer of the RNIC Interface.

   Work Requests can be submitted to the RNIC as a list of Work
   Requests. Each Work Request in the Work Request List which is
   successfully inserted into the Work Queue MUST result in the
   consumption of one WQE on the Work Queue, and each Work Request MUST
   be submitted to the Work Queue in the order specified in the Work
   Request List. When a list of WRs containing more than one WR is
   posted on an SQ, RQ, or an S-RQ, the first Immediate Error in
   processing a WR MUST stop processing of the Work Request List and
   MUST NOT enqueue the subsequent WRs in the list onto the Work Queue.
   All Work Requests prior to the Work Request in error MUST be
   inserted into the Work Queue. The RI MUST return to the Consumer the
   number of successfully posted WRs and the verbs result MUST indicate
   the Immediate Error associated with the WR that resulted in the
   first error.

   The intent of supporting a WR List is to allow some implementations
   to reduce the number of Consumer to RI interactions when the
   Consumer has multiple WRs to post, and to reduce the number of
   interactions between the RI and RNIC due to alerting the RNIC of
   additional work to perform.

   One of the intentions of the architecture is to allow an
   implementation to pass Work Requests from a Non-Privileged Mode
   Consumer directly to the RNIC. Consequently, certain Verbs are
   designed to be invoked in either Privileged Mode or Non-Privileged
   Mode while others are designed to be invoked only in Privileged
   Mode. The Verbs that are intended to be invoked in either Privileged
   Mode or Non-Privileged Mode are: PostSQ, PostRQ, Poll for Completion
   and Request Completion Notification.

   The RI MUST return control to the Consumer immediately after a WR or
   WR List has been submitted to the SQ, RQ or S-RQ and the RNIC has
   been notified that a new WR or WR List is ready to process.

   The RI MUST ensure that the space occupied by a Work Request in
   either the Send or Receive Work Queue is not made available for
   posting a new Work Request until:



   Hilland, et al.        Expires October 2003             [Page 133]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   In the case where the WR was Signaled, the associated Completion
       has been reaped.

   *   In the case where the WR is Unsignaled, one of the following is
       true:

       o   The WR has Completed processing successfully, OR

       o   The associated Completion has been reaped for the WR if the
           Unsignaled WR Completed in error, OR

       o   A Completion associated with a subsequently posted WR to the
           same WQ has been reaped.

   If space is not available on a Work Queue, then an RI MUST return an
   Immediate Error.

   The Unsignaled WR confirmation rules dictate that the Consumer must
   post a WR with the Signaled Completion indicator set with a
   frequency less than or equal to the maximum number of WQEs on the
   SQ. In other words, if X equals the maximum number of WQEs on the
   SQ, then the Consumer must post at least one Signaled Completion
   Work Request every X Work Requests. In addition, the Consumer must
   retrieve a Work Completion of a Signaled Completion with a frequency
   less than or equal to the maximum number of WQEs on the SQ. This is
   done in order to force confirmation that prior Unsignaled WRs are
   Completed. If the Consumer does not follow these rules, a situation
   may arise where the Consumer is unable to post WRs to the SQ. A ULP
   reply based on the data that was in a SQ WR is insufficient for
   determining if the WR has completed, since hardware resources may be
   held in use until the WCs are polled from the CQ.

   The QP can accept Work Requests only when the QP is in a state that
   allows Work Requests to be submitted.

   For details on the Verbs which submit Work Requests, see Sections
   9.3.1.1 - PostSQ and 9.3.1.2 - PostRQ.

8.2.2  Work Request Processing

   Processing of Work Requests submitted to a Work Queue is initiated
   and processed according to the rules in this section.

   It is important to understand the difference between Placement and
   Delivery ordering since RDMAP provides different semantics for the
   two.

   Note that many current protocols, both as used in the Internet and
   elsewhere, assume that data is both Placed and Delivered in order.
   This allowed applications to take a variety of shortcuts that



   Hilland, et al.        Expires October 2003             [Page 134]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   depended on in-order Placement and Delivery. For RDMAP, many of
   these shortcuts are no longer safe to use, and could cause
   application failure. To ensure reliable operation, applications need
   to take the rules described below into account.

   The following rules apply to implementations of the RDMAP protocol:

   1.  Send Type, RDMA Write, and RDMA Read Type Work Requests
       submitted to a Send Queue MUST be initiated and sent in the
       order submitted to the Send Queue.

   2.  Work Requests submitted to a single Send Queue or Receive Queue
       MUST be Completed by the RI in the same order as the Work
       Requests were submitted. Note that this does not apply to WRs
       posted to S-RQs.

   3.  Ordering guarantees for processing and Completion notifications
       exist only between Work Requests submitted to the same Work
       Queue. The RI is NOT REQUIRED to provide ordering guarantees
       across multiple local SQ to remote RQ pairs.

   4.  RDMA Messages MAY be Placed in any order while in the scope of
       the RI. If an application uses overlapping buffers (points
       different Messages or portions of a single Message at the same
       buffer), then it is possible that the last incoming write to the
       Data Sink buffer will not be the last outgoing data sent from
       the Data Source.

   5.  For a Send Type Operation, the contents of the Receive Queue
       Buffer at the Data Sink MAY be indeterminate until the Receive
       Queue Work Request is Completed at the Data Sink.

   6.  For an RDMA Write Operation, the contents of the buffer at the
       Data Sink MUST be considered indeterminate until a subsequent
       Send Type Message is Completed by consuming a Receive Queue WQE
       at the Data Sink.

   7.  For an RDMA Read Operation, the contents of the buffer at the
       Data Sink MUST be considered indeterminate until the RDMA Read
       Type Work Request has been Completed.

       Statements 5, 6, and 7 imply no peeking at the data in a buffer
       to see if all of the data has arrived. It is possible for some
       data to arrive before logically earlier data does, and peeking
       may cause unpredictable application failure

   8.  Except for Unsignaled WRs that complete successfully, the
       resources associated with a Work Request must be considered to
       be in the scope of the RI from the time the Work Request is sub-
       mitted to a Work Queue until the associated Work Completion has



   Hilland, et al.        Expires October 2003             [Page 135]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       been returned. For Unsignaled WRs that complete successfully,
       refer to Section 8.1.3.1 for a description of when the resources
       associated with the Unsignaled WR are freed.

   9.  If the Consumer or Application modifies the contents of Data
       Sink Buffers while the buffers are in the scope of the RI, the
       state of the Data Sink Buffers is indeterminate.

   10. If the Consumer or Application modifies the contents of Data
       Source Buffers while the buffers are in the scope of the RI, the
       state of the Data Sink buffers is indeterminate.

   11. The RI is NOT REQUIRED to guarantee that the Completion of an
       RDMA Write or Send Type WR at the Local Peer means that the ULP
       Message has: reached the Remote Peer, reached the Remote Peer
       ULP Buffer, or been examined by the Remote Peer ULP.

   12. Incoming Untagged RDMAP Messages (sent in FIFO and MSN order)
       MUST use RQ or S-RQ Buffers and Complete through the RQ's CQ, in
       the same order as the Send Message Type Work Requests are posted
       to the Associated QP's Send Queue.

   13. Upon local Completion of an incoming Untagged RDMAP Message the
       RI MUST guarantee that any prior Send or RDMA Write Messages
       from the same Associated QP have also Completed at the Data
       Sink.

   14. If the Consumer overlaps its Data Sink buffers for different
       operations, subsequent Operations MAY cause the RI to overwrite
       the data in those buffers before the Consumer receives and
       processes the Completion.

   15. The RI MAY begin processing subsequent Work Requests posted to
       the Send Queue (except for operations which are affected by a
       fence - see Section 8.2.2.2), before Completing a prior RDMA
       Read Type Work Request (including zero-length RDMA Read Type
       Work Requests). Therefore, when an application does an RDMA Read
       Type Work Request followed by an RDMA Write or Send Type WR
       targeting the same buffer, it MAY return the data from the later
       RDMA Write or Send Type WR in the RDMA Read Operation Data Sink
       buffer, even though the operations Complete in order on the Send
       Queue's Completion Queue. If this behavior is not desired, the
       Local Peer Consumer must set the Read Fence indicator on the
       later RDMA Write or Send Type Work Request.

   16. Before an Inbound RDMA Read Request Message is processed (the
       specified buffer is read), the RI MUST have delivered all prior
       incoming RDMAP Messages initiated from the same Remote Peer's
       Send Queue. Therefore, when an application does an RDMA Write or
       Send Type Work Request followed by an RDMA Read Type Work



   Hilland, et al.        Expires October 2003             [Page 136]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       Request targeting the same remote buffer, the RDMA Read Type WR
       MUST return the data as modified by the prior operations.

   17. The RI MAY Complete incoming Send Message Types before the RI
       has finished generating RDMA Read Response Messages for an
       incoming RDMA Read Request Message (initiated from the same
       Remote Peer's Send Queue). Therefore, indeterminate results may
       occur if an application does an RDMA Read Type Work Request
       followed by a Send Type Work Request, and uses the Work
       Completion on the Associated QP's RQ Completion Queue (for the
       incoming Send Type Message) as an indicator that the inbound
       RDMA Read Operation processing has finished. If this behavior is
       not desired, the Local Peer Consumer must set the Read Fence
       indicator on the later RDMA Write (or Send Type) Work Request.

   18. If more RDMA Read Type Work Requests are posted to the Send
       Queue than are indicated by the ORD QP Attribute, the RI MUST
       pause the processing of the Send Queue until at least one prior
       RDMA Read Type WR Completes. If zero outbound RDMA Read Request
       Messages are supported on the QP, and the Consumer posts an RDMA
       Read Type Work Request, the RI MUST Complete the Work Request in
       error.

   Access by the RNIC to Memory Regions or Memory Windows are NOT
   REQUIRED to be cache-coherent. If an RNIC caches some portion of
   memory buffers during the time that the buffers are being processed
   by the RNIC, there is no requirement that updates to these buffers
   by any entity be seen by the RNIC. Also, any updates to these
   buffers by the RNIC are implementation dependent and may not be
   immediately seen by the system processor, other IO devices, or other
   RNICs.

8.2.2.1  Memory Management Operation Ordering

   This section defines the ordering constraints imposed on Work
   Requests. The next section defines additional ordering constraints
   that can be placed by using the Read Fence or Local Fence indicator.

   Because one of the objectives of DDP is to enable placement of
   incoming out-of-order DDP segments into the buffer provided by the
   Consumer, ordering semantics can not be guaranteed for certain
   operation combinations. If the Work Request sends payload to the
   Remote Peer, just because a Work Request Completes locally does not
   necessarily mean that the Remote Peer has received the data, or that
   subsequent DDP Segment payload can not overwrite the current data if
   targeting the same Remote Peer buffer.

   Thus, for example, an RDMA Write Message, containing payload1
   immediately followed by an RDMA Write Message containing payload2 to
   the same Remote Peer buffer location may result in the remote buffer



   Hilland, et al.        Expires October 2003             [Page 137]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   containing either payload1, payload2, or some combination of
   payload1 and payload2. Thus a programming model that does multiple
   RDMA Write WRs into the same Remote Peer buffer location without an
   end-to-end synchronization mechanism is NOT RECOMMENDED.

   1.  An Incoming Remote Invalidate (the Invalidate portion of the
       Send with Invalidate or Send with Solicited Event & Invalidate
       operation) MUST be performed after the Send Message payload is
       delivered to the appropriate Receive Queue Entry buffer, and
       before the Associated RQ WR Completes.

       Note: Send with Invalidate is usually used by Remote Peers to
       invalidate STags that were enabled for remote access and
       advertised to the Remote Peer. The expected usage is:

       a.  Local Peer Consumer creates a Send WR containing a command
           to be remotely executed and an STag enabled for Remote
           access and posts it to the Send Queue.

       b.  Remote Consumer gets the Send Message through a Completion
           of an RQ buffer, and does one or more accesses to the STag's
           buffers via RDMA Read Type WRs and/or RDMA Write WRs.

       c.  Remote Consumer creates a Send with Invalidate or Send with
           SE and Invalidate WR with the status from the Consumer's
           operation and the original STag to be invalidated as an
           input modifier. Note that the Read Fence indicator would
           most likely be set on the Send with Invalidate or Send with
           SE and Invalidate WR if the remote buffer to be Invalidated
           was accessed using an RDMA Read or RDMA Read with Invalidate
           Local STag WR.

       d.  RI at Local Peer gets the Send with Invalidate or Send with
           SE and Invalidate Message, places the data according to the
           RQ WQE, Invalidates the STag, and creates a CQE on the
           Receive Queue's Completion Queue, which also contains the
           Invalidated STag as part of the CQE.

       e.  Local Consumer checks that the Invalidate STag output
           modifier from the Work Completion is the same as was
           originally sent (as a check on the remote Consumer). If it
           was not, and the Consumer wishes to prevent remote access,
           the Consumer should post an Invalidate Local STag WR for the
           STag.

   2.  RDMA Read with Invalidate Local STag
       The Invalidate portion of the RDMA Read with Invalidate Local
       STag Work Request MUST be performed after the RDMA Read Response
       Message is delivered to the Data Sink buffers, and before a Work
       Completion is retrieved for the RDMA Read with Invalidate Local



   Hilland, et al.        Expires October 2003             [Page 138]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       STag WR. As with RDMA Read, subsequent operations MUST be
       allowed to begin executing before the Invalidate takes place,
       unless the subsequent operations have the Read Fence indicator
       set.

   3.  Fast-Register
       The RI MUST ensure that the Fast-Register operation takes effect
       prior to the execution of any subsequent Work Requests.

   4.  Bind
       The RI MUST ensure that the Bind Memory Window operation takes
       effect prior to the execution of any subsequent Work Requests.

   5.  Invalidate Local STag
       The Invalidate Local STag Work Request MUST take effect prior to
       the execution of any subsequent Work Requests.

   The RI MAY perform Fast-Register WRs, Bind WRs and Invalidate Local
   STag WRs at any time between the posting of the Work Request and the
   execution of a subsequent Work Request. Consequently, it is up to
   the Consumer to ensure that the posting of the Invalidate Work
   Request takes place after the STag is no longer in use.

   SQ processing of Memory Management Operations (Fast-Register, Bind
   and Invalidate Local STag) does not usually require the prior
   operation to Complete before the current operation begins execution.
   Thus it is possible to have an Invalidate Local STag operation be
   applied to an RDMA Write WR Data Source buffer before the RDMA Write
   Message payload has been completely sent. To ensure that this does
   not occur, the Local Fence indicator may be set to require that all
   prior operations Complete first (See Section 8.2.2.2).

   Note that performing a Fast-Register on an already registered
   region, or a Bind on a Window that is already Bound, will result in
   a Completion Error. As such, it is up to the application to ensure
   that the STag is in the Invalid state before the Fast-Register or
   Bind Memory Window Work Request is posted.

   The rules for Invalidate and Fast-Register or Bind Memory Window
   above are based on the following usage model:

       a.  Allocate an STag (through either Allocate Non-Shared Memory
           Region STag or Allocate Memory Window).

       b.  Fast-Register or Bind the STag

       c.  Use the STag in a manner compatible with its Access Rights.






   Hilland, et al.        Expires October 2003             [Page 139]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       d.  Wait for the Completion of the operations using the STag.
           This ensures that the STag and its related buffer is no
           longer in use.

       e.  Invalidate the STag

       f.  Loop to (b) as long as the STag is still needed; otherwise,
           Deallocate the STag.

8.2.2.2  Read Fence and Local Fence Indicators

   Two types of fence indicators are defined in Verbs -                                                      - a Read Fence
   indicator for RDMA Write or Send Type WRs, and a Local Fence
   indicator for Invalidate Local STag WRs. The Read Fence ensures that
   the current WR does not execute until all prior RDMA Read Type WRs
   Complete. The Local Fence indicator ensures that all prior
   operations Complete before the Invalidate Local STag WR is executed.

   Note that in the Verbs specification, a fence indicates that some
   set of prior operations have completed before the current operation
   begins. A different concept is operations that are required to
   Complete before future operations in the SQ can be executed -
   specifically Bind, Fast-Register, and Invalidate Local STag WR. By
   default, these operations do not ensure prior operations have
   completed before they execute. For Invalidate Local STag, if the
   Local Fence indicator is set, it can ensure that all prior SQ
   operations Complete before it executes.

   Note that RDMAP does not provide any end-to-end acknowledgement
   except for an RDMA Read Operation. Thus in general an end-to-end
   fence is not possible without using an RDMA Read Operation, unless
   an explicit ULP exchange of messages is done. Some operations are
   local only operations - specifically PostSQ Invalidate Local STag,
   Bind Memory Window and PostSQ Fast-Register. For combinations of
   these operations and the local buffers which they operate on (the
   Data Source for an RDMA Write and Send Type Operation, or the Data
   Sink for an RDMA Read Operation), it is possible to ensure that a
   current operation is not executed until prior operation which
   operate on the referenced local buffer are Completed.

   Figure 22 shows the fencing semantics when one operation is followed
   by another, and whether that operation will not execute until all
   prior operations have Completed, some prior operations have
   completed, or potentially no prior operations have completed. The
   rows are the first operation, and the columns are the second
   operation. The fields are defined as follows:

   *   NA-1 - a fence is not applicable. An Invalidate must precede
       Bind or Fast-Register. Thus in terms of potential WRs in the SQ,




   Hilland, et al.        Expires October 2003             [Page 140]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       it is the Invalidate Local STag operation that must be fenced to
       ensure proper operation.

   *   NA-2 - A fence is not applicable. This is because RDMAP allows
       RDMA Write Message payloads and Send Type Message payloads to be
       Placed out-of-order. Thus a local Completion of prior WRs does
       not ensure the payload has been Placed at the Remote Peer.

   *   Not Needed - A fence is not needed, because RDMAP requires that
       the RDMA Read Request Message at the Data Source (i.e. the
       Remote Peer) must be executed in order. Note that RDMAP does not
       ensure that operations which are sent after the RDMA Read
       Request Message occur after the RDMA Read Type WR Completes.
       Thus the need for the Read Fence Indicator for RDMA Write and
       Send Type WRs.

   *   Yes, Full - If the Local Fence indicator is set on the
       Invalidate Local STag WR, then the operation and subsequent
       operations will not be executed until all prior operations
       Complete. Note that this can effectively cause a pipeline stall
       in transmission of RDMAP Messages, and should be used
       judiciously.

   *   Yes, Partial - If the Read Fence indicator is set on the RDMA
       Write or Send Type WR, then all prior RDMA Read Type WRs must
       Complete before the current operation can begin execution.


























    Hilland, et al.        Expires October 2003             [Page 141]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   PostSQ     Send    RDMA    RDMA       Bind Fast-    Invalidate
   Work       Type    Write   Read            Register
   Request

   Send Type  NA-2    NA-2    Not Needed NA-1 NA-1     Yes, full

   RDMA Write NA-2    NA-2    Not Needed NA-1 NA-1     Yes, full

   RDMA Read  Yes,    Yes,    Not Needed NA-1 NA-1     Yes, full
               Partial Partial

   Bind       NA-2    NA-2    Not Needed NA-1 NA-1     Yes, full

   Fast-      NA-2    NA-2    Not Needed NA-1 NA-1     Yes, full
   Register

   Invalidate NA-2    NA-2    Not Needed NA-1 NA-1     Yes, full


                  Figure 22 - Fencing on Prior Operations

   The following paragraphs provide the rules which dictate the above
   behavior.

   Read Fence - set in RDMA Write or Send Type Work Requests to ensure
       all prior RDMA Read Type WRs have been processed by the RI.
       The RI MUST provide a Read Fence indicator for Send Type Work
       Requests and RDMA Write Work Requests. This indicator MUST cause
       the RI to pause before the execution of the Read Fenced Work
       Request if all prior RDMA Read Type Work Requests are not
       complete. Once all prior RDMA Read Type Work Requests are
       complete the RI MUST resume SQ processing.

   Local Fence - set in Invalidate Local STag Work Requests to ensure
       all prior operations have been processed by the RI.
       The RI MUST provide a Local Fence indicator for the Invalidate
       Local STag Work Request. This Indicator MUST cause the RI to
       wait until all prior Work Requests on the Send Queue Complete.
       Once all prior WRs on the SQ complete, the RI MUST resume SQ
       processing.

       Note: This indicator may be used by the Consumer when there are
       insufficient STags available to allow them to remain in use
       until the Consumer can process the Completions for Work Requests
       using those STags. For example, the following sequence could be
       used:

       a.  Allocate an STag (either Allocate Non-Shared Memory Region
           or Allocate Memory Window)



   Hilland, et al.        Expires October 2003             [Page 142]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       b.  Fast-Register or Bind the STag

       c.  Use the STag in a manner compatible with its Access Rights.

       d.  Invalidate the STag using an Invalidate Local STag Work
           Request with the with Local Fence indicator set.

       e.  Loop to (b) as long as the STag is still needed; otherwise,
           Deallocate the STag by invoking the Deallocate STag Verb.

       Using this model, the application can reuse an STag multiple
       times without having to wait for the prior Work Request to
       Complete before posting the next Work Request. Using the Local
       Fence indicator may require the RI to stall before processing
       the Invalidate Local STag Work Request, reducing the rate of
       Send Queue processing.

   Implementation of an end-to-end fence - using an RDMA Write WR
       followed by an RDMA Read Type WR.
       An end-to-end fence ensures that all outstanding operations have
       been flushed from the network fabric prior to the next operation
       executing. [RDMAP] enables an application to use an RDMA Read
       Operation to ensure that all RDMA Write Operations and Send Type
       Operations prior to the RDMA Read Operation on the same RDMAP
       Stream have made it to remote memory and can be read back by any
       other RDMAP Stream connecting through the same remote RNIC with
       access to the remote memory. The RDMA Read Operation need not be
       to any of the data written, and can even be a zero length RDMA
       Read Operation (which does not even require a valid Data Source
       STag) to have this effect. This enables the Consumer to
       implement an end-to-end fence by waiting for a RDMA Read WR
       Completion to determine that data is up to date at the Remote
       Peer.

       If the requirement, for example, is to ensure, from the Data
       Source, that one RDMA Write Message has been Placed at the
       Remote Peer before another RDMA Write Message occurs, the
       following sequence can be used by the Consumer:

       a.  Perform one (or more) RDMA Write WR(s).

       b.  Perform an RDMA Read Type WR (zero length is acceptable)

       c.  Perform a second RDMA Write WR with the Read Fence indicator
           enabled on the Work Request.

8.2.3  Completion Processing

   A CQE is an internal representation of the Work Completion. The
   results from a Work Request operation are placed in a Completion



   Hilland, et al.        Expires October 2003             [Page 143]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Queue Entry (CQE) on the CQ associated with the Work Queue when the
   request has completed. A CQE MUST be generated for each WQE that
   results in a Work Completion.

8.2.4  Returning Completed Work Requests

   All Work Completions are abstracted through the Verbs. The only
   method of retrieving a Work Completion MUST be through the Poll for
   Completion Verb. The RI MUST enable the Consumer to be able to
   retrieve WCs resulting from WRs posted to QPs which are in any valid
   QP state. Note that a destroyed QP is not in a valid QP state. See
   Section 6.1.4.

   A Work Request is confirmed Complete when the associated Work
   Completion is retrieved from its CQ. The RI MUST NOT return a Work
   Completion for an Unsignaled Work Request that completed
   successfully. When the RI returns a single WC through Poll for
   Completion, it MUST free at least one CQE. Note that more than one
   CQE may be freed due to Unsignaled Completions. See Section 8.1.3.1,
   Signaled Completions, for the rules on determining when Unsignaled
   Work Requests have Completed.

   When a Work Request has Completed, any Scatter/Gather Elements or
   other information associated with the original WR are no longer in
   the domain of the RI. The RI MUST NOT access any memory locations
   referenced by the Scatter/Gather Elements, Local Address or Remote
   Address for a WR that has Completed. The RI MUST provide Work
   Completions through the Poll for Completion Verb no more than once
   per Work Request. Note that if Destroy QP is invoked with Work
   Requests pending, the Work Completion may be lost.

   The Work Completion contents are specified in 9.3.2.1 - Poll for
   Completion.

   A Consumer is able to find out if a Work Completion is available by
   polling or notification.

   Work Completions MUST be returned when the Consumer polls the CQ in
   the following cases:

   *   On Completion of a Work Request submitted to a Send Queue with a
       Signaled Completion.

   *   On Completion of a Work Request submitted to a Send Queue that
       completed in error.

   *   On Completion of a Work Request submitted to a Receive Queue.

   When the Consumer desires to know if a QP has had all of its WRs
   retrieved and the Work Queues are empty, but there may be only



   Hilland, et al.        Expires October 2003             [Page 144]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Unsignaled Work Requests on the Send Queue, the Consumer can
   transition the QP to the Error state (See Section 6.2.4) and then to
   the Idle state. This will guarantee that all WRs have been
   Completed. In order to ensure that the WQEs have been freed and the
   entries on the CQ have been made available, the Consumer should free
   any associated CQEs, if any are consumed. There are three methods
   for a Consumer to free the CQE consumed within the CQ. They are:

   *   for the Consumer to poll the CQ (See Section 9.3.2.1 - Poll for
       Completion (Poll CQ)) until the CQ is empty, or

   *   the Consumer retrieves a WC for a WR submitted to a Work Queue
       associated with the same CQ where the former WR was submitted
       and the new WR was submitted after the previous QP was
       destroyed, or

   *   the Consumer polls (See Section 9.3.2.1 - Poll for Completion
       (Poll CQ) a number of Work Completions equal to the total number
       of entries that the CQ can hold.

8.2.5  Asynchronous Completion Notification

   A Consumer of a CQ may request asynchronous notification of when
   CQEs have been added to a Completion Queue by invoking the Request
   Completion Notification Verb. The Verbs architecture assumes a
   Privileged Mode intermediary will process Asynchronous CQ Events for
   CQs. The Verbs architecture allows this intermediary to register one
   or more CQ Event Handlers for Asynchronous CQ Events by invoking the
   Set Completion Event Handler Verb. It is the responsibility of this
   intermediary to create the asynchronous completion notification to
   the Consumer that called the Request Completion Notification Verb.

   A Completion Event Handler Identifier delineates each Completion
   Event Handler. The Set Completion Event Handler is invoked once per
   supported Completion Event Handler. Note that the maximum number of
   supported Completion Event Handlers is returned by Query RNIC.

   Each Set Completion Event Handler invocation can be used to:

   *   Return a Completion Event Handler Identifier that is used as an
       input modifier to Create CQ (to associate a CQ with a Completion
       Event Handler).

   *   Clear a Completion Event Handler associated with the Completion
       Event Handler Identifier.

   *   Modify the address of the Completion Event Handler for the
       Completion Event Handler Identifier.





   Hilland, et al.        Expires October 2003             [Page 145]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   The RI is NOT REQUIRED to disassociate CQs from CQ Event Handlers
   when those CQ Event Handlers associated with the Completion Event
   Handler Identifiers are cleared. If a CQ Event Handler is cleared
   and the Consumer still has CQs associated with that CQ Event Handler
   (through the CQ Event Handler Identifier), and a Completion occurs
   which would have invoked the CQ Event Handler, behavior of the RI is
   indeterminate. The Consumer should keep this in mind before clearing
   the association to prevent indeterminate behavior, such as possible
   race conditions.

   The Request Completion Notification Verb is set on a per CQ basis.
   When armed, the RI MUST generate at most one notification until the
   notification has been rearmed by invoking Request Completion
   Notification Verb. Once Completion Notifications have been enabled,
   additional Request Completion Notification calls have no effect. The
   Completion Event Handler will be called only once when the next CQE
   is added to the CQ. The RI MUST invoke the Completion Event Handler
   associated with the CQ Event Handler Identifier which is associated
   with the CQ where the CQE was added. Once the Completion Event
   Handler routine has been invoked, the Consumer should call Request
   Completion Notification again to be notified when a new entry is
   added to the CQ, since the notification is a "one shot" mechanism.

   Existing CQEs on the CQ at the time the notification is enabled do
   not result in a call to the Completion Event Handler. The Completion
   Event Handler MUST be called when the next CQE is added to the CQ
   after the Request Completion Notification has been set.

   The RI MUST provide the ability for the Consumer to specify whether
   the Completion Event Handler is invoked for either:

   *   the next Solicited Completion Event only, or

   *   the next Completion Event.

   If the local Consumer requests the next solicited Completion in the
   Request Completion Notification Verb, the RI MUST generate a
   Completion Event when:

   *   an incoming Send with Solicited Event or Send with SE and
       Invalidate successfully causes a Receive Queue's WQE to be
       consumed, and thus a CQE to be added to a CQ, or

   *   a Work Completion for a Work Request which Completed in error is
       added to a CQ.

   If the Consumer requested an event for the next completion in the
   Request Completion Notification Verb, the RI MUST generate a
   Completion Event when any incoming Send operation type or Signaled
   Local SQ WR completes.



   Hilland, et al.        Expires October 2003             [Page 146]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   If multiple calls to Request Completion Notification have been made
   for the same CQ and at least one of the requests set the type to the
   next Work Completion, the RI MUST invoke the CQ event handler when
   the next CQE is added to that CQ. The CQ Event Handler MUST be
   called only once, even if multiple CQ notification requests were
   made prior to the Completion Event for the specified CQ.

   The RI MUST ensure that the following sequence of events will not
   result in a Completion Notification being missed. Therefore, the
   following sequence of calls should be used by the Consumer when
   using Request Completion Notification in order to ensure that a new
   CQE is not missed for the specified CQ:

   *   Call Poll for Completion to dequeue all existing CQ entries

   *   Call Request Completion Notification.

   *   Call Poll for Completion to dequeue all of the CQ entries that
       were added between the time the last Poll for Completion was
       called and the notification was enabled.

   When the Completion Event Handler is invoked, the RI MUST supply the
   CQ handle of the CQ which generated the Completion notification.

   The Consumer is responsible for polling the CQ to retrieve the Work
   Completion. This function MUST NOT be performed automatically by the
   RI when the notification occurs.

   For details on the Asynchronous Completion Verbs, refer to Section
   9.4.1 - Set Completion Event Handler and Section 9.3.2.2 - Request
   Completion Notification.

8.3  Error Handling

   The following section details many of the errors that can occur when
   using the RNIC, and the responsibilities of the RNIC and the
   Consumer.

   Errors are returned to the Consumer by one of three mechanisms:
   Immediate Errors, Work Completions, or Asynchronous Error Events.
   Immediate Errors are returned immediately as an Output Modifier of a
   Verb. Work Completions are used when the error can be related
   directly to the Work Request in progress. Asynchronous Error Events
   are used when the error can only be localized to the QP, CQ or RNIC
   but are not directly attributable to any single Work Request. Each
   of these errors is described below.







   Hilland, et al.        Expires October 2003             [Page 147]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

8.3.1  Immediate Errors

   Immediate Errors are those surfaced as Verb results provided to the
   Consumer via Output Modifiers. The individual Immediate Errors are
   documented within each Verb in Section 9 - RNIC Verbs. A summary of
   all of the Immediate Errors are covered in Section 9.5.1 - Immediate
   Status Codes.

   When the RI returns an Immediate Error, the RI MUST NOT affect the
   RI Resource that is the subject of the verb for which the Immediate
   Error is being returned, except for RI-Reregister Non-Shared Memory
   Region (which has slightly different rules). That is, for an
   Immediate Error returned on any verb that has the:

   - RI as the subject, the RI remains unchanged;

   - CQ as the subject, the CQ remains unchanged;

   - QP as the subject, the QP remains unchanged;

   - S-RQ as the subject, the S-RQ remains unchanged;

   - STag as the subject, the STag remains unchanged (except certain
   rules for RI-ReRegister Memory Region);

   - PD as the subject, the PD remains unchanged;

   - Asynchronous Event handling as the subject, Asynchronous Events
   must not be lost.

8.3.2  Work Completion Errors

   The following errors can be associated with a specific Work Request.
   The RI MUST return a Completion Error via a Work Completion on the
   Completion Queue associated with the Send or Receive Queue on which
   the Work Request was posted for the errors defined in Figure 23. The
   Work Completion's Completion Status field contains the Error
   information. In each case, the QP MUST be moved to the Terminate
   state and a Terminate Message is sent with the indicated Terminate
   code (see Section 6.6.2.5 - Local Termination, Local Abortive
   Teardown and Remote Abortive Teardown). On any Work Completion that
   includes the sending of a Terminate Message, the Terminate Message
   Buffer MUST be available for examination while the QP is in the
   Terminate state or Error state using Query QP. The Terminate Message
   may contain useful diagnostic information, depending on the error.
   For information on the format of the Terminate Message, see [RDMAP].







   Hilland, et al.        Expires October 2003             [Page 148]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

Error                            Terminate Action
                                 Code

Receive Queue Work Request Errors - These errors are probably due to a
local Consumer error.

Invalid WQE format,              0x0000    The RI Terminates the LLP
Invalid STag in SGE,                        Stream with Local
Base and bounds violation                   Catastrophic Error and the
(including length errors),                  QP transitions to the
Access Rights violation,                    Terminate state.
Invalid PD ID,
Wrap error (TO & Segment Length
caused an address to wrap).

Receive Queue Remote Protection Errors - These errors may be due to a
Consumer error at either end.

Invalidate STag Invalid.         0x0100    The RI Terminates the LLP
                                            Stream with the indicated
Invalidate STag Access Rights.   0x0102     Error and the QP
                                            transitions to the
Invalidate STag Invalid PD ID.   0x0103     Terminate state.
or STag not Bound to QP.

Invalidate MR STag had Bound MW. 0x0109

Send Queue Work Request Errors - These errors are probably due to a
local Consumer error.

Invalid WQE format,              0x0000    The RI Terminates the LLP
Zero ORD.                                   Stream with Local
                                            Catastrophic Error and the
                                            QP transitions to the
                                            Terminate state.

Local SQ Protection Errors -     0x0000    The RI Terminates the LLP
Send Types, RDMA Writes, and                Stream with Local
RDMA Read Types:                            Catastrophic Error and the
Invalid STag,                               QP transitions to the
Base and bounds violation                   Terminate state.
(including length errors),
Access Rights violation,
Invalid PD ID,
Wrap error.







   Hilland, et al.        Expires October 2003             [Page 149]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

              Error              Terminate Action
                                 Code

SQ Fast-Register errors:         0x0000    The RI Terminates the LLP
QP not in Privileged Mode,                  Stream with Local
Invalid Region STag,                        Catastrophic Error and the
Invalid Physical Buffer Size,               QP transitions to the
Physical Buffer List too long,              Terminate state.
STag not in Invalid state,
Invalid PD ID,
Invalid Access Rights Specified,
Invalid Virtual Address,
Invalid FBO,
Invalid Length

SQ Bind errors:                  0x0000    The RI Terminates the LLP
Invalid Region STag                         Stream with Local
Invalid Window STag                         Catastrophic Error and the
Base and bounds violation                   QP transitions to the
Access Rights violation                     Terminate state.
STag not in Invalid state
MR not in Valid state
Invalid PD ID

SQ Invalidate errors             0x0000    The RI Terminates the LLP
 (Footnote 6):                              Stream with Local
Invalid STag                                Catastrophic Error and the
Invalid PD ID (or QP ID)                    QP transitions to the
Invalidate MR STag had Bound MW             Terminate state.

       Figure 23 - Completion Errors with Resulting Terminate Codes

8.3.3  Asynchronous Errors

   The Consumer may register an Asynchronous Event Handler to be called
   when an Asynchronous Event occurs which is not associated with an
   individual CQE by using the Set Asynchronous Event Handler Verb.

   An input modifier to the Set Asynchronous Event Handler Verb is the
   address of the event handler routine. This is a Consumer routine
   that is invoked when an Asynchronous Event is generated. When the
   handler routine is invoked, an indication of the origin of the
   error, called an Event Record, is provided.




   Footnote 6: This includes RDMA Read and Invalidate.




   Hilland, et al.        Expires October 2003             [Page 150]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   The errors defined in Figure 24 are returned to the Consumer via an
   Event Record in the Asynchronous Event Handler.

   There is only one Asynchronous Event Handler per RNIC. If Set
   Asynchronous Event Handler Verb is called more than once, the new
   handler MUST replace the previous handler. The RI MUST turn off
   Asynchronous Event Notification if the Asynchronous Event Handler's
   address is zero.

   After the Asynchronous Event Handler is registered, all subsequent
   asynchronous events not associated with a CQE MUST result in a call
   to the handler. Until an Asynchronous Event Handler is registered,
   asynchronous events will be lost.

   For more information, see Section 9.4.2 - Set Asynchronous Event
   Handler and Section 9.5.3 - Asynchronous Event Identifiers.

   The following table covers the errors that can be associated with a
   QP, thus the Event Record should include the QP ID when the error is
   associated with a specific QP. On any Asynchronous Error Event that
   includes the reception or sending of a Terminate Message, the
   Terminate Message Buffer is available for examination while the QP
   is in the Terminate or Error state by retrieving it through Query
   QP. Note that Terminate Messages generated locally as well as
   Terminated Messages received from the Associated QP are available
   through Query QP. The Terminate Message may contain useful
   diagnostic information, depending on the error. For information on
   the format of the Terminate Message, see [RDMAP].

              Error               Terminate Action
                                   Code

Remotely detected Errors

"Terminate Message Received"       None      QP -> Terminate state. See
An incoming Terminate Message has            6.6.2.4 Remote Termination
arrived.















   Hilland, et al.        Expires October 2003             [Page 151]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

              Error               Terminate Action
                                   Code

LLP Errors - Errors on incoming RDMAP Segments or Messages probably due
to the Remote Peer or fabric corruption.

"LLP Connection Lost" -            None      QP -> Error state. See
Usually caused by Timeout or Too             6.6.2.4 Remote Termination
many Retries at the LLP.

"LLP Connection Reset" -           None      QP -> Error state. See
Caused by an incoming Reset at the           6.6.2.4 Remote Termination
LLP.

"LLP Integrity Error: Segment size 0x1000    If this cannot be
invalid" -                                   corrected by the LLP (drop
The incoming segment is too small            and retry etc.), then
to contain a valid RDMAP header,             QP -> Terminate state.
or larger than supported by this             The RI Terminates the LLP
implementation.                              Stream with the indicated
                                             error. See 6.6.2.5.
" LLP Integrity Error: Invalid     0x0202
CRC" -
The incoming segment had a bad LLP
CRC.

"Bad FPDU" -The incoming segment   0x0203
Received MPA marker and 'Length'
fields do not agree on the start
of a FPDU






















   Hilland, et al.        Expires October 2003             [Page 152]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

              Error               Terminate Action
                                   Code

Remote Operation Errors - Protocol Errors on incoming RDMAP Segments or
Messages probably due to the Remote Peer.

Invalid DDP version                0x1206    QP -> Terminate state. The
                                             RI Terminates the LLP
Invalid RDMA version               0x0205    Stream with the indicated
                                             error. See 6.6.2.5.
Unexpected Opcode                  0x0206

Invalid DDP Queue Number           0x1201

Invalid RDMA Read Request - RDMA   0x1201
Read not enabled

No 'L' bit when expected           0x0207

Remote Protection Errors (not associated with the RQ) - Protection
Errors on incoming DDP Segments or RDMAP Messages that are not RDMA
Read Request Messages, probably due to the Remote Peer's Consumer.

Invalid STag                       0x1100    QP -> Terminate state. The
                                             RI Terminates the LLP
Base and bounds violation          0x1101    Stream with the indicated
                                             error. See 6.6.2.5.
Access Rights violation            0x1102

Invalid PD ID                      0x1102

Wrap error - TO and segment length 0x1103
caused an address wrap past
0xFFFFFFFFFFFFFFFF


















   Hilland, et al.        Expires October 2003             [Page 153]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

              Error               Terminate Action
                                   Code

Remote Closing Error - Probably due to Consumer not properly
synchronizing the ULP close operation.

Bad Close - QP in Closing state    None      QP -> Error state.
and: Segment arrives, at least one
SQ WQE on the SQ, or RDMA in
progress.

Bad LLP Close - LLP Close received 0x0207    QP -> Terminate state. The
AND (the Send Queue was NOT empty            RI Terminates LLP Stream
OR the IRRQ was NOT empty)                   with indicated error. See
(Footnote 7)                                 6.6.2.5.

Remote Protection Errors associated with the Receive Queue - Protection
Errors on incoming RDMAP Segments or Messages probably due to the
Remote Peer's Consumer.

Invalid MSN - MSN range not valid  0x1202    QP -> Terminate state. The
                                             RI Terminates LLP Stream
                                             with indicated error. See
                                             6.6.2.5.

Invalid MSN - gap in MSN           0x1202    QP -> Terminate state. The
                                             RI Terminates LLP Stream
                                             with indicated error. See
                                             6.6.2.5.

IRRQ Protection Errors - Error processing an incoming RDMA Read Request
and generating the outgoing RDMA Read Response.

Invalid STag                       0x0100    QP -> Terminate state. The
                                             RI Terminates the LLP
Base and bounds violation(includes 0x0101    Stream with the indicated
RDMA Read Request larger than                error. See 6.6.2.5.
supported by the Data Source STag)

Access Rights violation            0x0102

Invalid PD ID                      0x0103




   Footnote 7: For TCP this would be a 1/2 close and a Terminate
   Message could be sent. For SCTP, no Terminate Message is sent.



    Hilland, et al.        Expires October 2003             [Page 154]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

              Error               Terminate Action
                                   Code

Wrap error - TO and length caused  0x0104
an address wrap past
0xFFFFFFFFFFFFFFFF

Invalid MSN - too many RDMA Read   0x1203
Request Messages in process

Invalid MSN - gap in MSN (RDMA     0x1203
Messages found missing when LLP
claims a Message is delivered.)

Invalid MSN - MSN range is not     0x1203
valid (MSN is unreasonably beyond
the end of the queue.)

Local Errors

CQ/SQ error - An error occurred on 0x0207    QP -> Terminate state. The
the CQ during a SQ completion.               CQ number itself must be
CQ Overflow error                            determined by using Query
CQ Operation error                           QP. The RI Terminates the
                                             LLP Stream with the
CQ/RQ error - An error occurred on 0x0207    indicated error. See
the CQ during a RQ completion.               6.6.2.5.
CQ Overflow error
CQ Operation error

S-RQ error on a QP - An error      0x0207    QP-> Terminate state. The
occurred while attempting to pull            S-RQ can be determined by
a WQE from the S-RQ associated               using Query QP. The RI
with the QP.                                 Terminates the LLP Stream
                                             with the indicated error.
                                             See 6.6.2.5.

Local QP Catastrophic Error - An   0x0207    The RI will attempt to
error related to the QP occurred             move the QP to the Error
while processing (probably a                 state. The QP is most
problem with the RNIC).                      likely unusable and should
                                             be destroyed.

      Figure 24 - Affiliated Asynchronous Errors with Terminate Codes

   Figure 25 indicates errors that cannot be associated with a QP; the
   Asynchronous Event Record MUST contain the additional information as
   indicated in the table.



   Hilland, et al.        Expires October 2003             [Page 155]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

              Error               Terminate Action
                                  Code

Locally detected Catastrophic Errors

CQ Operation Error - An error     None      The Asynchronous Event
occurred on the CQ unrelated to             Record includes the CQ
a specific QP completion.                   handle. All completions on
                                            the CQ are in an undefined
                                            state. It may be necessary
                                            to destroy any QPs
                                            targeting the CQ and
                                            destroy the CQ.

Shared Receive Queue              None      The Asynchronous Event
Catastrophic Failure - A problem            Record includes the S-RQ
occurred with the RNIC or its               handle. All WRs on the S-RQ
driver that renders the RNIC                are in an undefined state.
unable to use the S-RQ.                     It may be necessary to
                                            destroy any QPs using the
                                            S_RQ and destroy the S-RQ.

RNIC Catastrophic failure - A     0x0208    The Asynchronous Event
problem occurred with the RNIC              Record does not include any
or its driver that renders the              additional information. If
RNIC unable to reliably                     possible, the RI Terminates
function. All RNIC/QP/CQ state              all LLP Connections with
is indeterminate. The only                  Global Catastrophic Error.
recovery is to close the RNIC               See 6.6.2.5
(and reopen it if desired).

     Figure 25 - Unaffiliated Asynchronous Errors with Terminate Codes




















   Hilland, et al.        Expires October 2003             [Page 156]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

9  RNIC Verbs

   The Verbs described in this chapter provide an abstract definition
   of the functionality provided to a host by a RI. Host RIs that are
   compliant with this specification MUST exhibit the semantic behavior
   described by the Verbs.

   Since the Verbs define the behavior of the host RI, they may
   influence the design of software constructs, such as application
   programming interfaces (APIs), which provide access to the host RI.
   However, this specification explicitly does not define any such API.
   In particular, there is no requirement that an API used with a
   compliant host RI be semantically identical to, or expose the
   semantics of, the Verbs. For example, whether the input modifiers
   referenced in the Verbs are pass-by-reference or pass-by-value is
   outside the scope of this specification.

   It is OPTIONAL for an RI to implement Block Lists. It is OPTIONAL
   for an RI to implement S-RQs. Support for S-RQs can be discovered
   using Query RNIC. Support for Block Lists can be discovered by
   attempting to open the RNIC in Block Mode. If the Verb fails with
   the error "Block List Not Supported", the RNIC does not support
   Block Mode.

   The RI MUST use the values and information provided in the Input
   Modifiers when processing the requests and operations instantiated
   in the Verbs for mandatory features. The RI MUST use the values and
   information provided in the Input Modifiers when processing the
   requests and operations instantiated in the Verbs for optional
   features if the RI supports that optional feature.

9.1  Consumer Accessibility

   Verb Consumers are the direct users of the Verbs, and are sub-
   divided into two classes, Privileged and Non-Privileged.

   Privileged Consumers are typically those Consumers that operate at a
   privilege level sufficient to access OS internal data structures
   directly, and have the responsibility to control access to the RNIC
   Interface. All Verbs are available for use by Privileged Consumers.

   Non-Privileged Mode Consumers are those Consumers that must rely on
   another agent, having a sufficient high level of privilege, to
   manipulate OS data structures. Only those Verbs specifically labeled
   as such are available to be used by Non-Privileged Mode Consumers.
   Conceptually, the intent is that Non-Privileged Mode Consumers are
   not allowed to manipulate RI resources that could affect a QP in a
   different Protection Domain. Any manipulation of resources that can
   affect another Protection Domain, such as registering physical




   Hilland, et al.        Expires October 2003             [Page 157]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   memory, are assumed to be done by a trusted intermediary, or
   Privileged Consumer.

   The Protection Domain provides a mechanism to detect when a Consumer
   is posting WRs to QPs with which it is not associated. The RI also
   usually provides a mechanism to help prevent posting WRs to QPs not
   directly owned by the Consumer (e.g. a multi-Consumer application
   which shares the same PD). But it may still be possible to post a WR
   to a QP that is not owned by the Consumer in some environments.
   Preventing access to memory structures such as QPs not directly
   created by that Consumer can be partially provided by the Local
   HostÆs operating environment through the use of the virtual memory
   subsystem and mapping of RNIC resources. Since this is
   implementation and environment dependent, the mechanism describing
   it is outside the scope of the architecture.

   All Verbs can be accessed by Privileged Mode Consumers. To maintain
   the access control over RI resources, the host environment MUST
   provide Non-Privileged Mode Consumers with direct access to only the
   following Verbs:

   *   PostSQ

   *   PostRQ

   *   Poll for Completion

   *   Request Completion Notification

9.2  RNIC Resource Management

9.2.1  RNIC

9.2.1.1  Open RNIC

   Description:

       Opens the specified RNIC.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 5.1.2 - Opening an RNIC.

   Input Modifiers:

   *   The unique identifier for this RNIC. The naming scheme is
       implementation dependent.






   Hilland, et al.        Expires October 2003             [Page 158]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   The Physical Block List mode of the RNIC. This MUST either be
       Block List mode or Page List mode. Block List mode is only valid
       if the RNIC supports it.

   Output Modifiers:

   *   If the operation completed successfully:

       o   RNIC Handle.

   *   Verb Results:

       o   Operation completed successfully.

       o   Insufficient resources to complete request.

       o   Invalid Modifier (RNIC name).

       o   Block List mode not supported.

       o   RNIC in use.

9.2.1.2  Query RNIC

   Description:

       Returns the attributes for the specified RNIC.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 5.1.3 - Query RNIC.

   Input Modifiers:

   *   RNIC handle.

   Output Modifiers:

   *   RNIC Attributes & Values, if the operation completed
       successfully:

       o   Vendor specific information. This could, but is not required
           to, include information such as a vendor identifier, part
           number and/or hardware version.

       o   The maximum number of QPs supported by this RNIC.

       o   The maximum number of outstanding Work Requests on any Send
           Queue or Receive Queue supported by this RNIC.




   Hilland, et al.        Expires October 2003             [Page 159]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   The maximum number of outstanding Work Requests on any S-RQ
           supported by this RNIC. If S-RQs are not supported by this
           RNIC, this number is zero.

       o   The maximum number of Scatter/Gather Elements per Send
           Operation Type Work Request supported by this RNIC. This
           value also applies to the maximum number of Scatter/Gather
           Elements for WRs posted to Receive Queues as well as those
           posted to Shared-Receive Queues.

       o   The maximum number of Scatter/Gather Elements per RDMA Write
           Work Request supported by this RNIC.

       o   The maximum number of CQs supported by this RNIC.

       o   The maximum number of entries in each CQ supported by this
           RNIC.

       o   The maximum number of CQ Event Handlers supported by this
           RNIC.

       o   The maximum number of Memory Regions supported by this RNIC.

       o   The maximum number of Physical Buffer Entries per Physical
           Buffer List.

       o   The maximum number of Protection Domains supported by this
           RNIC.

       o   The maximum number of inbound RDMA Read Request Messages
           that can be in the IRRQ per RNIC. This is the per RNIC
           parameter that represents the maximum total value of IRD for
           all QPs. This value MUST be Zero if the resources used to
           handle Inbound RDMA Read Requests are not shared between
           QPs. (For more information, see Section 6.5 - Outstanding
           RDMA Read Resource Management)

       o   The maximum number of outbound RDMA Read Request Messages
           that can be outstanding per RNIC. This is the per RNIC
           parameter that represents the maximum total value of ORD for
           all QPs. This value is Zero if the resources used to handle
           outstanding Outbound RDMA Read Request Messages are not
           shared between QPs.

       o   The maximum number of inbound RDMA Read Request Messages
           that can be in the IRRQ per QP. This represents the maximum
           value for IRD for any QP.






   Hilland, et al.        Expires October 2003             [Page 160]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   The maximum number of outbound RDMA Read Request Messages
           that can be outstanding per QP. This represents the maximum
           value for ORD for any QP.

       o   Ability of this RNIC to support modifying IRD after the QP
           has been created.

       o   Ability of this RNIC to support increasing ORD after the QP
           has been created.

       o   The maximum number of Memory Windows supported by this RNIC.

       o   The ability of this RNIC to support modifying the maximum
           number of outstanding Work Requests per QP. (For more
           information, see Section 6.1.3 - Modifying Queue Pair
           Attributes)

       o   The Physical Block List mode of the RNIC. This MUST either
           be Block List Mode or Page List Mode.

       o   If Block List Mode is supported:

           +   The Physical Buffer Entry range of sizes supported by
               this RNIC.

       o   If Page List Mode is supported:

           +   The List of Page sizes supported by this RNIC.

       o   The ability of this RNIC to support Shared Receive Queues.

       o   The ability of this RNIC to perform CQ Overflow detection.

       o   If Shared Receive Queues are supported:

           +   The maximum number of Shared Receive Queues supported by
               this RNIC.

           +   The dequeuing model the RNIC supports: arrival order or
               sequential order.

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid RNIC handle.

9.2.1.3  Close RNIC

   Description:



   Hilland, et al.        Expires October 2003             [Page 161]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       Closes and resets the specified RNIC.

       This Verb is responsible for de-allocating resources allocated
       by the RI and to make the RNIC unavailable for use by the
       Consumer.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 5.1.4 - Closing an RNIC.

   Input Modifiers:

   *   RNIC handle.

   Output Modifiers:

   *   Verb Results

       o   Operation completed successfully.

       o   Invalid RNIC handle.

9.2.2  Protection Domain

9.2.2.1  Allocate PD

   Description:

       Allocates an unused Protection Domain.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 5.2.1 - Allocating a PD.

   Input Modifiers:

   *   RNIC Handle.

   Output Modifiers:

   *   If the operation completed successfully:

       o   PD ID.

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid RNIC handle.




   Hilland, et al.        Expires October 2003             [Page 162]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   Insufficient resources to complete request.

9.2.2.2  Deallocate PD

   Description:

       Deallocates a previously Allocated Protection Domain.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 5.2.2 - Deallocating a PD.

       The Protection Domain MUST NOT be deallocated if it is still
       associated with any Queue Pair, Non-Shared Memory Region, Shared
       Memory Region, Shared Receive Queue, Bound Memory Window or
       Invalidated Memory Window.

   Input Modifiers:

   *   RNIC Handle.

   *   PD ID.

   Output Modifiers:

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid PD ID.

       o   Invalid RNIC handle.

       o   Protection Domain is in use.

9.2.3  Completion Queue

9.2.3.1  Create CQ

   Description:

       Creates a CQ on the specified RNIC. In addition, a Completion
       Event Handler may be registered for the created CQ.

       The Consumer must specify the minimum number of entries in the
       CQ. The number of allocated entries for CQEs on the specified
       CQ, which might be different than the number requested, is
       returned on successful creation. The number returned differs
       only when the number of actual entries is greater than the
       number that the Consumer requested. If the maximum number of



   Hilland, et al.        Expires October 2003             [Page 163]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       entries the RNIC supports is less than the Consumer requested,
       an Immediate Error is returned and the CQ is not created.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 5.3.1 - Creating a Completion Queue.

   Input Modifiers:

   *   RNIC handle.

   *   The minimum number of entries in the CQ.

   *   Completion Event Handler Identifier - An opaque handle used to
       identify a Completion Event Handler. If the identifier is set to
       zero, then there is no Completion Event Handler associated with
       this CQ. Completion Event Handler Identifiers are obtained via
       the Set Completion Event Handler Verb.

   Output Modifiers:

   *   If the operation completed successfully:

       o   The handle of the newly created CQ.

       o   The allocated number of entries in the CQ.

   *   Verb Results:

       o   Operation completed successfully.

       o   Insufficient resources to complete request.

       o   Invalid RNIC handle.

       o   Number of CQ entries requested exceeds RNIC capability.

       o   Invalid Completion Event Handler Identifier

9.2.3.2  Query CQ

   Description:

       Returns the number of entries in the specified CQ.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 5.3.2 - Querying Completion Queue Attributes.

   Input Modifiers:



   Hilland, et al.        Expires October 2003             [Page 164]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   RNIC handle.

   *   CQ handle.

   Output Modifiers:

   *   If the operation completed successfully:

       o   The allocated number of entries in the CQ.

       o   The Completion Event Handler Identifier.

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid RNIC handle.

       o   Invalid CQ handle.

9.2.3.3  Modify CQ

   Description:

       Resizes the CQ.

       A CQ must be able to be resized with outstanding Work
       Completions on the CQ and Work Requests on queues associated
       with the specified CQ. If the requested minimum number of
       entries in the CQ is insufficient to hold the current number of
       entries on the CQ, an Immediate Error will result.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 5.3.3 - Modifying Completion Queue Attributes.

   Input Modifiers:

   *   RNIC handle.

   *   CQ handle.

   *   The minimum number of entries in the CQ.

   Output Modifiers:

   *   If the operation completed successfully:

       o   The allocated number of entries in the CQ.




   Hilland, et al.        Expires October 2003             [Page 165]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   Verb Results:

       o   Operation completed successfully.

       o   Insufficient resources to complete request.

       o   Invalid RNIC handle.

       o   Invalid CQ handle.

       o   Number of CQ entries requested exceeds RNIC capability.

       o   An Attempt to shrink the size of the queue failed because
           too many Completion Queue Entries were still present on the
           Completion Queue.

9.2.3.4  Destroy CQ

   Description:

       Destroys the specified CQ.

       The CQ cannot be destroyed if any Work Queue is still associated
       with the CQ.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 5.3.4 - Destroying a Completion Queue.

   Input Modifiers:

   *   RNIC handle.

   *   CQ handle.

   Output Modifiers:

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid RNIC handle.

       o   Invalid CQ handle.

       o   One or more Work Queues is still associated with the CQ.







   Hilland, et al.        Expires October 2003             [Page 166]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

9.2.4  Shared Receive Queue

9.2.4.1  Create S-RQ

   Description:

       Creates an S-RQ for the specified RNIC.

       A set of initial S-RQ attributes must be specified by the
       Consumer. If any of the required initial attributes are illegal
       or missing, an error is returned and the S-RQ is not created.

       The RI MUST support this Verb if the Query RNIC Output Modifier
       indicates support for an S-RQ and MUST support all of the Input
       & Output Modifiers in this case, except where noted. For more
       information, see Section 6.3.1 - Creating a Shared Receive
       Queue.

   Input Modifiers:

   *   RNIC handle.

   *   The maximum number of outstanding Work Requests the Consumer
       expects to submit to the Shared Receive Queue.

   *   The S-RQ Limit. The S-RQ Limit detection is armed by the RI upon
       creation of the S-RQ, if the S-RQ Limit is non-zero.

   *   The maximum number of Scatter/Gather Elements the Consumer can
       specify in a Work Request.

   *   PD ID.

   Output Modifiers:

   *   If the operation completed successfully:

       o   The S-RQ Handle.

       o   The allocated number of outstanding Work Requests the
           Consumer can submit to the Shared Receive Queue.

       o   The allocated number of scatter/gather elements that can be
           specified in Work Requests. If an error is not returned,
           this is guaranteed to be greater than or equal to the number
           requested.

   *   Verb Results:

       o   Operation completed successfully.



   Hilland, et al.        Expires October 2003             [Page 167]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   Insufficient resources to complete request.

       o   Invalid RNIC handle.

       o   Maximum number of Work Requests requested exceeds RNIC
           capability.

       o   Maximum number of scatter/gather elements per Receive Queue
           Work Request requested exceeds RNIC capability.

       o   Invalid PD ID.

       o   S-RQ Limit out of range.

9.2.4.2  Query S-RQ

   Description:

       Returns the attribute list and current values for the specified
       S-RQ.

       The RI MUST support this Verb if the Query RNIC Output Modifier
       indicates support for an S-RQ and MUST support all of the Input
       & Output Modifiers in this case, except where noted.

   Input Modifiers:

   *   RNIC Handle.

   *   S-RQ Handle.

   Output Modifiers:

   *   The S-RQ attributes, if the operation completed successfully.
       The list of attributes returned by the query are:

       o   The allocated number of outstanding Work Requests supported
           on the Shared Receive Queue.

       o   The allocated number of Scatter/Gather Elements supported on
           Work Requests submitted to the Shared Receive Queue.

       o   PD ID.

       o   The S-RQ Limit.

       o   S-RQ Limit Armed Indicator.

   *   Verb Results:




   Hilland, et al.        Expires October 2003             [Page 168]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   Operation completed successfully.

       o   Invalid RNIC handle.

       o   Invalid S-RQ handle.

9.2.4.3  Modify S-RQ

   Description:

       Modifies the attributes for the specified S-RQ.

       The RI MUST support this Verb if the Query RNIC Output Modifier
       indicates support for an S-RQ and MUST support all of the Input
       & Output Modifiers in this case, except where noted. For more
       information, see Section 6.3.2 - Modifying a Shared Receive
       Queue.

   Input Modifiers:

   *   RNIC Handle.

   *   S-RQ Handle.

   *   The S-RQ attributes to modify and their new values. The S-RQ
       attributes that can be modified after the S-RQ has been created
       are:

       o   The maximum number of outstanding Work Requests the Consumer
           expects to submit to the Shared Receive Queue (if changing
           is supported by the RNIC).

       o   The S-RQ Limit.

       o   Re-arm the S-RQ Limit Asynchronous Event.

   Output Modifiers:

   *   If the operation completed successfully:

       o   The allocated number of outstanding Work Requests supported
           on the Shared Receive Queue.

   *   Verb Results:

       o   Operation completed successfully.

       o   Insufficient resources to complete request.

       o   Invalid RNIC handle.



   Hilland, et al.        Expires October 2003             [Page 169]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   Invalid S-RQ handle.

       o   Maximum number of Shared Receive Queue Work Requests
           requested exceeds RNIC capability.

       o   An Attempt to shrink the size of the queue failed because
           too many elements were still present.

       o   S-RQ Limit out of range.

       o   Invalid Input Modifier.

9.2.4.4  Destroy S-RQ

   Description:

       Destroys the specified S-RQ.

       The RI MUST support this Verb if the Query RNIC Output Modifier
       indicates support for an S-RQ and MUST support all of the Input
       & Output Modifiers in this case, except where noted.

       For more information, see Section 6.3.3 - Destroying a Shared
       Receive Queue.

   Input Modifiers:

   *   RNIC handle.

   *   S-RQ handle.

   Output Modifiers:

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid RNIC handle.

       o   Invalid S-RQ handle.

       o   QPs still associated with the S-RQ.

9.2.5  Queue Pair

9.2.5.1  Create QP

   Description:

       Creates a QP for the specified RNIC.



   Hilland, et al.        Expires October 2003             [Page 170]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       A set of initial QP attributes must be specified by the
       Consumer. If any of the required initial attributes are illegal
       or missing, an error is returned and the Queue Pair is not
       created.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 6.1.1 - Creating a Queue Pair.

   Input Modifiers:

   *   RNIC handle.

   *   The QP attributes that must be specified at QP create time are:

       o   The CQ handle of the CQ to be associated with the Send
           Queue.

       o   The CQ handle of the CQ to be associated with the Receive
           Queue. (Note that this may be the same CQ that is associated
           with the Send Queue, or it may be a different CQ than the
           one associated with the Send Queue).

       o   The maximum number of outstanding Work Requests the Consumer
           expects to submit to the Send Queue.

       o   The maximum number of outstanding Work Requests the Consumer
           expects to submit to the Receive Queue. This value is
           ignored if the QP is associated with an S-RQ.

       o   If the QP's RQ will be associated with an S-RQ:

           +   S-RQ Handle.

           +   QP RQ Limit Indicator, as discussed in Section 6.3.8 -
               S-RQ Limit Checking. The QP RQ Limit detection is armed
               by the RI upon creation of the QP, if non-zero.

       o   Inbound RDMA Read enable.

       o   Inbound RDMA Write and inbound RDMA Read Response enable.

       o   Bind Memory Windows enable.

       o   The maximum number of scatter/gather elements the Consumer
           can specify in a Send Operation Type Work Request submitted
           to the Send Queue.






   Hilland, et al.        Expires October 2003             [Page 171]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   The maximum number of scatter/gather elements the Consumer
           can specify in a RDMA Write Work Request submitted to the
           Send Queue.

       o   The maximum number of scatter/gather elements the Consumer
           can specify in a Work Request submitted to the Receive
           Queue. This value is not returned if the QP is associated
           with an S-RQ.

       o   ORD (Requested) - The requested maximum number of
           outstanding Outgoing RDMA Read Request Messages the RNIC can
           initiate from the SQ.

       o   IRD (Requested) - The requested maximum number of
           outstanding Incoming RDMA Read Request Messages (e.g. IRRQ
           depth) the RNIC can handle for this QP.

       o   PD ID.

       o   Enable or disable the Use of the STag of zero and Fast-
           Register Non-Shared Memory Region Operations. This MUST only
           be allowed to be enabled for Privileged Mode Consumers.

   Output Modifiers:

   *   If the operation completed successfully:

       o   The QP Handle.

       o   The QP ID.

       o   The allocated number of outstanding Work Requests supported
           on the Send Queue. If an error is not returned, this is
           guaranteed to be greater than or equal to the number
           requested. (This may require the Consumer to increase the
           size of the CQ.)

       o   The allocated number of outstanding Work Requests supported
           on the Receive Queue. If an error is not returned, this is
           guaranteed to be greater than or equal to the number
           requested. (This may require the Consumer to increase the
           size of the CQ.) This value is not returned if the QP is
           associated with an S-RQ.

       o   The allocated number of scatter/gather elements that can be
           specified in Work Requests submitted to the Send Queue. If
           an error is not returned, this is guaranteed to be greater
           than or equal to the number requested.





   Hilland, et al.        Expires October 2003             [Page 172]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   The allocated number of Scatter/Gather Elements supported on
           RDMA Write Work Requests submitted to the Send Queue. If an
           error is not returned, this is guaranteed to be greater than
           or equal to the number requested.

       o   The allocated number of Scatter/Gather Elements that can be
           specified in Work Requests submitted to the Receive Queue.
           If an error is not returned, this is guaranteed to be
           greater than or equal to the number requested. This value is
           not returned if the QP is associated with an S-RQ.

       o   ORD (allocated) - The allocated number of outstanding RDMA
           Read Request Messages the RNIC can initiate from the SQ at
           the Data Sink. This number MUST be between zero and the
           number requested, inclusive. If the Consumer requested a
           non-zero number and the RI was unable to provision at least
           one then an Immediate Error MUST be returned.

       o   IRD (allocated) - The allocated number of incoming
           outstanding RDMA Read Request Messages (e.g. IRRQ depth) the
           RNICÆs QP can handle at the Data Source. If the Consumer
           requested a non-zero number and the RI was unable to
           provision at least one then an Immediate Error MUST be
           returned.

   *   Verb Results:

       o   Operation completed successfully.

       o   Insufficient resources to complete request.

       o   Invalid RNIC handle.

       o   Invalid CQ handle.

       o   Invalid S-RQ handle.

       o   The value requested for ORD exceeds RNIC capability.

       o   The value requested for IRD exceeds RNIC capability.

       o   Maximum number of Send Queue Work Requests requested exceeds
           RNIC capability.

       o   Maximum number of Receive Queue Work Requests requested
           exceeds RNIC capability

       o   Maximum number of scatter/gather elements per Send Queue
           Work Request requested exceeds RNIC capability.




   Hilland, et al.        Expires October 2003             [Page 173]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   Maximum number of scatter/gather elements per Receive Queue
           Work Request requested exceeds RNIC capability.

       o   Invalid Protection Domain.

       o   QP RQ Limit Out of Range.

9.2.5.2  Query QP

   Description:

       Returns the attribute list and current values for the specified
       QP.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 6.1.2 - Querying Queue Pair Attributes.

   Input Modifiers:

   *   RNIC Handle.

   *   QP Handle.

   Output Modifiers:

   *   The QP attributes, if the operation completed successfully. The
       list of attributes returned by the query are:

       o   Handle of the Completion Queue associated with the Send
           Queue.

       o   Handle of the Completion Queue associated with the Receive
           Queue.

       o   Handle of the S-RQ. This value is only returned if the QP is
           associated with an S-RQ.

       o   The allocated number of outstanding Work Requests supported
           on the Send Queue.

       o   The allocated number of outstanding Work Requests supported
           on the Receive Queue. This value is not returned if the QP
           is associated with an S-RQ.

       o   The actual number of Scatter/Gather Elements supported on
           Send Operation Type Work Requests submitted to the Send
           Queue.





   Hilland, et al.        Expires October 2003             [Page 174]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   The allocated number of Scatter/Gather Elements supported on
           RDMA Write Work Requests submitted to the Send Queue.

       o   The allocated number of Scatter/Gather Elements supported on
           Work Requests submitted to the Receive Queue. This value is
           not returned if the QP is associated with an S-RQ.

       o   ORD - The allocated number of outstanding RDMA Read Request
           Messages the RNIC can initiate from the SQ at the Data Sink.

       o   IRD - The allocated number of outstanding incoming RDMA Read
           Request Messages (e.g. IRRQ depth) the RNICÆs QP can handle
           at the Data Source.

       o   Current QP state.

       o   PD ID.

       o   QP ID.

       o   Use of the STag of zero and Fast-Register Non-Shared Memory
           Region Operations enabled.

       o   Inbound RDMA Read enable.

       o   Inbound RDMA Write and inbound RDMA Read Response enable.

       o   Bind Memory Windows enable.

       The following attributes are not defined unless the QP is in the
       Terminate or Error states.

       o   A buffer containing the Terminate Message that was received
           or sent (if possible).

       o   An indicator to state if the Terminate Message was generated
           locally or by the Associated QP.

       The following attributes are only defined if the QP is
       associated with a Shared Receive Queue.

       o   Current QP's RQ Limit.

       o   QP's RQ Limit armed indicator.

       The following attributes are only defined if the QP is not in
       the Idle state.

       o   LLP Stream Handle.




   Hilland, et al.        Expires October 2003             [Page 175]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid RNIC handle.

       o   Invalid QP handle.

9.2.5.3  Modify QP

   Description:

       Modifies the attributes for the specified QP then causes the QP
       to transition to the specified QP state. Only a subset of the QP
       attributes can be modified in each of the QP states.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 6.1.3 - Modifying Queue Pair Attributes.

   Input Modifiers:

   *   RNIC Handle.

   *   QP Handle.

   *   The QP attributes to modify and their new values. The QP
       attributes that can be modified after the QP has been created
       are:

       o   Next QP state. If the current state is specified, only the
           QP attributes will be modified.

       o   ORD - The requested number of outstanding RDMA Read Request
           Messages the RNIC can initiate from the SQ at the Data Sink.

       o   IRD - The requested number of incoming outstanding RDMA Read
           Request Messages (e.g. IRRQ depth) the RNICÆs QP can handle
           at the Data Source.

       o   The maximum number of outstanding Work Requests the Consumer
           expects to submit to the Send Queue (if changing is
           supported by the RNIC).

       o   The maximum number of outstanding Work Requests the Consumer
           expects to submit to the Receive Queue (if changing is
           supported by the RNIC). This value is not allowed if the QP
           is associated with an S-RQ.





   Hilland, et al.        Expires October 2003             [Page 176]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       The following attributes are only defined if the QP is
       associated with a Shared Receive Queue.

       o   QP's RQ Limit, as described in Section 6.3.8 - S-RQ Limit
           Checking.

       o   Re-arm the QP's RQ Limit, as described in Section 6.3.8 - S-
           RQ Limit Checking. The RI MUST allow an already armed S-RQ
           limit to be armed.

       Valid only when moving from Idle to RTS.

       o   LLP Stream Handle

       o   Stream Message Buffer.

   Output Modifiers:

   *   If the operation completed successfully:

       o   The allocated number of outstanding Work Requests supported
           on the Send Queue.

       o   The allocated number of outstanding Work Requests supported
           on the Receive Queue. This value is not returned if the QP
           is associated with an S-RQ.

       o   ORD - The allocated number of outstanding RDMA Read Request
           Messages the RNIC can initiate from the SQ at the Data Sink.
           This number MUST be between zero and the number requested,
           inclusive. If the Consumer requested a non-zero number and
           was unable to provision at least one then an Immediate Error
           will be returned.

       o   IRD - The allocated number of incoming outstanding RDMA Read
           Request Messages (e.g. the IRRQ depth) the RNICÆs QP can
           handle at the Data Source. If the Consumer requested a non-
           zero number and was unable to provision at least one then an
           Immediate Error will be returned.

   *   Verb Results:

       o   Operation completed successfully.

       o   Insufficient resources to complete request.

       o   Invalid RNIC handle.

       o   Invalid QP handle.




   Hilland, et al.        Expires October 2003             [Page 177]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   Cannot change QP attribute.

       o   Invalid QP state change requested.

       o   Maximum number of Send Queue Work Requests requested exceeds
           RNIC capability.

       o   Maximum number of Receive Queue Work Requests requested
           exceeds RNIC capability.

       o   The value requested for ORD exceeds RNIC capability.

       o   The value requested for IRD exceeds RNIC capability.

       o   An Attempt to shrink the size of the queue failed because
           too many elements were still present.

       o   Invalid LLP Stream Handle.

       o   Invalid Modifier.

       o   RI still flushing WQEs.

       o   RQ Limit Out of Range.

9.2.5.4  Destroy QP

   Description:

       Destroys the specified QP.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. The QP cannot be
       destroyed if any Memory Windows are still Bound to the QP.

       For more information, see Section 6.1.4 - Destroying a Queue
       Pair.

   Input Modifiers:

   *   RNIC handle.

   *   QP handle.

   Output Modifiers:

   *   Verb Results:

       o   Operation completed successfully.




   Hilland, et al.        Expires October 2003             [Page 178]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   Invalid RNIC handle.

       o   Invalid QP handle.

       o   Memory Windows still Bound to QP.

9.2.6  Memory Management

   Memory Management Verbs are used to manage Memory Regions and Memory
   Windows. The following table describes what each of the Memory
   Management Verbs manage and where the Verb appears to performed:

                    Verb                 Used to manage  Performed by
                                           MR vs. MW     RI vs. RNIC

    Allocate Non-Shared Memory Region    MR              RI
    STag

    Register Non-Shared Memory Region    MR              RI
    (RI-Register)

    Reregister Non-Shared Memory Region  MR              RI
    (RI-Reregister)

    Register Shared Memory Region        MR              RI

    Fast-Register Non-Shared Memory      MR              RNIC
    Region (PostSQ)

    Query Memory Region                  MR              RI

    Invalidate Local STag (PostSQ)       MR or MW        RNIC

    Deallocate STag                      MR or MW        RI

    Allocate Memory Window               MW              RI

    Query Memory Window                  MW              RI

    Bind Memory Window (PostSQ)          MW              RNIC

                    Figure 26 - Memory Management Verbs

9.2.6.1  Allocate Non-Shared Memory Region STag

   Description:

       Allocates memory registration resources on the RNIC.



   Hilland, et al.        Expires October 2003             [Page 179]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 7.3.2.1 - Allocate Non-Shared Memory Region STag.

   Input Modifiers:

   *   RNIC Handle.

   *   Requested Physical Buffer List size to be allocated.

   *   PD ID.

   *   Remote Access Flag. If set, Local and Remote Access is enabled.
       Otherwise only Local access is enabled.

   Output Modifiers:

   *   If the operation completed successfully:

       o   STag Index - used for local and, if specified by the input
           modifiers, remote access.

       o   The actual number of Physical Buffer List Entries in the
           allocated Physical Buffer List. Note that this MAY be
           greater than the number requested.

   *   Verb Results:

       o   Operation completed successfully.

       o   Insufficient resources to complete request.

       o   Invalid RNIC handle.

       o   Invalid PD ID.

9.2.6.2  Register Non-Shared Memory Region (RI-Register)

   Description:

       Registers a Non-Shared Memory Region for use by an RNIC.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 7.3.2.2 - RI-Register Non-Shared Memory Region.

   Input Modifiers:

   *   RNIC Handle.




   Hilland, et al.        Expires October 2003             [Page 180]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   Physical Buffer Entry size - The size, in bytes, of each
       Physical Buffer in the list. Note: If the Physical Buffer List
       references a Page List, the size MUST be a power of two. If the
       Physical Buffer List references a Block List, the size MAY have
       a byte alignment.

   *   Address List - A list of addresses that point to the Physical
       Buffers referenced by the Physical Buffer List. All Physical
       Buffers in the list have the same size.

   *   Address List Length - the number of entries in the Address list.

   *   First Byte Offset (FBO) - Offset to start of Non-Shared Memory
       Region on first Physical Buffer.

   *   Length - Total length of the Non-Shared Memory Region (can be of
       arbitrary byte-aligned length).

   *   Addressing type. The Addressing type MUST be one of the
       following:

       o   VA Based TO

       o   Zero Based TO

   *   The following input modifier is only valid if the Addressing
       type is VA Based TO:

       o   Virtual Address - The VA address of the first byte in the
           Non-Shared Memory Region.

   *   PD ID.

   *   STag Key.

   *   Remote Access Flag.

   *   Access Control - The following MAY be selected in any
       combination except as noted:

       o   Enable Local Write Access.

       o   Enable Remote Write Access. Remote Write Access requires
           Local Write Access to be enabled.

       o   Enable Local Read Access.

       o   Enable Remote Read Access. Remote Read Access requires Local
           Read Access to be enabled.




   Hilland, et al.        Expires October 2003             [Page 181]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   Enable Memory Window Binding.

   Output Modifiers:

   *   If the operation completed successfully:

       o   STag Index - used for local and, if specified by the input
           modifiers, remote access. Note: the RNIC associates the STag
           Key passed in as an input modifier to STag associated with
           the registered Non-Shared Memory Region.

       o   The actual number of Physical Buffer List Entries in the
           allocated Physical Buffer List. Note that this MAY be
           greater than the number requested.

   *   Verb Results:

       o   Operation completed successfully.

       o   Insufficient resources to complete request.

       o   Invalid RNIC handle.

       o   Invalid PD ID.

       o   Invalid Virtual Address.

       o   Invalid length.

       o   Invalid First Byte Offset.

       o   Invalid Access Rights requested.

       o   Invalid Physical Buffer List entry.

       o   Invalid Physical Buffer size.

9.2.6.3  Query Memory Region

   Description:

       Retrieves information about a specific Memory Region.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 7.7 - Querying Memory Regions.

   Input Modifiers:

   *   RNIC Handle.



   Hilland, et al.        Expires October 2003             [Page 182]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   STag Index - as originally returned from an Allocate Non-Shared
       Memory Region STag, RI-Register Non-Shared Memory Region, RI-
       Reregister Non-Shared Memory Region or Register Shared Memory
       Region Type Verb.

   Output Modifiers:

   *   If the operation completed successfully:

       o   STag Key - Current STag Key associated with the Memory
           Region, if it is in the Valid state.

       o   Remote Access Flag.

       o   PD ID.

       o   STag State: Valid or Invalid.

       o   STag Type: Shared or Non-Shared.

       o   The actual number of Physical Buffer List Entries in the
           allocated Physical Buffer List. Note that this MAY be
           greater than the number requested.

       o   Access Control settings for the registered Region. The
           following MAY be set in any combination except as noted:

           +   Local Write Access Enabled.

           +   Remote Write Access Enabled. Remote Write Access
               requires Local Write Access to be enabled.

           +   Local Read Access Enabled.

           +   Remote Read Access Enabled. Remote Read Access requires
               Local Read Access to be enabled.

           +   Memory Window Binding Enabled.

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid RNIC handle.

       o   Invalid STag Index.

9.2.6.4  Deallocate STag

   Description:



   Hilland, et al.        Expires October 2003             [Page 183]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       Removes an STag created through an Allocate Non-Shared Memory
       Region STag, RI-Register Non-Shared Memory Region, RI-Reregister
       Non-Shared Memory Region, Register Shared Memory Region or
       Allocate Memory Window from the RNIC.

       Work Requests or Remote Operation requests that are in-process
       and actively referencing memory locations associated with the
       STag being deallocated must fail with a protection error.

       If the STag references a Memory Region which has Memory Windows
       Bound to it, an immediate Error MUST be returned and the Memory
       Region must not be destroyed or modified.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 7.9 - Deallocation of STag associated with a Memory
       Region and Section 7.10.4 - Invalidating or De-allocating Memory
       Windows.

   Input Modifiers:

   *   RNIC Handle.

   *   STag Index - as originally returned from an Allocate Non-Shared
       Memory Region STag, Allocate Memory Window, or RI-Register Non-
       Shared Memory Region, RI-Reregister Non-Shared Memory Region or
       Register Shared Memory Region Verb.

   Output Modifiers:

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid RNIC handle.

       o   Invalid STag Index.

       o   One or more Memory Windows is still Bound to the Memory
           Region. Applies only if the STag is associated with a Memory
           Region.

9.2.6.5  Reregister Non-Shared Memory Region (RI-Reregister)

   Description:

       Modifies the attributes of an existing Non-Shared Memory Region.

       The STag output modifier from this Verb must be used in place of
       any previously issued for this Non-Shared Memory Region.



   Hilland, et al.        Expires October 2003             [Page 184]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       If the STag references a Non-Shared Memory Region which has
       Memory Windows Bound to it, an immediate Error MUST be returned
       and the Non-Shared Memory Region must not be destroyed or
       modified.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 7.3.2.3 - RI-Reregister Non-Shared Memory Region.

   Input Modifiers:

   *   RNIC Handle.

   *   Physical Buffer Entry size - The size, in bytes, of each
       Physical Buffer Entry in the list. Note: If the Physical Buffer
       List references a Page-List, the size MUST be a power of two. If
       the Physical Buffer List references a Block-List, the size MAY
       have a byte alignment.

   *   Address List - A list of addresses that point to the Physical
       Buffers referenced by the Physical Buffer List. All Physical
       Buffers in the list MUST have the same size.

   *   Address List Length - the number of entries in the Address list.

   *   First Byte Offset (FBO) - Offset to start of Non-Shared Memory
       Region on first Physical Buffer.

   *   Length - Total length of Non-Shared Memory Region (can be of
       arbitrary byte-aligned length).

   *   Addressing type. The addressing type MUST be one of the
       following:

       o   VA Based TO

       o   Zero Based TO

   *   The following input modifier is only valid if the Addressing
       type is VA Based TO:

       o   Virtual Address - The VA address of the first byte in the
           Non-Shared Memory Region.

   *   PD ID.

   *   STag Index.

   *   STag Key (not the existing STag Key, but the new STag Key).




   Hilland, et al.        Expires October 2003             [Page 185]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   Remote Access Flag.

   *   Access Control - The following MAY be selected in any
       combination except as noted:

       o   Enable Local Write Access.

       o   Enable Remote Write Access. Remote Write Access requires
           Local Write Access to be enabled.

       o   Enable Local Read Access.

       o   Enable Remote Read Access. Remote Read Access requires Local
           Read Access to be enabled.

       o   Enable Memory Window Binding.

   Output Modifiers:

   *   If the operation completed successfully:

       o   STag Index - used for local and, if specified by the input
           modifiers, remote access. Note: the RNIC associates the STag
           Key passed in as an input modifier to STag associated with
           the registered Non-Shared Memory Region. If the output STag
           index differs from the input STag index, the old STag index
           was Deallocated.

       o   The actual number of Physical Buffer List Entries in the
           allocated Physical Buffer List. Note that this MAY be
           greater than the number requested.

   *   Verb Results:

       o   Operation completed successfully.

       o   Insufficient resources to complete request.

       o   Invalid RNIC handle.

       o   Invalid STag Index.

       o   Invalid Virtual Address.

       o   Invalid Length.

       o   Invalid PD ID.

       o   Invalid First Byte Offset.




   Hilland, et al.        Expires October 2003             [Page 186]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   Invalid Access Rights request.

       o   One or more Memory Windows is still Bound to the Region.

       o   Invalid Physical Buffer List entry.

       o   Invalid Physical Buffer size.

9.2.6.6  Register Shared Memory Region

   Description:

       Registers a new Shared Memory Region which shares RNIC mapping
       resources with a previously registered Memory Region, thus
       returning a new STag. Note that other than the change of the
       original Memory Region to a Shared Memory Region, the original
       Memory Region remains unaffected by this operation.

       The Base TO,VA (if the input STag Index references a VA Based
       TO), PD ID, and Access Rights specified for the new Memory
       Region need not be the same as those of the existing Memory
       Region. The lengths are by definition the same.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 7.4.3 - Multiple Registrations of Memory Regions.

   Input Modifiers:

   *   RNIC Handle.

   *   STag Index of the existing Memory Region. If the existing Memory
       Region is Non-Shared, successful completion of this verb will
       convert the existing Non-Shared Memory Region to a Shared Memory
       Region.

   *   Addressing type. The addressing type MUST be one of the
       following:

       o   VA Based TO

       o   Zero Based TO

   *   The following modifier is only valid if the Addressing type of
       the existing region is VA Based TO:

       o   Virtual Address - The VA address of the first byte in the
           Memory Region.

   *   PD ID.



   Hilland, et al.        Expires October 2003             [Page 187]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   STag Key of the new STag.

   *   Remote Access Flag.

   *   Access Control - The following MAY be selected in any
       combination except as noted:

       o   Enable Local Write Access.

       o   Enable Remote Write Access. Remote Write Access requires
           Local Write Access to be enabled.

       o   Enable Local Read Access.

       o   Enable Remote Read Access. Remote Read Access requires Local
           Read Access to be enabled.

       o   Enable Memory Window Binding.

   Output Modifiers:

   *   If the operation completed successfully:

       o   STag Index - used for local and, if specified by the input
           modifiers, remote access. Note: the RNIC associates the STag
           Key passed in as an input modifier to STag associated with
           the registered Shared Memory Region.

   *   Verb Results:

       o   Operation completed successfully.

       o   Insufficient resources to complete request.

       o   Invalid RNIC handle.

       o   Invalid STag Index.

       o   Invalid Virtual Address.

       o   Invalid PD ID.

       o   Invalid Access Rights requested.

9.2.6.7  Allocate Memory Window

   Description:






   Hilland, et al.        Expires October 2003             [Page 188]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       This Verb allocates a memory window and associates it with a
       Protection Domain. It is not inherently associated with any
       Memory Region when allocated.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 7.10.1 - Allocating Memory Windows.

   Input Modifiers:

   *   RNIC Handle.

   *   PD ID.

   Output Modifiers:

   *   If the operation completed successfully:

       o   STag Index - an unbound STag for use in specifying the
           Window when invoking a Bind Work Request through the Post
           Send Verb.

   *   Verb Results:

       o   Operation completed successfully.

       o   Insufficient resources to complete request.

       o   Invalid RNIC handle.

       o   Invalid PD ID.

9.2.6.8  Query Memory Window

   Description:

       This Verb returns the attributes associated with the specified
       memory window.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 7.10.3 - Memory Windows.

   Input Modifiers:

   *   RNIC Handle.

   *   STag Index - the current STag associated with the Memory Window.

   Output Modifiers:



   Hilland, et al.        Expires October 2003             [Page 189]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   If the operation completed successfully:

       o   STag Key - current value of the STag Key, if the STag is in
           the Valid state.

       o   STag State: Valid or Invalid.

       o   PD ID.

       o   Access Rights. The following may be set in any combination
           except as noted.

           +   Remote Write Access Enabled. If set Remote Write Access
               is enabled.

           +   Remote Read Access Enabled. If set Remote Read Access is
               enabled.

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid RNIC handle.

       o   Invalid STag Index.

9.3  Work Request Processing

9.3.1  QP Operations

9.3.1.1  PostSQ

   Description:

       Builds a WQE on the Send Queue of the specified QP for each
       entry in the Work Request List submitted by the Consumer. This
       WQE is added to the end of the Send Queue and the RNIC is
       notified that a new WQE is ready to be processed.

       Note that not all Input Modifiers are valid for all operations.
       If Input Modifiers are specified that are not valid for a
       particular operation, they are ignored.

       Following the Verbs is a Work Request table which contains a
       List of the Operation Types and the Input Modifiers which are
       required for each of those Operation Types.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 8.2.1 - Submitting Work Request to a Work Queue.



   Hilland, et al.        Expires October 2003             [Page 190]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Input Modifiers:

   *   RNIC Handle

   *   QP Handle.

   *   A list of Work Requests. Each Work Request MUST contain the
       following information:

       o   A user defined 64-bit Work Request ID

       o   Operation type. The operation type MUST be one of the
           following:

           +   Send

           +   Send with Solicited Event

           +   Send with Invalidate

           +   Send with Solicited Event & Invalidate

           +   RDMA Write

           +   RDMA Read

           +   RDMA Read with Invalidate Local STag

           +   Bind Memory Window

           +   Fast-Register Non-Shared Memory Region

           +   Invalidate Local STag

       o   Completion Notification Type: Signaled or Unsignaled.

       o   The following list of modifiers are only valid for Send
           Operation Types and RDMA Write WRs to represent the Local
           Buffer:

           +   Scatter/Gather List. The Scatter/Gather List can contain
               zero or more Scatter/Gather Elements. This list is
               specified only for Send and RDMA type operations.

           +   Number of Scatter/Gather Elements.

           +   Note that the length is determined by adding up the
               Length field in the SGEs of the SGL.

           +   Read Fence indicator.



   Hilland, et al.        Expires October 2003             [Page 191]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   The following list of modifiers are only valid for RDMA Read
           Type operations to represent the Local Buffer:

           +   Local Address. This is a contiguous buffer represented
               by a TO, an STag, and a Length to be read.

       o   The following list of modifiers are only valid for RDMA
           Write or RDMA Read Type WRs to represent the Remote Buffer:

           +   Remote Address. This is a contiguous buffer represented
               by a TO and an STag.

       o   The following modifier is only valid for the Send with
           Invalidate and Send with Solicited Event & Invalidate
           operations:

           +   Remote STag. This is the STag to be Invalidated at the
               Remote Peer.

       o   The following list of modifiers are only valid for Bind
           Memory Window operations:

           +   STag Index for the Memory Window.

           +   STag Key for the Memory Window.

           +   STag for the Memory Region that the Memory Windows is to
               be associated with. This parameter includes both the
               STag Index and STag Key.

           +   Length or range to be Bound in number of octets.

           +   Addressing type. The addressing type MUST be one of the
               following:

               *   VA Based TO

               *   Zero Based TO

           +   Virtual Address - The VA address of the first byte into
               the Memory Region. This may be different than the
               starting address of the Memory Region.

           +   Access Control - either or both of the following must be
               selected:

               *   Enable Remote Write Access. Requires the Memory
                   Region to have Local Write Access.





   Hilland, et al.        Expires October 2003             [Page 192]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

               *   Enable Remote Read Access. Requires the Memory
                   Region to have Local Read Access.

       o   The following list of modifiers are only valid for Fast-
           Register Non-Shared Memory Region operations:

           +   Physical Buffer Entry size - The size, in bytes, of each
               Physical Buffer in the list. Note: If the Physical
               Buffer List references a Page-List, the size MUST be a
               power of two. If the Physical Buffer List references a
               Block-List, the size MUST be an RNIC supported size (see
               Section 9.2.1.2 - Query RNIC).

           +   Address List - A list of addresses that point to the
               Physical Buffers referenced by the Physical Buffer List.
               All Physical Buffers in the list MUST have the same
               size.

           +   Address List Length - the number of entries in the
               Address list.

           +   First Byte Offset (FBO) - Offset to start of Non-Shared
               Memory Region on first Physical Buffer.

           +   Length - Total length of Non-Shared Memory Region (can
               be any value supported by the RNIC).

           +   Addressing type. The addressing type MUST be one of the
               following:

               *   VA Based TO

               *   Zero Based TO

           +   The following modifier is only valid if the Addressing
               type is VA Based TO:

               *   Virtual Address - The VA address of the first byte
                   in the Non-Shared Memory Region

           +   STag Index.

           +   STag Key.

           +   Access Control - The following may be selected in any
               combination except as noted:

               *   Enable Local Write Access.





   Hilland, et al.        Expires October 2003             [Page 193]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

               *   Enable Remote Write Access. Remote Write Access
                   requires Local Write Access to be enabled. The STag
                   Index MUST have the Remote Access Flag enabled.

               *   Enable Local Read Access.

               *   Enable Remote Read Access. Remote Read Access
                   requires Local Read Access to be enabled. The STag
                   Index MUST have the Remote Access Flag enabled.

               *   Enable Memory Window Binding.

       o   The following list of modifiers are only valid for
           Invalidate Local STag operations:

           +   STag to be the target of the Invalidate operation.

           +   Local Fence indicator.

   Below, in Figure 27, is a matrix of the Input Modifiers for PostSQ
   and the Operation Types. The intersection of the matrix indicates
   that the Input Modifier is required for that Operation Type by
   specifying "Yes".

  Opcode-> Send Send Send Send RDMA  RDMA RDMA Bind Fast- Inv.
  Input         w/   w/   w/   Write Read Read MW   Reg.  Local
  Modifier      SE   Inv. SE &             w/       NS MR STag
                          Inv.            Inv.

  WR ID    Yes  Yes  Yes  Yes  Yes   Yes  Yes  Yes  Yes   Yes

  Compltn. Yes  Yes  Yes  Yes  Yes   Yes  Yes  Yes  Yes   Yes
  Notif.
  Type

  SGL      Yes  Yes  Yes  Yes  Yes

  SGE No.  Yes  Yes  Yes  Yes  Yes

  Read     Yes  Yes  Yes  Yes  Yes
  Fence

  Local                                                   Yes
  Fence

  Local                              Yes  Yes
  Address





   Hilland, et al.        Expires October 2003             [Page 194]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

  Opcode-> Send Send Send Send RDMA  RDMA RDMA Bind Fast- Inv.
  Input         w/   w/   w/   Write Read Read MW   Reg.  Local
  Modifier      SE   Inv. SE &            w/        NS MR STag
                          Inv.            Inv.

  Remote                       Yes   Yes  Yes
  Address

  Remote             Yes  Yes
  STag

  MW STag                                      Yes
  Key

  MW STag                                      Yes
  Index

  MW's                                         Yes
  MR STag

  MW                                           Yes
  Length

  Addr                                         Yes  Yes
  Type

  VA, if                                       Yes  Yes
  VA Based
  TO

  Acs                                               Yes
  Ctrl:
  Local Rd

  Acs                                          Yes  Yes
  Ctrl:
  Remote
  Rd

  Acs                                               Yes
  Ctrl:
  Local Wt

  Acs                                          Yes  Yes
  Ctrl:
  Remote
  Wt




   Hilland, et al.        Expires October 2003             [Page 195]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

  Opcode-> Send Send Send Send RDMA  RDMA RDMA Bind Fast- Inv.
  Input         w/   w/   w/   Write Read Read MW   Reg.  Local
  Modifier      SE   Inv. SE &            w/        NS MR STag
                          Inv.            Inv.

  Acs                                               Yes
  Ctrl:
  Bind
  Enable

  PBLE                                              Yes
  Size

  PBL                                               Yes

  FBO                                               Yes

  STag                                              Yes   Yes
  Index

  STag Key                                          Yes   Yes


                Figure 27 - PostSQ Input Modifier Validity

   Output Modifiers:

   *   Number of WRs posted.

   *   Verb Results:

       o   Operation completed successfully

       o   Invalid RNIC Handle

       o   Invalid QP Handle

       o   Too many Work Requests posted.

       o   Invalid operation type.

       o   Invalid QP state.

       o   Invalid Scatter/Gather list format.

       o   Invalid Scatter/Gather list length.

       o   Invalid Modifier.




   Hilland, et al.        Expires October 2003             [Page 196]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

9.3.1.2  PostRQ

   Description:

       Builds a WQE on the Receive Queue of the specified QP for each
       entry in the Work Request List submitted by the Consumer. This
       WQE is added to the end of the Receive Queue and the RNIC is
       notified that a new WQE is ready to be processed.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 8.2.1 - Submitting Work Request to a Work Queue.

   Input Modifiers:

   *   RNIC Handle.

   *   QP Handle, for QP's not associated with an S-RQ.

   *   S-RQ Handle, for QP's associated with an S-RQ.

   *   A list of Work Requests. Each Work Request MUST contain the
       following information.

       o   A user defined 64-bit Work Request ID.

       o   Scatter/Gather List. The scatter/gather list can contain one
           or more Data Segments.

       o   Number of Scatter/Gather List elements.

   Output Modifiers:

   *   Number of WRs posted.

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid RNIC handle.

       o   Invalid QP handle.

       o   Invalid S-RQ handle.

       o   Too many Work Requests posted.

       o   Invalid QP state.

       o   Invalid Scatter/Gather list format.



   Hilland, et al.        Expires October 2003             [Page 197]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       o   Invalid Scatter/Gather list length.

       o   Invalid Modifier.

       o   RQ Associated with S-RQ.

9.3.2  CQ Operations

9.3.2.1  Poll for Completion (Poll CQ)

   Description:

       Polls the specified CQ for a Work Completion.

       If a CQE is present, the CQE at the head of the CQ MUST be
       returned to the Consumer as a Work Completion. Note that the
       resources used are expected to be directly accessible by a Non-
       Privileged Mode Consumer.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 8.2.4 - Returning Completed Work Requests.

   Input Modifiers:

   *   RNIC Handle

   *   CQ Handle.

   Output Modifiers:

   *   The Work Completion. If an entry is present on the CQ and if the
       operation completed successfully, this contains information
       relating to a completed Work Request. If the status of the
       operation that generates the Work Completion is anything other
       than success, the contents of the Work Completion are undefined
       except as noted below. The contents of a Work Completion are:

       o   The 64-bit Work Request ID set by the Consumer in the asso-
           ciated Work Request. This is always valid, regardless of the
           status of the operation.

       o   The operation type specified in the completed Work Request.
           The valid operation types are:

           +   Send (for WRs posted to the Send Queue)

           +   Send with Solicited Event (for WRs posted to the Send
               Queue)




   Hilland, et al.        Expires October 2003             [Page 198]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

           +   Send with Invalidate (for WRs posted to the Send Queue)

           +   Send with Solicited Event & Invalidate (for WRs posted
               to the Send Queue)

           +   RDMA Write (for WRs posted to the Send Queue)

           +   RDMA Read (for WRs posted to the Send Queue)

           +   RDMA Read with Invalidate Local STag (for WRs posted to
               the Send Queue)

           +   Memory Window Bind (for WRs posted to the Send Queue)

           +   Fast-Register Non-Shared Memory Region (for WRs posted
               to the Send Queue)

           +   Invalidate Local STag (for WRs posted to the Send Queue)

           +   Receive (for WRs posted to the Receive Queue)

       o   The number of bytes transferred. This is only valid if the
           operation type was a Receive.

       o   The Completion Status of the operation. This modifier MUST
           be as specified in Section 9.5.2 - Completion Status Codes.

       o   STag Invalidated Indicator. This indicates that the incoming
           Untagged Message destined for the RQ was a Send with
           Invalidate or Send with Solicited Event & Invalidate, and
           thus the STag Invalidated field is valid.

       o   STag Invalidated. This contains the STag which was
           Invalidated. This is only valid when the Invalidated STag
           Indicator is set.

       o   QP ID. This is the QP ID of the QP where the WR which
           generated this completion was posted.

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid RNIC handle.

       o   Invalid CQ handle.

       o   CQ empty.





   Hilland, et al.        Expires October 2003             [Page 199]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

9.3.2.2  Request Completion Notification

   Description:

       Requests the CQ event handler be called when the next CQE of the
       specified type is added to the specified CQ.

       A CQ event handler must be specified prior to calling this
       routine (see Section 9.4.1 - Set Completion Event Handler). If
       the CQ event handler has not been registered when the event is
       generated, the handler will not be called.

       Once the handler routine has been invoked, the Consumer must
       call Request Completion Notification again to be notified when a
       new entry is added to that CQ.

       It is the responsibility of the Consumer to call the Poll for
       Completion Verb to retrieve a Work Completion after the handler
       is called.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 8.2.5 - Asynchronous Completion Notification.

   Input Modifiers:

   *   RNIC Handle.

   *   CQ Handle.

   *   Completion notification type. This MUST be either the next
       completion event or the next solicited completion event.

   Output Modifiers:

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid RNIC handle.

       o   Invalid CQ handle.

9.4  Event Handling

9.4.1  Set Completion Event Handler

   Description:





   Hilland, et al.        Expires October 2003             [Page 200]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

       A RNIC MUST support one CQ Event Handler, and MAY support
       additional Completion Event Handlers. Each Completion Event
       Handler address is maintained by the RI and delineated by an
       opaque handle called a Completion Event Handler Identifier. The
       consumer uses the Set Completion Event Handler to register
       individual Completion Event Handlers and obtain a unique
       Completion Event Handler Identifier. The Completion Event
       Handler Identifier is used in Create CQ to associate a CQ with a
       specific Completion Event Handler.

       This call does not automatically request a notification on a
       completion event. The Request Completion Notification Verb must
       be called in order to request notification.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 8.2.5 - Asynchronous Completion Notification.

   Input Modifiers:

   *   RNIC Handle

   *   Completion Event Handler Address. If set to zero, then the Set
       Completion Handler Verb is being used to clear the associated
       Completion Event Handler address identified by the Completion
       Event Handler Identifier. The Completion Event Handler will be
       invoked when an appropriate Completion occurs with the following
       input parameters passed in to it:

       o   RNIC Handle.

       o   CQ Handle.

   *   Completion Event Handler Identifier - An opaque handle used to
       identify a Completion Event Handler address.

       o   If set to zero, the Set Completion Event Handler verb is
           being used to register a new Completion Event Handler
           address and the verb will return a new Completion Event
           Handler Identifier.

       o   If set to non-zero, then the Set Completion Event Handler is
           being used:

           +   to clear the associated Completion Event Handler address
               for the specified Completion Event Handler Identifier,
               if the Completion Event Handler address is zero;

           +   to modify the associated Completion Event Handler
               address for the specified Completion Event Handler



   Hilland, et al.        Expires October 2003             [Page 201]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

               Identifier, if the Completion Event Handler address is
               non-zero.

   Output Modifiers:

   *   Completion Event Handler Identifier - Only returned if the Set
       Completion Event Handler verb is being used to register a new
       Completion Event Handler address.

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid RNIC Handle.

       o   Invalid Completion Event Handler Identifier.

       o   Insufficient Resources.

9.4.2  Set Asynchronous Event Handler

   Description:

       Registers the asynchronous event handler. Only one asynchronous
       event handler can be registered per RNIC. Additional calls to
       this Verb will overwrite the handler routine to be called.
       Additional calls will not generate an additional handler
       routine. If the new handler address is zero, there will be no
       Asynchronous Event Handler associated with the RNIC.

       The RI MUST support this Verb and MUST support all of the Input
       & Output Modifiers, except where noted. For more information,
       see Section 8.3.3 - Asynchronous Errors.

   Input Modifiers:

   *   RNIC Handle

   *   Asynchronous Event Handler Address. This routine will be invoked
       with the following input parameters passed in:

       o   RNIC Handle.

       o   Event Record. This contains information which indicates the
           resource type and identifier as well as which event
           occurred:

           +   Resource Indicator. This indicates the type of resource
               to which the Resource Identifier refers. This must be
               one of the following values:



   Hilland, et al.        Expires October 2003             [Page 202]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

               *   QP

               *   CQ

               *   RNIC

               *   S-RQ

           +   Resource Identifier. This value is the QP Handle, CQ
               Handle, S-RQ Handle or RNIC Handle for the Asynchronous
               Event.

           +   Event Identifier. This indicates the event which caused
               the Asynchronous Event to be generated. The possible
               list of Event Identifiers can be found in Section 9.5.3
               - Asynchronous Event Identifiers.

   Output Modifiers:

   *   Verb Results:

       o   Operation completed successfully.

       o   Invalid RNIC Handle.

9.5  Result Types

   The following section is a summary of Verb results detailed in
   Sections 9.2 - 9.4)

9.5.1  Immediate Status Codes

Operation completed successfully - The Verb was      All Verbs
executed successfully.


















   Hilland, et al.        Expires October 2003             [Page 203]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

9.5.1.1  RNIC Management Verb Status

Insufficient resources to complete request - An      Open RNIC, Query
error was detected due to insufficient resources.    RNIC

Invalid Modifier - One of the parameters were        Open RNIC
invalid.

Block List mode not supported - The RNIC does not    Open RNIC
support Block List mode and Block List mode was
requested.

RNIC in use - The RNIC was already in use.           Open RNIC

Invalid RNIC handle - An invalid RNIC handle was     Query RNIC, Close
specified.                                           RNIC

                  Figure 28 - RNIC Management Verb Status

9.5.1.2  PD Management Verb Status

Insufficient resources to complete request - An      Allocate PD
error was detected due to insufficient resources.

Invalid RNIC handle - An invalid RNIC handle was     Allocate PD,
specified.                                           Deallocate PD

Invalid PD ID - An invalid PD was specified.         Deallocate PD

Protection Domain is in use - The PD was currently   Deallocate PD
in use by a QP, Memory Region, or Memory Window.

                   Figure 29 - PD Management Verb Status


















   Hilland, et al.        Expires October 2003             [Page 204]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

9.5.1.3  CQ Management Verb Status

Insufficient resources to complete request - An      Create CQ, Modify
error was detected due to insufficient resources.    CQ

Number of CQE requested exceeds RNIC capability -    Create CQ, Modify
Too many CQ entries for this RNIC were requested.    CQ

An Attempt to shrink the size of the queue failed    Modify CQ
because too many elements were still present.

Invalid RNIC handle - An invalid RNIC handle was     Create CQ, Query
specified.                                           CQ, Modify CQ,
                                                     Destroy CQ, Poll
                                                     CQ

Invalid CQ handle- An invalid CQ handle was          Query CQ, Modify
specified.                                           CQ, Destroy CQ,
                                                     Poll CQ

CQ In Use - One or more QPs is still tied to the CQ. Destroy CQ

CQ empty - There were no Work Completions available  Poll CQ
to be retrieved.

Invalid Completion Event Handler Identifier - An     Create CQ
invalid identifier was specified.

                   Figure 30 - CQ Management Verb Status

9.5.1.4  S-RQ Management Verb Status

Insufficient resources to complete request - An      Create S-RQ,
error was detected due to insufficient resources.    Modify S-RQ

Invalid RNIC handle - An invalid RNIC handle was     Create S-RQ,
specified.                                           Query S-RQ,
                                                     Modify S-RQ,
                                                     Destroy S-RQ

Invalid PD ID - An invalid PD was specified.         Create S-RQ

Maximum number of Work Requests requested exceeds    Create S-RQ,
RNIC capability.                                     Modify S-RQ







   Hilland, et al.        Expires October 2003             [Page 205]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

Maximum number of scatter/gather elements per        Create S-RQ
Receive Queue Work Request requested exceeds RNIC
capability.

S-RQ Limit out of range                              Create S-RQ,
                                                     Modify S-RQ

Invalid S-RQ handle                                  Query S-RQ,
                                                     Modify S-RQ,
                                                     Modify S-RQ

An attempt to shrink the size of the queue failed    Modify S-RQ
because too many elements were still present

QPs still associated with the S-RQ                   Modify S-RQ

Invalid Input Modifer                                Modify S-RQ

                  Figure 31 - S-RQ Management Verb Status

9.5.1.5  QP Management Verb Status

Insufficient resources to complete request - An    Create QP, Modify QP
error was detected due to insufficient resources.

Invalid RNIC handle - An invalid RNIC handle was   Create QP, Query QP,
specified.                                         Modify QP, Destroy
                                                   QP

Invalid CQ handle - An invalid CQ handle was       Create QP
specified.

Value requested for ORD exceeds RNIC capability.   Create QP, Modify QP

Value requested for IRD exceeds RNIC capability.   Create QP, Modify QP

Maximum number of Work Requests requested exceeds  Create QP, Modify QP
RNIC capability.

Maximum number of scatter/gather elements          Create QP, Modify QP
requested per Work Request exceeds RNIC
capability.

Invalid PD ID - The PD ID provided was not valid   Create QP

Invalid QP ID - An invalid QP handle was           Query QP, Modify QP,
specified.                                         Destroy QP




   Hilland, et al.        Expires October 2003             [Page 206]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

Cannot change QP attribute - An attempt was made   Modify QP
to modify an attribute which is not allowed by the
RNIC (for example, number of WQEs)

An Attempt to shrink the size of the queue failed  Modify QP
because too many elements were still present.

Invalid state - An invalid QP state was specified. Modify QP

Invalid LLP Stream handle                          Modify QP

Invalid Modifier - One of the modifiers was        Modify QP
invalid or was not allowed to be modified in the
current state or state transition.

RI Still flushing WQEs - The QP is in the Error    Modify QP
state and a request to transition to the Idle
state but the RI is still flushing WQEs and
therefore cannot transition.

Invalid S-RQ handle                                Create QP

QP RQ Limit Out of Range.                          Create QP, Modify QP

Memory Windows still Bound to QP                   Destroy QP

                   Figure 32 - QP Management Verb Status

9.5.1.6  Memory Management Verb Status

Insufficient resources to complete     Allocate NS MR STag, RI-
request - An error was detected due to Register, RI-Reregister,
insufficient resources.                Register Shared MR, Allocate MW

Invalid RNIC handle - An invalid RNIC  Allocate NS MR STag, RI-
handle was specified.                  Register,Query MR, Deallocate
                                       STag, RI-Reregister, Register
                                       Shared MR, Allocate MW, Query MW

Invalid PD ID - An invalid PD ID was   Allocate NS MR STag, RI-
specified.                             Register, RI-Reregister,
                                       Register Shared MR, Allocate MW

Invalid Virtual Address - An invalid   RI-Register, RI-Reregister,
Memory Address or Offset was           Register Shared MR
specified.





   Hilland, et al.        Expires October 2003             [Page 207]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

Invalid Length - An invalid Length was RI-Register, RI-Reregister
specified. Too many pages or the MR
length was too long.

Invalid Access Rights requested - An   RI-Register, RI-Reregister,
invalid Access Control specifier was   Register Shared MR
specified.

Invalid Physical Buffer List entry.    RI-Register, RI-Reregister

Invalid Physical Buffer size - The     RI-Register, RI-Reregister
Physical Buffer size
(Page/Block)_requested was not
supported by the RNIC.

Invalid STag Index - An invalid Memory Query MR, RI-Reregister,
Region STag Index was specified.       Deallocate STag,
                                       Register Shared MR, Query MW

Invalid FBO - the FBO is larger than   RI-register, RI-Reregister
the physical buffer size

One or more Memory Windows is still    Deallocate STag, RI-Reregister,
Bound to the Region.

                 Figure 33 - Memory Management Verb Status

9.5.1.7  Post Verb Status

Invalid RNIC handle - An invalid RNIC handle was     PostSQ, PostRQ
specified.

Invalid QP handle - An invalid QP handle was         PostSQ, PostRQ
specified.

Invalid S-RQ handle - An invalid S-RQ handle was     PostRQ
specified.

Too many Work Requests posted.                       PostSQ, PostRQ

Invalid Operation type                               PostSQ

Invalid QP state.                                    PostSQ, PostRQ

Invalid Scatter/Gather list format                   PostSQ, PostRQ

Invalid Scatter/Gather list length - The Work        PostSQ, PostRQ




   Hilland, et al.        Expires October 2003             [Page 208]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

Request specified more Scatter/Gather elements than
the QP can support.

RQ Associated with S-RQ - This QP is associated with PostRQ
an S-RQ and therefore the QP Handle cannot be used
to post receive Work Requests. The S-RQ handle
should be used instead.

Invalid Modifier - One of the parameters were        PostSQ, PostRQ
invalid.

                       Figure 34 - Post Verb Status

9.5.1.8  Event Management Verb Status

Invalid RNIC handle - An invalid RNIC      Request Completion
handle was specified.                      Notification, Set Completion
                                           Event Handler, Set
                                           Asynchronous Event Handler

Invalid CQ handle - An invalid CQ handle   Request Completion
was specified.                             Notification

Invalid Notify Type - An invalid CQ        Request Completion
Notification type was specified.           Notification

Invalid Completion event handler           Set Completion Event Handler
identifier -           - An invalid identifier was
specified while attempting to clear a
Completion Event Handler address.

Insufficient Resources - The RI did not    Set Completion Event Handler
have sufficient resources to complete the
request, such as when the Consumer
requests another Completion Event Handler
Identifier but has already set an amount
equal to the value returned in Query RNIC.

                 Figure 35 - Event Management Verb Status












   Hilland, et al.        Expires October 2003             [Page 209]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

9.5.2  Completion Status Codes

Success - The RNIC Operation was           Send Operation Types,
successful.                                Receive, RDMA Write, RDMA
                                           Read, RDMA Read with
                                           Invalidate Local STag, Bind,
                                           Fast-Register, Invalidate
                                           Local STag

Flushed - The Work Request was incomplete  Send Operation Types,
when the QP entered the Error state.       Receive, RDMA Write, RDMA
                                           Read, RDMA Read with
                                           Invalidate Local STag, Bind,
                                           Fast-Register, Invalidate
                                           Local STag

Invalid WQE - The Work Request Element     Send Operation Types,
contained a format error.                  Receive, RDMA Write, RDMA
                                           Read, RDMA Read with
                                           Invalidate Local STag, Bind,
                                           Fast-Register, Invalidate
                                           Local STag

Local QP Catastrophic Error - An error     Send Operation Types,
related to the QP occurred while           Receive, RDMA Write, RDMA
processing the Work Request.               Read, RDMA Read with
                                           Invalidate Local STag, Bind,
                                           Fast-Register, Invalidate
                                           Local STag

Remote Termination Error - A Terminate     Send Operation Types, RDMA
Message was received from the Remote Peer  Write, RDMA Read, RDMA Read
that appears to be related to the          with Invalidate Local STag
execution of this Work Request. The error
type can be examined by looking at the
Terminate Message buffer via Query QP.

Invalid STag - An invalid STag was found   Send Operation Types,
in the local SGL. The STag was either not  Receive, RDMA Write, RDMA
found allocated, bound, or registered in   Read, RDMA Read with
the RI, or an STag of zero was specified   Invalidate Local STag, Bind,
for a QP without Privileged rights, or     Fast-Register, Invalidate
referred to a Shared Memory Region, or the Local STag
type of STag supplied was not allowed to
be used in the specified operation.







   Hilland, et al.        Expires October 2003             [Page 210]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

Base & Bounds Violation - The local SGL    Send Operation Types,
referenced an address beyond the limits    Receive, RDMA Write, RDMA
specified for the MR or MW. This includes  Read, RDMA Read with
length errors. For a Bind, the MW was not  Invalidate Local STag, Bind
wholly contained in the MR.

Access Violation - The RNIC attempted to   Send Operation Types,
read or write to a local SGL MR or MW that Receive, RDMA Write, RDMA
did not provide appropriate Access Rights. Read, RDMA Read with
For a Bind, the MW Access Rights were not  Invalidate Local STag, Bind
compatible with the MR Access Rights.

Invalid PD ID - For one of the STags       Send Operation Types,
specified in the Work Request the PD of    Receive, RDMA Write, RDMA
the MR STag was not the same as the PD of  Read, RDMA Read with
the QP, or, the QP of the MW STag was not  Invalidate Local STag, Bind,
the same as QP.                            Fast-Register, Invalidate
                                           Local STag

Wrap Error - The specified Address or      Send Operation Types,
offset (TO or MO) added to the length of   Receive, RDMA Write, RDMA
the operation resulted in a wrap beyond    Read, RDMA Read with
the machine-supported address.             Invalidate Local STag, Bind,
                                           Fast-Register

STag to Invalidate had Invalid PD or       Receive
Access Rights - The Invalidate STag on a
Receive did not have a PD ID that matched
the PD ID of the QP (for a MR) or a QP ID
that matched the QP ID of the QP (for a
MW). Or the STag did not have Access
Rights to be invalidated remotely.

Zero RDMA Read Resources - The QP ORD      RDMA Read, RDMA Read with
value was set to zero.                     Invalidate Local STag

QP Not In Privileged Mode - The QP is not  Fast-Register
enabled to perform the Privileged WR.

STag Not In Invalid state - The STag was   Bind,
already registered or bound, when          Fast-Register
attempting to Register or Bind it.

Invalid Page Size - The page size          Fast-Register
requested was not supported by the RNIC.

Invalid Physical Buffer Size - size not    Fast-Register
supported by the RNIC.




   Hilland, et al.        Expires October 2003             [Page 211]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

Invalid Physical Buffer List entry - for   Fast-Register
page mode, the entry must start on page
size boundaries.

Invalid FBO - the FBO is larger than the   Fast-Register
physical buffer size.

Invalid length - requested length is       Fast-Register
larger than supported by the buffer list.

Invalid Access Rights specified.           Fast-Register

Physical Buffer List too long.             Fast-Register

Invalid Virtual Address - VA and FBO are   Fast-Register
not consistent.

Invalid Region - The STag specified for    Bind
the MR in the BIND request was invalid.

Invalid Window - The STag specified for    Bind
the MW in the BIND request was invalid.

Invalid Length - The total size of the     Send, Receive, RDMA Write,
data to be moved as specified by the sum   RDMA Read, RDMA Read with
of the SGL elements, was larger than that  Invalidate Local STag
supported by the RNIC.

                    Figure 36 - Completion Status Codes

9.5.3  Asynchronous Event Identifiers

   The following table contains the list of Event Identifiers and
   Resource Indicators that the RNIC MUST support as Asynchronous Event
   Identifiers to be returned by the Asynchronous Event Handler. Note
   that the Resource Indicator dictates that the appropriate Resource
   Identifier corresponding to that Resource Indicator MUST be returned
   as well. For more information, see Section 9.4.2 - Set Asynchronous
   Event Handler.












   Hilland, et al.        Expires October 2003             [Page 212]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

             Event Identifier and Description.             Resource
                                                            Indicator

   LLP Close Complete - The RDMA Stream has completed     QP ID
   Closing and no SQ WQEs were flushed.

   Terminate Message Received                             QP ID

   LLP Connection Reset - An incoming LLP Reset (e.g. RST QP ID
   on TCP) was received.

   LLP Connection Lost                                    QP ID

   LLP Integrity Error: Segment size invalid              QP ID

   LLP Integrity Error: Invalid CRC                       QP ID

   LLP Integrity Error: Bad FPDU - Received MPA marker    QP ID
   and 'Length' fields do not agree on the start of a
   FPDU

   Remote Operation Error: Invalid DDP version - caused   QP ID
   by an inbound segment.

   Remote Operation Error: Invalid RDMA version - caused  QP ID
   by an inbound segment.

   Remote Operation Error: Unexpected Opcode - caused by  QP ID
   an inbound segment.

   Remote Operation Error: Invalid DDP Queue Number -     QP ID
   caused by an inbound segment.

   Remote Operation Error: Invalid RDMA Read Request      QP ID
   Message, RDMA Read not enabled - caused by an inbound
   segment.

   Remote Operation Error: Invalid RDMA Write or RDMA     QP ID
   Read Response Message, RDMA Write & RDMA Read Response
   not enabled - caused by an inbound segment.

   Remote Operation Error: Invalid RDMA Read Request      QP ID
   Message, message size too small or Offset non-zero -
   caused by an inbound segment.

   Remote Operation Error: No 'L' bit when expected -     QP ID
   caused by an inbound segment.





   Hilland, et al.        Expires October 2003             [Page 213]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Protection Error: Invalid STag - caused by an inbound  QP ID
   Tagged DDP segment not valid for this QP. This
   includes using the STag of zero, the STag was not
   associated with the QP or the STag was in the Invalid
   state.

   Protection Error: Tagged Base and bounds violation -   QP ID
   caused by an inbound Tagged segment attempted to
   access memory outside the limits assigned to the STag.

   Protection Error: Tagged Access Rights violation -     QP ID
   caused by an inbound segment referencing a Tagged
   Buffer which did not have the necessary memory Access
   Rights for the requested operation.

   Protection Error: Tagged Invalid PD - caused by an     QP ID
   inbound segment referencing a Tagged Buffer which was
   not allowed to be referenced by QP.

   Protection Error: Wrap error - caused by an inbound    QP ID
   segment not targeting the RQ.

   Bad Close - The QP was in the Closing state when a     QP ID
   Segment arrived.

   Bad LLP Close - An attempt was made to close the RDMA  QP ID
   Stream with work in progress.

   RQ Protection Error - Invalid MSN - MSN range not      QP ID
   valid. Caused by an inbound segment targeting the RQ.
   Possibly due to Receive Queue being empty.

   RQ Protection Error - Invalid MSN - gap in MSN. Caused QP ID
   by an inbound segment targeting the RQ.

   IRRQ Protection Error: Invalid MSN - too many RDMA     QP ID
   Read Request Messages in progress - caused by an
   inbound segment not targeting the IRRQ.

   IRRQ Protection Error: Invalid MSN - gap in MSN -      QP ID
   caused by an inbound segment not targeting the RQ.

   IRRQ Protection Error: Invalid MSN - range is not      QP ID
   valid - caused by an inbound segment not targeting the
   RQ.

   IRRQ Protection Error: Invalid STag - Data Source STag QP ID
   determined to be invalid during RDMA Read Response




   Hilland, et al.        Expires October 2003             [Page 214]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   processing.

   IRRQ Protection Error: Tagged Base and bounds          QP ID
   violation - This includes RDMA Read Request of a
   message larger than supported by the RNIC. It is
   detected accessing the Data Source during RDMA Read
   Response processing.

   IRRQ Protection Error: Tagged Access Rights violation  QP ID
   - Data Source Access Rights violation detected during
   RDMA Read Response processing.

   IRRQ Protection Error: Tagged Invalid PD - Data Source QP ID
   PD violation detected during RDMA Read Response
   processing.

   IRRQ Protection Error: Wrap error - detected during    QP ID
   RDMA Read Response processing.

   CQ/SQ Error: CQ Overflow Error - An error occurred on  QP ID
   the CQ during a SQ completion.

   CQ/RQ Error: CQ Operation error - An error occurred on QP ID
   the CQ during a RQ completion.

   S-RQ error on a QP - An error occurred while           QP ID
   attempting to pull a WQE from the S-RQ associated with
   the QP.

   Local QP Catastrophic Error - occurred during          QP ID
   processing.

   CQ Overflow Detected - An overflow of the Completion   CQ Handle
   Queue has been detected. This Error Code is OPTIONAL.

   CQ Operation Error - An error occurred on the CQ       CQ Handle
   unrelated to a specific QP completion.

   Shared Receive Queue Limit reached - The Limit value   S-RQ Handle
   established for the Shared Receive Queue has been
   reached.

   QP RQ Limit Reached - The Limit value established for  QP ID
   the QP's RQ has been reached.

   Shared Receive Queue Catastrophic Failure - A problem  S-RQ Handle
   occurred with the RNIC or its driver that renders the
   RNIC unable to use the S-RQ.




   Hilland, et al.        Expires October 2003             [Page 215]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   RNIC Catastrophic Failure - A problem occurred with    RNIC Handle
   the RNIC or its driver that renders the RNIC unable to
   reliably function.

                Figure 37 - Asynchronous Event Identifiers















































   Hilland, et al.        Expires October 2003             [Page 216]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

10 Security Considerations

   Security Considerations are necessary for the RDMA Protocols and
   this specification.  An Internet Draft is under development.

















































   Hilland, et al.        Expires October 2003             [Page 217]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

11 IANA Considerations

   If DDP was enabled a priori for a ULP by connecting to a well-known
   port, this well-known port would be registered for the DDP with
   IANA.
















































   Hilland, et al.        Expires October 2003             [Page 218]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

12 References

12.1 Normative References

   [RFC2026] Bradner, S., "The Internet Standards Process -- Revision
       3", BCP 9, RFC 2026, October 1996.

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
       Requirement Levels", BCP 14, RFC 2119, March 1997.

   [MPA] P. Culley et al., "Markers with PDU Alignment", RDMA
       Consortium Draft Specification draft-cully-iwarp-mpa-00.doc,
       October 2002

   [DDP] H. Shah et al., "Direct Data Placement over Reliable
       Transports", RDMA Consortium Draft Specification draft-shah-
       iwarp-ddp-00.txt, October 2002

   [RDMAP] R. Recio et al., "RDMA Protocol Specification", RDMA
       Consortium Draft Specification draft-recio-iwarp-00, October
       2002

   [SCTP] R. Stewart et al., "Stream Control Transmission Protocol",
       RFC 2960, October 2000.

   [TCP] Postel, J., "Transmission Control Protocol", STD 7, RFC 793,
       September 1981.

12.2 Informative References

    [IPSEC] Atkinson, R., Kent, S., "Security Architecture for the
       Internet Protocol", RFC 2401, November 1998.





















   Hilland, et al.        Expires October 2003             [Page 219]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

13 Appendix

13.1 Connection Initialization at LLP Startup

   The purpose of an initialization at LLP Startup is to enable iWARP
   using the minimum number of messages possible. Note that not all
   RNIC/OS implementations are required to support this.







             < Figure 39 did not convert properly from source >
             <  to be corrected in an upcoming version        >










     Figure 39 - Connection Initialization at LLP Startup (using TCP)


   Below is an example sequence for an iWARP startup that accomplishes
   this (other sequences are possible). The Sequence applies equally to
   either the active or passive side.

   *   The Consumer establishes the LLP Connection using a non-Verbs
       interface.

   *   The Consumer creates a QP, setting up the CQ, PD, etc., and
       registers memory for buffers.

   *   The Consumer posts buffers to the RQ appropriate for the
       expected traffic.

   *   If the ULP intends to transmit first, the Consumer could Post
       one or more Work Request(s) on the SQ (usually a SEND message)
       that will be sent after the QP is placed in the RTS state.







   Hilland, et al.        Expires October 2003             [Page 220]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   *   The Consumer moves the QP state to RTS. The Modify QP Verb for
       this includes the LLP Stream Handle, and does not include a
       streaming message buffer.

   *   If the local Consumer intends to perform RDMA Read Type WRs, the
       local Consumer obtains, in some ULP defined message, the number
       of incoming RDMA Read Request Messages that the Remote Peer can
       have outstanding (IRD). If the Remote Peer's IRD is smaller than
       the local Peer's ORD, the local Consumer should also perform a
       Modify QP Verb with the Remote Peer's IRD value placed into the
       local ORD value prior to posting the first RDMA Read Type WR.
       The local Consumer may also transmit, in some ULP defined
       message, the number of outgoing RDMA Read Request Messages that
       the Local Peer can have outstanding (ORD).

   *   If the local Consumer intends the QP to be the Data Source of
       RDMA Read Operations, the Consumer provides, in some ULP defined
       message, the number of incoming RDMA Read Request Messages (e.g.
       IRRQ depth) that the Local Peer can have outstanding (IRD). The
       Consumer may also receive, in some ULP defined message, the
       number of outgoing RDMA Read Request Messages that the Remote
       Peer can have outstanding (ORD). If the Remote Peer's ORD is
       smaller than the Local Peer's IRD, the local Consumer may also
       perform a Modify QP Verb with the Remote Peer's ORD value placed
       into the local IRD value prior to posting the first RDMA Read
       Type WR.

   This specification does not define which side of the connection
   sends the first message, the active or passive side; the ULP is
   responsible for determining this. In addition, this specification
   does not preclude the use of Active/Active connections.

   RNIC Implementers note: Since there is no integration between the RI
   and the LLP Connection startup sequence, as defined above, it is
   possible that some data may arrive over the transport before the
   RNIC is in iWARP mode. It is the responsibility of the RI to accept
   this data and interpret it as iWARP data. Alternately, the Consumer
   (or other service that establishes the LLP Connection) can ensure
   that no data will be received prior to moving the QP to RTS state.
   If neither of these methods is available, then iWARP startup with
   the LLP is not available.

13.2 Graceful Receive Overflow Handling

   A valid implementation option is to gracefully handle Receive Queue
   or Shared-Receive Queue overflow. In a strictly layered model, this
   may be difficult but in an RNIC implementation, this should be
   feasible.





   Hilland, et al.        Expires October 2003             [Page 221]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   In the current architecture, if there are no Receive Queue Work
   Queue Elements available when an Untagged Message arrives then the
   connection is dropped. This is true if there is a Shared Receive
   Queue or a dedicated receive queue.

   In this case, the implementation (RI/RNIC), which is not relying on
   an external LLP, may choose to handle this gracefully through LLP
   mechanisms. In this case, the RI will choose to not drop the
   connection and instead appear to pause receive queue processing
   until more WQEs have been posted to the RQ or S-RQ.

   How the RNIC decides to perform this function is left up to
   implementation. One example mechanism which may be used to
   gracefully handle receive overflow is for the implementation to drop
   incoming packets when there are no WQEs on the RQ or S-RQ. This type
   of mechanism may have side effects, such as causing back-off
   algorithms to be invoked, but this type of mechanism is still a
   valid implementation option.



































   Hilland, et al.        Expires October 2003             [Page 222]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

14 AuthorÆs Addresses

   Jeff Hilland
   Hewlett-Packard Company
   20555 SH 249
   Houston, TX 77070-2698 USA
   Phone: +1 (281) 514-9489
   Email: jeff.hilland@hp.com

   Paul R. Culley
   Hewlett-Packard Company
   20555 SH 249
   Houston, TX 77070-2698 USA
   Phone: +1 (281) 514-5543
   Email: paul.culley@hp.com

   James Pinkerton
   Microsoft Corporation
   One Microsoft Way
   Redmond, WA. 98052 USA
   Phone: +1 (425) 705-5442
   Email: jpink@windows.microsoft.com

   Renato Recio
   IBM Corporation
   11501 Burnett Road
   Austin, TX 78758 USA
   Phone: +1 (512) 838-1365
   Email: recio@us.ibm.com
























   Hilland, et al.        Expires October 2003             [Page 223]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

15 Acknowledgments

   John Carrier
   Adaptec, Inc.
   691 S. Milpitas Blvd.
   Milpitas, CA 95035 USA
   Phone: +1 (360) 378-8526
   Email: john_carrier@adaptec.com

   Hari Ghadia
   Adaptec, Inc.
   691 S. Milpitas Blvd.,
   Milpitas, CA 95035 USA
   Phone: +1 (408) 957-5608
   Email: hari_ghadia@adaptec.com

   Patricia Thaler
   Agilent Technologies, Inc.
   1101 Creekside Ridge Drive, #100
   M/S-RG10
   Roseville, CA 95678
   Phone: +1 (916) 788-5662
   email: pat_thaler@agilent.com

   Mike Penna
   Broadcom Corporation
   16215 Alton Parkway
   Irvine, California 92619-7013 USA
   Phone: +1 (949) 926-7149
   Email: MPenna@Broadcom.com

   Uri Elzur
   Broadcom Corporation
   16215 Alton Parkway
   Irvine, California 92619-7013 USA
   Phone: +1 (949) 585-6432
   Email: Uri@Broadcom.com

   Ted Compton
   EMC Corporation
   Research Triangle Park, NC 27709, USA
   Phone: +1 (919) 248-6075
   Email: compton_ted@emc.com

   Dwight Barron
   Hewlett-Packard Company
   20555 SH 249
   Houston, TX 77070-2698 USA
   Phone: +1 (281) 514-2769
   Email: Dwight.Barron@Hp.com



   Hilland, et al.        Expires October 2003             [Page 224]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Mallikarjun Chadalapaka
   Hewlett-Packard Company
   8000 Foothills Blvd.
   Roseville, CA 95747-5668, USA
   Phone: +1 (916) 785-5621
   Email: cbm@rose.hp.com

   Dave Garcia
   Hewlett-Packard Company
   19333 Vallco Parkway
   Cupertino, Ca. 95014 USA
   Phone: +1 (408) 285-6116
   Email: dave.garcia@hp.com

   Mike Krause
   Hewlett-Packard Company, 43LN
   19410 Homestead Road
   Cupertino, CA 95014 USA
   Phone: +1 (408) 447-3191
   Email: krause@cup.hp.com

   Jim Wendt
   Hewlett-Packard Company
   8000 Foothills Boulevard
   Roseville, CA 95747-5668 USA
   Phone: +1 (916) 785-5198
   Email: jim_wendt@hp.com

   John L. Hufferd
   IBM Corp.
   650 Harry Rd.
   San Jose CA
   Phone: +1 (408) 256-0403
   Email: hufferd@us.ibm.com

   Mike Ko
   IBM Corp.
   650 Harry Rd.
   San Jose, CA 95120, USA
   Phone: +1 (408) 927-2085
   Email: mako@us.ibm.com

   Ellen Deleganes
   Intel Corporation
   MS JF5-355
   2111 NE 25th Ave.
   Hillsboro, OR 97124 USA
   Phone: +1 (503) 712-4173
   Email: ellen.m.deleganes@intel.com




   Hilland, et al.        Expires October 2003             [Page 225]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

   Frank Berry
   Intel Corporation
   2111 NE 25th Ave.
   Hillsboro, OR 97124 USA
   Phone: +1 (503) 712-3897
   Email: frank.berry@intel.com

   Howard C. Herbert
   Intel Corporation
   MS CH7-404
   5000 West Chandler Blvd.
   Chandler, AZ 85226 USA
   Phone: +1 (480) 554-3116
   Email: howard.c.herbert@intel.com

   Dave Minturn
   Intel Corporation
   MS JF1-210
   5200 North East Elam Young Parkway
   Hillsboro, OR 97124 USA
   Phone: +1 (503) 712-4106
   Email: dave.b.minturn@intel.com

   Hemal Shah
   Intel Corporation
   MS PTL1
   1501 South Mopac Expressway, #400
   Austin, TX 78746 USA
   Phone: +1 (512) 732-3963
   Email: hemal.shah@intel.com

   James Livingston
   NEC Solutions (America), Inc.
   7525 166th Ave. N.E., Suite D210
   Redmond, WA 98052-7811
   Phone: +1 (425) 897-2033
   Email: james.livingston@necsam.com

   Tom Talpey
   Network Appliance
   375 Totten Pond Road
   Waltham, MA 02451 USA
   Phone: +1 (781) 768-5329
   Email: thomas.talpey@netapp.com









   Hilland, et al.        Expires October 2003             [Page 226]


   Internet-Draft      RDMA Verbs Specification            25 Apr 2003

16 Full Copyright Statement

   This document and the information contained herein is provided on an
   ææAS ISÆÆ basis and ADAPTEC INC., AGILENT TECHNOLOGIES INC., BROADCOM
   CORPORATION, CISCO SYSTEMS INC., DELL COMPUTER CORPORATION, EMC
   CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL BUSINESS
   MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT CORPORATION, NEC
   SOLUTIONS (AMERICA), INC., NETWORK APPLIANCE INC., THE INTERNET
   SOCIETY, AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL
   WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY
   WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE
   ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
   FOR A PARTICULAR PURPOSE.

   Copyright (c) 2002, 2003 ADAPTEC INC., BROADCOM CORPORATION, CISCO
   SYSTEMS INC., DELL COMPUTER CORPORATION, EMC CORPORATION, HEWLETT-
   PACKARD COMPANY, INTERNATIONAL BUSINESS MACHINES CORPORATION, INTEL
   CORPORATION, MICROSOFT CORPORATION, NETWORK APPLIANCE INC., All
   Rights Reserved.


































   Hilland, et al.        Expires October 2003             [Page 227]