INTERNET DRAFT                                               V. Kashyap
<draft-kashyap-ipoib-connected-mode-00.txt>                         IBM
Expiration Date: August 2003                              February 2003

                    IP over InfiniBand: Connected Mode

Status of this memo

    This document is an Internet-Draft and is in full conformance
    with all provisions of Section 10 of RFC 2026.

    Internet-Drafts are working documents of the Internet
    Engineering Task Force (IETF), its areas, and its working
    groups. Note that other groups may also distribute working
    documents as Internet- Drafts.

    Internet-Drafts are draft documents valid for a maximum of six
    months and may be updated, replaced, or obsoleted by other
    documents at any time. It is inappropriate to use
    Internet-Drafts as Reference material or to cite them other
    than as ``work in progress''.

    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt

    The list of Internet-Draft Shadow Directories can be accessed
    at http://www.ietf.org/shadow.html

    This memo provides information for the Internet community.
    This memo does not specify an Internet standard of any kind.
    Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2001).  All Rights Reserved.

Abstract

    The InfiniBand Architecture(IBA) defines a high speed, channel
    based interconnect between systems and devices. IBA provides
    multiple modes of transport services with differing
    characteristics. This document describes IP over IBA's Connected
    transport modes.








Kashyap                                                         [Page 1]


INTERNET-DRAFT            Connected mode IPoIB             February 2003


Table of Contents

    1.0        Introduction
    2.0        IPoIB-connected mode
    2.1        Outline of Address Resolution
    2.2        Outline of Connection Setup
    3.0        Address Resolution
    4.0        Connection setup
    4.1        Service ID
    4.2        MTU
    5.0        IP Encapsulation
    6.0        Security Considerations
    7.0        References


1.0 Introduction

    IBA defines two connected modes:

        1. Reliable Connected(RC)
        2. Unreliable Connected(UC)

    The two modes differ mainly, as is clear from the names, in
    providing reliability of data delivery across the connection.
    However, both these modes will be considered together in this
    document since all the discussion applies equally to both the
    modes - the two modes are referred to as IPoIB-CM (connected
    mode) in this document. IPoIB over reliable connected mode is
    referred to as IPoIB-RC whereas IPoIB over unreliable connected
    mode is referred to as IPoIB-UC.


    The connected modes offer link MTUs of upto 2^31 bytes in length.
    Thus the use of connected modes can offer significant benefits by
    supporting reasonably large MTUs. The datagram modes are limited
    to 4096 bytes. Reliability is also enhanced if the underlying
    feature of 'automatic path migration' supported by the connected
    modes is utilised [IBARCH].


    This document presents a method of address resolution and
    transmission of IP packets over connected modes of IBA.


    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
    NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL"
    in this document are to be interpreted as described in RFC 2119.




Kashyap                                                         [Page 2]


INTERNET-DRAFT            Connected mode IPoIB             February 2003


2.0 IPoIB-connected mode

    The connected modes of IBA do not support multicasting though
    every node is capable of communicating i.e. setting a connection,
    to every other node in the InfiniBand(IB) network.

    This implies that intrinsically one cannot rely on native
    broadcast or multicast to send out the address resolution query.
    But an ARP server is not an efficient solution. Fortunately in the
    case of IPoIB-CM there is a way out.


2.1 Outline of Address Resolution

    IBA requires that all Host Channel Adapters(HCAs) support the
    reliable and unreliable connected modes. It is optional for Target
    Channel Adapters (TCAs) to supported the connected modes. At the
    same time IBA requires all HCAs and TCAs to support unreliable
    datagram mode. The unreliable datagram mode does support
    multicasting. IPoIB over unreliable datagram(IPoIB-UD) as
    presented in [IPoIB_MCAST]/[IPoIB_ENCAP] requires the IB subnet to
    support IB level multicast.

    The above makes it possible to use a multicast query for IPoIB-CM
    address resolution. The address resolution in IPoIB-CM networks
    uses an unreliable datagram(UD) queue pair(QP).


    IPoIB_ENCAP proposes that the address resolution query is
    multicast over an IB multicast address that is joined by every
    member of the IPoIB subnet. This IB multicast address is referred
    to as the 'broadcast-GID' [IPoIB_ENCAP]. This document extends the
    requirement of joining the 'broadcast-GID' to IPoIB-CM too by
    associating an unreliable datagram with it.

    A broadcast-GID is formed with the knowledge of the scope bits, IP
    version and the partition key(P_Key) associated with the subnet.
    Thus these three parameters must be known to the node before an
    IPoIB interface can be brought up. The exact format and rules to
    setup the broadcast-GID are defined in [IPoIB_MCAST].

    An implementation MAY use the same unreliable datagram(UD) queue
    pair(QP) as used by the IPoIB-UD implementation if the latter mode
    is supported in the same partition and scope.

    Therefore during address resolution, the query is sent out on the
    broadcast GID. A unicast reply is received on the UD QP associated
    with IPoIB-CM.



Kashyap                                                         [Page 3]


INTERNET-DRAFT            Connected mode IPoIB             February 2003


2.2 Outline of Connection setup

    Address resolution is but the first step. Once the link address of
    the remote node is known a connection must be setup between the
    nodes before any IP communication may occur.

    To make a connection, the sender must know the service-ID to use
    in the request to make a connection [IBARCH]. It must also supply
    the queue pair to the remote node. The peer replies with its queue
    pair. Note that every connection is peer to peer and does not use
    shared queue pairs(QPs). Every connection uses a pair of unique
    (on the node) QPs.

    Though the address resolution occurs at an individual address
    level the connection between the nodes is at the IB layer and so,
    every individual address resolution does not imply a new connection
    between the peers.


3.0 Address Resolution

    Every IPoIB-CM node MUST join the broadcast-GID associated with the
    subnet. The address resolution query is always sent out on the
    broadcast-GID.


    IPoIB encapsulation [IPoIB_ENCAP] describes the link-layer address
    as follows:

        <1 octet reserved>:QP: GID


    This document extends the link-layer address as follows:

        <flags>:QP-cookie:GID


    QP-cookie:  An IB connection, as noted above, uses a pair of QPs;
                one on each node. An implementation therefore cannot
                advertise the QP it intends to use as part of the
                address resolution query since such a message is seen
                by all members of the subnet. Instead an
                implementation specific, 3-octet long, QP-cookie is
                used.







Kashyap                                                         [Page 4]


INTERNET-DRAFT            Connected mode IPoIB             February 2003


                The QP-cookie is used by the peer for two purposes:

                a.  The QP-cookie is used to form the service-ID that
                    is used in the IB connection messages. The
                    receiver can, depending on the service-ID and
                    thereby the QP-cookie make a decision on the QPN
                    to create (or deny).

                b.  The QP-cookie is used to determine if a connection
                    already exists to the peer. In such a case the
                    node can avoid an attempt at connection setup.
                    Note that the GID cannot be used for this purpose
                    since a GID can always be shared by multiple
                    interfaces.


    Flags:      This is a single octet field. If bit 0 is set then it
                implies that in the sender's view, the subnet is built
                over IB's 'reliable connected' i.e. RC mode. If bit 1
                is set then it implies that the subnet is built over
                IB's 'unreliable connected' i.e. UC mode. All other
                bits in the octet are reserved and MUST be set to 0.


                Both the RC and UC flags MUST not be set at the same
                time. They are mutually exclusive.


    The receiver replies with its own link-address and the set of
    flags. If the flags do not match then there is a misconfiguration
    since the members of the same expect different link
    characteristics (IB modes). In such a case a suitable error
    indication SHOULD be provided to the administrator.

    The receiver's reply is unicast back to the sender after the
    receiver has, as in the case of IPoIB over unreliable
    datagram(IPoIB_UD), after it resolves the GID to the LID.

    Once the address resolution is completed the connection may be
    setup.


4.0 Connection Setup

    The connection may be setup by any of the peers though it is more
    likely that the one that initiated the address resolution phase,
    probably as a result of the need to send IP data, will initiate
    the connection setup. IBA allows passive-active and active-active



Kashyap                                                         [Page 5]


INTERNET-DRAFT            Connected mode IPoIB             February 2003


    connection setup.

    The service ID used when setting up the connection is derived from
    the QP cookie received during the address resolution process. A
    node MAY return the same cookie for multiple addresses. For
    example, the node might support multiple subnets over the same GID
    and prefer to make only one IB level connection. This includes
    supporting the same IB connection for both IPv4 and IPv6 to a peer
    node. The choice of the QP-cookie is implementation dependent.

    Therefore, the end initiating the connection needs to defer to the
    peer's choice. If the peer has returned the same QP-cookie as
    a result of multiple address resolution requests then, for those
    addresses, a connection attempt SHOULD not be made. This is true
    even if the requestor itself presented different QP-cookies, and
    would have created separate QPs if it had received the request
        instead.

    If a node does receive a connection request for the same
    service-ID from the same peer then it is upto the implementation
    to honour or reject it.

4.1 Service ID


    The InfiniBand specification defines a block of service IDs
    for IETF use. The InfiniBand specification has left the
    definition and management of this block to the IETF. The
    64-bit block is:

+--------+--------+--------+--------+--------+--------+--------+--------+
|00000001|<-------------------IETF use--------------------------------->|
+--------+--------+--------+--------+--------+--------+--------+--------+

    The ServiceIDs used by IPoIB will use the following format:

+--------+--------+--------+--------+--------+--------+--------+--------+
|00000001|Reserved| 3-octet  QP  cookie      |Reserved|Reserved|Reserved|
+--------+--------+--------+--------+--------+--------+--------+--------+

    The Reserved fields MUST be transmitted as zeroes. They are
    ignored on reception.

4.2 MTU

    The IB connection might be used for both IPv4 and IPv6 or it
    could be used for only one of them while a different connection is
    used for the other. If the connection is used for both IPv4 and
    IPv6, the link MTU MUST be able to support the minimum MTU


Kashyap                                                         [Page 6]


INTERNET-DRAFT            Connected mode IPoIB             February 2003


    required by both IPv4 and IPv6.

    Every connection setup message includes a 'private data' field
    [IBARCH]. The private data field MUST carry the following
    information:

        0               15
        +----------------+
        | Desired   MTU  |
        +----------------+
        | Minimum MTU    |
        +----------------+

    The connection setup message (CM REQ) MUST insert the requested
    MTU in the 'Desired MTU' field and the minimum acceptable MTU. If
    it is not acceptable to the peer then it MUST indicate the
    preferred value in the 'desired MTU' when rejecting (CM REJ) the
    request. If the 'desired MTU' is lower than the minimum MTU that
    can be supported, the connection MUST be rejected (CM REJ message)
    with the minimum acceptable MTU in both the desired and minimum
    MTU fields.


5.0 IP encapsulation

    The IP encapsulation will be done as defined in the IPoIB
    encapsulation standard[IPoIB_ENCAP].

    IP multicast cannot be done over the IPoIB-CM modes.

6.0 Security Considerations

    A node may be returned a false set of flags by an imposter. This
    may cause unnecessary attempts and some delay/disruption in IPoIB
    communication. The same is the case if wrong/spurious QP-cookie
    values are provided.















Kashyap                                                         [Page 7]


INTERNET-DRAFT            Connected mode IPoIB             February 2003


7.0 References

[IB_ARCH]        InfiniBand Architecture Specification, version 1.1
                www.infinibandta.org

[IPoIB_ARCH]    draft-ietf-ipoib-architecture-01.txt, V. Kashyap

[IPoIB_ENCAP]    draft-ietf-ipoib-ip-over-infiniband-01.txt,
                V. Kashyap, H.K. Jerry Chu

[IPoIB_MCAST]    draft-ietf-ipoib-link-multicast-02.txt,
                H.K. Jerry Chu, V. Kashyap

7.0 Author's Address

Vivek Kashyap

15450, SW Koll Parkway
Beaverton, OR 97006

Phone: +1 503 578 3422
Email: vivk@us.ibm.com

Full Copyright Statement

    Copyright (C) The Internet Society (2001). All Rights Reserved.

    This document and translations of it may be copied and
    furnished to others, and derivative works that comment on or
    otherwise explain it or assist in its implementation may be
    prepared, copied, published and distributed, in whole or in
    part, without restriction of any kind, provided that the above
    copyright notice and this paragraph are included on all such
    copies and derivative works. However, this document itself may
    not be modified in any way, such as by removing the copyright
    notice or references to the Internet Society or other Internet
    organizations, except as needed for the purpose of developing
    Internet standards in which case the procedures for copyrights
    defined in the Internet Standards process must be followed, or
    as required to translate it into languages other than
    English.

    The limited permissions granted above are perpetual and will
    not be revoked by the Internet Society or its successors or
    assigns.

    This document and the information contained herein is provided
    on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET



Kashyap                                                         [Page 8]


INTERNET-DRAFT            Connected mode IPoIB             February 2003


    ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE
    USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
    ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
    PARTICULAR PURPOSE.














































Kashyap                                                         [Page 9]