INTERNET DRAFT                                               V. Kashyap
<draft-kashyap-ipoib-connected-mode-02.txt>                         IBM
Expiration Date: December 2004                                June 2004



                   IP over InfiniBand: Connected Mode



Status of this memo


    This document is an Internet-Draft and is in full conformance with
    all provisions of Section 10 of RFC 2026.


    Internet-Drafts are working documents of the Internet Engineering
    Task Force (IETF), its areas, and its working groups. Note that
    other groups may also distribute working documents as Internet-
    Drafts.


    Internet-Drafts are draft documents valid for a maximum of six
    months and may be updated, replaced, or obsoleted by other documents
    at any time. It is inappropriate to use Internet-Drafts as Reference
    material or to cite them other than as ``work in progress''.


    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt


    The list of Internet-Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html


    This memo provides information for the Internet community.  This
    memo does not specify an Internet standard of any kind.
    Distribution of this memo is unlimited.


Copyright Notice


    Copyright (C) The Internet Society (2001).  All Rights Reserved.



Abstract


        This document specifies a method for transmitting IPv4/IPv6
        packets and address resolution over the connectd modes of
        InfiniBand.









Kashyap                                                         [Page 1]


INTERNET-DRAFT            Connected mode IPoIB                 June 2004



        Table of Contents


        1.0       Introduction
        2.0       IPoIB-connected mode
        2.1       Multicasting
        2.2       Outline of Address Resolution
        2.3       Outline of Connection Setup
        3.0       Address Resolution
        3.1       Link-layer Address
        3.2       IB Connection Setup
        3.3       Service-ID
        4.0       Frame Format
        5.0       Maximum Transmission Unit
        5.1       Per-Connection MTU
        6.0       Security Considerations
        7.0       References


1.0 Introduction


        The InfiniBand specification [IB_ARCH] can be found at
        www.infinibandta.org.  The document [IPoIB_ARCH] provides a
        short overview of InfiniBand architecture along with
        consideration for specifying IP over InfiniBand networks.


        The InfiniBand architecture (IBA) defines multiple modes of
        transports. Of these the unreliable datagram (UD) transport
        method best matches the needs of IP. IP over InfiniBand (IPoIB)
        over UD is described in [IPoIB_ENCAP]. This document describes
        IP transmission over the connected modes of IBA.


        IBA defines two connected modes:


             1. Reliable Connected (RC)
             2. Unreliable Connected (UC)


        As is evident from the nomenclature, the two modes differ mainly
        in providing reliability of data delivery across the connection.
        This document applies equally to both the connected modes.
        IPoIB over these two modes is referred to as IPoIB-CM (connected
        mode) in this document.  For clarity IPoIB over the unreliable
        datagram mode, as described in [IPoIB_ENCAP] is referred to as
        IPoIB-UD.


        IBA requires that all Host Channel Adapters (HCAs) support the
        reliable and unreliable connected modes [IB_ARCH]. It is
        optional for Target Channel Adapters (TCAs) to support the
        connected modes.





Kashyap                                                         [Page 2]


INTERNET-DRAFT            Connected mode IPoIB                 June 2004



        The connected modes offer link MTUs of up to 2^31 octets in
        length.  Thus the use of connected modes can offer significant
        benefits by supporting reasonably large MTUs. The datagram modes
        of InfiniBand Architecture (IBA) are limited to 4096 octets.


        Reliability is also enhanced by the underlying feature of
        "automatic path migration" supported by the connected modes is
        utilized.


        The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
        NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
        "OPTIONAL" in this document are to be interpreted as described
        in RFC 2119.


2.0 IPoIB-connected mode


        This document extensively refers to [IPoIB_ENCAP] and extends
        IPoIB description given in [IPoIB_ENCAP] to IPoIB-CM. Therefore,
        only additional requirements or enhancements needed to enable
        IPoIB-CM are described.


        The IP encapsulation, default MTU, link layer address format and
        the IPv6 stateless autoconfiguration mechanism apply to IPoIB-CM
        exactly as described in [IPoIB_ENCAP].


2.1 Multicasting


        The connected modes of IBA define a non-broadcast, multiple
        access network. The connected modes of IBA do not support
        multicasting though every node can communicate with every other
        node if desired.


        This requires that multicasting be emulated in some form by the
        network. However, in the case of an InfiniBand network, instead
        of an emulation, an unreliable datagram (UD)  queue pair (QP)
        can be used to support multicasting while the connected mode  QP
        is used for unicast traffic. Since IBA requires all channel
        adapters to support the UD mode, every implementation supporting
        IPoIB-CM will also be able to utilize UD QPs.


        Multicast mapping, transmission and reception of multicast
        packets and multicast routing is over the UD QP associated with
        the IPoIB-CM interface in accordance with the document
        [IPoIB_ENCAP].


2.2 Outline of Address Resolution


        Every IPoIB-CM interface MUST have two QPs associated with it:




Kashyap                                                         [Page 3]


INTERNET-DRAFT            Connected mode IPoIB                 June 2004



                1) A connected mode QP
                2) An unreliable datagram mode QP


        [IPoIB_ENCAP] proposes that the address resolution query is
        multicast over an IB multicast address that is joined by every
        member of the IPoIB subnet. This IB multicast address is
        referred to as the "broadcast-GID" [IPoIB_ENCAP]. This document
        extends the requirement of joining the "broadcast-GID" to IPoIB-
        CM too by requiring every IPoIB-CM interface to "FullMember"
        join the broadcast-GID using the associated UD QP.


        A broadcast-GID is formed with the knowledge of the scope bits,
        IP version, the partition key (P_Key) associated with the
        subnet. Thus these three parameters must be known to the node
        before an IPoIB interface can be brought up. The exact format
        and rules to setup the broadcast-GID are defined in
        [IPoIB_ENCAP].


2.3 Outline of Connection setup


        Once the link address of the remote node is known an IB
        connection must be setup between the nodes before any IP
        communication may occur.


        To make a connection, the sender must know the service-ID to use
        in the request to make a connection [IB_ARCH]. It must also
        supply the "connection mode" queue pair to the remote node. The
        peer replies with its queue pair. Each IB connection is peer to
        peer and uses one connected mode QP at each end.


        Though the address resolution occurs at an individual IP address
        level the connection between the nodes is at the IB layer.
        Therefore every individual address resolution does not imply a
        new connection between the peers.


3.0 Address Resolution


        Address resolution queries are sent out on the "broadcast-GID"
        over the UD QP associated with the IPoIB-CM interface. A unicast
        reply is received on the UD QP associated with the IPoIB-CM
        interface.


        An IPoIB-CM implementation MAY use the same UD QP as used by the
        IPoIB-UD implementation if the latter mode is supported in the
        same partition and scope.







Kashyap                                                         [Page 4]


INTERNET-DRAFT            Connected mode IPoIB                 June 2004



3.1 Link-layer Address


        IPoIB encapsulation [IPoIB_ENCAP] describes the link-layer
        address as follows:


            <1 octet reserved>:QP: GID


        This document extends the link-layer address as follows:


            <Flags>:QPN:GID


            Flags:


                This is a single octet field. If bit 0 is set then it
                implies that in the sender's view,the subnet is built
                over IB's 'reliable connected' i.e. RC mode. If bit 1 is
                set then it implies that the subnet is built over IB's
                "unreliable connected" i.e. UC mode. All other bits in
                the octet are reserved and MUST be set to 0.


                If IPoIB-CM is not supported i.e. if the implementation
                only supports IPoIB-UD, then the implementation MUST
                ignore the <Flags> on reception. It MUST set the <Flags>
                octet to all zeroes as specified in [IPoIB_ENCAP].


                Both the RC and UC flags MUST not be set at the same
                time.  They are mutually exclusive.


                The format of the flags is:


                    +--+--+--+--+--+--+--+--+
                    |RC|UC| 0| 0| 0| 0| 0| 0|
                    +--+--+--+--+--+--+--+--+


                Note:
                    The above implies that a given IP subnet can only be
                    supported on one of the InfiniBand modes at any
                    time. If the link layer includes no flags then it is
                    part of an IPoIB-UD subnet, if the link layer
                    includes the RC flag then it is part of an IPoIB-RC
                    subnet, if the link layer includes the UC flag then
                    it is part of an IPoIB-UC subnet.


            QPN:


                The queue-pair number (QPN) on which the unicast address
                resolution reply will be received. This allows the
                IPoIB-UD address resolution code and method to be used




Kashyap                                                         [Page 5]


INTERNET-DRAFT            Connected mode IPoIB                 June 2004



                for IPoIB-CM address resolution.


                The QPN also serves another purpose. It is used to form
                the Service-ID that is used to setup the IB connection.


        On receiving the multicast/broadcast address resolution request
        the receiver replies with its own link-address, including the
        associated UD QPN and the appropriate flag. If the flags do not
        match then there is a misconfiguration since the underlying IB
        modes do not match. In such a case a suitable error indication
        SHOULD be provided to the administrator.


        The receiver's reply is unicast back to the sender after the
        receiver has, as in the case of IPoIB over unreliable datagram
        (IPoIB_UD), resolved the GID to the LID and determined other
        required parameters [IPoIB_ENCAP].


        Once the address resolution is completed the underlying IB
        connection can be setup.


3.2 IB Connection Setup


        The IB reliable/unreliable mode connection may be setup by any
        of the peers though it is more likely that the one that
        initiated the address resolution phase, probably as a result of
        the need to send IP data, will initiate the connection setup.
        IBA allows passive-active and active-active connection setup.


        To setup a connection IB Management Datagrams (MADs) are
        directed to the peer's communication manager (CM). The
        connection request always contains a Service-ID for the peer to
        associate the request with the appropriate entity. If the
        request is accepted the peer returns the relevant connected mode
        QPN in the response MAD. The format of the CM connection
        messages and the IB connection setup process is described in
        [IB_ARCH].


        The CM messages include, among other parameters, the Service-ID,
        Local QPN, and the payload size to use over the connection.


        Note:
            The IB connection is setup using the Service-ID as defined
            above. The node MUST keep a record of IB connections it is
            participating in. The node SHOULD NOT attempt another
            connection to the remote peer using the same Service-ID as
            used for an existing IB connection.






Kashyap                                                         [Page 6]


INTERNET-DRAFT            Connected mode IPoIB                 June 2004



3.3 Service-ID


        The InfiniBand specification defines a block of service IDs for
        IETF use. The InfiniBand specification has left the definition
        and management of this block to the IETF [IB_ARCH]. The 64-bit
        block is:


  +--------+--------+--------+--------+-------+--------+--------+------+
  |00000001|<-------------------IETF use------------------------------>|
  +--------+--------+--------+--------+-------+--------+--------+------+


        The Service-IDs used by IPoIB will be in the format:


  +--------+--------+--------+--------+-------+-------+--------+-------+
  |00000001|  Type  |Reserved|       QPN             |   Reserved     |
  +--------+--------+--------+--------+-------+-------+--------+-------+


        The Reserved fields MUST be transmitted as zeroes. They are
        ignored on reception.


        The QPN MUST be the UD QP exchanged during address resolution.


        The Type MUST be set to 0.


        The service-ID formed using the UD QPN used for address
        resolution MUST be supported by the associated interface.


4.0 Frame Format


        All IP and ARP datagrams transported over InfiniBand are
        prefixed  by a 4-octet encapsulation header as described in
        [IPoIB_ENCAP].


    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                               |                               |
    |         Type                  |       Reserved                |
    |                               |                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+












Kashyap                                                         [Page 7]


INTERNET-DRAFT            Connected mode IPoIB                 June 2004



        The type field SHALL indicate the encapsulated protocol as per
        the following table.


                        +----------+-------------+
                        | Type     |    Protocol |
                        |------------------------|
                        | 0x800    |    IPv4     |
                        |------------------------|
                        | 0x806    |    ARP      |
                        |------------------------|
                        | 0x8035   |    RARP     |
                        |------------------------|
                        | 0x86DD   |    IPv6     |
                        +------------------------+


        These values are taken from the "ETHER TYPE" numbers assigned by
        Internet Assigned Numbers Authority (IANA). Other network
        protocols, identified by different values of "ETHER TYPE", may
        use the encapsulation format defined herein but such use is
        outside of the scope of this document.


5.0 Maximum Transmission Unit


        The IB connection setup might be used for both IPv4 and IPv6 or
        it could be used for only one of them while a different
        connection is used for the other. The link MTU MUST be able to
        support the minimum MTU required by the protocols.


        The default MTU of the IPoIB-CM interface is 2044 octets i.e.
        2048 octet IPoIB-link MTU minus the 4 octet encapsulation
        header.


        The connected modes of InfiniBand allow message sizes up to 2^31
        octets.  Therefore, IPoIB-CM can use a much larger MTU for
        unicast communication between any two endpoints. At the same
        time the maximum and/or optimal payload that can be received or
        sent over an InfiniBand connection is dependent on the
        implementation, HCA and the resources configured.


        An implementation MAY utilise the following mechanism to
        request/accept MTUs across an IB connection.


5.1 Per-Connection MTU


        Every IB connection setup message includes a "private data"







Kashyap                                                         [Page 8]


INTERNET-DRAFT            Connected mode IPoIB                 June 2004



        field [IB_ARCH]. The private data field MUST carry the following
        information:


                        0               15
                        +----------------+
                        | Desired   MTU  |
                        +----------------+
                        | Minimum MTU    |
                        +----------------+


        The connection setup message (CM REQ) MUST insert the requested
        MTU in the "Desired MTU" field and the minimum acceptable MTU in
        the "Minimum MTU" field. The "Minimum MTU" value SHOULD NOT be
        less than the MTU set for multicast communication i.e. the MTU
        received on "FullMember" join of the broadcast-GID on the
        associated UD QP. The "Desired" and "Minimum" MTUs may be set to
        the same value.


        If the "Desired MTU" is not acceptable to the peer then it MUST
        indicate it's preferred value in the "Desired MTU" when
        rejecting (CM REJ) the request. If the "Desired MTU" is lower
        than the minimum MTU that can be supported, the connection MUST
        be rejected (CM REJ message) with the minimum acceptable MTU set
        in both the desired and minimum MTU fields.


        It is up to the implementation to utilize this mechanism for
        setting the per IB connection MTU. The IPoIB interface must
        account for the 4-octet encapsulation header and so the IPoIB
        MTU over the connection will be smaller by that amount.


6.0 Security Considerations


        A node may be returned a false set of flags by an impostor. This
        may cause unnecessary attempts and some delay/disruption in
        IPoIB communication. The same is the case if wrong/spurious QPN
        values are provided during address resolution
        broadcast/multicast.


7.0 References


        [IB_ARCH]      InfiniBand Architecture Specification, version 1.1
                       www.infinibandta.org


        [IPoIB_ARCH]   draft-ietf-ipoib-architecture-04.txt, V. Kashyap


        [IPoIB_ENCAP]  draft-ietf-ipoib-ip-over-infiniband-06.txt,
                       H.K. Jerry Chu, V. Kashyap





Kashyap                                                         [Page 9]


INTERNET-DRAFT            Connected mode IPoIB                 June 2004



Author's Address


        Vivek Kashyap


        15350, SW Koll Parkway Beaverton, OR 97006


        Phone: +1 503 578 3422 Email: vivk@us.ibm.com



Full Copyright Statement


Copyright (C) The Internet Society (2001). All Rights Reserved.


This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it or
assist in its implementation may be prepared, copied, published and
distributed, in whole or in part, without restriction of any kind,
provided that the above copyright notice and this paragraph are included
on all such copies and derivative works. However, this document itself
may not be modified in any way, such as by removing the copyright notice
or references to the Internet Society or other Internet organizations,
except as needed for the purpose of developing Internet standards in
which case the procedures for copyrights defined in the Internet
Standards process must be followed, or as required to translate it into
languages other than English.


The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.


This document and the information contained herein is provided on an "AS
IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK
FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT
LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT
INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR
FITNESS FOR A PARTICULAR PURPOSE.
















Kashyap                                                        [Page 10]