INTERNET DRAFT                                               V. Kashyap
<draft-ietf-ipoib-connected-mode-00.txt>                            IBM
Expiration Date: February 2005                              August 2005


                   IP over InfiniBand: Connected Mode


Status of this memo

    By submitting this Internet-Draft, I certify that any applicable
    patent or other IPR claims of which I am aware have been disclosed,
    or will be disclosed, and any of which I become aware will be
    disclosed, in accordance with RFC 3668.

    Internet-Drafts are working documents of the Internet Engineering
    Task Force (IETF), its areas, and its working groups.  Note that
    other groups may also distribute working documents as Internet-
    Drafts.

    Internet-Drafts are draft documents valid for a maximum of six
    months and may be updated, replaced, or obsoleted by other documents
    at any time.  It is inappropriate to use Internet-Drafts as
    reference material or to cite them other than as "work in progress."

    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt.

    The list of Internet-Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html.


Copyright Notice

    Copyright (C) The Internet Society (2001).  All Rights Reserved.


Abstract

        This document specifies a method for transmitting IPv4/IPv6
        packets and address resolution over the connectd modes of
        InfiniBand.









Kashyap                                                         [Page 1]


INTERNET-DRAFT            Connected mode IPoIB             February 2005


        Table of Contents

        1.0       Introduction
        2.0       IPoIB-connected mode
        2.1       Multicasting
        2.2       Outline of Address Resolution
        2.3       Outline of Connection Setup
        3.0       Address Resolution
        3.1       Link-layer Address
        3.2       IB Connection Setup
        3.3       Service-ID
        4.0       Frame Format
        5.0       Maximum Transmission Unit
        5.1       Per-Connection MTU
        6.0       IPoIB-CM Considerations
        6.1       A Cautionary Note on IPoIB-RC
        7.0       Security Considerations
        8.0       IANA Considerations
        9.0       References

1.0 Introduction

        The InfiniBand specification [IB_ARCH] can be found at
        www.infinibandta.org.  The document [IPoIB_ARCH] provides a
        short overview of InfiniBand architecture along with
        consideration for specifying IP over InfiniBand networks.

        The InfiniBand architecture (IBA) defines multiple modes of
        transports. Of these the unreliable datagram (UD) transport
        method best matches the needs of IP. IP over InfiniBand (IPoIB)
        over UD is described in [IPoIB_UD]. This document describes
        IP transmission over the connected modes of IBA.

        IBA defines two connected modes:

             1. Reliable Connected (RC)
             2. Unreliable Connected (UC)

        As is evident from the nomenclature, the two modes differ mainly
        in providing reliability of data delivery across the connection.
        This document applies equally to both the connected modes.
        IPoIB over these two modes is referred to as IPoIB-CM (connected
        mode) in this document.  For clarity IPoIB over the unreliable
        datagram mode, as described in [IPoIB_UD] is referred to as
        IPoIB-UD.

        IBA requires that all Host Channel Adapters (HCAs) support the
        reliable and unreliable connected modes [IB_ARCH]. It is



Kashyap                                                         [Page 2]


INTERNET-DRAFT            Connected mode IPoIB             February 2005


        optional for Target Channel Adapters (TCAs) to support the
        connected modes.

        The connected modes offer link MTUs of up to 2^31 octets in
        length.  Thus the use of connected modes can offer significant
        benefits by supporting reasonably large MTUs. The datagram modes
        of InfiniBand Architecture (IBA) are limited to 4096 octets.

        Reliability is also enhanced if the underlying feature of
        "automatic path migration" supported by the connected modes is
        utilized.

        The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
        NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
        "OPTIONAL" in this document are to be interpreted as described
        in RFC 2119.

2.0 IPoIB-connected mode

        Every IPoIB implementation MUST support IPoIB-UD. The IPoIB-CM
        support is OPTIONAL.

        This document extensively refers to [IPoIB_UD] and extends IPoIB
        description given in [IPoIB_UD] to IPoIB-CM. Therefore, only
        additional requirements or enhancements needed to enable IPoIB-
        CM are described.

        The IP encapsulation, default MTU, link layer address format and
        the IPv6 stateless autoconfiguration mechanism apply to IPoIB-CM
        exactly as described in [IPoIB_UD].

2.1 Multicasting

        The connected modes of IBA define a non-broadcast, multiple
        access network. The connected modes of IBA do not support
        multicasting though every node can communicate with every other
        node if desired.

        This requires that multicasting be emulated in some form by the
        network.  However, in the case of an InfiniBand network, instead
        of an emulation, an unreliable datagram (UD)  queue pair (QP)
        can be used to support multicasting while the connected mode  QP
        is used for unicast traffic.  Since every IPoIB implementation
        is required to support the UD mode, every implementation
        supporting IPoIB-CM will be able to utilize coexisting IPoIB-UD
        QP for all broadcast/multicast communications.

        Multicast mapping, transmission and reception of multicast



Kashyap                                                         [Page 3]


INTERNET-DRAFT            Connected mode IPoIB             February 2005


        packets and multicast routing MUST use the IPoIB-UD QP
        associated with the IPoIB-CM interface.

2.2 Outline of Address Resolution

        Every IPoIB-CM interface MUST have two QPs associated with it:

                1) A connected mode QP
                2) An unreliable datagram mode QP

        [IPoIB_UD] proposes that the address resolution query is
        multicast over an IB multicast address that is joined by every
        member of the IPoIB subnet. This IB multicast address is
        referred to as the "broadcast-GID" [IPoIB_UD]. The "broadcsat-
        GID" is "FullMember" joined by every IPoIB-UD implementation on
        the associated QP [IPoIB-UD].

        A broadcast-GID is formed with the knowledge of the scope bits,
        IP version, the partition key (P_Key) associated with the
        subnet. Thus these three parameters must be known to the node
        before an IPoIB interface can be brought up. The exact format
        and rules to setup the broadcast-GID are defined in [IPoIB_UD].

        In response to the query the response is received on the IPoIB-
        UD QP [IPoIB_UD].

2.3 Outline of Connection setup

        Once the link address of the remote node is known an IB
        connection must be setup between the nodes before any IP
        communication may occur.

        To make a connection, the sender must know the service-ID to use
        in the request to make a connection [IB_ARCH]. It must also
        supply the "connection mode" queue pair to the remote node. The
        peer replies with its queue pair. Each IB connection is peer to
        peer and uses one connected mode QP at each end.

        Though the address resolution occurs at an individual IP address
        level the connection between the nodes is at the IB layer.
        Therefore every individual address resolution does not imply a
        new connection between the peers.

3.0 Address Resolution

        Address resolution queries are sent out on the "broadcast-GID"
        over the IPoIB-UD QP associated with the IPoIB-CM interface. A
        unicast reply is received on the UD QP.



Kashyap                                                         [Page 4]


INTERNET-DRAFT            Connected mode IPoIB             February 2005


3.1 Link-layer Address

        IPoIB encapsulation [IPoIB_UD] describes the link-layer address
        as follows:

            <1 octet reserved>:QP: GID

        This document extends the link-layer address as follows:

            <Flags>:QPN:GID

            Flags:

                This is a single octet field. The bits indicate the
                connected modes supported by the interface.

                Bit 0 specifies the support for the "reliable connected"
                (RC) mode.  Bit 1 indicates the support for the
                "unreliable connected" (UC) mode.  All other bits in the
                octet are reserved and MUST be set to 0 on transmits and
                ignored on receives.  The format of the flags is:

                    +--+--+--+--+--+--+--+--+
                    |RC|UC| 0| 0| 0| 0| 0| 0|
                    +--+--+--+--+--+--+--+--+

                Both the RC and UC MAY be set at the same time if the
                interface supports both the modes. Since the IPoIB-UD
                mode is always supported there are no flags to indicate
                IPoIB-UD support.

                If IPoIB-CM is not supported i.e. if the implementation
                only supports IPoIB-UD, then the implementation MUST
                ignore the <Flags> on reception. It MUST set the <Flags>
                octet to all zeroes on transmission as specified in
                [IPoIB_UD].

            QPN:

                The queue-pair number (QPN) on which the unicast address
                resolution reply will be received. This allows the
                IPoIB-UD address resolution code and method to be used
                for IPoIB-CM address resolution.

                The QPN also serves another purpose. It is used to form
                the Service-ID that is used to setup the IB connection.

        On receiving the multicast/broadcast address resolution request



Kashyap                                                         [Page 5]


INTERNET-DRAFT            Connected mode IPoIB             February 2005


        the receiver replies with its own link-address, including the
        associated UD QPN and the appropriate flags.

        The receiver's reply is unicast back to the sender after the
        receiver has, as in the case of IPoIB-UD, resolved the GID to
        the LID and determined other required parameters [IPoIB_UD].

        Once the address resolution is completed the underlying IB
        connection on the supported connection modes can be set up. An
        implementation is NOT REQUIRED to setup a connection merely
        because the peer indicates the capability. The decision to make
        such a connection is left to the implementation.

3.2 IB Connection Setup

        The IB reliable/unreliable mode connection may be setup by any
        of the peers though it is more likely that the one that
        initiated the address resolution phase, probably as a result of
        the need to send IP data, will initiate the connection setup.
        IBA allows passive-active and active-active connection setup
        [IB_ARCH].

        To setup a connection IB Management Datagrams (MADs) are
        directed to the peer's communication manager (CM). The
        connection request always contains a Service-ID for the peer to
        associate the request with the appropriate entity. If the
        request is accepted the peer returns the relevant connected mode
        QPN in the response MAD. The format of the CM connection
        messages and the IB connection setup process is described in
        [IB_ARCH].

        The CM messages include, among other parameters, the Service-ID,
        Local QPN, and the payload size to use over the connection.

        Note:
              The IB connection is setup using the Service-ID as defined
              above. The node MUST keep a record of IB connections it is
              participating in. The node MAY attempt another connection
              to the remote peer using the same Service-ID as used for
              an existing IB connection. Similarly, the receiver of such
              a connection MAY drop the request with a suitable error
              indication in the CM response. The decision to accept or
              initiate multiple connections from or to an IPoIB
              interface is left to the implementation.

3.3 Service-ID

        The InfiniBand specification defines a block of service IDs for



Kashyap                                                         [Page 6]


INTERNET-DRAFT            Connected mode IPoIB             February 2005


        IETF use. The InfiniBand specification has left the definition
        and management of this block to the IETF [IB_ARCH]. The 64-bit
        block is:

  +--------+--------+--------+--------+-------+--------+--------+------+
  |00000001|<-------------------IETF use------------------------------>|
  +--------+--------+--------+--------+-------+--------+--------+------+

        The Service-IDs used by IPoIB will be in the format:

  +--------+--------+--------+--------+-------+-------+--------+-------+
  |00000001|  Type  |         Reserved        |        QPN             |
  +--------+--------+--------+--------+-------+-------+--------+-------+

        The Reserved fields MUST be transmitted as zeroes. They are
        ignored on reception.

        The QPN MUST be the UD QP exchanged during address resolution.

        The Type MUST be set to 0.

4.0 Frame Format

        All IP and ARP datagrams transported over InfiniBand are
        prefixed  by a 4-octet encapsulation header as described in
        [IPoIB_UD].

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                               |                               |
    |         Type                  |       Reserved                |
    |                               |                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        The type field SHALL indicate the encapsulated protocol as per















Kashyap                                                         [Page 7]


INTERNET-DRAFT            Connected mode IPoIB             February 2005


        the following table.

                        +----------+-------------+
                        | Type     |    Protocol |
                        |------------------------|
                        | 0x800    |    IPv4     |
                        |------------------------|
                        | 0x806    |    ARP      |
                        |------------------------|
                        | 0x8035   |    RARP     |
                        |------------------------|
                        | 0x86DD   |    IPv6     |
                        +------------------------+

        These values are taken from the "ETHER TYPE" numbers assigned by
        Internet Assigned Numbers Authority (IANA). Other network
        protocols, identified by different values of "ETHER TYPE", may
        use the encapsulation format defined herein but such use is
        outside of the scope of this document.

5.0 Maximum Transmission Unit

        The IB connection setup might be used for both IPv4 and IPv6 or
        it could be used for only one of them while a different
        connection is used for the other. The link MTU MUST be able to
        support the minimum MTU required by the protocols.

        The default MTU of the IPoIB-CM interface is 2044 octets i.e.
        2048 octet IPoIB-link MTU minus the 4 octet encapsulation
        header.

        However, connected modes of InfiniBand allow message sizes up to
        2^31 octets. Therefore, IPoIB-CM can use a much larger MTU for
        unicast communication between any two endpoints. The maximum
        and/or optimal payload that can be received or sent over an
        InfiniBand connection is dependent on the implementation, HCA
        and the resources configured.

        An implementation MAY utilise the following mechanism to
        exchange the optimal message size across the IB connection.

5.1 Per-Connection MTU

        Every IB connection setup message includes a "private data"
        field [IB_ARCH]. The private data field MUST carry the following






Kashyap                                                         [Page 8]


INTERNET-DRAFT            Connected mode IPoIB             February 2005


        information:

                        0               15
                        +----------------+
                        | Receive   MTU  |
                        +----------------+

        The connection setup message (CM REQ) MUST insert the requested
        MTU in the "Receive MTU" field. This indicates the maximum
        packet sze the requester can accept. The requester MUST be able
        to accept smaller MTU sizes as well.

        It is up to the implementation to utilize this mechanism for
        setting the per IB connection MTU. The IPoIB interface must
        account for the 4-octet encapsulation header and so the IPoIB
        MTU over the connection will be smaller by that amount.

6.0 IPoIB-CM Considerations

        Every IPoIB interface supports IPoIB-UD. It may additionally
        support one or both of IPoIB-CM modes. Therefore, there can be
        multiple methods of communicating between any two peers. This
        implies that an interface MAY transmit/receive a packet over any
        of the RC, UC or UD modes depending on the modes supported
        between it and the peer. It further follows that every IPoIB
        implementation compliant with this document MUST accept all
        unicast transmissions over any fo the IPoIB modes it supports.
        Multicast and broadcast packets by their nature will always be
        transmitted and received over the IPoIB-UD QP.

6.1 A Cautionary Note on IPoIB-RC

        The RC mode of InfiniBand guarantees in-order delivery of
        packets. Every message transmitted over the RC connection is
        broken into physical MTU sized packets by the RC connection. If
        any packet is lost, it is retransmitted until the complete
        message is exchanged. Therefore there is a possibility of a
        reliable transport layer, such as TCP, retransmitting due to a
        shorter timeout while the RC layer is still in the process of
        transferring the complete message. A retransmission at the upper
        layer will add to the already existing congestion.

        Therefore, the RC timers as well as the maximum message size
        supported at the IPoIB-RC connection must be set judiciously.

7.0 Security Considerations

        A node may be returned a false set of flags by an impostor. This



Kashyap                                                         [Page 9]


INTERNET-DRAFT            Connected mode IPoIB             February 2005


        may cause unnecessary attempts and some delay/disruption in
        IPoIB communication. The same is the case if wrong/spurious QPN
        values are provided during address resolution
        broadcast/multicast.

8.0 IANA Considerations

        This document requires that the reserved bits and octets be set
        to zero on sends and ignored on receives.  Proposed uses of the
        reserved bits MUST be published as RFCs.

9.0 References

Normative

        [IB_ARCH]      InfiniBand Architecture Specification, version 1.1
                       www.infinibandta.org

        [IPoIB_ARCH]   draft-ietf-ipoib-architecture-04.txt, V. Kashyap

        [IPoIB_UD]     draft-ietf-ipoib-ip-over-infiniband-06.txt,
                       H.K. Jerry Chu, V. Kashyap

Author's Address

        Vivek Kashyap

        15350, SW Koll Parkway Beaverton, OR 97006

        Phone: +1 503 578 3422 Email: vivk@us.ibm.com


Full Copyright Statement

        Copyright (C) The Internet Society (2004).  This document is
        subject to the rights, licenses and restrictions contained in
        BCP 78 and except as set forth therein, the authors retain all
        their rights.

        This document and the information contained herein are provided
        on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
        REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND
        THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
        EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY
        THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY
        RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
        FOR A PARTICULAR PURPOSE.




Kashyap                                                        [Page 10]


INTERNET-DRAFT            Connected mode IPoIB             February 2005


Intellectual Property


        The IETF takes no position regarding the validity or scope of
        any Intellectual Property Rights or other rights that might be
        claimed to pertain to the implementation or use of the
        technology described in this document or the extent to which any
        license under such rights might or might not be available; nor
        does it represent that it has made any independent effort to
        identify any such rights.  Information on the procedures with
        respect to rights in RFC documents can be found in BCP 78 and
        BCP 79.

        Copies of IPR disclosures made to the IETF Secretariat and any
        assurances of licenses to be made available, or the result of an
        attempt made to obtain a general license or permission for the
        use of such proprietary rights by implementers or users of this
        specification can be obtained from the IETF on-line IPR
        repository at http://www.ietf.org/ipr.

        The IETF invites any interested party to bring to its attention
        any copyrights, patents or patent applications, or other
        proprietary rights that may cover technology that may be
        required to implement this standard.  Please address the
        information to the IETF at ietf-ipr@ietf.org.

Acknowledgement

        Funding for the RFC Editor function is currently provided by the
        Internet Society.





















Kashyap                                                        [Page 11]