INTERNET DRAFT                                               V. Kashyap
<draft-kashyap-ipoib-connected-mode-01.txt>                         IBM
Expiration Date: March 2004                              September 2003

                    IP over InfiniBand: Connected Mode

Status of this memo

    This document is an Internet-Draft and is in full conformance
    with all provisions of Section 10 of RFC 2026.

    Internet-Drafts are working documents of the Internet
    Engineering Task Force (IETF), its areas, and its working
    groups. Note that other groups may also distribute working
    documents as Internet- Drafts.

    Internet-Drafts are draft documents valid for a maximum of six
    months and may be updated, replaced, or obsoleted by other
    documents at any time. It is inappropriate to use
    Internet-Drafts as Reference material or to cite them other
    than as ``work in progress''.

    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt

    The list of Internet-Draft Shadow Directories can be accessed
    at http://www.ietf.org/shadow.html

    This memo provides information for the Internet community.
    This memo does not specify an Internet standard of any kind.
    Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2001).  All Rights Reserved.

Abstract

    The InfiniBand Architecture(IBA) defines a high speed, channel
    based interconnect between systems and devices. IBA provides
    multiple modes of transport services with differing
    characteristics. This document describes IP over IBA's Connected
    transport modes.








Kashyap                                                         [Page 1]


INTERNET-DRAFT            Connected mode IPoIB            September 2003


Table of Contents

    1.0        Introduction
    2.0        IPoIB-connected mode
    2.1        Outline of Address Resolution
    2.2        Outline of Connection Setup
    3.0        Address Resolution
    4.0        Connection setup
    4.1        Service ID
    4.2        MTU
    5.0        IP Encapsulation
    6.0        Security Considerations
    7.0        References

1.0 Introduction

    IBA defines two connected modes:

        1. Reliable Connected(RC)
        2. Unreliable Connected(UC)

    The two modes differ mainly, as is evident from the nomenclature,
    in providing reliability of data delivery across the connection.
    This document applies equally to both the connected modes - IPoIB
    over these two modes is referred to as IPoIB-CM (connected mode)
    in this document. IPoIB over reliable connected mode is referred
    to as IPoIB-RC whereas IPoIB over unreliable connected mode is
    referred to as IPoIB-UC where applicable. For clarity IPoIB over
    the unreliable datagram mode, as described in [IPoIB_ENCAP] and
    [IPoIB_MCAST], is referred to as IPoIB-UD.

    The connected modes offer link MTUs of up to 2^31 bytes in length.
    Thus the use of connected modes can offer significant benefits by
    supporting reasonably large MTUs. The datagram modes of IBA are
    limited to 4096 bytes. Reliability is also enhanced by the
    underlying feature of 'automatic path migration' supported by the
    connected modes is utilised [IB_ARCH].

    This document presents a method of address resolution and
    transmission of IP packets over connected modes of IBA.

    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
    NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL"
    in this document are to be interpreted as described in RFC 2119.

2.0 IPoIB-connected mode

    The connected modes of IBA define a non-broadcast, multiple access



Kashyap                                                         [Page 2]


INTERNET-DRAFT            Connected mode IPoIB            September 2003


    network - the connected modes of IBA do not support multicasting
    though every node can communicate with every other node if
    desired.

    This implies that intrinsically one cannot rely on native
    broadcast or multicast to send out the address resolution query.
    But an ARP server is not an efficient solution. Fortunately in the
    case of IPoIB-CM there is a way out as explained below.

2.1 Outline of Address Resolution

    IBA requires that all Host Channel Adapters(HCAs) support the
    reliable and unreliable connected modes[IB_ARCH]. It is optional
    for Target Channel Adapters (TCAs) to support the connected modes.
    At the same time IBA requires all HCAs and TCAs to support
    unreliable datagram mode. The unreliable datagram mode does
    support multicasting. IPoIB over unreliable datagram(IPoIB-UD) as
    presented in [IPoIB_MCAST]/[IPoIB_ENCAP] requires the IB subnet to
    support IB level multicast.

    Therefore it is possible to use a multicast query over IB-UD
    for IPoIB-CM address resolution.


    [IPoIB_ENCAP] proposes that the address resolution query is
    multicast over an IB multicast address that is joined by every
    member of the IPoIB subnet. This IB multicast address is referred
    to as the 'broadcast-GID' [IPoIB_ENCAP]. This document extends the
    requirement of joining the 'broadcast-GID' to IPoIB-CM too by
    associating an unreliable datagram queue-pair with every IPoIB-CM
    interface.

    A broadcast-GID is formed with the knowledge of the scope bits, IP
    version, the partition key(P_Key) associated with the subnet. Thus
    these three parameters must be known to the node before an IPoIB
    interface can be brought up. The exact format and rules to setup
    the broadcast-GID are defined in [IPoIB_MCAST].

2.2 Outline of Connection setup

    Address resolution is but the first step. Once the link address of
    the remote node is known an IB connection must be setup between the
    nodes before any IP communication may occur.

    To make a connection, the sender must know the service-ID to use
    in the request to make a connection [IB_ARCH]. It must also supply
    the queue pair to the remote node. The peer replies with its queue
    pair. Note that each IB connection is peer to peer and uses one



Kashyap                                                         [Page 3]


INTERNET-DRAFT            Connected mode IPoIB            September 2003


     connected mode QP at each end.

    Though the address resolution occurs at an individual IP address
    level the connection between the nodes is at the IB layer.
    Therefore every individual address resolution does not imply a new
    connection between the peers.

3.0 Address Resolution

    Every IPoIB-CM node MUST join the broadcast-GID associated with the
    subnet [IPoIB_MCAST]. This join is over an UD QP. The address
    resolution query is always sent out on the broadcast-GID.

    An IPoIB-CM implementation MAY use the same unreliable
    datagram(UD) queue pair(QP) as used by the IPoIB-UD implementation
    if the latter mode is supported in the same partition and scope.

    Therefore the address resolution query is sent to the
    broadcast-GID on the associated UD QP. A unicast
    reply is received on the UD QP associated with IPoIB-CM.

    Note:
        The IPoIB-CM link need not be the same as the link defined by
        IPoIB-UD. In other words the broadcast-GID used for an
        IPoIB-CM is independent of other broadcast-GIDs supported over
        the same IB subnet. It MAY be the same but is not required to
        be the same.

    IPoIB encapsulation [IPoIB_ENCAP] describes the link-layer address
    as follows:

        <1 octet reserved>:QP: GID

    This document extends the link-layer address as follows:

        <flags>:QP:GID

        Flags:
            This is a single octet field. If bit 0 is set then it
            implies that in the sender's view,the subnet is built over
            IB's 'reliable connected' i.e. RC mode. If bit 1 is set
            then it implies that the subnet is built over IB's
            'unreliable connected' i.e. UC mode. All other bits in the
            octet are reserved and MUST be set to 0.

            Both the RC and UC flags MUST not be set at the same time.
            They are mutually exclusive.




Kashyap                                                         [Page 4]


INTERNET-DRAFT            Connected mode IPoIB            September 2003


            The format of the flags is:

            +--+--+--+--+--+--+--+--+
            |RC|UC| 0| 0| 0| 0| 0| 0|
            +--+--+--+--+--+--+--+--+

            Note:
                The above implies that a given IP subnet can only be
                supported on one InfiniBand mode. If the link layer
                includes no flags then it is part of an IPoIB-UD
                subnet, if the link layer includes the RC flag then
                it is part of an IPoIB-RC subnet, if the link layer
                includes the UC flag then it is part of an IPoIB-UC
                subnet.

        QPN:
            The queue-pair number(QPN) on which the unicast address
            resolution reply will be received. This allows the
            IPoIB-UD address resolution code and method can be used
            for IPoIB-CM address resolution.

            The QPN also serves another purpose. It is used to form
            the Service-ID that is used to setup the IB connection.

    On receiving the multicast/broadcast address resolution request
    the receiver replies with its own link-address, including the
    associated UD QPN and the appropriate flag. If the flags do not
    match then there is a misconfiguration since the underlying IB
    modes do not match. In such a case a suitable error indication
    SHOULD be provided to the administrator.

    The receiver's reply is unicast back to the sender after the
    receiver has, as in the case of IPoIB over unreliable
    datagram(IPoIB_UD), resolved the GID to the LID and determined
    other required parameters[IPoIB_ENCAP].

    Once the address resolution is completed the underlying IB
    connection can be setup.

4.0 IB Connection Setup

    The IB reliable/unreliable mode connection may be setup by any of
    the peers though it is more likely that the one that initiated the
    address resolution phase, probably as a result of the need to send
    IP data, will initiate the connection setup. IBA allows
    passive-active and active-active connection setup.

    To setup a connection IB Management Datagrams (MADs) are directed



Kashyap                                                         [Page 5]


INTERNET-DRAFT            Connected mode IPoIB            September 2003


    to the peer's communication manager(CM). The connection request
    always contains a Service-ID for the peer to associate the request
    with the appropriate entity. If the request is accepted the peer
    returns the relevant connected mode QPN in the response MAD. This
    content of the CM connection messages and the IB connection setup
    is described in[IB_ARCH].

    The CM messages include, among other parameters, the Service-ID,
    Local QPN, and the payload size to use over the connection.

    Note: The IB connection is setup using the Service-ID as defined
          above. The node MUST keep a record of IB connections it is
          participating in. The node SHOULD NOT attempt another
          connection to the remote peer using the same Service-ID as
          used for an existing IB connection.

4.1 Service-ID

    The InfiniBand specification defines a block of service IDs for
    IETF use. The InfiniBand specification has left the definition and
    management of this block to the IETF[IB_ARCH]. The 64-bit block
    is:

+--------+--------+--------+--------+--------+--------+--------+--------+
|00000001|<-------------------IETF use--------------------------------->|
+--------+--------+--------+--------+--------+--------+--------+--------+

    The Service-IDs used by IPoIB will be in the format:

+--------+--------+--------+--------+--------+--------+--------+--------+
|00000001|  Type  |Reserved|            QPN           |   Reserved      |
+--------+--------+--------+--------+--------+--------+--------+--------+

    The Reserved fields MUST be transmitted as zeroes. They are
    ignored on reception.

    The QPN MUST be the value exchanged during address
    resolution.

    The Type MUST be set to 0.

     Note:
          The service-ID formed using the UD QPN used for address
          resolution MUST be supported by the associated interface.

4.2 MTU

    The IB connection setup might be used for both IPv4 and IPv6 or it



Kashyap                                                         [Page 6]


INTERNET-DRAFT            Connected mode IPoIB            September 2003


    could be used for only one of them while a different connection is
    used for the other. The link MTU MUST be able to support the
    minimum MTU required by the protocols.

    Every connection setup message includes a 'private data'
    field[IB_ARCH]. The private data field MUST carry the following
    information:

        0               15
        +----------------+
        | Desired   MTU  |
        +----------------+
        | Minimum MTU    |
        +----------------+

    The connection setup message (CM REQ) MUST insert the requested
    MTU in the 'Desired MTU' field and the minimum acceptable MTU in
    the 'Minimum MTU' field. If it is not acceptable to the peer then
    it MUST indicate the preferred value in the 'desired MTU' when
    rejecting (CM REJ) the request. If the 'desired MTU' is lower than
    the minimum MTU that can be supported, the connection MUST be
    rejected (CM REJ message) with the minimum acceptable MTU set in
    both the desired and minimum MTU fields.

5.0 IP encapsulation

    The IP encapsulation will be done as defined in the IPoIB
    encapsulation standard[IPoIB_ENCAP].

    IP multicast cannot be done over the IPoIB-CM modes. Multicast
    traffic MUST be transmitted over the UD QP associated with the
    IPoIB-CM interface.

6.0 Security Considerations

    A node may be returned a false set of flags by an impostor. This
    may cause unnecessary attempts and some delay/disruption in IPoIB
    communication. The same is the case if wrong/spurious QPN
    values are provided during address resolution broadcast/multicast.

    The same precautions MUST be taken as described in the 'security
     considerations' section of [IPoIB_MCAST] and [IPoIB_ENCAP] .

7.0 References

[IB_ARCH]        InfiniBand Architecture Specification, version 1.1
                www.infinibandta.org




Kashyap                                                         [Page 7]


INTERNET-DRAFT            Connected mode IPoIB            September 2003


[IPoIB_ARCH]    draft-ietf-ipoib-architecture-02.txt, V. Kashyap

[IPoIB_ENCAP]    draft-ietf-ipoib-ip-over-infiniband-04.txt,
                V. Kashyap, H.K. Jerry Chu

[IPoIB_MCAST]    draft-ietf-ipoib-link-multicast-04.txt,
                H.K. Jerry Chu, V. Kashyap

7.0 Author's Address

Vivek Kashyap

15450, SW Koll Parkway
Beaverton, OR 97006

Phone: +1 503 578 3422
Email: vivk@us.ibm.com

Full Copyright Statement

    Copyright (C) The Internet Society (2001). All Rights Reserved.

    This document and translations of it may be copied and
    furnished to others, and derivative works that comment on or
    otherwise explain it or assist in its implementation may be
    prepared, copied, published and distributed, in whole or in
    part, without restriction of any kind, provided that the above
    copyright notice and this paragraph are included on all such
    copies and derivative works. However, this document itself may
    not be modified in any way, such as by removing the copyright
    notice or references to the Internet Society or other Internet
    organizations, except as needed for the purpose of developing
    Internet standards in which case the procedures for copyrights
    defined in the Internet Standards process must be followed, or
    as required to translate it into languages other than
    English.

    The limited permissions granted above are perpetual and will
    not be revoked by the Internet Society or its successors or
    assigns.

    This document and the information contained herein is provided
    on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
    ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE
    USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
    ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
    PARTICULAR PURPOSE.



Kashyap                                                         [Page 8]


INTERNET-DRAFT            Connected mode IPoIB            September 2003





















































Kashyap                                                         [Page 9]