INTERNET DRAFT                                               Vivek Kashyap
<draft-ietf-ipoib-ip-over-infiniband-02.txt>                           IBM
Expiration Date: September, 2003                            H.K. Jerry Chu
                                                          Sun Microsystems

                                                           September, 2003


    IP encapsulation and address resolution over InfiniBand networks

Status of this memo

        This document is an Internet-Draft and is in full conformance
        with all provisions of Section 10 of RFC 2026.

        Internet-Drafts are working documents of the Internet
        Engineering Task Force (IETF), its areas, and its working
        groups. Note that other groups may also distribute working
        documents as Internet- Drafts.

        Internet-Drafts are draft documents valid for a maximum of six
        months and may be updated, replaced, or obsoleted by other
        documents at any time. It is inappropriate to use
        Internet-Drafts as Reference material or to cite them other
        than as ``work in progress''.

        The list of current Internet-Drafts can be accessed at
        http://www.ietf.org/ietf/1id-abstracts.txt

        The list of Internet-Draft Shadow Directories can be accessed
        at http://www.ietf.org/shadow.html

        This memo provides information for the Internet community.
        This memo does not specify an Internet standard of any kind.
        Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2001).  All Rights Reserved.

Abstract

        This document specifies the frame format for transmission of
        IP and ARP packets over InfiniBand networks. Unless explicitly
        specified, the term 'IP' refers to both IPv4 and IPv6. The
        term 'ARP' refers to all the ARP protocols/op-codes such as
        ARP/RARP. This document also describes the method of forming
        IPv6 link-local addresses, and the content of the



Kashyap, Chu                                                    [Page 1]


INTERNET-DRAFT             IP over InfiniBand             February, 2003


        source/target link layer address option used in Neighbor
        solicitation and advertisement, router advertisement, router
        redirect and router solicitation on IPv6 over InfiniBand.

Table of Contents

        1.0     Introduction
        2.0     InfiniBand Datalink
        2.1     IP Support on IPoIB Link
        3.0     Frame Format
        4.0     Maximum Transmission Unit
        5.0     IPv6 Stateless Autoconfiguration
        5.1     IPv6 Link Local Address
        6.0     Address Mapping - Unicast
        6.1     Link-Information
        6.1.1   Link Layer Address/Hardware Address
        6.1.2   Auxiliary Link Information
        6.2     Address Resolution in IPv4 Subnets
        6.3     Address Resolution in IPv6 Subnets
        7.0     IANA Considerations
        8.0     Security Considerations
        9.0     Acknowledgements
       10.0     References
       11.0     Authors' Addresses

1.0 Introduction

        The InfiniBand specification[IB_ARCH] can be found at
        www.infinibandta.org. The document [IPoIB_ARCH] provides a
        short overview of InfiniBand architecture along with
        considerations for specifying IP over InfiniBand networks. The
        document [IPoIB_MCAST] defines the configuration of IPoIB
        links and the support of IP multicast over InfiniBand
        networks.

        The InfiniBand architecture(IBA) defines multiple modes of
        transport over which IP may be implemented. The unreliable
        datagram(UD) transport method best matches the needs of IP and
        the need for universality in general as described
        in [IPoIB_ARCH].

        This document specifies IPoIB over IB's unreliable
        datagram(UD) mode. IPoIB over other modes of IB is beyond
        the scope of this document.

        The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
        NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
        "OPTIONAL" in this document are to be interpreted as described



Kashyap, Chu                                                    [Page 2]


INTERNET-DRAFT             IP over InfiniBand             February, 2003


        in RFC 2119.

2.0 InfiniBand Datalink

        The document [IPoIB_MCAST] defines the IPoIB link, its setup,
        and IP multicast over InfiniBand in detail. The following
        discussion gives a short overview.

        An InfiniBand(IB) subnet is formed by a network of IB nodes
        interconnected either directly or via IB switches. IB subnets
        may be connected using IB routers to form a fabric made of
        multiple IB subnets. Multiple IP subnets may be overlaid over
        this IB cloud. The boundary of this IP subnet is arbitrary and
        not associated with a physical demarcation. The IPoIB nodes
        that are members of this subnet are interconnected by an
        abstract 'link'. The link is defined by its members and common
        characteristics such as the P_Key, link MTU and Q_Key that are
        defined per 'link'.

        IPv4 defines a limited-broadcast address over the link. All
        IPv4 hosts that are members of the IPv4 subnet are members of
        this address. IPv6 defines a multicast address referred to as
        the all-IP hosts address. IPoIB defines a mapping from these
        (and other IPv4/v6 multicast addresses) to IB multicast GIDs
        [IPoIB_MCAST]. The multicast GID derived from the IPv4
        limited-broadcast address and the multicast GID derived from
        the IPv6 all-nodes multicast address will collectively be
        referred to as the broadcast-GID in this document. The
        broadcast-GID is required to be setup for an IPoIB subnet to
        be formed.

        Every IPoIB interface MUST join the InfiniBand multicast group
        defined by the broadcast-GID. This operation returns the MTU
        and the Q_Key associated with the IPoIB link. Thus the IPoIB
        subnet (and the link) is formed by the IPoIB nodes joining the
        broadcast GID.

        The P_Key is a configuration parameter that must be known
        before the broadcast-GID can be formed[IPoIB_MCAST].

2.1 IP Support on IPoIB Link

        The unreliable datagram (UD) mode of communication is
        supported by all IB elements be they IB routers, Host Channel
        Adapters(HCAs) or Target Channel Adapters(TCAs). In addition
        to being the only universal transmission method it supports
        multicasting, partitioning and a 32-bit CRC [IB_ARCH]. IB does
        not require that all IB components support multicasting.



Kashyap, Chu                                                    [Page 3]


INTERNET-DRAFT             IP over InfiniBand             February, 2003


        Therefore, IB subnets with no multicast support are always
        possible. However, IPoIB architecture requires the
        participating components to support multicast.

        All IPoIB implementations MUST support IP over the unreliable
        datagram (UD) transport mode of IBA.

3.0 Frame Format

        All IP and ARP datagrams transported over InfiniBand are
        prefixed by a 4-octet encapsulation header as illustrated
        below.

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                               |                               |
    |         Type                  |       Reserved                |
    |                               |                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                     Figure 1

        The type field SHALL indicate the encapsulated protocol as per
        the following table.

        +----------+-------------+
        | Type       |     Protocol |
        |------------------------|
        | 0x800    |    IPv4     |
        |------------------------|
        | 0x806    |    ARP      |
        |------------------------|
        | 0x8035   |    RARP     |
        |------------------------|
        | 0x86DD   |    IPv6     |
        +------------------------+

            Table 1

        These values are taken from the 'ETHER TYPE' numbers assigned
        by [IANA]. Other network protocols, identified by different
        values of 'ETHER TYPE', may use the encapsulation format
        defined herein but such use is outside of the scope of this
        document.






Kashyap, Chu                                                    [Page 4]


INTERNET-DRAFT             IP over InfiniBand             February, 2003


    |<------ IB Frame headers -------->|<- Payload ->|<- IB trailers ->|
    +-------+------+---------+---------+-------------+---------+-------+
    |Local  |      |Base     |Datagram |   4-octet   |         |       |
    |Routing| GRH* |Transport|Extended |   header    |Invariant|Variant|
    |Header |Header|Header   |Transport|      +      |  CRC    |  CRC  |
    |       |      |         |Header   |   IP/ARP    |         |       |
    +-------+------+---------+---------+-------------+---------+-------+

                    Figure 2

        Figure 2 depicts the IB frame encapsulating an IP/ARP
        datagram. The IB frame headers are described in detail in the
        InfiniBand Architecture specification [IBARCH]. The InfiniBand
        specification requires the use of Global Routing Header (GRH)
        [IPoIB_ARCH] when multicasting or when an InfiniBand packet
        traverses from one IB subnet to another through an IB router.
        Its use is optional when used for unicast transmission between
        nodes within an IB subnet. The IPoIB implementation MUST be
        able to handle packets received with or without the use of
        GRH.


4.0 Maximum Transmission Unit


        IB MTU:
                The IB components i.e. IB links, switches, CAs, and IB
                routers, may support maximum payloads of : 256, 512,
                1024, 2048 or 4096 bytes.



        IPoIB-Link MTU:
                An IPoIB link is formed by the IPoIB nodes joining the
                broadcast-GID [IPoIB_MCAST]. The IPoIB-link MTU is the
                MTU value associated with the broadcast-GID. The
                IPoIB-link MTU can be set to any value upto the
                smallest IB MTU supported by the IB components
                comprising the IPoIB link.


        In order to reduce problems with fragmentation and path-MTU
        discovery, this document requires that all IPoIB
        implementations support an MTU of 2044 octets i.e. a 2048
        octet IPoIB-link MTU minus the 4 octet encapsulation overhead.
        Larger and smaller MTUs MAY be supported, but the default
        configuration must support an MTU of 2044 octets.



Kashyap, Chu                                                    [Page 5]


INTERNET-DRAFT             IP over InfiniBand             February, 2003


        In IPv6 subnets the MTU may be reduced by a Router
        Advertisement [RFC2461] containing an MTU option which
        specifies a smaller MTU, or by manual configuration of each
        node. If a Router Advertisement received on an IPoIB interface
        has an MTU option specifying an MTU larger than the link MTU
        or larger than a manually configured value, that MTU option
        may be logged to system management but must be otherwise
        ignored.

        Similarly, the IPv4 MTU may also be reduced by manual
        configuration of each node.

        For purposes of this document, information received from DHCP
        is considered "manually configured".


5.0 IPv6 Stateless Autoconfiguration

        IB architecture associates an EUI-64 identifier termed the
        GUID (Globally Unique Identifier) [IPoIB_ARCH, IB_ARCH] with
        each IB port. The LID (16 bits) is unique within an IB
                subnet only.

        The interface identifier may be chosen from:

                1) The EUI-64 compliant Globally unique
                   identifier(GUID) assigned by the manufacturer.

                2) If the IPoIB subnet is fully contained within an IB
                   subnet any of the unique 16-bit LIDs of the port
                   associated with the IPoIB interface.

                   The LID values of an IB port may change after a
                   reboot/power-cycle of the IB node. Therefore, if a
                   persistent value is desired, it would be prudent to
                   not use the LID to form the interface identifier.

                   On the other hand, the LID provides an identifier
                   that can be used to create a more anonymous IPv6
                   address since the LID is not globally unique and is
                   subject to change over time.

        It is RECOMMENDED that the link-local address be constructed
        from the port's EUI-64 identifier as per the rules specified
        in [RFC2373].






Kashyap, Chu                                                    [Page 6]


INTERNET-DRAFT             IP over InfiniBand             February, 2003


5.1 IPv6 Link Local Address

        The IPv6 link local address for an IPoIB interface is formed
        as described in [RFC2373] using the Interface Identifier
        described in the previous section.


6.0 Address Mapping - Unicast

        Address resolution in IPv4 subnets is accomplished through
        Address Resolution protocol (ARP)[RFC826]. It is accomplished
        in IPv6 subnets using the Neighbor discovery
        protocol[RFC2461].

6.1 Link Information

        An InfiniBand packet over the UD mode includes multiple
        headers such as the LRH(local route header), GRH(global route
        header), BTH(base transport header), DETH(datagram extended
        header) as depicted in Figure 1 and specified in the
        InfiniBand architecture[IB_ARCH]. All these headers comprise
        the link-layer in an IPoIB link.

        The parameters needed in these IBA headers constitute the
        link-layer information that needs to be determined before an
        IP packet may be transmitted across the IPoIB link.

        The parameters that need to be determined are:

        a) LID (local identifier)

                The LID is always needed. A packet always includes the
                LRH that is targeted at the remote node's LID, or an
                IB router's LID to get to the remote node in another
                IB subnet.

        b) GID (global identifier)

                The GID is not needed when exchanging information
                within an IB subnet though it may be included in any
                packet. It is an absolute necessity when transmitting
                across multiple IB subnets since the IB routers use the GID
                to correctly forward the packets. The source and
                destination GIDs are fields included in the GRH.

                The GID, if formed using the GUID, can be used to
                unambiguously identify an endpoint.




Kashyap, Chu                                                    [Page 7]


INTERNET-DRAFT             IP over InfiniBand             February, 2003


        c) QPN (queue pair number)

                Every unicast UD communication is always directed to a
                particular queue pair(QP) at the peer.

        d) Q_Key

                A Q_Key is associated with each unreliable datagram
                QPN. The received packets must contain a Q_Key that
                matches the QP's Q_Key to be accepted.

        e) P_Key

                A successful communication between two IB nodes using
                UD mode can occur only if the two nodes have
                compatible P_Keys. This is referred to as being in the
                same partition[IB_ARCH]. P_Keys are checked at the
                receiving channel adapter and may be optionally
                checked at intermediate switches/IB routers. If the
                P_Key in the packet does not match the expected P_Key
                the packet is dropped.

        f) SL (service level)

                Every IBA packet contains an SL value. A path in IBA
                is defined by the three-tuple (source LID, destination
                LID, SL). The SL in turns is mapped to a virtual
                lane(VL) at every CA, switch that sends/forwards the
                packet [IPoIB_ARCH]. Multiple SLs may be used between
                two endpoints to provide for load-balancing, SLs may
                be used for providing a QoS infrastructure, or may be
                used to avoid deadlocks in the IBA fabric.


            Another auxiliary piece of information, not included in
            the IBA headers, is :

        g) Path rate

                The InfiniBand architecture defines multiple link
                speeds. A higher speed transmitter can swamp
                switches/CAs. To avoid such congestion every source
                transmitting at greater than 1x speeds is required to
                determine the 'path rate' before the data may be
                transmitted [IB_ARCH].






Kashyap, Chu                                                    [Page 8]


INTERNET-DRAFT             IP over InfiniBand             February, 2003


6.1.1 Link Layer Address/Hardware Address

        Though the list of information required for a successful
        transmittal of an IPoIB packet is large not all the
        information need be determined during the IP address
        resolution process.

        The IPoIB link-layer address used in the source/target
        link-layer address option in IPv6 and the 'hardware address'
        in IPv4/ARP has the same format.

        The format is as described below:


     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |    Reserved   |              Queue Pair Number                |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                                                               |
    +                                                               +
    |                                                               |
    +                            GID                                +
    |                                                               |
    +                                                               +
    |                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                                Figure 2

        a) Reserved Flags

                These 8 bits are reserved for future use. These bits
                MUST be set to zero on send and ignored on receive
                unless specified differently in a future document.

        b) Queue Pair Number (QPN)

                Every unicast communication in IB architecture is
                directed to a specific queue pair(QP)[IB_ARCH]. This
                QP number is included in the link description. All IP
                communication to the relevant IPoIB interface MUST be
                directed to this QPN. In the case of IPv4 subnets the
                address resolution protocol(ARP) reply packets are
                also directed to the same QPN.

                The choice of the QPN value for IP/ARP communication
                is up to the receiving implementation.



Kashyap, Chu                                                    [Page 9]


INTERNET-DRAFT             IP over InfiniBand             February, 2003


        c) Global Identifier (GID)

                This is one of the Global Identifiers(GIDs)[IB_ARCH]
                of the port associated with the IPoIB interface. IB
                associates multiple GIDs with a port. It is
                RECOMMENDED that the 'GID at index 0' be included in
                the link-layer/hardware address [IBARCH]. The GID at
                index 0 is formed using the IB port's manufacturer
                assigned EUI-64 identifier.

6.1.2  Auxiliary Link Information

        The rest of the parameters are determined as follows:

        a) Local Identifier(LID)

                The method of determining the peer's LID is not
                defined in this document. It is up to the
                implementation to use any of the IBA approved methods
                to determine the destination LID. One such method is
                to use the GID determined during the address
                resolution, to retrieve the associated LID from the IB
                routing infrastructure or the Subnet
                Administrator(SA)[IBARCH].

                It is the responsibility of the administrator to
                ensure that the IB subnet(s) have unicast connectivity
                between the IPoIB nodes. The GID exchanged between two
                endpoints in a multicast message(ARP/ND) does not
                guarantee the existence of a unicast path between the
                two. This has to be ensured by the fabric
                administrator.

                There may be multiple LIDs, and hence paths, between
                the endpoints. The criteria for selection of the LIDs
                are beyond the scope of this document.

        b) Q_Key

                The Q_Key received on joining the broadcast-GID MUST
                be used for all IPoIB communication over the
                particular IPoIB link.

        c) P_Key

                The network administrator is required to setup an
                IPoIB link by setting up an IB partition and assigning
                it a unique P_Key[IPoIB_MCAST].



Kashyap, Chu                                                   [Page 10]


INTERNET-DRAFT             IP over InfiniBand             February, 2003


                Thus the P_Key to be used in the IP subnet is not
                discovered but is a configuration parameter.

        d) Service Level(SL)

                The method of determining the SL is not defined in
                this document. The SL is determined by any of the IBA
                approved methods.

        e) Path rate

                The implementation must leverage IB methods to
                determine the path rate as required.

6.2 Address Resolution in IPv4 Subnets

        The ARP packet header is as defined in [RFC826]. The hardware
        type is set to 32(decimal) as specified by Internet Assigned
        Numbers Authority(IANA). The rest of the fields are used as
        per RFC826.

                16 bits: hardware type
                16 bits: protocol
                 8 bits: length of hardware address
                 8 bits: length of protocol address
                16 bits: ARP operation

        The remaining fields in the packet hold the sender/target
        hardware and protocol addresses.

        [ sender hardware address ]
        [ sender protocol address ]
        [ target hardware address ]
        [ target protocol address ]

        The hardware address included in the ARP packet will be as
        specified in section 6.1.1 and depicted in Figure 3.

        The length of the hardware address used in ARP packet header
        therefore is 20.

6.3 Address Resolution in IPv6 Subnets

        The Source/Target Link-layer address option is used in Router
        Solicit, Router advertisements, Redirect, Neighbor
        Solicitation and Neighbor Advertisement messages when such
        messages are transmitted on InfiniBand networks.




Kashyap, Chu                                                   [Page 11]


INTERNET-DRAFT             IP over InfiniBand             February, 2003


        The source/target address option is specified as follows:

        Type:
            Source Link-layer address     1
            Target Link-layer address    2

        Length: 3

        Link-layer address:

            The link-layer address is as specified in section 6.1.1
            and depicted in Figure 3.


7.0 IANA Considerations

        To support ARP over InfiniBand a value for the Address
        Resolution Parameter 'Number Hardware Type (hrd)' is required.
        IANA has assigned the number '32' to indicate
        InfiniBand[IANA_ARP].

8.0 Security Considerations

        This document specifies IP transmission over a multicast
        network. Any network of this kind is vulnerable to a sender
        claiming another's identity and forge traffic or eavesdrop. It
        is the responsibility of the higher layers or applications to
        implement suitable counter-measures if this is a problem.

        The successful transmission of an IPoIB packet is dependent on
        multiple parameters that must be determined correctly. The
        operations for creating and configuring an IPoIB link are
        described in [IPoIB_MCAST]. These include creating IB
        multicast groups in SA, creating and attaching QPs to IB
        multicast groups,... etc. and MUST be protected by the
        underlying operating system. This is to prevent malicious,
        non- privileged software from hijacking important resources
        and configurations. E.g. A bogus IPoIB broadcast group may
        prevent a proper one from being created when the network
        administrator tries to set up a link.

        Controlled Q_Keys SHOULD be used in IPoIB links. This is to
        prevent non-privileged software from fabricating IP
        datagrams.







Kashyap, Chu                                                   [Page 12]


INTERNET-DRAFT             IP over InfiniBand             February, 2003


9.0 Acknowledgements

        The authors would like to thank Bruce Beukema, David Brean,
        Dan Cassiday, Yaron Haviv, Thomas Narten, Erik Nordmark, Greg
        Pfister, Jim Pinkerton, Renato Recio, Kevin Reilly, Madhu
        Talluri and Satya Sharma for their suggestions and many
        clarifications on the IBA specification.

10.0 References

[IB_ARCH]       InfiniBand Architecture Specification, Volume 1.0a
                  www.infinibandta.org

[IPoIB_ARCH]    draft-ietf-ipoib-architecture-02.txt

[IPoIB_MCAST]   draft-ietf-ipoib-link-multicast-02.txt

[RFC2373]       IP Version 6 Addressing Architecture

[RFC2375]       IPv6 Multicast Address Assignments

[RFC826]        An Ethernet Address Resolution Protocol

[RFC1700]       Assigned Numbers.

[RFC2434]       Guidelines for Writing an IANA Considerations Section in RFCs

[RFC2461]       Neighbor Discovery for IP version 6 (IPv6)

[RFC3041]       Extensions to IPv6 Address Autoconfiguration

[IANA]          Internet assigned numbers authority, www.iana.org

[IANA_ARP]      www.iana.org/assignments/arp-parameters


11.0 Authors' Address

Vivek Kashyap

15450, SW Koll Parkway
Beaverton, OR 97006
USA

Phone: +1 503 578 3422
Email: vivk@us.ibm.com





Kashyap, Chu                                                   [Page 13]


INTERNET-DRAFT             IP over InfiniBand             February, 2003


H.K. Jerry Chu

17 Network Circle, UMPK17-201
Menlo Park, CA 94025
USA

Phone: +1 650 786-5146
Email: jerry.chu@sun.com


Full Copyright Statement

        Copyright (C) The Internet Society (2001). All Rights Reserved.

        This document and translations of it may be copied and
        furnished to others, and derivative works that comment on or
        otherwise explain it or assist in its implementation may be
        prepared, copied, published and distributed, in whole or in
        part, without restriction of any kind, provided that the above
        copyright notice and this paragraph are included on all such
        copies and derivative works. However, this document itself may
        not be modified in any way, such as by removing the copyright
        notice or references to the Internet Society or other Internet
        organizations, except as needed for the purpose of developing
        Internet standards in which case the procedures for copyrights
        defined in the Internet Standards process must be followed, or
        as required to translate it into languages other than
        English.

        The limited permissions granted above are perpetual and will
        not be revoked by the Internet Society or its successors or
        assigns.

        This document and the information contained herein is provided
        on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
        ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE
        USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
        ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
        PARTICULAR PURPOSE.











Kashyap, Chu                                                   [Page 14]