INTERNET-DRAFT                                            H.K. Jerry Chu
<draft-ietf-ipoib-link-multicast-02.txt>                Sun Microsystems
                                                           Vivek Kashyap
                                                                     IBM
Expires: December, 2002                                       June, 2002


             IP link and multicast over InfiniBand networks


Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   Copyright (C) The Internet Society (date).  All Rights Reserved.


Abstract

   This document specifies a method for setting up IP subnets and
   multicast services over InfiniBand(TM) networks. Discussions in this
   document are applicable to both IPv4 and IPv6, unless explicitly
   specified. A separate document will cover unicast and encapsulation
   of IP datagrams over InfiniBand networks.


Table of Contents
   1.0     Introduction
   2.0     Terminology
   3.0     Basic IPoIB Transport - Unreliable Datagram
   4.0     IB Multicast Architecture
   5.0     IB Links vs IPoIB Links



Chu & Kashyap                                                   [Page 1]


draft-ietf-ipoib-link-multicast-02.txt                         June 2002


   6.0     Setting up an IPoIB Link
   6.1     Maximum Transmission Unit
   6.2     IPoIB Link Q_Key
   6.3     Other Link Attributes
   7.0     The IPoIB Broadcast Group
   8.0     Mapping for other Multicast Groups
   9.0     Sending and Receiving IP Multicast Packets
   10.0    Security Considerations
   11.0    Acknowledgments
   12.0    References
   13.0    Author's Address
   14.0    Full Copyright Statement


1.0 Introduction

   InfiniBand Architecture (IBA) defines four layers of network services
   corresponding to layer one through layer four of the OSI reference
   model.  For the purpose of running IP over an InfiniBand (IB)
   network, the IB link, network, and transport layers collectively
   constitute the data link layer to the IP stack. One can find a
   general overview of IB architecture related to IP networks in
   [IPoIB_ARCH].

   This document will focus on the necessary steps in order to lay out
   an IP network on top of an IB network. It will describe all the
   elements of an IP over InfiniBand (IPoIB) link, how to configure its
   associated attributes, and how to set up basic broadcast and
   multicast services for it. IPoIB link is the building block upon
   which an IP network consisting of many IP subnets connected by
   routers can be built.  Subnetting allows the containment of broadcast
   traffic within a single link. It also provides certain degree of
   isolation for administration purpose between nodes on different
   subnets.

2.0 Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

3.0 Basic IPoIB Transport - Unreliable Datagram

   InfiniBand defines four types of transport services [IBTA]. They are
   reliable connection, unreliable connection, reliable datagram,
   unreliable datagram. IBA also defines a special raw datagram service
   for encapsulation purpose. Both unreliable datagram and raw datagram
   define support for multicast. They provide the basic transport



Chu & Kashyap                                                   [Page 2]


draft-ietf-ipoib-link-multicast-02.txt                         June 2002


   mechanism that best matches the IP datagram paradigm.

   IB unreliable datagram provides many additional features such as the
   partition key (P_Key) protection, multiple queue pairs (QPs), and
   Q_Key protection. Moreover, it requires a 32-bit invariant CRC
   checksum, which provides a much stronger protection against data
   corruption, compared with the 16-bit CRC that a raw datagram carries.

   For these reasons, IB unreliable datagram is considered to be a much
   better choice as the basic IPoIB transport than the raw datagram, and
   is chosen as the default IPoIB transport mechanism ([IPoIB_ARCH],
   [IPoIB_ENCAP]).

4.0 IB Multicast Architecture

   The following discussion gives a short overview of the multicast
   architecture in InfiniBand. For a more complete description, the
   reader is referred to [IBTA] and [IPoIB_ARCH].

   IBA defines two layers of multicast services. Its link layer uses
   multicast LIDs (MLIDs), which are allocated by the Subnet Manager
   (SM) and fall in the range between 0xC0000 to 0xFFFE (approximately
   16k). MLIDs are used by IB switches to program their multicast
   forwarding tables. An IB switch implementation may support much fewer
   MLIDs in its forwarding table though.

   IB network layer uses multicast GIDs (MGIDs), which closely resemble
   IPv6 multicast addresses [AARCH] shown below.

   |   8    |  4 |  4 |                  112 bits                   |
   +------ -+----+----+---------------------------------------------+
   |11111111|flgs|scop|                  group ID                   |
   +--------+----+----+---------------------------------------------+

                                 Figure 1

   [IPoIB_ARCH] describes each field in more details.

   Since every IB multicast packet is required to carry both LRH and
   GRH, a valid MGID and a valid MLID are both needed before a valid IB
   multicast packet can be constructed.

   An IB multicast group is uniquely identified by a valid MGID. Before
   a MGID can be used within an IB subnet, either as a destination
   address of a multicast packet, or representing a multicast group that
   an IB node can join, an IB multicast group corresponding to the MGID
   must be created through the Subnet Administrator (SA). Besides the
   MGID, the creator must supply values of the path MTU, Q_Key, TClass,



Chu & Kashyap                                                   [Page 3]


draft-ietf-ipoib-link-multicast-02.txt                         June 2002


   FlowLabel, HopLimit that are appropriate for all the potential
   clients of the multicast group to use. In return, SA will allocate a
   MLID to be used by switches in the local IB subnet.

   Unreliable multicast is defined by IBA as an optional functionality
   for channel adaptors (CAs) and switches. In today's IP technology,
   link multicast has become an indispensable function for better
   supporting a modern IP network. For this reason, it is required that
   an IPoIB fabric supports multicast. This includes all the CAs and
   switches that are part of an IP network.

5.0 IB Links vs IPoIB Links

   A link segment on top of which an IP subnet can be configured is
   defined in [IPV6] as a communication facility or medium over which
   nodes can communicate at the "link" layer.  For most types of
   communication media, the boundary between different data link
   segments follows the physical topology of the network. E.g. an
   Ethernet network connected by switches, hubs, or bridges usually
   forms a single link segment and broadcast/multicast domain. Different
   Ethernet segments can be connected together by IP routers at the
   network layer.

   InfiniBand defines its own link-layer and subnets consisting of nodes
   connected by IB switches. However, the IPoIB link boundary need not
   follow the IB link boundary. Nodes residing on different IB subnets
   can still communicate directly with one another through IB routers at
   the InfiniBand network layer. This communication at the network layer
   applies to unicast as well as multicast.

   The ultimate requirement for two nodes in the same IB fabric to
   communicate at the IB level, besides physical connectivity, is a
   common P_Key.

   Partitioning in IB provides an isolation mechanism among nodes in an
   IB fabric, much like VLANs in the Ethernet network.  Each HCA (Host
   Channel Adaptor) port of an endnode contains a P_Key table holding
   all the valid P_Keys the port is allowed to use. The P_Key table is
   set up by the SM of the local IB subnet. Each QP is programmed with a
   P_Key from the local P_Key table. This P_Key is carried in all the
   outgoing packets from the QP, and is used to compare against the
   P_Key of incoming packets to the QP. Any packet with an invalid P_Key
   will be discarded by the QP and trigger a P_Key violation trap.  IB
   switches may optionally enforce partition checking too.

   Following the above, IB partitions are the natural choice for
   defining IPoIB link boundary. It also provides much needed
   flexibility for a network administrator to group nodes logically into



Chu & Kashyap                                                   [Page 4]


draft-ietf-ipoib-link-multicast-02.txt                         June 2002


   different subnets in a large network.

6.0 Setting up an IPoIB Link

   A network administrator defines an IPoIB link by setting up an IB
   partition and assigning it a unique P_Key. An IB partition may or may
   not span multiple IB subnets; and whether it does or not is mostly
   transparent to IPoIB.

   Each node attached to the IB partition MUST have one of its HCAs
   assigned the P_Key to use. Note that the P_key table of an HCA port
   may contain many P_Keys. It is up to the implementation to define the
   method by which the P_Key relevant to a particular IPoIB subnet is
   determined and conveyed to the IPoIB stack. E.g. implementations can
   resort to a manual configuration to choose the P_key or a set of
   P_Keys for IPoIB use, and rely on DHCP [DHCP] to assign an IP subnet
   number to each IPoIB link.

   Once an IB partition is established for IPoIB use, the link MTU and
   Q_Key are two other attributes that must be chosen before the IPoIB
   link can be configured.

6.1 Maximum Transmission Unit

   IB defines five permissible maximum payload sizes (MTUs). They are
   256, 512, 1024, 2048 and 4096 bytes. [IPV6] requires a link MTU of
   1280 bytes or greater. To be better compatible with Ethernet, the
   dominant network media in both the LAN and WAN environment, the IPoIB
   link MTU SHALL be 1500 bytes or greater. This leaves only 2048 and
   4096 bytes as the two acceptable MTUs for IPoIB. Channel adaptors
   supporting a MTU less than the minimal requirement can still expose
   an acceptable MTU to IP through an adaptation layer that fragments
   larger messages into smaller IB packets, and reassembles them on the
   receiving end. But this must be done in a way that is transparent to
   the IP stack.

   It is up to the network administrator to select a link MTU to use
   when configuring an IPoIB link. The link MTU SHALL not be greater
   than the MTU of any IB device on the IPoIB link minus the size of the
   "Type" field encapsulated in the payload [IPoIB_ENCAP]. Here the IB
   devices include IB switches, CAs, or routers.

   In general, a maximal link MTU should be employed whenever possible
   to attain better throughput performance. One caveat is that once a
   link MTU is chosen for a given IPoIB link, nodes connected by CAs of
   a smaller MTU won't be able to join the link unless the whole link
   and all the devices attached to it are reconfigured to use a smaller
   MTU.



Chu & Kashyap                                                   [Page 5]


draft-ietf-ipoib-link-multicast-02.txt                         June 2002


   The flexibility of configuring a smaller than the full link MTU size
   does make it easier for one to bridge an IPoIB link with an Ethernet
   link, by setting the IPoIB link MTU to 1500 bytes. For IPv4, this may
   require a manual configuration of a different link MTU than the
   maximum that all the nodes support. (See 7.0 below.)  For IPv6, one
   can use the MTU option of the router advertisement [DISC] to announce
   a smaller MTU to all the nodes.

   In case an IPoIB link spans more than one IB subnet, the IPoIB link
   MTU MUST not exceed the path MTU of any path connecting two nodes in
   the same IB partition. It is up to the network administrator to
   determine the appropriate path MTU value that will work for any node
   in the same IPoIB link.

6.2 IPoIB Link Q_Key

   A Q_Key is programmed by the source QP in every IB datagram, and is
   compared against the Q_Key of the destination QP.  A Q_Key violation
   will cause the offending datagram to be dropped, and a Q_Key
   violation trap to be raised.

   A Q_Key must be selected to be used by all the QPs attached to an
   IPoIB link. It is recommended that a controlled Q_Key be used with
   the high order bit set. This is to prevent non-privileged software
   from fabricating and sending out bogus IP datagrams. All QPs
   configured to be used on a given IPoIB link SHALL be assigned the
   same per-link Q_Key.

6.3 Other Link Attributes

   TClass, FlowLabel, and HopLimit are three other attributes that are
   required if an IPoIB link covers more than a single IB subnet.  The
   selection of these values are implementation dependent. But it must
   take into account the topology of IB subnets comprising the IPoIB
   link in order to allow successful communication between any two nodes
   in the same IPoIB link.

7.0 The IPoIB Broadcast Group

   Once an IB partition is created with link attributes identified for
   an IPoIB link, the network administrator must create a special IB
   all-node multicast group (henceforth referred to as the broadcast
   group) with these link attributes that every node on the IPoIB link
   can join.

   The MGID of the broadcast group will embed in it the P_Key of the IB
   partition that defines the IPoIB link. A special signature is also
   embedded to identify the MGID for IPoIB use only. For IPv4 over IB,



Chu & Kashyap                                                   [Page 6]


draft-ietf-ipoib-link-multicast-02.txt                         June 2002


   the signature will be "0x401B". For IPv6 over IB, the signature will
   be "0x601B".

   For an IPv4 subnet, the MGID for this special IB multicast group
   SHALL have the following format:

   |   8    |  4 |  4 |     16 bits    | 16 bits | 48 bits  | 32 bits |
   +--------+----+----+----------------+---------+----------+---------+
   |11111111|0001|scop|0100000000011011|< P_Key >|00.......0|<all 1's>|
   +--------+----+----+----------------+---------+----------+---------+

                                 Figure 2


   For an IPv6 subnet, the format of the MGID SHALL look like this:

   |   8    |  4 |  4 |     16 bits    | 16 bits |       80 bits      |
   +--------+----+----+----------------+---------+--------------------+
   |11111111|0001|scop|0110000000011011|< P_Key >|000.............0001|
   +--------+----+----+----------------+---------+--------------------+

                                 Figure 3

   As for the scop bits, if the IPoIB link is fully contained within a
   single IB subnet, the scop bits SHALL be set to 2 (link-local).
   Otherwise the scope will be set higher.

   The broadcast group for IPv4 will serve to provide a broadcast
   service for protocol like ARP to use.

   When a node is brought up on an IPoIB link identified by a P_Key, it
   must look for the right broadcast group to join. This is done by
   constructing the MGID with the link P_Key and the IPoIB signature.
   The node SHOULD always look for a MGID of a link-local scope first
   before attempting one with a greater scope.

   Once the right MGID and broadcast group are identified, the local
   node SHOULD use the MTU associated with the broadcast group.  In case
   the MTU of the broadcast group is greater than what the local HCA can
   support, the node can not join the IPoIB link and operate as an IP
   node. Otherwise the local node must join the broadcast group and use
   the rest of link attributes associated with the group for all future
   communication to the link.

   In addition to the special all-node multicast group for broadcast
   purpose, an all-router multicast group SHOULD be created at link
   configuration time if an IP router will be attached to the link. This
   is to facilitate IP multicast operations described later. An IB



Chu & Kashyap                                                   [Page 7]


draft-ietf-ipoib-link-multicast-02.txt                         June 2002


   multicast group for the all-router MGID must cover every IB subnet
   that the IPoIB link encompasses.  The format of the all-router MGID
   will be covered in next section.

8.0 Mapping for other Multicast Groups

   The support of general IP multicast [IPMULT] over IB is similar to
   the case of the special broadcast group discussed above. An
   algorithmic mapping is used so that given an IP multicast address,
   individual host can compute the corresponding IB multicast address
   (MGID) all by itself without having to consult an external entity.
   This also removes the need for an externally maintained IP to IB
   multicast mapping table.

   The IPoIB multicast mapping is depicted in Figure 4. The same mapping
   function is used for both IPv4 and IPv6 except the IPoIB signature
   field.

   |   8    |  4 |  4 |     16 bits     | 16 bits |      80 bits       |
   +------ -+----+----+-----------------+---------+--------------------+
   |11111111|0001|scop|<IPoIB signature>|< P_Key >|      group ID      |
   +--------+----+----+-----------------+---------+--------------------+

                                 Figure 4

   Since a MGID allocated for transporting IP multicast datagrams is
   considered only a transient link-layer multicast address, all IB
   MGIDs allocated for IPoIB purpose SHOULD have T = 1. The scope bits
   SHALL be the same as that of the all-node MGID for the same IPoIB
   link.

   The IP multicast address is used together with a given IPoIB link
   P_Key to form the MGID of the IB multicast group. For IPv6 the lower
   80-bit of the group ID is used directly in the lower 80-bit of the
   MGID. For IPv4, the group ID is only 28-bit long and the rest of the
   80 bits are filled with 0.

   The rest of the bits are the same as those of the broadcast MGID.
   E.g. on an IPoIB link that is fully contained within a single IB
   subnet with a P_Key of 8, the MGIDs for the all-router multicast
   group with group ID 2 [AARCH, IGMP2] are:

   FF12:401B:8:0:0:0:0:2

   or

   FF12:401B:8::2




Chu & Kashyap                                                   [Page 8]


draft-ietf-ipoib-link-multicast-02.txt                         June 2002


   for IPv4 in a compressed format, and

   FF12:601B:8:0:0:0:0:2

   or

   FF12:601B:8::2

   for IPv6 in a compressed format.

   A special case exists for the IPv4 limited broadcast address
   "255.255.255.255" [HOSTS]. The address SHALL be mapped to the
   broadcast MGID for IPv4 networks as described in section 7 above.
   Also the IPv6 all-node multicast address "FF0X::1" [AARCH] maps
   naturally to the the special broadcast MGID for IPv6 networks.

9.0 Sending and Receiving IP Multicast Packets

   For any MGID the equivalent IB multicast group must be created first
   before use. The implication for a sender is that to send a packet
   destined for an IP multicast address, it must first check for the
   existence of the IB multicast group corresponding to the MGID on the
   outbound link. If one already exists, the MLID associated with the
   multicast group is used as the DLID for the packet. Otherwise, it
   implies no member exists on the local link. The packet should be
   forwarded to locally connected routers. This is to allow local
   routers to forward the packet to multicast listeners on remote
   networks.  The specific mechanism for a sender to forward packets to
   routers are left to implementations. One can use, for example, the
   broadcast group, or the all-router multicast group for this purpose.

   A sender of multicast packets should cache information regarding the
   the MLID and other attributes of the target IB multicast group in
   order to avoid expensive SA calls on every outgoing multicast packet.
   The cache may need to be validated periodically. E.g., if SA supports
   multicast group create/delete traps, the sender should register to
   monitor the status of the target IB multicast group through event
   notification. If multicast packets were sent to the all-router
   multicast group because no local listener existed, the sender must be
   notified by SA when listeners show up later on the local link. This
   allows the sender to change the forwarding to the right multicast
   group.

   For a node to join an IP multicast group, it must first construct a
   MGID for it, using the rule described above. Note that it must
   remember the scope bits from the all-node MGID, and use the same
   scope in all the MGIDs it constructs.




Chu & Kashyap                                                   [Page 9]


draft-ietf-ipoib-link-multicast-02.txt                         June 2002


   The local node then calls SA to join the IB multicast group
   corresponding to the MGID. If the group doesn't already exist, one
   must be created first with the IPoIB link MTU. For the rest of
   attributes, it is recommended the same values from the all-node
   multicast/broadcast group be used.

   The join call enables SM to program local IB switches and routers
   with the new multicast information. Specifically it causes an IB
   switch to add the LID of the caller to its forwarding table entry
   corresponding to the MLID allocated for the group. It also causes an
   IB router to attach itself to the IB multicast tree corresponding to
   the MGID.

   When a node leaves an IP multicast group, it SHOULD notify the SA in
   order for all the related resources to be freed up. This gives SM an
   opportunity to delete an IB multicast group that is no longer in use,
   and free up the MLID allocated for it. The specific algorithm is
   implementation-dependent, and is out of the scope of this document.

   Note that for an IPoIB link that spans more than one IB subnet
   connected by IB routers, an adequate multicast forwarding support at
   the IB level is required for multicast packets to reach listeners on
   remote IB subnets. The specific mechanism for this will be covered in
   [IBTA], and is beyond the scope of IPoIB.

10.0 Security Considerations

   All the operations for creating and configuring an IPoIB link
   described in this document, including assigning P_Keys to CAs,
   creating IB multicast groups in SA, creating and attaching QPs to IB
   multicast groups,... etc are privileged operations, and MUST be
   protected by the underlying operating system. This is to prevent
   malicious, non- privileged software from hijacking important
   resources and configurations.  E.g. A bogus IPoIB broadcast group may
   prevent a proper one from being created when the network
   administrator tries to set up a link.

   Controlled Q_Keys SHOULD be used in IPoIB links. This is to prevent
   non-privileged software from fabricating IP datagrams to send, as
   mentioned in section 6.2.

11.0 Acknowledgments

   The authors would like to thank Bruce Beukema, David Brean, Dan
   Cassiday, Aditya Dube, Yaron Haviv, Michael Krause, Thomas Narten,
   Erik Nordmark, Greg Pfister, Renato Recio, Satya Sharma, and David L.
   Stevens for their suggestions and many clarifications on the IBA
   specification.



Chu & Kashyap                                                  [Page 10]


draft-ietf-ipoib-link-multicast-02.txt                         June 2002


12.0 References

   [AARCH]   Hinden, R. and S. Deering "IP Version 6 Addressing
             Architecture", RFC 2373, July 1998.

   [DHCP]    R. Droms "Dynamic Host Configuration Protocol", RFC 2131,
             March 1997.

   [DISC]    Narten, T., Nordmark, E. and W. Simpson, "Neighbor
             Discovery for IP Version 6 (IPv6)", RFC 2461, December
             1998.

   [HOSTS]   Braden R., "Requirements for Internet Hosts --
             Communication Layers", RFC 1122, October 1989

   [IBTA]    InfiniBand Architecture Specification, Release 1.0.a by
             InfiniBand Trade Association at www.infinibandta.org

   [IGMP2]   Fenner W., "Internet Group Management Protocol, Version 2",
             RFC 2236, November 1997.

   [IPMULT]  Deering S., "Host Extensions for IP Multicasting", RFC
             1112, August 1989.

   [IPoIB_ARCH]  draft-ietf-ipoib-architecture-01.txt

   [IPoIB_ENCAP] draft-ietf-ipoib-ip-over-infiniband-01.txt

   [IPV6]    Deering, S. and R. Hinden, "Internet Protocol, Version 6
             (IPv6) Specification", RFC 2460, December 1998.

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
             Requirement Levels", BCP 14, RFC 2119, March 1997.


13.0 Author's Address

   H.K. Jerry Chu
   17 Network Circle, UMPK17-201
   Menlo Park, CA 94025
   USA

   Phone: +1 650 786-5146
   EMail: jerry.chu@sun.com


   Vivek Kashyap
   IBM



Chu & Kashyap                                                  [Page 11]


draft-ietf-ipoib-link-multicast-02.txt                         June 2002


   15450, SW Koll Parkway
   Beaverton, OR 97006

   Phone: 503 578 3422
   EMail: vivk@us.ibm.com



14.0 Full Copyright Statement

   Copyright (C) The Internet Society (2002>.  All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
















Chu & Kashyap                                                  [Page 12]