[Search] [txt|pdf|bibtex] [Tracker] [WG] [Email] [Nits]

Versions: 00 01 02 03 04                                                
INTERNET-DRAFT                                            H.K. Jerry Chu
<draft-ietf-ipoib-link-multicast-00.txt>                Sun Microsystems
                                                           Vivek Kashyap
Expires: July, 2002                                        January, 2002

             IP link and multicast over InfiniBand networks

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at

   The list of Internet-Draft Shadow Directories can be accessed at

   Copyright (C) The Internet Society (date).  All Rights Reserved.


   This document specifies a method for setting up IP subnets and
   multicast services over InfiniBand(TM) networks. Discussions in this
   document are applicable to both IPv4 and IPv6, unless explicitly
   specified. A separate document will cover unicast and encapsulation
   of IP datagrams over InfiniBand networks.

1. Introduction

   InfiniBand Architecture (IBA) defines four layers of network services
   corresponding to layer one through layer four of the OSI reference
   model.  For the purpose of running IP over an InfiniBand (IB)
   network, the IB network and all its link, network, and transport

Chu & Kashyap                                                   [Page 1]

draft-ietf-ipoib-link-multicast-00.txt                      January 2002

   layers collectively constitute the data link layer to the IP stack.

   An IP network is often divided into many subnets connected by IP
   routers.  Subnetting allows the containment of broadcast traffic
   within a single subnet. It also provides certain degree of isolation
   between nodes on different subnets. The latter may be an important
   consideration for administration purpose.

   This document will focus on all the steps required to lay out an IP
   network on top of an IB network. It will describe all the elements an
   IP over InfiniBand (IPoIB) link consists of, how to configure its
   associated link attributes, and how to set up basic broadcast and
   multicast services on an IPoIB link.

2. Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   document are to be interpreted as described in [RFC2119].

3. Basic IPoIB Transport - Unreliable Datagram

   InfiniBand defines four types of transport services [IBTA]. They are
   reliable connection, unreliable connection, reliable datagram,
   unreliable datagram. IBA also defines a special raw datagram service
   for encapsulation purpose. Both unreliable datagram and raw datagram
   define support for multicast. They provide the basic transport
   mechanism that best matches the IP datagram paradigm.

   IB unreliable datagram provides many additional transport features
   such as the partition key (P_Key) protection, multiple queue pairs
   (QPs), and Q_Key protection. Moreover, it requires a 32-bit invariant
   CRC checksum, which provides a much stronger protection against data
   corruption, compared with the 16-bit CRC that a raw datagram carries.

   For these reasons, IB unreliable datagram is considered to be a much
   better choice as the basic IPoIB transport than the raw datagram, and
   is chosen as the default IPoIB transport mechanism for the rest of
   discussions in this document.

   An IB unreliable datagram contains the following headers:

   o  Local Route Header (LRH) - provides IB link-layer addressing
      information. An IB link layer address is based on a 16-bit
      identifier called Local Identifier (LID), and is used by IB
      switches to relay packets within an IB subnet.

   o  Global Route Header (GRH) - provides routing information for IB

Chu & Kashyap                                                   [Page 2]

draft-ietf-ipoib-link-multicast-00.txt                      January 2002

      routers to relay packets between IB subnets inside an IB fabric.
      GRH is only required for all multicast packets and any unicast
      packet that is destined to a node in a different IB subnet. GRH
      carries an IB network-layer address, which is an 128-bit
      identifier called Global Identifier (GID) that closely mimics IPv6
      addressing architecture [AARCH].

   o  Base Transport Header (BTH) - provides various information,
      including P_Key, destination queue pair number (QPN) for IB
      transport services.

   o  Datagram Extended Header (DETH) - provides additional IB
      information such as Q_Key, source queue pair number for datagram

   From the perspective of IP over IB encapsulation, all the above IB
   headers are considered as link layer encapsulation for IP datagrams.

4. IB Multicast Architecture

   IBA defines two layers of multicast services. Its link layer uses
   multicast LIDs (MLIDs), which are allocated by the Subnet Manager
   (SM) and fall in the range between 0xC0000 to 0xFFFE (approximately
   16k). MLIDs are used by IB switches to program their multicast
   forwarding tables. An IB switch implementation may support much fewer
   MLIDs in its forwarding table though.

   IB network layer uses multicast GIDs (MGIDs), which closely resemble
   IPv6 multicast addresses [AARCH] as shown in the following.

   |   8    |  4 |  4 |                  112 bits                   |
   +------ -+----+----+---------------------------------------------+
   |11111111|flgs|scop|                  group ID                   |

      11111111 at the start of the address identifies the address as
      being a multicast address.

      flgs is a set of 4 flags:     |0|0|0|T|

         The high-order 3 flags are reserved, and must be initialized to

         T = 0 indicates a permanently-assigned ("well-known") multicast
         address, assigned by the global internet numbering authority.

Chu & Kashyap                                                   [Page 3]

draft-ietf-ipoib-link-multicast-00.txt                      January 2002

         T = 1 indicates a non-permanently-assigned ("transient")
         multicast address.

      scop is a 4-bit multicast scope value used to limit the scope of
      the multicast group.  The values are:

         0  reserved
         1  node-local scope
         2  link-local scope
         3  (unassigned)
         4  (unassigned)
         5  site-local scope
         6  (unassigned)
         7  (unassigned)
         8  organization-local scope
         9  (unassigned)
         A  (unassigned)
         B  (unassigned)
         C  (unassigned)
         D  (unassigned)
         E  global scope
         F  reserved

      group ID identifies the multicast group, either permanent or
      transient, within the given scope.

   MGIDs are mainly used by IB routers when forwarding multicast packets
   to remote IB subnets that are part of a multicast forwarding tree.
   Since every IB multicast packet is required to carry both LRH and
   GRH, a valid MGID and a valid MLID are both needed before a valid IB
   multicast packet can be constructed.

   An IB multicast group is uniquely identified by a valid MGID. Before
   a MGID can be used within an IB subnet, either as a destination
   address of a multicast packet, or representing a multicast group that
   an IB node can join, a "MCGroupRecord" corresponding to the MGID must
   be created through the Subnet Administrator (SA). Besides the MGID,
   the creator must supply values of the path MTU, Q_Key, TClass,
   FlowLabel, HopLimit that are appropriate for all the potential
   clients of the multicast group to use. In return, SA will allocate a
   MLID to be used by switches in the local IB subnet.

   Note that MLIDs are allocated and managed by SM when new MGIDs are
   created though the creation of MCGroupRecords. The number of valid
   MLIDs that are available in a given IB subnet is limited by the
   implementation-dependent size of multicast forwarding table of IB
   switches.  Since the number can be small, reuses of MLIDs for MGIDs
   may be inevitable. Implementation should nevertheless avoid sharing

Chu & Kashyap                                                   [Page 4]

draft-ietf-ipoib-link-multicast-00.txt                      January 2002

   the same MLID among high volume multicast groups in order to reduce
   software filtering overhead and attain higher efficiency.

   Unreliable multicast is defined by IBA as an optional functionality
   for channel adaptors (CAs) and switches. In today's IP technology,
   link multicast has become an indispensable function for better
   supporting a modern IP network. For this reason, it is required that
   an IPoIB fabric supports multicast. This includes all the CAs and
   switches that make up an IP network.

5. IB Links vs IPoIB Links

   A link segment on top of which an IP subnet can be configured is
   defined in [IPV6] as a communication facility or medium over which
   nodes can communicate at the "link" layer.  For most types of
   communication media, the boundary between different data link
   segments follows the physical topology of the network connectivity,
   and is pretty obvious. E.g. an Ethernet network connected by
   switches, hubs, or bridges usually forms a single link segment and
   broadcast/multicast domain.  Different Ethernet segments can be
   connected by IP routers at the network layer.

   InfiniBand defines its own link-layer and subnets consisting of nodes
   connected by IB switches. However, the IPoIB link boundary needs not
   follow the IB link boundary. Nodes residing on different IB subnets
   can still communicate directly with one another through IB routers at
   the InfiniBand network layer. The same applies to multicast as well.
   I.e.  nodes on the same IB subnet can exchange multicast packets with
   one another all within the same subnet through the IB link multicast
   facility. But even nodes on different IB subnets can still exchange
   multicast packets with one another using IB network-layer multicast.

   The ultimate requirement for two nodes in the same IB fabric to
   communicate at the IB level, besides the physical connectivity, is a
   common P_Key.

   Partitioning in IB provides an isolation mechanism among nodes in an
   IB fabric, much like VLANs in an Ethernet network.  Each port of an
   endnode contains a P_Key table of all the valid P_Keys the port is
   allowed to use.  The P_Key table is set up by the SM of the local IB
   subnet.  Each QP is programmed with a P_Key from the local P_Key
   table.  This P_Key is carried in the BTH of all the outgoing packets
   from the QP, and is used to compare against the P_Key in the BTH of
   all the incoming packets to the QP. Reception of an invalid P_Key
   causes the packet to be discarded. IB switches may optionally enforce
   partition checking too.

   Therefore P_Key and IB partition are the most natural choice for

Chu & Kashyap                                                   [Page 5]

draft-ietf-ipoib-link-multicast-00.txt                      January 2002

   defining IPoIB link boundary. It also affords much flexibility to the
   network administrators when different links are set up in a large
   network. This is very similar to VLANs in Ethernet.

6. Setting up an IPoIB Link

   A network administrator defines an IPoIB link by setting up an IB
   partition and assigning it a unique P_Key. An IB partition may or may
   not span multiple IB subnets. But whether it does or not is mostly
   irrelevant to IPoIB.

   Each node attached to the IB partition MUST have one of its CA
   assigned the P_Key to use.

   Once an IB partition is established for IPoIB use, the link MTU and
   Q_Key are two other important attributes that must be chosen before
   the IPoIB link can be configured.

6.1 Maximum Transmission Unit

   IB defines five permissible maximum payload sizes. They are 256, 512,
   1024, 2048 and 4096 bytes.  [IPV6] requires a link MTU of 1280 bytes
   or greater. This leaves only 2048 and 4096 bytes as two acceptable
   choices for IPv6. Channel adaptors supporting a maximum payload size
   less than the minimal MTU requirement can still expose an acceptable
   link MTU to IP through an adaptation layer that fragments larger
   messages into smaller IB packets, and reassembles them on the
   receiving end. But this must be done in a way that is completely
   transparent to the IP stack.

   It is up to the network administrator to select a link MTU to use
   when configuring an IPoIB link. The link MTU SHALL not be greater
   than the maximum payload size of any CA or switch connected to the
   IPoIB link.

   In general a larger link MTU can potentially offer a better
   throughput performance. The caveat is that once a link MTU is chosen
   for a given IPoIB link, nodes connected by CAs of a smaller maximum
   payload size won't be able to join the link unless the whole link and
   all the nodes attached to it are reconfigured to use a smaller MTU.

   Note that the above discussion assumes that IP datagrams are fully
   encapsulated in the payload of IB unreliable datagrams. The actual
   MTU size, i.e., the payload size available for IP datagrams to use,
   may be slightly smaller. This will depend on the actual IPoIB
   encapsulation scheme, which will be covered in a separate document.

   Note also that in case an IPoIB link spans more than one IB subnet,

Chu & Kashyap                                                   [Page 6]

draft-ietf-ipoib-link-multicast-00.txt                      January 2002

   the IPoIB link MTU MUST not be set to greater than the path MTU of
   any path connecting two nodes in the same IB partition. It is up to
   the network administrator to determine the appropriate path MTU value
   that will work for any node in the same IPoIB link.

6.2. IPoIB Link Q_Key

   A Q_Key is programmed by the source QP in every IB datagram, and is
   verified by the destination QP against the Q_Key it has been
   assigned. A Q_Key violation will cause the offending datagram to be
   dropped, and a Q_Key violation trap to be raised.

   A Q_Key must be selected to be used by all the QPs attached to an
   IPoIB link. It is recommended that a controlled Q_Key be used with
   the high order bit set. This is to prevent non-privileged software
   from fabricating and sending out bogus IP datagrams. All QPs
   configured to use on a given IPoIB link SHALL be assigned the same
   per-link Q_Key.

6.3 Other Link Attributes

   TClass, FlowLabel, and HopLimit are three other attributes that are
   required for an IPoIB link covering more than a single IB subnet.
   The selection of these values are implementation dependent. But it
   must take into account the topology of IB subnets comprising the
   IPoIB link to ensure successful communication between any two nodes
   in the same IPoIB link.

7. The IPoIB All-Node Multicast and Broadcast Group

   Once an IB partition is created with link attributes identified for
   an IPoIB link, the network administrator must create a special IB
   multicast group for every node on the IPoIB link to join. This is
   achieved through the creation of "MCGroupRecord" in each IB subnet
   that the IB partition encompasses, as described in section 4 above.

   The MGID will have the P_Key of the IB partition that defines the
   IPoIB link embedded in it. A special signature is also embedded to
   identify the MGID for IPoIB use only. For IPv4 over IB, the signature
   will be "0x401B". For IPv6 over IB, the signature will be "0x601B".

   For an IPv4 subnet, the MGID for this special IB multicast group
   SHALL have the following format:

   |   8    |  4 |  4 |     16 bits     | 16 bits | 48 bits  | 32 bits |
   |11111111|0001|scop|<IPoIB signature>|< P_Key >|00.......0|<all 1's>|

Chu & Kashyap                                                   [Page 7]

draft-ietf-ipoib-link-multicast-00.txt                      January 2002

   For an IPv6 subnet, the format of the MGID SHALL look like this:

   |   8    |  4 |  4 |     16 bits     | 16 bits |       80 bits      |
   |11111111|0001|scop|<IPoIB signature>|< P_Key >|000.............0001|

   As for the scop bits, if the IPoIB link is fully contained within a
   single IB subnet, the scop bits SHALL be set to 2 (link-local).
   Otherwise the scope will be set higher.

   A MCGroupRecord will be created with all the IPoIB link attributes
   described before, including the link MTU, Q_Key, TClass, FlowLabel,
   and HopLimit. When a node is attached to an IPoIB link identified by
   a P_Key, it must look for a special, all-node multicast/broadcast
   group to join. This is done by constructing the MGID with the link
   P_Key and the IPoIB signature. The node SHOULD always look for a MGID
   of a link-local scope first before attempting one with a greater

   Once the right MGID and MCGroupRecord are identified, the local node
   SHOULD use the link MTU recorded in the MCGroupRecord. It MUST accept
   a smaller MTU if one is advertised through the link MTU option of a
   router advertisement [DISC].

   In case the link MTU is greater than the maximum payload size that
   the local HCA can support, the node can not join the IPoIB link and
   operate as an IP node.

   After the right MTU is determined, the local node must join the
   special all-node multicast/broadcast group by calling the SA to
   create a MCMemberRecord corresponding to the MGID. The SA will return
   all the link attributes for the local node to use. The node MUST use
   these attributes in all future multicast operations to the local
   IPoIB link.  The broadcast group for IPv4 will serve to provide a
   broadcast service for protocol like ARP to use.

   In addition to the all-node multicast/broadcast group, an all-router
   multicast group SHOULD be created at link configuration time if an IP
   router will be attached to the link. This is to facilitate IP
   multicast operations described later. A MCGroupRecord for the all-
   router MGID must be created in every IB subnet that the IPoIB link
   encompasses. The format of the all-router MGID will be covered in
   next section.

8. Mapping for other Multicast Groups

   The support of general IP multicast [IPMULT] over IB is similar to

Chu & Kashyap                                                   [Page 8]

draft-ietf-ipoib-link-multicast-00.txt                      January 2002

   the case of the special all-node multicast/broadcast group discussed
   above. An algorithmic mapping is used so that given an IP multicast
   address, individual host can compute the corresponding IB multicast
   address (MGID) all by itself without having to consult an external
   entity. This also removes the need for an externally maintained IP to
   IB multicast mapping table.

   The IPoIB multicast mapping is defined as follows. The same mapping
   function is used for both IPv4 and IPv6 except the IPoIB signature

   |   8    |  4 |  4 |     16 bits     | 16 bits |      80 bits       |
   +------ -+----+----+-----------------+---------+--------------------+
   |11111111|0001|scop|<IPoIB signature>|< P_Key >|      group ID      |

   Since a MGID allocated for transporting IP multicast datagrams is
   considered only a transient link-layer multicast address, all IB
   MGIDs allocated for IPoIB purpose SHOULD have T = 1. The scope bits
   SHALL be the same as that of the all-node MGID for the same IPoIB

   An IP multicast address is used together with a given IPoIB link
   P_Key to form the MGID of the IB multicast group. For IPv6 the lower
   80-bit of the group ID is used directly in the lower 80-bit of the
   MGID. For IPv4, the group ID is only 28-bit long and the rest of the
   80 bits are filled with 0.

   The rest of the bits are the same as those of the all-node MGID.
   E.g. on an IPoIB link that is fully contained within a single IB
   subnet with a P_Key of 8, the MGIDs for the all-router multicast
   group with group ID 2 [AARCH, IGMP2] are:




   for IPv4 in a compressed format, and




   for IPv6 in a compressed format.

Chu & Kashyap                                                   [Page 9]

draft-ietf-ipoib-link-multicast-00.txt                      January 2002

   A special case exists for the IPv4 limited broadcast address
   "" [HOSTS]. The address SHALL be mapped to the
   broadcast MGID for IPv4 networks as described in section 7 above.
   Also the IPv6 all-node multicast address "FF0X::1" [AARCH] will be
   mapped to the the special all-node MGID for IPv6 networks.

   When a node wishes to join an IP multicast group on a local link, it
   first needs to construct the corresponding MGID, using the rule
   described above. Note that it must remember the scope bits from the
   all-node MGID, and use the same scope in all later MGIDs it

   The local node then checks with SA to see if a MCGroupRecord
   corresponding to the MGID already exists. If not, one must be created
   first. The MCGroupRecord MUST be created with the IPoIB link MTU. For
   the rest of the attributes, it is recommended that it uses the same
   values from the all-node multicast/broadcast group corresponding to
   the link.

   Note that for an IPoIB link that spans more than one IB subnet
   connected by IB routers, adequate multicast forwarding support at the
   IB level is required for multicast packets to be forwarded properly
   to members in remote IB subnets. The specific mechanism for this will
   be covered in [IBTA], and is out of scope of this document.

   Once the IB multicast group is identified, the node must join the
   group, unless it is a member already, by calling the SA to create a
   MCMemberRecord corresponding to the MGID.  The join call enables SM
   to program local IB switches and routers with the new multicast
   information.  Specifically it causes an IB switch to add the LID of
   the caller to its forwarding table entry corresponding to the MLID
   allocated for the group.  It also causes an IB router to attach
   itself to the IB multicast tree corresponding to the MGID.

   When a node leaves an IP multicast group, it SHOULD delete the
   MCMemberRecord from the SA. This allows the SA to free up related
   resources. SM should delete MCGroupRecords that are no longer in use,
   and free up the MLIDs allocated for them. The specific algorithm is
   implementation-dependent, and therefore is out of scope of this

   In order to send a packet destined for an IP multicast address, a
   node must first check if a MCGroupRecord for the corresponding MGID
   of the outbound link exists or not. If one already exists, the MLID
   allocated by the SM for the MCGroupRecord is used as the DLID for the
   packet. Otherwise, it means no member exists on the local link.  The
   packet should be forwarded to the all-router multicast group
   described before. If one doesn't already exist, it implies no router

Chu & Kashyap                                                  [Page 10]

draft-ietf-ipoib-link-multicast-00.txt                      January 2002

   presence on the local subnet. The packet can then be silently

   Note that the local node MUST be notified when an IB multicast group
   corresponding to the MGID ever comes into existence later. This
   signifies that an interested party just showed up on the local link
   and therefore must be copied.

9. Support for IP Multicast Routing

   IP multicast routing requires a router to receive a copy of every
   link multicast packet on a locally connected link [IPMULT, IP6MLD].
   For Ethernet this is usually done by turning on promiscuous multicast
   mode on a locally connected Ethernet interface.

   Unfortunately IBA does not support promiscuous multicast mode.
   Therefore the IPoIB driver should forward a copy of every outbound
   multicast datagram to the MGID corresponding to the all-router
   multicast group. This is to ensure multicast packets be properly
   forwarded to remote IP networks.

10. Security Considerations

   All the operations for creating and configuring an IPoIB link
   described in this document, including assigning P_Keys to CAs,
   creating MCGroupRecords and MCMemberRecords in SA, creating and
   attaching QPs to IB multicast groups,... etc are privileged
   operations, and MUST be protected by the underlying operating system.
   This is to prevent malicious, non- privileged software from hijacking
   important resources and configurations.  E.g. A bogus all-node IPoIB
   multicast group may prevent a proper one from being created when the
   network administrator tries to set up a link.

   Controlled Q_Keys SHOULD be used in IB multicast groups in order to
   prevent non-privileged software from fabricating IP datagrams to
   send, as mentioned in section 6.2.

11. Acknowledgments

   The authors would like to thank Bruce Beukema, David Brean, Dan
   Cassiday, Thomas Narten, Erik Nordmark, Greg Pfister, Renato Recio,
   David L. Stevens, and Madhu Talluri for their suggestions and many
   clarifications on the IBA specification.

12. References

   [AARCH]   Hinden, R. and S. Deering "IP Version 6 Addressing
             Architecture", RFC 2373, July 1998.

Chu & Kashyap                                                  [Page 11]

draft-ietf-ipoib-link-multicast-00.txt                      January 2002

   [DISC]    Narten, T., Nordmark, E. and W. Simpson, "Neighbor
             Discovery for IP Version 6 (IPv6)", RFC 2461, December

   [HOSTS]   Braden R., "Requirements for Internet Hosts --
             Communication Layers", RFC 1122, October 1989

   [IBTA]    InfiniBand Architecture Specification, Release 1.0.a by
             InfiniBand Trade Association at www.infinibandta.org

   [IGMP2]   Fenner W., "Internet Group Management Protocol, Version 2",
             RFC 2236, November 1997.

   [IPMULT]  Deering S., "Host Extensions for IP Multicasting", RFC
             1112, August 1989.

   [IPV6]    Deering, S. and R. Hinden, "Internet Protocol, Version 6
             (IPv6) Specification", RFC 2460, December 1998.

   [IP6MLD]  Deering S., Fenner W., Haberman B., "Multicast Listener
             Discovery (MLD) for IPv6", RFC 2710, October 1999.

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
             Requirement Levels", BCP 14, RFC 2119, March 1997.

13. Author's Address

   H.K. Jerry Chu
   901 San Antonio Road, UMPK17-201
   Palo Alto, CA 94303-4900

   Phone: +1 650 786-5146
   EMail: jerry.chu@sun.com

   Vivek Kashyap
   15450, SW Koll Parkway
   Beaverton, OR 97006

   Phone: 503 578 3422
   EMail: vivk@us.ibm.com

14. Full Copyright Statement

Chu & Kashyap                                                  [Page 12]

draft-ietf-ipoib-link-multicast-00.txt                      January 2002

   Copyright (C) The Internet Society (2001>.  All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an

Chu & Kashyap                                                  [Page 13]