INTERNET DRAFT Vivek Kashyap
<draft-ietf-ipoib-ip-over-infiniband-04.txt> IBM
Expiration Date: October, 2003 H.K.Jerry Chu
Sun Microsystems
April, 2003
IP encapsulation and address resolution over InfiniBand networks
Status of this memo
This document is an Internet-Draft and is in full conformance
with all provisions of Section 10 of RFC 2026.
Internet-Drafts are working documents of the Internet
Engineering Task Force (IETF), its areas, and its working
groups. Note that other groups may also distribute working
documents as Internet- Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use
Internet-Drafts as Reference material or to cite them other
than as ``work in progress''.
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed
at http://www.ietf.org/shadow.html
This memo provides information for the Internet community.
This memo does not specify an Internet standard of any kind.
Distribution of this memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2001). All Rights Reserved.
Abstract
This document specifies the frame format for transmission of
IP and ARP packets over InfiniBand networks. Unless explicitly
specified, the term 'IP' refers to both IPv4 and IPv6. The
term 'ARP' refers to all the ARP protocols/op-codes such as
ARP/RARP. This document also describes the method of forming
IPv6 link-local addresses, and the content of the
Kashyap, Chu [Page 1]
INTERNET-DRAFT IP over InfiniBand April, 2003
source/target link layer address option used in Neighbor
solicitation and advertisement, router advertisement, router
redirect and router solicitation on IPv6 over InfiniBand.
Table of Contents
1.0 Introduction
2.0 InfiniBand Datalink
2.1 IP Support on IPoIB Link
3.0 Frame Format
4.0 Maximum Transmission Unit
5.0 IPv6 Stateless Autoconfiguration
5.1 IPv6 Link Local Address
6.0 Address Mapping - Unicast
6.1 Link-Information
6.1.1 Link Layer Address/Hardware Address
6.1.2 Auxiliary Link Information
6.2 Address Resolution in IPv4 Subnets
6.3 Address Resolution in IPv6 Subnets
6.4 Cautionary Note on QPN Caching
7.0 IANA Considerations
8.0 Security Considerations
9.0 Acknowledgements
10.0 References
11.0 Authors' Addresses
1.0 Introduction
The InfiniBand specification[IB_ARCH] can be found at
www.infinibandta.org. The document [IPoIB_ARCH] provides a
short overview of InfiniBand architecture along with
considerations for specifying IP over InfiniBand networks. The
document [IPoIB_MCAST] defines the configuration of IPoIB
links and the support of IP multicast over InfiniBand
networks.
The InfiniBand architecture(IBA) defines multiple modes of
transport over which IP may be implemented. The unreliable
datagram(UD) transport method best matches the needs of IP and
the need for universality in general as described
in [IPoIB_ARCH].
This document specifies IPoIB over IB's unreliable
datagram(UD) mode. The transmission of IP over other modes of
IB is beyond the scope of this document.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
Kashyap, Chu [Page 2]
INTERNET-DRAFT IP over InfiniBand April, 2003
"OPTIONAL" in this document are to be interpreted as described
in RFC 2119.
2.0 InfiniBand Datalink
The document [IPoIB_MCAST] defines the IPoIB link, its setup,
and IP multicast over InfiniBand in detail. The following
discussion gives a short overview.
An InfiniBand(IB) subnet is formed by a network of IB nodes
interconnected either directly or via IB switches. IB subnets
may be connected using IB routers to form a fabric made of
multiple IB subnets. Multiple IP subnets may be overlaid over
this IB cloud. The boundary of this IP subnet is arbitrary and
not associated with a physical demarcation. The IPoIB nodes
that are members of this subnet are interconnected by an
abstract 'link'. The link is defined by its members and common
characteristics such as the P_Key, link MTU and Q_Key that are
defined per 'link'.
IPv4 defines a limited-broadcast address over the link. All
IPv4 hosts that are members of the IPv4 subnet are members of
this address. IPv6 defines a multicast address referred to as
the all-IP hosts address. IPoIB defines a mapping from these
(and other IPv4/v6 multicast addresses) to IB multicast GIDs
[IPoIB_MCAST]. The multicast GID derived from the IPv4
limited-broadcast address and the multicast GID derived from
the IPv6 all-nodes multicast address will collectively be
referred to as the broadcast-GID in this document. The
broadcast-GID is required to be setup for an IPoIB subnet to
be formed.
Every IPoIB interface MUST join the InfiniBand multicast group
defined by the broadcast-GID. This operation returns the MTU
and the Q_Key associated with the IPoIB link. Thus the IPoIB
subnet (and the link) is formed by the IPoIB nodes joining the
broadcast GID.
The P_Key is a configuration parameter that must be known
before the broadcast-GID can be formed[IPoIB_MCAST].
2.1 IP Support on IPoIB Link
The unreliable datagram (UD) mode of communication is
supported by all IB elements be they IB routers, Host Channel
Adapters(HCAs) or Target Channel Adapters(TCAs). In addition
to being the only universal transmission method it supports
multicasting, partitioning and a 32-bit CRC [IB_ARCH]. IB does
Kashyap, Chu [Page 3]
INTERNET-DRAFT IP over InfiniBand April, 2003
not require that all IB components support multicasting.
Therefore, IB subnets with no multicast support are always
possible. However, IPoIB architecture requires the
participating components to support multicast.
All IPoIB implementations MUST support IP over the unreliable
datagram (UD) transport mode of IBA.
3.0 Frame Format
All IP and ARP datagrams transported over InfiniBand are
prefixed by a 4-octet encapsulation header as illustrated
below.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | |
| Type | Reserved |
| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 1
The type field SHALL indicate the encapsulated protocol as per
the following table.
+----------+-------------+
| Type | Protocol |
|------------------------|
| 0x800 | IPv4 |
|------------------------|
| 0x806 | ARP |
|------------------------|
| 0x8035 | RARP |
|------------------------|
| 0x86DD | IPv6 |
+------------------------+
Table 1
These values are taken from the 'ETHER TYPE' numbers assigned
by [IANA]. Other network protocols, identified by different
values of 'ETHER TYPE', may use the encapsulation format
defined herein but such use is outside of the scope of this
document.
Kashyap, Chu [Page 4]
INTERNET-DRAFT IP over InfiniBand April, 2003
|<------ IB Frame headers -------->|<- Payload ->|<- IB trailers ->|
+-------+------+---------+---------+-------------+---------+-------+
|Local | |Base |Datagram | 4-octet | | |
|Routing| GRH* |Transport|Extended | header |Invariant|Variant|
|Header |Header|Header |Transport| + | CRC | CRC |
| | | |Header | IP/ARP | | |
+-------+------+---------+---------+-------------+---------+-------+
Figure 2
Figure 2 depicts the IB frame encapsulating an IP/ARP
datagram. The IB frame headers are described in detail in the
InfiniBand Architecture specificaton [IBARCH]. The InfiniBand
specification requires the use of Global Routing Header (GRH)
[IPoIB_ARCH] when multicasting or when an InfiniBand packet
traverses from one IB subnet to another through an IB router.
Its use is optional when used for unicast transmission between
nodes within an IB subnet. The IPoIB implementation MUST be
able to handle packets received with or without the use of
GRH.
4.0 Maximum Transmission Unit
IB MTU:
The IB components i.e. IB links, switches, CAs, and IB
routers, may support maximum payloads of : 256, 512,
1024, 2048 or 4096 bytes. The maximum IB payload
supported by the IB components in any IB path is the
IB MTU for the path.
IPoIB-Link MTU:
An IPoIB link is formed by the IPoIB nodes joining the
broadcast-GID [IPoIB_MCAST]. The IPoIB-link MTU is the
MTU value associated with the broadcast-GID. The
IPoIB-link MTU can be set to any value upto the
smallest IB MTU supported by the IB components
comprising the IPoIB link.
In order to reduce problems with fragmentation and path-MTU
discovery, this document requires that all IPoIB
implementations support an MTU of 2044 octets i.e. a 2048
octet IPoIB-link MTU minus the 4 octet encapsulation overhead.
Larger and smaller MTUs MAY be supported, but the default
configuration must be support an MTU of 2044 octets.
Kashyap, Chu [Page 5]
INTERNET-DRAFT IP over InfiniBand April, 2003
In IPv6 subnets the MTU may be reduced by a Router
Advertisement [RFC2461] containing an MTU option which
specifies a smaller MTU, or by manual configuration of each
node. If a Router Advertisement received on an IPoIB interface
has an MTU option specifying an MTU larger than the link MTU
or larger than a manually configured value, that MTU option
may be logged to system management but must be otherwise
ignored.
Similarly, the IPv4 MTU may also be reduced by manual
configuration of each node.
For purposes of this document, information received from DHCP
is considered "manually configured".
5.0 IPv6 Stateless Autoconfiguration
IB architecture associates an EUI-64 identifier termed the
GUID (Globally Unique Identifier) [IPoIB_ARCH, IB_ARCH] with
each port. The LID (16 bits) is unique within an IB subnet
only.
The interface identifier may be chosen from:
1) The EUI-64 compliant Globally unique
identifier(GUID) assigned by the manufacturer.
2) If the IPoIB subnet is fully contained within an IB
subnet any of the unique 16-bit LIDs of the port
associated with the IPoIB interface.
The LID values of a port may change after a
reboot/power-cycle of the IB node. Therefore, if a
persistent value is desired, it would be prudent to
not use the LID to form the interface identifier.
On the other hand, the LID provides an identifier
that can be used to create a more anonymous IPv6
address since the LID is not globally unique and is
subject to change over time.
It is RECOMMENDED that the link-local address be constructed
from the port's EUI-64 identifier as per the rules specified
in [RFC2373].
Kashyap, Chu [Page 6]
INTERNET-DRAFT IP over InfiniBand April, 2003
5.1 IPv6 Link Local Address
The IPv6 link local address for an IPoIB interface is formed
as described in [RFC2373] using the Interface Identifier
described in the previous section.
6.0 Address Mapping - Unicast
Address resolution in IPv4 subnets is accomplished through
Address Resolution protocol (ARP)[RFC826]. It is accomplished
in IPv6 subnets using the Neighbor discovery
protocol[RFC2461].
6.1 Link Information
An InfiniBand packet over the UD mode includes multiple
headers such as the LRH(local route header), GRH(global route
header), BTH(base transport header), DETH(datagram extended
header) as depicted in Figure 2 and specified in the
InfiniBand architecture[IB_ARCH]. All these headers comprise
the link-layer in an IPoIB link.
The parameters needed in these IBA headers constitute the
link-layer information that needs to be determined before an
IP packet may be transmitted across the IPoIB link.
The parameters that need to be determined are:
a) LID (local identifier)
The LID is always needed. A packet always includes the
LRH that is targeted at the remote node's LID, or an
IB router's LID to get to the remote node in another
IB subnet.
b) GID (global identifier)
The GID is not needed when exchanging information
within an IB subnet though it may be included in any
packet. It is an absolute necessity when transmitting
across multiple IB subnets since the IB routers use the GID
to correctly forward the packets. The source and
destination GIDs are fields included in the GRH.
The GID, if formed using the GUID, can be used to
unambiguously identify an endpoint.
Kashyap, Chu [Page 7]
INTERNET-DRAFT IP over InfiniBand April, 2003
c) QPN (queue pair number)
Every unicast UD communication is always directed to a
particular queue pair(QP) at the peer.
d) Q_Key
A Q_Key is associated with each unreliable datagram
QPN. The received packets must contain a Q_Key that
matches the QP's Q_Key to be accepted.
e) P_Key
A successful communication between two IB nodes using
UD mode can occur only if the two nodes have
compatible P_Keys. This is referred to as being in the
same partition[IB_ARCH]. P_Keys are checked at the
receiving channel adapter and may be optionally
checked at intermediate switches/IB routers. If the
P_Key in the packet does not match the expected P_Key
the packet is dropped.
f) SL (service level)
Every IBA packet contains an SL value. A path in IBA
is defined by the three-tuple (source LID, destination
LID, SL). The SL in turns is mapped to a virtual
lane(VL) at every CA, switch that sends/forwards the
packet [IPoIB_ARCH]. Multiple SLs may be used between
two endpoints to provide for load-balancing, SLs may
be used for providing a QoS infrastructure, or may be
used to avoid deadlocks in the IBA fabric.
Another auxiliary piece of information, not included in
the IBA headers, is :
g) Path rate
The InfiniBand architecture defines multiple link
speeds. A higher speed transmitter can swamp the
switches and the CAs. To avoid such congestion every
source transmitting at greater than 1x speeds is
required to determine the 'path rate' before the data
may be transmitted [IB_ARCH].
Kashyap, Chu [Page 8]
INTERNET-DRAFT IP over InfiniBand April, 2003
6.1.1 Link Layer Address/Hardware Address
Though the list of information required for a successful
transmittal of an IPoIB packet is large not all the
information need be determined during the IP address
resolution process.
The IPoIB link-layer address used in the source/target
link-layer address option in IPv6 and the 'hardware address'
in IPv4/ARP has the same format.
The format is as described below:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Reserved | Queue Pair Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ +
| |
+ GID +
| |
+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 3
a) Reserved Flags
These 8 bits are reserved for future use. These bits
MUST be set to zero on send and ignored on receive
unless specified differently in a future document.
b) Queue Pair Number (QPN)
Every unicast communication in IB architecture is
directed to a specific queue pair(QP)[IB_ARCH]. This
QP number is included in the link description. All IP
communication to the relevant IPoIB interface MUST be
directed to this QPN. In the case of IPv4 subnets the
address resolution protocol(ARP) reply packets are
also directed to the same QPN.
The choice of the QPN value for IP/ARP communication
is up to the receiving implementation.
Kashyap, Chu [Page 9]
INTERNET-DRAFT IP over InfiniBand April, 2003
c) Global Identifier (GID)
This is one of the Global Identifiers(GIDs)[IB_ARCH]
of the port associated with the IPoIB interface. IB
associates multiple GIDs with a port. It is
RECOMMENDED that the 'GID at index 0' be included in
the link-layer/hardware address [IBARCH]. The GID at
index 0 is formed using the IB port's manufacturer
assigned EUI-64 identifier.
6.1.2 Auxiliary Link Information
The rest of the parameters are determined as follows:
a) Local Identifier(LID)
The method of determining the peer's LID is not
defined in this document. It is up to the
implementation to use any of the IBA approved methods
to determine the destination LID. One such method is
to use the GID determined during the address
resolution, to retrieve the associated LID from the IB
routing infrastructure or the Subnet
Administrator(SA)[IBARCH].
It is the responsibility of the administrator to
ensure that the IB subnet(s) have unicast connectivity
between the IPoIB nodes. The GID exchanged between two
endpoints in a multicast message(ARP/ND) does not
guarantee the existence of a unicast path between the
two. This has to be ensured by the fabric
administrator.
There may be multiple LIDs, and hence paths, between
the endpoints. The criteria for selection of the LIDs
are beyond the scope of this document.
b) Q_Key
The Q_Key received on joining the broadcast-GID MUST
be used for all IPoIB communication over the
particular IPoIB link.
c) P_Key
The network administrator is required to setup an
IPoIB link by setting up an IB partition and assigning
it a unique P_Key[IPoIB_MCAST].
Kashyap, Chu [Page 10]
INTERNET-DRAFT IP over InfiniBand April, 2003
Thus the P_Key to be used in the IP subnet is not
discovered but is a configuration parameter.
d) Service Level(SL)
The method of determining the SL is not defined in
this document. The SL is determined by any of the IBA
approved methods.
e) Path rate
The implementation must leverage IB methods to
determine the path rate as required.
6.2 Address Resolution in IPv4 Subnets
The ARP packet header is as defined in [RFC826]. The hardware
type is set to 32(decimal) as specified by Internet Assigned
Numbers Authority(IANA). The rest of the fields are used as
per RFC826.
16 bits: hardware type
16 bits: protocol
8 bits: length of hardware address
8 bits: length of protocol address
16 bits: ARP operation
The remaining fields in the packet hold the sender/target
hardware and protocol addresses.
[ sender hardware address ]
[ sender protocol address ]
[ target hardware address ]
[ target protocol address ]
The hardware address included in the ARP packet will be as
specified in section 6.1.1 and depicted in Figure 3.
The length of the hardware address used in ARP packet header
therefore is 20.
Kashyap, Chu [Page 11]
INTERNET-DRAFT IP over InfiniBand April, 2003
6.3 Address Resolution in IPv6 Subnets
The Source/Target Link-layer address option is used in Router
Solicit, Router advertisements, Redirect, Neighbor
Solicitation and Neighbor Advertisement messages when such
messages are transmitted on InfiniBand networks.
The source/target address option is specified as follows:
Type:
Source Link-layer address 1
Target Link-layer address 2
Length: 3
Link-layer address:
The link-layer address is as specified in section 6.1.1
and depicted in Figure 3.
6.4 Cautionary Note on QPN Caching
The link-address for IPoIB includes the QPN which might not be
constant across reboots or even across network interface
resets. Cached QPN entries, such as in static ARP entries or
in RARP servers will only work if the implementation(s) using
these options ensure that the QPN associated with an interface
is invariant across reboots/network resets.
7.0 IANA Considerations
To support ARP over InfiniBand a value for the Address
Resolution Parameter 'Number Hardware Type (hrd)' is required.
IANA has assigned the number '32' to indicate
InfiniBand[IANA_ARP].
8.0 Security Considerations
This document specifies IP transmission over a multicast
network. Any network of this kind is vulnerable to a sender
claiming another's identity and forge traffic or eavesdrop. It
is the responsibility of the higher layers or applications to
implement suitable counter-measures if this is a problem.
Successful transmission of IP packets depends on the correct
setup of the IPoIB link [IPOIB_MCAST], creation of the
broadcast GID, creation of the QP and its attachment to the
broadcast-GID, and the correct determination of various link
Kashyap, Chu [Page 12]
INTERNET-DRAFT IP over InfiniBand April, 2003
parameters such as the LID, service level, path rate etc.
These operations, many of which involve interactions with the SM/SA,
MUST be protected by the underlying operating system. This is
to prevent malicious, non- privileged software from hijacking
important resources and configurations.
Controlled Q_Keys SHOULD be used in all transmissions. This is
to prevent non-privileged software from fabricating IP
datagrams.
9.0 Acknowledgements
The authors would like to thank Bruce Beukema, David Brean,
Dan Cassiday, Yaron Haviv, Thomas Narten, Erik Nordmark, Greg
Pfister, Jim Pinkerton, Renato Recio, Kevin Reilly, Madhu
Talluri and Satya Sharma for their suggestions and many
clarifications on the IBA specification.
10.0 References
[IB_ARCH] InfiniBand Architecture Specification, Volume 1.0a
www.infinibandta.org
[IPoIB_ARCH] draft-ietf-ipoib-architecture-02.txt
[IPoIB_MCAST] draft-ietf-ipoib-link-multicast-03.txt
[RFC2373] IP Version 6 Addressing Architecture
[RFC2375] IPv6 Multicast Address Assignments
[RFC826] An Ethernet Address Resolution Protocol
[RFC1700] Assigned Numbers.
[RFC2434] Guidelines for Writing an IANA Considerations Section in RFCs
[RFC2461] Neighbor Discovery for IP version 6 (IPv6)
[RFC3041] Extensions to IPv6 Address Autoconfiguration
[IANA] Internet assigned numbers authority, www.iana.org
[IANA_ARP] www.iana.org/assignments/arp-parameters
Kashyap, Chu [Page 13]
INTERNET-DRAFT IP over InfiniBand April, 2003
11.0 Authors' Address
Vivek Kashyap
15450, SW Koll Parkway
Beaverton, OR 97006
USA
Phone: +1 503 578 3422
Email: vivk@us.ibm.com
H.K. Jerry Chu
17 Network Circle, UMPK17-201
Menlo Park, CA 94025
USA
Phone: +1 650 786-5146
Email: jerry.chu@sun.com
Full Copyright Statement
Copyright (C) The Internet Society (2001). All Rights Reserved.
This document and translations of it may be copied and
furnished to others, and derivative works that comment on or
otherwise explain it or assist in its implementation may be
prepared, copied, published and distributed, in whole or in
part, without restriction of any kind, provided that the above
copyright notice and this paragraph are included on all such
copies and derivative works. However, this document itself may
not be modified in any way, such as by removing the copyright
notice or references to the Internet Society or other Internet
organizations, except as needed for the purpose of developing
Internet standards in which case the procedures for copyrights
defined in the Internet Standards process must be followed, or
as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will
not be revoked by the Internet Society or its successors or
assigns.
This document and the information contained herein is provided
on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
Kashyap, Chu [Page 14]
INTERNET-DRAFT IP over InfiniBand April, 2003
IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE
USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE.
Kashyap, Chu [Page 15]