Working Group: ARMD Himanshu Shah
Intended Status: Proposed Standard Ciena Corp
Internet Draft
Expiration Date: May, 2011
October 18, 2010
ARP Reduction in Data Center
draft-shah-armd-arp-reduction-00.txt
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
This Internet-Draft will expire on May 18, 2011
Copyright Notice
Copyright (c) 2010 IETF Trust and the persons identified as the
document authors. All rights reserved.
Shah, et al. Expires May 2011 1
Internet Draft draft-shah-arp-reduction-00.txt
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with
respect to this document. Code Components extracted from this
document must include Simplified BSD License text as described in
Section 4.e of the Trust Legal Provisions and are provided without
warranty as described in the Simplified BSD License.
Abstract
With advent of virtual machine (VM) technologies, a host is able to
support multiple VMs in a single physical machine. The data center
application leverages these capabilities to instantiate upwards of
10s to 100s of VMs in a server. Each VM operates as an independent
IP host with its own MAC address associated with a virtual Network
Interface Card (vNIC) that maps to a single physical Ethernet
interface. These physical servers are typically stacked in a rack
with its Ethernet interface connected to top-of-the-rack (ToR)
switch. The ToR switches are interconnected through End-of-the-Row
(EoR) switch which in turn is connected to core switches.
As discussed in [ARP-Problem] the VM hosts use ARP broadcasts to
find other VM hosts and use periodic (broadcast) gratuitous ARPs to
refresh their IP to MAC address binding in other VM hosts. Such
broadcasts in a large data center with potentially thousands of VM
hosts in a layer-2 based topology can cause havoc.
This document describes a solution whereby a ToR switch assumes the
handling of the ARP broadcasts based on the ARP table that it
maintains by gleaning information from the passing ARP PDUs. When
the information is not new, gratuitous ARP PDUs are dropped and ARP
broadcast requests from hosts are responded by the switch from the
learned ARP information instead of forwarding them out.
Conventions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC 2119].
Table of Contents
Copyright Notice .................................................... 1
Shah, et al. Expires May 2011 2
Internet Draft draft-shah-arp-reduction-00.txt
Abstract.............................................................. 2
1.0 Contributing Authors ............................................. 3
2.0 Overview.......................................................... 3
2.1 Terminology ..................................................... 4
3.0 Topology.......................................................... 5
4.0 Configuration..................................................... 5
5.0 Building the ARP tables........................................... 6
5.1 ARP Requests .................................................... 6
5.2 ARP Response .................................................... 7
5.3 Gratuitous ARP .................................................. 7
5.4 Host movement ................................................... 7
5.5 IPv6 Hosts ...................................................... 8
5.6 External ARP servers ............................................ 8
6.0 Conclusion ...................................................... 8
7.0 Security Considerations........................................... 9
8.0 References........................................................ 9
8.1 Normative References ............................................ 9
8.2 Informative References .......................................... 9
9.0 Author's Address.................................................. 9
1.0 Contributing Authors
This document is the combined effort of the following individuals
and many others who have carefully reviewed this document and
provided the technical clarifications
Linda Durbar Huawei
Sue Hares Huawei
T Sridhar Force10 Networks
2.0 Overview
The following factors exasperate the effect of ARP broadcasts in the
data centers.
. Ever increasing dependence on applications that run in data
center
. Large number of physical server hosts in the server farms
. Use of large number of VMs in the physical server host
. Each VM to have its own IP and MAC address and they all reside
in the same subnet as it allows them to move around in
different physical hosts based on fair resource distribution
policies. This requires the data center networks to be layer 2
based
Shah, et al. Expires May 2011 3
Internet Draft draft-shah-arp-reduction-00.txt
. Each VM resort to frequent ARP broadcasts as request to find
the target VMs and/or gratuitous ARPs to refresh its IP to MAC
address binding in peers with whom they tend to chat more
often. This also stems from the fact that VMs holds a
relatively small ARP table and use more aggressive age out in
order to accommodate the 'most' active peers in the table.
The broadcast as such in layer 2 networks has far reaching impacts;
i.e. wastage in network bandwidth as well as CPU resources used by
all the VMs while processing superfluous ARP broadcasts.
It appears that it is possible to minimize the ills of ARP
broadcasts in the data center network in a relatively simpler
fashion. The solution requires the first hop Ethernet Switches,
typically ToR, to maintain ARP table learned from the passing ARP
PDUs and selectively propagate and/or proxy on behalf of the remote
peer. These types of ARP processing principles are well known and
used/described in L2VPN Working Group documents such as [ARP-
Mediation] and [IPLS].
The following sections describe the inner-workings of ARP snooping,
learning and maintaining ARP tables, using the learned information
to limit broadcast propagation and proxy (the response) on behalf of
the remote peers.
2.1 Terminology
ToR Top-of-Rack. An Ethernet switch present on top
of a rack which provides network connectivity to
the servers present on the rack.
Downlink Downlink in this document refers to local host
(servers in the rack) facing Ethernet
connection in the ToR switch.
Uplink Uplink in this document refers network facing
Ethernet connection in the ToR switch.
Typically, the uplinks from ToRs connect to
end-of-rack switches.
EoR End-of-Rack Ethernet switch. This is more of an
aggregation switch. Uplinks from ToR connects
to EoR and uplink from EoR connects to Core
switch.
Shah, et al. Expires May 2011 4
Internet Draft draft-shah-arp-reduction-00.txt
Host/Server The host or server term is used in this
document to refer to an IP host or server. An
IP host could be a one physical entity or a
logical entity (as a Virtual Machine) in a
physical host. The term server refers to its
application role in data center. Both terms are
used interchangeably or together and mean IP
end station.
Local hosts This term is used in the context of a ToR
switch to denote the (VM) hosts connected to a
ToR on the downlink, i.e. directly connected
hosts
Remote hosts This term is used in the context of a ToR to
denote the hosts that are accessible through
uplink of the ToR.
VM Virtual Machine. This is a logical instance of
a host that operates independently in a
physical host and has its own IP and MAC
address. The VM architecture allows efficient
use of physical host resources in data center
application.
3.0 Topology
An example topology of a data center network that is referred in
this document is that of an hierarchical connectivity of low to high
density Ethernet switches that provide flat (common broadcast
domain) layer 2-based network for the servers in the data center.
Each server host, thus connected is said to be on the same subnet
and communicate directly using IP without having to go through a
router or default gateway. In other words, an IPv4 host (VM or
otherwise) on this network can find another IPv4 host's MAC address
using the ARP methodologies.
4.0 Configuration
It is assumed that ARP reduction methodologies that are defined in
this document will be limited to ToR switches. We believe that
maximum benefits of restraining ARP broadcasts in the network can be
achieved by the first hop (or directly connected to host) switches
without placing additional burden on second or third tier switches.
The ToR switches will need to be configured with this feature
enabled. Each Ethernet interface needs to be identified as a type of
downlink or uplink within the context of this feature. The ARP
reduction feature treats ARP frames received from downlink or uplink
differently as described in the following sections.
Shah, et al. Expires May 2011 5
Internet Draft draft-shah-arp-reduction-00.txt
It is possible for the operator to configure various ARP reduction
related parameters; such as -
. ARP aging timer
. Size of the ARP table
. Static entries of IP to MAC address
There are situations where low cost ToR switches do not have the
needed capacity to process ARP reduction functions. Under those
circumstances, external ARP server (described below) approach can be
considered.
5.0 Building the ARP tables
When enabled, ToR switch will start monitoring the data frames for
the ARP PDUs. The ARP PDU processing is recommended to be handled in
the following manner.
. All ARP request PDUs should be redirected to control plane CPU
. All gratuitous ARP PDUs should be redirected to control plane
CPU
. All ARP response PDUs should be bi-casted; one copy sent to
control plane CPU and other copy forwarded out normally.
The ARP table can become large. The scaling factor dictates that the
table be maintained in the control plane memory as compared to
hardware tables in the forwarding plane. In either case, it is
prudent that 'local host' is preferred over 'remote host' when
placing the IP to MAC address association entries in the contested
ARP table space.
5.1 ARP Requests
The ARP requests are broadcast frames. The ToR gleans the IP and MAC
address from the ARP PDU. The source IP and MAC address association
is learned or updated/refreshed, if already learned. The destination
IP address is searched in the ARP table. If an entry exists, the
associated MAC address from the table is used to prepare a unicast
ARP reply PDU. The same MAC address is also used as the source MAC
address in the MAC header that is prepended to the unicast ARP reply
PDU.
If the destination IP address in the request is not present in the
table, then the original ARP request PDU is broadcasted to all the
switch ports except the source port the request was received from.
However, if the requested (destination) IP address is present in the
ARP table, unicast ARP response PDU is prepared as described above
and sent to the egress port based on which port the target existed
and original ARP request PDU is dropped.
The intent is to try preventing propagation of ARP request PDU
broadcasts as much as possible using the information present in the
Shah, et al. Expires May 2011 6
Internet Draft draft-shah-arp-reduction-00.txt
ARP table. The following observations can be made from such
behavior.
. Most of the ARP requests from the local hosts for the local
hosts can be prevented most of the times
. Most of the ARP requests from the remote hosts for the local
hosts can be prevented from forwarding towards downlinks or
other uplinks
. Many of the ARP requests from the local hosts for the remote
hosts can be prevented from forwarding towards uplinks, if
remote host IP to MAC association is known.
5.2 ARP Response
The unicast ARP response is gleaned to learn/update the ARP table
for source and destination IP/MAC address association and forwarded
out as a normal frame.
5.3 Gratuitous ARP
The Gratuitous ARP reply is a broadcast ARP PDU with destination IP
address and MAC address of the sender. It is typically used by the
(VM) IP hosts to keep its association fresh in peer's ARP cache.
The ToR switch should process Gratuitous ARP in the following
manner.
. Learn/update/refresh the ARP table entry
. If ARP entry was new or existed with different information then
gratuitous ARP PDU is forwarded out otherwise the PDU is
dropped.
The important goal for handling of the gratuitous ARP PDU from the
downlinks (i.e. local hosts) is to not propagate into the 'network'
(i.e. to uplinks) if the information is not new.
5.4 Host movement
The VM architecture allows movements of VMs to different physical
server entities based on optimum resource utilization policies. The
act of movement is called vMotion and the flexibility adds
attraction for its use. The vMotion could be manual (operator
initiated) or automatic in reaction to demands placed by the
application users. The important point is that in either case,
vMotion is not transparent and is made known to the network. There
is ongoing work in IEEE 802.1Qbe standards organization to
coordinate/communicate the presence and capabilities of the VMs to
the directly connected network switch.
It is expected that ToR would leverage the knowledge obtained of
newborn VM to update local ARP table as well as to notify the
network (other switches) via unsolicited gratuitous ARP on behalf of
the VM. The details of such procedures will be described in the
subsequent revisions of this document.
Shah, et al. Expires May 2011 7
Internet Draft draft-shah-arp-reduction-00.txt
5.5 IPv6 Hosts
The IPv6 hosts use Neighbor Discovery procedures that are different
from ARP methodologies used by IPv4 hosts. The details of handling
of Neighbor discovery procedures will be described in the subsequent
updates to this document.
5.6 External ARP servers
It is possible that in some configuration, the ToR switches may not
be capable to handle the ARP reduction procedures. For such
configuration, it is possible to outsource the ARP reduction
procedures to one or more external ARP server hosts. The ToR
switches will then be configured to,
. Identify the interface(s) connected to the ARP servers. Such
interface(s) must be separate from downlink and uplinks that
are connected to 'host reachable' (or native) networks as
described above. This concept is similar to how switches treat
'management' network separate from user data network.
. All broadcast ARP PDUs are forwarded to interface(s) where ARP
servers reside
. All unicast ARP PDUs received from 'native' interfaces are bi-
casted; one copy of the PDU is forwarded to ARP server and
other forwarded normally
. All ARP PDUs received from the ARP server interface(s) are the
results of the ARP-Reduction procedure based PDUs generated by
the ARP servers. They are handled in the following manner.
o The source MAC address is not learned
o Instead, the source MAC address is used to determine the
'real' native ingress interface. That is, switch will
treat the packet as if it was received from the interface
where source appears to reside and make the forwarding
decision based on destination MAC address and the newly
determined ingress port.
. If multiple ARP server interfaces are to exist (in order to
avoid single point of failure), an ARP PDU received from one
ARP server interface is never forwarded out to other ARP
server interface(s) (i.e. split horizon rule).
6.0 Conclusion
Based on the procedures described in this document, it is possible
for ToR switches in the data center to dampen ARP broadcasts
significantly. The solution is not new, based on well known
procedures, non-intrusive and low hanging fruit that strives to
curtail broadcasts that are increasingly becoming a problem in the
data centers. In essence, ToR switches are facilitating the
offloading of the extended ARP table management from the IP hosts
unto itself. The ARP table timeout can be tuned higher by the
operator based on the available switch resources and network traffic
behavior. The larger capacity of the ARP table directly translates
to more effective subduing of the ARP broadcasts. An additional
approach is described to further offload ARP table and PDU
Shah, et al. Expires May 2011 8
Internet Draft draft-shah-arp-reduction-00.txt
management to dedicated server(s) for reduced capacity low end ToR
or as a cost effective solution.
7.0 Security Considerations
The details of the security aspects will be addressed in future
revision.
8.0 References
8.1 Normative References
[ARP] RFC 826, STD 37, D. Plummer, "An Ethernet Address Resolution
Protocol: Or Converting Network Protocol Addresses to 48.bit
Ethernet Addresses for Transmission on Ethernet Hardware".
[ARP-Problem] L.Dunbar et al., "Scalable Address Resolution for
Large Data Center Problem Statements", draft-dunbar-arp-for-
large-dc-problem-statement-00.txt.
8.2 Informative References
[ARP-Mediation] H. Shah et al., "ARP Mediation for IP interworking
in Layer 2 VPN", draft-ietf-l2vpn-arp-mediation-13.txt.
[IPLS] H.Shah et al., "IP-only LAN service",
draft-ietf-l2vpn-ipls-09.txt.
[PROXY-ARP] RFC 925, J. Postel, "Multi-LAN Address Resolution".
9.0 Author's Address
Himanshu Shah
Ciena Corp
Email: hshah@ciena.com
Shah, et al. Expires May 2011 9