Working Group: ARMD                                  Himanshu Shah
   Intended Status: Proposed Standard                      Ciena Corp
   Internet Draft

   Expiration Date: May, 2011

                                                    October 18, 2010








                        ARP Reduction in Data Center
                   draft-shah-armd-arp-reduction-00.txt



Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.html

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on May 18, 2011



Copyright Notice

   Copyright (c) 2010 IETF Trust and the persons identified as the
   document authors.  All rights reserved.


   Shah, et al.            Expires May 2011                          1
   Internet Draft   draft-shah-arp-reduction-00.txt


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with
   respect to this document.  Code Components extracted from this
   document must include Simplified BSD License text as described in
   Section 4.e of the Trust Legal Provisions and are provided without
   warranty as described in the Simplified BSD License.


Abstract

   With advent of virtual machine (VM) technologies, a host is able to
   support multiple VMs in a single physical machine. The data center
   application leverages these capabilities to instantiate upwards of
   10s to 100s of VMs in a server. Each VM operates as an independent
   IP host with its own MAC address associated with a virtual Network
   Interface Card (vNIC) that maps to a single physical Ethernet
   interface. These physical servers are typically stacked in a rack
   with its Ethernet interface connected to top-of-the-rack (ToR)
   switch. The ToR switches are interconnected through End-of-the-Row
   (EoR) switch which in turn is connected to core switches.

   As discussed in [ARP-Problem] the VM hosts use ARP broadcasts to
   find other VM hosts and use periodic (broadcast) gratuitous ARPs to
   refresh their IP to MAC address binding in other VM hosts. Such
   broadcasts in a large data center with potentially thousands of VM
   hosts in a layer-2 based topology can cause havoc.

   This document describes a solution whereby a ToR switch assumes the
   handling of the ARP broadcasts based on the ARP table that it
   maintains by gleaning information from the passing ARP PDUs. When
   the information is not new, gratuitous ARP PDUs are dropped and ARP
   broadcast requests from hosts are responded by the switch from the
   learned ARP information instead of forwarding them out.


Conventions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC 2119].



Table of Contents


 Copyright Notice .................................................... 1

   Shah, et al.            Expires May 2011                          2
   Internet Draft   draft-shah-arp-reduction-00.txt


Abstract.............................................................. 2
1.0 Contributing Authors ............................................. 3
2.0 Overview.......................................................... 3
 2.1 Terminology ..................................................... 4
3.0 Topology.......................................................... 5
4.0 Configuration..................................................... 5
5.0 Building the ARP tables........................................... 6
 5.1 ARP Requests .................................................... 6
 5.2 ARP Response .................................................... 7
 5.3 Gratuitous ARP .................................................. 7
 5.4 Host movement ................................................... 7
 5.5 IPv6 Hosts ...................................................... 8
 5.6 External ARP servers ............................................ 8
 6.0 Conclusion ...................................................... 8
7.0 Security Considerations........................................... 9
8.0 References........................................................ 9
 8.1 Normative References ............................................ 9
 8.2 Informative References .......................................... 9
9.0 Author's Address.................................................. 9

1.0  Contributing Authors

   This document is the combined effort of the following individuals
   and many others who have carefully reviewed this document and
   provided the technical clarifications

   Linda Durbar                 Huawei
   Sue Hares                    Huawei
   T Sridhar          Force10 Networks



2.0 Overview

   The following factors exasperate the effect of ARP broadcasts in the
   data centers.
     . Ever increasing dependence on applications that run in data
        center
     . Large number of physical server hosts in the server farms
     . Use of large number of VMs in the physical server host
     . Each VM to have its own IP and MAC address and they all reside
        in the same subnet as it allows them to move around in
        different physical hosts based on fair resource distribution
        policies. This requires the data center networks to be layer 2
        based

   Shah, et al.            Expires May 2011                          3
   Internet Draft   draft-shah-arp-reduction-00.txt


     . Each VM resort to frequent ARP broadcasts as request to find
        the target VMs and/or gratuitous ARPs to refresh its IP to MAC
        address binding in peers with whom they tend to chat more
        often. This also stems from the fact that VMs holds a
        relatively small ARP table and use more aggressive age out in
        order to accommodate the 'most' active peers in the table.

   The broadcast as such in layer 2 networks has far reaching impacts;
   i.e. wastage in network bandwidth as well as CPU resources used by
   all the VMs while processing superfluous ARP broadcasts.

   It appears that it is possible to minimize the ills of ARP
   broadcasts in the data center network in a relatively simpler
   fashion. The solution requires the first hop Ethernet Switches,
   typically ToR, to maintain ARP table learned from the passing ARP
   PDUs and selectively propagate and/or proxy on behalf of the remote
   peer. These types of ARP processing principles are well known and
   used/described in L2VPN Working Group documents such as [ARP-
   Mediation] and [IPLS].

   The following sections describe the inner-workings of ARP snooping,
   learning and maintaining ARP tables, using the learned information
   to limit broadcast propagation and proxy (the response) on behalf of
   the remote peers.






2.1 Terminology


        ToR            Top-of-Rack. An Ethernet switch present on top
                       of a rack which provides network connectivity to
                       the servers present on the rack.

        Downlink       Downlink in this document refers to local host
                        (servers in the rack) facing Ethernet
                        connection in the ToR switch.


        Uplink         Uplink in this document refers network facing
                        Ethernet connection in the ToR switch.
                        Typically, the uplinks from ToRs connect to
                        end-of-rack switches.

        EoR            End-of-Rack Ethernet switch. This is more of an
                        aggregation switch. Uplinks from ToR connects
                        to EoR and uplink from EoR connects to Core
                        switch.


   Shah, et al.            Expires May 2011                          4
   Internet Draft   draft-shah-arp-reduction-00.txt


        Host/Server    The host or server term is used in this
                        document to refer to an IP host or server. An
                        IP host could be a one physical entity or a
                        logical entity (as a Virtual Machine) in a
                        physical host. The term server refers to its
                        application role in data center. Both terms are
                        used interchangeably or together and mean IP
                        end station.

        Local hosts    This term is used in the context of a ToR
                        switch to denote the (VM) hosts connected to a
                        ToR on the downlink, i.e. directly connected
                        hosts

        Remote hosts    This term is used in the context of a ToR to
                        denote the hosts that are accessible through
                        uplink of the ToR.

        VM             Virtual Machine. This is a logical instance of
                        a host that operates independently in a
                        physical host and has its own IP and MAC
                        address. The VM architecture allows efficient
                        use of physical host resources in data center
                        application.



3.0 Topology

   An example topology of a data center network that is referred in
   this document is that of an hierarchical connectivity of low to high
   density Ethernet switches that provide flat (common broadcast
   domain) layer 2-based network for the servers in the data center.
   Each server host, thus connected is said to be on the same subnet
   and communicate directly using IP without having to go through a
   router or default gateway. In other words, an IPv4 host (VM or
   otherwise) on this network can find another IPv4 host's MAC address
   using the ARP methodologies.


4.0 Configuration

   It is assumed that ARP reduction methodologies that are defined in
   this document will be limited to ToR switches. We believe that
   maximum benefits of restraining ARP broadcasts in the network can be
   achieved by the first hop (or directly connected to host) switches
   without placing additional burden on second or third tier switches.

   The ToR switches will need to be configured with this feature
   enabled. Each Ethernet interface needs to be identified as a type of
   downlink or uplink within the context of this feature. The ARP
   reduction feature treats ARP frames received from downlink or uplink
   differently as described in the following sections.

   Shah, et al.            Expires May 2011                          5
   Internet Draft   draft-shah-arp-reduction-00.txt



   It is possible for the operator to configure various ARP reduction
   related parameters; such as -
     . ARP aging timer
     . Size of the ARP table
     . Static entries of IP to MAC address

   There are situations where low cost ToR switches do not have the
   needed capacity to process ARP reduction functions.  Under those
   circumstances, external ARP server (described below) approach can be
   considered.


5.0 Building the ARP tables

   When enabled, ToR switch will start monitoring the data frames for
   the ARP PDUs. The ARP PDU processing is recommended to be handled in
   the following manner.
     . All ARP request PDUs should be redirected to control plane CPU
     . All gratuitous ARP PDUs should be redirected to control plane
        CPU
     . All ARP response PDUs should be bi-casted; one copy sent to
        control plane CPU and other copy forwarded out normally.

   The ARP table can become large. The scaling factor dictates that the
   table be maintained in the control plane memory as compared to
   hardware tables in the forwarding plane. In either case, it is
   prudent that 'local host' is preferred over 'remote host' when
   placing the IP to MAC address association entries in the contested
   ARP table space.


5.1 ARP Requests

   The ARP requests are broadcast frames. The ToR gleans the IP and MAC
   address from the ARP PDU. The source IP and MAC address association
   is learned or updated/refreshed, if already learned. The destination
   IP address is searched in the ARP table. If an entry exists, the
   associated MAC address from the table is used to prepare a unicast
   ARP reply PDU. The same MAC address is also used as the source MAC
   address in the MAC header that is prepended to the unicast ARP reply
   PDU.

   If the destination IP address in the request is not present in the
   table, then the original ARP request PDU is broadcasted to all the
   switch ports except the source port the request was received from.
   However, if the requested (destination) IP address is present in the
   ARP table, unicast ARP response PDU is prepared as described above
   and sent to the egress port based on which port the target existed
   and original ARP request PDU is dropped.

   The intent is to try preventing propagation of ARP request PDU
   broadcasts as much as possible using the information present in the

   Shah, et al.            Expires May 2011                          6
   Internet Draft   draft-shah-arp-reduction-00.txt


   ARP table. The following observations can be made from such
   behavior.
      . Most of the ARP requests from the local hosts for the local
         hosts can be prevented most of the times
      . Most of the ARP requests from the remote hosts for the local
         hosts can be prevented from forwarding towards downlinks or
         other uplinks
      . Many of the ARP requests from the local hosts for the remote
         hosts can be prevented from forwarding towards uplinks, if
         remote host IP to MAC association is known.

5.2 ARP Response

   The unicast ARP response is gleaned to learn/update the ARP table
   for source and destination IP/MAC address association and forwarded
   out as a normal frame.

5.3 Gratuitous ARP

   The Gratuitous ARP reply is a broadcast ARP PDU with destination IP
   address and MAC address of the sender. It is typically used by the
   (VM) IP hosts to keep its association fresh in peer's ARP cache.

   The ToR switch should process Gratuitous ARP in the following
   manner.
      . Learn/update/refresh the ARP table entry
      . If ARP entry was new or existed with different information then
         gratuitous ARP PDU is forwarded out otherwise the PDU is
         dropped.

   The important goal for handling of the gratuitous ARP PDU from the
   downlinks (i.e. local hosts) is to not propagate into the 'network'
   (i.e. to uplinks) if the information is not new.

5.4 Host movement

   The VM architecture allows movements of VMs to different physical
   server entities based on optimum resource utilization policies. The
   act of movement is called vMotion and the flexibility adds
   attraction for its use. The vMotion could be manual (operator
   initiated) or automatic in reaction to demands placed by the
   application users. The important point is that in either case,
   vMotion is not transparent and is made known to the network. There
   is ongoing work in IEEE 802.1Qbe standards organization to
   coordinate/communicate the presence and capabilities of the VMs to
   the directly connected network switch.

   It is expected that ToR would leverage the knowledge obtained of
   newborn VM to update local ARP table as well as to notify the
   network (other switches) via unsolicited gratuitous ARP on behalf of
   the VM. The details of such procedures will be described in the
   subsequent revisions of this document.


   Shah, et al.            Expires May 2011                          7
   Internet Draft   draft-shah-arp-reduction-00.txt


5.5 IPv6 Hosts

   The IPv6 hosts use Neighbor Discovery procedures that are different
   from ARP methodologies used by IPv4 hosts. The details of handling
   of Neighbor discovery procedures will be described in the subsequent
   updates to this document.

5.6 External ARP servers

   It is possible that in some configuration, the ToR switches may not
   be capable to handle the ARP reduction procedures. For such
   configuration, it is possible to outsource the ARP reduction
   procedures to one or more external ARP server hosts. The ToR
   switches will then be configured to,
      . Identify the interface(s) connected to the ARP servers. Such
         interface(s) must be separate from downlink and uplinks that
         are connected to 'host reachable' (or native) networks as
         described above. This concept is similar to how switches treat
         'management' network separate from user data network.
      . All broadcast ARP PDUs are forwarded to interface(s) where ARP
         servers reside
      . All unicast ARP PDUs received from 'native' interfaces are bi-
         casted; one copy of the PDU is forwarded to ARP server and
         other forwarded normally
      . All ARP PDUs received from the ARP server interface(s) are the
         results of the ARP-Reduction procedure based PDUs generated by
         the ARP servers. They are handled in the following manner.
           o The source MAC address is not learned
           o Instead, the source MAC address is used to determine the
              'real' native ingress interface. That is, switch will
              treat the packet as if it was received from the interface
              where source appears to reside and make the forwarding
              decision based on destination MAC address and the newly
              determined ingress port.
      . If multiple ARP server interfaces are to exist (in order to
         avoid single point of failure), an ARP PDU received from one
         ARP server interface is never forwarded out to other ARP
         server interface(s) (i.e. split horizon rule).

6.0 Conclusion

   Based on the procedures described in this document, it is possible
   for ToR switches in the data center to dampen ARP broadcasts
   significantly. The solution is not new, based on well known
   procedures, non-intrusive and low hanging fruit that strives to
   curtail broadcasts that are increasingly becoming a problem in the
   data centers. In essence, ToR switches are facilitating the
   offloading of the extended ARP table management from the IP hosts
   unto itself. The ARP table timeout can be tuned higher by the
   operator based on the available switch resources and network traffic
   behavior. The larger capacity of the ARP table directly translates
   to more effective subduing of the ARP broadcasts. An additional
   approach is described to further offload ARP table and PDU

   Shah, et al.            Expires May 2011                          8
   Internet Draft   draft-shah-arp-reduction-00.txt


   management to dedicated server(s) for reduced capacity low end ToR
   or as a cost effective solution.


7.0 Security Considerations

   The details of the security aspects will be addressed in future
   revision.

8.0 References

8.1 Normative References


   [ARP] RFC 826, STD 37, D. Plummer, "An Ethernet Address Resolution
      Protocol:  Or Converting Network Protocol Addresses to 48.bit
      Ethernet Addresses for Transmission on Ethernet Hardware".

   [ARP-Problem] L.Dunbar et al., "Scalable Address Resolution for
      Large Data Center Problem Statements", draft-dunbar-arp-for-
      large-dc-problem-statement-00.txt.


8.2 Informative References

   [ARP-Mediation] H. Shah et al., "ARP Mediation for IP interworking
      in Layer 2 VPN", draft-ietf-l2vpn-arp-mediation-13.txt.

   [IPLS] H.Shah et al., "IP-only LAN service",
      draft-ietf-l2vpn-ipls-09.txt.

   [PROXY-ARP] RFC 925, J. Postel, "Multi-LAN Address Resolution".


9.0 Author's Address

   Himanshu Shah
   Ciena Corp
   Email: hshah@ciena.com



   Shah, et al.            Expires May 2011                          9