OPSAWG R. Krishnan
Internet Draft S. Khanna
Intended status: Experimental Brocade Communications
Expires: July 2013 L. Yong
January 12, 2013 Huawei USA
A. Ghanwani
Dell
Ning So
Tata Communications
B. Khasnabish
ZTE Corporation
Best Practices for Optimal LAG/ECMP Component Link Utilization in
Provider Backbone Networks
draft-krishnan-opsawg-large-flow-load-balancing-02.txt
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
This Internet-Draft will expire on July 12, 2013.
Copyright Notice
Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
Krishnan Expires July 12, 2013 [Page 1]
Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document.
Abstract
Demands on networking infrastructure are growing exponentially; the
drivers are bandwidth hungry rich media applications, inter-data
center communications, etc. In this context, it is important to
optimally use the bandwidth in the service provider backbone networks
which extensively use LAG/ECMP techniques for bandwidth scaling. This
draft describes the issues faced in service provider backbones in the
context of LAG/ECMP and recommends some best practices for managing
the bandwidth efficiently in service provider backbones.
Table of Contents
1. Introduction...................................................3
1.1. Conventions...............................................3
2. Hash-based Load Distribution in LAG/ECMP.......................4
3. Best Practices for Optimal LAG/ECMP Component Link Utilization.5
3.1. Large Flow Recognition....................................7
Krishnan Expires July 12, 2013 [Page 2]
Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013
3.1.1. Flow Identification..................................7
3.1.2. Sampling Techniques - sFlow/PSAMP....................7
3.1.3. Automatic Hardware Recognition.......................8
3.2. Load Re-Balancing Options.................................9
3.2.1. Alternative Placement of Large Flows.................9
3.2.2. Redistributing Other Flows...........................9
3.2.2.1. Redistributing All Other Flows..................9
3.2.2.2. Redistributing the Other Flows on the Congested
Link....................................................10
3.2.3. Component Link Protection Considerations............10
3.2.4. Load Re-Balancing Example...........................10
4. Operational Considerations....................................11
5. Data Model Considerations.....................................11
6. IANA Considerations...........................................11
7. Security Considerations.......................................12
8. Acknowledgements..............................................12
9. References....................................................12
9.1. Normative References.....................................12
9.2. Informative References...................................12
Appendix A. Internet Traffic Analysis and Load Balancing Simulation13
1. Introduction
Service provider backbone networks extensively use LAG/ECMP
techniques for capacity scaling. Network traffic can be predominantly
categorized into two traffic types: long-lived large flows and other
flows (include long-lived small flows, short-lived small/large
flows). Stateless hash-based techniques[ITCOM, RFC 2991, RFC 2992,
RFC 6790] are often used to distribute both long-lived large flows
and other flows over the component links in a LAG/ECMP. However the
traffic may not be evenly distributed over the component links due to
the traffic pattern.
This draft describes best practices for optimal LAG/ECMP component
link utilization while using hash-based techniques. These best
practices comprise the following steps -- recognizing long-lived
large flows in a router; and assigning the long-lived large flows to
specific LAG/ECMP component links or redistribute other flows when a
component link on the router is congested.
1.1. Conventions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
Krishnan Expires July 12, 2013 [Page 3]
Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013
document are to be interpreted as described in RFC 2119 [RFC2119]. The
following acronyms are used:
Following Terms are used in the document:
COTS: Commercial Off-the-shelf
DOS: Denial of Service
ECMP: Equal Cost Multi-path
GRE: Generic Routing Encapsulation
LAG: Link Aggregation Group
Large flow(s): long-lived large flow(s)
MPLS: Multiprotocol Label Switching
NVGRE: Network Virtualization using Generic Routing Encapsulation
Other flows: long-lived small flows and short-lived small/large flows
QoS: Quality of Service
VXLAN: Virtual Extensible LAN
2. Hash-based Load Distribution in LAG/ECMP
Hashing techniques are often used for flow based load distribution
[ITCOM]. A large space of the flow identifications, i.e. finer
granularity of the flows, conducts more random in spreading the flows
over a set of component links. The advantages of hashing based load
distribution are the preservation of the packet sequence in a flow
and the real time distribution with the stateless of individual
flows. If the traffic flows randomly spread in the flow
identification space, the flow rates are much smaller compared to the
link capacity, and the rate differences are not dramatic, the hashing
algorithm works very well in general. However, if one or more of
these conditions do not meet, the hashing may result very unbalanced
loads on individual component links. One example is illustrated in
Figure 1. There is a LAG between 2 routers R1 and R2. This LAG has 3
component links (1), (2), (3).
. Component link (1) has 2 other flows and 1 large flow and the
link utilization is normal.
Krishnan Expires July 12, 2013 [Page 4]
Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013
. Component link (2) has 3 other flows and no large flow and the
link utilization is light.
o The absence of any large flow causes the component link
under-utilized.
. Component link (3) has 2 other flows and 2 large flows and the
link utilization is exceeded.
o The presence of 2 large flows causes the component link
congested.
+-----------+ +-----------+
| | -> -> | |
| |=====> | |
| (1)|--/---/-|(1) |
| | | |
| | | |
| (R1) |-> -> ->| (R2) |
| (2)|--/---/-|(2) |
| | | |
| | -> -> | |
| |=====> | |
| |=====> | |
| (3)|--/---/-|(3) |
| | | |
+-----------+ +-----------+
Where: ->-> other flows
===> large flow
Figure 1: Unevenly Utilized Component Links
This document presents the improved hashing load distribution
techniques based on the large flow awareness. The techniques
compensate unbalanced load distribution from hashing due to the
traffic pattern.
3. Best Practices for Optimal LAG/ECMP Component Link Utilization
The suggested techniques in this draft are about a local optimization
solution, where the local is in the sense of both measuring large
flows and re-balancing the load at individual nodes in the network.
This approach would not yield a globally optimal placement of a large
flow across several nodes in the network which some networks may
Krishnan Expires July 12, 2013 [Page 5]
Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013
desire/require. On the other hand, this may be adequate for some
operators for the following reasons-- 1) Different links in the
network experience different levels of utilization and, thus, a more
"targeted" solution is needed for those few hot-spots in the network;
2) Some networks may lack end-to-end visibility, e.g. when a network
carries the traffic from multiple other networks.
The various steps in achieving optimal LAG/ECMP component link
utilization in backbone networks are detailed below:
Step 1) This involves large flow recognition in routers and
maintaining the mapping of the large flow to the component link that
it uses. The recognition of large flows is explained in Section 3.1.
Step 2) The egress component links are periodically scanned for link
utilization. If the egress component link utilization exceeds a pre-
programmed threshold, an operator alert is generated. The large flows
mapped to the congested egress component link are exported to a
central management entity.
Step 3) On receiving the alert about the congested component link,
the operator, through a central management entity, finds the large
flows mapped to that component link and the LAG/ECMP group to which
the component link belongs.
Step 4) The operator can choose to rebalance the large flows on
lightly loaded component links of the LAG/ECMP group or redistribute
all the other flows on the congested link to other component links of
the group. The operator, through a central management entity, can
choose one of the following actions:
1) Can indicate specific large flows to rebalance;
2) Let the router decide the best large flows to rebalance;
3) Let the router to redistribute all the other flows on the
congested link to other component links in the group.
The central management entity conveys the above information to the
router. The load re-balancing options are explained in section 3.2.
Optionally, if desired, steps 2) to 4) could become an automated
process.
The techniques described above are especially useful when bundling
links of different bandwidths for e.g. 10Gbps and 100Gbps as
described in [I-D.ietf-rtgwg-cl-requirement].
Krishnan Expires July 12, 2013 [Page 6]
Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013
3.1. Large Flow Recognition
3.1.1. Flow Identification
A flow (large flow or other flow) can be defined as a sequence of
packets for which ordered delivery should be maintained. Flows are
commonly identified by using any of the following sets of fields in a
packet header:
. Layer 2: source MAC address, destination MAC address, VLAN ID
. IP 5 tuple: IP Protocol, IP source address, IP destination
address, TCP/UDP source port, TCP/UDP destination port
. IP 3 tuple: IP Protocol, IP source address, IP destination
address
. MPLS Labels
. IPv6: IP source address, IP destination address and IPv6 flow
label (RFC 6437)
Flow identification is possible based on inner and/or outer headers
for tunneling protocols like GRE, VXLAN, NVGRE etc.
The above list is not exhaustive. The best practices described in
this document are agnostic to the fields that are used for flow
identification.
3.1.2. Sampling Techniques - sFlow/PSAMP
Enable sFlow [RFC 3176]/PSAMP [RFC 5475] sampling on all the egress
ports in the routers. Through sFlow processing in a sFlow collector,
an approximate indication of large flows mapping to each of the
component links in each LAG/ECMP group is available. The advantages
and disadvantages of sFlow/PSAMP are detailed below.
Advantages:
. Supported in most routers.
. Requires minimal router resources.
Disadvantages:
. Large flow recognition time is long, not instant.
Krishnan Expires July 12, 2013 [Page 7]
Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013
The time taken to determine a candidate large flow would be dependent
on the number of sFlow samples being generated and the processing
power of the external sFlow collector.
3.1.3. Automatic Hardware Recognition
Implementations may choose an automatic recognition of large flows on
the hardware of a router. The characteristics of such an
implementation would be:
. Inline solution
. Maintain line-rate performance
. Perform accounting of large flows with a high degree of
accuracy
Using automatic hardware recognition of large flows, an accurate
indication of large flows mapped to each of the component links in a
LAG/ECMP group is available. The advantages and disadvantages of
automatic hardware recognition are:
Advantages:
. Accurate and in real-time
Disadvantages:
. Not supported in many routers
The measurement interval for determining a large flow and the
bandwidth threshold of a large flow would be programmable parameters
in the router.
The implementation of automatic hardware recognition of large flows
is vendor dependent. Below is a suggested technique.
Suggested Technique for Automatic Hardware Recognition
Step 1) If the large flow exists in a hardware table resource like
TCAM, increment the counter of the flow. Else, proceed to Step 2.
Step 2) There are multiple hash tables, each with a different hash
function. Each hash table entry has an associated counter. On packet
arrival, a new flow is looked up in parallel in all the hash tables
and the corresponding counter is incremented. If the counter exceeds
a programmed threshold in a given time interval in all the hash table
Krishnan Expires July 12, 2013 [Page 8]
Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013
entries, a candidate large flow is learnt and programmed in a
hardware table resource like TCAM.
There may be some false positives due to multiple other flows
masquerading as a large flow; the amount of false positives is
reduced by parallel hashing using different hash functions
3.2. Load Re-Balancing Options
Below are suggested techniques for load re-balancing. Equipment
vendors should implement all these techniques and allow the operator
to choose one or more techniques based on their applications.
3.2.1. Alternative Placement of Large Flows
In the LAG/ECMP group, choose other member component links with least
average port utilization. Move some large flow(s) from the heavily
loaded component link to other member component links using a Policy
Based Routing (PBR) rule in the ingress processing element(s) in the
routers. The key aspects of this are:
. Other flows are not subjected to flow re-ordering.
. Only certain large flows are subjected to momentary flow re-
ordering temporarily.
Note that perfect re-balancing of large flows may not be possible
since flows arrive and depart at different times.
3.2.2. Redistributing Other Flows
Some large flows may consume the entire bandwidth of the component
link(s). In this case, it would be desirable for the other flows to
not use the congested component link(s). This can be accomplished in
one of the following ways.
3.2.2.1. Redistributing All Other Flows
This works on existing router hardware. The idea is to prevent the
other flow from hashing into the congested component link(s).
. Modify the LAG/ECMP table to only include the non-congested
component link(s). The other flows hash into this table to be
mapped to a destination component link.
. All the other flows are subject to momentary flow re-ordering.
Krishnan Expires July 12, 2013 [Page 9]
Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013
. The PBR rules for large flows (refer to Section 3.2.1) have
strict precedence over the LAG/ECMP table lookup result.
3.2.2.2. Redistributing the Other Flows on the Congested Link
This needs a switch/router hardware change.
. If a packet belongs to one of other flows and is hashed to
congested component link, apply a second hashing on it, which
results the flow mapped to one of the non-congested component
links.
. The other flows originally directed to the congested link are
re-directed to other non-congested component links.
. The other flows originally directed to a congested component
link are subject to momentary flow re-ordering.
3.2.3. Component Link Protection Considerations
If desired, certain component links may be reserved for link
protection. These reserved component links are not used for any flows
which are described in Section 3.2. In the case when the component
link(s) fail, all the flows on the failed component link(s) are moved
to the reserved component link(s). The mapping table of large flows/
component link simply replaces the reference pointer from the failed
component link to the reserved link. The LAG/ECMP hash table just
replaces the reference pointer from the failed component link to the
reserved link.
3.2.4. Load Re-Balancing Example
Optimal LAG/ECMP component utilization for the use case in Figure 1,
is depicted below in Figure 2. The large flow rebalancing explained
in Section 3.2.1 is used. The improved link utilization is as
follows:
. Component link (1) has 2 other flows and 1 large flow and the
link utilization is normal.
. Component link (2) has 3 other flows and 1 large flow and the
link utilization is normal now.
. Component link (3) has 2 other flows and 1 large flow and the
link utilization is normal now.
Krishnan Expires July 12, 2013 [Page 10]
Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013
+-----------+ +-----------+
| | -> -> | |
| |=====> | |
| (1)|--/---/-|(1) |
| | | |
| |=====> | |
| (R1) |-> -> ->| (R2) |
| (2)|--/---/-|(2) |
| | | |
| | | |
| | -> -> | |
| |=====> | |
| (3)|--/---/-|(3) |
| | | |
+-----------+ +-----------+
Where: ->-> other flows
===> large flow
Figure 2: Evenly utilized Composite Links
4. Operational Considerations
For future study. We like to get operators input here.
5. Data Model Considerations
For Step 2 in Section 3, IETF could potentially consider a standards-
based activity around, say, a data-model used to move the long-lived
large flow information from the router to the central management
entity.
For Step 4 in Section 3, IETF could potentially consider a standards-
based activity around, say, a data-model used to move the long-lived
large flow re-balancing information from the central management
entity to the router.
6. IANA Considerations
This memo includes no request to IANA.
Krishnan Expires July 12, 2013 [Page 11]
Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013
7. Security Considerations
This document does not directly impact the security of the Internet
infrastructure or its applications. In fact, it could help if there
is a DOS attack pattern which causes a hash imbalance resulting in
heavy overloading of large flows to certain LAG/ECMP component
links.
8. Acknowledgements
The authors would like to thank Shane Amante for all the support and
valuable input. The authors would like to thank Curtis Villamizar
for his valuable input. The authors would also like to thank Fred
Baker and Wes George for their input.
9. References
9.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2234] Crocker, D. and Overell, P.(Editors), "Augmented BNF for
Syntax Specifications: ABNF", RFC 2234, Internet Mail
Consortium and Demon Internet Ltd., November 1997.
9.2. Informative References
[I-D.ietf-rtgwg-cl-requirement] C. Villamizar et al., "Requirements
for MPLS Over a Composite Link", June 2012
[RFC 6790] K. Kompella et al., "The Use of Entropy Labels in MPLS
Forwarding", November 2012
[CAIDA] Caida Internet Traffic Analysis, www.caida.org/home
[YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport",
draft-yong-pwe3-enhance-ecmp-lfat-01, Sept. 2010
[ITCOM] Jo, J., etc "Internet traffic load balancing using dynamic
hashing with flow volume", SPIE ITCOM, 2002,
[RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and
Multicast", November 2000.
Krishnan Expires July 12, 2013 [Page 12]
Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013
[RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path
Algorithm", November 2000.
[RFC5475] T. Zseby et al., "Sampling and Filtering Techniques for IP
Packet Selection", March 2009.
[RFC3176] P. Phaal et al. "InMon Corporation's sFlow: A Method for
Monitoring Traffic in Switched and Routed Networks", RFC 3176,
September 2001
Appendix A. Internet Traffic Analysis and Load Balancing Simulation
Internet traffic [CAIDA] has been analyzed on the packet volume per a
flow. The five tuples in the packet header (IP addresses, TCP/UDP
Ports, and IP protocol) are used as the flow identification. The
analysis indicates that <~2% of the top rate ranked flows takes
about ~30% of total traffic volume while the rest of >98% flows
contributes ~70% in total.[YONG]
The simulation has shown that given Internet traffic pattern, the
hash method does not evenly distribute the flows over ECMP paths.
Some links may be >90% loaded while some may be <40% loaded. The
more ECMP paths exist, the more severe is the un-balancing. This
implies that hash based distribution can cause some paths congested
while other paths are only partial filled. [YONG]
The simulation also shows the substantial improvement by using large
flow aware hashing distribution technique described in this document.
In using the same simulated traffic, the improved rebalancing can
achieve <10% load differences among the links. It proves how large
flow aware hashing distribution can effectively compensate the uneven
load balancing caused by hashing and the traffic pattern.
Authors' Addresses
Ram Krishnan
Brocade Communications
San Jose, 95134, USA
Phone: +001-408-406-7890
Email: ramk@brocade.com
Sanjay Khanna
Krishnan Expires July 12, 2013 [Page 13]
Internet-Draft Optimal Load Distribution over LAG/ECMP January 2013
Brocade Communications
San Jose, 95134, USA
Phone: +001-408-333-4850
Email: skhanna@brocade.com
Lucy Yong
Huawei USA
5340 Legacy Drive
Plano, TX 75025, USA
Phone: 469-277-5837
Email: lucy.yong@huawei.com
Anoop Ghanwani
Dell
San Jose, CA 95134
Phone: (408) 571-3228
Email: anoop@alumni.duke.edu
Ning So
Tata Communications
Plano, TX 75082, USA
Phone: +001-972-955-0914
Email: ning.so@tatacommunications.com
Bhumip Khasnabish
ZTE Corporation
New Jersey, 07960, USA
Phone: +001-781-752-8003
Email: bhumip.khasnabish@zteusa.com
Krishnan Expires July 12, 2013 [Page 14]