Skip to main content

Last Call Review of draft-ietf-opsawg-large-flow-load-balancing-11
review-ietf-opsawg-large-flow-load-balancing-11-opsdir-lc-pignataro-2014-05-15-00

Request Review of draft-ietf-opsawg-large-flow-load-balancing
Requested revision No specific revision (document currently at 15)
Type Last Call Review
Team Ops Directorate (opsdir)
Deadline 2014-05-06
Requested 2014-04-24
Authors Ramki Krishnan , Lucy Yong , Anoop Ghanwani , Ning So , Bhumip Khasnabish
I-D last updated 2014-05-15
Completed reviews Genart Last Call review of -11 by Martin Thomson (diff)
Secdir Last Call review of -11 by Yoav Nir (diff)
Secdir Telechat review of -15 by Yoav Nir
Opsdir Last Call review of -11 by Carlos Pignataro (diff)
Opsdir Telechat review of -15 by Carlos Pignataro
Assignment Reviewer Carlos Pignataro
State Completed
Request Last Call review on draft-ietf-opsawg-large-flow-load-balancing by Ops Directorate Assigned
Reviewed revision 11 (document currently at 15)
Result Not ready
Completed 2014-05-15
review-ietf-opsawg-large-flow-load-balancing-11-opsdir-lc-pignataro-2014-05-15-00
Hi!

I have reviewed this document as part of the Operational directorate's ongoing
effort to review all IETF documents being processed by the IESG. These comments
were written primarily for the benefit of the operational area directors.
Document editors and WG chairs should treat these comments just like any other
last call comments.

This document is on the Informational track, providing mechanisms for
optimization of LAG/ECMP load-balancing.

Summary: Not ready, there are issues to solve

The document offers a thorough analysis and presents a taxonomy and lexicon to
talk about LAG/ECMP load-balancing. It also presents various options with pros
and cons of the mechanics presented in this document. This is all very useful
and helpful.

I do have a number of questions and comments included below, categorized as
Major and Minor. I apologize in advance if I misunderstood anything in the
document that lead to these questions and observations, and I hope these are
useful.

Major:

Section 4 describes the trade-off and limitations of local-only optimizations.
However, this document describes what's an active (stateful) mechanism as
opposed to a hash-based passive (stateless) mechanism. There should be a
section of Operational considerations of Stateful LAG/ECMP LB, given that
monitoring flow degrades forwarding performance, requires state maintenance,
etc.

4.2. Operational Overview
...
   Step 2) The egress component links are periodically scanned for link
   utilization and the imbalance for the LAG/ECMP group is monitored. If
   the imbalance exceeds a certain imbalance threshold, then re-
   balancing is triggered. Measurement of the imbalance is discussed
   further in 5.1. Additional criteria may also be used to determine
   whether or not to trigger rebalancing, such as the maximum
   utilization of any of the component links, in addition to the
   imbalance.

If the egress component link of an ECMP are measured, but those are in
different routers, how is this a local-only method, and how is the loop closed
and "rebalancing required" notified?

Take for example:

       +--B
A==+
       +--C

If B and C measure inbalance, how do they know they belong to the same ECMP?

The doc says:

   All of the steps identified above can be done locally within the
   router itself or could involve the use of a central management
   entity.

But I am not sure how some of these are done locally only, and also the
"central management entity" seems underspecified.

5.1. Configuration Parameters for Flow Rebalancing
...

Also, this paragraph and document defines a number of variables like the
"imbalance threshold", the "max utilization of any component links", etc.  From
an operational perspective: how are those values set? What are their defaults?
What are appropriate ranges and values? Section 5 describes nicely the
parameters, but does not give guidance of default values and ranges.

4.3. Large Flow Recognition

4.3.1. Flow Identification

   A flow (large flow or small flow) can be defined as a sequence of
   packets for which ordered delivery should be maintained.  Flows are
   typically identified using one or more fields from the packet header,
   for example:

     .  Layer 2: source MAC address, destination MAC address, VLAN ID.

     .  IP header: IP Protocol, IP source address, IP destination
        address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP
        destination port.

Are these only applicable to TCP and UDP traffic? I think there needs to be a
more exhaustive list of transports for this to be useful.

   For tunneling protocols like Generic Routing Encapsulation (GRE)
   [RFC 2784], Virtual eXtensible Local Area Network (VXLAN) [VXLAN],
   Network Virtualization using Generic Routing Encapsulation (NVGRE)
   [NVGRE], Stateless Transport Tunneling (STT) [STT], etc., flow
   identification is possible based on inner and/or outer headers.

Please add L2TPv3 as a key tunneling protocol.

Also, for tunneling protocols, there is a lot more than that. Yes, inner or
outer. BUT there is also the tunnel header typically. For example, GRE Key,
L2TPv3 Session ID, etc. Sometimes, these summarize a flow decision. You might
also want to look at (and reference) RFC 5640, "Load-Balancing for Mesh
Softwires".

4.3.2. Criteria and Techniques for Large Flow Recognition

   From a bandwidth and time duration perspective, in order to recognize
   large flows we define an observation interval and observe the
   bandwidth of the flow over that interval.  A flow that exceeds a
   certain minimum bandwidth threshold over that observation interval
   would be considered a large flow.

>From an operational standpoint, it appears these techniques are
under-specified. As it pertains to these thresholds, time intervals, etc. How
are those configured? What are defaults? What are appropriate ranges?

Sections 4.3 and 4.4 present respectively different techniques for sampling and
re-balancing. THe analysis are very useful. It would be really helpful to have
a table summarizing all the different options and associated pros and cons, and
perhaps some applicability-based recommendations.

5.2. System Configuration and Identification Parameters
...
How are those parameters (besides an IP address) defined? What is a "LAG ID"?
An UTF-8 string? A 64-bit unsigned integer?

5.3. Information for Alternative Placement of Large Flows

See comment above regarding transport protocols and tunnels.

5.6. Monitoring information

5.6.1. Interface (link) utilization

   The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and
   interface speed (ifSpeed) can be measured from the Interface table
   (iftable) MIB [RFC 1213].

Why are these algorithms using MIBs only?

Minor:

I think it is confusing to talk about "short-lived large flows" referring to
them as "small flows". In fact I think it is potentially very confusing. I'd
recommend creating a new term.

The introduction describes a bunch of numbers (5% link bandwidth, 10s/100s
flows, etc) but from an operational standpoint it is not clear how those
potentially vary or are tied to a specific set of use cases. Further, not clear
how those can potentially influence different algorithms. Maybe the answer is
to put caps on them, or other answer, but it would help to be more prescriptive
about applicability.

1.2. Terminology

   ECMP table: A table that is used as the nexthop of an ECMP route that
   comprises the set of component links and the weights associated with
   each of those component links.  The weights are used to determine
   which values of the hash function map to a given component link.

It is not clear what the "weights" are if this is ECMP and not UCMP (U for
Unequal).

Also, "a table used as the next hop" is confusing.

   LAG table: A table that is used as the output port which is a LAG
   that comprises the set of component links and the weights associated
   with each of those component links.

What is the input? or what is the LAG Table associated to (i.e., not a route)

                Figure 2: Unevenly Utilized Component Links

I am not sure how realistic the example in Section 3, Figure 2 is, if only two
flows congest a member link...

4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization

   The suggested mechanisms in this draft are about a local optimization
   solution; they are local in the sense that both the identification of
   large flows and re-balancing of the load can be accomplished
   completely within individual nodes in the network without the need
   for interaction with other nodes.

It is not clear to me how a local-only node can deal with node polarization in
ECMP networks. A small explanation of this could help.

     .  Component Link Weight: The relative weight to be applied to
        traffic for a given component link when using hash-based
        techniques for load distribution.

Is this for ECMP or UCMP?

11. References

11.1. Normative References

11.2. Informative References

I would have expected that many of these references are Normative (i.e., needed
to understand the document). Yes, the doc is Informational. The meaning of
Normative vs. Informative still remains.

Hope this helps.

Thanks,

-- Carlos.