Network Working Group H. Chen
Internet-Draft W,. Song
Intended status: Standards Track Huawei Technologies
Expires: April 30, 2015 October 27, 2014
Load balancing without packet reordering in NVO3
draft-chen-nvo3-load-banlancing-00
Abstract
Traditional ECMP can not balance loads well in the data center
network because it splits loads at the granularity of flow. Packets
belong to a single flow have to be delivered along the same path.
Though it is able to avoid packet reordering, it may degrade the
bandwidth utilization.
This document describes method of splitting a single flow to across
multiple parallel paths without causing packet reordering, which is
more effective when large flows exist. The specific path selection
algorithm is NOT discussed in this document.
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on April 30, 2015.
Chen & Song Expires April 30, 2015 [Page 1]
Internet-Draft Load Balancing October 2014
Copyright Notice
Copyright (c) 2014 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Rational for flowlet-based splitting . . . . . . . . . . . . 3
4. Flowlet-based load balancing . . . . . . . . . . . . . . . . 5
4.1. Unicast . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2. Multicast . . . . . . . . . . . . . . . . . . . . . . . . 6
5. The state machine . . . . . . . . . . . . . . . . . . . . . . 6
6. Header extension examples . . . . . . . . . . . . . . . . . . 7
6.1. VXLAN header extension . . . . . . . . . . . . . . . . . 7
6.2. NVGRE header extension . . . . . . . . . . . . . . . . . 8
7. Acknowledge frame format . . . . . . . . . . . . . . . . . . 9
8. Security Considerations . . . . . . . . . . . . . . . . . . . 9
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9
10. References . . . . . . . . . . . . . . . . . . . . . . . . . 9
10.1. Normative References . . . . . . . . . . . . . . . . . . 9
10.2. Informative References . . . . . . . . . . . . . . . . . 9
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10
1. Introduction
Large flows are not rare in current data center network. Typical
examples include: 1) large amount data copying during the process of
virtual machine migration, 2) storage traffic when employing the
iSCSI technique. In order to increase bandwidth utilization, ECMP
routing is introduced to balance the loads. However, existing ECMP
technique is splitting loads at the granularity of flow, which means
all packets from to a single flow have to be delivered along the same
path. Though ECMP is able to avoid packet reordering, it may degrade
the bandwidth utilization.
Chen & Song Expires April 30, 2015 [Page 2]
Internet-Draft Load Balancing October 2014
One basic idea to increase bandwidth utilization is splitting a
single flow into several bursts of packets, and delivering them along
parallel paths. The requirement for the splitting method is that
reordering can be avoided. Flowlet-based splitting [FLARE]can meet
above requirement. Flowet is defined as bursts of packets from a
single flow that are separated by large enough gaps.
Utilizing the time gap between conseutive burst of packets from a
single flow, flowlet-based ECMP is splitting large flow into flowlets
provided that the time gap is larger than the path delay. These
flowets will be delivered along multiple parallel paths and reoriding
will not happen due to the in-sequence arrival.
2. Terminology
This document makes use of the following terms, additional terms are
defined in [RFC7348]:
ECMP Equal-Cost Multipath
iSCSI internet Small Computer Storage Interface
NVGRE Network Virtualization using Generic Routing Encapsulation
NVO3 Network Virtualization over layer 3
VM Virtual Machine
VXLAN Virtual eXtensible Local Area Network
3. Rational for flowlet-based splitting
In data center network more than 90% loads are delivered over TCP.
For the TCP flow, packet reordering takes place when three or more
packets are received before a "late" packet, and in this case TCP
enters fast-retransmit mode which consumes extra bandwidth (which
could potentially cause more loss, decreasing throughput) as it
attempts to unnecessarily retransmit the delayed packet(s)[RFC2991].
So per-packet ECMP which randomly hashes packets to paths is rarely
used in modern data center network.
MPTCP[RFC6182]is one feasible method to increase bandwidth without
causing packet reorderding. But it adds more complexity to an
already complex transport layer burdened by new requirements such as
low latency and burst tolerance in datacenters[CONGA].
Chen & Song Expires April 30, 2015 [Page 3]
Internet-Draft Load Balancing October 2014
Besides, load balancing is best done in the network. The transport
layer should NOT be complicated. Specifically, the existing TCP
protocol should be utilized without modification.
Flowlet-based switching can meet above requirement especially for the
leaf-spine topoligies in data center network. Flowlets are bursts of
packets from a single flow that are separated by large enough idle
interval or we say the gaps. Splitted into several flowlets, large
flow can be delivered across multiple parallel paths, rather than be
delivered along a single path all the while. In this case, potential
congestion can be avoided and bandwidth utilization get increased.
The idle intervals between conseutive packets are inherent for the
tcp flow due to TCP's burstiness. As shown in Figure 1, given two
consecutive packets in a TCP flow, if the first packet leaves the
ingress NVE before the second packet reaches the egress NVE, the
ingress NVE can route the second packet-and subsequent packets from
this flow-on to other available path with no threat of reordering.
.................
. .
. ----------- .
+-------+ . / \ . +-------+
TCP |Ingress| ./ L3 overlay \.Pkt1 | Egress|
--flow --->| NVE |-----. Network .->---| NVE |---->
| | .\ /. | |
+-------+ . \Pkt2 / . +-------+
. ->--------- .
. .
.................
Figure 1: Rational of splitting TCP flow into flowlets
If during the time interval the previous packet reach the egress NVE,
no packets of this TCP flow were sent out from the Ingress NVE, then
this time interval could be considered as large enough to be used to
split the TCP flow. In order to find the 'gap', the Egress NVE may
reply with an acknowledge packet for each received packet, with some
information to idenitify which packet it replies to.
The Ingress NVE may decide whether this time interval is large enough
according to comparaing the indentification of latest sent packet and
the received Acknowledge packet. If this time interval is large
enough, the result of comparation should be equal, which means no
packets of this flow are sent out during this time interval.
Otherwise, there must be some packets sent out during the time
interval, so it can not be considered as the large enough gap to be
Chen & Song Expires April 30, 2015 [Page 4]
Internet-Draft Load Balancing October 2014
used to split the TCP flow. The identification metioned here shoud
include the flow ID and the its sequence ID in the flow.
4. Flowlet-based load balancing
4.1. Unicast
For the unicast traffic, the NVE will process the outgoing/incoming
packets as description below:
1. The Ingress NVE computes the identifier for the incoming flow.
TPackets from this flow will be populated with the same flow ID.
2. Packets from a single flow will be indexed by a sequence ID
field in an increamental manner. For example, the first packet
with sequence ID equals to 0 and the next packet with sequence
ID increased to 1 and so on.
3. For these packets originated from the Ingress NVE, the sender
flag in the outer header will be set to 1 and the receiver flag
will be set to 0 to indicate that it is a acknowledge packet.
4. The Ingress NVE has to maintain a flow state table for the
active flows with each entry recording the flow ID and sequence
ID. Notice that the comunication is full-deplex, each NVE could
act as Ingress NVE for one outgoing flow and as a receiver NVE
for the another incoming flow at the same time. So each NVE may
has a flow state table for all of the outgoing TCP flows.
5. There is also aging time associate with the flow state table,
The aging time can be configured through NVE's management
interface. One option to caculate this value is refer to the
way [TCP] does. In this way the flow state table size can be
limited in a small size and won't take too much system resource.
6. The Egress NVE will reply to the Ingress with an acknowledge
packet after successfully reciveing each packet. The
acknowledage packet is a encapsulated ipv4 packet with a vacant
payload. Its source ipv4 address field will be populated with
the Egress NVE's ip address and its destination ipv4 address
will be populated with the Ingress NVE's ip address.
7. The sender flag in the outer header will be set to 0 and the
receiver flag will be set to 1 to indicate that it is a
acknowledge packet. The flow ID field and sequence ID field of
the acknowledge packet will be copied from the corresponding
incoming packet directly.
Chen & Song Expires April 30, 2015 [Page 5]
Internet-Draft Load Balancing October 2014
8. On receiving the acknowledge packet, the Ingress NVE will look
up its state map to find if there is any entry has the same flow
ID as the acknowledge packet own. If there is no matching
entry, the Ingress NVE will drop the acknowledge packet.
9. If the Ingress NVE finds that there is a matching entry, it will
compare the sequence ID field of this entry with the sequence ID
field in the outer header of the acknowledge packet.
10. If the comaring results is equal, it indicates that no
subsequent packets from this flow are sent from the Ingress NVE
before receiving this acknowledge packet. So it can be assumed
that the time interval between this sent packet and its
subsequent packet is large enough. In this case, the Ingress
NVE will distribute this flow to other path according to
routinng selection algorithm without causing packet reordering.
11. Otherwise, there must be subsequent packets of this flow are
sent before receiving the acknowledge packet. It indicates that
the time interval is not large enough and packet reording may
happen if switching this flow to other path. So the Ingress NVE
will maintain current path for this flow until the large gap
appears.
flow ID sequence ID
+-------------+---------------+---------------+---------------+
| flow ID A | sequence A1 | sequence A2 | ... |
+-------------+---------------+---------------+---------------+
| flow ID B | sequence B1 | sequence B2 | ... |
+-------------+---------------+---------------+---------------+
| ... | ... | ... | ... |
+-------------+---------------+---------------+---------------+
| flow ID X | sequence X1 | sequence X2 | ... |
+-------------+---------------+---------------+---------------+
Figure 2: flow state table resides in NVE
4.2. Multicast
For the multicast traffic, the load balancing mechanism will not be
employed. The multicast packets will be routed according to the
exsting routing techniques.
5. The state machine
Chen & Song Expires April 30, 2015 [Page 6]
Internet-Draft Load Balancing October 2014
+---------+
| init | Reset Aging Timer
+---------+
|
v
+------------+
| Recv(pkt) |
+------------+
from NVE | from host
+---------------------v-------------+
| |
v v
+-----------------+ +-------------------+
|pkt.hdr.Tflag==1?| |GenerateflowID(pkt)|
+-----------------+ +-------------------+
Yes | No |
+-----------v--------+ v
| | +-------------------+
v v |any match entry in |
+------------------+ +-----------------+ |flow state table ? |
| pkt.hdr.seqID | | foward to upper | +-------------------+
| == | | layer for futher| No | Yes
| this.entry.seqID?| | processing | +-------- v---------+
+------------------+ +-----------------+ | |
Yes | No | |
+--v---------------+ v v
| | +-------------------+ +--------------------+
v v | new flow, create | | existing flow, |
+-------------+ +------------+ | an entry for it. | | this.entry.seqID ++|
| MATCH | |Do NOT MATCH| +-------------------+ +--------------------+
| swith path | | maintain | | |
+-------------+ +------------+ v v
+-------------------+ +--------------------+
|this.entry.flowID =| | foward pkt to path |
| pkt.hdr.flowID |---->| selection module |
|this.entry.seqID =0| | |
+-------------------+ +--------------------+
Figure 3: The state machine
6. Header extension examples
6.1. VXLAN header extension
The extension format of VXLAN header is shown as below. In order to
distinguish different flow and index the flowlets belong to the same
flow, four fields have to be added in vxlan header: sender flag,
receiver flag, flow ID and sequence ID.
Chen & Song Expires April 30, 2015 [Page 7]
Internet-Draft Load Balancing October 2014
VXLAN header: 8 bytes field, as shown in Figure 4, reuse the higher
24 bits of the reserved fields in VXLAN header.
- S (1 bit) : sender flag, default set to 0, set to 1 to indicate
it is the Ingress NVE.
- T (1 bit) : receiver flag, default set to 0, set to 1 to
indicate it is the egress NVE.
- flow ID (12 bits) : employed to ideantify different flows, reuse
the higher 8 bits of the reserved fields in VXLAN header.
- sequence ID (12 bits): employed to index the flowlet within the
same flow, reuse 8 bits following the Flow ID.
The lower 8 bits of the reserved fields in VXLAN head are set to zero
on transmission and ignored on receipt.
Outer UDP Header: as suggested in section 5 of [RFC7348], the source
port field is use to realize the load balancing of the VM-to-VM
traffic across the VXLAN overlay. It will be set as the hash value
of the inner ethernet frame's header.The UDP source port number will
be calculated in the dynamic/private port range 49152-65535.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
Outer UDP Header:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port(load balancing) | Dest Port = VXLAN Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| UDP Length | UDP Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
VXLAN Header:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|R|R|R|R|I|S|T|R| flow ID | Sequence ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| VXLAN Network Identifier (VNI) | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 4: VXLAN Frame Format extension
6.2. NVGRE header extension
The extension format of NVGRE header is shown as below. In order to
distinguish different flow and index the flowlets from the same flow,
the sequence field have to be enabled in NVGRE header. The sequence
flag shoud be set to 1. Lowest two bits of sequence field are used
Chen & Song Expires April 30, 2015 [Page 8]
Internet-Draft Load Balancing October 2014
to indicate sender flag and receiver flag respectively, and the
residual 30 bit can be used to indicate the sequence ID. The
combination of VSID field and flowID field (32 bit) can be used to
identify the outgoing packet.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
NVGRE Header:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0| |1|1| Reserved0 | Ver | Protocol Type 0x6558 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Virtual Subnet ID (VSID) | FlowID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|S|T| Sequence ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
7. Acknowledge frame format
The acknowledge packet is a general encapsulated IPv4 packet with
vacant payload. The encapsulation format could be VXLAN or NVGRE or
other format. According to the ethernet frame format defined in
[IEEE802.3], the minimum size of acknowledge packet has to be set to
42 bytes.
8. Security Considerations
Security considerations are not addressed in this document.
9. IANA Considerations
No IANA action is needed for this document.
10. References
10.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
10.2. Informative References
[CONGA] Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan,
R., Chu, K., Fingerhut, A., and V. Lam, "CONGA:
Distributed Congestion-aware Load Balancing for
Datacenters", 2014.
[FLARE] Kandula, S., Katabi, D., Sinha, S., and A. Berger,
"Dynamic Load Balancing Without Packet Reordering", 2007.
Chen & Song Expires April 30, 2015 [Page 9]
Internet-Draft Load Balancing October 2014
[IEEE802.1Q]
"IEEE Standard for Local and metropolitan area networks--
Media Access Control (MAC) Bridges and Virtual Bridged
Local Area Networks IEEE Std 802.1Q-2011 (Revision of IEEE
Std 802.1Q-2005)", 2011.
[IEEE802.3]
"IEEE Standard for Information Technology--
Telecommunications and Information Exchange Between
Systems--Local and Metropolitan Area Networks--Specific
Requirements Part 3: Carrier Sense Multiple Access With
Collision Detection (CSMA/CD) Access Method and Physical
Layer Specifications", April 2014.
[RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and
Multicast Next-Hop Selection", November 2000.
[RFC6182] Ford, A., Raiciu, C., Handley, M., Barre, S., and J.
Iyengar, "Architectural Guidelines for Multipath TCP
Development", 2011.
[RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger,
L., Sridhar, T., Bursell, M., and C. Wright, "Virtual
eXtensible Local Area Network (VXLAN): A Framework for
Overlaying Virtualized Layer 2 Networks over Layer 3
Networks", August 2014.
[TCP] ISI, USC., "Transmission Control Protocol", 1981.
Authors' Addresses
Hao Chen
Huawei Technologies
101 Software Ave., Yuhuatai Dist.
Nanjing, Jiangsu 210012
China
Phone: +86 025-5662-4440
Email: philips.chenhao@huawei.com
Chen & Song Expires April 30, 2015 [Page 10]
Internet-Draft Load Balancing October 2014
Wei Song
Huawei Technologies
101 Software Ave., Yuhuatai Dist.
Nanjing, Jiangsu 210012
China
Phone: +86 025-5662-6297
Email: songwei80@huawei.com
Chen & Song Expires April 30, 2015 [Page 11]