Network Working Group                                            H. Chen
Internet-Draft                                                  W,. Song
Intended status: Standards Track                     Huawei Technologies
Expires: April 30, 2015                                 October 27, 2014


            Load balancing without packet reordering in NVO3
                   draft-chen-nvo3-load-banlancing-00

Abstract

   Traditional ECMP can not balance loads well in the data center
   network because it splits loads at the granularity of flow.  Packets
   belong to a single flow have to be delivered along the same path.
   Though it is able to avoid packet reordering, it may degrade the
   bandwidth utilization.

   This document describes method of splitting a single flow to across
   multiple parallel paths without causing packet reordering, which is
   more effective when large flows exist.  The specific path selection
   algorithm is NOT discussed in this document.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on April 30, 2015.







Chen & Song              Expires April 30, 2015                 [Page 1]


Internet-Draft               Load Balancing                 October 2014


Copyright Notice

   Copyright (c) 2014 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Rational for flowlet-based splitting  . . . . . . . . . . . .   3
   4.  Flowlet-based load balancing  . . . . . . . . . . . . . . . .   5
     4.1.  Unicast . . . . . . . . . . . . . . . . . . . . . . . . .   5
     4.2.  Multicast . . . . . . . . . . . . . . . . . . . . . . . .   6
   5.  The state machine . . . . . . . . . . . . . . . . . . . . . .   6
   6.  Header extension examples . . . . . . . . . . . . . . . . . .   7
     6.1.  VXLAN header extension  . . . . . . . . . . . . . . . . .   7
     6.2.  NVGRE header extension  . . . . . . . . . . . . . . . . .   8
   7.  Acknowledge frame format  . . . . . . . . . . . . . . . . . .   9
   8.  Security Considerations . . . . . . . . . . . . . . . . . . .   9
   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   9
   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .   9
     10.1.  Normative References . . . . . . . . . . . . . . . . . .   9
     10.2.  Informative References . . . . . . . . . . . . . . . . .   9
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  10

1.  Introduction

   Large flows are not rare in current data center network.  Typical
   examples include: 1) large amount data copying during the process of
   virtual machine migration, 2) storage traffic when employing the
   iSCSI technique.  In order to increase bandwidth utilization, ECMP
   routing is introduced to balance the loads.  However, existing ECMP
   technique is splitting loads at the granularity of flow, which means
   all packets from to a single flow have to be delivered along the same
   path.  Though ECMP is able to avoid packet reordering, it may degrade
   the bandwidth utilization.





Chen & Song              Expires April 30, 2015                 [Page 2]


Internet-Draft               Load Balancing                 October 2014


   One basic idea to increase bandwidth utilization is splitting a
   single flow into several bursts of packets, and delivering them along
   parallel paths.  The requirement for the splitting method is that
   reordering can be avoided.  Flowlet-based splitting [FLARE]can meet
   above requirement.  Flowet is defined as bursts of packets from a
   single flow that are separated by large enough gaps.

   Utilizing the time gap between conseutive burst of packets from a
   single flow, flowlet-based ECMP is splitting large flow into flowlets
   provided that the time gap is larger than the path delay.  These
   flowets will be delivered along multiple parallel paths and reoriding
   will not happen due to the in-sequence arrival.

2.  Terminology

   This document makes use of the following terms, additional terms are
   defined in [RFC7348]:

   ECMP Equal-Cost Multipath

   iSCSI internet Small Computer Storage Interface

   NVGRE Network Virtualization using Generic Routing Encapsulation

   NVO3 Network Virtualization over layer 3

   VM Virtual Machine

   VXLAN Virtual eXtensible Local Area Network

3.  Rational for flowlet-based splitting

   In data center network more than 90% loads are delivered over TCP.
   For the TCP flow, packet reordering takes place when three or more
   packets are received before a "late" packet, and in this case TCP
   enters fast-retransmit mode which consumes extra bandwidth (which
   could potentially cause more loss, decreasing throughput) as it
   attempts to unnecessarily retransmit the delayed packet(s)[RFC2991].
   So per-packet ECMP which randomly hashes packets to paths is rarely
   used in modern data center network.

   MPTCP[RFC6182]is one feasible method to increase bandwidth without
   causing packet reorderding.  But it adds more complexity to an
   already complex transport layer burdened by new requirements such as
   low latency and burst tolerance in datacenters[CONGA].






Chen & Song              Expires April 30, 2015                 [Page 3]


Internet-Draft               Load Balancing                 October 2014


   Besides, load balancing is best done in the network.  The transport
   layer should NOT be complicated.  Specifically, the existing TCP
   protocol should be utilized without modification.

   Flowlet-based switching can meet above requirement especially for the
   leaf-spine topoligies in data center network.  Flowlets are bursts of
   packets from a single flow that are separated by large enough idle
   interval or we say the gaps.  Splitted into several flowlets, large
   flow can be delivered across multiple parallel paths, rather than be
   delivered along a single path all the while.  In this case, potential
   congestion can be avoided and bandwidth utilization get increased.

   The idle intervals between conseutive packets are inherent for the
   tcp flow due to TCP's burstiness.  As shown in Figure 1, given two
   consecutive packets in a TCP flow, if the first packet leaves the
   ingress NVE before the second packet reaches the egress NVE, the
   ingress NVE can route the second packet-and subsequent packets from
   this flow-on to other available path with no threat of reordering.

                                .................
                                .               .
                                .  -----------  .
                  +-------+     . /           \ .     +-------+
         TCP      |Ingress|     ./ L3 overlay  \.Pkt1 | Egress|
       --flow --->|  NVE  |-----.    Network    .->---|  NVE  |---->
                  |       |     .\             /.     |       |
                  +-------+     . \Pkt2       / .     +-------+
                                .  ->---------  .
                                .               .
                                .................

          Figure 1: Rational of splitting TCP flow into flowlets

   If during the time interval the previous packet reach the egress NVE,
   no packets of this TCP flow were sent out from the Ingress NVE, then
   this time interval could be considered as large enough to be used to
   split the TCP flow.  In order to find the 'gap', the Egress NVE may
   reply with an acknowledge packet for each received packet, with some
   information to idenitify which packet it replies to.

   The Ingress NVE may decide whether this time interval is large enough
   according to comparaing the indentification of latest sent packet and
   the received Acknowledge packet.  If this time interval is large
   enough, the result of comparation should be equal, which means no
   packets of this flow are sent out during this time interval.
   Otherwise, there must be some packets sent out during the time
   interval, so it can not be considered as the large enough gap to be




Chen & Song              Expires April 30, 2015                 [Page 4]


Internet-Draft               Load Balancing                 October 2014


   used to split the TCP flow.  The identification metioned here shoud
   include the flow ID and the its sequence ID in the flow.

4.  Flowlet-based load balancing

4.1.  Unicast

   For the unicast traffic, the NVE will process the outgoing/incoming
   packets as description below:

   1.   The Ingress NVE computes the identifier for the incoming flow.
        TPackets from this flow will be populated with the same flow ID.

   2.   Packets from a single flow will be indexed by a sequence ID
        field in an increamental manner.  For example, the first packet
        with sequence ID equals to 0 and the next packet with sequence
        ID increased to 1 and so on.

   3.   For these packets originated from the Ingress NVE, the sender
        flag in the outer header will be set to 1 and the receiver flag
        will be set to 0 to indicate that it is a acknowledge packet.

   4.   The Ingress NVE has to maintain a flow state table for the
        active flows with each entry recording the flow ID and sequence
        ID.  Notice that the comunication is full-deplex, each NVE could
        act as Ingress NVE for one outgoing flow and as a receiver NVE
        for the another incoming flow at the same time.  So each NVE may
        has a flow state table for all of the outgoing TCP flows.

   5.   There is also aging time associate with the flow state table,
        The aging time can be configured through NVE's management
        interface.  One option to caculate this value is refer to the
        way [TCP] does.  In this way the flow state table size can be
        limited in a small size and won't take too much system resource.

   6.   The Egress NVE will reply to the Ingress with an acknowledge
        packet after successfully reciveing each packet.  The
        acknowledage packet is a encapsulated ipv4 packet with a vacant
        payload.  Its source ipv4 address field will be populated with
        the Egress NVE's ip address and its destination ipv4 address
        will be populated with the Ingress NVE's ip address.

   7.   The sender flag in the outer header will be set to 0 and the
        receiver flag will be set to 1 to indicate that it is a
        acknowledge packet.  The flow ID field and sequence ID field of
        the acknowledge packet will be copied from the corresponding
        incoming packet directly.




Chen & Song              Expires April 30, 2015                 [Page 5]


Internet-Draft               Load Balancing                 October 2014


   8.   On receiving the acknowledge packet, the Ingress NVE will look
        up its state map to find if there is any entry has the same flow
        ID as the acknowledge packet own.  If there is no matching
        entry, the Ingress NVE will drop the acknowledge packet.

   9.   If the Ingress NVE finds that there is a matching entry, it will
        compare the sequence ID field of this entry with the sequence ID
        field in the outer header of the acknowledge packet.

   10.  If the comaring results is equal, it indicates that no
        subsequent packets from this flow are sent from the Ingress NVE
        before receiving this acknowledge packet.  So it can be assumed
        that the time interval between this sent packet and its
        subsequent packet is large enough.  In this case, the Ingress
        NVE will distribute this flow to other path according to
        routinng selection algorithm without causing packet reordering.

   11.  Otherwise, there must be subsequent packets of this flow are
        sent before receiving the acknowledge packet.  It indicates that
        the time interval is not large enough and packet reording may
        happen if switching this flow to other path.  So the Ingress NVE
        will maintain current path for this flow until the large gap
        appears.

          flow ID                      sequence ID
      +-------------+---------------+---------------+---------------+
      |  flow ID A  |  sequence A1  |  sequence A2  |      ...      |
      +-------------+---------------+---------------+---------------+
      |  flow ID B  |  sequence B1  |  sequence B2  |      ...      |
      +-------------+---------------+---------------+---------------+
      |     ...     |      ...      |      ...      |      ...      |
      +-------------+---------------+---------------+---------------+
      |  flow ID X  |  sequence X1  |  sequence X2  |      ...      |
      +-------------+---------------+---------------+---------------+

                 Figure 2: flow state table resides in NVE

4.2.  Multicast

   For the multicast traffic, the load balancing mechanism will not be
   employed.  The multicast packets will be routed according to the
   exsting routing techniques.

5.  The state machine







Chen & Song              Expires April 30, 2015                 [Page 6]


Internet-Draft               Load Balancing                 October 2014


                                      +---------+
                                      |  init   | Reset Aging Timer
                                      +---------+
                                           |
                                           v
                                    +------------+
                                    |  Recv(pkt) |
                                    +------------+
                         from NVE          |     from host
                     +---------------------v-------------+
                     |                                   |
                     v                                   v
             +-----------------+               +-------------------+
             |pkt.hdr.Tflag==1?|               |GenerateflowID(pkt)|
             +-----------------+               +-------------------+
                 Yes   |   No                            |
           +-----------v--------+                        v
           |                    |              +-------------------+
           v                    v              |any match entry in |
 +------------------+  +-----------------+     |flow state table ? |
 |  pkt.hdr.seqID   |  | foward to upper |     +-------------------+
 |        ==        |  | layer for futher|      No       |    Yes
 | this.entry.seqID?|  | processing      |     +-------- v---------+
 +------------------+  +-----------------+     |                   |
    Yes    |     No                            |                   |
        +--v---------------+                   v                   v
        |                  |         +-------------------+     +--------------------+
        v                  v         | new flow, create  |     | existing flow,     |
+-------------+     +------------+   | an entry for it.  |     | this.entry.seqID ++|
|  MATCH      |     |Do NOT MATCH|   +-------------------+     +--------------------+
| swith path  |     | maintain   |            |                           |
+-------------+     +------------+            v                           v
                                     +-------------------+     +--------------------+
                                     |this.entry.flowID =|     | foward pkt to path |
                                     |    pkt.hdr.flowID |---->|  selection module  |
                                     |this.entry.seqID =0|     |                    |
                                     +-------------------+     +--------------------+

                        Figure 3: The state machine

6.  Header extension examples

6.1.  VXLAN header extension

   The extension format of VXLAN header is shown as below.  In order to
   distinguish different flow and index the flowlets belong to the same
   flow, four fields have to be added in vxlan header: sender flag,
   receiver flag, flow ID and sequence ID.



Chen & Song              Expires April 30, 2015                 [Page 7]


Internet-Draft               Load Balancing                 October 2014


   VXLAN header: 8 bytes field, as shown in Figure 4, reuse the higher
   24 bits of the reserved fields in VXLAN header.

      - S (1 bit) : sender flag, default set to 0, set to 1 to indicate
      it is the Ingress NVE.

      - T (1 bit) : receiver flag, default set to 0, set to 1 to
      indicate it is the egress NVE.

      - flow ID (12 bits) : employed to ideantify different flows, reuse
      the higher 8 bits of the reserved fields in VXLAN header.

      - sequence ID (12 bits): employed to index the flowlet within the
      same flow, reuse 8 bits following the Flow ID.

   The lower 8 bits of the reserved fields in VXLAN head are set to zero
   on transmission and ignored on receipt.

   Outer UDP Header: as suggested in section 5 of [RFC7348], the source
   port field is use to realize the load balancing of the VM-to-VM
   traffic across the VXLAN overlay.  It will be set as the hash value
   of the inner ethernet frame's header.The UDP source port number will
   be calculated in the dynamic/private port range 49152-65535.

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
     Outer UDP Header:
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Source Port(load balancing)  |       Dest Port = VXLAN Port  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |           UDP Length          |        UDP Checksum           |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

     VXLAN Header:
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |R|R|R|R|I|S|T|R|       flow ID         |      Sequence ID      |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |        VXLAN Network Identifier (VNI)         |   Reserved    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                  Figure 4: VXLAN Frame Format extension

6.2.  NVGRE header extension

   The extension format of NVGRE header is shown as below.  In order to
   distinguish different flow and index the flowlets from the same flow,
   the sequence field have to be enabled in NVGRE header.  The sequence
   flag shoud be set to 1.  Lowest two bits of sequence field are used



Chen & Song              Expires April 30, 2015                 [Page 8]


Internet-Draft               Load Balancing                 October 2014


   to indicate sender flag and receiver flag respectively, and the
   residual 30 bit can be used to indicate the sequence ID.  The
   combination of VSID field and flowID field (32 bit) can be used to
   identify the outgoing packet.

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
   NVGRE Header:
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |0| |1|1|   Reserved0     | Ver |   Protocol Type 0x6558        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |               Virtual Subnet ID (VSID)        |    FlowID     |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |S|T|                  Sequence ID                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

7.  Acknowledge frame format

   The acknowledge packet is a general encapsulated IPv4 packet with
   vacant payload.  The encapsulation format could be VXLAN or NVGRE or
   other format.  According to the ethernet frame format defined in
   [IEEE802.3], the minimum size of acknowledge packet has to be set to
   42 bytes.

8.  Security Considerations

   Security considerations are not addressed in this document.

9.  IANA Considerations

   No IANA action is needed for this document.

10.  References

10.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

10.2.  Informative References

   [CONGA]    Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan,
              R., Chu, K., Fingerhut, A., and V. Lam, "CONGA:
              Distributed Congestion-aware Load Balancing for
              Datacenters", 2014.

   [FLARE]    Kandula, S., Katabi, D., Sinha, S., and A. Berger,
              "Dynamic Load Balancing Without Packet Reordering", 2007.



Chen & Song              Expires April 30, 2015                 [Page 9]


Internet-Draft               Load Balancing                 October 2014


   [IEEE802.1Q]
              "IEEE Standard for Local and metropolitan area networks--
              Media Access Control (MAC) Bridges and Virtual Bridged
              Local Area Networks IEEE Std 802.1Q-2011 (Revision of IEEE
              Std 802.1Q-2005)", 2011.

   [IEEE802.3]
              "IEEE Standard for Information Technology--
              Telecommunications and Information Exchange Between
              Systems--Local and Metropolitan Area Networks--Specific
              Requirements Part 3: Carrier Sense Multiple Access With
              Collision Detection (CSMA/CD) Access Method and Physical
              Layer Specifications", April 2014.

   [RFC2991]  Thaler, D. and C. Hopps, "Multipath Issues in Unicast and
              Multicast Next-Hop Selection", November 2000.

   [RFC6182]  Ford, A., Raiciu, C., Handley, M., Barre, S., and J.
              Iyengar, "Architectural Guidelines for Multipath TCP
              Development", 2011.

   [RFC7348]  Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger,
              L., Sridhar, T., Bursell, M., and C. Wright, "Virtual
              eXtensible Local Area Network (VXLAN): A Framework for
              Overlaying Virtualized Layer 2 Networks over Layer 3
              Networks", August 2014.

   [TCP]      ISI, USC., "Transmission Control Protocol", 1981.

Authors' Addresses

   Hao Chen
   Huawei Technologies
   101 Software Ave., Yuhuatai Dist.
   Nanjing, Jiangsu  210012
   China

   Phone: +86 025-5662-4440
   Email: philips.chenhao@huawei.com












Chen & Song              Expires April 30, 2015                [Page 10]


Internet-Draft               Load Balancing                 October 2014


   Wei Song
   Huawei Technologies
   101 Software Ave., Yuhuatai Dist.
   Nanjing, Jiangsu  210012
   China

   Phone: +86 025-5662-6297
   Email: songwei80@huawei.com











































Chen & Song              Expires April 30, 2015                [Page 11]