Skip to main content

Coordinated Congestion Management

Document Type Active Internet-Draft (individual)
Authors Lv Yunping , Yuhan Zhang , Mengzhu Liu
Last updated 2024-04-19
RFC stream (None)
Intended RFC status (None)
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
RTGWG                                                             Y. Lyu
Internet-Draft                                                  Y. Zhang
Intended status: Standards Track                                  M. Liu
Expires: 21 October 2024                                          Huawei
                                                           19 April 2024

                   Coordinated Congestion Management


   AI fabric is sensitive to bandwidth.  Congestion management,
   including congestion control and load balancing, is a main method to
   fully utilize network resource.  However, current congestion
   management mechanisms are not coordinated, which lead to throughput
   decreasing.  This document provides a scheme to coordinate different
   congestion management mechanisms.  It describes the design principle,
   behaviors of network switches and hosts in the scheme, and gives an
   example to show end-to-end procedure.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 21 October 2024.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components

Lyu, et al.              Expires 21 October 2024                [Page 1]
Internet-Draft                     CCM                        April 2024

   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Requirements Language . . . . . . . . . . . . . . . . . . . .   3
   4.  Existing congestion management  . . . . . . . . . . . . . . .   3
   5.  Design principle of coordinated congestion management . . . .   5
   6.  Coordinated congestion management scheme  . . . . . . . . . .   6
     6.1.  Coordination tag  . . . . . . . . . . . . . . . . . . . .   6
     6.2.  Notification message  . . . . . . . . . . . . . . . . . .   6
     6.3.  Behavior of network switches  . . . . . . . . . . . . . .   7
       6.3.1.  Identify congestion type  . . . . . . . . . . . . . .   7
       6.3.2.  Notify CC congestion  . . . . . . . . . . . . . . . .   7
       6.3.3.  Notify upstream point to perform AR . . . . . . . . .   8
       6.3.4.  Perform congestion control  . . . . . . . . . . . . .   8
       6.3.5.  Perform adaptive routing  . . . . . . . . . . . . . .   8
     6.4.  Behavior of source hosts  . . . . . . . . . . . . . . . .   9
   7.  An example of end-to-end procedure  . . . . . . . . . . . . .   9
   8.  Security Considerations . . . . . . . . . . . . . . . . . . .  11
   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  11
   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  11
     10.1.  Normative References . . . . . . . . . . . . . . . . . .  11
     10.2.  Informative References . . . . . . . . . . . . . . . . .  11
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  12

1.  Introduction

   ML/AI has been progressing rapidly over the last decade.  ChatGpt is
   a milestone of generative AI.  It ignites industry's enthusiasm of AI
   large models.  A single AI accelerator or a single server with
   multiple AI accelerator is not capable to train the large models, due
   to lack of memory and lack of compute power.  So it is imperative to
   employ distributed system with parallel processing to train those

   AI training is bandwidth sensitive.  Taking data pralleslism and MOE
   which are commonly used prallel processing in AI training as example,
   the required bandwidth is GB level.  That brings a big challenge to
   AI fabric.  Increasing link speed is an important approach, from
   400Gbps to 800Gbps, or even 1.6Tbps in future.  What's more, how to
   effectively use the bandwidth also becomes a critical issue.  It is
   expected to fully utilize the link bandwidth to achieve high
   throughput.  Network congestion is a major problem which deteriorate
   the performance.  Thus, congestion management is always applied in

Lyu, et al.              Expires 21 October 2024                [Page 2]
Internet-Draft                     CCM                        April 2024

   the network to alleviate congestion.  Usually, congestion managment
   includes congestion control and load balancing.  But today,
   congestion control and load balancing work independently, without any

   This document discusses the uncoordinated mechanisms in current
   congestion management.  That leads to throughput issues which are
   particularly harmful in AI fabric.  A scheme for coordinating
   different congestion management mechanisms is proposed in this
   document, which can be effectively and widely deployed in AI fabric.

2.  Terminology

   *  ML: Machine Learning

   *  AI: Artificial Intelligence

   *  ECN: Explicit Congestion Notification

   *  AR: Adaptive Routing

   *  DCQCN: Data center QCN [DCQCN]

   *  CNP: Congestion Notification Packet

   *  PLB: Protective Load Balancing [PLB]

   *  CC: Congestion Control

   *  ECMP: Equal-cost multi-path routing

3.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

4.  Existing congestion management

   Congestion managment includes congestion control and load balancing.
   PFC like flow control is not discussed in this document.  It is
   useful as the last gate to prevent packet loss.  We do not count it
   as a part of congestion management.

Lyu, et al.              Expires 21 October 2024                [Page 3]
Internet-Draft                     CCM                        April 2024

   *  There are many congestion control mechanisms, such as DCQCN
      [DCQCN], Timely [Timely].  Although they have differnt procedure,
      using different algorithms, the purpose is to control the sending
      rate at the source.  Basically, congestion control identifies
      network congestion by network status, like queue length of switch
      port, end-to-end delay RTT, etc., then adjust the sending rate at
      the sender to alleviate congestion.  How to quickly flatten down
      the rate curve to avoid packet loss and how to recover the rate
      for less throughput reduction are essential to congestion control

   *  From another aspect, load balancing alleviate congestion by
      adjusting forwarding paths for traffic.  ECMP is one way of load
      balancing.  It hashes each flow on a specific path by 5-tuple of
      the flow.  This does not work well for AI workload.  Because AI
      has a few number of flows, and most of the flows are with big
      size.  ECMP cannot distribute the traffic evenly on the network.
      So adpative routing is perferred.  Adpative routing indicates to
      changes the path for a single flow according to network status.
      For example, originally, flow 1 uses path 1 for forwarding.  When
      network switch detects the path is becoming heavy-loaded, it
      selects another light-loaded path, path 2, for the following
      packets in the flow.  The path status could be indicated by local
      link status, and/or downstream link status etc.  And how to judge
      if the path is heavy-loaded, that could be implementation
      dependently.  Adaptive routing can select path for each packet,
      thus using network resource in a most efficient way.  But avoding
      uncessary path swithcing is critical, because each path switching
      may increase the systeme complexity, like re-ordering.  Another
      load balancing mechanism is packet spray.  Source host or network
      switch evenly distributes packets on each path.  The distribution
      does not consider actual path status.  Compared with adaptive
      routing, it is easier for implementation, but it is not the most
      optimized way.  In this document, we focus on adaptive routing.
      And the scheme proposed is also applicable for packet spray.

   Currently, congestion control and adaptive routing work
   independently, without coordination.  That results in negative impact
   on system performance.  For example, when congestion caused by
   imbalanced load on network occurs on a switch, both DCQCN and
   adaptive routing are activated.  ECN in data packets is marked,
   causing the CNP to be sent back to sender.  Thus, sender slows down
   the sending rate of the congested flow.  Meanwhile, the switch
   changes the path for packets of the congested flow, traversing the
   new incoming packets to a light-loaded path.  The result is that the
   congested flow is forwarded on the light-loaded path at a low rate.
   Then, DCQCN needs some time to recover the sending rate at the new
   path.  It reduces effective bandwidth and seriously impact

Lyu, et al.              Expires 21 October 2024                [Page 4]
Internet-Draft                     CCM                        April 2024

   computation efficiency in AI training.  Another example, if the
   congestion is caused by in-cast traffic, congestion control should be
   enough.  Additional adaptive routing adjustments not only fail to
   mitigate congestion, but may also introduce more out-of-order

   The fact is that current congestion management does not distinguish
   the cause of congestion, but triggering the mechanmis when congestion
   is detected.  That brings trouble.  In principle, in-cast congestion
   cannot be migigated by load balancing, and reducing flow rate by
   congestion control for imbalanced congestion (in-network congestion)
   decreases network efficiency.

5.  Design principle of coordinated congestion management

   Coordinated congestion management is designed to coordinate
   congestion control and adaptive routing.  Design principle is shown
   as below.

   *  Avoid unnecessary sending rate reduction
      AI fabric is bandwidth sensitive.  High throughput is extremely
      important.  Multipath is needed to make full use of network
      bandwidth.  Slowing down the sending rate while there are still
      available paths for traffic will be a waste of network resource,
      thereby increasing communication time in AI cluster and reducing
      AI training performance.

   *  Fully use multipath while reducing invalid path switching
      While searching for light-loaded paths for load balancing, new
      paths should be located quickly and accurately.  The new path
      should not be restricted to local paths but extends the search to
      available paths upstream.  Invalid path switching should be
      avoided.  Invalid path switching includes switching in-cast
      traffic as no matter how to switch the traffic path, it will final
      get congested on the last hop.

   *  Reuse current CC algorithm and AR algorithm
      There are already a variety of CC algorithm and AR algorithms.
      Those can still be used in the congestion management coordination
      scheme.  The scheme enables CC and AR be triggered coordinately,
      adjusting sending rate or switching path depending on different
      reasons of congestion.

   *  Applicable to various topologies
      Most AI fabrics use CLOS or FATTREE topologies, but there are also
      new studies considering the use of direct topologies, such as
      torus, dragonfly, dragonfly+. Some of existing solutions for CC
      and AR coordination, e.g PLB [PLB], relies on ECMP which can only

Lyu, et al.              Expires 21 October 2024                [Page 5]
Internet-Draft                     CCM                        April 2024

      be used in topologies with equal cost paths like CLOS.  For those
      topologies without equal cost paths, like dragonfly+, such
      solutions do not work.  The coordination scheme should be
      applicable to different topologies.

6.  Coordinated congestion management scheme

   The key to the coordinated congestion management is to identify CC
   traffic and non-CC traffic, thereby they are treated differently in
   network when congestion occurs.  CC traffic is those packets which
   cause in-cast congestion.  Non-CC traffic is the rest packets in

   CC traffic recognized by network is notified to the source host.  The
   subsequent packets of the same flow are tagged by the source host.
   This indicates the network switch to perform CC mechanism on those
   packets instead of AR.  For non-CC traffic, the network switch first
   performs AR.  Only when AR mechansim cannot find light-loaded path
   for switching, the traffic turns to be CC traffic and CC will be run
   to alleviate congestion.

   Coordinated congestion management requires interaction between
   network switches and source hosts.  The following sections explain
   the detail of the scheme.

6.1.  Coordination tag

   Coordination tag is inserted into data packets by source host when it
   sends out the packets.  The tag contains CC indicator and AR

   *  CC indicator: indicates if the packet may cause in-cast

   *  AR indicator: indicates the location of upstream AR point where
      adaptive routing can be performed.  The AR point can be a network
      switch or a source host.  AR indicator can be an ID, an IP address
      or other information which guides how to send a message to the AR

   The tag can use in-band telemetry scheme to carry in data packet.  A
   new method CSIG [I-D.draft-ravi-ippm-csig] may provide another

6.2.  Notification message

   There are 3 types of notification.

Lyu, et al.              Expires 21 October 2024                [Page 6]
Internet-Draft                     CCM                        April 2024

   *  Type 1: congestion control required
      Example: Type 1 message is sent from incast congetion switch to
      source host, notifying the source host to tag (set CC indicator)
      the packets belonging to the flow which causes in-cast congestion.

   *  Type 2: congestion control released
      Example: When incast congestion is eliminated, the switch sends
      type 2 message to corresponding hosts, notfifying the source hosts
      to untag CC indicator in the subsequent packets of the
      corresponding flow.

   *  Type 3: upstream AR required
      Example: If the switch determins to perform AR upstream, type 3
      message is sent to the upstream AR point.  The upstream AR point
      can be one-hop neighbour of the switch or a point multi-hop away.

   The notification message includes source IP, destination IP,
   notification type and flow key.  Source IP is the ip address of the
   switch which sends the notification.  Destination IP is the ip
   address of the destination which will handle the notification
   message.  Notification type is one of the above 3 types.  Flow key is
   the information of the flow to be handled, such as 5-tuple

6.3.  Behavior of network switches

6.3.1.  Identify congestion type

   When congestion is detected, network switch judge whether it is in-
   cast congestion.

   If congestion occurs at the switch egress port, and the switch is the
   last-hop switch to destination host, it is determined that the
   congestion is incast congestion.  The flows causing incast congestion
   are identified as incast flow.

   There may have other methods to identify congestion type.  This
   document does not make limitation on that.

6.3.2.  Notify CC congestion

   When in-cast congestion is determined by the network switch, it
   generates type 1 notification messages for each identified flow, and
   sends the notification messages to source hosts of the flows.  When
   CC congestion is eliminated, the switch sends type 2 notification
   messages to the source hosts.

Lyu, et al.              Expires 21 October 2024                [Page 7]
Internet-Draft                     CCM                        April 2024

6.3.3.  Notify upstream point to perform AR

   When it is determined to perform AR, but network switch cannot do it
   locally and AR indicator in the data packet shows availability to do
   AR upstream, a type 3 notification message is sent to upstream point
   according to AR indicator.

6.3.4.  Perform congestion control

   Network switch performs congestion control in below cases.

   *  It is identified as in-cast congestion.

   *  It is not identified as in-cast congestion, but adaptive routing
      cannot be used because there is no available new path for traffic
      switching either locally or upstream.

   This document does not limit which CC mechanism is performed.

6.3.5.  Perform adaptive routing

   Network switch performs adaptive routing in below cases.

   *  The packet is not in-cast traffic.  CC indicator in data packet is
      used to determine if it is in-cast traffic.

   *  Type 3 notification message is received.  According to flow
      information in the notification, new path is selected for the
      subsequent packets of the flow.

   In order to enable upstream AR, it is required to update AR indicator
   in data packets hop by hop.  When a data packet arrives at the
   network switches,

   *  if there are several local light-loaded paths available for AR on
      the switch, the switch updates AR indicator in the data packet to
      itself, such as its own ID.  Then the switch selects the
      appropriate local path to send the data packet.  This document
      does not define algorithm of local path selection.  It depends on
      routing strategy on the network switch.

   *  If there is only one local light-loaded path available for AR,
      network switch can only select that path for traffic.  AR
      indicator in the data packet will not be updated.

   *  If there is no local light-loaded path, network switch gets
      upstream AR availability by reading AR indicator in the data
      packet.  If AR indicator indicates upstream point can perform AR,

Lyu, et al.              Expires 21 October 2024                [Page 8]
Internet-Draft                     CCM                        April 2024

      network switch generates type 3 notification message and sends it
      directly to the corresponding upstream point.  Otherwise, network
      switch triggers congestion control mechanism, such as set ECN in
      data packet.

6.4.  Behavior of source hosts

   When receiving type 1 notification message, source host sets CC
   indicator of the subsequent packets for the corresponding flow.

   When receiving type 2 notificiation message, source host unset CC
   indicator of the subsequent packets for the corresponding flow.

   When receiving type 3 notification message, source host performs AR
   on the subsequent packets for the corresponding flow.

   When receiving congestion control signals and the CC indicator is
   set, source host performs CC on the flow.

7.  An example of end-to-end procedure

   Network topology is shown in Figure 1.  This is a 4 layer fattree
   topology.  There are n computing racks and m switching racks.
   Computing racks have source hosts, layer 1 switches and layer 2
   switches.  Swithcing racks contain layer 3 and layer 4 switches.

Lyu, et al.              Expires 21 October 2024                [Page 9]
Internet-Draft                     CCM                        April 2024

         Switching Rack 1    Switching Rack m
         +---------------+   +---------------+
         |L4-1-1...L4-1-e|   |L4-m-1...L4-m-e|
         |  | \    / |   |   |  | \    / |   |
         |  |  \  /  |   |   |  |  \  /  |   |
         |  |   \/   |   |   |  |   \/   |   |
         |  |   /\   |   |...|  |   /\   |   |
         |  |  /  \  |   |   |  |  /  \  |   |
         |  | /    \ |   |   |  | /    \ |   |
         |L3-1-1...L3-1-d|   |L3-m-1...L3-m-d|
         +--+-----------\    +-/----------+--+
            |            \    /           |
            |             \  /            |
            |  ......      \/     ......  |
            |              /\             |
            |             /  \            |
            |            /    \           |
         +--+-----------/      \----------+---+
         |L2-1-1...L1-1-c|    |L2-n-1...L2-n-c|
         |  | \    / |   |    |  | \    / |   |
         |  |  \  /  |   |    |  |  \  /  |   |
         |  |   \/   |   |    |  |   \/   |   |
         |  |   /\   |   |... |  |   /\   |   |
         |  |  /  \  |   |    |  |  /  \  |   |
         |  | /    \ |   |    |  | /    \ |   |
         |L1-1-1...L1-1-b|    |L1-n-1...L1-n-b|
         |  +        +   |    |  +        +   |
         | H-1-1... H-1-a|    | H-n-1... H-n-a|
         +---------------+    +---------------+
         Computing Rack 1     Computing Rack n

                         Figure 1: Network Topology

   *  Host H-1-1 in computing rack 1sends out a data packet P1 belonging
      to flow F1 to H-n-1 in computing rack n.  The value of CC
      indicator in the packet tag is not set indicating this packet is
      in a non-incast flow.  The AR indicator in the packet tag does not
      point to any available AR point.

   *  P1 arrives at switch L1-1-1 in computing rack 1.  L1-1-1 has
      multiple light-loaded paths for AR.  Path from L1-1-1 to L2-1-1 is
      selected for P1.  AR indicator in P1 tag is updated to L1-1-1.

   *  P1 arrives at switch L2-1-1.  L2-1-1 also has multiple light-
      loaded paths for AR.  Path from L2-1-1 to L3-1-1 is selected for
      P1.  AR indicator in P1 tag is updated to L2-1-1.

Lyu, et al.              Expires 21 October 2024               [Page 10]
Internet-Draft                     CCM                        April 2024

   *  P1 arrives at switch L3-1-1.  L3-1-1 only has one light-loaded
      paths.  The only path from L3-1-1 to L4-1-1 is selected for P1.
      AR indicator in P1 tag keeps to be L2-1-1.

   *  P1 arrives at switch L4-1-1.  L4-1-1 is congested and no local
      path available for performing AR.  By reading AR indicator in P1,
      L4-1-1 sends an type 3 notification to L2-1.

   *  After receiving AR notification, L2-1-1 switches path from
      L2-1-1->L3-1-1 to L2-1-1->L3-m-1 for the new incoming packets of
      flow F1.

   *  After a while, L1-n-1 is congested due to incast.  The flow F1 is
      identified as incast flow.  L1-n-1 sends type 1 notification to

   *  By receiving the type 1notification, H-1-1 sets CC indicator of
      the subsequent packets of F1 indicating the packets are in a
      incast flow.  Thus those packets will not be performed AR.
      Sending rate of F1 will also be reduced according to congestion
      control algorithm.

8.  Security Considerations


9.  IANA Considerations


10.  References

10.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <>.

10.2.  Informative References

              Ravi, A., Dukkipati, N., Mehta, N., and J. Kumar,
              "Congestion Signaling (CSIG)", Work in Progress, Internet-

Lyu, et al.              Expires 21 October 2024               [Page 11]
Internet-Draft                     CCM                        April 2024

              Draft, draft-ravi-ippm-csig-01, 2 February 2024,

   [DCQCN]    "Congestion Control for Large-Scale RDMA Deployments",
              August 2015,

   [Timely]   "TIMELY: RTT-based Congestion Control for the Datacenter",
              August 2015,

   [PLB]      "PLB: Congestion Signals are Simple and Effective for
              Network Load Balancing", August 2022,

Authors' Addresses

   Yunping(Lily) Lyu

   Yuhan Zhang

   Mengzhu Liu

Lyu, et al.              Expires 21 October 2024               [Page 12]