Skip to main content

Adaptive Routing Framework
draft-cheng-rtgwg-adaptive-routing-framework-00

The information below is for an old version of the document.
Document Type
This is an older version of an Internet-Draft whose latest revision state is "Active".
Authors Weiqiang Cheng , Changwang Lin , Jiaming Ye
Last updated 2024-07-04
RFC stream (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-cheng-rtgwg-adaptive-routing-framework-00
Network Working Group                                          W. Cheng
Internet Draft                                             China Mobile
Intended status: Informational                                   C. Lin
Expires: January 2, 2025                           New H3C Technologies
                                                                  J. Ye
                                                           China Mobile
                                                            July 4,2024

                        Adaptive Routing Framework
              draft-cheng-rtgwg-adaptive-routing-framework-00

Abstract

   This document describes a framework for Adaptive Routing.
   Specifically, it identifies a set of adaptive routing components,
   explains their interactions, and exemplifies the workflow mechanism.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time.  It is inappropriate to use Internet-Drafts as
   reference material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on January 2, 2025.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors. All rights reserved.

Cheng, et al.          Expire January 2, 2025                 [Page 1]
Internet-Draft        Adaptive Routing Framework             July 2024

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with
   respect to this document. Code Components extracted from this
   document must include Simplified BSD License text as described in
   Section 4.e of the Trust Legal Provisions and are provided without
   warranty as described in the Simplified BSD License.

Table of Contents

   1. Introduction...................................................3
      1.1. Requirements Language.....................................3
   2. Problem Analysis...............................................3
         2.1.1. Use Case 1...........................................4
         2.1.2. Use Case 2...........................................5
   3. Solution.......................................................5
   4. Framework......................................................6
      4.1. Framework Overview........................................6
      4.2. Remote Path Info..........................................7
      4.3. Routing Plane.............................................7
      4.4. Forwarding Plane..........................................8
      4.5. Adaptive Routing Mode.....................................9
      4.6. Congestion Detection......................................9
      4.7. Congestion Notify........................................10
   5. Work Flow.....................................................10
      5.1. Remote Link Congestion Adjustment........................10
      5.2. Remote Flow Congestion Adjustment........................12
   6. Security Considerations.......................................12
   7. IANA Considerations...........................................12
   8. References....................................................12
      8.1. Normative References.....................................12
   Authors' Addresses...............................................13

Cheng, et al.          Expires January 2, 2025                [Page 2]
Internet-Draft        Adaptive Routing Framework             July 2024

1. Introduction

   In many cases, ECMP flow-based hashing leads to high congestion and
   variable flow completion time. This reduces applications
   performance. Load balancing based on local link quality is not
   always optimal, A global view of congestion, with information from
   remote links, is needed for optimal balancing.

   Adaptive routing is a network routing mechanism that dynamically
   adjusts routing paths based on changes in network conditions,
   thereby optimizing network performance and resource utilization.

   This document describes a framework for Adaptive Routing.
   Specifically, it identifies a set of adaptive routing components,
   explains their interactions, and exemplifies the workflow mechanism.

1.1. Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

2. Problem Analysis

   The current AI networks exhibit the following characteristics: a low
   number of flows, but each flow has a heavy load. The commonly used
   load balancing strategy employs an N-tuple hash algorithm to forward
   traffic on a per-flow basis. For current AI networks, this load
   balancing strategy can easily lead to load imbalances, causing
   network congestion.

   When network congestion occurs, the current load balancing
   adjustment strategy typically involves nearby devices at the
   congestion point switching links based on the local link congestion
   state. However, this approach is inefficient because adjustments
   made by devices near the congestion point have limited impact. If
   load balancing adjustments could be initiated from the earliest
   routing devices, it would significantly improve the efficiency of
   load balancing.

   The commonly used load balancing method nowadays typically adopts an
   N-tuple hash algorithm to forward packets on a per-flow basis. For
   current computing networks, this load distribution strategy can
   easily lead to load imbalances, resulting in network congestion.

Cheng, et al.          Expires January 2, 2025                [Page 3]
Internet-Draft        Adaptive Routing Framework             July 2024

2.1.1. Use Case 1

            +--+        +--+
     Spine  |R1|        |R2|
            +--+        +--+
             | \        / |
             |   \    /   |
             |     \/     |
             |     /\     X <- congested
             |   /    \   |
             | /        \ |
            +--+        +--+
     Leaf   |R3|        |R4|
            +--+        +--+
             ^            |
             |            v
            Source      Destination

              Figure 1 Spine-Leaf network

   In the Spin-Leaf network shown in Figure 1, assuming that the R2-R4
   link becomes congested, R3 will continue to send traffic to both R1
   and R2. Due to the congestion, continuing to forward traffic at the
   current rate through R2 will exacerbate the link congestion, leading
   to the loss of some traffic.

Cheng, et al.          Expires January 2, 2025                [Page 4]
Internet-Draft        Adaptive Routing Framework             July 2024

2.1.2. Use Case 2

        Source
          |
          v
     +---------+
     |         |
     | Group 1 |-------------+
     |         |             |
     +---------+             |
          |             +---------+
          |             |         |
          X<- congested | Group 3 |
          |             |         |
          |             +---------+
     +---------+             |
     |         |             |
     | Group 2 |-------------+
     |         |
     +---------+
          |
          v
     Destination

              Figure 2 Dragon-fly network

   In the dragon-fly network shown in Figure 2, the ECMP paths include
   Group1->Group2 and Group1->Group3->Group2 for load balancing. When
   the link between Group1 and Group2 becomes congested, Group1
   continues to send traffic at the current rate through the
   Group1->Group2 link, exacerbating the congestion and causing the
   loss of some traffic.

3. Solution

   Using a weighted load balancing strategy instead of a hash-based
   strategy can more fully utilize the bandwidth resources of multiple
   links. By assigning forwarding weights based on the state of each
   link, the load can be more evenly balanced.

   Additionally, dynamically adjusting the weights of each link
   according to congestion conditions allows for better adaptation and
   adjustment to bursty traffic in AI networks.

   For example, in Figure 1, when R2 detects congestion on the R2->R4
   link, it sends the congestion information to R3 via the control
   plane. R3 then dynamically adjusts the forwarding weights of the
   ECMP paths based on the congestion status, reducing the forwarding
   weight for the congested link, thereby decreasing the traffic
   directed to that link and alleviating its load. Once the congestion

Cheng, et al.          Expires January 2, 2025                [Page 5]
Internet-Draft        Adaptive Routing Framework             July 2024

   is cleared, R2 sends a congestion clearance message to R3 via the
   control plane, and R3 restores the original forwarding weight for
   that link.

   In Figure 2, the egress router in Group 1 detects inter-group link
   congestion and sends a congestion message to the ingress router via
   the control plane. The ingress router dynamically adjusts the
   forwarding weights of the ECMP paths based on the congestion status,
   reducing the traffic through the Group1->Group2 link to alleviate
   the load on the congested link. Once the congestion is cleared, the
   egress router in Group 1 notifies the ingress router in Group 1 of
   the congestion-cleared message, and the ingress router restores the
   ECMP link weights.

4. Framework

4.1. Framework Overview

   A high-level view of the CATS framework, without expanding the
   functional entities in the network, is illustrated in Figure 3.

     +-------------+
     |Routing Plane|
     +-------------+
             |
             | Remote Path Info
             v
     +----------------+       +-----------------------+
     |Forwarding Plane|<------|Adaptive Routing Policy|
     +----------------+       +-----------------------+
                                        ^
                                        | Congestion Notifiy
                                        |
                            +----------------------------+
                            |Remote Congestion Detection |
                            +----------------------------+

              Figure 3 Adaptive Routing Framwork Overview

   Starting from the bottom part of Figure 1 and moving to the upper
   part, the following planes are defined:

    * Routing Plane: Responsible for the transmission and calculation of
     routes. The calculated routes should include remote path
     information. The routes and remote Path Info should be correlated
     and updated to the Forwarding Plane.

    * Forwarding Plane: Responsible for path adjustments based on the
     policies of Adaptive Routing and remote link congestion

Cheng, et al.          Expires January 2, 2025                [Page 6]
Internet-Draft        Adaptive Routing Framework             July 2024

     information, following the adjusted forwarding strategies for
     traffic forwarding.

    * Adaptive Routing Policy: Responsible for remote link congestion
     information or flow information, dynamically adjusting routing
     accordingly, and updating the Forwarding Plane.

    * Remote Congestion Detection: Responsible for detecting link
     congestion and sending Congestion Notification to neighboring
     devices.

4.2. Remote Path Info

   Currently, the forwarding table contains information about the route
   destination, next hop, and exit interface. Local dynamic load
   balancing can dynamically adjust the weight of load distribution
   based on the link metric of local interfaces, such as interface
   traffic load and queue size.

   Load balancing based on local link quality is not always optimal.
   Global congestion awareness, with information from remote links, is
   needed for optimal balancing. Therefore, the forwarding table needs
   to contain not only local exit interface information but also remote
   path info and remote link congestion information.

   Remote path info can be remote links or remote nodes, specifically
   as follows:

    * For BGP-based networks: Remote path info can be the BGP identifier
     corresponding to the next-next-hop, as described in [I-D.wang-idr-
     next-next-hop-nodes]. It can also be the BGP AS-PATH information
     or BGP router-id, which is not detailed in this document.

    * For IGP-based networks: Remote path info can be the interface
     information from the next-hop neighbor device to the next-hop
     device, which could be the interface index, or the interface's
     local address.

   By using remote path info, routes can be associated with remote
   paths.

4.3. Routing Plane

   When calculating routes, the path needs to be perceived, and the
   path information will be attached to the next hop.

   In a BGP-based network, a BGP route may carry the router-id of the
   peer from which that route is received, and the router-id will be
   added into the path information when calculating that route. The BGP
   protocol may need some extensions to support such a feature. The

Cheng, et al.          Expires January 2, 2025                [Page 7]
Internet-Draft        Adaptive Routing Framework             July 2024

   specific extensions can refer to [I-D.wang-idr-next-next-hop-nodes]
   or other extensions, which are not detailed in this document.

   In an IGP-based network, a router may compute the path information
   based on the SPF tree and attach it to the next hop. Path info can
   be a link-local address, interface ID, or Link Local Identifier, or
   other extensions. The detailed mechanisms are out of the scope of
   this document.

4.4. Forwarding Plane

   The following figure 4 is a schematic of forwarding table
   maintenance. For each prefix, the next hop and weight corresponding
   to each path are recorded. The next hop of the prefix is constructed
   from the local next hop and remote path information. The forwarding
   weight is determined by the quality of the local next-hop interface
   (local(q)) and the quality of the remote link in the remote path
   (remote(q)).

   When responding to local congestion events, the next-hop address in
   the congestion event is used to find the corresponding ECMP entry,
   and the weight of this ECMP entry is modified according to the
   congestion level.

   When responding to remote congestion events, the path info in the
   congestion message is used to find the corresponding ECMP entry. The
   link quality of the remote path is updated, and a new weight value
   is calculated based on the local and remote link quality. Then the
   weight of this ECMP entry is modified according to the congestion
   level.

     +------+       +--------------------------+ local(q)+remote(q)
     |Prefix|---+-->|Next-hop: to R1, Weight w1|<----------------|
     +------+   |   +--------------------------+                 |
                |           |           +------------+   +--------+
                |           +---------->|Path: R1->R4|-->|Quality1|
                |                       +------------+   +--------+
                |   +--------------------------+ local(q)+remote(q)
                +-->|Next-hop: to R2, Weight w2|<----------------|
                    +--------------------------+                 |
                            |           +------------+   +--------+
                            +---------->|Path: R2->R4|-->|Quality2|
                                        +------------+   +--------+
           Figure 4 Forwarding table for Adaptive Routing

   When the number of flows is small or when there are elephant flows,
   adaptive routing needs to be performed through flow redirection. The
   following figure 5 is a schematic of the forwarding layer flow table
   maintenance. The flow tables are maintained according to the five-

Cheng, et al.          Expires January 2, 2025                [Page 8]
Internet-Draft        Adaptive Routing Framework             July 2024

   tuple of the traffic, recording the path information corresponding
   to this flow.

   When responding to remote flow congestion events as described in
   section 4.7, the flow will be rehashed to choose an ECMP path, and
   this flow is redirected to the least loaded ECMP path.

     +------+
     |SAddr |
     |DAddr |
     |SPort |       +------------------+
     |DPort |------>|Next-hop: to R1   |
     |Proto |       +------------------+
     +------+
               Figure 5 Flow table

4.5. Adaptive Routing Mode

   For network congestion, detection can be performed either on a per-
   link basis or on a per-flow basis.

   Link-based congestion detection and flow-based congestion detection
   can also be used in combination.

   For link-level congestion events, the forwarding weights of the
   corresponding ECMP links in the forwarding table are adjusted,
   thereby affecting the weight distribution of subsequent traffic for
   load balancing and reducing the traffic weight on the congested
   link. The forwarding weights are calculated based on the quality of
   the local link and the quality of the remote link.

   For flow-level congestion events, the corresponding flow is
   redirected to ECMP links with lower loads.

   Based on the severity of network congestion, network congestion can
   be divided into multiple levels, such as levels 1 to 7 corresponding
   to link congestion from mild to severe. The Congestion Response
   Module adjusts the ECMP link weights accordingly based on the
   congestion level.

4.6. Congestion Detection

   Congestion detection is generally performed by devices near the
   congestion point, including the detection of link congestion and
   congestion clearance. Network performance and congestion points can
   be identified by sending test traffic. A queue exceeds a threshold
   depth may send congestion notification. Congestion can also be

Cheng, et al.          Expires January 2, 2025                [Page 9]
Internet-Draft        Adaptive Routing Framework             July 2024

   inferred by monitoring the packet loss rate to determine if a link
   is congested. Congestion Specific detection methods are beyond the
   scope of this document.

4.7. Congestion Notify

   When a change in congestion status is detected, it needs to be
   communicated to remote devices in order to adjust traffic scheduling
   from the source.

   Congestion messages can be of two types:

   1)           The first type includes Path information, which helps in
      identifying the corresponding route for adjustments. It also
      includes the congestion information of the link corresponding to
      the Path. With this information, global congestion calculation can
      be performed to derive the weight information for the forwarding
      table. For details, refer to section 4.4.

   2)           The second type includes the five-tuple information of the
      congested flow. By using this congested flow information,
      congestion flow redirection can be implemented. For details, refer
      to sections 4.4 and 4.5.

   This can be done by extending the IGP protocol to transmit link
   state information within the IGP domain, or by extending the BGP
   protocol and setting up BGP reflectors to communicate between BGP
   neighbors. Alternatively, new protocols can be designed for this
   purpose. Congestion messages can be transmitted in-band or out-of-
   band. For high-performance solutions, additional protocols may be
   needed for efficient out-of-band message transmission. Specific
   methods are beyond the scope of this document.

5. Work Flow

5.1. Remote Link Congestion Adjustment

   As shown in Figure 1, the workflow for handling remote link
   congestion is as follows:

   1) In the initial state, there are two paths from R3 to R4:
      R3->R1->R4 and R3->R2->R4. Assume the initial weights are the
      same, set to 50 for both. The initial table entries are as shown
      in Figure 6.

   2) R2 detects a change in congestion on the R2->R4 link using
      congestion detection methods and classifies the congestion into
      levels according to severity.

Cheng, et al.          Expires January 2, 2025               [Page 10]
Internet-Draft        Adaptive Routing Framework             July 2024

   3) R2 notifies the remote device R3 of the congestion change event,
      including the congested node (R2), the next-hop information (R4),
      and the congestion level.

   4) R3 receives the remote notification and, based on the congested
      node (R2) and next-hop information (R4), looks up its local
      forwarding table. It then adjusts the forwarding weights of the
      corresponding ECMP entries according to the congestion level,
      assuming the weight is adjusted to 10, as shown in Figure 7.

   5) When R3 receives new traffic, it performs load balancing according
      to the adjusted forwarding weights.

     +------+       +--------------------------+
     |Prefix|---+-->|Next-hop: to R1, Weight 50|
     +------+   |   +--------------------------+
                |           |           +----------------+
                |           +---------->|Path: R1->R4    |
                |                       +----------------+
                |   +--------------------------+
                +-->|Next-hop: to R2, Weight 50|
                    +--------------------------+
                            |           +----------------+
                            +---------->|Path: R2->R4    |
                                        +----------------+

           Figure 6 Initial forwarding table

     +------+       +--------------------------+
     |Prefix|---+-->|Next-hop: to R1, Weight 50|
     +------+   |   +--------------------------+
                |           |           +----------------+
                |           +---------->|Path: R1->R4    |
                |                       +----------------+
                |   +--------------------------+
                +-->|Next-hop: to R2, Weight 10|
                    +--------------------------+
                            |           +----------------+
                            +---------->|Path: R2->R4    |
                                        +----------------+

           Figure 7 Adaptive forwarding table

Cheng, et al.          Expires January 2, 2025               [Page 11]
Internet-Draft        Adaptive Routing Framework             July 2024

5.2. Remote Flow Congestion Adjustment

   As shown in Figure 1, the workflow for handling remote flow
   congestion is as follows:

   1)           R2 detects congestion on a specific flow passing through the
      R3->R4 link using congestion detection methods;

   2)           R2 notifies the remote device R3 of the congestion change event,
      including the congested path info and flow information;

   3)           R3 receives the flow congestion event and looks up the flow table
      based on the flow information, redirecting the flow to the least
      loaded link among the ECMP links;

   4)           Subsequently, the flow is forwarded according to the new flow
      table.

6. Security Considerations

   TBD.

7. IANA Considerations

   TBD.

8. References

8.1. Normative References

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
             Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
             2119 Key Words", BCP 14, RFC 8174, May 2017

Cheng, et al.          Expires January 2, 2025               [Page 12]
Internet-Draft        Adaptive Routing Framework             July 2024

Authors' Addresses

   Weiqiang Cheng
   China Mobile
   China
   Email: chengweiqiang@chinamobile.com

   Changwang Lin
   New H3C Technologies
   China
   Email: linchangwang.04414@h3c.com

   Jiaming Ye
   China Mobile
   China
   Email: yejiaming@chinamobile.com

Cheng, et al.          Expires January 2, 2025               [Page 13]