Network Working Group W. Cheng
Internet Draft China Mobile
Intended status: Informational C. Lin
Expires: April 28, 2025 New H3C Technologies
K. Wang
Juniper Networks
J. Ye
R. Zhuang
China Mobile
P. Huo
ByteDance
October 20,2024
Adaptive Routing Framework
draft-cheng-rtgwg-adaptive-routing-framework-03
Abstract
In many cases, ECMP (Equal-Cost Multi-Path) flow-based hashing leads
to high congestion and variable flow completion time. This reduces
applications performance. Load balancing based on local link quality
is not always optimal, A global view of congestion, with information
from remote links, is needed for optimal balancing. Adaptive routing
is a technology that makes dynamic routing decision based on changes
in traffic load and network topology.
This document describes a framework for Adaptive Routing.
Specifically, it identifies a set of adaptive routing components,
explains their interactions, and exemplifies the workflow mechanism.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on April 28, 2025.
Cheng, et al. Expires April 28, 2025 [Page 1]
Internet-Draft Adaptive Routing Framework October 2024
Copyright Notice
Copyright (c) 2024 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with
respect to this document. Code Components extracted from this
document must include Revised BSD License text as described in
Section 4.e of the Trust Legal Provisions and are provided without
warranty as described in the Revised BSD License.
Table of Contents
1. Introduction...................................................3
1.1. Requirements Language.....................................3
2. Problem Analysis...............................................3
2.1. Use Case 1................................................4
2.2. Use Case 2................................................5
3. Solution.......................................................5
3.1. Flow-based solution.......................................5
3.1.1. Weight-Based Dynamic ECMP Flow Mode..................6
3.1.2. Flow Redirection Mode................................6
3.2. Packet-based solution.....................................7
4. Framework......................................................8
4.1. Framework Overview........................................8
4.2. Remote Path Info..........................................9
4.3. Routing Plane.............................................9
4.4. Forwarding Plane.........................................10
4.5. Adaptive Routing Mode....................................11
4.5.1. Flow-Based Adjustment Mode..........................12
4.5.2. Packet-Based Adjustment Mode........................12
4.6. Congestion Detection.....................................12
4.7. Congestion definition....................................12
4.8. Congestion Notify........................................13
5. Work Flow.....................................................13
5.1. Weight-Based Dynamic ECMP Flow Adjustment Mode...........14
5.2. Flow Redirect Mode.......................................16
5.3. Packet-Based Adjustment Mode.............................18
6. Security Considerations.......................................19
7. IANA Considerations...........................................19
8. References....................................................20
8.1. Normative References.....................................20
8.2. Informative References...................................20
9. Acknowledgments...............................................20
Authors' Addresses...............................................21
Cheng, et al. Expires April 28, 2025 [Page 2]
Internet-Draft Adaptive Routing Framework October 2024
1. Introduction
In many cases, ECMP (Equal-Cost Multi-Path) flow-based hashing leads
to high congestion and variable flow completion time. This reduces
applications performance. Load balancing based on local link quality
is not always optimal, A global view of congestion, with information
from remote links, is needed for optimal balancing.
Adaptive routing is a network routing mechanism that dynamically
adjusts routing paths based on changes in network conditions,
thereby optimizing network performance and resource utilization.
This document defines a framework for Adaptive Routing.
Specifically, it identifies adaptive routing components, explains
their interactions, and illustrates the workflow mechanism. It
focuses exclusively on dynamic load balancing for existing loop-free
multiple paths, allowing adjustments based on remote link quality.
The formation of loop-free paths is outside the scope of this
document.
1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
2. Problem Analysis
The current AI networks exhibit the following characteristics: a low
number of flows, but each flow has a heavy load. The commonly used
load balancing strategy employs an N-tuple hash algorithm to forward
traffic on a per-flow basis. For current AI networks, this load
balancing strategy can easily lead to load imbalances, causing
network congestion.
When network congestion occurs, the current load balancing
adjustment strategy typically involves nearby devices at the
congestion point switching links based on the local link congestion
state. However, this approach is inefficient because adjustments
made by devices near the congestion point have limited impact. If
load balancing adjustments could be initiated from the earliest
routing devices, it would significantly improve the efficiency of
load balancing.
Cheng, et al. Expires April 28, 2025 [Page 3]
Internet-Draft Adaptive Routing Framework October 2024
2.1. Use Case 1
+--+ +--+
Spine |R1| |R2|
+--+ +--+
| \ / |
| \ / |
| \/ |
| /\ X <- congested
| / \ |
| / \ |
+--+ +--+
Leaf |R3| |R4|
+--+ +--+
^ |
| v
Source Destination
Figure 1 Spine-Leaf network
In the Spin-Leaf network shown in Figure 1, assuming that the R2-R4
link becomes congested, R3 will continue to send traffic to both R1
and R2. Due to the congestion, continuing to forward traffic at the
current rate through R2 will exacerbate the link congestion, leading
to the loss of some traffic.
Cheng, et al. Expires April 28, 2025 [Page 4]
Internet-Draft Adaptive Routing Framework October 2024
2.2. Use Case 2
Source
|
v
+---------+
| |
| Group 1 |-------------+
| | |
+---------+ |
| +---------+
| | |
X<- congested | Group 3 |
| | |
| +---------+
+---------+ |
| | |
| Group 2 |-------------+
| |
+---------+
|
v
Destination
Figure 2 Dragon-fly network
In the dragon-fly network shown in Figure 2, the ECMP paths include
Group1->Group2 and Group1->Group3->Group2 for load balancing. When
the link between Group1 and Group2 becomes congested, Group1
continues to send traffic at the current rate through the
Group1->Group2 link, exacerbating the congestion and causing the
loss of some traffic.
3. Solution
To address the problem of load imbalance mentioned above, solutions
can be classified into two types: flow-based adjustments and packet-
based adjustments. In flow-based adjustment, each flow is forwarded
along a single path, and the dynamic adjustment is in the load
distribution across multiple paths. In packet-based adjustment,
packets can be forwarded across multiple paths on a per-packet
basis, and if ordering is required, the receiving end must handle
the reordering.
3.1. Flow-based solution
Flow-based load balancing adjustments can be further categorized
into weight-based dynamic ECMP flow mode and redirecting congested
flows mode. The weight-based dynamic ECMP flow mode adjusts the
forwarding ratio across multiple paths in real-time to prevent
Cheng, et al. Expires April 28, 2025 [Page 5]
Internet-Draft Adaptive Routing Framework October 2024
further congestion on remote paths. In contrast, the redirecting
congested flows mode reroutes actual congested flows from the
congested link to other links, suitable for scenarios with
relatively stable flows.
3.1.1. Weight-Based Dynamic ECMP Flow Mode
When congestion occurs and nearby devices detect it, congestion
information is sent to remote devices. The remote devices then
dynamically adjust the forwarding weights of the ECMP paths for this
link according to the congestion status. This reduces traffic
through the congested link and alleviates the load. Using a weighted
load balancing strategy instead of a hash-based strategy can more
effectively utilize the bandwidth resources of multiple links. By
assigning forwarding weights based on the status of each link, the
load can be more evenly balanced.
The disadvantage of weight-based dynamic ECMP flow Mode is that it
cannot adjust existing flows, only new ones.
Examples of weight-based dynamic ECMP flow mode are as follows:
Example 1:
In Figure 1, when R2 detects congestion on the R2->R4 link, R2 sends
the congestion information to R3 via the control plane. R3 then
dynamically adjusts the forwarding weights of the ECMP paths based
on the congestion status, reducing the forwarding weight for the
congested link. This decreases the traffic on that link and
alleviates its load. Once the congestion is cleared, R2 sends a
congestion clearance message to R3 via the control plane, and R3
restores the original forwarding weight for that link.
Example 2:
In Figure 2, when the egress router in Group 1 detects inter-group
link congestion, it sends a congestion message to the ingress router
via the control plane. The ingress router dynamically adjusts the
forwarding weights of the ECMP paths based on the congestion status,
reducing the traffic through the Group1->Group2 link to alleviate
the load on the congested link. Once the congestion is cleared, the
egress router in Group 1 notifies the ingress router in Group 1 of
the congestion-clearance message, and the ingress router restores
the ECMP link weights.
3.1.2. Flow Redirection Mode
When congestion occurs and nearby devices detect congestion in a
specific flow, they send congestion information to remote devices.
The remote devices then recompute the load balancing for the
Cheng, et al. Expires April 28, 2025 [Page 6]
Internet-Draft Adaptive Routing Framework October 2024
congested flow and select a less loaded ECMP link for forwarding the
congested flow.
The advantage of the flow redirection mode is that it specifically
adjusts the congested flows, which can quickly alleviate congestion.
Examples of flow redirection mode are as follows:
Example 1:
In Figure 1, when R2 detects congestion in a specific flow S on the
R2->R4 link, R2 sends congestion information about flow S to R3 via
the control plane. R3 then redirects flow S by selecting a less
loaded ECMP path to forward flow S.
Example 2:
In Figure 2, when the egress router in Group 1 detects congestion in
a specific flow S on the inter-group link, it sends a congestion
message about flow S to the ingress router via the control plane.
The ingress router then redirects flow S by selecting a less loaded
link to forward flow S.
3.2. Packet-based solution
The packet-based adjustment method does not select load-balancing
paths based on flows; instead, it chooses the load-balancing path
for each individual packet. This approach helps prevent congestion
caused by imbalanced load distribution from extremely large flows.
For multiple ECMP links, link load status is monitored based on link
quality. When selecting the forwarding path for each packet, the
ECMP link with the lowest load is chosen. After forwarding a packet,
the recorded load for that link is correspondingly increased.
When responding to congestion notifications from remote devices, the
load on the ECMP links is adjusted to influence the path selection
for subsequent packets. This dynamic adjustment of link load helps
achieve load balancing. Upon receiving a packet, per-packet
forwarding mode can be used based on the packet's markers or the
current per-packet forwarding mode. Each time, the least loaded link
is selected for sending, in order to achieve the effect of overall
load balancing.
Example:
In Figure 1, when R2 detects congestion on the R2->R4 link, it sends
congestion information to R3. R3 then dynamically adjusts the load
status of the ECMP paths according to the congestion status,
increasing the load for the congested link. As a result, subsequent
Cheng, et al. Expires April 28, 2025 [Page 7]
Internet-Draft Adaptive Routing Framework October 2024
packets will choose other links due to the higher load on the
congested link, thus relieving congestion. Once the congestion is
cleared, R2 sends a congestion clearance message to R3 via the
control plane, and R3 decreases the load for that link.
4. Framework
4.1. Framework Overview
A high-level view of the Adaptive Routing framework, without
expanding the functional entities in the network, is illustrated in
Figure 3.
+-------------+
|Routing Plane|
+-------------+
|
| Remote Path Info
v
+----------------+ +-----------------------+
|Forwarding Plane|<------|Adaptive Routing Policy|
+----------------+ +-----------------------+
^
| Congestion Notifiy
|
+----------------------------+
|Remote Congestion Detection |
+----------------------------+
Figure 3 Adaptive Routing Framwork Overview
As shown in Figure 3, the following planes are defined:
* Routing Plane: Responsible for the transmission and calculation
of routes. The calculated routes should include remote path
information. The routes and remote path info should be correlated
and updated to the Forwarding Plane. Control plane protocols can
statically specify remote paths or, through protocol extensions,
calculate remote path information and update it to the forwarding
plane.
* Forwarding Plane: Responsible for path adjustments based on
Adaptive Routing policies and remote link congestion information,
implementing adjusted forwarding strategies for traffic. In
addition to the traditional next-hop in routing, extra
information is needed to describe remote link details. Beyond the
direct next-hop, an additional remote next-hop is required to
convey remote path information.
Cheng, et al. Expires April 28, 2025 [Page 8]
Internet-Draft Adaptive Routing Framework October 2024
* Adaptive Routing Policy: Handles remote link congestion or flow
information, dynamically adjusting routing and updating the
Forwarding Plane. There are two adjustment modes: flow-based and
packet-based. Congestion information and adjustment modes update
local forwarding table information.
* Remote Congestion Detection: Detects link congestion and sends
Congestion Notifications to neighboring devices. It dynamically
senses congestion information, defines it consistently, and
promptly informs surrounding devices.
4.2. Remote Path Info
Currently, the forwarding table contains information about the route
destination, next hop, and exit interface. Local dynamic load
balancing can dynamically adjust the weight of load distribution
based on the link metric of local interfaces, such as interface
traffic load and queue size.
Load balancing based on local link quality is not always optimal.
Global congestion awareness, with information from remote links, is
needed for optimal balancing. Therefore, the forwarding table needs
to contain not only local exit interface information but also remote
path info and remote link congestion information.
Remote path info can be remote links or remote nodes, specifically
as follows:
* For BGP-based networks: Remote path info can be the BGP identifier
corresponding to the next-next-hop, as described in [I-D.wang-idr-
next-next-hop-nodes]. It can also be the BGP AS-PATH information
or BGP router-id, which is not detailed in this document.
* For IGP-based networks: Remote path info can be the interface
information from the next-hop neighbor device to the next-hop
device, which could be the interface index, or the interface's
local address.
By using remote path info, routes can be associated with remote
paths.
4.3. Routing Plane
When calculating routes, the path needs to be perceived, and the
path information will be attached to the next hop.
In a BGP-based network, a BGP route may carry the router-id of the
peer from which that route is received, and the router-id will be
added into the path information when calculating that route. The BGP
protocol may need some extensions to support such a feature. The
Cheng, et al. Expires April 28, 2025 [Page 9]
Internet-Draft Adaptive Routing Framework October 2024
specific extensions can refer to [I-D.wang-idr-next-next-hop-nodes]
or other extensions, which are not detailed in this document.
In an IGP-based network, a router may compute the path information
based on the SPF tree and attach it to the next hop. Path info can
be a link-local address, interface ID, or Link Local Identifier, or
other extensions. The detailed mechanisms are out of the scope of
this document.
4.4. Forwarding Plane
Taking Figure 1 as an example, the forwarding table on R3 is
illustrated in Figure 4. Below is an explanation of the table's
structure.
For each prefix, the next hop and the weight corresponding to each
path are recorded. If per-packet load balancing is supported, the
load on the path is also recorded.
The next hop for the prefix is constructed from the local next hop
and remote path information. The forwarding weight and load are
determined by the quality of the local next-hop interface (local(q))
and the quality of the remote link in the remote path (remote(q)).
Additionally, the load varies with the total traffic forwarded to
this path during per-packet forwarding.
When responding to local congestion events, the next-hop address in
the congestion event is used to find the corresponding ECMP entry,
and the weight and load of this ECMP entry are modified according to
the congestion level.
When responding to remote congestion events, the path info in the
congestion message is used to find the corresponding ECMP entry. The
link quality of the remote path is updated, and a new weight value
and new load value are calculated based on the local and remote link
quality. Then the weight and load of this ECMP entry are modified
according to the congestion level.
In per-packet load balancing mode, each time the path with the
smallest load is selected for forwarding. Each time a packet is
forwarded, the load for the corresponding path is increased
accordingly. When a link congestion event is received, the load for
that path is increased according to the level of congestion.
Cheng, et al. Expires April 28, 2025 [Page 10]
Internet-Draft Adaptive Routing Framework October 2024
+------+ +--------------------------+ local(q)+remote(q)
|Prefix|---+-->|Next-hop: to R1, Weight w1|<----------------|
+------+ | | Load l1| |
| +--------------------------+ |
| | +------------+ +--------+
| +---------->|Path: R1->R4|-->|Quality1|
| +------------+ +--------+
| +--------------------------+ local(q)+remote(q)
+-->|Next-hop: to R2, Weight w2|<----------------|
| Load l2| |
+--------------------------+ |
| +------------+ +--------+
+---------->|Path: R2->R4|-->|Quality2|
+------------+ +--------+
Figure 4 Forwarding table for Adaptive Routing
When the number of flows is small or when there are elephant flows,
adaptive routing needs to be performed through flow redirection. The
following figure 5 is a schematic of the forwarding layer flow table
maintenance. The flow tables are maintained according to the five-
tuple of the traffic, recording the path information corresponding
to this flow.
When responding to remote flow congestion events as described in
section 4.7, the flow will be rehashed to choose an ECMP path, and
this flow is redirected to the least loaded ECMP path.
+------+
|SAddr |
|DAddr |
|SPort | +------------------+
|DPort |------>|Next-hop: to R1 |
|Proto | +------------------+
+------+
Figure 5 Flow table
4.5. Adaptive Routing Mode
Adaptive routing adjustment modes can be categorized into flow-based
adjustment modes and packet-based adjustment modes. Flow-based
adjustments can be further divided into link congestion-based load
balancing weight adjustments and flow redirection-based adjustments.
In packet-based adjustment modes, each time a packet is forwarded,
the link with the lightest load is chosen for forwarding. This can
lead to out-of-order reception at the receiving end, requiring the
application side to handle out-of-order issues.
Cheng, et al. Expires April 28, 2025 [Page 11]
Internet-Draft Adaptive Routing Framework October 2024
4.5.1. Flow-Based Adjustment Mode
In flow-based adjustment modes, the load balancing weight of links
can be adjusted, or specific flows can be redirected.
Weight-Based Dynamic ECMP Flow Mode For link-level congestion
events, based on the congestion status of remote links and combined
with the local link congestion status, the forwarding weights of the
corresponding ECMP links in the forwarding table are dynamically
adjusted. This helps to achieve global link load balancing by
reducing the flow weights on congested links. The forwarding weight
is calculated based on the quality of the local and remote links.
Flow Redirection Mode For specific flow congestion events, the
congested flow is redirected to ECMP links with lighter loads.
During flow redirection, the quality, such as the remaining
bandwidth, of these links must be considered to avoid congestion on
the redirected link.
4.5.2. Packet-Based Adjustment Mode
Based on the congestion status of local and remote links, the load
on corresponding ECMP links is dynamically adjusted. During data
forwarding, using a per-packet forwarding model, each data packet is
sent through the link with the lightest load. The application side
must be capable of handling out-of-order packets when receiving
them.
4.6. Congestion Detection
Congestion detection is generally performed by devices near the
congestion point, including the detection of link congestion and
congestion clearance. Network performance and congestion points can
be identified by sending test traffic. A queue exceeds a threshold
depth may send congestion notification. Congestion can also be
inferred by monitoring the packet loss rate to determine if a link
is congested. Congestion detection also includes flow-based
congestion detection. Congestion Specific detection methods are
beyond the scope of this document.
4.7. Congestion definition
The definition of congestion can be based on interface bandwidth or
forwarding buffer utilization, measured using a quality level. This
level can be tailored so that lower levels indicate poorer path
quality and can be calculated based on current bandwidth and buffer
usage, using a specific ratio.
Cheng, et al. Expires April 28, 2025 [Page 12]
Internet-Draft Adaptive Routing Framework October 2024
For instance, with 16 quality levels, on a 400G interface, level 0
could represent 25G and level 15 could represent 400G.
The exact method for calculating the quality level is beyond this
document's scope, but the rules must be consistent among routers
exchanging this information.
4.8. Congestion Notify
When a change in congestion status is detected, it needs to be
communicated to remote devices in order to adjust traffic scheduling
from the source.
Congestion messages can be of two types:
1) The first type includes Path information, which helps in
identifying the corresponding route for adjustments. It also
includes the congestion information of the link corresponding to
the Path. With this information, global congestion calculation can
be performed to derive the weight information for the forwarding
table. For details, refer to section 4.4.
2) The second type includes the five-tuple information of the
congested flow. By using this congested flow information,
congestion flow redirection can be implemented. For details, refer
to sections 4.4 and 4.5.
This can be achieved by extending the IGP protocol to transmit link
state information within the IGP domain, or by extending the BGP
protocol and setting up BGP reflectors to facilitate communication
between BGP neighbors. However, transmitting congestion information
through traditional routing protocols presents performance
challenges. On one hand, congestion notifications need to be sent
more frequently than routing information. On the other hand,
processing congestion messages should ideally occur in the
forwarding plane. Therefore, to improve the performance of adaptive
routing, new protocols can be designed specifically for this
purpose. Specific extensions can refer to [I-D. draft-zzhang-rtgwg-
router-info] or other extensions, which are not detailed in this
document. Congestion messages can be transmitted in-band or out-of-
band. For high-performance solutions, additional protocols may be
needed for efficient out-of-band message transmission. Specific
methods are beyond the scope of this document.
5. Work Flow
Taking the network shown in Figure 1 as an example, the neighbor ID
of R4 is included in the routing distribution and calculation to
indicate the path from R2 to R4. The router ID can be used as the
Cheng, et al. Expires April 28, 2025 [Page 13]
Internet-Draft Adaptive Routing Framework October 2024
neighbor ID. Using BGP routing as an example, explain the entire
working process.
When congestion occurs on the link between R2 and R4, R2 will use
R4's neighbor ID to notify R1 of the link congestion.
Below is a description of the workflow for handling congestion
information in various modes:
5.1. Weight-Based Dynamic ECMP Flow Adjustment Mode
R3 <---BGP Route--- R2 [S1]
<NLRI: Prefix, Next-hop BGP ID: R2, Next-next-hop BGP ID: R4>
R3 <---BGP Route--- R1 [S1]
<NLRI: Prefix, Next-hop BGP ID: R1, Next-next-hop BGP ID: R4>
|
| Per-neighbor Path Information
v
R3's Forwarding Plane: [S2]
+------+ +--------------------------+ local(q)+remote(q)
|Prefix|---+-->|Next-hop: to R1, Weight w1|<----------------|
+------+ | +--------------------------+ |
| | +------------+ +--------+
| +---------->|Path: R1->R4|-->|Quality1|
| +------------+ +--------+
| +--------------------------+ local(q)+remote(q)
+-->|Next-hop: to R2, Weight w2|<----------------|
+--------------------------+ |
| +------------+ +--------+
+---------->|Path: R2->R4|-->|Quality2|[S4]
+------------+ +--------+
^
| Link Congestion
|
R3 <---Control Plan Notification--- R2:
Link Congestion: Link ID, Congestion Level [S3]
Figure 6
As shown in Figure 6, the workflow for handling remote link
congestion by stream mode is as follows:
Cheng, et al. Expires April 28, 2025 [Page 14]
Internet-Draft Adaptive Routing Framework October 2024
[S1]:There are two paths from R3 to R4: R3->R1->R4 and R3->R2->R4.
Link information is advertised by R2 through the BGP.
When R2 delivers BGP routes to R3, the NNHN Capability TLV is
carried in the attributes [I-D.wang-idr-next-next-hop-nodes],
indicating that the next-hop is R2 and the next-next-hop is R4.
When R1 delivers BGP routes to R3, the NNHN Capability TLV is
carried in the attributes [I-D.wang-idr-next-next-hop-nodes],
indicating that the next-hop is R1 and the next-next-hop is R4.
[S2]:R3 learns the routes and maintains two ECMP paths, both with
initial equal weights set to 1.
[S3]:R2 detects a change in congestion on the R2->R4 link using
congestion detection methods and classifies the congestion into
levels according to severity. This information, including the
detecting node, the congested link, and the congestion level, is
notified to R1 through the control plane [I-D. draft-zzhang-rtgwg-
router-info].
[S4]:R3 receives the remote notification and, based on the congested
node (R2) and next-hop information (R4), looks up its local
forwarding table. It then adjusts the forwarding weights of the
corresponding ECMP entries according to the congestion level,
assuming the weight is adjusted to 10, as shown in Figure 7.
Cheng, et al. Expires April 28, 2025 [Page 15]
Internet-Draft Adaptive Routing Framework October 2024
+------+ +--------------------------+
|Prefix|---+-->|Next-hop: to R1, Weight 50|
+------+ | +--------------------------+
| | +----------------+
| +---------->|Path: R1->R4 |
| +----------------+
| +--------------------------+
+-->|Next-hop: to R2, Weight 10|
+--------------------------+
| +----------------+
+---------->|Path: R2->R4 |
+----------------+
Figure 7 Adaptive forwarding table
5.2. Flow Redirect Mode
R3 <---BGP Route--- R2 [S1]
<NLRI: Prefix, Next-hop BGP ID: R2, Next-next-hop BGP ID: R4>
R3 <---BGP Route--- R1 [S1]
<NLRI: Prefix, Next-hop BGP ID: R1, Next-next-hop BGP ID: R4>
|
| Per-neighbor Path Information
v
R3's Forwarding Plane: [S2]
+------+ +--------------------------+
|Flow 1|------>|Next-hop: to R2 |
+------+ +--------------------------+
| +------------+
+---------->|Path: R2->R4| [S4]
+------------+
^
| Flow Congestion
|
R3 <---Control Plan Notification--- R2:
Flow Congestion: Flow 1, Congestion [S3]
Figure 8
Cheng, et al. Expires April 28, 2025 [Page 16]
Internet-Draft Adaptive Routing Framework October 2024
As shown in Figure 8, the workflow for handling remote flow
congestion is as follows:
[S1]:There are two paths from R3 to R4: R3->R1->R4 and R3->R2->R4.
Link information is advertised by R2 through the BGP.
When R2 delivers BGP routes to R3, the NNHN Capability TLV is
carried in the attributes [I-D.wang-idr-next-next-hop-nodes],
indicating that the next-hop is R2 and the next-next-hop is R4.
When R1 delivers BGP routes to R3, the NNHN Capability TLV is
carried in the attributes [I-D.wang-idr-next-next-hop-nodes],
indicating that the next-hop is R1 and the next-next-hop is R4.
[S2]:The initial Flow 1 selects the path R2->R4 for forwarding and
establishes a flow table.
[S3]:R2 detects Flow congestion on a specific flow passing through
the R3->R4 link using congestion detection methods; R2 notifies the
remote device R3 of the congestion change event, including the
congested path info and flow information [I-D. draft-zzhang-rtgwg-
router-info];
[S4]:R3 receives the flow congestion event and looks up the flow
table based on the flow information, redirecting the flow to the
least loaded link among the ECMP links; Subsequently, the flow is
forwarded according to the new flow table.
Cheng, et al. Expires April 28, 2025 [Page 17]
Internet-Draft Adaptive Routing Framework October 2024
5.3. Packet-Based Adjustment Mode
R3 <---BGP Route--- R2 [S1]
<NLRI: Prefix, Next-hop BGP ID: R2, Next-next-hop BGP ID: R4>
R3 <---BGP Route--- R1 [S1]
<NLRI: Prefix, Next-hop BGP ID: R1, Next-next-hop BGP ID: R4>
|
| Per-neighbor Path Information
v
R3's Forwarding Plane: [S2]
+------+ +--------------------------+ local(q)+remote(q)
|Prefix|---+-->|Next-hop: to R1, Load l1 |<----------------|
+------+ | +--------------------------+ |
| | +------------+ +--------+
| +---------->|Path: R1->R4|-->|Quality1|
| +------------+ +--------+
| +--------------------------+ local(q)+remote(q)
+-->|Next-hop: to R2, Load l2 |<----------------|
+--------------------------+ |
| +------------+ +--------+
+---------->|Path: R2->R4|-->|Quality2|[S4]
+------------+ +--------+
^
| Link Congestion
|
R3 <---Control Plan Notification--- R2:
Link Congestion: Link ID, Congestion Level [S3]
Figure 9
As shown in Figure 9, the workflow for handling remote link
congestion by packet mode is as follows:
[S1]:There are two paths from R3 to R4: R3->R1->R4 and R3->R2->R4.
Link information is advertised by R2 through the BGP.
When R2 delivers BGP routes to R3, the NNHN Capability TLV is
carried in the attributes [I-D.wang-idr-next-next-hop-nodes],
indicating that the next-hop is R2 and the next-next-hop is R4.
Cheng, et al. Expires April 28, 2025 [Page 18]
Internet-Draft Adaptive Routing Framework October 2024
When R1 delivers BGP routes to R3, the NNHN Capability TLV is
carried in the attributes [I-D.wang-idr-next-next-hop-nodes],
indicating that the next-hop is R1 and the next-next-hop is R4.
[S2]:R3 learns the routes and maintains two ECMP paths, both with
initial equal load set to 50.
[S3]:R2 detects a change in congestion on the R2->R4 link using
congestion detection methods and classifies the congestion into
levels according to severity. This information, including the
detecting node, the congested link, and the congestion level, is
notified to R1 through the control plane [I-D. draft-zzhang-rtgwg-
router-info].
[S4]:R3 receives the remote notification and, based on the congested
node (R2) and next-hop information (R4), looks up its local
forwarding table. It then adjusts the forwarding load of the
corresponding ECMP entries according to the congestion level,
assuming the load is adjusted to 100, as shown in Figure 9.
Afterwards, when selecting a path for per-packet forwarding, the
lower load path R1->R4 will be chosen for forwarding, until the load
on the path R1->R4 becomes higher than that of R2->R4.
+------+ +--------------------------+
|Prefix|---+-->|Next-hop: to R1, load 50 |
+------+ | +--------------------------+
| | +----------------+
| +---------->|Path: R1->R4 |
| +----------------+
| +--------------------------+
+-->|Next-hop: to R2, load 100 |
+--------------------------+
| +----------------+
+---------->|Path: R2->R4 |
+----------------+
Figure 10
6. Security Considerations
TBD.
7. IANA Considerations
TBD.
Cheng, et al. Expires April 28, 2025 [Page 19]
Internet-Draft Adaptive Routing Framework October 2024
8. References
8.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, May 2017
8.2. Informative References
[I-D.wang-idr-next-next-hop-nodes] Wang, K. , J. Haas. And Lin, C. ,
"BGP Next-next Hop Nodes", Work in Progress, Internet-
Draft, draft-wang-idr-next-next-hop-nodes-01, 4 September
2024, <https://datatracker.ietf.org/doc/html/draft-wang-
idr-next-next-hop-nodes-01>.
[I-D. draft-zzhang-rtgwg-router-info] Zhang, Z. , Wang, K. and Lin,
C., " Advertising Router Information", Work in Progress,
Internet-Draft, draft-zzhang-rtgwg-router-info-01, 18
September 2024,
<https://datatracker.ietf.org/doc/html/draft-zzhang-rtgwg-
router-info-01>.
9. Acknowledgments
TBD.
Cheng, et al. Expires April 28, 2025 [Page 20]
Internet-Draft Adaptive Routing Framework October 2024
Authors' Addresses
Weiqiang Cheng
China Mobile
China
Email: chengweiqiang@chinamobile.com
Changwang Lin
New H3C Technologies
China
Email: linchangwang.04414@h3c.com
Kevin F. Wang
Juniper Networks
Email: kfwang@juniper.net
Jiaming Ye
China Mobile
China
Email: yejiaming@chinamobile.com
Rui Zhuang
China Mobile
China
Email: zhuangruiyjy@chinamobile.com
PengFei Huo
ByteDance
China
Email: huopengfei@bytedance.com
Cheng, et al. Expires April 28, 2025 [Page 21]