Skip to main content

Gap Analysis of Fast Notification for Traffic Engineering and Load Balancing
draft-geng-fantel-fantel-gap-analysis-02

Document Type Active Internet-Draft (individual)
Authors Xuesong Geng , Jie Dong , Weiqiang Cheng , Dan Li , Yongqing Zhu , Han Zhengxin
Last updated 2026-02-26
RFC stream (None)
Intended RFC status (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-geng-fantel-fantel-gap-analysis-02
FANTEL                                                           X. Geng
Internet-Draft                                                   J. Dong
Intended status: Standards Track                                  Huawei
Expires: 31 August 2026                                         W. Cheng
                                                            China Mobile
                                                                   D. Li
                                                     Tsinghua University
                                                                  Y. Zhu
                                                           China Telecom
                                                                  Z. Han
                                                            China Unicom
                                                        27 February 2026

   Gap Analysis of Fast Notification for Traffic Engineering and Load
                               Balancing
                draft-geng-fantel-fantel-gap-analysis-02

Abstract

   Modern networks require fast, adaptive Traffic Engineering (TE) to
   support demanding applications like AI training and real-time
   services.  Existing mechanisms for load balancing, protection, and
   flow control often lack responsiveness and scalability.  This
   document analyzes key gaps in current TE solutions and proposes fast
   notification as a low-latency, event-driven enhancement.  Fast
   notification enables real-time network awareness and quicker
   reactions to dynamic conditions, improving overall network efficiency
   and reliability.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 31 August 2026.

Geng, et al.             Expires 31 August 2026                 [Page 1]
Internet-Draft           Gap Analysis of Fantel            February 2026

Copyright Notice

   Copyright (c) 2026 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Fast Notification for Traffic Engineering and Load Balancing:
           Gap Analysis  . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Introduction  . . . . . . . . . . . . . . . . . . . . . .   2
     1.2.  Requirements Language . . . . . . . . . . . . . . . . . .   3
     1.3.  Gap Analysis for Load Balancing . . . . . . . . . . . . .   3
       1.3.1.  IOAM Telemetry Limitations  . . . . . . . . . . . . .   3
       1.3.2.  Role of Fast Notification . . . . . . . . . . . . . .   4
     1.4.  Gap Analysis for Protection . . . . . . . . . . . . . . .   4
       1.4.1.  Bidirectional Forwarding Detection (BFD)  . . . . . .   4
       1.4.2.  Fast Reroute (FRR)  . . . . . . . . . . . . . . . . .   5
       1.4.3.  Routing Convergence . . . . . . . . . . . . . . . . .   5
       1.4.4.  Multi-Path Routing (e.g., ECMP) . . . . . . . . . . .   5
     1.5.  Gap Analysis for Flow Control . . . . . . . . . . . . . .   6
       1.5.1.  Sender-Based Congestion Control . . . . . . . . . . .   6
       1.5.2.  Receiver Based TCP Congestion Control . . . . . . . .   6
       1.5.3.  Explicit Congestion Notification (ECN)  . . . . . . .   7
       1.5.4.  Inband Network Telemetry (INT)  . . . . . . . . . . .   7
     1.6.  Conclusion  . . . . . . . . . . . . . . . . . . . . . . .   8
   2.  Informative References  . . . . . . . . . . . . . . . . . . .   8
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   9

1.  Fast Notification for Traffic Engineering and Load Balancing: Gap
    Analysis

1.1.  Introduction

   In use cases such as AI training, a lossless and adaptive network is
   required to ensure reliable and congestion-free data transfer.  These
   workloads demand high throughput, low latency, and zero packet loss
   across dynamically shifting traffic patterns.

Geng, et al.             Expires 31 August 2026                 [Page 2]
Internet-Draft           Gap Analysis of Fantel            February 2026

   To meet these demands, networks rely on Traffic Engineering (TE)
   mechanisms, including load balancing, protection, and flow control.
   However, existing solutions face limitations in responsiveness,
   coverage, and operational overhead, especially in high-speed, large-
   scale environments.

   This document provides a gap analysis focused on three key TE areas:

   *  Load Balancing

   *  Protection

   *  Flow Control

   For each area, we analyze current limitations and explore how fast
   notification mechanisms can help fill these gaps.

   It is important to clarify that the mechanisms discussed in this
   document, such as BFD, IOAM, and traditional transport-layer flow
   control, are not necessarily alternatives to fast notification.
   Instead, these mechanisms can complement notification-based
   approaches.  For instance, measurement results from BFD or IOAM can
   serve as triggers for fast notifications.  Similarly, existing flow
   control mechanisms at the transport layer can work in coordination
   with network-layer flow control enabled by notifications.  Therefore,
   the "gaps" identified in this document reflect potential enhancements
   when relying solely on these mechanisms without fast notification,
   rather than suggesting they should be replaced.

1.2.  Requirements Language

   TBD

1.3.  Gap Analysis for Load Balancing

   Load balancing ensures efficient utilization of available bandwidth
   and reduces congestion.  In modern networks, dynamic load balancing
   is essential but often lacks real-time responsiveness.

1.3.1.  IOAM Telemetry Limitations

   In-situ OAM (IOAM) provides visibility into traffic by embedding
   telemetry data directly in packets.  It enables measurement of path
   latency, loss, and performance metrics.

   However, IOAM has notable drawbacks:

Geng, et al.             Expires 31 August 2026                 [Page 3]
Internet-Draft           Gap Analysis of Fantel            February 2026

   *  Telemetry Export Delays: IOAM data is extracted and reported by
      the device CPU to a controller.  This adds latency and limits
      responsiveness.

   *  Controller Reaction Time: Centralized controllers typically
      process telemetry in software, resulting in delayed decision-
      making.

   Gap: These factors reduce the effectiveness of real-time load
   balancing.

1.3.2.  Role of Fast Notification

   To address the above:

   *  Proactive Signaling: Fast notification can signal network
      conditions (e.g., congestion) before service degradation occurs.

   *  Event-Driven Control: Control loops can dynamically adjust traffic
      distribution without relying on polling or telemetry aggregation.

   *  Lightweight Signaling: Avoids the overhead of traditional
      telemetry processing.

1.4.  Gap Analysis for Protection

   Protection mechanisms ensure service continuity in case of failures.
   While existing tools like BFD and FRR are widely deployed, they have
   inherent limitations in speed and scope.

1.4.1.  Bidirectional Forwarding Detection (BFD)

   BFD is designed for rapid fault detection by sending frequent control
   packets between peers.  While widely used, it presents the following
   limitations:

   *  Overhead vs. Frequency Tradeoff: Higher probe frequency improves
      detection time but increases CPU and bandwidth usage.

   *  Scalability Issues: Maintaining many BFD sessions in large-scale
      networks strains the control plane.

   *  Path Detection Limitations: In scenarios with multiple ECMP paths,
      BFD struggles to detect the status of specific paths, making it
      difficult to identify partial failures or asymmetrical
      degradations.

   Gap: BFD struggles to balance detection speed with system overhead.

Geng, et al.             Expires 31 August 2026                 [Page 4]
Internet-Draft           Gap Analysis of Fantel            February 2026

1.4.1.1.  Fast Notification Enhancement

   *  Targeted Notifications: Fast notification provides event-driven
      alerts rather than continuous probing.

   *  Improved Scalability: Reduces resource usage while preserving
      rapid failure detection.

1.4.2.  Fast Reroute (FRR)

   FRR reroutes traffic upon link or node failures.  However:

   *  Local-Only Protection: Typically protects against only adjacent
      failures.

   Gap: FRR lacks flexibility and responsiveness in complex topologies.

1.4.2.1.  Fast Notification Enhancement

   *  Instant Failure Alerts: Enables immediate detection and rerouting
      across the network.

   *  Minimized Packet Loss: Reduces the time between failure detection
      and redirection.

1.4.3.  Routing Convergence

   Routing Convergence mechanisms depend on routing protocol
   convergence, which may take hundreds of milliseconds.

   Gap: Delay-sensitive services cannot tolerate slow failover.

1.4.3.1.  Fast Notification Enhancement

   *  Real-Time Failover: Triggers immediate switching to standby paths.

   *  Service Continuity: Ensures uninterrupted performance for critical
      applications.

1.4.4.  Multi-Path Routing (e.g., ECMP)

   Equal-Cost Multi-Path (ECMP) routing uses multiple paths for load
   sharing.  However:

   Gap: It lacks fast detection of path degradation or failure, making
   real-time traffic rebalancing difficult.

Geng, et al.             Expires 31 August 2026                 [Page 5]
Internet-Draft           Gap Analysis of Fantel            February 2026

1.4.4.1.  Fast Notification Enhancement

   *  On-the-Fly Path Reallocation: Shifts traffic to healthy paths
      based on real-time failure or degradation alerts.

   *  Improved Reliability: Maintains availability during partial
      failures.

1.5.  Gap Analysis for Flow Control

   Flow control ensures congestion-free transmission and optimal
   throughput.  Current mechanisms either react too slowly or lack
   granular, real-time information.

1.5.1.  Sender-Based Congestion Control

   Congestion control is based on end-to-end feedback such as packet
   loss or RTT increases.

   *  End-to-End Delay Sensitivity: Sender-driven control relies on
      detecting congestion from end-to-end signals, often after at least
      one RTT.  In bursty traffic scenarios such as data centers, this
      delay may result in buffer bloat or packet loss.

   *  Ambiguity in Signal Source: It's also hard to distinguish between
      congestion and transient fluctuations, leading to overreaction or
      misjudgment in rate adaptation.

   Gap: These signals are slow and reactive, especially in high-latency
   or long-RTT environments.

1.5.1.1.  Fast Notification Enhancement

   *  Mid-Path Feedback: Intermediate nodes can issue real-time
      congestion alerts.

   *  Faster Rate Adjustment: Prevents packet loss and improves flow
      responsiveness.

1.5.2.  Receiver Based TCP Congestion Control

   Receiver driven congestion control uses feedback signals from the
   receiver to adjust transmission rate of the sender.

   *  Control Loop Latency: These signals still traverse the network and
      are subject to RTT delays, especially problematic in high-speed
      dynamic environments.

Geng, et al.             Expires 31 August 2026                 [Page 6]
Internet-Draft           Gap Analysis of Fantel            February 2026

   *  Bandwidth Overhead: In large-scale or short-flow-intensive
      environments like data centers, signaling from massive numbers of
      receivers can impose significant bandwidth and processing
      overhead.

1.5.2.1.  Fast Notification Enhancement

   *  Direct Congestion Signals: Reduces RTT-related lag by injecting
      congestion indicators directly into the network fabric.

   *  Efficient Scaling: Enables scalable control even in environments
      with many short-lived flows.

1.5.3.  Explicit Congestion Notification (ECN)

   ECN marks packets to indicate congestion, avoiding drops.  However:

   Gap: ECN still relies on end-to-end signaling and lacks precise real-
   time feedback.

1.5.3.1.  Fast Notification Enhancement

   *  Granular Congestion Updates: Real-time alerts from within the
      network augment ECN markings.

   *  Proactive Shaping: Faster congestion mitigation before queue
      buildup.

1.5.4.  Inband Network Telemetry (INT)

   INT provides path-level telemetry by inserting metadata at each hop,
   which is returned to the sender via the ACK.  Some congestion control
   algorithms, such as HPCC, utilize INT for precise load-awareness.

   However:

   *  RTT Dependency: INT-based telemetry still incurs a one-RTT delay
      before feedback is received by the sender.

   *  Feedback Loop Latency: This delay limits responsiveness,
      especially in dynamic high-speed environments.

1.5.4.1.  Fast Notification Enhancement

   *  Immediate Inline Feedback: Enables mid-network nodes to send
      congestion indicators directly, bypassing RTT delays.

Geng, et al.             Expires 31 August 2026                 [Page 7]
Internet-Draft           Gap Analysis of Fantel            February 2026

   *  Enhanced Responsiveness: Combines the accuracy of INT with faster
      notification paths for congestion control.

1.6.  Conclusion

   This document highlights the following gaps in Traffic Engineering
   mechanisms and how fast notification can enhance each area:

   +============+===========================+==========================+
   | Area       | Key Gap                   | Fast Notification        |
   |            |                           | Enhancement              |
   +============+===========================+==========================+
   | Load       | Slow telemetry export and | Event-driven             |
   | Balancing  | software control delays   | signaling for            |
   |            |                           | immediate adjustment     |
   +------------+---------------------------+--------------------------+
   | Protection | BFD/FRR trade off speed   | Lightweight, fast        |
   |            | for overhead; slow        | fault alerts across      |
   |            | convergence               | the network topology     |
   +------------+---------------------------+--------------------------+
   | Flow       | TCP/ECN feedback too slow | Real-time congestion     |
   | Control    | for real-time adaptation  | feedback from network    |
   |            |                           | infrastructure           |
   +------------+---------------------------+--------------------------+

                                  Table 1

   Fast notification mechanisms provide a low-latency, low-overhead
   method for improving responsiveness across load balancing,
   protection, and flow control.  These capabilities are increasingly
   vital to support demanding applications like distributed AI training
   and real-time cloud services.

2.  Informative References

   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
              of Explicit Congestion Notification (ECN) to IP",
              RFC 3168, DOI 10.17487/RFC3168, September 2001,
              <https://www.rfc-editor.org/rfc/rfc3168>.

   [RFC5880]  Katz, D. and D. Ward, "Bidirectional Forwarding Detection
              (BFD)", RFC 5880, DOI 10.17487/RFC5880, June 2010,
              <https://www.rfc-editor.org/rfc/rfc5880>.

   [RFC7490]  Bryant, S., Filsfils, C., Previdi, S., Shand, M., and N.
              So, "Remote Loop-Free Alternate (LFA) Fast Reroute (FRR)",
              RFC 7490, DOI 10.17487/RFC7490, April 2015,
              <https://www.rfc-editor.org/rfc/rfc7490>.

Geng, et al.             Expires 31 August 2026                 [Page 8]
Internet-Draft           Gap Analysis of Fantel            February 2026

Authors' Addresses

   Xuesong Geng
   Huawei
   Email: gengxuesong@huawei.com

   Jie Dong
   Huawei
   Email: jie.dong@huawei.com

   Weiqiang Cheng
   China Mobile
   Email: chengweiqiang@chinamobile.com

   Dan Li
   Tsinghua University
   Email: tolidan@tsinghua.edu.cn

   Yongqing Zhu
   China Telecom
   Email: zhuyq8@chinatelecom.cn

   Zhengxin Han
   China Unicom
   Email: hanzx21@chinaunicom.cn

Geng, et al.             Expires 31 August 2026                 [Page 9]