Javascript disabled? Like other modern websites, the IETF Datatracker relies on Javascript. Please enable Javascript for full functionality.
Problem Statement and Requirements for Fast Network Event Notification in Distributed AI Training and Inference
draft-sang-fann-fast-network-event-notification-01

Versions:
This document is an Internet-Draft (I-D). Anyone may submit an I-D to the IETF. This I-D is not endorsed by the IETF and has no formal standing in the IETF standards process.
Document	Type	Active Internet-Draft (individual)
	Authors	Liu Sang , Xuesong Geng , Huan Deng
	Last updated	2026-07-03
	RFC stream	(None)
	Intended RFC status	(None)
	Formats	txt html xml htmlized bibtex bibxml
Stream	Stream state	(No stream defined)
	Consensus boilerplate	Unknown
	RFC Editor Note	(None)
IESG	IESG state	I-D Exists
	Telechat date	(None)
	Responsible AD	(None)
	Send notices to	(None)
Email authors IPR References Referenced by Nits Search email archive
draft-sang-fann-fast-network-event-notification-01
Fast Network Notification (FANN)                                 L. Sang
Internet-DraftChina Academy of Information and Communications Technology
Intended status: Informational                                   X. Geng
Expires: 4 January 2027                              Huawei Technologies
                                                                 H. Deng
                                                           China Telecom
                                                             3 July 2026

 Problem Statement and Requirements for Fast Network Event Notification
                in Distributed AI Training and Inference
           draft-sang-fann-fast-network-event-notification-01

Abstract

   Distributed AI training and inference rely on tightly coordinated
   communication across large-scale AI fabrics, making timely awareness
   of network conditions essential to application performance.  Network
   events, including congestion, link degradation, path changes, and
   device failures, can significantly affect collective communication
   efficiency, job completion time, and overall resource utilization.
   Existing network event notification mechanisms are primarily designed
   for general-purpose IP networks and do not adequately address the
   timeliness, semantics, and coordination requirements of distributed
   AI workloads.

   This document identifies the problem space for fast network event
   notification in distributed AI training and inference environments.
   It presents representative use cases, identifies gaps in existing
   approaches, and derives a set of functional and operational
   requirements for timely, reliable, and interoperable dissemination of
   network events across AI fabrics.  These requirements are intended to
   facilitate future work on network architectures and protocols for AI
   networking.  This document does not specify a protocol, signaling
   mechanism, or protocol extension.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

Sang, et al.             Expires 4 January 2027                 [Page 1]
Internet-Draft       Fast Network Event Notification           July 2026

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 4 January 2027.

Copyright Notice

   Copyright (c) 2026 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.  Background and Motivation . . . . . . . . . . . . . . . .   3
     1.2.  Scope . . . . . . . . . . . . . . . . . . . . . . . . . .   4
     1.3.  Requirements Language . . . . . . . . . . . . . . . . . .   4
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   5
   3.  Problem Statement . . . . . . . . . . . . . . . . . . . . . .   6
     3.1.  AI Fabric Traffic and Workload Characteristics  . . . . .   6
     3.2.  Limitations of Existing Network Monitoring and Notification
           Mechanisms  . . . . . . . . . . . . . . . . . . . . . . .   7
     3.3.  Capability Gap Analysis for AI Fabric Scenarios . . . . .   9
     3.4.  Problem Summary . . . . . . . . . . . . . . . . . . . . .  10
   4.  Representative Use Cases  . . . . . . . . . . . . . . . . . .  10
     4.1.  UC1: Congestion Escalation During Collective
           Communication . . . . . . . . . . . . . . . . . . . . . .  10
     4.2.  UC2: Communication Performance Degradation  . . . . . . .  11
     4.3.  UC3: Node and Path Failure  . . . . . . . . . . . . . . .  11
     4.4.  UC4: Runtime-driven Network Adaptation  . . . . . . . . .  12
     4.5.  UC5 Cross-domain AI Fabric Operation  . . . . . . . . . .  12
   5.  Requirements  . . . . . . . . . . . . . . . . . . . . . . . .  13
     5.1.  REQ-1: Timely Event Dissemination . . . . . . . . . . . .  13
     5.2.  REQ-2: Event Granularity  . . . . . . . . . . . . . . . .  14
     5.3.  REQ-3: Rich Event Semantics . . . . . . . . . . . . . . .  14
     5.4.  REQ-4: Cross-layer Coordination . . . . . . . . . . . . .  14
     5.5.  REQ-5: Interoperability . . . . . . . . . . . . . . . . .  14
     5.6.  REQ-6: Scalability  . . . . . . . . . . . . . . . . . . .  15

Sang, et al.             Expires 4 January 2027                 [Page 2]
Internet-Draft       Fast Network Event Notification           July 2026

     5.7.  REQ-7: Reliability  . . . . . . . . . . . . . . . . . . .  15
     5.8.  REQ-8: Security . . . . . . . . . . . . . . . . . . . . .  15
     5.9.  REQ-9: Extensibility  . . . . . . . . . . . . . . . . . .  15
   6.  Reference Deployment Model  . . . . . . . . . . . . . . . . .  16
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  17
     7.1.  Event Authenticity and Integrity  . . . . . . . . . . . .  17
     7.2.  Access Control  . . . . . . . . . . . . . . . . . . . . .  18
     7.3.  Denial-of-Service Risks . . . . . . . . . . . . . . . . .  18
     7.4.  Privacy Considerations  . . . . . . . . . . . . . . . . .  18
   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  18
   Appendix A.  Future Work  . . . . . . . . . . . . . . . . . . . .  18
   Appendix B.  References . . . . . . . . . . . . . . . . . . . . .  19
     B.1.  Normative References  . . . . . . . . . . . . . . . . . .  19
     B.2.  Informative References  . . . . . . . . . . . . . . . . .  19
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  19

1.  Introduction

1.1.  Background and Motivation

   Recent advances in foundation models have accelerated the deployment
   of distributed AI training and inference across large-scale computing
   infrastructures.  Compared with conventional cloud applications,
   distributed AI workloads generate sustained high-bandwidth traffic
   and rely on tightly synchronized communication among a large number
   of computing nodes.  As a result, application performance is highly
   sensitive to network conditions, particularly during collective
   communication operations.

   To support these workloads, modern data centers increasingly deploy
   dedicated high-performance networking infrastructures, commonly
   referred to as AI Fabrics.  An AI Fabric integrates high-speed
   network interconnects, accelerators, and scheduling systems to
   provide scalable communication for large GPU clusters.  Technologies
   such as Remote Direct Memory Access (RDMA) over Converged Ethernet
   (RoCE) are widely adopted to reduce communication latency and improve
   transport efficiency for distributed AI applications.

   Distributed AI workloads depend on collective communication
   primitives, including AllReduce, AllGather, ReduceScatter, and
   pipeline-parallel communication, which require coordinated
   participation from hundreds or thousands of compute nodes.  The
   overall execution time of these operations is often determined by the
   slowest participant.  Consequently, transient network events, such as
   congestion, link degradation, path changes, or device failures, can
   interrupt communication synchronization, create straggler nodes, and
   significantly reduce overall training and inference efficiency.

Sang, et al.             Expires 4 January 2027                 [Page 3]
Internet-Draft       Fast Network Event Notification           July 2026

   Existing network monitoring and event notification mechanisms are
   primarily designed for general-purpose IP networks, where traffic is
   relatively elastic and applications are generally tolerant of
   transient network fluctuations.  In contrast, distributed AI
   workloads require timely and consistent awareness of network events
   to enable rapid adaptation by communication libraries, runtime
   systems, schedulers, or network controllers.  As AI Fabrics continue
   to increase in scale and complexity, existing mechanisms provide
   limited support for the responsiveness and coordination required by
   these environments, motivating the need to identify requirements for
   fast network event notification.

1.2.  Scope

   This document focuses on the problem space of fast network event
   notification for distributed AI training and inference deployed over
   AI Fabrics.  It examines the communication characteristics of
   distributed AI workloads, identifies limitations of existing network
   event notification mechanisms, and derives a set of functional and
   operational requirements from representative deployment scenarios.

   The scope of this document is limited to problem statement, use case
   analysis, and requirement identification.  It does not define a
   network protocol, signaling mechanism, routing or forwarding
   behavior, traffic engineering algorithm, YANG data model, or
   implementation approach.  Protocol specification and solution design
   are considered out of scope.

   The objective of this document is to provide a common understanding
   of the problem space and associated requirements, serving as input to
   future work on AI networking architectures, protocols, and management
   models.  It is intended to facilitate discussion and interoperability
   across implementations rather than prescribe a specific technical
   solution.

   The requirements identified in this document are intended to be
   technology-neutral and areapplicable to Al networking environments
   regardless of the underlying transport technology or network
   implementation.

1.3.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

Sang, et al.             Expires 4 January 2027                 [Page 4]
Internet-Draft       Fast Network Event Notification           July 2026

2.  Terminology

   The terminology defined in this document is intended for the purpose
   of this document and does not redefine existing IETF terminology.

   *AI Fabric:* A networking infrastructure designed to interconnect
   large-scale AI computing resources and support distributed AI
   workloads.  An AI Fabric provides high-performance communication
   among compute nodes and is optimized for large-scale collective
   communication and accelerator-centric traffic patterns.

   *AI Job:* A distributed training or inference task executed across
   one or more compute nodes within an AI Fabric.  An AI job typically
   requires coordinated communication and resource allocation throughout
   its execution.

   *Distributed AI Training:* A computing paradigm in which model
   training is distributed across multiple compute nodes to accelerate
   the training of large-scale machine learning models.  Distributed AI
   training relies on frequent synchronization and collective
   communication to maintain model consistency.

   *Distributed AI Inference:* A deployment model in which inference
   workloads are distributed across multiple compute nodes to improve
   scalability, throughput, or latency.  Such deployments may require
   communication and synchronization among participating nodes.

   *Network Event:* A change in network state that may affect the
   communication performance of distributed AI workloads.  Examples
   include congestion, link degradation, path changes, packet loss, and
   device failures.

   *Fast Network Event Notification:* A mechanism for disseminating
   network events to relevant entities with sufficiently low latency to
   enable timely adaptation by distributed AI applications,
   communication libraries, runtime systems, or network controllers.

   *Telemetry:* A mechanism for collecting and exporting network state
   information, including traffic statistics, device status, and link
   performance, for network monitoring and operational purposes.

   *Control Plane:* The set of network functions responsible for
   topology discovery, routing, path computation, policy distribution,
   and other control functions that determine network behavior.

   *Data Plane:* The set of network functions responsible for forwarding
   packets and carrying application traffic between communicating
   endpoints.

Sang, et al.             Expires 4 January 2027                 [Page 5]
Internet-Draft       Fast Network Event Notification           July 2026

3.  Problem Statement

   This section examines the problem space for fast network event
   notification in AI Fabrics.  It describes the communication
   characteristics of distributed AI workloads, discusses the
   limitations of existing network event notification mechanisms, and
   identifies the capability gaps that motivate the functional and
   operational requirements presented in the following section.

3.1.  AI Fabric Traffic and Workload Characteristics

   Distributed AI training and inference exhibit communication
   characteristics that differ significantly from those of conventional
   data-center applications.  Rather than being limited by raw network
   throughput alone, the performance of distributed AI workloads depends
   heavily on timely and coordinated communication among a large number
   of participating compute nodes.  Consequently, transient network
   events that have little impact on conventional applications may
   substantially affect collective communication efficiency, accelerator
   utilization, and overall job completion time.  The following
   subsections summarize the communication characteristics that motivate
   the need for fast network event notification in AI Fabrics.

   *Collective Communication Dependency:* Distributed AI training relies
   extensively on collective communication operations, including
   AllReduce, AllGather, All-to-All, and pipeline-parallel
   communication.  These operations require coordinated participation
   from a large number of compute nodes, and the completion time of each
   communication round is often determined by the slowest participant.
   Consequently, network events affecting a single node or communication
   path may delay the entire collective operation and reduce overall
   application efficiency.  Timely dissemination of such events enables
   communication libraries and runtime systems to react before
   performance degradation propagates across the workload.

   *Bursty Traffic and Communication Imbalance:* Distributed AI
   workloads generate communication patterns that differ from
   conventional client-server traffic.  Collective operations frequently
   produce many-to-one traffic bursts, while model synchronization
   creates long-lived, high-bandwidth elephant flows.  These traffic
   patterns are sensitive to transient congestion and localized
   performance degradation.  Detecting and disseminating significant
   network events in a timely manner can help reduce the impact of
   communication imbalance on distributed AI execution.

   *Sensitivity to Network Latency and Transient Degradation:* AI
   workloads are highly sensitive not only to network failures but also
   to transient performance degradation, including latency variation,

Sang, et al.             Expires 4 January 2027                 [Page 6]
Internet-Draft       Fast Network Event Notification           July 2026

   packet loss, and path-quality changes.  Even short-lived network
   events may interrupt communication synchronization, reduce
   accelerator utilization, and increase overall job completion time.
   Compared with conventional applications, distributed AI workloads
   therefore require faster awareness of network conditions to support
   timely adaptation.

   *Dynamic Runtime Adaptation:* Modern AI systems continuously adapt
   workload placement, communication patterns, and resource allocation
   according to runtime conditions.  Such adaptation increasingly
   depends on timely information about network state, including
   congestion, path degradation, and device availability.  Efficient
   dissemination of network events enables runtime systems,
   communication libraries, and schedulers to coordinate their responses
   and improve the resilience and efficiency of distributed AI
   execution.

   The characteristics described above demonstrate that distributed AI
   workloads require more timely and application-aware dissemination of
   network events than conventional data-center applications.  The
   following section examines the extent to which existing network event
   notification mechanisms satisfy these requirements.

3.2.  Limitations of Existing Network Monitoring and Notification
      Mechanisms

   Existing network event notification and monitoring mechanisms provide
   valuable capabilities for congestion indication, fault detection,
   routing recovery, and operational visibility in general-purpose IP
   networks.  These mechanisms have been successfully deployed in a wide
   range of operational environments.  However, they were not
   specifically designed to support the communication characteristics of
   distributed AI workloads described in the previous section.  As a
   result, several limitations become apparent when they are applied to
   AI Fabric environments.

   Explicit Congestion Notification (ECN)[RFC3168] provides lightweight
   in-band congestion indication and enables transport protocols to
   react before packet loss occurs.  However, ECN conveys only limited
   congestion information and does not distinguish event severity,
   affected communication groups, or the operational impact on
   distributed AI workloads.  In addition, ECN is primarily designed to
   signal congestion rather than other network events, such as path
   degradation or device anomalies.  Consequently, ECN alone cannot
   provide sufficient information for AI runtimes and schedulers to
   perform workload-aware adaptation.

Sang, et al.             Expires 4 January 2027                 [Page 7]
Internet-Draft       Fast Network Event Notification           July 2026

   In-band telemetry mechanisms, such as INT and IOAM[RFC9197], provide
   detailed visibility into packet forwarding paths and network
   conditions.  These mechanisms are primarily intended for network
   measurement and diagnostics rather than timely dissemination of
   network events.  Furthermore, continuous telemetry collection may
   introduce considerable processing and operational overhead in large-
   scale AI clusters.  As a result, telemetry alone does not provide an
   efficient event-driven notification mechanism for distributed AI
   workloads.

   Streaming telemetry continuously exports network state information to
   external monitoring systems and improves the timeliness of
   operational visibility compared with periodic polling.  However, it
   focuses on exporting measurements rather than communicating
   actionable network events.  Distributed AI workloads typically
   require concise and timely notification of significant network state
   changes instead of continuous streams of telemetry data.

   Bidirectional Forwarding Detection (BFD)[RFC5880] provides rapid
   detection of link and neighbor failures and plays an important role
   in improving network resiliency.  However, distributed AI workloads
   are often affected by transient performance degradation rather than
   complete failures.  Conditions such as latency variation, packet
   loss, or localized congestion may significantly reduce collective
   communication efficiency while remaining outside the scope of BFD
   notifications.

   Routing protocols restore network connectivity following topology
   changes or failures through protocol convergence.  Although these
   mechanisms improve network availability, they primarily address
   reachability rather than communication quality.  Furthermore, routing
   convergence is typically triggered after topology changes rather than
   transient network degradation.  Consequently, routing mechanisms
   alone do not provide the timely, application-aware event
   dissemination required by distributed AI workloads.

   The mechanisms discussed above provide complementary capabilities for
   congestion indication, telemetry, fault detection, and routing
   recovery.  Nevertheless, none of them individually, nor their
   straightforward combination, fully satisfies the communication
   characteristics of distributed AI workloads described in Section 3.1.
   The following section summarizes the common capability gaps observed
   across these mechanisms.

Sang, et al.             Expires 4 January 2027                 [Page 8]
Internet-Draft       Fast Network Event Notification           July 2026

3.3.  Capability Gap Analysis for AI Fabric Scenarios

   Based on the workload characteristics described in Section 3.1 and
   the limitations of existing mechanisms discussed in Section 3.2, this
   section identifies the common capability gaps that prevent current
   network monitoring and notification mechanisms from fully supporting
   distributed AI workloads.  These gaps motivate the functional and
   operational requirements presented in Section 4.

   *Notification Timeliness:* Distributed AI workloads require network
   events to be delivered quickly enough to support runtime adaptation
   during communication-intensive operations.  Existing mechanisms are
   often optimized for monitoring, diagnostics, or protocol convergence,
   resulting in notification latency that may exceed the timescale of AI
   communication iterations.  Delayed notification limits the ability of
   communication libraries, runtime systems, and schedulers to mitigate
   the impact of transient network degradation before application
   performance is affected.

   *Event Granularity:* Existing mechanisms primarily expose network
   status at the device, interface, or path level.  Distributed AI
   workloads, however, often require finer-grained visibility into
   communication flows and collective operations in order to identify
   the affected participants and communication context.  Insufficient
   event granularity limits the ability to perform targeted workload
   adaptation and localized performance optimization.

   *Event Semantics:* Current network event notifications primarily
   describe network-centric conditions, such as congestion, packet loss,
   or link failures.  However, distributed AI applications require
   richer event semantics that enable runtime systems to understand the
   operational impact of network events, including whether collective
   communication may be affected or whether adaptive actions should be
   initiated.  Without such semantics, network events cannot be
   efficiently consumed by upper-layer AI software.

   *Cross-layer Coordination:* Distributed AI workloads increasingly
   rely on coordinated interaction among communication libraries,
   runtime systems, schedulers, and network infrastructure.  Existing
   notification mechanisms generally operate within the networking
   domain and provide limited support for efficient dissemination of
   network events across these components.  As a result, network
   conditions cannot always be translated into timely workload
   adaptation or resource management decisions.

   *Interoperability:* AI Fabrics are increasingly deployed across
   heterogeneous environments involving equipment from multiple vendors
   and diverse operational domains.  Existing notification mechanisms

Sang, et al.             Expires 4 January 2027                 [Page 9]
Internet-Draft       Fast Network Event Notification           July 2026

   often employ implementation-specific event formats, interfaces, or
   operational models, making consistent dissemination and
   interpretation of network events difficult.  Improving
   interoperability is therefore important for enabling portable and
   vendor-neutral AI networking solutions.

   The capability gaps described above indicate that existing mechanisms
   provide useful building blocks but do not collectively satisfy the
   operational requirements of distributed AI workloads.  Addressing
   these gaps does not necessarily require replacing existing
   technologies.  Instead, it motivates the definition of a common set
   of functional and operational requirements for fast network event
   notification in AI Fabric environments.

3.4.  Problem Summary

   The analysis presented in this section indicates that distributed AI
   workloads introduce communication characteristics that are not fully
   addressed by existing network monitoring and notification mechanisms.
   Although current mechanisms provide valuable capabilities for
   congestion indication, telemetry, fault detection, and routing
   recovery, they do not collectively satisfy the requirements for
   timely, fine-grained, semantically rich, and interoperable
   dissemination of network events in AI Fabric environments.  These
   observations motivate the need for a common set of functional and
   operational requirements for fast network event notification, which
   are presented in the following section.

4.  Representative Use Cases

   The capability gaps identified in Section 3 arise in a variety of
   operational scenarios in distributed AI training and inference.  This
   section presents representative use cases that illustrate these
   scenarios and highlights where timely network event notification can
   improve coordination between the network and AI runtime systems.  The
   observations from these use cases provide the basis for the
   requirements defined in Section 5.

4.1.  UC1: Congestion Escalation During Collective Communication

   *Background:* Distributed AI training relies heavily on collective
   communication operations such as AllReduce and AllGather.  These
   operations generate synchronized many-to-one traffic bursts and long-
   lived elephant flows, making communication performance highly
   sensitive to transient congestion within the AI Fabric.

Sang, et al.             Expires 4 January 2027                [Page 10]
Internet-Draft       Fast Network Event Notification           July 2026

   *Network Event:* Transient congestion develops on one or more
   forwarding paths during collective communication.  Although the
   congestion may not immediately result in packet loss, it increases
   communication latency and delays synchronization across participating
   compute nodes, leading to straggler effects and reduced training
   throughput.

   *Limitation of Existing Mechanisms:* Existing mechanisms such as ECN
   provide limited congestion indication, while telemetry mechanisms
   primarily support monitoring and post-event analysis.  They do not
   provide sufficiently timely and workload-aware notification for AI
   communication libraries or runtime systems.

   *Implication for Fast Network Event Notification:* The network should
   rapidly notify congestion escalation together with sufficient context
   to identify affected communication activities, enabling AI runtimes
   to react before communication performance deteriorates significantly.

4.2.  UC2: Communication Performance Degradation

   *Background:* Distributed AI workloads depend on stable communication
   quality over long-running training and inference sessions.
   Performance degradation may originate from link jitter, intermittent
   packet loss, NIC anomalies, or bandwidth fluctuation without causing
   complete connectivity failures.

   *Network Event:* Communication quality gradually degrades because of
   transient or progressive network impairments.  These impairments
   increase retransmissions and synchronization delays while remaining
   difficult to detect using traditional fault detection mechanisms.

   *Limitation of Existing Mechanisms:* Current monitoring mechanisms
   primarily detect complete failures or export statistical
   measurements.  They provide limited support for identifying gradual
   communication degradation or correlating such events with ongoing AI
   workloads.

   *Implication for Fast Network Event Notification:* The notification
   mechanism should report communication quality degradation in a timely
   manner, allowing AI runtime systems to initiate workload adaptation
   before application performance is significantly affected.

4.3.  UC3: Node and Path Failure

   *Background:* Distributed AI applications rely on large numbers of
   compute nodes interconnected through redundant network paths.
   Failures affecting either compute nodes or forwarding paths may
   interrupt collective communication and delay workload execution.

Sang, et al.             Expires 4 January 2027                [Page 11]
Internet-Draft       Fast Network Event Notification           July 2026

   *Network Event:* A compute node, network device, or forwarding path
   becomes unavailable, requiring communication sessions to recover
   through runtime adaptation or network rerouting.

   *Limitation of Existing Mechanisms:* Existing failure detection and
   routing mechanisms focus primarily on restoring connectivity.  They
   generally do not provide workload-aware notification that enables AI
   runtimes to coordinate communication recovery with network recovery.

   *Implication for Fast Network Event Notification:* Fast notification
   of node and path failures should enable communication libraries,
   runtime systems, and schedulers to coordinate recovery actions and
   minimize the impact on distributed AI execution.

4.4.  UC4: Runtime-driven Network Adaptation

   *Background:* Modern AI platforms continuously perform workload
   placement, scaling, migration, and resource scheduling according to
   runtime conditions.  These decisions increasingly depend on current
   network conditions.

   *Network Event:* Network conditions change because of congestion,
   resource contention, or topology changes, requiring runtime systems
   to adjust communication patterns or workload placement.

   *Limitation of Existing Mechanisms:* Existing monitoring systems
   primarily export measurements rather than delivering actionable
   events.  Consequently, network information cannot always be
   incorporated into runtime adaptation in a timely manner.

   *Implication for Fast Network Event Notification:* Network events
   should be disseminated in a form that can be efficiently consumed by
   AI runtime systems and schedulers to support coordinated workload
   adaptation.

4.5.  UC5 Cross-domain AI Fabric Operation

   *Background:* Large-scale AI deployments increasingly span multiple
   administrative domains and heterogeneous network infrastructures.
   Consistent dissemination of network events becomes more challenging
   in these environments.

   *Network Event:* Network events occur within different operational
   domains and must be interpreted consistently across heterogeneous
   devices and management systems.

Sang, et al.             Expires 4 January 2027                [Page 12]
Internet-Draft       Fast Network Event Notification           July 2026

   *Limitation of Existing Mechanisms:* Existing notification mechanisms
   often rely on implementation-specific event formats and interfaces,
   limiting interoperability across vendors and operational domains.

   *Implication for Fast Network Event Notification:* Fast network event
   notification should support interoperable event representation and
   dissemination, enabling consistent interpretation of network events
   across heterogeneous AI Fabric environments.

   The scenarios presented above illustrate representative situations in
   which timely and interoperable dissemination of network events can
   improve the operation of distributed AI workloads.  Although the
   scenarios involve different types of network events, they
   collectively demonstrate the need for common capabilities in fast
   network event notification.  These observations motivate the
   functional and operational requirements described in the following
   section.

5.  Requirements

   This section defines a set of functional and operational requirements
   for fast network event notification in AI Fabric environments.  These
   requirements are derived from the capability gaps identified in
   Section 3 and the representative use cases described in Section 4.
   They are intended to guide the design and evaluation of future
   solutions rather than prescribe a specific protocol or
   implementation.

5.1.  REQ-1: Timely Event Dissemination

   *Requirement:*
   The system SHOULD deliver network event notifications to subscribed
   consumers with sufficiently low latency to enable runtime or
   scheduling actions before transient network conditions significantly
   impact application performance.  The system SHOULD adopt an event-
   driven push model for significant network state changes.

   *Discussion:*
   AI distributed workloads rely on tightly synchronized communication
   patterns.  Delayed visibility of network conditions reduces the
   effectiveness of runtime adaptation and may lead to performance
   degradation in collective communication operations.

Sang, et al.             Expires 4 January 2027                [Page 13]
Internet-Draft       Fast Network Event Notification           July 2026

5.2.  REQ-2: Event Granularity

   *Requirement:*
   The system SHOULD support event notifications that include sufficient
   context to identify the scope of affected communication entities,
   such as links, paths, nodes, or communication groups, when such
   information is available.

   *Discussion:*
   Fine-grained event context enables runtime systems to localize
   performance issues and apply targeted mitigation strategies, reducing
   unnecessary impact on unaffected workloads.

5.3.  REQ-3: Rich Event Semantics

   *Requirement:*
   The system SHOULD support extensible event metadata that describes
   the operational significance of network events in a machine-readable
   format.  The event representation SHOULD be independent of vendor-
   specific interpretations.

   *Discussion:*
   AI runtime systems require semantic context beyond raw network state
   to determine whether adaptation actions are necessary.

5.4.  REQ-4: Cross-layer Coordination

   *Requirement:*
   The system SHOULD enable coordination between network infrastructure,
   communication libraries, runtime systems, and scheduling components
   through standardized event dissemination interfaces, without
   requiring tight coupling between these layers.

   *Discussion:*
   AI workload optimization increasingly depends on coordinated actions
   across multiple system layers, requiring consistent visibility of
   network events.

5.5.  REQ-5: Interoperability

   *Requirement:*
   The system SHOULD define a standardized representation of network
   events that can be interpreted consistently across heterogeneous AI
   Fabric deployments.

   *Discussion:*
   Heterogeneous hardware and multi-vendor environments require
   consistent event interpretation to ensure portable workload behavior.

Sang, et al.             Expires 4 January 2027                [Page 14]
Internet-Draft       Fast Network Event Notification           July 2026

5.6.  REQ-6: Scalability

   *Requirement:*
   The system SHOULD support event dissemination in AI Fabric
   environments with large-scale deployments (e.g., thousands of compute
   nodes) without introducing disproportionate communication or
   processing overhead.

   *Discussion:*
   AI clusters are expected to continue scaling in size and complexity,
   requiring notification mechanisms that remain efficient under
   increasing event volume and node count.

5.7.  REQ-7: Reliability

   *Requirement:*
   The system SHOULD ensure reliable delivery of critical network event
   notifications, minimizing loss or inconsistent delivery when such
   events may affect workload correctness or performance.

   *Discussion:*
   Reliable event dissemination improves the consistency of distributed
   decision-making in AI workloads.

5.8.  REQ-8: Security

   *Requirement:*
   The system SHOULD ensure the authenticity, integrity, and controlled
   delivery of network event notifications, while maintaining acceptable
   notification latency.

   *Discussion:*
   Event notifications may directly influence scheduling and runtime
   behavior, requiring protection against unauthorized modification or
   injection.

5.9.  REQ-9: Extensibility

   *Requirement:*
   The system SHOULD support extensible event types and metadata fields
   to accommodate future AI networking technologies and deployment
   models without requiring changes to the core notification mechanism.

   *Discussion:*
   AI networking systems are rapidly evolving, requiring forward-
   compatible event representation mechanisms.

Sang, et al.             Expires 4 January 2027                [Page 15]
Internet-Draft       Fast Network Event Notification           July 2026

6.  Reference Deployment Model

   This section presents a representative deployment model to illustrate
   how the network event notification requirements described in this
   document arise in practical distributed AI environments.  The
   deployment model is provided for explanatory purposes only.  It is
   not intended to define a reference architecture or prescribe any
   implementation.

   Modern distributed AI training and inference systems are increasingly
   evolving from isolated data-center deployments toward distributed AI
   fabrics, where computing resources, data sources, and AI model
   execution are distributed across multiple sites connected by high-
   performance networks.  Such deployments require coordinated
   utilization of compute resources while respecting operational
   constraints such as data locality, security policies, and
   administrative boundaries.

   A representative deployment consists of three logical domains:
   1)distributed compute sites, 2)a wide-area interconnect network, and
   3)centralized compute and model resource pools.

   Figure 1 illustrates one example.

             +-----------------------------------------+
             |    Centralized Compute Resource Pool    |
             |-----------------------------------------|
             |       GPU / AI Accelerator Clusters     |
             |           Distributed Training          |
             |             Model Repository            |
             |            Inference Services           |
             +---------------------+-------------------+
                                   ^
                                   |
                       High-speed AI Fabric / WAN
                 (Low latency / High throughput / Reliable)
                                   |
             +---------------------+-------------------+
             |                     |                   |
             |                     |                   |
   +---------+---------+ +---------+---------+ +----------+--------+
   |  Distributed Site | |  Distributed Site | |  Distributed Site |
   |-------------------| |-------------------| |-------------------|
   |    Local Data     | |    Local Data     | |    Local Data     |
   |  Edge Processing  | |  Edge Processing  | |  Edge Processing  |
   |  Local Inference  | |  Local Inference  | | Local Inference   |
   +-------------------+ +-------------------+ +-------------------+

Sang, et al.             Expires 4 January 2027                [Page 16]
Internet-Draft       Fast Network Event Notification           July 2026

   Within the distributed compute sites, local resources are responsible
   for data acquisition, preprocessing, and execution of latency-
   sensitive AI tasks.  Depending on deployment requirements, portions
   of AI models may also execute locally to reduce communication
   overhead or satisfy locality constraints.

   The wide-area interconnect network provides the communication
   substrate between distributed sites and centralized compute
   resources.  It carries training data, model parameters, gradients,
   intermediate activations, and inference requests.  Consequently,
   communication performance directly affects distributed AI execution,
   making timely awareness of congestion, failures, and communication
   degradation increasingly important.

   The centralized compute resource pool provides large-scale resources
   for distributed training, model fine-tuning, and inference serving.
   Compute resources may be dynamically shared among multiple
   distributed sites, enabling elastic workload placement and efficient
   resource utilization.

   Collaborative execution mechanisms, such as model partitioning and
   distributed execution, may span multiple compute domains.  In these
   deployments, different stages of model execution are distributed
   across local and centralized resources, while communication-intensive
   operations continue throughout the execution process.

   This deployment model illustrates the operational context considered
   throughout this document.  The network events discussed in Section 3
   naturally arise from communication across the distributed AI Network.
   The representative use cases in Section 4 describe typical situations
   that may occur within this deployment, while the requirements defined
   in Section 5 identify the capabilities needed to enable timely
   dissemination of such network events and efficient adaptation by AI
   systems.

7.  Security Considerations

   Fast network event notifications influence AI scheduling, workload
   placement, and fault recovery decisions.  As a result, they introduce
   security risks that must be addressed to ensure trustworthy system
   behavior.

7.1.  Event Authenticity and Integrity

   Event messages MUST be protected against forgery and tampering.
   Unauthorized generation or modification of events may lead to
   incorrect scheduling decisions or service disruption.

Sang, et al.             Expires 4 January 2027                [Page 17]
Internet-Draft       Fast Network Event Notification           July 2026

7.2.  Access Control

   Access to event publication and subscription interfaces MUST be
   restricted to authorized network and control-plane components.
   Sensitive cluster and workload information SHOULD NOT be exposed to
   unauthorized entities.

7.3.  Denial-of-Service Risks

   Event notification systems are vulnerable to resource exhaustion
   through high-rate or malicious event injection.  Mechanisms SHOULD be
   in place to limit event rates and prioritize critical notifications
   under overload conditions.

7.4.  Privacy Considerations

   Event metadata may reveal workload distribution, topology, or system
   load information.  Such data SHOULD be protected during cross-domain
   transmission using appropriate confidentiality mechanisms.

8.  IANA Considerations

   This document defines problem statements, use cases, and technical
   requirements for AI-oriented fast network event notification.  It
   does not define new protocol fields, message types, port numbers,
   code points, YANG modules, or registry entries.  Therefore, *no IANA
   actions are required* for this document.

Appendix A.  Future Work

   This document focuses on requirements and architectural
   considerations for AI-oriented fast network event notification.
   Several topics remain for future standardization work:

   1.  Definition of a complete protocol framework for event
       notification in AI Fabrics

   2.  Standardized event data model with extensible AI-specific
       semantics

   3.  YANG models for configuration and capability advertisement

   4.  Mechanisms for device capability negotiation in heterogeneous
       environments

   5.  Evaluation methodologies for latency, scalability, and
       reliability of event notification systems

Sang, et al.             Expires 4 January 2027                [Page 18]
Internet-Draft       Fast Network Event Notification           July 2026

   6.  Interfaces between network notification systems and AI training
       frameworks

   These areas are expected to be further developed in subsequent FANN
   standardization efforts.

Appendix B.  References

B.1.  Normative References

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
   Requirement Levels", RFC 2119, BCP 14, March 1997.

B.2.  Informative References

   [RFC3168] Ramakrishnan, K., Floyd, S., and D.  Black, "The Addition
   of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI
   10.17487/RFC3168, September 2001.

   [RFC9197] Song, H., et al., "In-situ OAM Data Fields", RFC 9197, DOI
   10.17487/RFC9197, December 2020.

   [RFC5880] Katz, D. and D.  Ward, "Bidirectional Forwarding Detection
   (BFD)", RFC 5880, June 2010.

Authors' Addresses

   Liu Sang
   China Academy of Information and Communications Technology
   China
   Email: sangliu@caict.ac.cn

   Xuesong Geng
   Huawei Technologies
   China
   Email: gengxuesong@huawei.com

   Huan Deng
   China Telecom
   China
   Email: denghuan@chinatelecom.cn

Sang, et al.             Expires 4 January 2027                [Page 19]
Problem Statement and Requirements for Fast Network Event Notification in Distributed AI Training and Inference draft-sang-fann-fast-network-event-notification-01

Problem Statement and Requirements for Fast Network Event Notification in Distributed AI Training and Inference
draft-sang-fann-fast-network-event-notification-01