Problem Statement and Requirements for Fast Network Event Notification in Distributed AI Training and Inference
draft-sang-fann-fast-network-event-notification-01
This document is an Internet-Draft (I-D).
Anyone may submit an I-D to the IETF.
This I-D is not endorsed by the IETF and has no formal standing in the
IETF standards process.
| Document | Type | Active Internet-Draft (individual) | |
|---|---|---|---|
| Authors | Liu Sang , Xuesong Geng , Huan Deng | ||
| Last updated | 2026-07-03 | ||
| RFC stream | (None) | ||
| Intended RFC status | (None) | ||
| Formats | |||
| Stream | Stream state | (No stream defined) | |
| Consensus boilerplate | Unknown | ||
| RFC Editor Note | (None) | ||
| IESG | IESG state | I-D Exists | |
| Telechat date | (None) | ||
| Responsible AD | (None) | ||
| Send notices to | (None) |
draft-sang-fann-fast-network-event-notification-01
Fast Network Notification (FANN) L. Sang
Internet-DraftChina Academy of Information and Communications Technology
Intended status: Informational X. Geng
Expires: 4 January 2027 Huawei Technologies
H. Deng
China Telecom
3 July 2026
Problem Statement and Requirements for Fast Network Event Notification
in Distributed AI Training and Inference
draft-sang-fann-fast-network-event-notification-01
Abstract
Distributed AI training and inference rely on tightly coordinated
communication across large-scale AI fabrics, making timely awareness
of network conditions essential to application performance. Network
events, including congestion, link degradation, path changes, and
device failures, can significantly affect collective communication
efficiency, job completion time, and overall resource utilization.
Existing network event notification mechanisms are primarily designed
for general-purpose IP networks and do not adequately address the
timeliness, semantics, and coordination requirements of distributed
AI workloads.
This document identifies the problem space for fast network event
notification in distributed AI training and inference environments.
It presents representative use cases, identifies gaps in existing
approaches, and derives a set of functional and operational
requirements for timely, reliable, and interoperable dissemination of
network events across AI fabrics. These requirements are intended to
facilitate future work on network architectures and protocols for AI
networking. This document does not specify a protocol, signaling
mechanism, or protocol extension.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Sang, et al. Expires 4 January 2027 [Page 1]
Internet-Draft Fast Network Event Notification July 2026
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 4 January 2027.
Copyright Notice
Copyright (c) 2026 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Background and Motivation . . . . . . . . . . . . . . . . 3
1.2. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3. Requirements Language . . . . . . . . . . . . . . . . . . 4
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5
3. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 6
3.1. AI Fabric Traffic and Workload Characteristics . . . . . 6
3.2. Limitations of Existing Network Monitoring and Notification
Mechanisms . . . . . . . . . . . . . . . . . . . . . . . 7
3.3. Capability Gap Analysis for AI Fabric Scenarios . . . . . 9
3.4. Problem Summary . . . . . . . . . . . . . . . . . . . . . 10
4. Representative Use Cases . . . . . . . . . . . . . . . . . . 10
4.1. UC1: Congestion Escalation During Collective
Communication . . . . . . . . . . . . . . . . . . . . . . 10
4.2. UC2: Communication Performance Degradation . . . . . . . 11
4.3. UC3: Node and Path Failure . . . . . . . . . . . . . . . 11
4.4. UC4: Runtime-driven Network Adaptation . . . . . . . . . 12
4.5. UC5 Cross-domain AI Fabric Operation . . . . . . . . . . 12
5. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1. REQ-1: Timely Event Dissemination . . . . . . . . . . . . 13
5.2. REQ-2: Event Granularity . . . . . . . . . . . . . . . . 14
5.3. REQ-3: Rich Event Semantics . . . . . . . . . . . . . . . 14
5.4. REQ-4: Cross-layer Coordination . . . . . . . . . . . . . 14
5.5. REQ-5: Interoperability . . . . . . . . . . . . . . . . . 14
5.6. REQ-6: Scalability . . . . . . . . . . . . . . . . . . . 15
Sang, et al. Expires 4 January 2027 [Page 2]
Internet-Draft Fast Network Event Notification July 2026
5.7. REQ-7: Reliability . . . . . . . . . . . . . . . . . . . 15
5.8. REQ-8: Security . . . . . . . . . . . . . . . . . . . . . 15
5.9. REQ-9: Extensibility . . . . . . . . . . . . . . . . . . 15
6. Reference Deployment Model . . . . . . . . . . . . . . . . . 16
7. Security Considerations . . . . . . . . . . . . . . . . . . . 17
7.1. Event Authenticity and Integrity . . . . . . . . . . . . 17
7.2. Access Control . . . . . . . . . . . . . . . . . . . . . 18
7.3. Denial-of-Service Risks . . . . . . . . . . . . . . . . . 18
7.4. Privacy Considerations . . . . . . . . . . . . . . . . . 18
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18
Appendix A. Future Work . . . . . . . . . . . . . . . . . . . . 18
Appendix B. References . . . . . . . . . . . . . . . . . . . . . 19
B.1. Normative References . . . . . . . . . . . . . . . . . . 19
B.2. Informative References . . . . . . . . . . . . . . . . . 19
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 19
1. Introduction
1.1. Background and Motivation
Recent advances in foundation models have accelerated the deployment
of distributed AI training and inference across large-scale computing
infrastructures. Compared with conventional cloud applications,
distributed AI workloads generate sustained high-bandwidth traffic
and rely on tightly synchronized communication among a large number
of computing nodes. As a result, application performance is highly
sensitive to network conditions, particularly during collective
communication operations.
To support these workloads, modern data centers increasingly deploy
dedicated high-performance networking infrastructures, commonly
referred to as AI Fabrics. An AI Fabric integrates high-speed
network interconnects, accelerators, and scheduling systems to
provide scalable communication for large GPU clusters. Technologies
such as Remote Direct Memory Access (RDMA) over Converged Ethernet
(RoCE) are widely adopted to reduce communication latency and improve
transport efficiency for distributed AI applications.
Distributed AI workloads depend on collective communication
primitives, including AllReduce, AllGather, ReduceScatter, and
pipeline-parallel communication, which require coordinated
participation from hundreds or thousands of compute nodes. The
overall execution time of these operations is often determined by the
slowest participant. Consequently, transient network events, such as
congestion, link degradation, path changes, or device failures, can
interrupt communication synchronization, create straggler nodes, and
significantly reduce overall training and inference efficiency.
Sang, et al. Expires 4 January 2027 [Page 3]
Internet-Draft Fast Network Event Notification July 2026
Existing network monitoring and event notification mechanisms are
primarily designed for general-purpose IP networks, where traffic is
relatively elastic and applications are generally tolerant of
transient network fluctuations. In contrast, distributed AI
workloads require timely and consistent awareness of network events
to enable rapid adaptation by communication libraries, runtime
systems, schedulers, or network controllers. As AI Fabrics continue
to increase in scale and complexity, existing mechanisms provide
limited support for the responsiveness and coordination required by
these environments, motivating the need to identify requirements for
fast network event notification.
1.2. Scope
This document focuses on the problem space of fast network event
notification for distributed AI training and inference deployed over
AI Fabrics. It examines the communication characteristics of
distributed AI workloads, identifies limitations of existing network
event notification mechanisms, and derives a set of functional and
operational requirements from representative deployment scenarios.
The scope of this document is limited to problem statement, use case
analysis, and requirement identification. It does not define a
network protocol, signaling mechanism, routing or forwarding
behavior, traffic engineering algorithm, YANG data model, or
implementation approach. Protocol specification and solution design
are considered out of scope.
The objective of this document is to provide a common understanding
of the problem space and associated requirements, serving as input to
future work on AI networking architectures, protocols, and management
models. It is intended to facilitate discussion and interoperability
across implementations rather than prescribe a specific technical
solution.
The requirements identified in this document are intended to be
technology-neutral and areapplicable to Al networking environments
regardless of the underlying transport technology or network
implementation.
1.3. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
Sang, et al. Expires 4 January 2027 [Page 4]
Internet-Draft Fast Network Event Notification July 2026
2. Terminology
The terminology defined in this document is intended for the purpose
of this document and does not redefine existing IETF terminology.
*AI Fabric:* A networking infrastructure designed to interconnect
large-scale AI computing resources and support distributed AI
workloads. An AI Fabric provides high-performance communication
among compute nodes and is optimized for large-scale collective
communication and accelerator-centric traffic patterns.
*AI Job:* A distributed training or inference task executed across
one or more compute nodes within an AI Fabric. An AI job typically
requires coordinated communication and resource allocation throughout
its execution.
*Distributed AI Training:* A computing paradigm in which model
training is distributed across multiple compute nodes to accelerate
the training of large-scale machine learning models. Distributed AI
training relies on frequent synchronization and collective
communication to maintain model consistency.
*Distributed AI Inference:* A deployment model in which inference
workloads are distributed across multiple compute nodes to improve
scalability, throughput, or latency. Such deployments may require
communication and synchronization among participating nodes.
*Network Event:* A change in network state that may affect the
communication performance of distributed AI workloads. Examples
include congestion, link degradation, path changes, packet loss, and
device failures.
*Fast Network Event Notification:* A mechanism for disseminating
network events to relevant entities with sufficiently low latency to
enable timely adaptation by distributed AI applications,
communication libraries, runtime systems, or network controllers.
*Telemetry:* A mechanism for collecting and exporting network state
information, including traffic statistics, device status, and link
performance, for network monitoring and operational purposes.
*Control Plane:* The set of network functions responsible for
topology discovery, routing, path computation, policy distribution,
and other control functions that determine network behavior.
*Data Plane:* The set of network functions responsible for forwarding
packets and carrying application traffic between communicating
endpoints.
Sang, et al. Expires 4 January 2027 [Page 5]
Internet-Draft Fast Network Event Notification July 2026
3. Problem Statement
This section examines the problem space for fast network event
notification in AI Fabrics. It describes the communication
characteristics of distributed AI workloads, discusses the
limitations of existing network event notification mechanisms, and
identifies the capability gaps that motivate the functional and
operational requirements presented in the following section.
3.1. AI Fabric Traffic and Workload Characteristics
Distributed AI training and inference exhibit communication
characteristics that differ significantly from those of conventional
data-center applications. Rather than being limited by raw network
throughput alone, the performance of distributed AI workloads depends
heavily on timely and coordinated communication among a large number
of participating compute nodes. Consequently, transient network
events that have little impact on conventional applications may
substantially affect collective communication efficiency, accelerator
utilization, and overall job completion time. The following
subsections summarize the communication characteristics that motivate
the need for fast network event notification in AI Fabrics.
*Collective Communication Dependency:* Distributed AI training relies
extensively on collective communication operations, including
AllReduce, AllGather, All-to-All, and pipeline-parallel
communication. These operations require coordinated participation
from a large number of compute nodes, and the completion time of each
communication round is often determined by the slowest participant.
Consequently, network events affecting a single node or communication
path may delay the entire collective operation and reduce overall
application efficiency. Timely dissemination of such events enables
communication libraries and runtime systems to react before
performance degradation propagates across the workload.
*Bursty Traffic and Communication Imbalance:* Distributed AI
workloads generate communication patterns that differ from
conventional client-server traffic. Collective operations frequently
produce many-to-one traffic bursts, while model synchronization
creates long-lived, high-bandwidth elephant flows. These traffic
patterns are sensitive to transient congestion and localized
performance degradation. Detecting and disseminating significant
network events in a timely manner can help reduce the impact of
communication imbalance on distributed AI execution.
*Sensitivity to Network Latency and Transient Degradation:* AI
workloads are highly sensitive not only to network failures but also
to transient performance degradation, including latency variation,
Sang, et al. Expires 4 January 2027 [Page 6]
Internet-Draft Fast Network Event Notification July 2026
packet loss, and path-quality changes. Even short-lived network
events may interrupt communication synchronization, reduce
accelerator utilization, and increase overall job completion time.
Compared with conventional applications, distributed AI workloads
therefore require faster awareness of network conditions to support
timely adaptation.
*Dynamic Runtime Adaptation:* Modern AI systems continuously adapt
workload placement, communication patterns, and resource allocation
according to runtime conditions. Such adaptation increasingly
depends on timely information about network state, including
congestion, path degradation, and device availability. Efficient
dissemination of network events enables runtime systems,
communication libraries, and schedulers to coordinate their responses
and improve the resilience and efficiency of distributed AI
execution.
The characteristics described above demonstrate that distributed AI
workloads require more timely and application-aware dissemination of
network events than conventional data-center applications. The
following section examines the extent to which existing network event
notification mechanisms satisfy these requirements.
3.2. Limitations of Existing Network Monitoring and Notification
Mechanisms
Existing network event notification and monitoring mechanisms provide
valuable capabilities for congestion indication, fault detection,
routing recovery, and operational visibility in general-purpose IP
networks. These mechanisms have been successfully deployed in a wide
range of operational environments. However, they were not
specifically designed to support the communication characteristics of
distributed AI workloads described in the previous section. As a
result, several limitations become apparent when they are applied to
AI Fabric environments.
Explicit Congestion Notification (ECN)[RFC3168] provides lightweight
in-band congestion indication and enables transport protocols to
react before packet loss occurs. However, ECN conveys only limited
congestion information and does not distinguish event severity,
affected communication groups, or the operational impact on
distributed AI workloads. In addition, ECN is primarily designed to
signal congestion rather than other network events, such as path
degradation or device anomalies. Consequently, ECN alone cannot
provide sufficient information for AI runtimes and schedulers to
perform workload-aware adaptation.
Sang, et al. Expires 4 January 2027 [Page 7]
Internet-Draft Fast Network Event Notification July 2026
In-band telemetry mechanisms, such as INT and IOAM[RFC9197], provide
detailed visibility into packet forwarding paths and network
conditions. These mechanisms are primarily intended for network
measurement and diagnostics rather than timely dissemination of
network events. Furthermore, continuous telemetry collection may
introduce considerable processing and operational overhead in large-
scale AI clusters. As a result, telemetry alone does not provide an
efficient event-driven notification mechanism for distributed AI
workloads.
Streaming telemetry continuously exports network state information to
external monitoring systems and improves the timeliness of
operational visibility compared with periodic polling. However, it
focuses on exporting measurements rather than communicating
actionable network events. Distributed AI workloads typically
require concise and timely notification of significant network state
changes instead of continuous streams of telemetry data.
Bidirectional Forwarding Detection (BFD)[RFC5880] provides rapid
detection of link and neighbor failures and plays an important role
in improving network resiliency. However, distributed AI workloads
are often affected by transient performance degradation rather than
complete failures. Conditions such as latency variation, packet
loss, or localized congestion may significantly reduce collective
communication efficiency while remaining outside the scope of BFD
notifications.
Routing protocols restore network connectivity following topology
changes or failures through protocol convergence. Although these
mechanisms improve network availability, they primarily address
reachability rather than communication quality. Furthermore, routing
convergence is typically triggered after topology changes rather than
transient network degradation. Consequently, routing mechanisms
alone do not provide the timely, application-aware event
dissemination required by distributed AI workloads.
The mechanisms discussed above provide complementary capabilities for
congestion indication, telemetry, fault detection, and routing
recovery. Nevertheless, none of them individually, nor their
straightforward combination, fully satisfies the communication
characteristics of distributed AI workloads described in Section 3.1.
The following section summarizes the common capability gaps observed
across these mechanisms.
Sang, et al. Expires 4 January 2027 [Page 8]
Internet-Draft Fast Network Event Notification July 2026
3.3. Capability Gap Analysis for AI Fabric Scenarios
Based on the workload characteristics described in Section 3.1 and
the limitations of existing mechanisms discussed in Section 3.2, this
section identifies the common capability gaps that prevent current
network monitoring and notification mechanisms from fully supporting
distributed AI workloads. These gaps motivate the functional and
operational requirements presented in Section 4.
*Notification Timeliness:* Distributed AI workloads require network
events to be delivered quickly enough to support runtime adaptation
during communication-intensive operations. Existing mechanisms are
often optimized for monitoring, diagnostics, or protocol convergence,
resulting in notification latency that may exceed the timescale of AI
communication iterations. Delayed notification limits the ability of
communication libraries, runtime systems, and schedulers to mitigate
the impact of transient network degradation before application
performance is affected.
*Event Granularity:* Existing mechanisms primarily expose network
status at the device, interface, or path level. Distributed AI
workloads, however, often require finer-grained visibility into
communication flows and collective operations in order to identify
the affected participants and communication context. Insufficient
event granularity limits the ability to perform targeted workload
adaptation and localized performance optimization.
*Event Semantics:* Current network event notifications primarily
describe network-centric conditions, such as congestion, packet loss,
or link failures. However, distributed AI applications require
richer event semantics that enable runtime systems to understand the
operational impact of network events, including whether collective
communication may be affected or whether adaptive actions should be
initiated. Without such semantics, network events cannot be
efficiently consumed by upper-layer AI software.
*Cross-layer Coordination:* Distributed AI workloads increasingly
rely on coordinated interaction among communication libraries,
runtime systems, schedulers, and network infrastructure. Existing
notification mechanisms generally operate within the networking
domain and provide limited support for efficient dissemination of
network events across these components. As a result, network
conditions cannot always be translated into timely workload
adaptation or resource management decisions.
*Interoperability:* AI Fabrics are increasingly deployed across
heterogeneous environments involving equipment from multiple vendors
and diverse operational domains. Existing notification mechanisms
Sang, et al. Expires 4 January 2027 [Page 9]
Internet-Draft Fast Network Event Notification July 2026
often employ implementation-specific event formats, interfaces, or
operational models, making consistent dissemination and
interpretation of network events difficult. Improving
interoperability is therefore important for enabling portable and
vendor-neutral AI networking solutions.
The capability gaps described above indicate that existing mechanisms
provide useful building blocks but do not collectively satisfy the
operational requirements of distributed AI workloads. Addressing
these gaps does not necessarily require replacing existing
technologies. Instead, it motivates the definition of a common set
of functional and operational requirements for fast network event
notification in AI Fabric environments.
3.4. Problem Summary
The analysis presented in this section indicates that distributed AI
workloads introduce communication characteristics that are not fully
addressed by existing network monitoring and notification mechanisms.
Although current mechanisms provide valuable capabilities for
congestion indication, telemetry, fault detection, and routing
recovery, they do not collectively satisfy the requirements for
timely, fine-grained, semantically rich, and interoperable
dissemination of network events in AI Fabric environments. These
observations motivate the need for a common set of functional and
operational requirements for fast network event notification, which
are presented in the following section.
4. Representative Use Cases
The capability gaps identified in Section 3 arise in a variety of
operational scenarios in distributed AI training and inference. This
section presents representative use cases that illustrate these
scenarios and highlights where timely network event notification can
improve coordination between the network and AI runtime systems. The
observations from these use cases provide the basis for the
requirements defined in Section 5.
4.1. UC1: Congestion Escalation During Collective Communication
*Background:* Distributed AI training relies heavily on collective
communication operations such as AllReduce and AllGather. These
operations generate synchronized many-to-one traffic bursts and long-
lived elephant flows, making communication performance highly
sensitive to transient congestion within the AI Fabric.
Sang, et al. Expires 4 January 2027 [Page 10]
Internet-Draft Fast Network Event Notification July 2026
*Network Event:* Transient congestion develops on one or more
forwarding paths during collective communication. Although the
congestion may not immediately result in packet loss, it increases
communication latency and delays synchronization across participating
compute nodes, leading to straggler effects and reduced training
throughput.
*Limitation of Existing Mechanisms:* Existing mechanisms such as ECN
provide limited congestion indication, while telemetry mechanisms
primarily support monitoring and post-event analysis. They do not
provide sufficiently timely and workload-aware notification for AI
communication libraries or runtime systems.
*Implication for Fast Network Event Notification:* The network should
rapidly notify congestion escalation together with sufficient context
to identify affected communication activities, enabling AI runtimes
to react before communication performance deteriorates significantly.
4.2. UC2: Communication Performance Degradation
*Background:* Distributed AI workloads depend on stable communication
quality over long-running training and inference sessions.
Performance degradation may originate from link jitter, intermittent
packet loss, NIC anomalies, or bandwidth fluctuation without causing
complete connectivity failures.
*Network Event:* Communication quality gradually degrades because of
transient or progressive network impairments. These impairments
increase retransmissions and synchronization delays while remaining
difficult to detect using traditional fault detection mechanisms.
*Limitation of Existing Mechanisms:* Current monitoring mechanisms
primarily detect complete failures or export statistical
measurements. They provide limited support for identifying gradual
communication degradation or correlating such events with ongoing AI
workloads.
*Implication for Fast Network Event Notification:* The notification
mechanism should report communication quality degradation in a timely
manner, allowing AI runtime systems to initiate workload adaptation
before application performance is significantly affected.
4.3. UC3: Node and Path Failure
*Background:* Distributed AI applications rely on large numbers of
compute nodes interconnected through redundant network paths.
Failures affecting either compute nodes or forwarding paths may
interrupt collective communication and delay workload execution.
Sang, et al. Expires 4 January 2027 [Page 11]
Internet-Draft Fast Network Event Notification July 2026
*Network Event:* A compute node, network device, or forwarding path
becomes unavailable, requiring communication sessions to recover
through runtime adaptation or network rerouting.
*Limitation of Existing Mechanisms:* Existing failure detection and
routing mechanisms focus primarily on restoring connectivity. They
generally do not provide workload-aware notification that enables AI
runtimes to coordinate communication recovery with network recovery.
*Implication for Fast Network Event Notification:* Fast notification
of node and path failures should enable communication libraries,
runtime systems, and schedulers to coordinate recovery actions and
minimize the impact on distributed AI execution.
4.4. UC4: Runtime-driven Network Adaptation
*Background:* Modern AI platforms continuously perform workload
placement, scaling, migration, and resource scheduling according to
runtime conditions. These decisions increasingly depend on current
network conditions.
*Network Event:* Network conditions change because of congestion,
resource contention, or topology changes, requiring runtime systems
to adjust communication patterns or workload placement.
*Limitation of Existing Mechanisms:* Existing monitoring systems
primarily export measurements rather than delivering actionable
events. Consequently, network information cannot always be
incorporated into runtime adaptation in a timely manner.
*Implication for Fast Network Event Notification:* Network events
should be disseminated in a form that can be efficiently consumed by
AI runtime systems and schedulers to support coordinated workload
adaptation.
4.5. UC5 Cross-domain AI Fabric Operation
*Background:* Large-scale AI deployments increasingly span multiple
administrative domains and heterogeneous network infrastructures.
Consistent dissemination of network events becomes more challenging
in these environments.
*Network Event:* Network events occur within different operational
domains and must be interpreted consistently across heterogeneous
devices and management systems.
Sang, et al. Expires 4 January 2027 [Page 12]
Internet-Draft Fast Network Event Notification July 2026
*Limitation of Existing Mechanisms:* Existing notification mechanisms
often rely on implementation-specific event formats and interfaces,
limiting interoperability across vendors and operational domains.
*Implication for Fast Network Event Notification:* Fast network event
notification should support interoperable event representation and
dissemination, enabling consistent interpretation of network events
across heterogeneous AI Fabric environments.
The scenarios presented above illustrate representative situations in
which timely and interoperable dissemination of network events can
improve the operation of distributed AI workloads. Although the
scenarios involve different types of network events, they
collectively demonstrate the need for common capabilities in fast
network event notification. These observations motivate the
functional and operational requirements described in the following
section.
5. Requirements
This section defines a set of functional and operational requirements
for fast network event notification in AI Fabric environments. These
requirements are derived from the capability gaps identified in
Section 3 and the representative use cases described in Section 4.
They are intended to guide the design and evaluation of future
solutions rather than prescribe a specific protocol or
implementation.
5.1. REQ-1: Timely Event Dissemination
*Requirement:*
The system SHOULD deliver network event notifications to subscribed
consumers with sufficiently low latency to enable runtime or
scheduling actions before transient network conditions significantly
impact application performance. The system SHOULD adopt an event-
driven push model for significant network state changes.
*Discussion:*
AI distributed workloads rely on tightly synchronized communication
patterns. Delayed visibility of network conditions reduces the
effectiveness of runtime adaptation and may lead to performance
degradation in collective communication operations.
Sang, et al. Expires 4 January 2027 [Page 13]
Internet-Draft Fast Network Event Notification July 2026
5.2. REQ-2: Event Granularity
*Requirement:*
The system SHOULD support event notifications that include sufficient
context to identify the scope of affected communication entities,
such as links, paths, nodes, or communication groups, when such
information is available.
*Discussion:*
Fine-grained event context enables runtime systems to localize
performance issues and apply targeted mitigation strategies, reducing
unnecessary impact on unaffected workloads.
5.3. REQ-3: Rich Event Semantics
*Requirement:*
The system SHOULD support extensible event metadata that describes
the operational significance of network events in a machine-readable
format. The event representation SHOULD be independent of vendor-
specific interpretations.
*Discussion:*
AI runtime systems require semantic context beyond raw network state
to determine whether adaptation actions are necessary.
5.4. REQ-4: Cross-layer Coordination
*Requirement:*
The system SHOULD enable coordination between network infrastructure,
communication libraries, runtime systems, and scheduling components
through standardized event dissemination interfaces, without
requiring tight coupling between these layers.
*Discussion:*
AI workload optimization increasingly depends on coordinated actions
across multiple system layers, requiring consistent visibility of
network events.
5.5. REQ-5: Interoperability
*Requirement:*
The system SHOULD define a standardized representation of network
events that can be interpreted consistently across heterogeneous AI
Fabric deployments.
*Discussion:*
Heterogeneous hardware and multi-vendor environments require
consistent event interpretation to ensure portable workload behavior.
Sang, et al. Expires 4 January 2027 [Page 14]
Internet-Draft Fast Network Event Notification July 2026
5.6. REQ-6: Scalability
*Requirement:*
The system SHOULD support event dissemination in AI Fabric
environments with large-scale deployments (e.g., thousands of compute
nodes) without introducing disproportionate communication or
processing overhead.
*Discussion:*
AI clusters are expected to continue scaling in size and complexity,
requiring notification mechanisms that remain efficient under
increasing event volume and node count.
5.7. REQ-7: Reliability
*Requirement:*
The system SHOULD ensure reliable delivery of critical network event
notifications, minimizing loss or inconsistent delivery when such
events may affect workload correctness or performance.
*Discussion:*
Reliable event dissemination improves the consistency of distributed
decision-making in AI workloads.
5.8. REQ-8: Security
*Requirement:*
The system SHOULD ensure the authenticity, integrity, and controlled
delivery of network event notifications, while maintaining acceptable
notification latency.
*Discussion:*
Event notifications may directly influence scheduling and runtime
behavior, requiring protection against unauthorized modification or
injection.
5.9. REQ-9: Extensibility
*Requirement:*
The system SHOULD support extensible event types and metadata fields
to accommodate future AI networking technologies and deployment
models without requiring changes to the core notification mechanism.
*Discussion:*
AI networking systems are rapidly evolving, requiring forward-
compatible event representation mechanisms.
Sang, et al. Expires 4 January 2027 [Page 15]
Internet-Draft Fast Network Event Notification July 2026
6. Reference Deployment Model
This section presents a representative deployment model to illustrate
how the network event notification requirements described in this
document arise in practical distributed AI environments. The
deployment model is provided for explanatory purposes only. It is
not intended to define a reference architecture or prescribe any
implementation.
Modern distributed AI training and inference systems are increasingly
evolving from isolated data-center deployments toward distributed AI
fabrics, where computing resources, data sources, and AI model
execution are distributed across multiple sites connected by high-
performance networks. Such deployments require coordinated
utilization of compute resources while respecting operational
constraints such as data locality, security policies, and
administrative boundaries.
A representative deployment consists of three logical domains:
1)distributed compute sites, 2)a wide-area interconnect network, and
3)centralized compute and model resource pools.
Figure 1 illustrates one example.
+-----------------------------------------+
| Centralized Compute Resource Pool |
|-----------------------------------------|
| GPU / AI Accelerator Clusters |
| Distributed Training |
| Model Repository |
| Inference Services |
+---------------------+-------------------+
^
|
High-speed AI Fabric / WAN
(Low latency / High throughput / Reliable)
|
+---------------------+-------------------+
| | |
| | |
+---------+---------+ +---------+---------+ +----------+--------+
| Distributed Site | | Distributed Site | | Distributed Site |
|-------------------| |-------------------| |-------------------|
| Local Data | | Local Data | | Local Data |
| Edge Processing | | Edge Processing | | Edge Processing |
| Local Inference | | Local Inference | | Local Inference |
+-------------------+ +-------------------+ +-------------------+
Sang, et al. Expires 4 January 2027 [Page 16]
Internet-Draft Fast Network Event Notification July 2026
Within the distributed compute sites, local resources are responsible
for data acquisition, preprocessing, and execution of latency-
sensitive AI tasks. Depending on deployment requirements, portions
of AI models may also execute locally to reduce communication
overhead or satisfy locality constraints.
The wide-area interconnect network provides the communication
substrate between distributed sites and centralized compute
resources. It carries training data, model parameters, gradients,
intermediate activations, and inference requests. Consequently,
communication performance directly affects distributed AI execution,
making timely awareness of congestion, failures, and communication
degradation increasingly important.
The centralized compute resource pool provides large-scale resources
for distributed training, model fine-tuning, and inference serving.
Compute resources may be dynamically shared among multiple
distributed sites, enabling elastic workload placement and efficient
resource utilization.
Collaborative execution mechanisms, such as model partitioning and
distributed execution, may span multiple compute domains. In these
deployments, different stages of model execution are distributed
across local and centralized resources, while communication-intensive
operations continue throughout the execution process.
This deployment model illustrates the operational context considered
throughout this document. The network events discussed in Section 3
naturally arise from communication across the distributed AI Network.
The representative use cases in Section 4 describe typical situations
that may occur within this deployment, while the requirements defined
in Section 5 identify the capabilities needed to enable timely
dissemination of such network events and efficient adaptation by AI
systems.
7. Security Considerations
Fast network event notifications influence AI scheduling, workload
placement, and fault recovery decisions. As a result, they introduce
security risks that must be addressed to ensure trustworthy system
behavior.
7.1. Event Authenticity and Integrity
Event messages MUST be protected against forgery and tampering.
Unauthorized generation or modification of events may lead to
incorrect scheduling decisions or service disruption.
Sang, et al. Expires 4 January 2027 [Page 17]
Internet-Draft Fast Network Event Notification July 2026
7.2. Access Control
Access to event publication and subscription interfaces MUST be
restricted to authorized network and control-plane components.
Sensitive cluster and workload information SHOULD NOT be exposed to
unauthorized entities.
7.3. Denial-of-Service Risks
Event notification systems are vulnerable to resource exhaustion
through high-rate or malicious event injection. Mechanisms SHOULD be
in place to limit event rates and prioritize critical notifications
under overload conditions.
7.4. Privacy Considerations
Event metadata may reveal workload distribution, topology, or system
load information. Such data SHOULD be protected during cross-domain
transmission using appropriate confidentiality mechanisms.
8. IANA Considerations
This document defines problem statements, use cases, and technical
requirements for AI-oriented fast network event notification. It
does not define new protocol fields, message types, port numbers,
code points, YANG modules, or registry entries. Therefore, *no IANA
actions are required* for this document.
Appendix A. Future Work
This document focuses on requirements and architectural
considerations for AI-oriented fast network event notification.
Several topics remain for future standardization work:
1. Definition of a complete protocol framework for event
notification in AI Fabrics
2. Standardized event data model with extensible AI-specific
semantics
3. YANG models for configuration and capability advertisement
4. Mechanisms for device capability negotiation in heterogeneous
environments
5. Evaluation methodologies for latency, scalability, and
reliability of event notification systems
Sang, et al. Expires 4 January 2027 [Page 18]
Internet-Draft Fast Network Event Notification July 2026
6. Interfaces between network notification systems and AI training
frameworks
These areas are expected to be further developed in subsequent FANN
standardization efforts.
Appendix B. References
B.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", RFC 2119, BCP 14, March 1997.
B.2. Informative References
[RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI
10.17487/RFC3168, September 2001.
[RFC9197] Song, H., et al., "In-situ OAM Data Fields", RFC 9197, DOI
10.17487/RFC9197, December 2020.
[RFC5880] Katz, D. and D. Ward, "Bidirectional Forwarding Detection
(BFD)", RFC 5880, June 2010.
Authors' Addresses
Liu Sang
China Academy of Information and Communications Technology
China
Email: sangliu@caict.ac.cn
Xuesong Geng
Huawei Technologies
China
Email: gengxuesong@huawei.com
Huan Deng
China Telecom
China
Email: denghuan@chinatelecom.cn
Sang, et al. Expires 4 January 2027 [Page 19]