Topology-Aware Construction of Collective Communication over Time-Expanded Networks
draft-wang-topology-aware-collective-communication-00
This document is an Internet-Draft (I-D).
Anyone may submit an I-D to the IETF.
This I-D is not endorsed by the IETF and has no formal standing in the
IETF standards process.
| Document | Type | Active Internet-Draft (individual) | |
|---|---|---|---|
| Authors | ZIAN WANG , Qianqiumin Sun , Hui Gao | ||
| Last updated | 2026-05-24 | ||
| RFC stream | (None) | ||
| Intended RFC status | (None) | ||
| Formats | |||
| Stream | Stream state | (No stream defined) | |
| Consensus boilerplate | Unknown | ||
| RFC Editor Note | (None) | ||
| IESG | IESG state | I-D Exists | |
| Telechat date | (None) | ||
| Responsible AD | (None) | ||
| Send notices to | (None) |
draft-wang-topology-aware-collective-communication-00
Internet Engineering Task Force Z. Wang
Internet-Draft Q. Sun
Intended status: Informational H. Gao
Expires: 14 November 2026 BUPT
13 May 2026
Topology-Aware Construction of Collective Communication over Time-
Expanded Networks
draft-wang-topology-aware-collective-communication-00
Abstract
This document describes a topology-aware method for constructing
collective communication schedules in distributed systems. Instead
of selecting from a small set of predefined communication algorithms,
the method expands a target network topology along a time dimension,
tracks per-node data state, and incrementally builds a schedule
through candidate-source discovery and link-to-chunk matching. The
approach is intended for heterogeneous or asymmetric topologies in
which fixed communication patterns often underutilize available links
or create avoidable bottlenecks. According to the source material,
the resulting schedule is intended for collective communication tasks
involving data distribution, aggregation, reduction, and
synchronization, including all-gather, reduce-scatter, and all-
reduce.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 14 November 2026.
Copyright Notice
Copyright (c) 2026 IETF Trust and the persons identified as the
document authors. All rights reserved.
Wang, et al. Expires 14 November 2026 [Page 1]
Internet-Draft Topology-Aware Collective Communication May 2026
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 3
3. Design Goals . . . . . . . . . . . . . . . . . . . . . . . . 3
4. Terminology and Model . . . . . . . . . . . . . . . . . . . . 4
5. Construction Framework . . . . . . . . . . . . . . . . . . . 4
5.1. Input Topology and Communication Objective . . . . . . . 5
5.2. Chunk Partitioning and State Initialization . . . . . . . 5
5.3. Time-Expanded Topology . . . . . . . . . . . . . . . . . 5
5.4. Candidate Source Discovery . . . . . . . . . . . . . . . 5
5.5. Link-to-Chunk Matching . . . . . . . . . . . . . . . . . 6
5.6. State Update and Iterative Expansion . . . . . . . . . . 6
5.7. Resulting Schedule . . . . . . . . . . . . . . . . . . . 6
6. Applicability to Collective Operations . . . . . . . . . . . 6
7. Execution Considerations . . . . . . . . . . . . . . . . . . 7
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7
9. Security Considerations . . . . . . . . . . . . . . . . . . . 7
1. Introduction
Large-scale distributed training systems rely heavily on collective
communication. During training, nodes repeatedly exchange model
parameters, gradients, or intermediate results. As model size and
cluster size increase, the communication subsystem becomes a major
factor in overall job completion time.
Many existing implementations choose a communication procedure from a
predefined algorithm library. This works reasonably well on regular
topologies, but it becomes less effective when the underlying network
contains heterogeneous links, asymmetric connectivity, or multi-level
structure. In such environments, a fixed algorithm can overload some
links while leaving others underused.
Other approaches attempt to generate schedules through global
optimization or exhaustive search. These methods can produce good
results for some inputs, but their construction cost often grows
quickly with the number of nodes and links, which makes them harder
to use in large systems.
Wang, et al. Expires 14 November 2026 [Page 2]
Internet-Draft Topology-Aware Collective Communication May 2026
This document describes a different construction framework. It uses
a time-expanded network to represent both topology and time
evolution, models which data chunks are currently available at which
nodes, and incrementally builds communication steps until the target
communication state is reached.
2. Problem Statement
The construction method described in this document is motivated by
five practical issues.
1. Predefined collective algorithms do not always fit a specific
topology, especially when the topology is irregular.
2. Heterogeneous and asymmetric networks make it difficult to
maintain both good link utilization and low completion time with
a single fixed communication pattern.
3. Many implementations do not expose a unified model for data
state, temporal progression, and link occupancy during schedule
construction.
4. Globally optimized schedule generation can become expensive as
system size increases.
5. A practical method should support multiple collective
communication modes without being tied to a single handcrafted
algorithm family.
3. Design Goals
The source material indicates that the construction method is
intended to meet the following goals:
* It is intended to account for topology details, including directed
links, asymmetric connectivity, and bandwidth differences.
* It is intended to build a schedule for the given topology, rather
than only choosing from a fixed library of existing algorithms.
* It is intended to represent communication progress as an explicit
state transition process.
* It is intended to expand the schedule progressively in time
instead of requiring a fully fixed time horizon in advance.
* It is intended to remain applicable to several collective
communication patterns.
Wang, et al. Expires 14 November 2026 [Page 3]
Internet-Draft Topology-Aware Collective Communication May 2026
4. Terminology and Model
This section summarizes the key terms used by the construction
method.
Node
A compute element or processing unit participating in collective
communication.
Directed Link
A communication edge from one node to another. Link direction
matters when the physical or logical topology is asymmetric.
Chunk
The smallest schedulable unit obtained by splitting the data to be
communicated.
Precondition
The set of chunks currently available at each node at a given time
layer.
Postcondition
The target chunk distribution that defines completion of the
collective task.
Unsatisfied Target
A node-chunk pair that is required by the postcondition but has
not yet been satisfied in the current state.
Candidate Source
A node that already holds a required chunk and can potentially
deliver it to a target node through an available link in the
current time layer.
Time-Expanded Network
A representation in which the original topology is unfolded across
discrete time layers so that communication opportunities and state
transitions can be expressed in a single structure.
5. Construction Framework
The method takes as input a topology description, link attributes, a
collective communication objective, and an initial data placement.
It produces a communication schedule consisting of per-time-layer
send and receive actions, together with the derived transfer paths of
each chunk.
Wang, et al. Expires 14 November 2026 [Page 4]
Internet-Draft Topology-Aware Collective Communication May 2026
5.1. Input Topology and Communication Objective
The constructor receives a set of nodes, a set of directed links, and
link attributes including link bandwidth. It also receives the
target collective mode. The source material explicitly mentions all-
gather, reduce-scatter, and all-reduce, and more generally describes
distribution, aggregation, reduction, and synchronization tasks.
The communication objective is expressed as a desired postcondition
over chunks. The postcondition defines which nodes are expected to
hold which chunks after the collective operation completes.
5.2. Chunk Partitioning and State Initialization
Before schedule construction begins, the payload is partitioned into
chunks. Each chunk is treated as an independent schedulable item.
The constructor initializes a precondition describing which chunks
are currently available at each node. It also initializes the
postcondition that describes the desired final state. These two
state descriptions provide the basis for deciding whether
construction is complete and which transfers are still needed.
5.3. Time-Expanded Topology
The original topology is expanded into discrete time layers. Each
original node is represented by a sequence of node copies, one for
each layer. A directed link in the original topology becomes a
temporal edge that carries a chunk from a source node in one layer to
a destination node in the next layer.
This representation captures spatial connectivity and temporal
progression in one model. The source material describes a layer-by-
layer expansion process rather than requiring a fixed final time
horizon in advance.
5.4. Candidate Source Discovery
For each unsatisfied target, the constructor searches for candidate
sources that already hold the required chunk and can reach the
destination through a valid temporal edge in the current layer.
According to the source material, this process starts from the target
node and traces reachable links in the current time layer to identify
source nodes that already hold the required chunk. The candidate set
therefore reflects both current chunk availability and current
temporal connectivity.
Wang, et al. Expires 14 November 2026 [Page 5]
Internet-Draft Topology-Aware Collective Communication May 2026
5.5. Link-to-Chunk Matching
After candidate sources have been identified, the constructor selects
feasible transfers for the current time layer. A feasible transfer
binds one chunk to one directed link from one source node to one
destination node.
According to the source material, valid matching respects link
direction and only uses links that are not already occupied in the
current layer. The source material also notes that alternative
embodiments may consider path length, node load, or link utilization
when adjusting matching decisions.
5.6. State Update and Iterative Expansion
Once transfers have been selected for the current layer, the
constructor updates node state for the next layer. Any chunk that is
successfully delivered becomes part of the receiving node's
precondition in the following layer.
If unsatisfied targets remain after the update, the constructor
extends the time-expanded network and repeats candidate discovery and
link-to-chunk matching. This process continues until the
postcondition is fully satisfied.
5.7. Resulting Schedule
The final output is a topology-aware communication schedule.
According to the source material, the output includes the transfer
path of each chunk, the send/receive relationships in each time
layer, and the complete collective communication plan.
6. Applicability to Collective Operations
The same construction framework can be applied to several collective
communication patterns by changing the initial and target state
definitions.
* In all-gather, each node starts with a subset of chunks, and the
postcondition requires every node to hold the union of all chunks.
* In reduce-scatter, chunks are combined according to a reduction
rule and the postcondition assigns portions of the reduced result
to different nodes.
* In all-reduce, the framework can be viewed as a reduce-scatter
phase followed by an all-gather phase, or as another equivalent
state formulation.
Wang, et al. Expires 14 November 2026 [Page 6]
Internet-Draft Topology-Aware Collective Communication May 2026
7. Execution Considerations
The source material explicitly lists the node set, link set, link
direction, link bandwidth, and target collective mode as inputs to
schedule construction. It also describes the generated schedule as
an output that can later be read and executed by sending, receiving,
and processing data according to the per-layer communication
arrangement.
The source material further states that, after construction, the
resulting communication schedule can be executed layer by layer until
the target completion state is reached, and the final result can then
be supplied to later training, scheduling, or control modules.
8. IANA Considerations
This document includes no request to IANA.
9. Security Considerations
This document describes a schedule construction method and does not
define a new wire protocol or a new security mechanism.
Wang, et al. Expires 14 November 2026 [Page 7]