Skip to main content

Topology-Aware Construction of Collective Communication over Time-Expanded Networks
draft-wang-topology-aware-collective-communication-00

Document Type Active Internet-Draft (individual)
Authors ZIAN WANG , Qianqiumin Sun , Hui Gao
Last updated 2026-05-24
RFC stream (None)
Intended RFC status (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-wang-topology-aware-collective-communication-00
Internet Engineering Task Force                                  Z. Wang
Internet-Draft                                                    Q. Sun
Intended status: Informational                                    H. Gao
Expires: 14 November 2026                                           BUPT
                                                             13 May 2026

   Topology-Aware Construction of Collective Communication over Time-
                           Expanded Networks
         draft-wang-topology-aware-collective-communication-00

Abstract

   This document describes a topology-aware method for constructing
   collective communication schedules in distributed systems.  Instead
   of selecting from a small set of predefined communication algorithms,
   the method expands a target network topology along a time dimension,
   tracks per-node data state, and incrementally builds a schedule
   through candidate-source discovery and link-to-chunk matching.  The
   approach is intended for heterogeneous or asymmetric topologies in
   which fixed communication patterns often underutilize available links
   or create avoidable bottlenecks.  According to the source material,
   the resulting schedule is intended for collective communication tasks
   involving data distribution, aggregation, reduction, and
   synchronization, including all-gather, reduce-scatter, and all-
   reduce.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 14 November 2026.

Copyright Notice

   Copyright (c) 2026 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

Wang, et al.            Expires 14 November 2026                [Page 1]
Internet-Draft   Topology-Aware Collective Communication        May 2026

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Problem Statement . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Design Goals  . . . . . . . . . . . . . . . . . . . . . . . .   3
   4.  Terminology and Model . . . . . . . . . . . . . . . . . . . .   4
   5.  Construction Framework  . . . . . . . . . . . . . . . . . . .   4
     5.1.  Input Topology and Communication Objective  . . . . . . .   5
     5.2.  Chunk Partitioning and State Initialization . . . . . . .   5
     5.3.  Time-Expanded Topology  . . . . . . . . . . . . . . . . .   5
     5.4.  Candidate Source Discovery  . . . . . . . . . . . . . . .   5
     5.5.  Link-to-Chunk Matching  . . . . . . . . . . . . . . . . .   6
     5.6.  State Update and Iterative Expansion  . . . . . . . . . .   6
     5.7.  Resulting Schedule  . . . . . . . . . . . . . . . . . . .   6
   6.  Applicability to Collective Operations  . . . . . . . . . . .   6
   7.  Execution Considerations  . . . . . . . . . . . . . . . . . .   7
   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   7
   9.  Security Considerations . . . . . . . . . . . . . . . . . . .   7

1.  Introduction

   Large-scale distributed training systems rely heavily on collective
   communication.  During training, nodes repeatedly exchange model
   parameters, gradients, or intermediate results.  As model size and
   cluster size increase, the communication subsystem becomes a major
   factor in overall job completion time.

   Many existing implementations choose a communication procedure from a
   predefined algorithm library.  This works reasonably well on regular
   topologies, but it becomes less effective when the underlying network
   contains heterogeneous links, asymmetric connectivity, or multi-level
   structure.  In such environments, a fixed algorithm can overload some
   links while leaving others underused.

   Other approaches attempt to generate schedules through global
   optimization or exhaustive search.  These methods can produce good
   results for some inputs, but their construction cost often grows
   quickly with the number of nodes and links, which makes them harder
   to use in large systems.

Wang, et al.            Expires 14 November 2026                [Page 2]
Internet-Draft   Topology-Aware Collective Communication        May 2026

   This document describes a different construction framework.  It uses
   a time-expanded network to represent both topology and time
   evolution, models which data chunks are currently available at which
   nodes, and incrementally builds communication steps until the target
   communication state is reached.

2.  Problem Statement

   The construction method described in this document is motivated by
   five practical issues.

   1.  Predefined collective algorithms do not always fit a specific
       topology, especially when the topology is irregular.

   2.  Heterogeneous and asymmetric networks make it difficult to
       maintain both good link utilization and low completion time with
       a single fixed communication pattern.

   3.  Many implementations do not expose a unified model for data
       state, temporal progression, and link occupancy during schedule
       construction.

   4.  Globally optimized schedule generation can become expensive as
       system size increases.

   5.  A practical method should support multiple collective
       communication modes without being tied to a single handcrafted
       algorithm family.

3.  Design Goals

   The source material indicates that the construction method is
   intended to meet the following goals:

   *  It is intended to account for topology details, including directed
      links, asymmetric connectivity, and bandwidth differences.

   *  It is intended to build a schedule for the given topology, rather
      than only choosing from a fixed library of existing algorithms.

   *  It is intended to represent communication progress as an explicit
      state transition process.

   *  It is intended to expand the schedule progressively in time
      instead of requiring a fully fixed time horizon in advance.

   *  It is intended to remain applicable to several collective
      communication patterns.

Wang, et al.            Expires 14 November 2026                [Page 3]
Internet-Draft   Topology-Aware Collective Communication        May 2026

4.  Terminology and Model

   This section summarizes the key terms used by the construction
   method.

   Node
      A compute element or processing unit participating in collective
      communication.

   Directed Link
      A communication edge from one node to another.  Link direction
      matters when the physical or logical topology is asymmetric.

   Chunk
      The smallest schedulable unit obtained by splitting the data to be
      communicated.

   Precondition
      The set of chunks currently available at each node at a given time
      layer.

   Postcondition
      The target chunk distribution that defines completion of the
      collective task.

   Unsatisfied Target
      A node-chunk pair that is required by the postcondition but has
      not yet been satisfied in the current state.

   Candidate Source
      A node that already holds a required chunk and can potentially
      deliver it to a target node through an available link in the
      current time layer.

   Time-Expanded Network
      A representation in which the original topology is unfolded across
      discrete time layers so that communication opportunities and state
      transitions can be expressed in a single structure.

5.  Construction Framework

   The method takes as input a topology description, link attributes, a
   collective communication objective, and an initial data placement.
   It produces a communication schedule consisting of per-time-layer
   send and receive actions, together with the derived transfer paths of
   each chunk.

Wang, et al.            Expires 14 November 2026                [Page 4]
Internet-Draft   Topology-Aware Collective Communication        May 2026

5.1.  Input Topology and Communication Objective

   The constructor receives a set of nodes, a set of directed links, and
   link attributes including link bandwidth.  It also receives the
   target collective mode.  The source material explicitly mentions all-
   gather, reduce-scatter, and all-reduce, and more generally describes
   distribution, aggregation, reduction, and synchronization tasks.

   The communication objective is expressed as a desired postcondition
   over chunks.  The postcondition defines which nodes are expected to
   hold which chunks after the collective operation completes.

5.2.  Chunk Partitioning and State Initialization

   Before schedule construction begins, the payload is partitioned into
   chunks.  Each chunk is treated as an independent schedulable item.

   The constructor initializes a precondition describing which chunks
   are currently available at each node.  It also initializes the
   postcondition that describes the desired final state.  These two
   state descriptions provide the basis for deciding whether
   construction is complete and which transfers are still needed.

5.3.  Time-Expanded Topology

   The original topology is expanded into discrete time layers.  Each
   original node is represented by a sequence of node copies, one for
   each layer.  A directed link in the original topology becomes a
   temporal edge that carries a chunk from a source node in one layer to
   a destination node in the next layer.

   This representation captures spatial connectivity and temporal
   progression in one model.  The source material describes a layer-by-
   layer expansion process rather than requiring a fixed final time
   horizon in advance.

5.4.  Candidate Source Discovery

   For each unsatisfied target, the constructor searches for candidate
   sources that already hold the required chunk and can reach the
   destination through a valid temporal edge in the current layer.

   According to the source material, this process starts from the target
   node and traces reachable links in the current time layer to identify
   source nodes that already hold the required chunk.  The candidate set
   therefore reflects both current chunk availability and current
   temporal connectivity.

Wang, et al.            Expires 14 November 2026                [Page 5]
Internet-Draft   Topology-Aware Collective Communication        May 2026

5.5.  Link-to-Chunk Matching

   After candidate sources have been identified, the constructor selects
   feasible transfers for the current time layer.  A feasible transfer
   binds one chunk to one directed link from one source node to one
   destination node.

   According to the source material, valid matching respects link
   direction and only uses links that are not already occupied in the
   current layer.  The source material also notes that alternative
   embodiments may consider path length, node load, or link utilization
   when adjusting matching decisions.

5.6.  State Update and Iterative Expansion

   Once transfers have been selected for the current layer, the
   constructor updates node state for the next layer.  Any chunk that is
   successfully delivered becomes part of the receiving node's
   precondition in the following layer.

   If unsatisfied targets remain after the update, the constructor
   extends the time-expanded network and repeats candidate discovery and
   link-to-chunk matching.  This process continues until the
   postcondition is fully satisfied.

5.7.  Resulting Schedule

   The final output is a topology-aware communication schedule.
   According to the source material, the output includes the transfer
   path of each chunk, the send/receive relationships in each time
   layer, and the complete collective communication plan.

6.  Applicability to Collective Operations

   The same construction framework can be applied to several collective
   communication patterns by changing the initial and target state
   definitions.

   *  In all-gather, each node starts with a subset of chunks, and the
      postcondition requires every node to hold the union of all chunks.

   *  In reduce-scatter, chunks are combined according to a reduction
      rule and the postcondition assigns portions of the reduced result
      to different nodes.

   *  In all-reduce, the framework can be viewed as a reduce-scatter
      phase followed by an all-gather phase, or as another equivalent
      state formulation.

Wang, et al.            Expires 14 November 2026                [Page 6]
Internet-Draft   Topology-Aware Collective Communication        May 2026

7.  Execution Considerations

   The source material explicitly lists the node set, link set, link
   direction, link bandwidth, and target collective mode as inputs to
   schedule construction.  It also describes the generated schedule as
   an output that can later be read and executed by sending, receiving,
   and processing data according to the per-layer communication
   arrangement.

   The source material further states that, after construction, the
   resulting communication schedule can be executed layer by layer until
   the target completion state is reached, and the final result can then
   be supplied to later training, scheduling, or control modules.

8.  IANA Considerations

   This document includes no request to IANA.

9.  Security Considerations

   This document describes a schedule construction method and does not
   define a new wire protocol or a new security mechanism.

Wang, et al.            Expires 14 November 2026                [Page 7]