Skip to main content

Scheduling Network Resources for Machine Learning Clusters
draft-kompella-rtgwg-mlnwsched-01

Document Type Active Internet-Draft (individual)
Authors Kireeti Kompella , Vishnu Pavan Beeram , Aditya Mahale , Raghav Bhargava
Last updated 2025-11-02
RFC stream (None)
Intended RFC status (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-kompella-rtgwg-mlnwsched-01
RTG WG                                                       K. Kompella
Internet-Draft                                              V. P. Beeram
Intended status: Informational                          Juniper Networks
Expires: 6 May 2026                                            A. Mahale
                                                        Cerebras Systems
                                                             R. Bhargava
                                                                  Crusoe
                                                         2 November 2025

       Scheduling Network Resources for Machine Learning Clusters
                   draft-kompella-rtgwg-mlnwsched-01

Abstract

   Large Language Models (LLMs) are pushing the boundaries of
   technology.  The scale that they have reached currently vastly
   exceeds the capacity of any single compute unit (XPU); this requires
   a distributed approach where multiple XPUs are connected via a
   "backend" network, sometimes in a single data center, but
   increasingly in multiple data centers connected by a "data center
   interconnect" (DCI).  We are approaching the point where the scale
   exceeds that of a single data center, thus requiring multiple such
   data centers connected via a "data center interconnect" network.
   Training and inferencing are expensive and critical operations, thus
   they are typically scheduled, i.e., the (compute) resources they need
   are carefully estimated, allocated and deployed so that these
   resources are efficiently used.  However, while compute investment in
   these LLM processing clusters dwarfs that of networks, it is becoming
   increasingly clear that the latter can greatly impact the former.
   This has been the focus of recent conferences, including the fantel
   Birds of a Feather meeting in IETF 123, @Scale: Networking 2025 and
   Open Compute Project 2025.

   This memo proposes that the same care that is taken regarding
   allocation of compute resources to jobs be taken with networking
   resources: that they are estimated, allocated and deployed alongside
   compute resources; that they have contingency plans in case of
   network glitches; and that a holistic view be taken in order to
   optimize job completion times of training and inferencing jobs.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

Kompella, et al.           Expires 6 May 2026                   [Page 1]
Internet-Draft                 ML NW sched                 November 2025

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 6 May 2026.

Copyright Notice

   Copyright (c) 2025 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   4
       1.1.1.  Definition of Commonly Used Terms . . . . . . . . . .   5
   2.  Problem Statement . . . . . . . . . . . . . . . . . . . . . .   5
     2.1.  Collective Operation  . . . . . . . . . . . . . . . . . .   7
     2.2.  Compute Scheduling  . . . . . . . . . . . . . . . . . . .   7
     2.3.  Network Scheduling  . . . . . . . . . . . . . . . . . . .   8
       2.3.1.  Traffic Engineering . . . . . . . . . . . . . . . . .  10
       2.3.2.  Multipathing  . . . . . . . . . . . . . . . . . . . .  10
     2.4.  Comparing Compute and Network Scheduling Features . . . .  11
     2.5.  Back to the Problem . . . . . . . . . . . . . . . . . . .  12
   3.  Proposal  . . . . . . . . . . . . . . . . . . . . . . . . . .  13
   4.  Conclusion  . . . . . . . . . . . . . . . . . . . . . . . . .  14
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  14
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  14
   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  14
     7.1.  Normative References  . . . . . . . . . . . . . . . . . .  14
     7.2.  Informative References  . . . . . . . . . . . . . . . . .  15
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  15

Kompella, et al.           Expires 6 May 2026                   [Page 2]
Internet-Draft                 ML NW sched                 November 2025

1.  Introduction

   Large Language Models (LLMs) are pushing the industry to ever greater
   scale, both in training and in inference.  This leads to more
   critical use of backend networks and a higher stake in producing
   timely results.  A major learning from recent work is that the
   network cannot be taken for granted: a dropped or delayed packet can
   delay, stall or even abort a Machine Learning (ML) job, requiring
   more effort in checkpointing and managing job restarts, dealing with
   network congestion, and dealing with network failures.  The problems
   get exacerbated in multi-tenant clusters where multiple jobs are run
   and job isolation becomes a key requirement.  The fantel Birds of a
   Feather meeting (BoF) illustrated well the role the network plays in
   ML jobs, the potential for network events to disrupt jobs, and some
   early thoughts on how to handle these events.  While the BoF was very
   successful in exposing these issues, we believe that adding a
   proactive approach would be beneficial; this can go hand in hand with
   the reactive approach of dealing effectively with network events.

Kompella, et al.           Expires 6 May 2026                   [Page 3]
Internet-Draft                 ML NW sched                 November 2025

   This memo proposes that the network resources are reserved/scheduled
   in coordination with ML job scheduler, which is responsible for
   reserving compute resources (Central Processing Units [CPUs],
   Graphics Processing Units [GPUs], XPUs, memory, storage, ...).  This
   is especially useful when multiple jobs are run in each cluster; an
   example is GPUaaS (GPU as a Service), or running several inference
   jobs simultaneously, or multi-tenancy.  Reserving network resources
   reduces the probability of some disruptive network events and
   improves job isolation.  This is the network analogy of reserving
   compute resources and ideally can be done at the same time.
   Essentially, when an ML job is scheduled, the “size” of the job (type
   of model, complexity of model, number of parameters, etc.) determines
   how many CPU/GPU/XPU cores are needed and how much memory and storage
   is needed; typically, the same parameters determine the amount of
   network resources needed during different collective (i.e., inter-
   XPU) communication stages (Broadcast, AllReduce, Reduce, etc.)  Job
   placement (i.e., which XPUs to allocate for this job?) also
   determines the source(s) and destination(s) of the communication.
   If, at the time the job is scheduled, network resources are also
   reserved (and potentially, backup resources are put in place), the
   probability that network events can disrupt the job is reduced
   (although not eliminated).  One can also set up the communication
   pathway and reserve resources when a collective communications API
   call ([MPI] or [NCCL] or the like) is made; this is especially
   relevant for long-running jobs where the time between communications
   phases can be long, and the phases vary from (say) Broadcast to
   AllReduce to quiescent.  Finally, if backup pathways for a given
   communication are set up, traffic can quickly be protected when a
   failure happens, and in parallel, the sources can be notified of the
   failure and can reduce their traffic they send, build new end-to-end
   pathways or otherwise handle the failure.

   The previous paragraph suggests a proactive methodology.  Fast
   congestion notification and signaling constitutes a reactive
   methodology.  These fit well together.  One can couple network
   resource scheduling with fast event detection, signaling and
   mitigation for an overall much-reduced impact of network events on
   job progress.

1.1.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

Kompella, et al.           Expires 6 May 2026                   [Page 4]
Internet-Draft                 ML NW sched                 November 2025

1.1.1.  Definition of Commonly Used Terms

   This section provides definitions for terms and abbreviations that
   are used in this memo.

   XPU:  one of several types of processing units: central processing
      unit (CPU), graphics processing unit (GPU), language processing
      unit (LPU), tensor processing unit (TPU) and the like.  They fall
      under the category of "compute resources".

   TE:  traffic engineering, a technology that allows the specification
      of constraints (such as "admin groups" or colors) to guide the
      layout of

   phop:  previous hop (of N), a node and link that feeds in to junction
      N

   nhop:  next hop (of N): a node that is fed by N over a specified
      link.

   MPTE:  multipath TE, a technology that combines all the features of
      TE while offering multipathing with weighted load balancing for
      unicast traffic

   MCTE:  multicast TE, a technology that combines all the features of
      TE with load balancing for multicast traffic

   ML:  machine learning, a powerful technique to learn from data
      without explicit programming, used to solve problems of AI.

   junction:  a node in a DAG, with 0 or more phops, and 0 or more
      nhops.  A junction with 0 phops is an ingress; a junction with 0
      nhops is an egress.  Other junctions are transit.  A junction may
      be a unicast or a multicast junction.  A DAG must have 1 or more
      ingresses, 1 or more egresses, and 0 or more transit junctions.

   DSF:  disaggregated scheduled fabric, a methodology for packet
      spraying in networks with multipathing.

   DCI:  data center interconnect

   DAG:  directed acyclic graph

2.  Problem Statement

   Consider the ML cluster Figure 1:

Kompella, et al.           Expires 6 May 2026                   [Page 5]
Internet-Draft                 ML NW sched                 November 2025

           S1         .... S2
         / ...\.......   /    \      Note: L1 & L2 are connected to S2;
       L1..    L2      L3      L4          L3 & L4 are connected to S1.
      /  \    /  \    /  \    /  \   All links are 400G links.
     X1  X2  X3  X4  X5  X6  X7  X8

                           Figure 1: ML Cluster 1

   The bottom layer consists of XPUs X1 through X8.  The next layer up
   consists of "leaf" switches L1 through L4.  The top layer consists of
   "spine" switches S1 and S2.  All links between layers are 400Gbps;
   thus there is no oversubscription in the network, provided:

   1.  All XPUs are well-behaved.

   2.  All switches load balance fairly and perfectly.

   However, "fair" load balancing is insufficient unless the load
   balancing is done on a per-packet (or better, per-cell) basis
   ("packet spraying") [DSF].  If load balancing is done on a per-flow
   basis ("flow level multipathing"), it is highly unlikely to be
   perfectly balanced across the next hops, in which case one next hop
   may see too much traffic, leading to congestion, packet delays or
   even packet drops.  Disaggregated Scheduled Fabric (DSF) uses per-
   packet or per-cell load balancing, but it comes at a cost, and may
   not scale (and scale is a big consideration in these networks).

   With flow level multipathing, say X1 and X2 are both sending 400G of
   traffic to L1.  L1 tries to load balance X1's traffic to S1 and S2
   (in principle, 200G each).  In practice, that may turn out to be 220G
   to S1 and 180G to S2.  L1 does the same with X2's traffic; let's say
   this goes 190G to S1 and 210G to S2.  The L1-S1 link will be
   congested, with 410G of traffic.

   On the "downward" side (traffic going to the XPUs), there can be an
   "in-cast" problem: say both X1 and X3 are sending traffic to X6.  In
   the worst case, each sends 400G for a total of 800G to X6, but the
   L3-X6 link can only transmit 400G.  Thus, half the traffic will be
   dropped.

   If the entire cluster (here, XPUs X1 through X8) is working on a
   single ML job, things are a bit simpler (but the issues remain).
   However, if this cluster is used for inferencing, or multi-tenant
   workloads, additional considerations arise.  Tenant 1 (or inferencing
   job 1) (T1) may be using XPU X1 and part of X6; tenant 2 (or job 2)
   (T2) may be using XPU X3 and another part of X6.

Kompella, et al.           Expires 6 May 2026                   [Page 6]
Internet-Draft                 ML NW sched                 November 2025

   If T1 and T2 simultaneously require communication to X6, there could
   be contention for the L3-X6 link.  Again, this could lead to
   congestion, and hence delayed or dropped packets.  But now, the issue
   is inter-tenant.

   As stated in the Introduction Section 1, such delayed or dropped
   packets can have big consequences for the jobs that are running.
   Issues such as these are the motivation for DSF, packet spraying and
   fast congestion notification.

2.1.  Collective Operation

   Collective operations [CO] are used in distributed computing for the
   participating compute entities to exchange information.  One example
   is the Message Passing Interface [MPI]; others are the NVIDIA
   Collection Communications Library [NCCL] and the ROCm Communication
   Collectives Library [RCCL].  These are used by the compute entities
   in a deep learning cluster to send information to each other, or as a
   group.

   Collective operations include both unicast and multicast
   communications.  Thus, in scheduling network resources, both patterns
   should be covered.

2.2.  Compute Scheduling

   In shared compute environments, such as a compute cluster or a cloud,
   a scheduler is commonly used to orchestrate access to compute
   resources.  SLURM [SLURM] is a commonly used scheduler in Linux
   clusters; its documentation says "First, [SLURM] allocates exclusive
   and/or non-exclusive access to resources (compute nodes) to users for
   some duration of time so they can perform work."  Another is KAI
   [KAI] which says "KAI Scheduler is a robust, efficient, and scalable
   Kubernetes scheduler that optimizes GPU resource allocation for AI
   and machine learning workloads."  There are several other schedulers
   in common use.

   A scheduler offers several features.  The following are taken from
   SLURM:

   1.  Accounting

   2.  Advanced reservation

   3.  Gang scheduling (time sharing for parallel jobs)

   4.  Backfill scheduling

Kompella, et al.           Expires 6 May 2026                   [Page 7]
Internet-Draft                 ML NW sched                 November 2025

   5.  Topology optimized resource selection

   6.  Resource limits by user or bank account

   7.  Sophisticated multifactor job prioritization algorithms

   KAI offers the following:

   1.   Batch Scheduling

   2.   Bin Packing & Spread Scheduling

   3.   Workload Priority

   4.   Hierarchical Queues

   5.   Resource distribution

   6.   Fairness Policies

   7.   Workload Consolidation

   8.   Elastic Workloads

   9.   Dynamic Resource Allocation (DRA)

   10.  GPU Sharing

   To summarize, a compute scheduler allows effective and optimal
   sharing of compute resources among multiple tenants and multiple
   jobs, while ensuring fairness, enforcing limits and enabling
   accounting.  Without a scheduler, multitenancy and multiple jobs
   would be impractical and chaotic.

   Note that multi-tenancy is implicit.  There may be ways to reserve
   resources for a particular tenant or group of tenants with allocating
   them, but the documentation doesn't say how.

2.3.  Network Scheduling

   In shared network environments (which almost all networks are), a
   scheduler can be used to orchestrate access to network resources --
   primarily bandwidth, but also highly prized links(*), QoS, etc.

Kompella, et al.           Expires 6 May 2026                   [Page 8]
Internet-Draft                 ML NW sched                 November 2025

   The primary task of network resource scheduling is to reserve
   resource along a pathway (tunnel) from one or more XPUs (ingresses)
   to another set of XPUs (egresses).  Note that the paradigm here is of
   uni-directional reservations; this is more general than bidirectional
   reservations, as the traffic requirements may not be symmetric.

   Given that X1 wants to send 20Gbps to {X2, X3, X4}, one would create
   a tunnel from X1 to {X2, X3, X4} with 20Gbps capacity.  Note that
   this traffic might be unicast (distributing different parts of a
   matrix to the recipients) or broadcast (distributing the same
   information to all).  If further, one wanted to use certain links
   exclusively, one can color links in the network and state that this
   tunnel must/must not use links of a certain color.  Thus, link
   coloring is a tool that network administrators can use to hold back
   links for a subset of job types.  The compute analogy would be to
   hold back some XPUs, mark them "blue" and allow only a subset of jobs
   to use those XPUs.

   Link coloring allows a provider to partition their network to
   optimally serve their customers.  While links in a Clos network (as
   most ML clusters are) are perfectly symmetrical, once one gets into
   "distributed clusters" that are connected via DCI links, link
   coloring and other link attributes will find greater use.

   Reserving bandwidth means that a particular job J1 (probably) won't
   step on another job J2's traffic.  Say J1 is using a tunnel T1 with a
   reservation of 20G, and J2 is using a tunnel T2 with a reservation of
   50G.  The reservation procedure ensures any links T1 and T2 traverse
   in common have sufficient bandwidth for both T1 and T2 (and any other
   tunnels with reservations).  Of course, J1 may use more than its
   allocated bandwidth; this can negatively impact J2.  To reduce/
   prevent this, one can apply a policer at the ingress of J1's tunnels
   to ensure that J1 sends no more than its allocated share over each
   tunnel.  This policer can drop traffic over the limit, or simply mark
   it as such, so that if the other jobs on a common link are not using
   their full quota, J1's traffic can go through.

   This last point is crucial for multi-tenancy.  A provider who cannot
   provide hard (or at least soft) guarantees to their customers that
   they will in fact get the resources they asked (and paid) for will
   soon be out of business.

Kompella, et al.           Expires 6 May 2026                   [Page 9]
Internet-Draft                 ML NW sched                 November 2025

   Elastic bandwidth is a very useful feature that goes along with
   elastic compute.  If a job's requirements are: start me off with 5
   XPUs, but expand that to 8 as the need arises, and shrink it back
   down to 5 when no longer needed, then the job's bandwidth
   requirements are likely to grow and shrink in tandem.  Thus, in
   addition to making binding reservations, one must be able to adjust
   those reservations as needs change.

   Finally, not all jobs (and all customers) are created equal.
   Priority and preemption are powerful tools in schedulers to give
   preference to certain jobs over others.  Without these tools, a
   provider would be helpless if their cluster were overrun with low
   priority jobs.  In addition, it would be nice to have a graceful way
   of managing preemption.

2.3.1.  Traffic Engineering

   All the features mentioned in the last section are available today,
   in bandwidth-aware traffic engineering (TE).

   TE constraints allow a user to specify constraints on the path a
   tunnel will take.  These can include administrative groups (colors),
   shared risk link groups (SRLGs), TE metric, other metrics such as
   delay, bandwidth reservations, and many others.

   Bandwidth reservation allows the allocation of bandwidth resources to
   a tunnel.  Policers are a useful adjunct to enforce limits.

   Elastic bandwidth (aka "auto-bandwidth") allows a tunnel to
   dynamically adjust its reservations (within limits).

   Priority and preemption are implemented by all vendors.  Graceful
   preemption is possible using "soft preemption".

   New traffic engineering parameters such as available buffer space,
   available queue-pairs for communication, etc. will be introduced and
   discussed in a future version of this memo, as well as in companion
   documents.

2.3.2.  Multipathing

   There is one missing piece with "regular" TE: ML clusters (and Clos
   networks or fat trees in general) make heavy use of multipathing, and
   often have multiple ingresses and egresses for their communications.
   Current traffic engineering techniques focus on a single path from
   one ingress to one egress.  However, a new technique for multipath TE
   that allows for multiple ingresses and egresses and multiple paths
   between them is being developed that has relevance here

Kompella, et al.           Expires 6 May 2026                  [Page 10]
Internet-Draft                 ML NW sched                 November 2025

   [I-D.kompella-teas-mpte].

2.4.  Comparing Compute and Network Scheduling Features

   In this section, we look at compute scheduling features, and ask
   whether the corresponding feature exists in network scheduling.

   +=====================================+=============================+
   | SLURM - Compute Scheduling          | Network Scheduling (Feature |
   | Features                            | Availability)               |
   +=====================================+=============================+
   | Accounting                          | Yes                         |
   +-------------------------------------+-----------------------------+
   | Advanced reservation                | Yes (bandwidth calendaring) |
   +-------------------------------------+-----------------------------+
   | Gang scheduling                     | Yes (primary effort is on   |
   |                                     | compute)                    |
   +-------------------------------------+-----------------------------+
   | Backfill scheduling                 | N/A                         |
   +-------------------------------------+-----------------------------+
   | Topology optimized resource         | Yes                         |
   | selection                           |                             |
   +-------------------------------------+-----------------------------+
   | Resource limits by user or          | Yes (via controller policy) |
   | bank account                        | (enforcement via policers)  |
   +-------------------------------------+-----------------------------+
   | Sophisticated multifactor job       | No (maybe N/A)              |
   | prioritization algorithms           |                             |
   +-------------------------------------+-----------------------------+

              Table 1: Comparing SLURM and Network Scheduling

Kompella, et al.           Expires 6 May 2026                  [Page 11]
Internet-Draft                 ML NW sched                 November 2025

   +===================+==============================================+
   | KAI features      | Network Scheduling (Feature Availability)    |
   +===================+==============================================+
   | Batch Scheduling  | Yes (via multi-ingress/multi-egress tunnels) |
   +-------------------+----------------------------------------------+
   | Bin Packing &     | Yes ("least-fill", "max-fill")               |
   | Spread Scheduling |                                              |
   +-------------------+----------------------------------------------+
   | Workload Priority | Yes                                          |
   +-------------------+----------------------------------------------+
   | Hierarchical      | Yes (via QoS in the data plane)              |
   | Queues            |                                              |
   +-------------------+----------------------------------------------+
   | Resource          | Yes (via tunnel priority)                    |
   | distribution      |                                              |
   +-------------------+----------------------------------------------+
   | Fairness Policies | Yes                                          |
   +-------------------+----------------------------------------------+
   | Workload          | N/A                                          |
   | Consolidation     |                                              |
   +-------------------+----------------------------------------------+
   | Elastic Workloads | Yes ("auto-bandwidth")                       |
   +-------------------+----------------------------------------------+
   | Dynamic Resource  | N/A (multivendor is a given)                 |
   | Allocation (DRA)  |                                              |
   +-------------------+----------------------------------------------+
   | GPU Sharing       | Yes (link sharing)                           |
   +-------------------+----------------------------------------------+

              Table 2: Comparing KAI and Network Scheduling

   As can be seen, almost all features are supported; some other
   features are supported in network scheduling that may not have
   analogies in compute scheduling.

2.5.  Back to the Problem

   Back to Figure 1.

   With flow level multipathing, say X1 and X2 both send 400G of traffic
   to L1.  L1 tries to load balance X1's traffic to S1 and S2 (in
   principle, 200G each).  In practice, that may turn out to be 220G to
   S1 and 180G to S2.  However, L1 knows that it's only supposed to send
   200G to S1 from X1.  S1 adjusts its load balancing weights ("adaptive
   load balancing") until the traffic sent to each of S1 and S2 is 200G.
   L1 does the same with X2's traffic; if all works well, L1 will send a
   total of 400G to each of S1 and S2.

Kompella, et al.           Expires 6 May 2026                  [Page 12]
Internet-Draft                 ML NW sched                 November 2025

   On the "downward" side (traffic going to the XPUs), there can be an
   "in-cast" problem: say both X1 and X3 are sending traffic to X6.
   Now, X1 has a TE tunnel to X6 with only 200G; similarly for X3.  So,
   in principle, the L3-X6 link should only carry 400G.

   Reservations can be temporarily exceeded; that is equally true with
   compute reservations.  Depending on the enforcement policies, an
   oversubscription situation should be temporary and is clearly visible
   (since accounting is easy), allowing more severe enforcement should
   it be persistent.

3.  Proposal

   Multipath TE (MPTE) [I-D.kompella-teas-mpte] has all the features of
   Traffic Engineering, including the above-mentioned TE constraints.
   However, whereas "regular" TE [RFC2702] considers a TE path with one
   ingress, one egress and a single path between them, MPTE allows
   multiple ingresses and egresses, and considers all paths between
   ingresses and egressses that meet the TE constraints.  Thus, MPTE
   build a directed acyclic graph (DAG) between ingresses and egresses.
   This allows traffic flowing over the MPTE DAG to be load balanced
   across these paths.  Moreover, MPTE computes near optimal load
   balancing factors at each node; it does not simply use an equally
   weighted scheme.

   This memo proposes the use of MPTE to compute, set up and allocate
   bandwidth for unicast collection communication among compute nodes in
   a deep learning cluster.

   Multicast TE (MCTE) uses similar constructs as MPTE (namely, DAGs and
   junctions) to set up point-to-multipoint and multipoint-to-multipoint
   tunnels among compute nodes.  MCTE also obeys TE constraints and
   allocates bandwidth resources.  Thus, whatever the type of
   communication is required at various phases of a deep learning job,
   there is a TE construct to allocate network resources and instantiate
   the communication pattern.

   Both MPTE and MCTE can preprogram "backup" paths in case of a link or
   node failure.

Kompella, et al.           Expires 6 May 2026                  [Page 13]
Internet-Draft                 ML NW sched                 November 2025

   We believe the use of MPTE and MCTE will reduce the incidence of
   congestion in a deep learning cluster.  Of course, congestion can
   happen for a number of reasons, including network failures.  Thus
   congestion notification will be needed; however, with the state
   installed in the network for the TE tunnels, a node X that detects a
   (link or node) failure knows exactly what tunnels are affected by a
   given failure and which ingress nodes to notify.  Furthermore, X can
   quickly put in place a backup path to protect against that failure
   until the ingresses can either reduce the traffic they send, or
   compute alternate end-to-end tunnels.

4.  Conclusion

   As mentioned in the Introduction, to make optimal use of deep
   learning clusters, especially when multiple jobs (e.g., inferencing
   or multi-tenancy) are run, and multi-tenancy is in play, network
   scheduling takes on increasing importance as a proactive measure to
   prevent network events such as congestion.  (This works orthogonally
   to packet spraying.)  One can add fast network event notification as
   a reactive measure.  Together, these techniques present a more
   holistic approach and should allow much better utilization of ML
   resources.

5.  IANA Considerations

   None, for now.

6.  Security Considerations

   TBD

7.  References

7.1.  Normative References

   [I-D.kompella-teas-mpte]
              Kompella, K., Jalil, L., Khaddam, M., and A. Smith,
              "Multipath Traffic Engineering", Work in Progress,
              Internet-Draft, draft-kompella-teas-mpte-01, 7 July 2025,
              <https://datatracker.ietf.org/doc/html/draft-kompella-
              teas-mpte-01>.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/rfc/rfc2119>.

Kompella, et al.           Expires 6 May 2026                  [Page 14]
Internet-Draft                 ML NW sched                 November 2025

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.

7.2.  Informative References

   [CO]       "Collective operation", November 2025,
              <https://en.wikipedia.org/wiki/Collective_operation>.

   [DSF]      "Disaggregated Scheduled Fabric", October 2024,
              <https://engineering.fb.com/2024/10/15/data-
              infrastructure/open-future-networking-hardware-ai-ocp-
              2024-meta>.

   [KAI]      "KAI Scheduler", n.d.,
              <https://github.com/NVIDIA/KAI-Scheduler>.

   [MPI]      "MPI: A Message-Passing Interface Standard, version 5.0",
              5 June 2025,
              <https://www.mpi-forum.org/docs/mpi-5.0/mpi50-report.pdf>.

   [NCCL]     "Collective Operations", 2020,
              <https://docs.nvidia.com/deeplearning/nccl/user-
              guide/docs/usage/collectives.html>.

   [RCCL]     "ROCm Communication Collectives Library", 31 October 2025,
              <https://rocm.docs.amd.com/projects/rccl/en/latest/>.

   [RFC2702]  Awduche, D., Malcolm, J., Agogbua, J., O'Dell, M., and J.
              McManus, "Requirements for Traffic Engineering Over MPLS",
              RFC 2702, DOI 10.17487/RFC2702, September 1999,
              <https://www.rfc-editor.org/rfc/rfc2702>.

   [SLURM]    "SLURM Workload Manager", n.d.,
              <https://slurm.schedmd.com/overview.html>.

Authors' Addresses

   Kireeti Kompella
   Juniper Networks
   Sunnyvale, California 94089
   United States of America
   Email: kireeti.ietf@gmail.com

Kompella, et al.           Expires 6 May 2026                  [Page 15]
Internet-Draft                 ML NW sched                 November 2025

   Vishnu Pavan Beeram
   Juniper Networks
   Sunnyvale, California 94089
   United States of America
   Email: vbeeram@juniper.net

   Aditya Mahale
   Cerebras Systems
   Sunnyvale, California 94085
   United States of America
   Email: aditya.ietf@gmail.com

   Raghav Bhargava
   Crusoe
   Sunnyvale, California 94085
   United States of America
   Email: rbhargava@crusoe.ai

Kompella, et al.           Expires 6 May 2026                  [Page 16]