Skip to main content

Collective Communication Optimizations: Requirement and Analysis
draft-yao-tsvwg-cco-requirement-and-analysis-02

Document Type Expired Internet-Draft (individual)
Expired & archived
Authors Kehan Yao , Xu Shiping , Liu Chang , Yizhou Li , Hongyi Huang , Weifeng Wang , Dirk KUTSCHER
Last updated 2025-01-09 (Latest revision 2024-07-08)
RFC stream (None)
Intended RFC status (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state Expired
Telechat date (None)
Responsible AD (None)
Send notices to (None)

This Internet-Draft is no longer active. A copy of the expired Internet-Draft is available in these formats:

Abstract

Gernerative AI applications depend on large scale parallel computing clusters for model training and inference. Existing implementations of collective communication in parallel computing is built on top of RDMA, the most adoptable AI transport protocol. However, One-to- Many, Many-to-One, and Many-to-Many collective operations all depend on point-to-point transport semantics of RDMA, which inevitably introduces more bandwidth occupancy and transmission overhead. Emerging approaches for collective communication optimization focus on network-assisted collective acceleration and can work compatibly with RDMA. This document analyzes different technical schemes for network-assisted collective acceleration based on RDMA, and presents the gap between these work and current IETF standards, notably iWARP. Requirements for designing new standards are proposed accordingly.

Authors

Kehan Yao
Xu Shiping
Liu Chang
Yizhou Li
Hongyi Huang
Weifeng Wang
Dirk KUTSCHER

(Note: The e-mail addresses provided for the authors of this Internet-Draft may no longer be valid.)