rtgwg R. Chen
Internet-Draft ZTE Corporation
Intended status: Informational K. Yao
Expires: 9 January 2025 China Mobile
C. Gao
ZTE Corporation
8 July 2024
A Framework and Definition for Collective Communication Offloading
draft-chen-rtgwg-cco-framework-and-definition-00
Abstract
This document provides a definition of the term "Collective
Communication Offloading" for use within the IETF and specifically as
a reference for other IETF documents that describe or use aspects of
Collective Communication Offloading.
The document also describes the characteristics of an IETF Collective
Communication Offloading, related terms and their meanings, and
discusses the general framework for Collective Communication
Offloading, the necessary system components and interfaces.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 9 January 2025.
Copyright Notice
Copyright (c) 2024 IETF Trust and the persons identified as the
document authors. All rights reserved.
Chen, et al. Expires 9 January 2025 [Page 1]
Internet-Draft A Framework and Definition for CCO July 2024
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
2. Terms and Abbreviations . . . . . . . . . . . . . . . . . . . 3
3. Definition of CCO . . . . . . . . . . . . . . . . . . . . . . 3
4. Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1. CCOM . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2. Infrastructure Layer . . . . . . . . . . . . . . . . . . 6
4.3. CCOM Southbound Interface . . . . . . . . . . . . . . . . 6
4.3.1. Interface between the CCO-member and CCOM . . . . . . 6
4.3.2. IInterface between the CCO-switch and CCOM . . . . . 7
5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7
6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7
7. Informative References . . . . . . . . . . . . . . . . . . . 7
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8
1. Introduction
Collective Communication Offloading(CCO) feature allows the
offloading of collective operations to the switches. Distributed
applications that might benefit from the CCO include but not limited
to:
* Artificial intelligence (AI).
* High performance computing (HPC).
In Network Computing(INC) is a relatively common technology. Both AI
and HPC networks, the specific usage of INC is Collective
Communication Offloading(CCO). The use cases and characteristics of
each use case are further described in
[I-D.yao-tsvwg-cco-problem-statement-and-usecases].
This document provides a definition of the term " Collective
Communication Offloading " for use within the IETF and specifically
as a reference for other IETF documents that describe or use aspects
of Collective Communication Offloading.
Chen, et al. Expires 9 January 2025 [Page 2]
Internet-Draft A Framework and Definition for CCO July 2024
The document also describes the characteristics of an IETF Collective
Communication Offloading, related terms and their meanings, and
discusses the general framework for Collective Communication
Offloading, the necessary system components and interfaces.
1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP
14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
2. Terms and Abbreviations
The terms and abbreviations used in this document are listed below.
INC:In Network Computing
CCO: Collective Communication Offloading
CCOM: Collective Communication Offloading Manager
CCO-switch: A device in a network that performs collective operations
CCO-member: A member of a collective group.
CCO-tree: A tree of CCO-switches used for collective offload for a
Collective Group.
Collective Group: A set of works that participate in a collective
operation.
3. Definition of CCO
The definition of CCO in IETF context is as follows:
The Collective Communication Offloading (CCO) can efficiently and
controllably utilize the storage and computing resources of network
equipment without affecting the normal functions of network
equipment. CCO feature takes the approach of offloading collective
operations to the CCO switch to achieve the ultimate network
performance, such as reduced latency, increased throughput, and so
on.
The type of collective operations referred to in this draft is as
follows, they can benefit from CCO:
Chen, et al. Expires 9 January 2025 [Page 3]
Internet-Draft A Framework and Definition for CCO July 2024
* Broadcast:distribute data from one member to all other members.
* AllGather:collect and distribute data from all members.
* Reduce:combine data from all members and distribute the results to
one member.
* AllReduce:combine data from all members and distribute the results
to all members.
* ReduceScatter:combine data from all members but scatter the
results to all members.
* Barrier:synchronize across all members.
4. Framework
An IETF CCO and its realization involves the following stakeholders
and it is relevant to define them for consistent terminology(see
Figure 1).
* CCOM: The CCOM can be used to discover CCO-switch capability and
manages CCO-switch resources. It is mainly responsible for
establishes collective groups and configuration of resources
allocated to a group for collective offload.
* Infrastructure Layer: It includes CCO-switch and CCO-member.
* CCOM Southbound Interface:It includesInterface between the CCO-
member and CCOM and Interface between the CCO-switch and CCOM.
Chen, et al. Expires 9 January 2025 [Page 4]
Internet-Draft A Framework and Definition for CCO July 2024
+-----------------------------------------------------------------------+
| +----------------+ +-------------------+ +----------------------+ |
| |Group Management| |Topology Management| |CCO-switch capability | |
| +----------------+ +-------------------+ | Management | |
| CCOM +----------------------+ |
+-----------------+-------------------------------+---------------------+
| |
Interface between the CCO-member and CCOM Interface between the CCO-switch and CCOM
| |
+------------+-------------------------------+------------------+
| | | |
| +------+-----+ +-----+------+ |
| | CCO-member | | CCO-switch | |
| +------------+ +------------+ |
| Infrastructure Layer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 1: Figure 1: The framework of the CCO
4.1. CCOM
The CCOM is mainly responsible for establish collective groups and
allocates the necessary CCO resources for the collective group. The
CCOM is reachable by the CCO-switch and by the group CCO-member,
possibly through different networks. When a set of members decide to
form a group, The CCOM determines an CCO-tree to assign to the group
for collective offload. The CCOM then configures each individual
switch via Interface between CCO-switches and CCOM, and finally
returns to the CCO-member all information required to communicate
with their neighboring CCO-switch.
The CCOM has the following functional modules:
* Group Management: It includes group creation/destruction, group
status query, and allocates and de-allocate the necessary CCO
resources for the collective group.
* Topology Management: The CCOM obtains or computes an CCO-tree
between CCO-member and CCO-switch for the group after receiving
requests from all members. The CCO-tree is determined using the
underlay topology and CCO-switch resource information.
* CCO-switch capability management: The CCOM MUST obtain the CCO-
switch capability and manages CCO-switch resources.
Chen, et al. Expires 9 January 2025 [Page 5]
Internet-Draft A Framework and Definition for CCO July 2024
4.2. Infrastructure Layer
It includes CCO-switch and CCO-member.
* CCO-switch: A device in a network that performs collective
operations. It receives input data from CCO-member and performs a
reduction operation to produce a single result, and then
distribute the output data to one or more members depending on the
collective group configuration and particular collective
operation.
* CCO-member: A member of a collective group. It provides input
data and accepts output data, the initiator of collective
operations.
4.3. CCOM Southbound Interface
The interworking and interoperability between the CCOM and the CCO-
switch and the CCO-member to provide common means of provisioning,
operating and monitoring the CCO is enabled by the following
communication interfaces (see Figure 1).
4.3.1. Interface between the CCO-member and CCOM
It is an interface between CCOM and CCO-member. The CCOM can use
this interface to communicate with CCO-member about group
information, including requests from CCO-member to join the group,
creation and destruction of the group, group status query, etc. The
main interactive information is as follows:
* CCO-member registers with CCOM: CCO-member need to register with
CCOM to let CCOM know the existence of the current member and
maintain the connection with the CCO-member. CCO-member register
with CCOM through this interface. Registration request parameters
include: the CCO-member's addressing information and MTU supported
by CCO-member.
* Group setup: CCO-member joins a group by providing a set of
required capabilities to the CCOM. A group is established after
all members have attempted to join. Group creation MUST fail if
the required network resources or capabilities are not provided.
* Group destruction: If the topology changes during the life time of
the group or Once any member has left, the group is no longer
usable, the CCOM must tear down the group, and build a new CCO-
tree.
Chen, et al. Expires 9 January 2025 [Page 6]
Internet-Draft A Framework and Definition for CCO July 2024
4.3.2. IInterface between the CCO-switch and CCOM
It is an interface between CCOM and CCO-switch. The CCOM discover
CCO-switch capability and manages CCO-switch resources. The main
interactive information is as follows:
* Discover CCO-switch capability: The CCOM queries the CCO-switch to
obtain their capabilities. The capabilities of the CCO-switch
mainly include: whether it supports network computing, supported
types of collective operations, supported group numbers, number of
trees, supported MTU, etc.
* Allocate and de-allocate switch resources for a group: To allocate
resources for the Collective Group, the CCOM first needs to know
the type of collective operations the group intends to perform.
Because, a deployment can have different types of CCO-switch,
e.g., some switches can have reduction support while others can
support only data transfer offload. So, the CCOM queries the CCO-
switch to obtain their capabilities. In this way, the CCOM can
allocate appropriate resources when different Collective Groups
might perform different types of collective operations.
5. IANA Considerations
There are no requests to IANA in this framework document.
6. Acknowledgements
TBD.
7. Informative References
[I-D.yao-tsvwg-cco-problem-statement-and-usecases]
Yao, K., Shiping, X., Li, Y., Huang, H., and D. KUTSCHER,
"Collective Communication Optimization: Problem Statement
and Use cases", Work in Progress, Internet-Draft, draft-
yao-tsvwg-cco-problem-statement-and-usecases-00, 23
October 2023, <https://datatracker.ietf.org/doc/html/
draft-yao-tsvwg-cco-problem-statement-and-usecases-00>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/info/rfc8174>.
Chen, et al. Expires 9 January 2025 [Page 7]
Internet-Draft A Framework and Definition for CCO July 2024
Authors' Addresses
Ran Chen
ZTE Corporation
Nanjing
China
Email: chen.ran@zte.com.cn
Kehan Yao
China Mobile
Beijing
China
Email: yaokehan@chinamobile.com
Chenqiang Gao
ZTE Corporation
Nanjing
China
Email: gao.chenqiang@zte.com.cn
Chen, et al. Expires 9 January 2025 [Page 8]