Skip to main content

Collective Communication Optimization: Requirement and Analysis
draft-yao-tsvwg-cco-requirement-and-analysis-00

The information below is for an old version of the document.
Document Type
This is an older version of an Internet-Draft whose latest revision state is "Active".
Authors Kehan Yao , Xu Shiping , Yizhou Li , Hongyi Huang , Dirk KUTSCHER
Last updated 2023-10-23
RFC stream (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-yao-tsvwg-cco-requirement-and-analysis-00
Transport Area Working Group                                      K. Yao
Internet-Draft                                                     S. Xu
Intended status: Informational                              China Mobile
Expires: 25 April 2024                                             Y. Li
                                                                H. Huang
                                                     Huawei Technologies
                                                             D. KUTSCHER
                                                       HKUST (Guangzhou)
                                                         23 October 2023

    Collective Communication Optimization: Requirement and Analysis
            draft-yao-tsvwg-cco-requirement-and-analysis-00

Abstract

   As is mentioned in draft [CCO PS & USECASE], the most obvious problem
   on why existing protocols cannot meet the high-performance
   requirements of collective communications is that these distributed
   applications are not co-designed with the underlying networking
   protocols.  There is a semantic gap between inter-process message
   transportation and packet forwarding, which should be bridged by
   efficient mapping and optimization.

   This draft further presents the technical requirements on how the
   collective communication optimization should be designed, and makes
   some discussion and analysis on several related work.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

Yao, et al.               Expires 25 April 2024                 [Page 1]
Internet-Draft  Collective Communication Optimization: R    October 2023

   This Internet-Draft will expire on 25 April 2024.

Copyright Notice

   Copyright (c) 2023 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Existing Work and Analysis  . . . . . . . . . . . . . . . . .   3
     2.1.  COIN  . . . . . . . . . . . . . . . . . . . . . . . . . .   3
     2.2.  SHArP . . . . . . . . . . . . . . . . . . . . . . . . . .   4
     2.3.  Relations with IP multicast . . . . . . . . . . . . . . .   4
     2.4.  Unified Communication X . . . . . . . . . . . . . . . . .   4
     2.5.  Other Considerations  . . . . . . . . . . . . . . . . . .   5
       2.5.1.  Applicable Domain . . . . . . . . . . . . . . . . . .   5
       2.5.2.  Operational considerations  . . . . . . . . . . . . .   5
   3.  Requirements  . . . . . . . . . . . . . . . . . . . . . . . .   5
     3.1.  Improve Tranport Efficiency . . . . . . . . . . . . . . .   6
     3.2.  Transport Suitability for Different Communication
           Modes . . . . . . . . . . . . . . . . . . . . . . . . . .   6
     3.3.  End-network Collaborative Collective Communication  . . .   6
     3.4.  Management and Control  . . . . . . . . . . . . . . . . .   7
     3.5.  Make Use of IP Multicast  . . . . . . . . . . . . . . . .   7
     3.6.  Self-healing  . . . . . . . . . . . . . . . . . . . . . .   7
   4.  Security Considerations . . . . . . . . . . . . . . . . . . .   7
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   8
   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   8
     6.1.  Normative References  . . . . . . . . . . . . . . . . . .   8
     6.2.  Informative References  . . . . . . . . . . . . . . . . .   8
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   8

Yao, et al.               Expires 25 April 2024                 [Page 2]
Internet-Draft  Collective Communication Optimization: R    October 2023

1.  Introduction

   Optimizing collective communication originates from parallel
   computing which is limited in High-Performance Computing(HPC) at the
   very beginning.  However, with the development of distributed
   applications, especially distributed Artificial Intelligence(AI),
   collective communication becomes a common inter-process communication
   pattern which can be extended to apply in various scenarios, and the
   use cases are still evolving repidly, from clusters in controlled
   environments, to campus network, and even to open Internet, like
   Federated Learning.

   Collective communication represents the behaviors of applications,
   which currently is not co-designed with network protocols.  Specific
   requirements on the protocol design should be raised to instruct the
   standardization.  Currently, there are some other work related on
   this topic, but their scope is either too broad or limited in single
   applications.  This document focuses on optimizing collective
   communication from the network perspective.  Solutions may be first
   introduced to limited domain [RFC8799], we still consider about how
   to extend the work into Internet-scale applications.

2.  Existing Work and Analysis

   In this section, we elaborate some work that are related to
   collective communication optimization.  These work comes from IETF
   working groups, research groups, open source or industry practice.
   We make some analysis as well to help understand why the collective
   communication optimization is worth more investigations and studies
   in IETF.

2.1.  COIN

   According to the charter of Computing in the Network(COIN) research
   group, its scope is to investigate how network data plane
   programmability can improve Internet architecture.  Use cases that
   COIN defines are relatively broad, including network functions
   offloading, machine learning acceleration, in-network caching and in-
   network control, etc.  What "computing" means here is very generic.
   It is highly coupled to the programmability of the chips.  The
   collective communication offloading is briefly mentioned as one the
   use cases of COIN.  Nevertheless, optimizing collective communication
   is not only leverage the programmable chip help the calculation of
   collective operations in network devices, but also includes the
   potential standazation wotk lik communication protocol enhancement or
   application interface design.

Yao, et al.               Expires 25 April 2024                 [Page 3]
Internet-Draft  Collective Communication Optimization: R    October 2023

2.2.  SHArP

   Scalable Hierachical Aggregation Protocol(SHArP) [SHArP] is a private
   solution for optimizing collective communication, owned by Nvidia.
   It has been realized in its commodity switches.  SHArP breaks the
   end-to-end transport rule by implementing Target Channel Adapter(TCA)
   in switches.  The TCA supports both Reliable Connection (RC)
   transport to enable reliable delivery of data through the aggregation
   tree as well as Unreliable Datagram (UD) transport to enable
   Multicast distribution of the aggregation result.  As is stated,
   SHArP is based on Infiniband architecture.  Currently it cannot work
   interoperably with the other network architectures, thus limiting its
   applicability.

2.3.  Relations with IP multicast

   As is analyzed in [CCO PS & USECASE], existing IP multicast protocols
   havn't been made the full use for distributed application systems
   which require collective communication.  Most collective operations
   still use unicast mode for transmission which experience a lot of
   overhead in terms of network bandwidth consumption and host CPU
   consumption.  Exisiting IP multicast uses the forwarding based on the
   destination multicast IP address.  It is not clear yet how the group
   based collective communications leverage the multicast to improve the
   performance and save the bandwidth.

2.4.  Unified Communication X

   Unified Communication X(UCX) defines a set of network APIs and their
   implementation for high throughput computing.  It's oriented towards
   programming models that need collective communication.  The core
   objective of UCX is to design a framework for collective
   communication implementation and address the cross-platform
   functionality and performance portability challenges, since there are
   multiple efforts in the industry to optimize the collective-based
   programming models, and they are either specific model oriented or
   specific hardware limited, which creates a lot of duplicate work and
   doesn't scale well.  UCX hopes to address these issues by unifying
   network APIs across different network architecture, like Infiniband,
   RoCE, TCP/IP, and Cray, etc.  The goal seems challenging because it
   will indeed introduce a lot of abstraction which may impact system
   overall performance.

   The object of this draft cares more about the optimization of
   collective communication, including protocols that realize message
   transportation, multicast, and OAM.  Cross-platform portability is
   not the major goal.

Yao, et al.               Expires 25 April 2024                 [Page 4]
Internet-Draft  Collective Communication Optimization: R    October 2023

2.5.  Other Considerations

   It's very necessay to discuss about some other issues like where the
   optimization should be applied and who should be operator.  The scope
   analysis helps better find the landscape of the new technique.

2.5.1.  Applicable Domain

   Since there need some discussions on whether end-to-end transport
   mechanism should be broken to optimize collective communication,
   there are different technical roadmaps.  On the one hand, SHArP is
   the representative of breaking end-to-end transportation, but it's
   based on Infiniband.  On the other hand, transport transparent
   solutions like [NetReduce] doesn't break the end-to-end rule and
   could be a host-friendly solution, so that it has better scalability
   and interoperability to be applied in Internet.  But there still need
   further investigation on other potential collective operations that
   could be accelerated than Allreduce.  Considering that optimizing
   collective communication may introduce addtional overhead in
   transport upgrade and there are some security issues as side-effects
   as stated in [CCO PS & USECASE], collective optimization maybe first
   introduced into limited domain and then extended to the Internet.

2.5.2.  Operational considerations

   Use cases like AI model training, distributed storage, and big data
   analysis usually need infrastructure to be deployed in clusters which
   are operated by single entities.  In this case, not only the compute
   and network infrastructure, but also the application could be owned
   by single service providers.  These use cases are typically
   performance-driven, which means they need application and
   infrastructure to be co-designed to reach optimization.  However,
   applications are not co-designed with underlying network protocols
   case-by-case, as long as the definition and realization of certain
   collective operations that would be offloaded can be reached in
   consensus across vendors, like unified primitives used for
   implementing the collective communication, applications can leverage
   on the standardized north bound API to improve performance, albeit
   the applications do not belong to the same service providers
   [NetRPC].

3.  Requirements

Yao, et al.               Expires 25 April 2024                 [Page 5]
Internet-Draft  Collective Communication Optimization: R    October 2023

3.1.  Improve Tranport Efficiency

   Distributed parallel computing needs high-performance inter-process
   communication.  The ability of Remote Memory Access(RMA) can reduce
   the number of memory copies which could improve the performance of
   distributed systems.

   R1.  Transport layer MUST support RMA function.

   R2.  Memory sharing is RECOMMENDED to support collective operations.

3.2.  Transport Suitability for Different Communication Modes

   The communication modes and traffic characteristics of different
   applications in parallel computing vary, resulting in different
   transport layer capabilities required.  The transport layer needs to
   make adjustments to adapt to different communication modes.
   Especially after supporting collective operations offloading, the
   communication mode has changed to: end-to-end, and end-to-network,
   whitch means suporting in-network processing/computing functions.

   R3.  The implementation of blocking or non-blocking communication of
   applications MUST be adjusted according to different communication
   modes.

   R4.  The transport layer MUST provide appropriate reliability
   mechanisms to adapt to different communication modes.

3.3.  End-network Collaborative Collective Communication

   Collective operations offloading utilizes network devices to achieve
   low computational accuracy and high IO computing tasks, achieving
   optimization of collective operations.  Network computing devices not
   only complete packet routing and forwarding, but also need to process
   transport layer messages.  Therefore, the transport layer needs to
   complete the mapping mechanism between packets and messages, and
   complete the transmission of transport layer messages.  Therefore,
   the transport layer protocol that supports collective communication
   optimization needs to meet the following requirements:

   R5.  The transport layer MUST carry messages that network devices can
   recognize to complete offloaded collective operations processing.

   R6.  The transport layer MUST support fallback mechanism, in case
   network devices are not sufficient for collective operations
   offloading.

Yao, et al.               Expires 25 April 2024                 [Page 6]
Internet-Draft  Collective Communication Optimization: R    October 2023

3.4.  Management and Control

   Current hardware implementations of offloading collective operations
   vary, and there is no unified standard for the supported in-network
   computing primitives, which poses great challenges to the management
   and control plane.  In addition, as the network scale continues to
   expand, it is necessary to rely on topology-aware algorithms to
   complete computing-related task allocation, path planning, etc., in
   order to optimize the control effect.

   R7.  MUST support unified definition and management of data types,
   data structures supported by network devices considering collectives
   offloading, and the unified management of resources such as memory of
   network devices.

   R8.  It is RECOMMENDED to achieve topology awareness, task
   scheduling, and allocation in collaboration between the end and
   network.

3.5.  Make Use of IP Multicast

   IP multicast protocols bring native one-to-group communication
   mechanisms which is very suited for one-to-group and group-to-group
   collective operations.  Even though designing IP multicast protocols
   is not within the scope of transport and application area, but
   extending IP multicast protocols to support collective communication
   is necessary.  Thus:

   R9.  IP multicast protocols SHOULD be extended to support collective
   communication.  However, whether to design new multicast algorithms
   and protocols that are dedicated for collective communication is out
   of the scope.

3.6.  Self-healing

   Self-healing is required since there is chance that single in-network
   device can run out of service, due to either single point failure or
   link break down.  Therefore:

   R10.  The mechanism of choosing alternative node for implementing
   collective operations MUST be designed, to ensure system robustness
   and reliability.

4.  Security Considerations

   Some security concerns have been described in the [CCO PS & use
   cases].

Yao, et al.               Expires 25 April 2024                 [Page 7]
Internet-Draft  Collective Communication Optimization: R    October 2023

5.  IANA Considerations

   TBD.

6.  References

6.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC8799]  Carpenter, B. and B. Liu, "Limited Domains and Internet
              Protocols", RFC 8799, DOI 10.17487/RFC8799, July 2020,
              <https://www.rfc-editor.org/info/rfc8799>.

6.2.  Informative References

   [NetReduce]
              Liu, S., "In-Network Aggregation with Transport
              Transparency for Distributed Training",
              DOI 10.1145/3582016.3582037, 2023,
              <https://doi.org/10.1145/3582016.3582037>.

   [NetRPC]   Zhao, B., "NetRPC: Enabling In-Network Computation in
              Remote Procedure Calls", 2023.

   [SHArP]    Graham, R. L., "Scalable Hierarchical Aggregation Protocol
              (SHArP): A Hardware Architecture for Efficient Data
              Reduction", DOI 10.1109/COMHPC.2016.006, 2023,
              <https://doi.org/10.1109/COMHPC.2016.006>.

Authors' Addresses

   Kehan Yao
   China Mobile
   Beijing
   100053
   China
   Email: yaokehan@chinamobile.com

   Shiping Xu
   China Mobile
   Beijing
   100053
   China

Yao, et al.               Expires 25 April 2024                 [Page 8]
Internet-Draft  Collective Communication Optimization: R    October 2023

   Email: xushiping@chinamobile.com

   Yizhou Li
   Huawei Technologies
   Nanjing, Jiangsu
   China
   Email: liyizhou@huawei.com

   Hongyi Huang
   Huawei Technologies
   Beijing
   China
   Email: hongyi.huang@huawei.com

   Dirk KUTSCHER
   HKUST (Guangzhou)
   Guangzhou
   China
   Email: dku@hkust-gz.edu.cn

Yao, et al.               Expires 25 April 2024                 [Page 9]