Javascript disabled? Like other modern websites, the IETF Datatracker relies on Javascript. Please enable Javascript for full functionality.
RoCEv2-based Collective Communication Offloading
draft-liu-nfsv4-rocev2-00

Versions:
Document	Type	Active Internet-Draft (individual)
	Authors	liufeng , Weifeng Wang , Rubing Liu , Yan Mu , Kehan Yao
	Last updated	2024-02-28
	RFC stream	(None)
	Intended RFC status	(None)
	Formats	txt htmlized pdf bibtex bibxml
Stream	Stream state	(No stream defined)
	Consensus boilerplate	Unknown
	RFC Editor Note	(None)
IESG	IESG state	I-D Exists
	Telechat date	(None)
	Responsible AD	(None)
	Send notices to	(None)
Email authors IPR References Referenced by Nits Search email archive
draft-liu-nfsv4-rocev2-00
NFSV4                                                          F. Liu
Internet Draft                                                W. Wang
Intended status: Standards Track                               R. Liu
Expires: August 2024                                              H3C
                                                                 Y. Mu
                                                                 K. Yao
                                                           China Mobile
                                                     February 28, 2024

              RoCEv2-based Collective Communication Offloading
                       draft-liu-nfsv4-rocev2-00.txt

Abstract

   This draft proposes the design scheme of RoCEv2-based collective
   communication offloading. Through establishing RDMA connections
   between client and switch, collective operations can be implemented
   on network nodes, thus improving the overall efficiency of collective
   communication.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79. This document may not be modified,
   and derivative works of it may not be created, and it may not be
   published except as an Internet-Draft.

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79. This document may not be modified,
   and derivative works of it may not be created, except to publish it
   as an RFC and to translate it into languages other than English.

   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November 10,
   2008. The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format

Liu, et al.            Expires August 28, 2024                [Page 1]
Internet-Draft               RoCEv2 CCO                  February 2024

   it for publication as an RFC or to translate it into languages other
   than English.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on August 28, 2024.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents carefully,
   as they describe your rights and restrictions with respect to this
   document. Code Components extracted from this document must include
   Simplified BSD License text as described in Section 4.e of the Trust
   Legal Provisions and are provided without warranty as described in
   the Simplified BSD License.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents carefully,
   as they describe your rights and restrictions with respect to this
   document.

Table of Contents

   1. Introduction...................................................3
   2. Terminology and Definitions....................................4

Liu, et al.            Expires August 28, 2024                [Page 2]
Internet-Draft               RoCEv2 CCO                  February 2024

   3. Architecture...................................................4
      3.1. In-network Computing Aggregation Manager..................6
      3.2. In-network Computing Switch...............................7
      3.3. In-network Computing Client...............................8
   4. Deployment.....................................................9
   5. Interaction Process...........................................12
      5.1. Control plane............................................12
      5.2. Forwarding plane.........................................13
   6. Packet encapsulation..........................................15
   7. Transport layer requirements..................................16
   8. Security Considerations.......................................17
   9. IANA Considerations...........................................17
   10. References...................................................17
      10.1. Normative References....................................17
      10.2. Informative References..................................18
   11. Acknowledgments..............................................18

1. Introduction

   Collective communication means that within a network, multiple
   computers or devices communicate through shared resources and
   cooperation to achieve more efficient and secure data transmission
   and information exchange. Detailed use cases and problems are
   proposed in [I-D.yao-tsvwg-cco-problem-statement-and-usecases].
   Various collective communication operations are used in both
   artificial intelligence (AI) and high performance computing (HPC)
   workloads, including:

   1. Broadcast - spread data from one member to all other members.

   2. AllGather - collect data from all members and spread it to all
      member.

   3. AllToAll - distribute different data from all members to all other
      members.

   4. Scatter - distribute different data from one member to all other
      members.

   5. Gather - collect data from all members and send to one member.

   6. Reduce - merge data of all members send to one member.

   7. AllReduce - merge data of all members and spread it to all members.

   8. ReduceScatter - merge different part of data of all members, and
      distribute it to all members.

Liu, et al.            Expires August 28, 2024                [Page 3]
Internet-Draft               RoCEv2 CCO                  February 2024

   9. Barrier - synchronize among all members.

   In-network computing enables network device to participate in
   collective communication by offloading the collective communication
   operations frequently used by HPC and AI to network nodes. The
   acceleration of collective communication through in-network computing
   is of great significance, which is mainly reflected in the following
   aspects. The requirement and analysis are described in [I-D.draft-
   yao-tsvwg-cco-requirement-and-analysis]. From an application point of
   view, in-network computing can significantly reduce the communication
   traffic, thus improving the overall computing efficiency and the
   overall application performance. From the point of view of resource
   utilization, computational tasks of processors are shared, thus the
   computation speed is accelerated, and the overall resource
   utilization is improved. From the network point of view, the data
   flow of the network is reduced and congestion is relieved, so that
   the network utilization is improved.

2. Terminology and Definitions

   The following terms are used in this document:

   Aggregation
   The act of collecting and reducing input data from one or more group
   members.

   Collective
   Collective Operation - an operation done by a group of ranks.

   Collective Group
   A set of ranks that participate in a collective operation.

   INC
   In-network computing.

   INC-switch
   Switch with the capability to support INC.

3. Architecture

   Figure 1 illustrates a conceptual architecture of in-network
   computing.

Liu, et al.            Expires August 28, 2024                [Page 4]
Internet-Draft               RoCEv2 CCO                  February 2024

                       +------------------------+
                       |                        |
                       |       INC Switch       |
                       |                        |
         +-------------+     +------------+     |
         |             |     | Switch Chip|     |
         |             |     +------------+     |
         |             |                        |
     +---+----+        +------/----------\------+
     | INC AM |             //            \\
     +---+----+            /                \\
         |                /   RDMA-RoCEv2     \
         |              //                     \\
         |             /                         \\
         | +----------/----------+       +---------\-----------+
         | |                     |       |                     |
         +-|     INC Client      |       |     INC Client      |
           |                     |       |                     |
           |   +-------------+   |       |   +-------------+   |
           |   |   GPU       |   |       |   |   GPU       |   |
           |   +-------------+   |       |   +-------------+   |
           |                     |       |                     |
           +---------------------+       +---------------------+

                           Figure 1 Architecture

   In order to offload collective communication, the architecture of in-
   network computing will mainly be composed of three parts: In-network
   Computing Aggregation Manager, In-network Computing Switch, and In-
   network Computing Client.

   o    In-network Computing Aggregation Manager (INC AM): It is the
      controller of the entire in-network computing, mainly responsible
      for the generation and management of the Aggregation Tree, issuing
      in-network computing related flow table to the switch, and real-
      time monitoring of the in-network computing task status.

   o    In-network Computing Switch (INC Switch): It is the core that
      offloads collective communication to network devices. It performs
      specific collective communication operations by receiving
      corresponding data and operation methods from the in-network
      computing client, and finally sends the results to the in-network
      computing client. It also provides related operation and
      maintenance data, such as in-network computing related task and
      message statistics.

Liu, et al.            Expires August 28, 2024                [Page 5]
Internet-Draft               RoCEv2 CCO                  February 2024

   o    In-network Computing Client (INC client): It is the data source
      that needs to perform collective communication in in-network
      computing. It is deployed in the computing nodes and is used to
      integrate with MPI (Message Passing Interface) library, NCCL
      (NVIDIA Collective Communication Library) to send collective
      communication data to the in-network computing switch.

3.1. In-network Computing Aggregation Manager

   The main function of the in-network computing Aggregation Manager is
   to coordinate the establishment and dismantling of the collection
   communication group. At the same time, it also provides the ability
   to collect and manage the lifecycle of the collection communication
   group, and monitors the in-network computing switches and in-network
   computing clients through heartbeat detection.

   The in-network computing Aggregation Manager must be deployed in a
   location that can access to the in-network computing switches and in-
   network computing clients; it connects to the in-network computing
   clients via gRPC and to the in-network computing switches via NETCONF.

   o    Topology Information. The in-network computing aggregation
      manager must be able to obtain network topology and the
      capabilities of in-network computing switches, and display all in-
      network computing clients and in-network computing switches, as
      well as their connection relationships. The document focuses
      specifically on the tree topology, and does not discuss other
      topologies.

   o    Establishment of Collection Communication Group. When using
      offloading mode in collective communication, the in-network
      computing aggregation manager needs to calculate and determine
      which in-network computing switches have the capability and
      resources, and establish an aggregation tree. All unsupported
      devices will be excluded from the aggregation tree.

   1. Select a root switch, generally the position of the root is the
   spine switch, so that all subsequent leaf switches can communicate
   directly with the root.

   2. Select the communication link between the root and leaf switches.

   3. The in-network computing Aggregation Manager configures the in-
   network computing switches via NETCONF, obtains such as capability,
   RDMA information etc. from the in-network computing switches, and
   sends it to the in-network computing clients.

Liu, et al.            Expires August 28, 2024                [Page 6]
Internet-Draft               RoCEv2 CCO                  February 2024

   4. If there are any change in the topology during the lifecycle, the
   in-network computing Aggregation Manager needs to dismantle the
   collection communication group or establish a new aggregation tree.

   o    Dismantling of Collection Communication Group. The conditions
      for dismantling the collection communication group include:

   1. In-network computing clients leaving the collection communication
   group.

   2. Failure of heartbeat detection for in-network computing clients.

   3. Link failure.

   4. Manual dismantling.

   o    Resource Allocation. Resource allocation and distribution are
      required in-network computing, and its main functions include:

   1. Responsible for allocating identity identifiers for in-network
   computing: assigning identity identifiers to each in-network
   computing switch in the aggregation tree; mapping the identity
   identifiers of in-network computing clients to the identity
   identifiers of in-network computing switches.

   2. Responsible for establishing QP in the RDMA protocol.

   3. Distribute the in-network computing forwarding table to in-network
   computing clients and in-network computing switches: the in-network
   computing Aggregation Manager generates forwarding table based on the
   aggregation tree and distributes the forwarding table to in-network
   computing clients and in-network computing switches.

   4. Monitor the status of in-network computing tasks: the in-network
   computing Aggregation Manager is responsible for monitoring the
   running status of in-network computing clients and in-network
   computing switches, including task status and statistics.

3.2. In-network Computing Switch

   The in-network computing switch offloads collective communication of
   in-network computing clients. The in-network computing switch is
   directly or indirectly connected to the in-network computing clients
   and serves as the core for offloading collective communication to
   network devices. It performs specific collective communication
   operations by receiving corresponding data and instructions from in-
   network computing clients, and ultimately sends the results back to

Liu, et al.            Expires August 28, 2024                [Page 7]
Internet-Draft               RoCEv2 CCO                  February 2024

   the client or clients. The interface and functions between the in-
   network computing switch and the in-network computing aggregation
   manager include:

   o    In-network computing related configuration processing,
      specifically including: configuring in-network computing
      management addresses, querying in-network computing aggregation
      trees, querying in-network computing statistics, and providing
      corresponding NETCONF interfaces.

   o    In-network computing packet parsing and encapsulation: parsing
      in-network computing packets sent from in-network computing
      clients, performing in-network computing processing, and then re-
      encapsulating the in-network computing packets to send to in-
      network computing clients or in-network computing root and leaf
      switches.

   o    Performing in-network computing processing based on the in-
      network computing forwarding table: supporting collective
      communication operations such as AllReduce, Broadcast, Barrier,
      etc.

   o    Providing in-network computing statistics: including packet
      statistics based on identity and packet statistics based on QP
      (Queue Pair).

3.3. In-network Computing Client

   The in-network computing client needs to integrate with collective
   communication library. OpenMPI and NCCL define the standard MPI
   collective communication interface, but allow third-party to have its
   own implementation. By developing the INC Client to implement the
   docking with the MPI collective communication interface of OpenMPI
   and NCCL, and implementing the MPI collective communication algorithm
   in in-network computing, this INC client can be integrated into the
   communication library through plugins or embedded directly.

   When the application calls the MPI_AllReduce interface of OpenMPI or
   NCCL, it directly calls the processing in the INC Client. The INC
   Client sends the data of MPI collective communication in the
   encapsulation format of in-network computing to the in-network
   computing switch. The INC Client is also responsible for receiving
   the in-network computing packets in response from the in-network
   computing switch and returning them to the upper-layer application.
   The in-network computing client needs to have the following functions:

Liu, et al.            Expires August 28, 2024                [Page 8]
Internet-Draft               RoCEv2 CCO                  February 2024

   o    Deployed within the computing node, used for integration with
      MPI library and NCCL library: it needs to provide plugin for
      integration with OpenMPI and NCCL respectively, as well as INC
      Client lib; INC Client starts with the start of the MPI process
      and stops with the stop of the MPI process.

   o    Responsible for sending and receiving in-network computing
      packets: the in-network computing client sends in-network
      computing packets to the in-network computing switch based on the
      forwarding table issued by the in-network computing aggregation
      manager (including identity, task identification, QP, switch IP,
      etc.).

   o    Provide interface for querying in-network computing task-
      related information: mainly including in-network computing task
      status, data block size, identity, task identification, QP,
      message statistics.

   o    Provide INC Client log.

4. Deployment

   Considering that the scale of networking can vary according to the
   size of AI training, in-network computing needs to support single-
   level aggregation and multi-level aggregation. In general, the
   single-level aggregation method can meet the requirements of in-
   network computing. If the aggregation capacity of the in-network
   computing switch is insufficient, or in order to save bandwidth
   between switches, a multi-level aggregation method can be adopted.
   The networking diagram for single-level aggregation is as follows:

Liu, et al.            Expires August 28, 2024                [Page 9]
Internet-Draft               RoCEv2 CCO                  February 2024

                  +-------------------------------------+
                  |           INC Switch                |
                  |                                     |
                  |         +-------------+             |
        +---------+         |   Leaf1     |             |
        |         |         |             |             |
        |         |         |  AllReduce  |             |
        |         |         +-/-+-------+-+             |
    +---+----+    +---------//--+-------+-\\------------+
    | INC AM |             /    |       |   \\
    +---+----+            /     |       |     \\
        |               //      |       |       \\
        |              /        |       |         \\
        |             /         |       |           \\
        | +---------//----------+-------+-------------\\--------+
        | | +------/---+ +------+---+ +-+--------+ +----\-----+ |
        +-| |INC Client| |INC Client| |INC Client| |INC Client| |
          | |          | |          | |          | |          | |
          | | Worker1  | | Worker2  | | Worker3  | | Worker4  | |
          | +----------+ +----------+ +----------+ +----------+ |
          +-----------------------------------------------------+

                 Figure 2 Single-level Aggregation Network

   In a single-level aggregation network environment, the following
   operations need to be implemented:

   o    The in-network computing aggregation manager generates
      aggregation trees and assigns Tree IDs for different computing
      tasks, and then sends the aggregation tree information to the
      switch.

   o    The in-network computing switch performs local aggregation
      based on the aggregation tree information upon receiving packets
      from the in-network computing client.

   o    Broadcast the local aggregation results to the in-network
      computing client.

Liu, et al.            Expires August 28, 2024               [Page 10]
Internet-Draft               RoCEv2 CCO                  February 2024

                  +----------------------------------------+
                  |             INC Switch                 |
                  |    +-------------+  +-------------+    |
                  |    |   Spine1    |  |   Spine2    |    |
                  |    |             |  |  AllReduce  |    |
        +---------+    +-+---------\\+  +/----------+-+    |
        |         |      |           \\//           |      |
        |         |      |           //\\           |      |
        |         |+-----+-------+ //    \\ +-------+-----+|
    +---+----+    ||   Leaf1     |/        \|   Leaf2     ||
    | INC AM |    ||  AllReduce  |          |  AllReduce  ||
    +---+----+    |++--------+---+          +-+----------++|
        |         +-+--------+----------------+----------+-+
        |           |        |                |          |
        |           |        |                |          |
        | +---------+--------+----------------+----------+------+
        | | +-------+--+ +---+------+ +-------+--+ +-----+----+ |
        +-| |INC Client| |INC Client| |INC Client| |INC Client| |
          | |          | |          | |          | |          | |
          | | Worker1  | | Worker2  | | Worker3  | | Worker4  | |
          | +----------+ +----------+ +----------+ +----------+ |
          +-----------------------------------------------------+

                 Figure 3 Multi-level Aggregation Network

   In a multi-level aggregation network environment, the following
   operations need to be implemented:

   o    The in-network computing aggregation manager generates
      aggregation trees and assigns Tree IDs for different computing
      tasks, then sends the aggregation tree information to the switch,
      and informs the switch of its role: leaf or root.

   o    The in-network computing switch first performs local
      aggregation based on the aggregation tree information upon
      receiving data packets from lower-level nodes.

   o    If it is the root, it indicates that the aggregation is
      completed, and broadcasts the aggregation result to all members.

   o    If it is not the root, it indicates the need for multi-level
      aggregation, and sends the local aggregation result to the upper-
      level in-network computing switch for further aggregation.

Liu, et al.            Expires August 28, 2024               [Page 11]
Internet-Draft               RoCEv2 CCO                  February 2024

   o    When a leaf in-network computing switch receives the
      aggregation result from the upper-level in-network computing
      switch, it continues to broadcast the aggregation result to the
      members at the lower level.

5. Interaction Process

   The interaction process in-network computing mainly consists of two
   parts, namely the control plane and the forwarding plane. The control
   plane is responsible for establishing, resource allocation/release,
   and dismantling of communication groups in-network computing; the
   forwarding plane is responsible for executing the data processing
   tasks of specific communication groups in-network computing.

5.1. Control plane

   The deployment architecture model starts from the in-network
   computing client joining the collective communication group. The in-
   network computing aggregation manager allocates the corresponding
   resources for in-network computing by establishing the collective
   communication group. The in-network computing aggregation manager
   needs to be deployed in a network environment accessible between the
   in-network computing switch and the in-network computing client, and
   then needs to complete the registration of the in-network computing
   switch capability, discover the topology between the in-network
   computing switch and the in-network computing client, and
   allocate/release the resources of the in-network computing switch
   according to the requirements of the in-network computing client for
   the collective communication group. Communication between the in-
   network computing clients and the in-network computing switches, and
   between in-network computing switches, is done through the RDMA
   protocol, so before RDMA communication, it is necessary to apply for
   QPN and create QP. Resources can be allocated through CM
   (Communication Management), resource allocation can be done through
   the Socket API, or resources can be allocated through the in-network
   computing aggregation manager.

   (1) Building a connection between RDMA QPs based on the Socket API
   requires establishing a TCP/IP connection between two nodes through
   the Socket API, and then using this connection to exchange
   information about both QPs. The application program implements the
   TCP/IP three-way handshake, data exchange, and four-way handshake
   process by calling the Socket API according to the process, and then
   starts to exchange information such as QPN.

   (2) CM is a mechanism specifically used in RDMA technology to
   establish connections between QPs. It has a set of exclusive message

Liu, et al.            Expires August 28, 2024               [Page 12]
Internet-Draft               RoCEv2 CCO                  February 2024

   formats, interaction processes, and user interfaces. The CM protocol
   establishes connections through multiple round-trip messages, and it
   also specifies the way to disconnect. Users control the CM to send
   and receive CM protocol messages through the CM programming interface,
   completing the interaction of GID, QPN, and other information.

   (3) Considering the complexity of implementation, it is also possible
   to allocate QPN and QP for the switches in in-network computing
   through the in-network computing aggregation manager, and the QPN and
   QP allocation for the in-network computing client is done by the
   client itself, and the allocated information is synchronized to the
   in-network computing aggregation manager for unified pairing and
   management.

5.2. Forwarding plane

   The entire forwarding plane in-network computing starts with the
   client initiating a data packet. The in-network computing switches
   receive the data from the in-network computing client based on the
   generated topology graph and process it. The data is then broadcasted
   to all member clients. Considering the complexity of multi-level
   aggregation, the overall process is divided into upstream and
   downstream processes.

   We assume Work1 and Work2 are attached to leaf1 switch; Work3 and
   Work4 are attached to Leaf2 switch; Leaf1 switch and Leaf2 switch are
   attached to root spine switch. The specific upstream process is shown
   in the following.

   o    Work1 and Work2 will send the messages calculated on the
      network to Leaf1 according to the message format of RoCEv2,
      carrying corresponding information such as QP and tree.

   o    Leaf1 receives the information from Work1 and Work2, aggregates
      it locally, and then sends it to the spine switch, carrying
      information such as QP and tree.

   o    Work3 and Work4 will send the messages calculated on the
      network to Leaf2 according to the message format of RoCEv2,
      carrying corresponding information such as QP and tree.

   o    Leaf2 receives the information from Work3 and Work4, aggregates
      it locally, and then sends it to the spine switch, carrying
      information such as QP and tree.

   o    The spine switch receives the information from Leaf1 and Leaf2
      and completes the aggregation.

Liu, et al.            Expires August 28, 2024               [Page 13]
Internet-Draft               RoCEv2 CCO                  February 2024

   The specific downstream process is shown in the following.

   o    After the spine switch completes the aggregation, it locally
      replicates the aggregation result and sends it to the Leaf1 and
      Leaf2 switches respectively, carrying QP, tree, and other
      information.

   o    Leaf1 receives the aggregation result from the spine, completes
      the local replication, and then sends it to work1 and Work2,
      carrying QP, tree, and other information.

   o    Leaf2 receives the aggregation result from the spine, completes
      the local replication, and then sends it to Work3 and Work4,
      carrying QP, tree, and other information.

   In-network computing switch is crucial for completing the aggregation
   operation. Let's introduce the handling of aggregation on the switch.

   +-------------------------------------------------------------------+
   |  Tree ID=1                                                        |
   |                  slot0                          slot255           |
   |      +-------+------+-----+------+   +-------+------+-----+------+|
   |sum   | msg_0 |sum_1 | ... |sum_k |...|msg_255|sum_1 | ... |sum_k ||
   |      +-------+------+-----+------+   +-------+------+-----+------+|
   |                                                                   |
   |      +-------+------+-----+------+   +-------+------+-----+------+|
   |rank0 | msg_0 |fp32_1| ... |fp32_k|...|msg_255|fp32_1| ... |fp32_k||
   |      +-------+------+-----+------+   +-------+------+-----+------+|
   |                                                                   |
   |      +-------+------+-----+------+   +-------+------+-----+------+|
   |rank1 | msg_0 |fp32_1| ... |fp32_k|...|msg_255|fp32_1| ... |fp32_k||
   |      +-------+------+-----+------+   +-------+------+-----+------+|
   |                                                                   |
   |      +-------+------+-----+------+   +-------+------+-----+------+|
   |rank63| msg_0 |fp32_1| ... |fp32_k|...|msg_255|fp32_1| ... |fp32_k||
   |      +-------+------+-----+------+   +-------+------+-----+------+|
   +-------------------------------------------------------------------+

                    Figure 4 The aggregation operation

   As shown in the figure above, assuming:

   o    For a certain tree id, the in-network computing switch needs to
      process 64 workers (represented by rank0-rank63).

Liu, et al.            Expires August 28, 2024               [Page 14]
Internet-Draft               RoCEv2 CCO                  February 2024

   o    Each rank sends 256 messages at a time (represented by
      message0-message255).

   o    The in-network computing switch creates 256 aggregator pools
      (corresponding to slot0-slot255) for this tree id, with each slot
      responsible for aggregating a column of messages.

   For each slot, it is necessary to check the arrival status of the
   rank's data under that slot. For example, for slot0, when the
   aggregated messages sent by rank0-rank63 are all received and the
   tree id and message id are checked to be the same, the aggregation
   operation is performed on each data (from data1 to datak) in these
   messages:

   o    Aggregate rank0 data1, rank1 data1, and so on, up to rank63
      data1.

   o    Aggregate rank0 data2, rank2 data2, and so on, up to rank63
      data2.

   o    Continue until rank0 datak, rank1 datak, and so on, up to
      rank63 datak.

   After completing all the data aggregation, perform different
   operations based on whether the role is root or leaf:

   o    If it is root, send the aggregated result to the leaf and clear
      the data under the slot, updating the expected message id.

   o    If it is leaf, send the aggregated result of slot x to the root
      switch, and wait to receive the final aggregated result from the
      root before clearing the data under the slot and updating the
      expected message id.

   Each slot runs independently and does not interfere with each other.
   When a slot completes processing, it can initiate the processing of
   the next message id separately.

6. Packet encapsulation

   Communication between in-network computing switches and in-network
   computing clients is done through RDMA. RDMA communication requires a
   lossless network environment, so in an Ethernet environment, they
   communicate data messages for in-network computing through RoCEv2.
   RDMA generally uses RC (Reliable Connection) mode and UC (Unreliable
   Connection) mode. In RC mode, it supports message acknowledgment
   confirmation and timeout retransmission. If a message times out

Liu, et al.            Expires August 28, 2024               [Page 15]
Internet-Draft               RoCEv2 CCO                  February 2024

   without confirmation, all subsequent messages after that will be
   retransmitted. In UC mode, a link needs to be established in advance,
   messages do not need to carry address information, do not support
   acknowledgment confirmation or retransmission, and do not guarantee
   that the other end can receive them correctly.

   Using the standard Ethernet/IP message format, UDP port number 4791
   represents RoCEv2 messages; using the Basic Transport Header (BTH)
   containing fields that are always present in all IBA transport
   services; using the RDMA Extension Transport Header (RETH) of 16
   bytes, which includes additional transport fields for RDMA operations;
   using the Immediate Extension Transport Header (IMMDT) of 4 bytes,
   followed by the placement of data information related to in-network
   computing.

   The specific message mainly contains key information for executing
   data information related to in-network computing, which includes the
   following:

   (1) Aggregation Tree ID: representing collective communication.

   (2) Collective communication type, including specific operations to
   be performed, such as AllReduce, Broadcast, Barrier, etc.

   (3) Data type: including specific data types to be executed, such as
   IEEE754 floating point in 16, 32, 64 bits, etc.

   (4) Operation type, including specific operation types of the in-
   network computing switch after receiving the collective communication
   message, such as Sum (add the data together), Min (find the minimum
   value), Max (find the maximum value), etc.

   (5) The Payload section contains the data that is specifically
   transferred through RDMA in in-network computing.

7. Transport layer requirements

   Data packets may be lost due to link quality, switch buffer overflow,
   or other abnormal conditions. If packet loss occurs, the client in
   in-network computing is responsible for retransmission. If RC mode is
   used, all retransmissions are guaranteed by the RDMA transport layer.
   If UC mode is used, the retransmission process for in-network
   computing is as follows:

   (1) The in-network computing client sends a packet with MessageID = n
   and starts the packet retransmission timer.

Liu, et al.            Expires August 28, 2024               [Page 16]
Internet-Draft               RoCEv2 CCO                  February 2024

   (2) If the corresponding response packet with MessageID = n is
   received before the retransmission timer times out, the next
   MessageID packet is sent and the packet retransmission timer is reset.

   (3) If the packet retransmission timer times out, the packet with
   MessageID = n is retransmitted until it is successfully sent.

   (4) A threshold N can be set to indicate that if N timeouts occur
   without successful transmission, the aggregation manager should be
   notified for error handling.

   The in-network computing switch passively processes the data packets,
   and in order to determine whether the received packet is a
   retransmitted packet and prevent duplicate aggregation of packets,
   the switch needs to record whether the corresponding MessageID packet
   has been received.

8. Security Considerations

   In network computing scheme may introduce some security and privacy
   concerns.

   Offloading collective operations may introduce new risks to the
   network. The content of the information exchanged among the INC
   aggregation manager, INC switches, and INC hosts may be topologically
   sensitive. It is possible to disclose the location information of
   computing resources hosted in the network and service sites, and
   attackers can use this information to identify vulnerable points in
   the network. For example, an attacker may take advantage of tampering
   with network topology information to interrupt customer service
   delivery, or even direct traffic to other places. The solution should
   support authentication and integrity protection mechanisms to enhance
   security.

9. IANA Considerations

   TBD

10. References

10.1. Normative References

   [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol
             Specification Version 2", RFC 5531, May 2009,
             <https://www.rfc-editor.org/info/rfc5531>.

Liu, et al.            Expires August 28, 2024               [Page 17]
Internet-Draft               RoCEv2 CCO                  February 2024

   [RFC6241] R. Enns, M. Bjorklund, J. Schoenwaelder, A. Bierman,
             "Network Configuration Protocol (NETCONF)", RFC 6241, June
             2011, <https://www.rfc-editor.org/info/rfc6241>.

10.2. Informative References

   [I-D.yao-tsvwg-cco-problem-statement-and-usecases] K. Yao, S. Xu, Y.
             Li, H. Huang, D. KUTSCHER, "Collective Communication
             Optimization: Problem Statement and Use cases", Work in
             Progress, Internet-Draft, draft-yao-tsvwg-cco-problem-
             statement-and-usecases-00, 23 October 2023,
             <https://datatracker.ietf.org/doc/draft-yao-tsvwg-cco-
             problem-statement-and-usecases/>.

   [I-D.draft-yao-tsvwg-cco-requirement-and-analysis] K. Yao, S. Xu, Y.
             Li, H. Huang, W. Wang, D. KUTSCHER, "Collective
             Communication Optimizations: Requirement and Analysis",
             Work in Progress, Internet-Draft, draft-yao-tsvwg-cco-
             requirement-and-analysis-01, 5 February 2024,
             <https://datatracker.ietf.org/doc/draft-yao-tsvwg-cco-
             requirement-and-analysis/>.

11. Acknowledgments

   TBD

Liu, et al.            Expires August 28, 2024               [Page 18]
Internet-Draft               RoCEv2 CCO                  February 2024

Authors' Addresses

   Feng Liu
   New H3C Technologies Co., Ltd
   Hangzhou, China
   Email: 11957147@qq.com

   Weifeng Wang
   New H3C Technologies Co., Ltd
   Beijing, China
   Email: wangweifeng@h3c.com

   Rubing Liu
   New H3C Technologies Co., Ltd
   Hangzhou, China
   Email: liurubing@h3c.com

   Yan Mu
   China Mobile
   Beijing, China
   Email: muyan@chinamobile.com

   Kehan Yao
   China Mobile
   Beijing, China
   Email: yaokehan@chinamobile.com

Liu, et al.            Expires August 28, 2024               [Page 19]
RoCEv2-based Collective Communication Offloading draft-liu-nfsv4-rocev2-00

RoCEv2-based Collective Communication Offloading
draft-liu-nfsv4-rocev2-00