Internet-Draft APDN March 2024
Wang, et al. Expires 2 September 2024 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-wh-rtgwg-application-aware-dc-network-02
Published:
Intended Status:
Standards Track
Expires:
Authors:
H. Wang
Huawei
K. Yao
China Mobile
W. Pan
Huawei
H. Huang
Huawei

Application-aware Data Center Network (APDN) Use Cases and Requirements

Abstract

The deployment of large-scale AI services within data centers introduces significant challenges to established technologies, including load balancing and congestion control. Additionally, the adoption of cutting-edge network technologies, such as in-network computing, is on the rise within AI-centric data centers. These advanced network-assisted application acceleration technologies necessitate the flexible exchange of cross-layer interaction information between end-hosts and network nodes.

The Application-aware Data Center Network (APDN) leverages the Application-aware Networking (APN) framework for application side to furnish the data center network with detailed application-aware information. This approach facilitates the rapid advancement of network-application co-design technologies. This document delves into the use cases of APDNs and outlines the associated requirements, setting the stage for enhanced performance and efficiency in data center operations tailored to the demands of AI services.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 2 September 2024.

1. Introduction

The advent of large AI models like AlphaGo and ChatGPT4 has positioned distributed training for AI large models as a pivotal operation within large-scale data centers. To enhance the efficiency of training these substantial models, a significant number of computing units—such as thousands of GPUs operating in tandem—are deployed for parallel processing, aiming to minimize job completion time (JCT). This setup necessitates frequent and bandwidth-heavy communications among concurrent computing nodes, introducing a novel multi-party communication mode that demands heightened throughput performance, load balancing proficiency, and congestion management capabilities from the data center network.

Traditionally, data center technology primarily views the network as a mere conduit for data transmission for upper-layer applications, offering basic connectivity services. Yet, the scenario of large AI model training is increasingly incorporating network-assisted technologies, such as offloading parts of the computation to the network. This approach seeks to boost AI job efficiency through the joint optimization of network communication and computing applications. In many current instances of network assistance, operators tailor and implement proprietary protocols on a limited scale, leading to a lack of widespread interoperability.

However, as AI data centers grow and diversify in offering cloud services for various AI tasks, emerging data center network technologies must account for serving different transports and applications. Building large-scale data centers now involves not just ensuring device interoperability but also facilitating interaction between network devices and end-host services.

This document illustrates use cases that requires application-aware information between network nodes and applications. Current ways of conveying information are limited by the extensibility of packet headers, where only coarse-grained information can be transmitted between the network and the host through a limited space (for example, one-bit ECN [RFC3168] or DSCP in IP layer).

The Application-aware Networking (APN) framework [I-D.li-apn-framework] delineates how application-aware information, including APN identification (ID) and/or parameters (e.g., network performance requirements), is encapsulated by network edge devices. This information is then carried in packets across an APN domain to support service provisioning, enable fine-grained traffic steering, and adjust network resources. An extension of the APN framework caters to the application side [I-D.li-rtgwg-apn-app-side-framework], allowing APN domain resources to be allocated to applications that encapsulate the APN attribute in packets.

This document delves into the application side of the APN framework to foster enriched interaction between hosts and networks within the data center, outlining several use cases and the corresponding requirements for Application-aware Data center Network (APDN).

1.1. Terminology

APDN: APplication-aware Data center Network

SQN: SeQuence Number

TOR: Top Of Rack switch

PFC: Priority-based Flow Control

NIC: Network Interface Card

ECMP: Equal-Cost Multi-Path routing

AI: Artificial Intelligence

JCT: Job Completion Time

PS: Parameter Server

INC: In-Network Computing

APN: APplication-aware Network

1.2. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

2. Use Case and Requirements for Application-aware Data Center Network

2.1. Fine-grained packet scheduling for load balancing

Traditional data centers utilize the per-flow Equal-Cost Multi-Path (ECMP) method to distribute traffic evenly across several paths. These centers, primarily focused on cloud computing, handle a vast number of data flows. Despite the large quantity, these flows are predominantly small and short-lived, allowing the ECMP method to facilitate a nearly uniform traffic distribution across multiple pathways.

Contrastingly, the communication dynamics shift markedly during the training of large AI models. This process demands unprecedented bandwidth levels, where a singular data flow between multiple machines could potentially max out the upstream bandwidth of a server’s egress Network Interface Controller (NIC), with single data flow throughputs approaching or exceeding 100GB x X.

Applying traditional per-flow ECMP strategies, such as hash-based or round-robin algorithms, often results in the concurrent allocation of large ("elephant") flows to a single pathway. This can lead to severe congestion, notably when two simultaneous 100Gb/s flows vie for the same 100Gb/s bandwidth, significantly impacting the completion time for AI jobs.

To mitigate these issues, there's a pivotal shift towards implementing a fine-grained, per-packet ECMP strategy. This approach ensures the distribution of all packets from a single flow across multiple paths, enhancing balance and preventing congestion. However, due to the varying delays (propagation and switching) across these paths, such a strategy may result in significant packet disorder upon arrival at the destination, thereby degrading the performance of both transport and application layers.

A viable solution is the resequencing of out-of-order packets at the egress Top-of-Rack (ToR) switch, employing per-packet ECMP. This assumes multipath transmission extends from ingress to egress ToRs, with the reordering principle ensuring that the packet departure sequence from the last ToR mirrors the arrival sequence at the first ToR.

Achieving packet reordering at the egress ToR necessitates a clear indication of packet arrival sequences at the ingress ToR. Current protocols do not directly mark sequence numbers (SQNs) at the Ethernet and IP layers.

  • Presently, SQNs are encapsulated within transport layers (e.g., TCP, QUIC, RoCEv2) or application protocols. Relying on these SQNs for packet reordering requires network devices to interpret a vast array of transport/application layer information.

  • SQNs at the transport/application layer are allocated per flow, with each having distinct sequence number spaces and initial values. These cannot directly represent the packet arrival sequence at the initial ToR. Although assigning a specific reordering queue to each flow at the egress ToR and reordering based on upper-layer SQNs is conceivable, the associated hardware resource demands are significant.

  • Direct modification of upper-layer SQNs by network devices to reflect ToR-ToR pairwise SQNs compromises end-to-end transmission reliability.

Consequently, a mechanism to convey specific order information across the multipath forwarding domain, from the initial to the final device with reordering capabilities, is essential.

The Application-aware Networking (APN) framework is proposed to transport critical ordering information. In this context, it records the sequence number of packets as they arrive at the ingress ToR (each ToR-ToR pair having a unique, incremental SQN), facilitating packet reordering by the egress ToR based on this data.

Requirements:

  • [REQ1-1] The APN framework SHOULD tag each packet with an SQN alongside the APN ID to enable reordering. The ingress ToR SHOULD assign and log an SQN for each packet based on its arrival sequence, with SQN granularity adaptable to ToR-ToR, port-port, or queue-queue levels.

  • [REQ1-2] The APN-encapsulated SQN MUST remain unaltered within the multipathing domain and may be removed at the egress device.

  • [REQ1-3] The APN framework SHOULD convey necessary queue information (i.e., the sorting queue ID) to support fine-grained reordering. The queue ID SHOULD match the granularity of SQN assignments. Additionally, the APN framework COULD transport path details to expedite the differentiation between out-of-order packets and packet loss.

2.2. Enhancing Distributed Machine Learning Training with In-Network Computing

Distributed machine learning training frequently employs the AllReduce communication mode[mpi-doc] for efficient cross-accelerator data transfer. This method is pivotal in scenarios involving data and model parallelism, where parallel execution across multiple processors necessitates the exchange of intermediate results, such as gradient data, as a core component of the communication process.

The Parameter Server (PS) architecture[atp], which centralizes gradient data aggregation through a server from multiple clients and redistributes the aggregated results, often faces incast congestion challenges due to simultaneous large-volume data transmissions to the server.

In-network computing (INC) introduces a paradigm shift by delegating the server's processing tasks to network switches. Utilizing network devices equipped with high-capacity switching and computational abilities (for basic arithmetic operations) as surrogate parameter servers for gradient aggregation enables the consolidation of multiple data streams into a singular network stream. This approach not only alleviates server-side incast congestion but also leverages the superior speed of on-switch computing (e.g., ASICs) over traditional server-based processing (e.g., CPUs), offering a boon to distributed computing applications.

As outlined in [I-D.draft-lou-rtgwg-sinc], the realization of INC requires network devices to comprehend the computing tasks dictated by applications, including the accurate parsing of relevant data units and the coordination of synchronization signals across diverse data sources.

Present implementations like ATP[atp] and NetReduce[netreduce] necessitate that switches interpret upper-layer protocols and application-specific logic, which remains tailored to particular applications due to the absence of standardized transport or application protocols for INC. To accommodate a broad spectrum of INC applications, network devices must exhibit versatility across various protocol formats.

Moreover, while end users may encrypt payloads for security, they might be inclined to expose certain non-sensitive data to benefit from accelerated INC operations. However, the current protocol landscape does not facilitate easy access to necessary INC data without decrypting the entire payload, posing interoperability challenges between applications and INC functionalities.

The Application-aware Networking (APN) framework emerges as a solution, capable of conveying essential information for INC tasks and their associated data segments, thereby enabling the offloading of specific computational tasks to the network.

Requirements:

  • [REQ2-1] The APN framework MUST include identifiers to differentiate among INC tasks.

  • [REQ2-2] The APN framework MUST accommodate the transport of application data in varied formats and lengths, such as gradient data for INC, along with the specified operations.

  • [REQ2-3] To augment INC efficiency, the APN framework SHOULD transmit additional application-aware information to support computational processes without undermining end-to-end transport reliability.

  • [REQ2-4] The APN framework MUST have the capability to convey comprehensive INC outcomes and document the computational status within data packets.

2.3. Enhanced Congestion Control with Precise Feedback Mechanisms

Data center environments encompass various congestion scenarios, notably:

  • The prevalent use of multi-accelerator collaborative AI model training, employing AllReduce and All2All communication patterns (Section 2.2), often leads to server-side incast congestion as multiple clients simultaneously transmit substantial volumes of gradient data.

  • Diverse load balancing methodologies across different flows can induce overload conditions on specific links.

  • The inherent randomness of service access within data centers frequently triggers traffic bursts, extending queue lengths and precipitating congestion.

To mitigate these challenges, the industry has developed an array of congestion control algorithms tailored for data center networks. ECN-based congestion control mechanisms, such as DCTCP[RFC8257] and DCQCN[dcqcn], leverage ECN marks based on switch buffer occupancy levels to signal congestion.

However, these approaches are constrained by the use of a singular 1-bit mark within packet headers to denote congestion, limiting the scope of conveyed congestion details due to header space restrictions. Alternative strategies, such as HPCC++ [I-D.draft-miao-ccwg-hpcc], adopt in-band telemetry to cumulatively append congestion data at each hop, increasing packet length and bandwidth consumption.

A compromise solution, AECN[I-D.draft-shi-ippm-advanced-ecn], endeavors to encapsulate critical congestion indicators along the path while minimizing data overhead through hop-by-hop aggregation, including queue delay and congested hop counts. This model allows end-hosts to specify the congestion metrics of interest, with network devices incrementally compiling this data en route. APN frameworks can facilitate this nuanced exchange, enabling tailored congestion data accumulation.

Requirements:

  • [REQ3-1] The APN framework MUST empower data senders to specify the congestion metrics they wish to gather.

  • [REQ3-2] The APN framework MUST enable network nodes to log and update selected measurements accordingly. This may encompass metrics such as port queue lengths, link monitoring rates, PFC frame counts, probed RTTs, and variability, among others. Additionally, the APN MAY tag each measurement with its collector, assisting in the identification of potential congestion points.

3. Encapsulation

The encapsulation of application-aware information proposed by use cases of APDN in the APN Header [I-D.draft-li-apn-header] will be defined in the future version of the draft.

5. IANA Considerations

This document has no IANA actions.

6. References

6.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/rfc/rfc8174>.

6.2. Informative References

[mpi-doc]
"Message-Passing Interface Standard", , <https://www.mpi-forum.org/docs/mpi-4.1>.
[dcqcn]
"Congestion Control for Large-Scale RDMA Deployments", n.d., <https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p523.pdf>.
[netreduce]
"NetReduce - RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration", n.d., <https://arxiv.org/abs/2009.09736>.
[atp]
"ATP - In-network Aggregation for Multi-tenant Learning", n.d., <https://www.usenix.org/conference/nsdi21/presentation/lao>.
[I-D.li-apn-framework]
Li, Z., Peng, S., Voyer, D., Li, C., Liu, P., Cao, C., and G. S. Mishra, "Application-aware Networking (APN) Framework", Work in Progress, Internet-Draft, draft-li-apn-framework-07, , <https://datatracker.ietf.org/doc/html/draft-li-apn-framework-07>.
[I-D.li-rtgwg-apn-app-side-framework]
Li, Z. and S. Peng, "Extension of Application-aware Networking (APN) Framework for Application Side", Work in Progress, Internet-Draft, draft-li-rtgwg-apn-app-side-framework-00, , <https://datatracker.ietf.org/doc/html/draft-li-rtgwg-apn-app-side-framework-00>.
[I-D.draft-lou-rtgwg-sinc]
Lou, Z., Iannone, L., Li, Y., Zhangcuimin, and K. Yao, "Signaling In-Network Computing operations (SINC)", Work in Progress, Internet-Draft, draft-lou-rtgwg-sinc-01, , <https://datatracker.ietf.org/doc/html/draft-lou-rtgwg-sinc-01>.
[RFC8257]
Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., and G. Judd, "Data Center TCP (DCTCP): TCP Congestion Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, , <https://www.rfc-editor.org/rfc/rfc8257>.
[I-D.draft-miao-ccwg-hpcc]
Miao, R., Anubolu, S., Pan, R., Lee, J., Gafni, B., Tantsura, J., Alemania, A., and Y. Shpigelman, "HPCC++: Enhanced High Precision Congestion Control", Work in Progress, Internet-Draft, draft-miao-ccwg-hpcc-02, , <https://datatracker.ietf.org/doc/html/draft-miao-ccwg-hpcc-02>.
[I-D.draft-shi-ippm-advanced-ecn]
Shi, H., Zhou, T., and Z. Li, "Advanced Explicit Congestion Notification", Work in Progress, Internet-Draft, draft-shi-ippm-advanced-ecn-00, , <https://datatracker.ietf.org/doc/html/draft-shi-ippm-advanced-ecn-00>.
[I-D.draft-li-apn-header]
Li, Z., Peng, S., and S. Zhang, "Application-aware Networking (APN) Header", Work in Progress, Internet-Draft, draft-li-apn-header-04, , <https://datatracker.ietf.org/doc/html/draft-li-apn-header-04>.

Authors' Addresses

Haibo Wang
Huawei
Kehan Yao
China Mobile
Wei Pan
Huawei
Hongyi Huang
Huawei