Application-aware Data Center Network (APDN) Use Cases and Requirements
draft-wh-rtgwg-application-aware-dc-network-01
This document is an Internet-Draft (I-D).
Anyone may submit an I-D to the IETF.
This I-D is not endorsed by the IETF and has no formal standing in the
IETF standards process.
The information below is for an old version of the document.
| Document | Type |
This is an older version of an Internet-Draft whose latest revision state is "Expired".
|
|
|---|---|---|---|
| Authors | Haibo Wang , Hongyi Huang | ||
| Last updated | 2023-11-05 (Latest revision 2023-10-23) | ||
| RFC stream | (None) | ||
| Formats | |||
| Stream | Stream state | (No stream defined) | |
| Consensus boilerplate | Unknown | ||
| RFC Editor Note | (None) | ||
| IESG | IESG state | I-D Exists | |
| Telechat date | (None) | ||
| Responsible AD | (None) | ||
| Send notices to | (None) |
draft-wh-rtgwg-application-aware-dc-network-01
Network Working Group H. Wang
Internet-Draft H. Huan
Intended status: Standards Track Huawei
Expires: 8 May 2024 5 November 2023
Application-aware Data Center Network (APDN) Use Cases and Requirements
draft-wh-rtgwg-application-aware-dc-network-01
Abstract
Deploying large-scale AI services in data centers poses new
challenges to traditional technologies such as load balancing and
congestion control. Besides, emerging network technologies such as
in-network computing are gradually accepted and used in AI data
centers. These network-assisted application acceleration
technologies require that cross-layer interaction information can be
flexibly transmitted between end-hosts and network nodes.
APDN (Application-aware Date Center Network) adopts the APN framework
for application side to provide more application-aware information
for the data center network, enabling the fast evolution of network-
application co-design technology. This document elaborates use cases
of APDNs and proposes the requirements.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 8 May 2024.
Copyright Notice
Copyright (c) 2023 IETF Trust and the persons identified as the
document authors. All rights reserved.
Wang & Huan Expires 8 May 2024 [Page 1]
Internet-Draft APDN November 2023
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Requirements Language . . . . . . . . . . . . . . . . . . 4
2. Use Case and Requirements for Application-aware Date Center
Network . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1. Fine-grained packet scheduling for load balancing . . . . 4
2.2. In-network computing for distributed machine learning
training . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3. Refined congestion control that requires feedback of
accurate congestion information . . . . . . . . . . . . . 7
3. Encapsulation . . . . . . . . . . . . . . . . . . . . . . . . 9
4. Security Considerations . . . . . . . . . . . . . . . . . . . 9
5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9
6. References . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.1. Normative References . . . . . . . . . . . . . . . . . . 9
6.2. Informative References . . . . . . . . . . . . . . . . . 9
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 11
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11
1. Introduction
Distributed training for AI large model has gradually become an
important business in large-scale data centers after the emergence of
large AI models such as AlphaGo and ChatGPT4. In order to improve
the efficiency of large model training, large amounts of computing
units (for example, thousands of GPUs running simultaneously) are
used to perform computing processing in parallel to reduce JCT(job
completion time). The concurrent computing nodes require periodic
and bandwidth-intensive communications.
The new multi-party communication mode and characteristics between
computing units put forward higher requirements for the throughput
performance, load balancing capability, and congestion handling
capabilities of the entire data center network. Traditional data
center technology usually regards the network purely as the data
transmission carrier for upper-layer applications, and the network
Wang & Huan Expires 8 May 2024 [Page 2]
Internet-Draft APDN November 2023
provides basic connectivity services. However, in the scenario of
large AI model training, network-assisted technology (e.g.,
offloading partial computation in the network) is being introduced to
improve the efficiency of AI jobs by joint optimization of network
communication and computing applications. In most existing network-
assisted cases, the network operators customize and implement private
protocols in a very small scope, but cannot achieve general
interoperability. However, emerging technology for data center
network needs to consider serving different transports and
applications, as the scale of AI data centers continues to increase
and there is a trend to provide cloud services for different AI jobs.
The construction of large-scale data centers not only needs to
consider general interoperability between devices, but also needs to
consider the interoperability between network devices and end-host
services.
This document illustrates use cases that requires application-aware
information between network nodes and applications. Current ways of
conveying information are limited by the extensibility of packet
headers, where only coarse-grained information can be transmitted
between the network and the host through a limited space (for
example, one-bit ECN [RFC3168] in IP layer).
The Application-aware Networking (APN) framework
[I-D.li-apn-framework] defines that application-aware information
(i.e. APN attribute) including APN identification (ID) and/or APN
parameters (e.g. network performance requirements) is encapsulated at
network edge devices and carried in packets traversing an APN domain
in order to facilitate service provisioning, perform fine-granularity
traffic steering and network resource adjustment. Application-aware
Networking (APN) framework for application side
[I-D.li-rtgwg-apn-app-side-framework] defines the extension of the
APN framework for the application side. In this extension, the APN
resources of an APN domain is allocated to applications which compose
and encapsulate the APN attribute in packets.
This document explores the APN framework for application side to
provide richer interactive information between hosts and networks
within the data center. This document provides some use cases and
proposes the corresponding requirements for APplication-aware Data
center Network (APDN).
1.1. Terminology
APDN: APplication-aware Data center Network
SQN: SeQuence Number
Wang & Huan Expires 8 May 2024 [Page 3]
Internet-Draft APDN November 2023
TOR: Top Of Rack switch
PFC: Priority-based Flow Control
NIC: Network Interface Card
ECMP: Equal-Cost Multi-Path routing
AI: Artificial Intelligence
JCT: Job Completion Time
PS: Parameter Server
INC: In-Network Computing
APN: APplication-aware Network
1.2. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
2. Use Case and Requirements for Application-aware Date Center Network
2.1. Fine-grained packet scheduling for load balancing
Traditional data centers adopt per-flow ECMP method to balance
traffic across multiple paths. In traditional data centers that
focus on cloud computing, due to the diversity of services and random
access, the amount of data flows is large but most of them are
typically small and short. The ECMP method can realize near-equally
distribution of traffic on multiple paths.
In contrast, the communication pattern is different during the large
AI model training. It is observed that the traffic requires large
bandwidth than ever. A single data flow between multiple machines
can usually saturate the upstream bandwidth of the entire server's
egress NIC (for example, the throughput of single data flow can reach
nearly X*100GB). When per-flow ECMP (e.g., hash-based or round-robin
ECMP) is applied, it is common to concurrently distribute elephant
flows to a single path. For example, two concurrent 100Gb/s flows
may be distributed to the same path with completing for available
bandwidth 100Gb/s. In such case, traffic congestion is obvious and
greatly affects flow completion time of AI jobs.
Wang & Huan Expires 8 May 2024 [Page 4]
Internet-Draft APDN November 2023
Therefore, it is necessary to implement fine-grained per-packet ECMP
-- all the packets of the same flow are sprayed over multiple paths
to achieve balancing and avoid congestion. Due to the differences
between the delay (propagation, switching) of different paths,
packets in the same flow would be likely to be in extensible disorder
when they arrive at the end-host, causing performance degradation of
upper transport and application. To this end, a feasible method is
to reorder the disordered packets at the egress TOR (top of rack
switch) with applying per-packet ECMP. Assuming the scope of
multipath transmission starts from ingress to egress TORs, the
principle of reordering is that for each TOR-TOR pair, the order in
which packets leave the last TOR is consistent with the order in
which they arrive at the first TOR.
To realize packet reordering in egress TOR, the entering order of
packets arriving at ingress TOR should be clearly indicated. Looking
back to existing protocols, the sequence number(SQN) information is
not directly indicated at the Ethernet and IP layers.
* As far as current implementations, the per-flow/application SQN is
generally encapsulated in transport (e.g., TCP, QUIC, RoCEv2) or
applications. If reordering packets depends on that SQN, the
network devices MUST be able to parse large amount of transport/
application layers.
* The SQN in the upper-layer protocol is allocated based on each
transport/application-level flow. That is, the sequence number
space and initial value of different flows may be different, and
cannot be directly used to express the sequence in which packets
arrive at the initial TOR. Although it is possible to assign a
corresponding reordering queue to each flow on the egress TOR and
reorder packets with the SQN of the upper layer, the hardware
resource consumption cannot be overlooked.
* If the network device directly overwrites TOR-TOR pairwise SQN to
the upper-layer SQN, the end-to-end transmission reliability will
no longer work.
Therefore, specific order information needs to be transmitted from
the first device to the last device with reordering functionality
given a multipath forwarding domain.
APN framework is explored to carry the important order information
which, in this case, records sequence number of the packets arriving
in the ingress TOR (for example, each TOR-TOR pair has an independent
and incremental SQN), and the egress TOR reorders the packets
according to that information.
Wang & Huan Expires 8 May 2024 [Page 5]
Internet-Draft APDN November 2023
Requirements:
* [REQ1-1] APN SHOULD encapsulate each packet with SQN besides APN
ID for reordering. The ingress TOR SHOULD assign and record SQN
with certain granularity in each packet regarding their arriving
order. The granularity of SQN assignment can be TOR-TOR, port-
port, queue-queue.
* [REQ1-2] The SQN in APN MUST NOT be modified inside the multi-
pathing domain and could be cleared from APN at the egress device.
* [REQ1-3] APN SHOULD be able to carry necessary queue information
(i.e., the sorting queue ID) usable for fine-grained reordering
process. The queue ID SHOULD be in the same granularity as SQN
assignment.
2.2. In-network computing for distributed machine learning training
Distributed training of machine learning commonly applies AllReduce
communication mode[mpi-doc] for cross-accelerator data transfer in
the scenarios of data parallelism and model parallelism which perform
parallel execution of an application on multiple processors.
The exchange of intermediate results (i.e., gradient data in machine
learning) of per-processor training occupies the majority of the
communication process.
Under the Parameter Server(PS) architecture [atp] (a centralized
parameter server is responsible for collecting gradient data from
multiple clients, aggregating and sending the aggregation results
back to each client), when multiple clients send a large amount of
gradient data to the same server simultaneously, it is prone to
induce incast (many-to-one) congestion from the perspective of
server.
In-network computing (INC) offloads the processing behavior of the
server to the switch. When an on-path network device with both high
switching and line-rate computing (regarding simple arithmetic
operations) capabilities is used as a parameter server to replace the
traditional end-host server for gradient aggregation("addition"
operation), the distributed AI training application can complete
gradient aggregation on the way. On one hand, it turns multiple data
streams to single stream within the network, eliminating incast
congestion on the server. On the other hand, distributed computing
applications can also benefit from INC due to faster on-switch
computing (e.g., ASIC) compared with servers (e.g., CPU).
Wang & Huan Expires 8 May 2024 [Page 6]
Internet-Draft APDN November 2023
[I-D.draft-lou-rtgwg-sinc] argues that to implement in-network
computing, network devices need to be aware of computing tasks
required by applications and correctly parse corresponding data
units. For multi-source computing, synchronization signals of
different data source streams need to be explicitly indicated as
well.
Current implementations (e.g., ATP[atp], NetReduce[netreduce])
require the switches to parse upper-layer protocol and understand
application-specific logic that is dedicated to certain application
because there are still neither general transport or application
protocols for INC. To support various INC applications, the switch
MUST adapt to all kinds of transport/application protocols.
Furthermore, the end users may simply apply encryption to the whole
payload to achieve security, although they are willing to provide
some non-sensitive information to benifit from accelerated INC
operations. In such case, the switch is unable to fetch those
information necessary for INC operations without decryption of the
whole payload. Current status of protocols make it difficult for
applications and INC operations to interoperate.
Fortunately, APN is able to transmit information about the requested
INC operations as well as the corresponding data segments, with which
the applications can offload some analysis and calculation to the
network.
Requirements:
* [REQ2-1] APN MUST carry identifier to distinguish different INC
tasks.
* [REQ2-2] APN MUST support to carry various formats and length of
application data (such as gradients in this use case) to apply INC
and the expected operations.
* [REQ2-3] In order to improve the efficiency of INC, APN SHOULD be
able to carry other application-aware information that can be used
to assist computations and make sure not to compromise the
reliability of end-to-end transport.
* [REQ2-4] APN MUST be able to carry complete INC results and record
the computation status in the data packets.
2.3. Refined congestion control that requires feedback of accurate
congestion information
The data center includes at least the following congestion scenarios:
Wang & Huan Expires 8 May 2024 [Page 7]
Internet-Draft APDN November 2023
* Multi-accelerator collaborative AI model training commonly adopts
AllReduce and All2All communication modes (Section 2.2). When
multiple clients send a large amount of gradient data to a server
at the same time, incast congestion is likely to occur from server
side.
* Different flows may adopt different methods and strategies of load
balancing, it may cause overload on individual links.
* Due to random access to services in data center, there are still
bursts of traffic that could increase the length of queueing and
incur congestion.
The industry has proposed different types of congestion control
algorithms to alleviate traffic congestion over the paths in data
center network. Among them, ECN-based congestion control algorithms
are commonly used in data centers, such as DCTCP[RFC8257],
DCQCN[dcqcn], etc., which uses ECN to mark congestion according to
the occupancy of switch buffer.
But these methods could only use a 1-bit mark in the packet to
indicate congestion information (i.e., queue size reaching a
threshold) and are unable to embrace more in-situ measurement
information due to limited header space. Other proposals, for
example, HPCC++ [I-D.draft-miao-ccwg-hpcc] collect congestion
information along the path hop by hop through inband telemetry, which
will keep appending the information of interests to the data packets.
However, it greatly increases the length of data packets as
traversing hops and requires more consumption of bandwidth resources.
A trade-off method such as AECN[I-D.draft-shi-ccwg-advanced-ecn] can
be used to collect the most important information representing the
congestion along the path. Meanwhile, AECN-like methods apply hop-
by-hop calculation to reduce the carrying of redundant information.
For example, queue delay and the number of congested hops can be
calculated cumulatively as packets traverse the path.
In this use case, the end-host can clarify the scope of the
information desired to collect, and the network device needs to
record/update the corresponding information hop-by-hop, to the data
packet. The collected information might echoed back to the sender
via transport protocol. APN could serve such interaction between
hosts and switches to realize customized information collection.
Requirements:
* [REQ3-1] APN framework MUST allow the data sender to express its
intention about which measurement it wants to collect.
Wang & Huan Expires 8 May 2024 [Page 8]
Internet-Draft APDN November 2023
* [REQ3-2] APN MUST allow network nodes to record/update necessary
measurement results, if the nodes decide to do so. The
measurement could be queue length of ports, monitored rate of
links, the number of PFC frames, probed RTT, variation and so on.
APN MAY record the collector of each measurement in order that
information consumers can identify possible congestion points.
3. Encapsulation
The encapsulation of application-aware information proposed by use
cases of APDN in the APN Header [I-D.draft-li-apn-header] will be
defined in the future version of the draft.
4. Security Considerations
TBD.
5. IANA Considerations
This document has no IANA actions.
6. References
6.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/rfc/rfc2119>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.
6.2. Informative References
[mpi-doc] "Message-Passing Interface Standard", August 2023,
<https://www.mpi-forum.org/docs/mpi-4.1>.
[dcqcn] "Congestion Control for Large-Scale RDMA Deployments",
n.d.,
<https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/
p523.pdf>.
[netreduce]
"NetReduce - RDMA-Compatible In-Network Reduction for
Distributed DNN Training Acceleration", n.d.,
<https://arxiv.org/abs/2009.09736>.
Wang & Huan Expires 8 May 2024 [Page 9]
Internet-Draft APDN November 2023
[atp] "ATP - In-network Aggregation for Multi-tenant Learning",
n.d.,
<https://www.usenix.org/conference/nsdi21/presentation/
lao>.
[I-D.li-apn-framework]
Li, Z., Peng, S., Voyer, D., Li, C., Liu, P., Cao, C., and
G. S. Mishra, "Application-aware Networking (APN)
Framework", Work in Progress, Internet-Draft, draft-li-
apn-framework-07, 3 April 2023,
<https://datatracker.ietf.org/doc/html/draft-li-apn-
framework-07>.
[I-D.li-rtgwg-apn-app-side-framework]
Li, Z. and S. Peng, "Extension of Application-aware
Networking (APN) Framework for Application Side", Work in
Progress, Internet-Draft, draft-li-rtgwg-apn-app-side-
framework-00, 22 October 2023,
<https://datatracker.ietf.org/doc/html/draft-li-rtgwg-apn-
app-side-framework-00>.
[I-D.draft-lou-rtgwg-sinc]
Lou, Z., Iannone, L., Li, Y., Zhangcuimin, and K. Yao,
"Signaling In-Network Computing operations (SINC)", Work
in Progress, Internet-Draft, draft-lou-rtgwg-sinc-01, 15
September 2023, <https://datatracker.ietf.org/doc/html/
draft-lou-rtgwg-sinc-01>.
[RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
October 2017, <https://www.rfc-editor.org/rfc/rfc8257>.
[I-D.draft-miao-ccwg-hpcc]
Miao, R., Anubolu, S., Pan, R., Lee, J., Gafni, B.,
Shpigelman, Y., Tantsura, J., and G. Caspary, "HPCC++:
Enhanced High Precision Congestion Control", Work in
Progress, Internet-Draft, draft-miao-ccwg-hpcc-00, 5 July
2023, <https://datatracker.ietf.org/doc/html/draft-miao-
ccwg-hpcc-00>.
[I-D.draft-shi-ccwg-advanced-ecn]
Shi, H. and T. Zhou, "Advanced Explicit Congestion
Notification", Work in Progress, Internet-Draft, draft-
shi-ccwg-advanced-ecn-00, 10 July 2023,
<https://datatracker.ietf.org/doc/html/draft-shi-ccwg-
advanced-ecn-00>.
Wang & Huan Expires 8 May 2024 [Page 10]
Internet-Draft APDN November 2023
[I-D.draft-li-apn-header]
Li, Z., Peng, S., and S. Zhang, "Application-aware
Networking (APN) Header", Work in Progress, Internet-
Draft, draft-li-apn-header-04, 12 April 2023,
<https://datatracker.ietf.org/doc/html/draft-li-apn-
header-04>.
Acknowledgements
Contributors
Authors' Addresses
Haibo Wang
Huawei
Email: rainsword.wang@huawei.com
Hongyi Huang
Huawei
Email: hongyi.huang@huawei.com
Wang & Huan Expires 8 May 2024 [Page 11]