Internet-Draft CSIG February 2024
Ravi, et al. Expires 5 August 2024 [Page]
Workgroup:
Networking Working Group
Internet-Draft:
draft-ravi-ippm-csig-01
Published:
Intended Status:
Experimental
Expires:
Authors:
A. Ravi
Google LLC
N. Dukkipati
Google LLC
N. Mehta
Google LLC
J. Kumar
Broadcom Inc.

Congestion Signaling (CSIG)

Abstract

This document presents Congestion Signaling (CSIG), an in-band network telemetry protocol that allows end-hosts to obtain visibility into fine-grained network signals for congestion control, traffic management, and network debuggability in the network. CSIG provides a simple, low-overhead, and extensible packet header mechanism to obtain fixed-length summaries from bottleneck devices along a packet path. This summarized information is collected over L2 CSIG-tags in a compare-and-replace manner across network devices along the path. Receivers can reflect this information back to senders via L4+ CSIG reflection headers.

CSIG builds upon the successful aspects of prior work such as switch in-band network telemetry (INT) that incorporates multibit signals in live data packets. At the same time, CSIG's end-to-end mechanism for carrying the signals via fixed size header is simple, practical and deployable akin to Explicit Congestion Notification (ECN).

In addition to a detailed description of the end-to-end protocol, this document also motivates the use cases for CSIG and the rationale for design choices made in CSIG. It describes a set of signals of interest to applications (minimum available bandwidth, maximum link utilization, and maximum hop delay), methods to compute these signals in network devices, and how these signals can be leveraged in applications. Additionally, it describes how attributes about the bottleneck's location can be carried and made useful to applications. It also provides the framework to incorporate future signals. Finally, this document addresses incremental deployment, backward compatibility and nuances of CSIG's applicability in a range of scenarios.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 5 August 2024.

Table of Contents

1. Introduction

Many network control loops, including Congestion Control, Traffic Engineering and Network Operations, make decisions based on the congestion experienced by application flows. The signals used to determine congestion are often implicitly derived from end-to-end signals, approximated over larger timescales than desired, or obtained out-of-band from the network. This can lead to suboptimal performance for applications or inefficiency in network usage. CSIG (Congestion Signaling) provides direct, real-time, inband signals that network control loops can incorporate for performance and efficiency.

A number of congestion control algorithms (CCA) are deployed in datacenters, including Swift [SWIFT], BBR [BBR], DCTCP [RFC8257], DCQCN [DCQCN] and HPCC++ [I-D.miao-tsv-hpcc]. These CCA vary in the congestion signals they use and in how they increase/decrease flow rates in response to the signals. Swift uses precise measurements of round-trip time (RTT) to modulate its congestion window. BBR uses a combination of flow's delivery rate and RTT measurements. DCTCP and DCQCN rely on Explicit Congestion Notification (ECN [RFC3168]) from switches that indicate if the queue build up is above a threshold. HPCC++ leverages per-hop queue depth and transmit bytes along the flow's path, obtained via inband telemetry probes, to update flow rates.

Despite the advances in sophisticated signals on when to slow down transfers, there continue to be blind-spots for CCA when it comes to increasing flow rates, e.g., What is the appropriate starting rate for a flow? How quickly should a flow ramp up in the absence of congestion? Without explicit information from the network, end-to- end CCA have come to rely on heuristics that can either undershoot or overshoot the bottleneck bandwidth, which can lead to slower Flow Completion Times (FCT) or increased round-trip times or packet losses. At the same time, applications' appetite for fast network performance is rising: AI/ML applications are pushing for fast network transfers and avoid idling expensive Tensor Processing Units (TPUs) and Graphics Processing Units (GPUs). Similarly Storage disaggregation needs fast transfers to make a remote Storage device appear as a local device at host.

In this document we introduce Congestion Signaling (CSIG) to explicitly notify the hosts of the bottleneck link metrics. There are several important use cases for CSIG, including:

  • Congestion Control Algorithms for making decisions on sending rate: CCA at senders can use CSIG for quickly and safely ramping up to the maximum feasible rate as determined by the bottleneck link, and react with precision to the bottleneck hop both in the presence and absence of congestion. The motivation for quick ramp-up stems from making maximal use of datacenter bandwidth, and decreasing latency even for large transfers. There are several ways in which CSIG can help complete transfers quickly, e.g., transfers belonging to an ML collective communication can ramp up quickly to maximally use all network bandwidth and complete close to the ideal transfer completion time.

  • Traffic Management systems including Traffic Engineering (TE), Load Balancing and Multipathing too benefit from CSIG. TE systems infer congested flows through an offline multi-minute process via superimposition of network traffic stats, topology and routing information. With CSIG, TE has more up to date information on the congested points and the application flows experiencing congestion. Using such finer-grained information can lead to more efficient and timely provisioning for bursty traffic. Similarly, CSIG-enabled multipathed transport flows can choose paths in real time with the most available bandwidth.

  • Troubleshooting and Performance Optimization. We also envision CSIG to assist with debugging the network-level performance of datacenter applications. Large-scale applications, including ML training workloads, open thousands of connections at the transport layer. When the network is slow for an application, it is almost impossible to identify the bottleneck hops without joining many data sources across switches and hosts. Because CSIG conveys the path bottleneck characteristics, it is valuable in pinpointing choke points in the network. Knowledge of these choke points can lead to better bandwidth provisioning, timely repair processes, and real-time control, such as better load balancing.

CSIG provides simple, fixed-length summaries of bottleneck links along a path, such as maximum hop delay, minimum available bandwidth, and maximum link utilization. Information is collected at L2 from network devices along a packet path. Each data receiver then returns the collected information to the data sender via L4 transport options or payloads. CSIG uses a simple compare-and-replace operation at network devices, which allows it to scale with network topology, link speeds, and packet rates.

CSIG builds on the successful aspects of prior explicit feedback schemes, but is more capable. CSIG carries rich multi-bit switch telemetry in live data packets, drawing from the advancements in in-band network telemetry, also generally known as INT. At the same time, CSIG retains the fixed-size headers and reflection in L4 transports akin to Explicit Congestion Notification (ECN). The industry has three key variants of INT: the one first specified in P4.org [P4-INT], the IOAM (In Situ Operations, Administration, and Maintenance) standard [RFC9378] in IETF and the Inband Flow Analyzer (IFA) spec [I-D.kumar-ippm-ifa] that is used in HPCC deployment [HPCCPLUS]. While they differ in the header definitions and encapsulation mechanisms, they all commonly stack up multiple per-switch telemetry data per-hop in the path of a packet. The packet size grows proportional to the metrics per switch and the number of forwarding devices along its path. Depending on the use case and header definition, the per-packet overhead ranges from 20B to above 100B. The large and variable size header overhead incurs challenges in end-to-end MTU limit conformation and parsing of the packet header data in the forwarding or receiving devices.

There exist several efforts to address the challenges incurred in INT variants, including: 1) carrying INT data in synthetically generated non-data packets also known as probe packets, and 2) carrying only the fixed-size INT instructions (e.g., specifying which data to collect per hop) in data packets, while hop devices generate separate report packets that deliver the requested per-hop data. While these techniques reduced the per-data-packet overhead, they did not fundamentally reduce the total amount of bytes or PPS overhead on the network devices or the data collector. TCP-INT [TCP-INT] was developed in parallel to carry fixed-size min/max/sum aggregate metric over the hops together with a hop locator in live data packets. However, it is limited to TCP Options, hence not applicable to various modern transports for AI/HPC, and furthermore there is no flexible way to introduce a new metric. CSIG's type-value format ensures a constant size overhead with future-proofness. The guaranteed constant size is small enough to fit into the 4B or 8B tag, enabling the unique placement of CSIG in L2, which frees the operators from the concerns around tunneling and encryption in deploying CSIG.

In the rest of the document, we describe the design of end-to-end CSIG at hosts and network devices.

1.1. Terminology

ABW:

Available Bandwidth

AQM:

Active Queue Management

CCA:

Congestion Control Algorithms

Connection / Flow:

A 5-tuple transport connection, e.g. TCP connection

CSIG:

Congestion Signaling

CSIG data fields:

Fields in the CSIG tag excluding the TPID.

CSIG packets:

Packets that contain the CSIG-tag and optionally the CSIG reflection header

CSIG-capable path:

Path is termed CSIG-capable if all transit devices along the path support the CSIG protocol and end hosts have at least pass-through support for CSIG packets

CSIG-tagged packets:

Packets that contain the CSIG-tag in the packet header

CSIG-domain:

Secure network deployment domain where all devices in the domain have complete CSIG support or pass-through CSIG support

PD:

Per-hop delay

E2E:

End-to-End

IPSec:

Internet Protocol Security

MTU:

Maximum Transmission Unit

MSS:

Maximum Segment Size

NIC:

Network Interface Card

Packet Path:

The port-by-port network path taken by a given packet specified as a sequence of device interfaces

PSP:

PSP Security Protocol

TPID:

Tag Protocol ID

TE:

Traffic Engineering

Transit device:

Any switch, router or middlebox in the path of a CSIG packet

WRR:

Weighted Round Robin

2. Design Principles

CSIG was conceived to address problems in congestion control, traffic management and network debuggability in production networks. We describe below the design principles that shaped CSIG, with simplicity and ease of deployment being at the forefront. Section 7 discusses the rationale behind the specific design choices made in CSIG.

  • Simple Signals driven by Use Cases: Simple device port or queue metrics that solve concrete use cases are at the heart of CSIG's design principles. This simplicity is not only important to applications, but also keeps the area, power and cost of implementation low on network devices. Signals in CSIG are designed to be implementable in ASICs at line rate. Signals that track per-flow state at the switch, for example, are harder to implement and deploy, and are hence avoided in CSIG. CSIG is also flexible enough to accommodate new signals and use cases beyond those described in this document.

  • End-to-End Perspective: CSIG's design stems from an end-to-end perspective of requirements and trade-offs for both applications and the network. This document covers the necessary end-to-end aspects and the resulting design choices that make CSIG both useful to applications and practical to deploy.

  • Small and Fixed Packet Overhead: It is important that the packet size does not increase as it traverses the network, which means that the MTU does not need to be changed. Any overhead that is introduced should be fixed and small, minimizing the cost of implementation in switch / NIC pipelines. Low protocol overhead also means low bandwidth overhead for small packets, minimizing impact to packet-per-second (PPS) load and bandwidth efficiency. We make very few assumptions about which packets and devices CSIG is enabled on. Device implementations must be able to process CSIG on packets at line rate with minimal CPU involvement. Keeping the overhead small and fixed allows for CSIG to be enabled on every single packet at line rate. This is important because deployments may choose to enable CSIG on every packet rather than on a small sample of packets.

  • Works easily under Tunneling and Encryption: Tunnels are broadly used in modern deployments e.g., Traffic-engineering systems and Cloud traffic frequently use tunnels. CSIG is designed to easily support end-to-end signaling on devices even in the presence of complex tunneling deployments. This is in contrast to other in-band telemetry schemes that put more pressure on the ASICs to relocate metadata across inner and outer headers to work in the presence of tunnels. In addition, CSIG also works with encrypted packets, including PSP, IPSec and 802.1AE MAC Security.

  • Incremental Deployability: CSIG allows incremental deployment, where the mechanism can be deployed gradually into domains where some devices may support the new protocol and others may not. This document addresses interoperability in heterogeneous networks, and addresses backward compatibility with legacy devices. We envision CSIG to be broadly valuable across wired networks, although our target domain for initial usage is datacenter networks. We make minimal assumptions about the network architecture around tunneling, number of hops (diameter), routing, topology etc. Configuring CSIG for end-to-end consistency in a private network, or deployments over the Internet are not in scope for this document.

3. Conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. In this document, these words will appear with that interpretation only when in ALL CAPS. Lower case uses of these words are not to be interpreted as carrying significance described in RFC 2119.

4. Congestion Signaling Protocol

CSIG protocol defines two components in the packet header to achieve end to end congestion signaling in a production network.

  • CSIG-tag: An L2 protocol that end hosts and transit devices participate in.

  • CSIG Reflection: A flexible L4+ protocol that only end hosts participate in.

CSIG-tag is the core component of the CSIG specification. It enables end hosts to request network signals of interest and for transit devices to provide these signals to end hosts over the specified packet header bits.

However, to achieve end-to-end CSIG, CSIG-tag MAY be combined with the CSIG reflection protocol to expose the signals of interest to the relevant endpoints or consumers where the signals are needed.

This section first describes the header formats for CSIG-tag and CSIG reflection. Then it describes the life of a CSIG packet, outlining the different roles of network devices in the context of CSIG, and how these two packet header mechanisms work together to achieve end- to-end signaling.

4.1. CSIG-tag Header Format

CSIG tag is a fixed size tag at the layer 2 header.

CSIG-tag placement in various packet encapsulations is shown below for completeness. It is always the last tag in the layer 2 header.

ARPA: dstmac / srcmac / csig-tag / ethertype / payload

802.1q: dstmac / srcmac / vlan-tag / csig-tag / ethertype / payload

802.1ad: dstmac / srcmac / vlan-tag / vlan-tag / csig-tag / ethertype / payload

802.1ad tunnel: dstmac / srcmac / vlan-tag / vlan-tag / vlan-tag / vlan-tag / csig-tag / ethertype / payload

802.1ae: dstmac / srcmac / security-tag / vlan-tag / csig-tag / ethertype / payload

Consequently, the placement / offset of the CSIG tag is not affected by the headers and payload at layers 3 and above. Layer 2.5 headers, such as MPLS, are also placed after the CSIG tag and do not impact its offset.

CSIG-tag is defined in two variants - Compact and Expanded. Each variant has a dedicated TPID codepoint to allow devices to infer which variant is in use. Each variant supports a distinct set of requirements with respect to production deployment and identifies contrasting trade-off points in the solution space. Deployment considerations are discussed in Section 6.

Structurally, the compact CSIG-tag variant resembles a single VLAN tag and the expanded CSIG-tag variant resembles a double VLAN tag. This structural similarity is intentional and the reasons are elaborated in Section 6.4.

4.1.1. Compact Format

CSIG-tag compact format is as shown, with 2B allocated for the CSIG Tag Protocol ID (TPID) and 2B allocated for the data fields.

   0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |             TPID              |  T  |R|    S    |      LM     |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   |0-15|  TPID  : IEEE allocated Tag Protocol ID for 4 Byte CSIG tag
   |16-18| T     : Signal Type (0:min(ABW), 1: min(ABW/C), 2:max(PD))
   |19|    R     : Reserved
   |20-24| S     : Signal Value: Bucketed (32 configurable buckets)
   |25-31| LM    : Locator Metadata of bottleneck device / port
Figure 1: CSIG-tag Compact version

4.1.2. Expanded Format

CSIG-tag expanded format is as shown, with 2B allocated for the Tag Protocol ID (TPID) and 6B allocated for the data fields

   0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |             TPID              |               LM              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   T   |                  S                    |       R       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   |0-15|  TPID : IEEE allocated Tag Protocol ID for 8 Byte CSIG tag
   |16-31| LM   : Locator Metadata of bottleneck device / port
   |0-3|   T    : Signal Type (0:min(ABW), 1: min(ABW/C), 2:max(PD))
   |4-23|  S    : Signal Value: Uniformly quantized
   |24-31| R    : Reserved for future use
Figure 2: CSIG-tag Expanded version

4.1.3. CSIG-tag Data fields Description

This section describes the format and usage of data fields within the CSIG-tag

4.1.3.1. Signal Type

The Signal Type field T is three (four) bits long in the compact (expanded) format and indicates the type of signal being carried in the CSIG-tag. End hosts set the signal type T and request it on each packet of interest. Up to 8 signal types are supported in the compact format, and up to 16 signal types are supported in the expanded format. This draft concretely defines three signals: min(ABW), min(ABW/C) and max(PD), elaborated in Section 5 and Section 8. The remaining codepoints are reserved for future signals, and may be defined and used in future versions of CSIG.

A single packet can carry at most one Congestion Signal. However, end hosts MAY obtain multiple signals for a single 5-tuple flow by requesting different signal types on alternating packets of a flow or in a round-robin fashion across packets. Therefore, end hosts need not tie a single flow to a specific signal type, and MAY obtain all supported CSIG signals for a single flow.

4.1.3.2. Signal Value

The Signal Value field S is 5 bits (20 bits) long in the compact (expanded) format and captures the value of the signal specified by Signal Type T. End hosts set the initial Signal Value S alongside the requested Signal Type T, and each transit device along the packet path in the network MAY modify S in accordance with the e2e signal being computed. E.g., For signals that are min() aggregations, end hosts set the initial value of S to the maximum allowable value of the signal or its encoding thereof, and transit devices perform compare-and-replace to compute the min() across signals of individual devices on the packet path.

In the compact format, the 5-bit Signal Value is bucketed with 32 fully configurable buckets. Each bucket is configured with (low, high) value range. This configuration is specific to each Signal Type and MAY vary across Signal Types. This allows the Signal Value representation to be tailored to the specific needs of each Signal Type. For example, in typical use cases of available bandwidth, it is more useful to have higher granularity at lower values of the signal (i.e., when ABW is close to 0) than at higher values of the signal. This is because lower values of ABW have greater impact on application control decisions e.g., knowing whether there was 0 Gbps vs 1 Gbps available on a path makes a larger difference than knowing if there was 399 Gbps vs 400 Gbps available. Appendix A shows how the buckets could be defined in order to provide such a non-linear encoding of value-ranges to buckets. Such configurable encodings allow capturing useful information about the signal with fewer bits and is a core feature of the compact CSIG format.

In the expanded format, Signal Value is uniformly quantized into a 20 bit value. The unit of quantization is configurable on a per Signal Type basis, depending on the minimum and maximum value that needs to be represented with the given bits. The higher bit length allows for enhanced signal granularity and fewer configuration knobs in domains where the expanded CSIG format is viable to deploy (Section 6.5). 20-bits are sufficient to represent a wide range of values with high granularity. As an example, with a 8Mbps quantum for min(ABW), the signal value field can represent up to a max of 8Tbps. With a 128ns quantum for max(PD), the signal value field can represent up to a max of 128ms. More discussion on signal-specific quanta is in Appendix A.

Signal quantization / bucketing parameters are configured directly at the transit devices where the signal is computed. End hosts do not explicitly request or negotiate these parameters. As described in Section 5, all devices MUST be configured with the same quantization / bucketing parameters for each signal type, in order to correctly compute the requested signal along packet paths.

4.1.3.3. Locator Metadata

Locator Metadata field LM is an optional 7 bits (16 bits) in the compact (expanded) format. It captures relevant metadata about the bottleneck port or device, where the notion of bottleneck is specific to individual signal types. Locator Metadata MAY include compressed attributes about the bottleneck that is relevant for the use case e.g., capacity of the bottleneck port, stage of the bottleneck device in the data center topology, orientation of the bottleneck port - uplink / downlink. LM MAY also include expanded attributes of the bottleneck (e.g., port ID, TTL). This document provides recommendations for the type of information that locator metadata MAY carry, but it does not require any specific set of metadata to be supported. Metadata that is useful and viable to support will depend on the production setting, which is out of scope for this document. Instances of CSIG deployment MAY include locator metadata with custom-defined metadata beyond those described in this document. Section 5.5 discusses requirements for supporting LM in devices.

End hosts initialize LM to a default value. Transit devices that do not update the Signal Value S on a given packet MUST NOT alter LM on the packet. Transit devices that update S on a packet MUST update LM on the same packet.

4.2. CSIG Reflection Header Format

CSIG reflection enables consumption of tag data fields at the point where the signals are needed for telemetry or control. This mechanism is particularly relevant for sender-driven / source-based telemetry and control. For receiver-driven transports and controllers, CSIG reflection may not be necessary as the signals on the CSIG tag are available at the receiver without reflection (See Section 4.3).

This document provides recommendations on how CSIG reflection SHOULD be implemented, and provides the framework to make the implementation deployment-specific.

CSIG reflection header is a separate header from the CSIG tag, implemented at layer 4 or above. The location of the header and the choice of which packets carry the header are transport-specific. As an example, the header can be carried on TCP ACK packets from the receiver back to the sender. Note that the presence of ACK coalescing, piggybacked ACKs, Selective Acknowledgements (SACK) etc. can impact the behavior of CSIG reflection. More generally, there may not be a 1:1 mapping between forward and reverse path packets. In a scenario where the transport implements ACK coalescing, the CSIG reflection header SHOULD reflect the latest CSIG-tag data fields received across the packets being acknowledged or a more advanced summary of the CSIG-tag data fields across the packets being acknowledged. It is important to note that since Signal Type is chosen on a per-packet granularity, a coalesced ACK may acknowledge multiple packets that carry different signal types in their CSIG- tags. In such a scenario, the reflection header MAY only reflect one of the signals. The sender transport should choose Signal Type for packets in a way that ensures that it can continue to receive all signals of interest.

CSIG reflection header MAY include all of the CSIG data fields i.e., 2B for the compact version and 6B for the expanded version. However, one could optimize header space and include only a subset of the data fields if the consumer is interested only in a subset of signals or locator metadata.

CSIG reflection is an end-host-only protocol and transit devices do not participate in it. Therefore, CSIG reflection header can be incorporated in portions of the packet that are e2e encrypted via PSP or IPSec.

The following subsections discuss locations in the packet header where CSIG reflection could be implemented for different transports

4.2.1. Reflection in TCP

Reflection in TCP is typically achieved via TCP options. CSIG Reflection can be implemented via a new TCP Option, identified by a unique Kind.

   0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |     Kind     |    Length     |       CSIG data fields         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Kind              : Unique codepoint to recognize TCP CSIG option
   Length            : Length in bytes of the CSIG data fields
                       carried in the options payload
   CSIG data fields  : Values reflected from receiver to sender
Figure 3: CSIG Reflection TCP Option

4.2.2. Reflection in non-TCP Transports

Several transports such as QUIC [RFC9000] and PonyExpress [PONYEXPRESS] are built atop UDP. Reflection in UDP can be achieved by including CSIG data fields in the UDP payload from receiver to sender. For unidirectional UDP traffic, an out-of-band reverse connection from the receiver to the sender may be necessary for CSIG reflection.

As an example, PonyExpress [PONYEXPRESS] is a custom transport implemented within a userspace host networking stack. It supports a flexible L4 wire protocol that periodically changes as new features are added (Sec 3.1 in Snap). CSIG reflection can be implemented as additional bytes within this wire format.

                   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                   |    Flags    | CSIG data fields|
                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 4: PonyExpress CSIG Reflection header

For simplicity and to avoid the need for negotiation, the CSIG reflection header can be carried on all packets independent of whether CSIG is enabled on them. The Valid bit in the Flags field can be set to 1 for packets that carry valid data fields in the reflection header. In certain deployments, negotiation is unavoidable for a variety of reasons. Section 6.3.3 provides details regarding options for negotiation.

4.3. CSIG Operation - Life of a packet

This section describes the end-to-end operation of CSIG with the walkthrough of the life a packet. It assumes that all nodes in the path are CSIG-capable and omits the negotiation phase. Details of negotiation are covered in in Section 6.3.3

                            Forward Path
     --------------------------------------------------------->

     <---------------------------------------------------------
                            Reverse Path

+------+   +-----+   +------+  +------+  +------+  +-----+   +------+
| Host +---+ ToR +---+ Aggr +--+ Core +--+ Aggr +--+ ToR +---+ Host |
+------+   +-----+   +------+  +------+  +------+  +-----+   +------+

        C:   800G      100G      100G      100G      40G

      ABW:   100G       95G       70G       90G      20G
                                                     ---
    ABW/C:  12.5%       95%       70%       90%      50%
            -----
        D:   10us       3us      18us       5us      8us
                                 ----
Figure 5: Life of a CSIG packet. Underlined values show the forward path bottlenecks for the corresponding signal types

4.3.1. Forward Path

The sender end-host first constructs a CSIG-tagged packet for a flow of interest and sends out the packet with the tag data fields initialized. The transport determines these initial values for the packet, including Signal Type to request and default values for the other data fields. Each transit device performs a compare-and-replace on the CSIG-tag to optionally update the Signal Value and Locator Metadata fields on the tag. As the packet traverses through the network, the CSIG-tag data fields accumulate the desired aggregation of the requested signal.

4.3.2. Reverse Path

When the CSIG-tagged packet reaches the receiver end-host, the data fields in the CSIG tag are extracted and delivered to the transport layer at the receiver. The transport stores the data fields of the packet to be reflected, or a summary of these fields across packets. It reflects these data fields in the layer-4 CSIG reflection header on packets traversing the reverse path from receiver to sender. The CSIG reflection header is unmodified as the packet travels from receiver to sender. The sender extracts the CSIG data fields from the CSIG reflection header of the incoming packet, and hands it to the transport layer for use in applications at the sender. As a result, the sender transport learns the desired signal for a flow within approximately one round-trip time.

4.3.3. Multiple signals

The transport layer has a significant role to play in making CSIG usable. Although the CSIG data fields are carried on packets, the measurements are ultimately relevant at the flow / connection level for specific paths. If the sender transport desires to obtain multiple signals for the same flow, it MAY choose Signal Type on a per-packet basis (e.g., in a round robin fashion across the flow's packets), and internally keep track of all of the requested signals as part of the flow's state variables. This approach allows the sender transport to use all supported CSIG signals for use cases such as congestion control, load balancing and multipathing.

4.4. Device Roles

CSIG has three participating entities, each with their own roles and responsibilities for achieving end-to-end congestion signaling.

4.4.1. Sender host

The sender host is responsible for

(i) Constructing CSIG-tagged packets for flows of interest and initializing the CSIG-tag data fields on each packet as specified by the transport, and

(ii) Parsing the CSIG reflection header received in incoming packets and extracting CSIG data fields for use in the sender transport / applications.

Only the sender is allowed to insert CSIG-tags into packets.

4.4.2. Transit device

Transit devices are responsible for

(i) Computing and tracking Congestion signals such as ABW and ABW/C of each port and hop delay per packet

(ii) Parsing the CSIG-tag based on the TPID code point on incoming packets to identify the Signal type being requested, and

(iii) Performing compare-and-replace on the Signal value and locator metadata fields on the CSIG-tag based on the aggregation corresponding to the requested signal type (min / max)

Transit devices MUST NOT add CSIG tags to incoming packets that are not already CSIG-tagged. Transit devices MAY delete the CSIG tag before forwarding the packet. This functionality can be exercised when downstream devices are not CSIG-capable. Further discussion on this topic is in Section 6 on Incremental Deployment of CSIG.

4.4.3. Receiver host

The receiver host is responsible for

(i) Extracting the CSIG-tag on incoming packets and exposing the data fields to the transport layer and/or receiver-driven applications

(ii) Inserting and populating the CSIG Reflection header at the transport layer for packets traversing the reverse path to the sender.

4.4.4. Host roles for bidirectional flows

Note that for bi-directional flows, the Sender and Receiver are specific to each direction within the flow. For a bi-directional flow between hosts A and B,

(i) A plays the Sender host role and B plays the Receiver host role for data packets traveling from A to B, and similarly

(ii) B plays the Sender host role and A plays the Receiver host role for data packets traveling from B to A.

In this scenario, packets traversing from A to B contain both a CSIG- tag that captures the congestion signals on the forward A-->B path, and a CSIG reflection header that captures the CSIG data fields of the reverse B-->A path. Equivalently, packets traversing from B to A contain both a CSIG-tag that captures the congestion signals on the forward B-->A path, and a CSIG reflection header that captures the CSIG data fields of the reverse A-->B path

5. Signals in CSIG

As described in the previous section, Signal Type indicates the type of congestion signal that CSIG-tag carries on each packet. Up to 8 signal types are supported by the compact format and up to 16 signal types are supported by the expanded format.

In this section, we concretely define three signals driven by use cases described in Section 8. While Section 8 covers how these three signals are useful to applications, this section focuses on precise definitions of these signals and how they may be implemented on transit devices.

Note for future extensions: Signals in CSIG are intended to be aggregation functions of individual per-hop or per-port signals across the path of a packet. The typical definition of such signals with max / min aggregations captures the notion of a path bottleneck for different definitions of bottleneck. However, structurally, the format supports arbitrary read-modify-write operations, including aggregations such as max, min, count and sum, allowing future use cases to leverage this structure for new signals.

5.1. Minimum Available Bandwidth - min(ABW)

min(ABW) captures the minimum absolute available bandwidth (in bps) across all the ports in the packet path. Available bandwidth is defined per egress port on each device.

5.1.1. ABW Computation

ABW can be computed using one of many algorithm variants, each having implications on HW or SW implementation complexity, timescales of computation and accuracy of the signal. In its rudimentary form, the raw ABW for a given egress port p over a time interval delta_t can be computed as follows:

// delta_txbit is the number of bits that exited on the wire
utilization_bps[p] = (delta_txbit[p]) / delta_t;
// capacity_bps[p] captures the link speed of port p
abw_bps[p] = capacity_bps[p] - utilization_bps[p];

Implementation of these computations relies on at least one of the following capabilities in the devices:

  • Timer-based computations: Most networking ASICs maintain hardware counters that track the number of bits that exit on each egress port. To compute available bandwidth, a periodic-timer thread in SW or HW triggers the computation and update of available bandwidth every delta_t time interval , where delta_t is a configurable parameter.

  • Per-packet computations: In this alternative, available bandwidth is computed and updated on every packet that is processed via the egress pipeline, typically in HW e.g., via Exponential Weighted Moving Average (EWMA) estimation where the weights are configurable. delta_t is not an explicit parameter in this approach, and is implicitly determined by EWMA weights.

Variants such as Discounted Rate Estimator (DRE) [CONGA] use a combination of per-packet updates and timer-based approaches.

5.3. Shared requirements for min(ABW) and min(ABW/C)

5.3.1. Algorithm Requirements

To support min(ABW) or min(ABW/C) in CSIG, the device SHOULD support raw ABW computation with a configurable delta_t, and MAY support additional algorithms such as EWMA or DRE. This requirement enables the consistent interpretation of timescale over which available bandwidth is computed. This consistent interpretation allows end-hosts to tune their control decisions based on this timescale e.g., in relation to the flow's RTT.

5.3.2. Timescale and Accuracy Requirements

CSIG does not set strict requirements on the delta_t values to be supported by the implementation, except that it SHOULD be configurable to cover the range of RTTs in the network e.g., {10us, 100us, 1ms, 10ms, 100ms, 1s etc.}. Although one would expect all devices on a packet path to compute ABW at similar timescales to provide a consistent path-wide view, CSIG does NOT set strict requirements on the consistency of delta_t parameters chosen across the devices of a packet path. Choices of signal accuracy and timescales are a function of the use case and are not enforced by CSIG. End hosts MAY use EWMA across packets of a flow to calculate ABW or ABW/C over a longer timescale when CSIG on each packet carries ABW or ABW/C over shorter timescales. This technique is useful when flows traversing a given egress port span a wide range of RTTs while ABW computation over the egress port is fixed to a chosen timescale at each transit device.

5.3.3. Bucketing / Quantization Requirements

The computed ABW or ABW/C values MUST be compressed to fit in the available Signal value bits on the CSIG-tag. The device MUST support 32 fully configurable ABW buckets and ABW/C buckets for compact CSIG, and configurable quanta for uniform quantization in expanded CSIG. All devices along the packet path MUST be configured with the same buckets / quanta per signal type in order to correctly compute min(ABW) or min(ABW/C) along the path. Appendix A provides examples of these configurations.

Each transit device performs a compare-and-replace, i.e., updates the signal value on the CSIG tag if the incoming ABW or ABW/C signal value on the packet is higher than the device's locally computed ABW or ABW/C value for the packet's egress port, post bucketization / quantization. E.g.,

// Update the signal value on packet if current hop is the bottleneck
pkt->csig_tag->abw = min(pkt->csig_tag->abw, egr_port->abw)

5.3.4. QoS requirements

min(ABW) and min(ABW/C) are unambiguous signals with low implementation complexity on network devices. For simplicity, these definitions intentionally do NOT distinguish across QoS classes that may share the egress port. Available bandwidth per QoS class on an egress port is complex to define and meaningfully interpret since it depends on the scheduling policy (Strict Priority / WRR / Deficit WRR), buffer carving configuration and other policies (e.g., AQM) associated with QoS. Section 8 describes the applications of min(ABW) and min(ABW/C) as defined. We leave QoS-based variations of these signals and their potential use cases as future work.

5.4. Maximum Per-hop Delay - max(PD)

max(PD) captures the maximum per-hop delay experienced by a packet among all the hops in the packet path. Per-hop delay PD is the time spent by the packet in the device pipeline. It MAY include link layer delays or it MAY only include the delays observed in the forwarding pipeline.

5.4.1. Per-hop Delay Computation

Unlike ABW and ABW/C which are per-port signals, PD is a per-packet signal. It consists of PHY, MAC and switch pipeline delay experienced by the packet. Pipeline delay is the most relevant component as it captures congestion related queueing delay. Device implementations MAY track ingress and egress timestamps explicitly for each packet and perform a diff in the final stages of the pipeline. Precise definitions of these stages depend on the architecture of the device. For example, some devices could leverage existing timestamping support from tail timestamping capabilities for this purpose.

5.4.2. Requirements

5.4.2.1. Algorithm Requirements

To support max(PD) in CSIG, the device SHOULD support per-packet tracking of delay experienced through the device.

5.4.2.2. Accuracy Requirements

It is desirable to have minimal gaps in the components of packet delays captured by the device. However, CSIG does NOT set strict requirements on the accuracy of PD to be supported by the implementation.

5.4.2.3. Bucketing / Quantization Requirements

The computed delay values MUST be compressed to fit in the available Signal value bits on the CSIG-tag. The device MUST support 32 fully configurable delay buckets for compact CSIG, and configurable quanta for uniform quantization in expanded CSIG. All devices along the packet path MUST be configured with the same buckets / quanta to correctly compute max(PD) along the path.

Each transit device performs a compare-and-replace, i.e., updates the signal value on the CSIG tag if the incoming delay signal value on the packet is lower than the device's locally computed delay for the packet, post bucketization / quantization. E.g.,

// Update the signal value on packet if current hop is the bottleneck
pkt->csig_tag->pd = max(pkt->csig_tag->pd, device->pkt->pd)
5.4.2.4. QoS requirements

Delay experienced by the packet on a device, as defined, is implicitly a QoS-specific signal. This is because the packet is subject to QoS policies as it traverses through the device pipeline, including prioritization, scheduling and buffering. For example, a high priority packet may see smaller delays than low priority packets. Therefore, the delay measured for the packet SHOULD include components in the pipeline where QoS policies are applied.

5.5. Locator Metadata Implementation

Locator metadata (LM) captures information about the bottleneck device or port, as described in Section 4.1.3.3. In this section, we discuss requirements for supporting LM in CSIG, and provide recommendations for commonly useful attributes to carry in LM.

5.5.1. Requirements

A single deployment MAY choose a subset of the attributes in Section 5.5.2 and/or newly defined attributes beyond those listed in Section 5.5.2 to include in LM. However, the total size of the individual attributes MUST be within 7 bits for Compact CSIG and within 16 bits for Expanded CSIG.

CSIG does not set strict requirements on the LM internal format i.e., how the individual attributes are organized among the available LM bits. However, this LM internal format MUST be consistent across devices in the deployment domain so that the end hosts can consistently interpret these bits. The LM internal format MAY be specific to each signal type.

Devices SHOULD support configuring per-port values for LM to be written on the CSIG-tag. Devices MAY provide more granular configurability of LM based on Signal type as well. CSIG packets egressing on a given port that have their Signal Value updated by the device MUST be updated with the LM corresponding to the port and Signal Type.

5.5.2. Attributes

Attributes can be designed to capture the level of resolution desired by use cases for pinpointing the bottleneck. Attributes may be encoded to fit within the limited number of LM bits available in CSIG.

We separate the list of attributes into compact attributes and expanded attributes. Compact attributes are motivated by the limited number of LM bits available in Compact CSIG, and therefore capture only the essential information about the bottleneck that is necessary for the use cases i.e., to inform control decisions or telemetry. Expanded attributes provide higher resolution information about the bottleneck, and can aid in directly pinpointing bottleneck devices or ports. Expanded attributes typically require more bits and are hence more suited for Expanded CSIG.

Examples of attributes are listed below.

5.5.2.1. Compact Attributes
  • Link capacity: Encodes the capacity of the bottleneck link. In typical deployments, the number of link speeds deployed is a small set, can be encoded using <= 5 bits.

  • Stage of the bottleneck: Encodes the stage of the topology where the bottleneck device / port is located. For example, in a 5-stage clos topology, the stage of the device can be encoded with 3 bits.

  • Link orientation: Encodes the direction of a port in the context of the network topology. For example, with three categories - uplinks, downlinks and side-links - link orientation can be encoded using 2 bits.

5.5.2.2. Expanded Attributes
  • Port ID: Encodes a unique identifier for each port within a deployment domain.

  • Device ID: Encodes a unique identifier for each device within a deployment domain.

  • TTL (Time-to-live): Captures the TTL value of the packet at the bottleneck device, represented using 8-bits. End hosts can use this attribute to infer the hop number at which the packet was bottlenecked.

LM attributes and encoding schemes are ultimately deployment specific and use-case specific. CSIG supports a flexible specification of LM to accommodate a variety of requirements and future applications.

6. Incremental Deployment of CSIG.

Most production networks are heterogeneous, with a mix of network devices across generations. This document addresses the brownfield deployment of CSIG in a heterogeneous network, where there may be a mix of devices that offer varying degrees of support for CSIG packet construction and processing.

6.1. CSIG Stripping: A per egress-port primitive

Before describing incremental deployment, we introduce the idea of CSIG stripping, an action primitive which is foundational to deploying CSIG in a heterogeneous environment.

Devices that support CSIG MUST be capable of removing the CSIG tag before forwarding the packet. Devices MUST allow configuring CSIG- stripping on a per egress-port basis. If a port is configured to strip CSIG, then all CSIG-tagged packets that egress on this port must have the tag removed before being forwarded.

In the following sections, we describe how this capability can enable incremental deployment.

6.2. Levels of CSIG Support

We first classify devices into three simplified categories based on their level of CSIG support. In the subsequent sections we describe how CSIG can interoperate with each category of device. Note that the level of support is a function of the tag placement and whether the compact or expanded CSIG tag format is used as shown in Section 4.1.

6.2.1. Discard

Devices in this category are not capable of recognizing or parsing CSIG tagged packets. If such packets are received, they will simply be dropped.

6.2.2. Pass-through

Devices in this category are able to recognize and parse CSIG tagged packets, and transparently forward the packet with the tag intact or with the tag stripped to neighboring devices (in the case of transit devices) or to the end host transport layer (in the case of end hosts). However, they do not support updating the CSIG data fields on the tag.

Some devices that do not natively support CSIG may be configured to support pass-through mode for CSIG if they support VLAN tags with configurable TPIDs. This is discussed in more detail in Section 6.4.

6.2.3. Complete

Devices in this category support the complete CSIG protocol, including recognition, parsing, forwarding, tag-stripping, signal computation, and signal updates on the tag. However, only a subset of signal types may be supported.

6.2.3.1. Software-assisted support

It is noteworthy that in some devices that do not natively support CSIG, resources available for VLAN tag processing can be repurposed to support CSIG for certain signal types using a combination of software and hardware capabilities. We refer to this level of support as software-assisted support. This capability is discussed in more detail in Section 6.4.

6.2.3.2. Native support

Devices that natively support CSIG are explicitly equipped with the hardware capabilities required to implement the CSIG protocol.

A CSIG domain is a deployment domain where all network devices have complete support or pass-through support for CSIG.

6.3. Interoperability in Brownfield Deployments

In this section, we first define the requirements for CSIG Interoperability in brownfield deployments. Then, we consider devices with all levels of support described in Section 6.2 and describe how these devices MAY be configured to achieve interoperability. Note that the following descriptions apply separately to both Compact and Expanded CSIG-tags.

Table 1: Interoperability with devices having different levels of CSIG support
Device category Interop support
Discard Upstream devices must strip CSIG tags before packets reach this device
Pass-through support only Device may strip tag or transparently forward with tag unmodified depending on e2e signal accuracy requirements
Native CSIG support Device updates CSIG-tag as per protocol
SW-assisted CSIG support Device updates CSIG-tag using VLAN match/action with approximate signals computed in S/W agent

6.3.1. Requirements for interoperability

Forwarding: The fundamental requirement is that no CSIG-tagged packet should be dropped in the network due to a lack of CSIG support on a device. This requirement means packets with CSIG-tags MUST never reach devices in the Discard category, or MUST have their CSIG-tag stripped before reaching such devices.

Negotiation: End hosts / flows SHOULD ensure that the path (including end hosts and transit devices) is CSIG-capable before enabling CSIG- tagging on packets. Devices in the Discard category should not require any changes in order to achieve negotiation. This requirement is to ensure correctness of data fields in end-to-end CSIG operation, and to interoperate with legacy devices or software stacks.

6.3.2. Forwarding

To achieve forwarding interoperability requirements for CSIG, CSIG stripping may be exercised as shown below

  • When a neighboring device connected to a given egress port is a Discard device and cannot parse CSIG packets, this egress port MUST be configured to strip the tag on outgoing packets to ensure that the packet does not get dropped downstream.

  • When a device supports Pass-through only or does not support the requested signal type on a CSIG packet, egress ports on this device MAY be configured to strip the tag on outgoing packets to ensure that CSIG does not carry inaccurate information. In some use cases where it is acceptable for CSIG to miss capturing signals on certain hops, pass-through devices MAY transparently forward the packet with the CSIG tag intact.

  • At the boundary of a CSIG domain, device ports that are connected to devices outside of the CSIG domain MUST strip the tag to ensure that packets exiting the domain do not contain CSIG-tags. Only egress ports connected to devices within the CSIG domain SHOULD retain CSIG-tags on outgoing packets.

CSIG packets and non-CSIG packets can be used together in a brownfield setting. This requirement means that end hosts MUST be capable of transmitting and receiving both CSIG packets and non-CSIG packets, including for the same flow. A packet marked with CSIG-tag at the sender host may arrive at the receiver host without the tag. In addition, Compact CSIG and Expanded CSIG packets may be used together on the same network.

6.3.3. Negotiation

Support for sending and receiving CSIG-tagged packets may require software and/or hardware changes on transit devices and end hosts. In many deployments, particularly those requiring hardware upgrades to support CSIG (such as Switch or NIC support), version stragglers continue to exist for long time horizons for a variety of reasons, and interoperability with such stragglers is a critical requirement. Without negotiation for CSIG capability, devices that are not CSIG- compliant may drop CSIG packets and thus blackhole traffic. Negotiating for CSIG-capability of a path is critical to ensure that CSIG protocol operates safely end-to-end in a brownfield deployment.

A path is considered CSIG-capable if end-hosts have at least Pass-through CSIG support and transit devices have Complete CSIG support (native or software-assisted). Before sending CSIG-tagged packets on a network flow, end-hosts must negotiate for path CSIG-capability. We discuss one approach to negotiation for path CSIG-capability, which involves two parts: negotiation for transit device support and negotiation for end host support.

6.3.3.1. Negotiation for transit device support

In this section, we describe one simple approach to negotiate CSIG support on transit devices with CSIG stripping.

CSIG stripping can be used to implicitly achieve negotiation by removing the CSIG-tag from the packet header at or before devices on the packet path that do not have the desired level of CSIG support. If the receiver end host receives a CSIG-tagged packet, it serves as an explicit indication that all devices on the packet path, including transit devices and end-hosts, have the desired CSIG support. If the receiver end host receives a packet without a CSIG-tag, it is an indication that one or more devices do not have the desired CSIG support, or that the packet was not tagged at the sender to begin with. This indication can be implicitly reported to the sender via an empty / invalid CSIG reflection header and the sender can determine whether the packet path was CSIG-capable.

This approach assumes that each device has knowledge about the level of CSIG support in its immediate neighboring devices, which is viable through configuration in typical private SDN networks. In the absence of centralization, mechanisms such as a new LLDP TLV may be defined to advertise aspects of CSIG support on the device, including compact vs expanded CSIG-tag support, signal types that are supported, pass-through vs complete support etc. We leave the details of such an LLDP extension for future extensions of the protocol.

6.3.3.2. Negotiation for end host support

A sender end host may need to explicitly negotiate with the remote end-host to ensure that the host networking stack at the remote host has the desired level of CSIG support. Ideally such explicit CSIG negotiation should be performed during or before the initial connection handshake, after which CSIG is enabled / disabled on packets post connection establishment. It may also be necessary to explicitly negotiate the use of CSIG Reflection in transports, separately from the negotiation for path CSIG-capability. For example, in TCP, negotiation is required to use the CSIG Reflection TCP Option. We leave the details of such negotiation schemes for future extensions of the protocol.

6.4. Backward Compatibility via Software-assisted CSIG

Transit devices without native CSIG support MAY participate in CSIG protocol via a Software-assisted approach. This allows brownfield deployments to reap incremental benefits of CSIG without having to upgrade a significant fraction of device HW on their networks.

Since compact and expanded CSIG tags are structurally similar to single VLAN-tags and double VLAN-tags respectively, VLAN resources in a transit device can be repurposed to support CSIG updates. More specifically, configurable TPIDs for VLAN tags can be used to treat CSIG tags as VLAN tags, and VLAN match/action resources for tag updates in the device can be leveraged to support updating CSIG data fields on the tag.

For signals such as ABW and ABW/C, a software agent running on the CPU of a transit device can periodically compute these signals based on hardware byte counters, and program VLAN match/action rules in the dataplane to update CSIG data fields based on the computed signals. Since the match/action rules are in the dataplane, CSIG packets can be processed at line rate without CPU involvement. However the match/action rules themselves can be updated at a slower cadence via the software agent.

Compact CSIG is designed to enable software-assisted backward compatibility while operating within the constraints of commonly available VLAN resources on transit devices. Backward compatibility via software is a fundamental feature in the design of Compact CSIG.

Note that it may not be possible to track signal types such as hop delay per packet in a software agent. However, approximations of the signal based on available hardware counters and registers (such as latency histograms) can be implemented in the agent if software- assisted support is desired for such signal types.

6.5. Greenfield deployments

In greenfield deployments of CSIG domains, all devices in the domain natively support the CSIG protocol.

Expanded CSIG is designed to leverage greenfield deployments where backward compatibility, negotiation and interoperability are not requirements. It provides enhanced signal resolution via higher bit width for signal values and locator metadata in comparison to Compact CSIG. Expanded CSIG can also support up to 16 signal types.

Devices in Greenfield CSIG domains MUST support CSIG stripping at the domain boundary to ensure that CSIG packets don't exit the domain.

7. Design Rationale

CSIG's design choices are shaped by an end-to-end perspective of what matters to applications and where tradeoffs can be made towards simplicity and practicality. In this section, we discuss the rationale behind CSIG's design and the advantages it provides over existing state of the art.

7.1. Choice of Layer 2

CSIG-tag offsets at layer 2 are independent of headers and payload at layer 3 and above, which means that only a small set of tag placement offsets need to be supported for reading and updating the header. This makes device implementations of CSIG simpler. In contrast, in-band network telemetry schemes implemented at layer 3 or higher require support for a large set of packet formats as this set grows by the cross-product of formats / encapsulations at each layer. This complexity forces device implementations to restrict support for only a fraction of packet formats / encapsulations, hindering the adoption and deployment of such schemes. CSIG-tagging, on the other hand, is simpler to support and deploy since it is at layer 2 and has a fixed offset despite various formats / encapsulation at layer 3 and above.

The choice of layer 2 also makes compatibility with in-network tunneling and encryption simpler, which are common features in data center deployments.

  • CSIG-tags are, by design, compatible with PSP encrypted packets and IPSec encrypted packets, where Layer 4 headers and payloads may be encrypted.

  • CSIG tags are carried through Layer 3 tunnels e.g., IP-in-IP, VxLAN, Geneve, at a fixed offset in the packet header. This avoids the need to copy and relocate CSIG tags across inner / outer headers during encapsulation and decapsulation of packets, which would be necessary if implemented instead at layers 3 or higher.

  • CSIG tags are placed as the last header in the Layer 2 header stack to ensure compatibility with layer 2 and layer 2.5 tunneled domains as well. The placement of CSIG tags in MACSec and other Layer 2 encapsulations is shown in the table in Section 4.1.

Most in-band network telemetry schemes are not backward compatible. However, CSIG tag's structural similarity to VLAN tags enables backward compatibility with many devices that don't have native CSIG support as described in Section 6.4. This allows deployments to reap the benefits of CSIG without having to upgrade a significant portion of their network hardware.

In addition, since expanded CSIG is limited to 8B, i.e., the size of double VLAN tags, the packet parsing depth required on devices to read and process headers at layer 3 and above is not affected.

In summary, the choice of Layer 2 for CSIG-tag is a key part of CSIG's simplicity and efficiency, since it keeps device implementations simple while supporting multiple encapsulations and backward compatibility.

7.2. Separation of headers for CSIG-tag and reflection

CSIG's design separates the CSIG-tag and CSIG reflection headers into distinct layers. This decoupling enables end hosts to develop different transport-specific implementations of CSIG reflection while sharing the underlying CSIG-tag mechanism. This means that transit device behaviors are not impacted by innovations in CSIG reflection.

In addition, this decoupling enables the separate tracking of forward and reverse path bottlenecks. This is important since CCAs typically prefer to react to congestion on the forward path only and not react to congestion on the reverse path. In contrast, in-band schemes that mix signaling and reflection into the same header do not provide distinctions between forward and reverse path.

7.3. Fixed-size headers

CSIG's fixed-size headers constitute less than 0.2% bandwidth overhead in packets with 4k or 9k MTU. This means that there is no need for fragmentation or increasing MTU size for the purposes of supporting multiple congestion signals. Furthermore, the performance of network device packets per second (PPS) is minimally impacted by the inclusion of CSIG tag and reflection headers.

The low overhead allows CSIG to be enabled on all live data packets or explicit probe packets or sampled packets. This is an important capability because it allows for the direct quantification of the bottlenecks experienced by the data packets themselves instead of having to rely on probes. However, leveraging CSIG on probes or sampled packets is still an option for deployments that require such visibility.

CSIG is designed to perform compare-and-replace (or more generally read-modify-write for future extensions), with a fixed size header. Therefore, CSIG is not limited by the number of hops in a network path (i.e., diameter of the network) unlike schemes that append information at each hop.

7.4. Signal Design

CSIG's signal design focuses on simple, aggregate signals that are driven by use cases, as demonstrated in Section 5 and Section 8.

CSIG allows a single packet to carry only one congestion signal. To obtain multiple signals at the end hosts, it takes advantage of the fact that the end host can request different signal types across multiple packets of a flow. In contrast, other schemes tend to overload each packet with a lot of information, including metadata about multiple signals, which can be limiting. Moreover, CSIG-tag's format is also extensible, which means that it can be adapted to support additional signal types and locator metadata in the future without compromising the advantages of CSIG's design.

A unique feature of Compact CSIG's design is the ability to fully configure signal value buckets, which allows for efficient signal representations with a limited number of bits. For example, the encodings can be adjusted to provide greater granularity at value ranges that are more important to the application, and lower granularity at ranges that are less important. Similarly, locator metadata can be efficiently represented by carrying fewer bits of relevant compressed attributes of the bottleneck that are important to applications. Expanded CSIG, on the other hand, uses uniform signal quantization for more accuracy and provides even more flexibility in defining signals and locator metadata with a larger bit width.

8. Use Cases defined by Bottleneck Signals

The use cases for CSIG are motivated by congestion control, traffic management and network debuggability. These use cases have always existed in production before CSIG, often using signals that are measured end-to-end (such as packet loss and delay), or out-of-band signals from network devices such as port utilization. CSIG provides a boost in performance, efficiency and debuggability by augmenting existing use cases with explicit in-band measurements.

In this document, we present the use cases for the three signals defined in Section 5. At the crux of a signal is the definition of bottleneck. Over time we envision use cases for other signals that would define a bottleneck, e.g., the maximum number of co-sharing flows on a link. For each of these new signals, locator metadata can continue to provide attributes about the bottleneck port such as port capacity.

8.1. Congestion Control

CCA can make use of CSIG signals in at least two different ways. First, existing CCA can use CSIG values to address blindspots in end- to-end signals such as packet loss, delay, and delivery rates. This use case is immediately relevant as most production networks deploy some form of end-to-end congestion control including Swift [SWIFT], and BBR [BBR]. A second way to use CSIG is to design entirely new congestion control algorithms that use CSIG as their primary signal. We focus below on the former category.

E2E CCA comes in various forms and for simplicity we describe the use cases taking Swift CC [SWIFT] as the baseline. Swift is delay-based congestion control that uses accurate round-trip time (RTT) measurements done via the NIC hardware timestamps. These signals can be applied to other CCA and are NOT limited to Swift.

The interpretation and applications of CSIG for congestion control in lossless networks and networks that use packet spraying is a topic for future research.

8.1.1. Using maximum per-hop delay in E2E CC

E2E RTT measurements used in Swift include the queueing delays on all hops along the flows' path, including the forward and reverse paths. A consequence of using a lumped delay signal is that a flow reduces its sending rate in response to delays that it may not be able to directly control. Furthermore, in deployments where there can be multiple congested links along the path of a flow, it is desirable to modulate the sending rate of a flow in response to just the maximum of the per-hop delays, max(PD), along a flows' path. Replacing the end-to-end measured delay with bottleneck delay into Swift's equation yields the following:

// Reduce the congestion window when bottleneck hop delay
// exceeds a chosen target hop delay
if (max(PD) > target_delay) then
  md = beta * (max(PD) - target_delay) / max(PD)
  cwnd = (1 - md) *cwnd

Poseidon [POSEIDON] is a CC proposed in literature that exemplifies the use of maximum per-hop delay in reducing its congestion window. By incorporating bottleneck information in congestion control response, POSEIDON flows achieve higher flow throughputs in presence of reverse path congestion, and congestion across multiple network hops. Algorithm 1 in [POSEIDON] details the use of maximum per-hop delay in both the increase and the decrease of the congestion window.

8.1.3. Using minimum available bandwidth in E2E CC

E2E CC uses heuristics to determine the initial transfer rate for newly established connections. Starting too slowly would cause the transfer to take longer than necessary while wasting available bandwidth, whereas starting too quickly would cause queue delays and packet drops. The same dilemma exists for transfers that are starting on a connection that has been idle for multiple round-trip times.

In networks where we know ahead of time that the degree of multiplexing is low i.e., just a handful of flows co-existing on the link at any point in time, transfers complete quickly when they "jump-start" to use up all of the bottleneck bandwidth. This is especially helpful when transports employ robust loss recovery mechanisms such that even if the queue overflows, any lost packets can be quickly recovered.

As an example, on an empty network of 200Gbps, a single transfer can use up the entire 200Gbps in the second RTT, after the CSIG feedback in the first RTT indicates the availability of 200Gbps at the bottleneck link.

CSIG's min(ABW) bottleneck bandwidth allows transfers to start safely at line-rate.

8.2. Traffic Management

CSIG encodes the most notable information about the path for each flow by carrying bottleneck link signals and bottleneck locator metadata. This path-level information, which is obtained directly from application data packets rather than synthetic probes, is directly attributable to the flow and is valuable for traffic engineering and application performance debugging.

8.2.1. Load Balancing and Multipathing

Datacenter topologies employ a diverse set of paths between any source-destination pairs. Transports employ techniques such as Protective Load Balancing [PLB] and Multipathing [RFC8684] to spread traffic across the multitude of paths. Load balancing and multipathing in transports use a combination of end-to-end signals and heuristics to select which paths to use and how much traffic to channel in each of the paths.

Using CSIG signals from bottleneck links along the diverse set of paths, load balancing and multipathing schemes can select high quality paths with lower congestion, and spread traffic across them in a congestion-aware manner.

Locator metadata can also be used to distinguish between incast congestion and core network congestion, which can then be used to adjust load balancing / multipathing actions. For instance, the stage of the bottleneck and link orientation attributes are enough to determine whether the last hop is the bottleneck or not. When the last hop is the bottleneck, flow-level load balancing / multipathing actions may not be effective and may, in fact, worsen incasts. Such cases may require application-level load balancing or job scheduling techniques to distribute traffic. However, when congestion is instead known to be in the core network, flow-level load balancing / multipathing actions can route around congested areas and improve performance.

8.2.2. Traffic Engineering

Traffic Engineering carves out paths with apt bandwidth across aggregate source-destination pairs. Examples within a datacenter include Datacenter Network Interconnection Layer (DCNI) [JUPITEREVOL]. CSIG can be used to provide fine-grained path level information, including short timescale microburst congestion, to TE systems. By using summarized CSIG signals aggregated both spatially and temporally across flows, TE can select paths and balance traffic at the datacenter level to accommodate bursty traffic, e.g., from ML.

8.3. Application Performance Debugging

Applications often complain that the network is slow, but it can be challenging to identify the specific segment of the network that is causing the problem. This is especially true with the scale of datacenters, where flows can traverse up to nine hops [JUPITEREVOL]. Figuring out where the bottleneck is and the timescales at which the path poses a bottleneck is like searching for a needle in a haystack for an application with thousands of flows across various source-destination pairs.

On application network flows, CSIG information, with its bottleneck locator, can quickly and precisely answer why the flows are slow and where the network / path bottlenecks are.

CSIG can also be enabled on mesh prober systems similar to [PINGMESH] to augment end-to-end probe measurements between any two servers with bottleneck information to aid troubleshooting.

9. Security Considerations

Only trusted sender hosts MUST be allowed to construct, initialize and insert a CSIG tag into packets for authorized flows. Based on deployments, the authorization can be done at the NICs or at the switches, akin to firewall rules. CSIG stripping may also be employed as fencing rules at domain boundaries to ensure that unauthorized CSIG-tags are not traversing across these boundaries.

A rogue or broken network-device in a private network might put in arbitrary CSIG values, or insert a CSIG tag in packets on a transit node. We expect there to be checks and balances to identify and take non-functioning or rogue network devices out of a private network, as they can impose greater harm than distributing misleading CSIG values.

10. IANA Considerations

There are no IANA considerations. CSIG Tag Protocol Identifier (TPID) is requested from IEEE.

11. Conclusions

With the increased deployment of applications that are sensitive to delay and bandwidth usage in data centers, e.g., AI/ML/HPC workloads and RDMA based applications, relying solely on end-to-end signals is insufficient under dynamically changing traffic patterns. Simple and timely signals from network devices to end-hosts can augment and optimize end-host transports to make optimal use of datacenter bandwidth. CSIG is a simple, practical and deployable protocol for distributing congestion information in networks that builds on the successful aspects of prior work and is grounded in use-cases of congestion control, traffic management and network debuggability.

12. Acknowledgements

This work would not be possible without the following individuals whose various engineering and design contributions shaped CSIG and its use cases:

Christopher Alfeld, Neelesh Bansod, Jis Ben, Neal Cardwell, Yongzhou Chen, Yuchung Cheng, Dal Chand Choudhary, Mick Fingleton, Mahmudul Hasan, Jeffrey Ji, Marc De Kruijf, Praveen Kumar, Rich Lane, Chang Liu, Morley Mao, Carl Mauer, Sachin Menezes, Nipen Mody, Masoud Moshref, Alex Rumyantsev, Gerald Schmidt, Arjun Singh, Arjun Singhvi, Babru Thatikunta, Jeff Tikkanen, Frank Uyeda, Brian Vasquez, Rui Wang, Hassan Wassel, Yong Xia, Zhengxu Xia, Kevin Yang, Liangcheng Yu.

We would like to thank Arjun Singh, David Wetherall, Neal Cardwell, Akash Deshpande and Arvind Krishnamurthy for their feedback on several portions of this document.

13. Normative References

[BBR]
Cardwell, N., Cheng, Y., Gunn, C., Yeganeh, S., and V. Jacobson, "BBR: congestion-based congestion control", Communications of the ACM vol. 60, no. 2, pp. 58-66, DOI 10.1145/3009824, , <https://doi.org/10.1145/3009824>.
[CONGA]
Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan, R., Chu, K., Fingerhut, A., Lam, V., Matus, F., Pan, R., Yadav, N., and G. Varghese, "CONGA: distributed congestion-aware load balancing for datacenters", ACM SIGCOMM Computer Communication Review vol. 44, no. 4, pp. 503-514, DOI 10.1145/2740070.2626316, , <https://doi.org/10.1145/2740070.2626316>.
[DCQCN]
Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., Liron, Y., Padhye, J., Raindel, S., Yahia, M., and M. Zhang, "Congestion Control for Large-Scale RDMA Deployments", ACM SIGCOMM Computer Communication Review vol. 45, no. 4, pp. 523-536, DOI 10.1145/2829988.2787484, , <https://doi.org/10.1145/2829988.2787484>.
[HPCCPLUS]
"High-precision congestion control (HPCC++) deployment at Alibaba leveraging In-band Flow Analyzer (IFA)", n.d., <https://www.broadcom.com/blog/high-precision-congestion-control>.
[I-D.kumar-ippm-ifa]
Kumar, J., Anubolu, S., Lemon, J., Manur, R., Holbrook, H., Ghanwani, A., Cai, D., Ou, H., Li, Y., and X. Wang, "Inband Flow Analyzer", Work in Progress, Internet-Draft, draft-kumar-ippm-ifa-07, , <https://datatracker.ietf.org/doc/html/draft-kumar-ippm-ifa-07>.
[I-D.miao-tsv-hpcc]
Miao, R., Anubolu, S., Pan, R., Lee, J., Gafni, B., Shpigelman, Y., Tantsura, J., and G. Caspary, "HPCC++: Enhanced High Precision Congestion Control", Work in Progress, Internet-Draft, draft-miao-tsv-hpcc-02, , <https://datatracker.ietf.org/doc/html/draft-miao-tsv-hpcc-02>.
[JUPITEREVOL]
Poutievski, L., Mashayekhi, O., Ong, J., Singh, A., Tariq, M., Wang, R., Zhang, J., Beauregard, V., Conner, P., Gribble, S., Kapoor, R., Kratzer, S., Li, N., Liu, H., Nagaraj, K., Ornstein, J., Sawhney, S., Urata, R., Vicisano, L., Yasumura, K., Zhang, S., Zhou, J., and A. Vahdat, "Jupiter evolving: transforming google's datacenter network via optical circuit switches and software-defined networking", Proceedings of the ACM SIGCOMM 2022 Conference, DOI 10.1145/3544216.3544265, , <https://doi.org/10.1145/3544216.3544265>.
[P4-INT]
"In-band Network Telemetry (INT) Dataplane Specification", n.d., <https://p4.org/p4-spec/docs/INT_v2_1.pdf>.
[PINGMESH]
Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R., Maltz, D., Liu, Z., Wang, V., Pang, B., Chen, H., Lin, Z., and V. Kurien, "Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis", ACM SIGCOMM Computer Communication Review vol. 45, no. 4, pp. 139-152, DOI 10.1145/2829988.2787496, , <https://doi.org/10.1145/2829988.2787496>.
[PLB]
Qureshi, M., Cheng, Y., Yin, Q., Fu, Q., Kumar, G., Moshref, M., Yan, J., Jacobson, V., Wetherall, D., and A. Kabbani, "PLB: congestion signals are simple and effective for network load balancing", Proceedings of the ACM SIGCOMM 2022 Conference, DOI 10.1145/3544216.3544226, , <https://doi.org/10.1145/3544216.3544226>.
[PONYEXPRESS]
Marty, M., de Kruijf, M., Adriaens, J., Alfeld, C., Bauer, S., Contavalli, C., Dalton, M., Dukkipati, N., Evans, W., Gribble, S., Kidd, N., Kononov, R., Kumar, G., Mauer, C., Musick, E., Olson, L., Rubow, E., Ryan, M., Springborn, K., Turner, P., Valancius, V., Wang, X., and A. Vahdat, "Snap: a microkernel approach to host networking", Proceedings of the 27th ACM Symposium on Operating Systems Principles, DOI 10.1145/3341301.3359657, , <https://doi.org/10.1145/3341301.3359657>.
[POSEIDON]
Wang, W., Moshref, M., Li, Y., Kumar, G., Ng, E., Cardwell, N., and N. Dukkipati, "Poseidon: Efficient, Robust, and Practical Datacenter CC via Deployable INT", , <https://www.usenix.org/conference/nsdi23/presentation/wang-weitao>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC3168]
Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, , <https://www.rfc-editor.org/rfc/rfc3168>.
[RFC8257]
Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., and G. Judd, "Data Center TCP (DCTCP): TCP Congestion Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, , <https://www.rfc-editor.org/rfc/rfc8257>.
[RFC8684]
Ford, A., Raiciu, C., Handley, M., Bonaventure, O., and C. Paasch, "TCP Extensions for Multipath Operation with Multiple Addresses", RFC 8684, DOI 10.17487/RFC8684, , <https://www.rfc-editor.org/rfc/rfc8684>.
[RFC9000]
Iyengar, J., Ed. and M. Thomson, Ed., "QUIC: A UDP-Based Multiplexed and Secure Transport", RFC 9000, DOI 10.17487/RFC9000, , <https://www.rfc-editor.org/rfc/rfc9000>.
[RFC9378]
Brockners, F., Ed., Bhandari, S., Ed., Bernier, D., and T. Mizrahi, Ed., "In Situ Operations, Administration, and Maintenance (IOAM) Deployment", RFC 9378, DOI 10.17487/RFC9378, , <https://www.rfc-editor.org/rfc/rfc9378>.
[SWIFT]
Kumar, G., Dukkipati, N., Jang, K., Wassel, H., Wu, X., Montazeri, B., Wang, Y., Springborn, K., Alfeld, C., Ryan, M., Wetherall, D., and A. Vahdat, "Swift: Delay is Simple and Effective for Congestion Control in the Datacenter", Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, DOI 10.1145/3387514.3406591, , <https://doi.org/10.1145/3387514.3406591>.
[TCP-INT]
Jereczek, G., Jepsen, T., Wass, S., Pujari, B., Zhen, J., and J. Lee, "TCP-INT: lightweight network telemetry with TCP transport", Proceedings of the SIGCOMM '22 Poster and Demo Sessions, DOI 10.1145/3546037.3546064, , <https://doi.org/10.1145/3546037.3546064>.

Appendix A. Example encodings of CSIG signals

The following table demonstrates an example encoding of a 3-bit signal value. Note that this is an example ONLY. The encoding that is meaningful to a certain deployment is specific to the use cases in consideration.

Note that CSIG tag supports 5 bit (20 bit) signal value size for the compact (expanded) formats.

Table 2
Value min(ABW/C) min(ABW) max(PD)
0x0 0%-1% 0-1Gbps 0-10us
0x1 1%-5% 1-5Gbps 10-50us
0x2 5%-10% 5-10Gbps 50-100us
0x3 10%-20% 10-20Gbps 100-200us
0x4 20%-50% 20-50Gbps 200-400us
0x5 50%-75% 50-75Gbps 400-800us
0x6 75%-90% 75-90Gbps 800-2000us
0x7 90%-100% >90 Gbps >2000us

Contributors

Weida Huang
Google LLC
Tyler Griggs
UC Berkeley
Mohammad Jafar Akhbarizadeh
Google LLC
Jeongkeun Lee
Google LLC
Surendra Anubolu
Broadcom Inc.
Kok-Kiong Yap
Google LLC
Neal Cardwell
Google LLC

Authors' Addresses

Abhiram Ravi
Google LLC
Nandita Dukkipati
Google LLC
Naoshad Mehta
Google LLC
Jai Kumar
Broadcom Inc.