Javascript disabled? Like other modern websites, the IETF Datatracker relies on Javascript. Please enable Javascript for full functionality.
Congestion Signaling (CSIG)
draft-ravi-ippm-csig-01

Versions:
Document	Type	Active Internet-Draft (individual)
	Authors	Abhiram Ravi , Nandita Dukkipati , Naoshad Mehta , Jai Kumar
	Last updated	2024-02-02
	RFC stream	(None)
	Intended RFC status	(None)
	Formats	txt html xml htmlized pdf bibtex bibxml
Stream	Stream state	(No stream defined)
	Consensus boilerplate	Unknown
	RFC Editor Note	(None)
IESG	IESG state	I-D Exists
	Telechat date	(None)
	Responsible AD	(None)
	Send notices to	(None)
Email authors IPR References Referenced by Nits Search email archive
draft-ravi-ippm-csig-01
Networking Working Group                                         A. Ravi
Internet-Draft                                              N. Dukkipati
Intended status: Experimental                                   N. Mehta
Expires: 5 August 2024                                        Google LLC
                                                                J. Kumar
                                                           Broadcom Inc.
                                                         2 February 2024

                      Congestion Signaling (CSIG)
                        draft-ravi-ippm-csig-01

Abstract

   This document presents Congestion Signaling (CSIG), an in-band
   network telemetry protocol that allows end-hosts to obtain visibility
   into fine-grained network signals for congestion control, traffic
   management, and network debuggability in the network.  CSIG provides
   a simple, low-overhead, and extensible packet header mechanism to
   obtain fixed-length summaries from bottleneck devices along a packet
   path.  This summarized information is collected over L2 CSIG-tags in
   a compare-and-replace manner across network devices along the path.
   Receivers can reflect this information back to senders via L4+ CSIG
   reflection headers.

   CSIG builds upon the successful aspects of prior work such as switch
   in-band network telemetry (INT) that incorporates multibit signals in
   live data packets.  At the same time, CSIG's end-to-end mechanism for
   carrying the signals via fixed size header is simple, practical and
   deployable akin to Explicit Congestion Notification (ECN).

   In addition to a detailed description of the end-to-end protocol,
   this document also motivates the use cases for CSIG and the rationale
   for design choices made in CSIG.  It describes a set of signals of
   interest to applications (minimum available bandwidth, maximum link
   utilization, and maximum hop delay), methods to compute these signals
   in network devices, and how these signals can be leveraged in
   applications.  Additionally, it describes how attributes about the
   bottleneck's location can be carried and made useful to applications.
   It also provides the framework to incorporate future signals.
   Finally, this document addresses incremental deployment, backward
   compatibility and nuances of CSIG's applicability in a range of
   scenarios.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

Ravi, et al.              Expires 5 August 2024                 [Page 1]
Internet-Draft                    CSIG                     February 2024

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 5 August 2024.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   4
     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   6
   2.  Design Principles . . . . . . . . . . . . . . . . . . . . . .   8
   3.  Conventions . . . . . . . . . . . . . . . . . . . . . . . . .   9
   4.  Congestion Signaling Protocol . . . . . . . . . . . . . . . .   9
     4.1.  CSIG-tag Header Format  . . . . . . . . . . . . . . . . .  10
       4.1.1.  Compact Format  . . . . . . . . . . . . . . . . . . .  11
       4.1.2.  Expanded Format . . . . . . . . . . . . . . . . . . .  11
       4.1.3.  CSIG-tag Data fields Description  . . . . . . . . . .  11
     4.2.  CSIG Reflection Header Format . . . . . . . . . . . . . .  14
       4.2.1.  Reflection in TCP . . . . . . . . . . . . . . . . . .  15
       4.2.2.  Reflection in non-TCP Transports  . . . . . . . . . .  15
     4.3.  CSIG Operation - Life of a packet . . . . . . . . . . . .  16
       4.3.1.  Forward Path  . . . . . . . . . . . . . . . . . . . .  16
       4.3.2.  Reverse Path  . . . . . . . . . . . . . . . . . . . .  17
       4.3.3.  Multiple signals  . . . . . . . . . . . . . . . . . .  17
     4.4.  Device Roles  . . . . . . . . . . . . . . . . . . . . . .  17
       4.4.1.  Sender host . . . . . . . . . . . . . . . . . . . . .  17
       4.4.2.  Transit device  . . . . . . . . . . . . . . . . . . .  18
       4.4.3.  Receiver host . . . . . . . . . . . . . . . . . . . .  18

Ravi, et al.              Expires 5 August 2024                 [Page 2]
Internet-Draft                    CSIG                     February 2024

       4.4.4.  Host roles for bidirectional flows  . . . . . . . . .  18
   5.  Signals in CSIG . . . . . . . . . . . . . . . . . . . . . . .  19
     5.1.  Minimum Available Bandwidth - min(ABW)  . . . . . . . . .  19
       5.1.1.  ABW Computation . . . . . . . . . . . . . . . . . . .  19
     5.2.  Maximum link utilization - max(U/C) or min(ABW/C) . . . .  20
       5.2.1.  ABW/C Computation . . . . . . . . . . . . . . . . . .  20
       5.2.2.  min(ABW) vs min(ABW/C) bottlenecks  . . . . . . . . .  21
     5.3.  Shared requirements for min(ABW) and min(ABW/C) . . . . .  21
       5.3.1.  Algorithm Requirements  . . . . . . . . . . . . . . .  21
       5.3.2.  Timescale and Accuracy Requirements . . . . . . . . .  21
       5.3.3.  Bucketing / Quantization Requirements . . . . . . . .  21
       5.3.4.  QoS requirements  . . . . . . . . . . . . . . . . . .  22
     5.4.  Maximum Per-hop Delay - max(PD) . . . . . . . . . . . . .  22
       5.4.1.  Per-hop Delay Computation . . . . . . . . . . . . . .  22
       5.4.2.  Requirements  . . . . . . . . . . . . . . . . . . . .  22
     5.5.  Locator Metadata Implementation . . . . . . . . . . . . .  23
       5.5.1.  Requirements  . . . . . . . . . . . . . . . . . . . .  24
       5.5.2.  Attributes  . . . . . . . . . . . . . . . . . . . . .  24
   6.  Incremental Deployment of CSIG. . . . . . . . . . . . . . . .  25
     6.1.  CSIG Stripping: A per egress-port primitive . . . . . . .  25
     6.2.  Levels of CSIG Support  . . . . . . . . . . . . . . . . .  26
       6.2.1.  Discard . . . . . . . . . . . . . . . . . . . . . . .  26
       6.2.2.  Pass-through  . . . . . . . . . . . . . . . . . . . .  26
       6.2.3.  Complete  . . . . . . . . . . . . . . . . . . . . . .  26
     6.3.  Interoperability in Brownfield Deployments  . . . . . . .  27
       6.3.1.  Requirements for interoperability . . . . . . . . . .  27
       6.3.2.  Forwarding  . . . . . . . . . . . . . . . . . . . . .  28
       6.3.3.  Negotiation . . . . . . . . . . . . . . . . . . . . .  28
     6.4.  Backward Compatibility via Software-assisted CSIG . . . .  30
     6.5.  Greenfield deployments  . . . . . . . . . . . . . . . . .  31
   7.  Design Rationale  . . . . . . . . . . . . . . . . . . . . . .  31
     7.1.  Choice of Layer 2 . . . . . . . . . . . . . . . . . . . .  31
     7.2.  Separation of headers for CSIG-tag and reflection . . . .  32
     7.3.  Fixed-size headers  . . . . . . . . . . . . . . . . . . .  32
     7.4.  Signal Design . . . . . . . . . . . . . . . . . . . . . .  33
   8.  Use Cases defined by Bottleneck Signals . . . . . . . . . . .  34
     8.1.  Congestion Control  . . . . . . . . . . . . . . . . . . .  34
       8.1.1.  Using maximum per-hop delay in E2E CC . . . . . . . .  34
       8.1.2.  Using maximum link utilization in E2E CC  . . . . . .  35
       8.1.3.  Using minimum available bandwidth in E2E CC . . . . .  36
     8.2.  Traffic Management  . . . . . . . . . . . . . . . . . . .  36
       8.2.1.  Load Balancing and Multipathing . . . . . . . . . . .  37
       8.2.2.  Traffic Engineering . . . . . . . . . . . . . . . . .  37
     8.3.  Application Performance Debugging . . . . . . . . . . . .  37
   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  38
   10. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  38
   11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .  38
   12. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  39

Ravi, et al.              Expires 5 August 2024                 [Page 3]
Internet-Draft                    CSIG                     February 2024

   13. Normative References  . . . . . . . . . . . . . . . . . . . .  39
   Appendix A.  Example encodings of CSIG signals  . . . . . . . . .  42
   Contributors  . . . . . . . . . . . . . . . . . . . . . . . . . .  43
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  43

1.  Introduction

   Many network control loops, including Congestion Control, Traffic
   Engineering and Network Operations, make decisions based on the
   congestion experienced by application flows.  The signals used to
   determine congestion are often implicitly derived from end-to-end
   signals, approximated over larger timescales than desired, or
   obtained out-of-band from the network.  This can lead to suboptimal
   performance for applications or inefficiency in network usage.  CSIG
   (Congestion Signaling) provides direct, real-time, inband signals
   that network control loops can incorporate for performance and
   efficiency.

   A number of congestion control algorithms (CCA) are deployed in
   datacenters, including Swift [SWIFT], BBR [BBR], DCTCP [RFC8257],
   DCQCN [DCQCN] and HPCC++ [I-D.miao-tsv-hpcc].  These CCA vary in the
   congestion signals they use and in how they increase/decrease flow
   rates in response to the signals.  Swift uses precise measurements of
   round-trip time (RTT) to modulate its congestion window.  BBR uses a
   combination of flow's delivery rate and RTT measurements.  DCTCP and
   DCQCN rely on Explicit Congestion Notification (ECN [RFC3168]) from
   switches that indicate if the queue build up is above a threshold.
   HPCC++ leverages per-hop queue depth and transmit bytes along the
   flow's path, obtained via inband telemetry probes, to update flow
   rates.

   Despite the advances in sophisticated signals on when to slow down
   transfers, there continue to be blind-spots for CCA when it comes to
   increasing flow rates, e.g., What is the appropriate starting rate
   for a flow?  How quickly should a flow ramp up in the absence of
   congestion?  Without explicit information from the network, end-to-
   end CCA have come to rely on heuristics that can either undershoot or
   overshoot the bottleneck bandwidth, which can lead to slower Flow
   Completion Times (FCT) or increased round-trip times or packet
   losses.  At the same time, applications' appetite for fast network
   performance is rising: AI/ML applications are pushing for fast
   network transfers and avoid idling expensive Tensor Processing Units
   (TPUs) and Graphics Processing Units (GPUs).  Similarly Storage
   disaggregation needs fast transfers to make a remote Storage device
   appear as a local device at host.

Ravi, et al.              Expires 5 August 2024                 [Page 4]
Internet-Draft                    CSIG                     February 2024

   In this document we introduce Congestion Signaling (CSIG) to
   explicitly notify the hosts of the bottleneck link metrics.  There
   are several important use cases for CSIG, including:

   *  Congestion Control Algorithms for making decisions on sending
      rate: CCA at senders can use CSIG for quickly and safely ramping
      up to the maximum feasible rate as determined by the bottleneck
      link, and react with precision to the bottleneck hop both in the
      presence and absence of congestion.  The motivation for quick
      ramp-up stems from making maximal use of datacenter bandwidth, and
      decreasing latency even for large transfers.  There are several
      ways in which CSIG can help complete transfers quickly, e.g.,
      transfers belonging to an ML collective communication can ramp up
      quickly to maximally use all network bandwidth and complete close
      to the ideal transfer completion time.

   *  Traffic Management systems including Traffic Engineering (TE),
      Load Balancing and Multipathing too benefit from CSIG.  TE systems
      infer congested flows through an offline multi-minute process via
      superimposition of network traffic stats, topology and routing
      information.  With CSIG, TE has more up to date information on the
      congested points and the application flows experiencing
      congestion.  Using such finer-grained information can lead to more
      efficient and timely provisioning for bursty traffic.  Similarly,
      CSIG-enabled multipathed transport flows can choose paths in real
      time with the most available bandwidth.

   *  Troubleshooting and Performance Optimization.  We also envision
      CSIG to assist with debugging the network-level performance of
      datacenter applications.  Large-scale applications, including ML
      training workloads, open thousands of connections at the transport
      layer.  When the network is slow for an application, it is almost
      impossible to identify the bottleneck hops without joining many
      data sources across switches and hosts.  Because CSIG conveys the
      path bottleneck characteristics, it is valuable in pinpointing
      choke points in the network.  Knowledge of these choke points can
      lead to better bandwidth provisioning, timely repair processes,
      and real-time control, such as better load balancing.

   CSIG provides simple, fixed-length summaries of bottleneck links
   along a path, such as maximum hop delay, minimum available bandwidth,
   and maximum link utilization.  Information is collected at L2 from
   network devices along a packet path.  Each data receiver then returns
   the collected information to the data sender via L4 transport options
   or payloads.  CSIG uses a simple compare-and-replace operation at
   network devices, which allows it to scale with network topology, link
   speeds, and packet rates.

Ravi, et al.              Expires 5 August 2024                 [Page 5]
Internet-Draft                    CSIG                     February 2024

   CSIG builds on the successful aspects of prior explicit feedback
   schemes, but is more capable.  CSIG carries rich multi-bit switch
   telemetry in live data packets, drawing from the advancements in in-
   band network telemetry, also generally known as INT.  At the same
   time, CSIG retains the fixed-size headers and reflection in L4
   transports akin to Explicit Congestion Notification (ECN).  The
   industry has three key variants of INT: the one first specified in
   P4.org [P4-INT], the IOAM (In Situ Operations, Administration, and
   Maintenance) standard [RFC9378] in IETF and the Inband Flow Analyzer
   (IFA) spec [I-D.kumar-ippm-ifa] that is used in HPCC deployment
   [HPCCPLUS].  While they differ in the header definitions and
   encapsulation mechanisms, they all commonly stack up multiple per-
   switch telemetry data per-hop in the path of a packet.  The packet
   size grows proportional to the metrics per switch and the number of
   forwarding devices along its path.  Depending on the use case and
   header definition, the per-packet overhead ranges from 20B to above
   100B.  The large and variable size header overhead incurs challenges
   in end-to-end MTU limit conformation and parsing of the packet header
   data in the forwarding or receiving devices.

   There exist several efforts to address the challenges incurred in INT
   variants, including: 1) carrying INT data in synthetically generated
   non-data packets also known as probe packets, and 2) carrying only
   the fixed-size INT instructions (e.g., specifying which data to
   collect per hop) in data packets, while hop devices generate separate
   report packets that deliver the requested per-hop data.  While these
   techniques reduced the per-data-packet overhead, they did not
   fundamentally reduce the total amount of bytes or PPS overhead on the
   network devices or the data collector.  TCP-INT [TCP-INT] was
   developed in parallel to carry fixed-size min/max/sum aggregate
   metric over the hops together with a hop locator in live data
   packets.  However, it is limited to TCP Options, hence not applicable
   to various modern transports for AI/HPC, and furthermore there is no
   flexible way to introduce a new metric.  CSIG's type-value format
   ensures a constant size overhead with future-proofness.  The
   guaranteed constant size is small enough to fit into the 4B or 8B
   tag, enabling the unique placement of CSIG in L2, which frees the
   operators from the concerns around tunneling and encryption in
   deploying CSIG.

   In the rest of the document, we describe the design of end-to-end
   CSIG at hosts and network devices.

1.1.  Terminology

   ABW:  Available Bandwidth

   AQM:  Active Queue Management

Ravi, et al.              Expires 5 August 2024                 [Page 6]
Internet-Draft                    CSIG                     February 2024

   CCA:  Congestion Control Algorithms

   Connection / Flow:  A 5-tuple transport connection, e.g.  TCP
      connection

   CSIG:  Congestion Signaling

   CSIG data fields:  Fields in the CSIG tag excluding the TPID.

   CSIG packets:  Packets that contain the CSIG-tag and optionally the
      CSIG reflection header

   CSIG-capable path:  Path is termed CSIG-capable if all transit
      devices along the path support the CSIG protocol and end hosts
      have at least pass-through support for CSIG packets

   CSIG-tagged packets:  Packets that contain the CSIG-tag in the packet
      header

   CSIG-domain:  Secure network deployment domain where all devices in
      the domain have complete CSIG support or pass-through CSIG support

   PD:  Per-hop delay

   E2E:  End-to-End

   IPSec:  Internet Protocol Security

   MTU:  Maximum Transmission Unit

   MSS:  Maximum Segment Size

   NIC:  Network Interface Card

   Packet Path:  The port-by-port network path taken by a given packet
      specified as a sequence of device interfaces

   PSP:  PSP Security Protocol

   TPID:  Tag Protocol ID

   TE:  Traffic Engineering

   Transit device:  Any switch, router or middlebox in the path of a
      CSIG packet

   WRR:  Weighted Round Robin

Ravi, et al.              Expires 5 August 2024                 [Page 7]
Internet-Draft                    CSIG                     February 2024

2.  Design Principles

   CSIG was conceived to address problems in congestion control, traffic
   management and network debuggability in production networks.  We
   describe below the design principles that shaped CSIG, with
   simplicity and ease of deployment being at the forefront.  Section 7
   discusses the rationale behind the specific design choices made in
   CSIG.

   *  Simple Signals driven by Use Cases: Simple device port or queue
      metrics that solve concrete use cases are at the heart of CSIG's
      design principles.  This simplicity is not only important to
      applications, but also keeps the area, power and cost of
      implementation low on network devices.  Signals in CSIG are
      designed to be implementable in ASICs at line rate.  Signals that
      track per-flow state at the switch, for example, are harder to
      implement and deploy, and are hence avoided in CSIG.  CSIG is also
      flexible enough to accommodate new signals and use cases beyond
      those described in this document.

   *  End-to-End Perspective: CSIG's design stems from an end-to-end
      perspective of requirements and trade-offs for both applications
      and the network.  This document covers the necessary end-to-end
      aspects and the resulting design choices that make CSIG both
      useful to applications and practical to deploy.

   *  Small and Fixed Packet Overhead: It is important that the packet
      size does not increase as it traverses the network, which means
      that the MTU does not need to be changed.  Any overhead that is
      introduced should be fixed and small, minimizing the cost of
      implementation in switch / NIC pipelines.  Low protocol overhead
      also means low bandwidth overhead for small packets, minimizing
      impact to packet-per-second (PPS) load and bandwidth efficiency.
      We make very few assumptions about which packets and devices CSIG
      is enabled on.  Device implementations must be able to process
      CSIG on packets at line rate with minimal CPU involvement.
      Keeping the overhead small and fixed allows for CSIG to be enabled
      on every single packet at line rate.  This is important because
      deployments may choose to enable CSIG on every packet rather than
      on a small sample of packets.

Ravi, et al.              Expires 5 August 2024                 [Page 8]
Internet-Draft                    CSIG                     February 2024

   *  Works easily under Tunneling and Encryption: Tunnels are broadly
      used in modern deployments e.g., Traffic-engineering systems and
      Cloud traffic frequently use tunnels.  CSIG is designed to easily
      support end-to-end signaling on devices even in the presence of
      complex tunneling deployments.  This is in contrast to other in-
      band telemetry schemes that put more pressure on the ASICs to
      relocate metadata across inner and outer headers to work in the
      presence of tunnels.  In addition, CSIG also works with encrypted
      packets, including PSP, IPSec and 802.1AE MAC Security.

   *  Incremental Deployability: CSIG allows incremental deployment,
      where the mechanism can be deployed gradually into domains where
      some devices may support the new protocol and others may not.
      This document addresses interoperability in heterogeneous
      networks, and addresses backward compatibility with legacy
      devices.  We envision CSIG to be broadly valuable across wired
      networks, although our target domain for initial usage is
      datacenter networks.  We make minimal assumptions about the
      network architecture around tunneling, number of hops (diameter),
      routing, topology etc.  Configuring CSIG for end-to-end
      consistency in a private network, or deployments over the Internet
      are not in scope for this document.

3.  Conventions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].
   In this document, these words will appear with that interpretation
   only when in ALL CAPS.  Lower case uses of these words are not to be
   interpreted as carrying significance described in RFC 2119.

4.  Congestion Signaling Protocol

   CSIG protocol defines two components in the packet header to achieve
   end to end congestion signaling in a production network.

   *  CSIG-tag: An L2 protocol that end hosts and transit devices
      participate in.

   *  CSIG Reflection: A flexible L4+ protocol that only end hosts
      participate in.

   CSIG-tag is the core component of the CSIG specification.  It enables
   end hosts to request network signals of interest and for transit
   devices to provide these signals to end hosts over the specified
   packet header bits.

Ravi, et al.              Expires 5 August 2024                 [Page 9]
Internet-Draft                    CSIG                     February 2024

   However, to achieve end-to-end CSIG, CSIG-tag MAY be combined with
   the CSIG reflection protocol to expose the signals of interest to the
   relevant endpoints or consumers where the signals are needed.

   This section first describes the header formats for CSIG-tag and CSIG
   reflection.  Then it describes the life of a CSIG packet, outlining
   the different roles of network devices in the context of CSIG, and
   how these two packet header mechanisms work together to achieve end-
   to-end signaling.

4.1.  CSIG-tag Header Format

   CSIG tag is a fixed size tag at the layer 2 header.

   CSIG-tag placement in various packet encapsulations is shown below
   for completeness.  It is always the last tag in the layer 2 header.

   ARPA: dstmac / srcmac / csig-tag / ethertype / payload

   802.1q: dstmac / srcmac / vlan-tag / csig-tag / ethertype / payload

   802.1ad: dstmac / srcmac / vlan-tag / vlan-tag / csig-tag / ethertype
   / payload

   802.1ad tunnel: dstmac / srcmac / vlan-tag / vlan-tag / vlan-tag /
   vlan-tag / csig-tag / ethertype / payload

   802.1ae: dstmac / srcmac / security-tag / vlan-tag / csig-tag /
   ethertype / payload

   Consequently, the placement / offset of the CSIG tag is not affected
   by the headers and payload at layers 3 and above.  Layer 2.5 headers,
   such as MPLS, are also placed after the CSIG tag and do not impact
   its offset.

   CSIG-tag is defined in two variants - Compact and Expanded.  Each
   variant has a dedicated TPID codepoint to allow devices to infer
   which variant is in use.  Each variant supports a distinct set of
   requirements with respect to production deployment and identifies
   contrasting trade-off points in the solution space.  Deployment
   considerations are discussed in Section 6.

   Structurally, the compact CSIG-tag variant resembles a single VLAN
   tag and the expanded CSIG-tag variant resembles a double VLAN tag.
   This structural similarity is intentional and the reasons are
   elaborated in Section 6.4.

Ravi, et al.              Expires 5 August 2024                [Page 10]
Internet-Draft                    CSIG                     February 2024

4.1.1.  Compact Format

   CSIG-tag compact format is as shown, with 2B allocated for the CSIG
   Tag Protocol ID (TPID) and 2B allocated for the data fields.

      0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |             TPID              |  T  |R|    S    |      LM     |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

      |0-15|  TPID  : IEEE allocated Tag Protocol ID for 4 Byte CSIG tag
      |16-18| T     : Signal Type (0:min(ABW), 1: min(ABW/C), 2:max(PD))
      |19|    R     : Reserved
      |20-24| S     : Signal Value: Bucketed (32 configurable buckets)
      |25-31| LM    : Locator Metadata of bottleneck device / port

                     Figure 1: CSIG-tag Compact version

4.1.2.  Expanded Format

   CSIG-tag expanded format is as shown, with 2B allocated for the Tag
   Protocol ID (TPID) and 6B allocated for the data fields

      0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |             TPID              |               LM              |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |   T   |                  S                    |       R       |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

      |0-15|  TPID : IEEE allocated Tag Protocol ID for 8 Byte CSIG tag
      |16-31| LM   : Locator Metadata of bottleneck device / port
      |0-3|   T    : Signal Type (0:min(ABW), 1: min(ABW/C), 2:max(PD))
      |4-23|  S    : Signal Value: Uniformly quantized
      |24-31| R    : Reserved for future use

                    Figure 2: CSIG-tag Expanded version

4.1.3.  CSIG-tag Data fields Description

   This section describes the format and usage of data fields within the
   CSIG-tag

Ravi, et al.              Expires 5 August 2024                [Page 11]
Internet-Draft                    CSIG                     February 2024

4.1.3.1.  Signal Type

   The Signal Type field T is three (four) bits long in the compact
   (expanded) format and indicates the type of signal being carried in
   the CSIG-tag.  End hosts set the signal type T and request it on each
   packet of interest.  Up to 8 signal types are supported in the
   compact format, and up to 16 signal types are supported in the
   expanded format.  This draft concretely defines three signals:
   min(ABW), min(ABW/C) and max(PD), elaborated in Section 5 and
   Section 8.  The remaining codepoints are reserved for future signals,
   and may be defined and used in future versions of CSIG.

   A single packet can carry at most one Congestion Signal.  However,
   end hosts MAY obtain multiple signals for a single 5-tuple flow by
   requesting different signal types on alternating packets of a flow or
   in a round-robin fashion across packets.  Therefore, end hosts need
   not tie a single flow to a specific signal type, and MAY obtain all
   supported CSIG signals for a single flow.

4.1.3.2.  Signal Value

   The Signal Value field S is 5 bits (20 bits) long in the compact
   (expanded) format and captures the value of the signal specified by
   Signal Type T.  End hosts set the initial Signal Value S alongside
   the requested Signal Type T, and each transit device along the packet
   path in the network MAY modify S in accordance with the e2e signal
   being computed.  E.g., For signals that are min() aggregations, end
   hosts set the initial value of S to the maximum allowable value of
   the signal or its encoding thereof, and transit devices perform
   compare-and-replace to compute the min() across signals of individual
   devices on the packet path.

   In the compact format, the 5-bit Signal Value is bucketed with 32
   fully configurable buckets.  Each bucket is configured with (low,
   high) value range.  This configuration is specific to each Signal
   Type and MAY vary across Signal Types.  This allows the Signal Value
   representation to be tailored to the specific needs of each Signal
   Type.  For example, in typical use cases of available bandwidth, it
   is more useful to have higher granularity at lower values of the
   signal (i.e., when ABW is close to 0) than at higher values of the
   signal.  This is because lower values of ABW have greater impact on
   application control decisions e.g., knowing whether there was 0 Gbps
   vs 1 Gbps available on a path makes a larger difference than knowing
   if there was 399 Gbps vs 400 Gbps available.  Appendix A shows how
   the buckets could be defined in order to provide such a non-linear
   encoding of value-ranges to buckets.  Such configurable encodings
   allow capturing useful information about the signal with fewer bits
   and is a core feature of the compact CSIG format.

Ravi, et al.              Expires 5 August 2024                [Page 12]
Internet-Draft                    CSIG                     February 2024

   In the expanded format, Signal Value is uniformly quantized into a 20
   bit value.  The unit of quantization is configurable on a per Signal
   Type basis, depending on the minimum and maximum value that needs to
   be represented with the given bits.  The higher bit length allows for
   enhanced signal granularity and fewer configuration knobs in domains
   where the expanded CSIG format is viable to deploy (Section 6.5).
   20-bits are sufficient to represent a wide range of values with high
   granularity.  As an example, with a 8Mbps quantum for min(ABW), the
   signal value field can represent up to a max of 8Tbps.  With a 128ns
   quantum for max(PD), the signal value field can represent up to a max
   of 128ms.  More discussion on signal-specific quanta is in
   Appendix A.

   Signal quantization / bucketing parameters are configured directly at
   the transit devices where the signal is computed.  End hosts do not
   explicitly request or negotiate these parameters.  As described in
   Section 5, all devices MUST be configured with the same quantization
   / bucketing parameters for each signal type, in order to correctly
   compute the requested signal along packet paths.

4.1.3.3.  Locator Metadata

   Locator Metadata field LM is an optional 7 bits (16 bits) in the
   compact (expanded) format.  It captures relevant metadata about the
   bottleneck port or device, where the notion of bottleneck is specific
   to individual signal types.  Locator Metadata MAY include compressed
   attributes about the bottleneck that is relevant for the use case
   e.g., capacity of the bottleneck port, stage of the bottleneck device
   in the data center topology, orientation of the bottleneck port -
   uplink / downlink.  LM MAY also include expanded attributes of the
   bottleneck (e.g., port ID, TTL).  This document provides
   recommendations for the type of information that locator metadata MAY
   carry, but it does not require any specific set of metadata to be
   supported.  Metadata that is useful and viable to support will depend
   on the production setting, which is out of scope for this document.
   Instances of CSIG deployment MAY include locator metadata with
   custom-defined metadata beyond those described in this document.
   Section 5.5 discusses requirements for supporting LM in devices.

   End hosts initialize LM to a default value.  Transit devices that do
   not update the Signal Value S on a given packet MUST NOT alter LM on
   the packet.  Transit devices that update S on a packet MUST update LM
   on the same packet.

Ravi, et al.              Expires 5 August 2024                [Page 13]
Internet-Draft                    CSIG                     February 2024

4.2.  CSIG Reflection Header Format

   CSIG reflection enables consumption of tag data fields at the point
   where the signals are needed for telemetry or control.  This
   mechanism is particularly relevant for sender-driven / source-based
   telemetry and control.  For receiver-driven transports and
   controllers, CSIG reflection may not be necessary as the signals on
   the CSIG tag are available at the receiver without reflection (See
   Section 4.3).

   This document provides recommendations on how CSIG reflection SHOULD
   be implemented, and provides the framework to make the implementation
   deployment-specific.

   CSIG reflection header is a separate header from the CSIG tag,
   implemented at layer 4 or above.  The location of the header and the
   choice of which packets carry the header are transport-specific.  As
   an example, the header can be carried on TCP ACK packets from the
   receiver back to the sender.  Note that the presence of ACK
   coalescing, piggybacked ACKs, Selective Acknowledgements (SACK) etc.
   can impact the behavior of CSIG reflection.  More generally, there
   may not be a 1:1 mapping between forward and reverse path packets.
   In a scenario where the transport implements ACK coalescing, the CSIG
   reflection header SHOULD reflect the latest CSIG-tag data fields
   received across the packets being acknowledged or a more advanced
   summary of the CSIG-tag data fields across the packets being
   acknowledged.  It is important to note that since Signal Type is
   chosen on a per-packet granularity, a coalesced ACK may acknowledge
   multiple packets that carry different signal types in their CSIG-
   tags.  In such a scenario, the reflection header MAY only reflect one
   of the signals.  The sender transport should choose Signal Type for
   packets in a way that ensures that it can continue to receive all
   signals of interest.

   CSIG reflection header MAY include all of the CSIG data fields i.e.,
   2B for the compact version and 6B for the expanded version.  However,
   one could optimize header space and include only a subset of the data
   fields if the consumer is interested only in a subset of signals or
   locator metadata.

   CSIG reflection is an end-host-only protocol and transit devices do
   not participate in it.  Therefore, CSIG reflection header can be
   incorporated in portions of the packet that are e2e encrypted via PSP
   or IPSec.

   The following subsections discuss locations in the packet header
   where CSIG reflection could be implemented for different transports

Ravi, et al.              Expires 5 August 2024                [Page 14]
Internet-Draft                    CSIG                     February 2024

4.2.1.  Reflection in TCP

   Reflection in TCP is typically achieved via TCP options.  CSIG
   Reflection can be implemented via a new TCP Option, identified by a
   unique Kind.

      0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |     Kind     |    Length     |       CSIG data fields         |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

      Kind              : Unique codepoint to recognize TCP CSIG option
      Length            : Length in bytes of the CSIG data fields
                          carried in the options payload
      CSIG data fields  : Values reflected from receiver to sender

                    Figure 3: CSIG Reflection TCP Option

4.2.2.  Reflection in non-TCP Transports

   Several transports such as QUIC [RFC9000] and PonyExpress
   [PONYEXPRESS] are built atop UDP.  Reflection in UDP can be achieved
   by including CSIG data fields in the UDP payload from receiver to
   sender.  For unidirectional UDP traffic, an out-of-band reverse
   connection from the receiver to the sender may be necessary for CSIG
   reflection.

   As an example, PonyExpress [PONYEXPRESS] is a custom transport
   implemented within a userspace host networking stack.  It supports a
   flexible L4 wire protocol that periodically changes as new features
   are added (Sec 3.1 in Snap).  CSIG reflection can be implemented as
   additional bytes within this wire format.

                      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
                      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                      |    Flags    | CSIG data fields|
                      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                Figure 4: PonyExpress CSIG Reflection header

   For simplicity and to avoid the need for negotiation, the CSIG
   reflection header can be carried on all packets independent of
   whether CSIG is enabled on them.  The Valid bit in the Flags field
   can be set to 1 for packets that carry valid data fields in the
   reflection header.  In certain deployments, negotiation is
   unavoidable for a variety of reasons.  Section 6.3.3 provides details
   regarding options for negotiation.

Ravi, et al.              Expires 5 August 2024                [Page 15]
Internet-Draft                    CSIG                     February 2024

4.3.  CSIG Operation - Life of a packet

   This section describes the end-to-end operation of CSIG with the
   walkthrough of the life a packet.  It assumes that all nodes in the
   path are CSIG-capable and omits the negotiation phase.  Details of
   negotiation are covered in in Section 6.3.3

                               Forward Path
        --------------------------------------------------------->

        <---------------------------------------------------------
                               Reverse Path

   +------+   +-----+   +------+  +------+  +------+  +-----+   +------+
   | Host +---+ ToR +---+ Aggr +--+ Core +--+ Aggr +--+ ToR +---+ Host |
   +------+   +-----+   +------+  +------+  +------+  +-----+   +------+

           C:   800G      100G      100G      100G      40G

         ABW:   100G       95G       70G       90G      20G
                                                        ---
       ABW/C:  12.5%       95%       70%       90%      50%
               -----
           D:   10us       3us      18us       5us      8us
                                    ----

        Figure 5: Life of a CSIG packet.  Underlined values show the
        forward path bottlenecks for the corresponding signal types

4.3.1.  Forward Path

   The sender end-host first constructs a CSIG-tagged packet for a flow
   of interest and sends out the packet with the tag data fields
   initialized.  The transport determines these initial values for the
   packet, including Signal Type to request and default values for the
   other data fields.  Each transit device performs a compare-and-
   replace on the CSIG-tag to optionally update the Signal Value and
   Locator Metadata fields on the tag.  As the packet traverses through
   the network, the CSIG-tag data fields accumulate the desired
   aggregation of the requested signal.

Ravi, et al.              Expires 5 August 2024                [Page 16]
Internet-Draft                    CSIG                     February 2024

4.3.2.  Reverse Path

   When the CSIG-tagged packet reaches the receiver end-host, the data
   fields in the CSIG tag are extracted and delivered to the transport
   layer at the receiver.  The transport stores the data fields of the
   packet to be reflected, or a summary of these fields across packets.
   It reflects these data fields in the layer-4 CSIG reflection header
   on packets traversing the reverse path from receiver to sender.  The
   CSIG reflection header is unmodified as the packet travels from
   receiver to sender.  The sender extracts the CSIG data fields from
   the CSIG reflection header of the incoming packet, and hands it to
   the transport layer for use in applications at the sender.  As a
   result, the sender transport learns the desired signal for a flow
   within approximately one round-trip time.

4.3.3.  Multiple signals

   The transport layer has a significant role to play in making CSIG
   usable.  Although the CSIG data fields are carried on packets, the
   measurements are ultimately relevant at the flow / connection level
   for specific paths.  If the sender transport desires to obtain
   multiple signals for the same flow, it MAY choose Signal Type on a
   per-packet basis (e.g., in a round robin fashion across the flow's
   packets), and internally keep track of all of the requested signals
   as part of the flow's state variables.  This approach allows the
   sender transport to use all supported CSIG signals for use cases such
   as congestion control, load balancing and multipathing.

4.4.  Device Roles

   CSIG has three participating entities, each with their own roles and
   responsibilities for achieving end-to-end congestion signaling.

4.4.1.  Sender host

   The sender host is responsible for

   (i) Constructing CSIG-tagged packets for flows of interest and
   initializing the CSIG-tag data fields on each packet as specified by
   the transport, and

   (ii) Parsing the CSIG reflection header received in incoming packets
   and extracting CSIG data fields for use in the sender transport /
   applications.

   Only the sender is allowed to insert CSIG-tags into packets.

Ravi, et al.              Expires 5 August 2024                [Page 17]
Internet-Draft                    CSIG                     February 2024

4.4.2.  Transit device

   Transit devices are responsible for

   (i) Computing and tracking Congestion signals such as ABW and ABW/C
   of each port and hop delay per packet

   (ii) Parsing the CSIG-tag based on the TPID code point on incoming
   packets to identify the Signal type being requested, and

   (iii) Performing compare-and-replace on the Signal value and locator
   metadata fields on the CSIG-tag based on the aggregation
   corresponding to the requested signal type (min / max)

   Transit devices MUST NOT add CSIG tags to incoming packets that are
   not already CSIG-tagged.  Transit devices MAY delete the CSIG tag
   before forwarding the packet.  This functionality can be exercised
   when downstream devices are not CSIG-capable.  Further discussion on
   this topic is in Section 6 on Incremental Deployment of CSIG.

4.4.3.  Receiver host

   The receiver host is responsible for

   (i) Extracting the CSIG-tag on incoming packets and exposing the data
   fields to the transport layer and/or receiver-driven applications

   (ii) Inserting and populating the CSIG Reflection header at the
   transport layer for packets traversing the reverse path to the
   sender.

4.4.4.  Host roles for bidirectional flows

   Note that for bi-directional flows, the Sender and Receiver are
   specific to each direction within the flow.  For a bi-directional
   flow between hosts A and B,

   (i) A plays the Sender host role and B plays the Receiver host role
   for data packets traveling from A to B, and similarly

   (ii) B plays the Sender host role and A plays the Receiver host role
   for data packets traveling from B to A.

Ravi, et al.              Expires 5 August 2024                [Page 18]
Internet-Draft                    CSIG                     February 2024

   In this scenario, packets traversing from A to B contain both a CSIG-
   tag that captures the congestion signals on the forward A-->B path,
   and a CSIG reflection header that captures the CSIG data fields of
   the reverse B-->A path.  Equivalently, packets traversing from B to A
   contain both a CSIG-tag that captures the congestion signals on the
   forward B-->A path, and a CSIG reflection header that captures the
   CSIG data fields of the reverse A-->B path

5.  Signals in CSIG

   As described in the previous section, Signal Type indicates the type
   of congestion signal that CSIG-tag carries on each packet.  Up to 8
   signal types are supported by the compact format and up to 16 signal
   types are supported by the expanded format.

   In this section, we concretely define three signals driven by use
   cases described in Section 8.  While Section 8 covers how these three
   signals are useful to applications, this section focuses on precise
   definitions of these signals and how they may be implemented on
   transit devices.

   Note for future extensions: Signals in CSIG are intended to be
   aggregation functions of individual per-hop or per-port signals
   across the path of a packet.  The typical definition of such signals
   with max / min aggregations captures the notion of a path bottleneck
   for different definitions of bottleneck.  However, structurally, the
   format supports arbitrary read-modify-write operations, including
   aggregations such as max, min, count and sum, allowing future use
   cases to leverage this structure for new signals.

5.1.  Minimum Available Bandwidth - min(ABW)

   min(ABW) captures the minimum absolute available bandwidth (in bps)
   across all the ports in the packet path.  Available bandwidth is
   defined per egress port on each device.

5.1.1.  ABW Computation

   ABW can be computed using one of many algorithm variants, each having
   implications on HW or SW implementation complexity, timescales of
   computation and accuracy of the signal.  In its rudimentary form, the
   raw ABW for a given egress port p over a time interval delta_t can be
   computed as follows:

   // delta_txbit is the number of bits that exited on the wire
   utilization_bps[p] = (delta_txbit[p]) / delta_t;
   // capacity_bps[p] captures the link speed of port p
   abw_bps[p] = capacity_bps[p] - utilization_bps[p];

Ravi, et al.              Expires 5 August 2024                [Page 19]
Internet-Draft                    CSIG                     February 2024

   Implementation of these computations relies on at least one of the
   following capabilities in the devices:

   *  Timer-based computations: Most networking ASICs maintain hardware
      counters that track the number of bits that exit on each egress
      port.  To compute available bandwidth, a periodic-timer thread in
      SW or HW triggers the computation and update of available
      bandwidth every delta_t time interval , where delta_t is a
      configurable parameter.

   *  Per-packet computations: In this alternative, available bandwidth
      is computed and updated on every packet that is processed via the
      egress pipeline, typically in HW e.g., via Exponential Weighted
      Moving Average (EWMA) estimation where the weights are
      configurable. delta_t is not an explicit parameter in this
      approach, and is implicitly determined by EWMA weights.

   Variants such as Discounted Rate Estimator (DRE) [CONGA] use a
   combination of per-packet updates and timer-based approaches.

5.2.  Maximum link utilization - max(U/C) or min(ABW/C)

   ABW/C captures the fraction or percentage of available bandwidth on a
   given link relative to the link's capacity.  min(ABW/C) captures the
   link utilization bottleneck along the path of the packet.  This
   signal is most relevant in paths with heterogeneous link speeds,
   where it distinguishes itself from min(ABW). min(ABW/C) is equivalent
   to max(U/C), where

   U = utilization of a given egress port in bps
   C = capacity of a given egress port in bps
   ABW = available bandwidth of a given egress port in bps

   Therefore, max(U/C) = max (1 - ABW/C) = 1 - min(ABW/C)

5.2.1.  ABW/C Computation

   ABW/C can be computed from ABW as follows:

   // Represents fraction of available bandwidth on port p
   // relative to the port's capacity.
   abwc_frac[p] = abw_bps[p] / capacity_bps[p];

   Algorithms for ABW computation described in Section 5.1.1 also apply
   to ABW/C computation, except that the resulting value is normalized
   by the port capacity.  Quantization / bucketing is performed after
   normalization.

Ravi, et al.              Expires 5 August 2024                [Page 20]
Internet-Draft                    CSIG                     February 2024

5.2.2.  min(ABW) vs min(ABW/C) bottlenecks

   On paths with heterogeneous link speeds, min(ABW) and min(ABW/C)
   bottlenecks are not necessarily the same ports.  Figure 2 shows an
   example where these two bottlenecks are different.  Each type of
   bottleneck has its own value, as demonstrated in Section 8.

5.3.  Shared requirements for min(ABW) and min(ABW/C)

5.3.1.  Algorithm Requirements

   To support min(ABW) or min(ABW/C) in CSIG, the device SHOULD support
   raw ABW computation with a configurable delta_t, and MAY support
   additional algorithms such as EWMA or DRE.  This requirement enables
   the consistent interpretation of timescale over which available
   bandwidth is computed.  This consistent interpretation allows end-
   hosts to tune their control decisions based on this timescale e.g.,
   in relation to the flow's RTT.

5.3.2.  Timescale and Accuracy Requirements

   CSIG does not set strict requirements on the delta_t values to be
   supported by the implementation, except that it SHOULD be
   configurable to cover the range of RTTs in the network e.g., {10us,
   100us, 1ms, 10ms, 100ms, 1s etc.}.  Although one would expect all
   devices on a packet path to compute ABW at similar timescales to
   provide a consistent path-wide view, CSIG does NOT set strict
   requirements on the consistency of delta_t parameters chosen across
   the devices of a packet path.  Choices of signal accuracy and
   timescales are a function of the use case and are not enforced by
   CSIG.  End hosts MAY use EWMA across packets of a flow to calculate
   ABW or ABW/C over a longer timescale when CSIG on each packet carries
   ABW or ABW/C over shorter timescales.  This technique is useful when
   flows traversing a given egress port span a wide range of RTTs while
   ABW computation over the egress port is fixed to a chosen timescale
   at each transit device.

5.3.3.  Bucketing / Quantization Requirements

   The computed ABW or ABW/C values MUST be compressed to fit in the
   available Signal value bits on the CSIG-tag.  The device MUST support
   32 fully configurable ABW buckets and ABW/C buckets for compact CSIG,
   and configurable quanta for uniform quantization in expanded CSIG.
   All devices along the packet path MUST be configured with the same
   buckets / quanta per signal type in order to correctly compute
   min(ABW) or min(ABW/C) along the path.  Appendix A provides examples
   of these configurations.

Ravi, et al.              Expires 5 August 2024                [Page 21]
Internet-Draft                    CSIG                     February 2024

   Each transit device performs a compare-and-replace, i.e., updates the
   signal value on the CSIG tag if the incoming ABW or ABW/C signal
   value on the packet is higher than the device's locally computed ABW
   or ABW/C value for the packet's egress port, post bucketization /
   quantization.  E.g.,

   // Update the signal value on packet if current hop is the bottleneck
   pkt->csig_tag->abw = min(pkt->csig_tag->abw, egr_port->abw)

5.3.4.  QoS requirements

   min(ABW) and min(ABW/C) are unambiguous signals with low
   implementation complexity on network devices.  For simplicity, these
   definitions intentionally do NOT distinguish across QoS classes that
   may share the egress port.  Available bandwidth per QoS class on an
   egress port is complex to define and meaningfully interpret since it
   depends on the scheduling policy (Strict Priority / WRR / Deficit
   WRR), buffer carving configuration and other policies (e.g., AQM)
   associated with QoS.  Section 8 describes the applications of
   min(ABW) and min(ABW/C) as defined.  We leave QoS-based variations of
   these signals and their potential use cases as future work.

5.4.  Maximum Per-hop Delay - max(PD)

   max(PD) captures the maximum per-hop delay experienced by a packet
   among all the hops in the packet path.  Per-hop delay PD is the time
   spent by the packet in the device pipeline.  It MAY include link
   layer delays or it MAY only include the delays observed in the
   forwarding pipeline.

5.4.1.  Per-hop Delay Computation

   Unlike ABW and ABW/C which are per-port signals, PD is a per-packet
   signal.  It consists of PHY, MAC and switch pipeline delay
   experienced by the packet.  Pipeline delay is the most relevant
   component as it captures congestion related queueing delay.  Device
   implementations MAY track ingress and egress timestamps explicitly
   for each packet and perform a diff in the final stages of the
   pipeline.  Precise definitions of these stages depend on the
   architecture of the device.  For example, some devices could leverage
   existing timestamping support from tail timestamping capabilities for
   this purpose.

5.4.2.  Requirements

Ravi, et al.              Expires 5 August 2024                [Page 22]
Internet-Draft                    CSIG                     February 2024

5.4.2.1.  Algorithm Requirements

   To support max(PD) in CSIG, the device SHOULD support per-packet
   tracking of delay experienced through the device.

5.4.2.2.  Accuracy Requirements

   It is desirable to have minimal gaps in the components of packet
   delays captured by the device.  However, CSIG does NOT set strict
   requirements on the accuracy of PD to be supported by the
   implementation.

5.4.2.3.  Bucketing / Quantization Requirements

   The computed delay values MUST be compressed to fit in the available
   Signal value bits on the CSIG-tag.  The device MUST support 32 fully
   configurable delay buckets for compact CSIG, and configurable quanta
   for uniform quantization in expanded CSIG.  All devices along the
   packet path MUST be configured with the same buckets / quanta to
   correctly compute max(PD) along the path.

   Each transit device performs a compare-and-replace, i.e., updates the
   signal value on the CSIG tag if the incoming delay signal value on
   the packet is lower than the device's locally computed delay for the
   packet, post bucketization / quantization.  E.g.,

   // Update the signal value on packet if current hop is the bottleneck
   pkt->csig_tag->pd = max(pkt->csig_tag->pd, device->pkt->pd)

5.4.2.4.  QoS requirements

   Delay experienced by the packet on a device, as defined, is
   implicitly a QoS-specific signal.  This is because the packet is
   subject to QoS policies as it traverses through the device pipeline,
   including prioritization, scheduling and buffering.  For example, a
   high priority packet may see smaller delays than low priority
   packets.  Therefore, the delay measured for the packet SHOULD include
   components in the pipeline where QoS policies are applied.

5.5.  Locator Metadata Implementation

   Locator metadata (LM) captures information about the bottleneck
   device or port, as described in Section 4.1.3.3.  In this section, we
   discuss requirements for supporting LM in CSIG, and provide
   recommendations for commonly useful attributes to carry in LM.

Ravi, et al.              Expires 5 August 2024                [Page 23]
Internet-Draft                    CSIG                     February 2024

5.5.1.  Requirements

   A single deployment MAY choose a subset of the attributes in
   Section 5.5.2 and/or newly defined attributes beyond those listed in
   Section 5.5.2 to include in LM.  However, the total size of the
   individual attributes MUST be within 7 bits for Compact CSIG and
   within 16 bits for Expanded CSIG.

   CSIG does not set strict requirements on the LM internal format i.e.,
   how the individual attributes are organized among the available LM
   bits.  However, this LM internal format MUST be consistent across
   devices in the deployment domain so that the end hosts can
   consistently interpret these bits.  The LM internal format MAY be
   specific to each signal type.

   Devices SHOULD support configuring per-port values for LM to be
   written on the CSIG-tag.  Devices MAY provide more granular
   configurability of LM based on Signal type as well.  CSIG packets
   egressing on a given port that have their Signal Value updated by the
   device MUST be updated with the LM corresponding to the port and
   Signal Type.

5.5.2.  Attributes

   Attributes can be designed to capture the level of resolution desired
   by use cases for pinpointing the bottleneck.  Attributes may be
   encoded to fit within the limited number of LM bits available in
   CSIG.

   We separate the list of attributes into compact attributes and
   expanded attributes.  Compact attributes are motivated by the limited
   number of LM bits available in Compact CSIG, and therefore capture
   only the essential information about the bottleneck that is necessary
   for the use cases i.e., to inform control decisions or telemetry.
   Expanded attributes provide higher resolution information about the
   bottleneck, and can aid in directly pinpointing bottleneck devices or
   ports.  Expanded attributes typically require more bits and are hence
   more suited for Expanded CSIG.

   Examples of attributes are listed below.

5.5.2.1.  Compact Attributes

   *  Link capacity: Encodes the capacity of the bottleneck link.  In
      typical deployments, the number of link speeds deployed is a small
      set, can be encoded using <= 5 bits.

Ravi, et al.              Expires 5 August 2024                [Page 24]
Internet-Draft                    CSIG                     February 2024

   *  Stage of the bottleneck: Encodes the stage of the topology where
      the bottleneck device / port is located.  For example, in a
      5-stage clos topology, the stage of the device can be encoded with
      3 bits.

   *  Link orientation: Encodes the direction of a port in the context
      of the network topology.  For example, with three categories -
      uplinks, downlinks and side-links - link orientation can be
      encoded using 2 bits.

5.5.2.2.  Expanded Attributes

   *  Port ID: Encodes a unique identifier for each port within a
      deployment domain.

   *  Device ID: Encodes a unique identifier for each device within a
      deployment domain.

   *  TTL (Time-to-live): Captures the TTL value of the packet at the
      bottleneck device, represented using 8-bits.  End hosts can use
      this attribute to infer the hop number at which the packet was
      bottlenecked.

   LM attributes and encoding schemes are ultimately deployment specific
   and use-case specific.  CSIG supports a flexible specification of LM
   to accommodate a variety of requirements and future applications.

6.  Incremental Deployment of CSIG.

   Most production networks are heterogeneous, with a mix of network
   devices across generations.  This document addresses the brownfield
   deployment of CSIG in a heterogeneous network, where there may be a
   mix of devices that offer varying degrees of support for CSIG packet
   construction and processing.

6.1.  CSIG Stripping: A per egress-port primitive

   Before describing incremental deployment, we introduce the idea of
   CSIG stripping, an action primitive which is foundational to
   deploying CSIG in a heterogeneous environment.

   Devices that support CSIG MUST be capable of removing the CSIG tag
   before forwarding the packet.  Devices MUST allow configuring CSIG-
   stripping on a per egress-port basis.  If a port is configured to
   strip CSIG, then all CSIG-tagged packets that egress on this port
   must have the tag removed before being forwarded.

Ravi, et al.              Expires 5 August 2024                [Page 25]
Internet-Draft                    CSIG                     February 2024

   In the following sections, we describe how this capability can enable
   incremental deployment.

6.2.  Levels of CSIG Support

   We first classify devices into three simplified categories based on
   their level of CSIG support.  In the subsequent sections we describe
   how CSIG can interoperate with each category of device.  Note that
   the level of support is a function of the tag placement and whether
   the compact or expanded CSIG tag format is used as shown in
   Section 4.1.

6.2.1.  Discard

   Devices in this category are not capable of recognizing or parsing
   CSIG tagged packets.  If such packets are received, they will simply
   be dropped.

6.2.2.  Pass-through

   Devices in this category are able to recognize and parse CSIG tagged
   packets, and transparently forward the packet with the tag intact or
   with the tag stripped to neighboring devices (in the case of transit
   devices) or to the end host transport layer (in the case of end
   hosts).  However, they do not support updating the CSIG data fields
   on the tag.

   Some devices that do not natively support CSIG may be configured to
   support pass-through mode for CSIG if they support VLAN tags with
   configurable TPIDs.  This is discussed in more detail in Section 6.4.

6.2.3.  Complete

   Devices in this category support the complete CSIG protocol,
   including recognition, parsing, forwarding, tag-stripping, signal
   computation, and signal updates on the tag.  However, only a subset
   of signal types may be supported.

6.2.3.1.  Software-assisted support

   It is noteworthy that in some devices that do not natively support
   CSIG, resources available for VLAN tag processing can be repurposed
   to support CSIG for certain signal types using a combination of
   software and hardware capabilities.  We refer to this level of
   support as software-assisted support.  This capability is discussed
   in more detail in Section 6.4.

Ravi, et al.              Expires 5 August 2024                [Page 26]
Internet-Draft                    CSIG                     February 2024

6.2.3.2.  Native support

   Devices that natively support CSIG are explicitly equipped with the
   hardware capabilities required to implement the CSIG protocol.

   A CSIG domain is a deployment domain where all network devices have
   complete support or pass-through support for CSIG.

6.3.  Interoperability in Brownfield Deployments

   In this section, we first define the requirements for CSIG
   Interoperability in brownfield deployments.  Then, we consider
   devices with all levels of support described in Section 6.2 and
   describe how these devices MAY be configured to achieve
   interoperability.  Note that the following descriptions apply
   separately to both Compact and Expanded CSIG-tags.

         +==============+=======================================+
         | Device       | Interop support                       |
         | category     |                                       |
         +==============+=======================================+
         | Discard      | Upstream devices must strip CSIG tags |
         |              | before packets reach this device      |
         +--------------+---------------------------------------+
         | Pass-through | Device may strip tag or transparently |
         | support only | forward with tag unmodified depending |
         |              | on e2e signal accuracy requirements   |
         +--------------+---------------------------------------+
         | Native CSIG  | Device updates CSIG-tag as per        |
         | support      | protocol                              |
         +--------------+---------------------------------------+
         | SW-assisted  | Device updates CSIG-tag using VLAN    |
         | CSIG support | match/action with approximate signals |
         |              | computed in S/W agent                 |
         +--------------+---------------------------------------+

              Table 1: Interoperability with devices having
                     different levels of CSIG support

6.3.1.  Requirements for interoperability

   Forwarding: The fundamental requirement is that no CSIG-tagged packet
   should be dropped in the network due to a lack of CSIG support on a
   device.  This requirement means packets with CSIG-tags MUST never
   reach devices in the Discard category, or MUST have their CSIG-tag
   stripped before reaching such devices.

Ravi, et al.              Expires 5 August 2024                [Page 27]
Internet-Draft                    CSIG                     February 2024

   Negotiation: End hosts / flows SHOULD ensure that the path (including
   end hosts and transit devices) is CSIG-capable before enabling CSIG-
   tagging on packets.  Devices in the Discard category should not
   require any changes in order to achieve negotiation.  This
   requirement is to ensure correctness of data fields in end-to-end
   CSIG operation, and to interoperate with legacy devices or software
   stacks.

6.3.2.  Forwarding

   To achieve forwarding interoperability requirements for CSIG, CSIG
   stripping may be exercised as shown below

   *  When a neighboring device connected to a given egress port is a
      Discard device and cannot parse CSIG packets, this egress port
      MUST be configured to strip the tag on outgoing packets to ensure
      that the packet does not get dropped downstream.

   *  When a device supports Pass-through only or does not support the
      requested signal type on a CSIG packet, egress ports on this
      device MAY be configured to strip the tag on outgoing packets to
      ensure that CSIG does not carry inaccurate information.  In some
      use cases where it is acceptable for CSIG to miss capturing
      signals on certain hops, pass-through devices MAY transparently
      forward the packet with the CSIG tag intact.

   *  At the boundary of a CSIG domain, device ports that are connected
      to devices outside of the CSIG domain MUST strip the tag to ensure
      that packets exiting the domain do not contain CSIG-tags.  Only
      egress ports connected to devices within the CSIG domain SHOULD
      retain CSIG-tags on outgoing packets.

   CSIG packets and non-CSIG packets can be used together in a
   brownfield setting.  This requirement means that end hosts MUST be
   capable of transmitting and receiving both CSIG packets and non-CSIG
   packets, including for the same flow.  A packet marked with CSIG-tag
   at the sender host may arrive at the receiver host without the tag.
   In addition, Compact CSIG and Expanded CSIG packets may be used
   together on the same network.

6.3.3.  Negotiation

   Support for sending and receiving CSIG-tagged packets may require
   software and/or hardware changes on transit devices and end hosts.
   In many deployments, particularly those requiring hardware upgrades
   to support CSIG (such as Switch or NIC support), version stragglers
   continue to exist for long time horizons for a variety of reasons,
   and interoperability with such stragglers is a critical requirement.

Ravi, et al.              Expires 5 August 2024                [Page 28]
Internet-Draft                    CSIG                     February 2024

   Without negotiation for CSIG capability, devices that are not CSIG-
   compliant may drop CSIG packets and thus blackhole traffic.
   Negotiating for CSIG-capability of a path is critical to ensure that
   CSIG protocol operates safely end-to-end in a brownfield deployment.

   A path is considered CSIG-capable if end-hosts have at least Pass-
   through CSIG support and transit devices have Complete CSIG support
   (native or software-assisted).  Before sending CSIG-tagged packets on
   a network flow, end-hosts must negotiate for path CSIG-capability.
   We discuss one approach to negotiation for path CSIG-capability,
   which involves two parts: negotiation for transit device support and
   negotiation for end host support.

6.3.3.1.  Negotiation for transit device support

   In this section, we describe one simple approach to negotiate CSIG
   support on transit devices with CSIG stripping.

   CSIG stripping can be used to implicitly achieve negotiation by
   removing the CSIG-tag from the packet header at or before devices on
   the packet path that do not have the desired level of CSIG support.
   If the receiver end host receives a CSIG-tagged packet, it serves as
   an explicit indication that all devices on the packet path, including
   transit devices and end-hosts, have the desired CSIG support.  If the
   receiver end host receives a packet without a CSIG-tag, it is an
   indication that one or more devices do not have the desired CSIG
   support, or that the packet was not tagged at the sender to begin
   with.  This indication can be implicitly reported to the sender via
   an empty / invalid CSIG reflection header and the sender can
   determine whether the packet path was CSIG-capable.

   This approach assumes that each device has knowledge about the level
   of CSIG support in its immediate neighboring devices, which is viable
   through configuration in typical private SDN networks.  In the
   absence of centralization, mechanisms such as a new LLDP TLV may be
   defined to advertise aspects of CSIG support on the device, including
   compact vs expanded CSIG-tag support, signal types that are
   supported, pass-through vs complete support etc.  We leave the
   details of such an LLDP extension for future extensions of the
   protocol.

Ravi, et al.              Expires 5 August 2024                [Page 29]
Internet-Draft                    CSIG                     February 2024

6.3.3.2.  Negotiation for end host support

   A sender end host may need to explicitly negotiate with the remote
   end-host to ensure that the host networking stack at the remote host
   has the desired level of CSIG support.  Ideally such explicit CSIG
   negotiation should be performed during or before the initial
   connection handshake, after which CSIG is enabled / disabled on
   packets post connection establishment.  It may also be necessary to
   explicitly negotiate the use of CSIG Reflection in transports,
   separately from the negotiation for path CSIG-capability.  For
   example, in TCP, negotiation is required to use the CSIG Reflection
   TCP Option.  We leave the details of such negotiation schemes for
   future extensions of the protocol.

6.4.  Backward Compatibility via Software-assisted CSIG

   Transit devices without native CSIG support MAY participate in CSIG
   protocol via a Software-assisted approach.  This allows brownfield
   deployments to reap incremental benefits of CSIG without having to
   upgrade a significant fraction of device HW on their networks.

   Since compact and expanded CSIG tags are structurally similar to
   single VLAN-tags and double VLAN-tags respectively, VLAN resources in
   a transit device can be repurposed to support CSIG updates.  More
   specifically, configurable TPIDs for VLAN tags can be used to treat
   CSIG tags as VLAN tags, and VLAN match/action resources for tag
   updates in the device can be leveraged to support updating CSIG data
   fields on the tag.

   For signals such as ABW and ABW/C, a software agent running on the
   CPU of a transit device can periodically compute these signals based
   on hardware byte counters, and program VLAN match/action rules in the
   dataplane to update CSIG data fields based on the computed signals.
   Since the match/action rules are in the dataplane, CSIG packets can
   be processed at line rate without CPU involvement.  However the
   match/action rules themselves can be updated at a slower cadence via
   the software agent.

   Compact CSIG is designed to enable software-assisted backward
   compatibility while operating within the constraints of commonly
   available VLAN resources on transit devices.  Backward compatibility
   via software is a fundamental feature in the design of Compact CSIG.

   Note that it may not be possible to track signal types such as hop
   delay per packet in a software agent.  However, approximations of the
   signal based on available hardware counters and registers (such as
   latency histograms) can be implemented in the agent if software-
   assisted support is desired for such signal types.

Ravi, et al.              Expires 5 August 2024                [Page 30]
Internet-Draft                    CSIG                     February 2024

6.5.  Greenfield deployments

   In greenfield deployments of CSIG domains, all devices in the domain
   natively support the CSIG protocol.

   Expanded CSIG is designed to leverage greenfield deployments where
   backward compatibility, negotiation and interoperability are not
   requirements.  It provides enhanced signal resolution via higher bit
   width for signal values and locator metadata in comparison to Compact
   CSIG.  Expanded CSIG can also support up to 16 signal types.

   Devices in Greenfield CSIG domains MUST support CSIG stripping at the
   domain boundary to ensure that CSIG packets don't exit the domain.

7.  Design Rationale

   CSIG's design choices are shaped by an end-to-end perspective of what
   matters to applications and where tradeoffs can be made towards
   simplicity and practicality.  In this section, we discuss the
   rationale behind CSIG's design and the advantages it provides over
   existing state of the art.

7.1.  Choice of Layer 2

   CSIG-tag offsets at layer 2 are independent of headers and payload at
   layer 3 and above, which means that only a small set of tag placement
   offsets need to be supported for reading and updating the header.
   This makes device implementations of CSIG simpler.  In contrast, in-
   band network telemetry schemes implemented at layer 3 or higher
   require support for a large set of packet formats as this set grows
   by the cross-product of formats / encapsulations at each layer.  This
   complexity forces device implementations to restrict support for only
   a fraction of packet formats / encapsulations, hindering the adoption
   and deployment of such schemes.  CSIG-tagging, on the other hand, is
   simpler to support and deploy since it is at layer 2 and has a fixed
   offset despite various formats / encapsulation at layer 3 and above.

   The choice of layer 2 also makes compatibility with in-network
   tunneling and encryption simpler, which are common features in data
   center deployments.

   *  CSIG-tags are, by design, compatible with PSP encrypted packets
      and IPSec encrypted packets, where Layer 4 headers and payloads
      may be encrypted.

   *  CSIG tags are carried through Layer 3 tunnels e.g., IP-in-IP,
      VxLAN, Geneve, at a fixed offset in the packet header.  This
      avoids the need to copy and relocate CSIG tags across inner /

Ravi, et al.              Expires 5 August 2024                [Page 31]
Internet-Draft                    CSIG                     February 2024

      outer headers during encapsulation and decapsulation of packets,
      which would be necessary if implemented instead at layers 3 or
      higher.

   *  CSIG tags are placed as the last header in the Layer 2 header
      stack to ensure compatibility with layer 2 and layer 2.5 tunneled
      domains as well.  The placement of CSIG tags in MACSec and other
      Layer 2 encapsulations is shown in the table in Section 4.1.

   Most in-band network telemetry schemes are not backward compatible.
   However, CSIG tag's structural similarity to VLAN tags enables
   backward compatibility with many devices that don't have native CSIG
   support as described in Section 6.4.  This allows deployments to reap
   the benefits of CSIG without having to upgrade a significant portion
   of their network hardware.

   In addition, since expanded CSIG is limited to 8B, i.e., the size of
   double VLAN tags, the packet parsing depth required on devices to
   read and process headers at layer 3 and above is not affected.

   In summary, the choice of Layer 2 for CSIG-tag is a key part of
   CSIG's simplicity and efficiency, since it keeps device
   implementations simple while supporting multiple encapsulations and
   backward compatibility.

7.2.  Separation of headers for CSIG-tag and reflection

   CSIG's design separates the CSIG-tag and CSIG reflection headers into
   distinct layers.  This decoupling enables end hosts to develop
   different transport-specific implementations of CSIG reflection while
   sharing the underlying CSIG-tag mechanism.  This means that transit
   device behaviors are not impacted by innovations in CSIG reflection.

   In addition, this decoupling enables the separate tracking of forward
   and reverse path bottlenecks.  This is important since CCAs typically
   prefer to react to congestion on the forward path only and not react
   to congestion on the reverse path.  In contrast, in-band schemes that
   mix signaling and reflection into the same header do not provide
   distinctions between forward and reverse path.

7.3.  Fixed-size headers

   CSIG's fixed-size headers constitute less than 0.2% bandwidth
   overhead in packets with 4k or 9k MTU.  This means that there is no
   need for fragmentation or increasing MTU size for the purposes of
   supporting multiple congestion signals.  Furthermore, the performance
   of network device packets per second (PPS) is minimally impacted by
   the inclusion of CSIG tag and reflection headers.

Ravi, et al.              Expires 5 August 2024                [Page 32]
Internet-Draft                    CSIG                     February 2024

   The low overhead allows CSIG to be enabled on all live data packets
   or explicit probe packets or sampled packets.  This is an important
   capability because it allows for the direct quantification of the
   bottlenecks experienced by the data packets themselves instead of
   having to rely on probes.  However, leveraging CSIG on probes or
   sampled packets is still an option for deployments that require such
   visibility.

   CSIG is designed to perform compare-and-replace (or more generally
   read-modify-write for future extensions), with a fixed size header.
   Therefore, CSIG is not limited by the number of hops in a network
   path (i.e., diameter of the network) unlike schemes that append
   information at each hop.

7.4.  Signal Design

   CSIG's signal design focuses on simple, aggregate signals that are
   driven by use cases, as demonstrated in Section 5 and Section 8.

   CSIG allows a single packet to carry only one congestion signal.  To
   obtain multiple signals at the end hosts, it takes advantage of the
   fact that the end host can request different signal types across
   multiple packets of a flow.  In contrast, other schemes tend to
   overload each packet with a lot of information, including metadata
   about multiple signals, which can be limiting.  Moreover, CSIG-tag's
   format is also extensible, which means that it can be adapted to
   support additional signal types and locator metadata in the future
   without compromising the advantages of CSIG's design.

   A unique feature of Compact CSIG's design is the ability to fully
   configure signal value buckets, which allows for efficient signal
   representations with a limited number of bits.  For example, the
   encodings can be adjusted to provide greater granularity at value
   ranges that are more important to the application, and lower
   granularity at ranges that are less important.  Similarly, locator
   metadata can be efficiently represented by carrying fewer bits of
   relevant compressed attributes of the bottleneck that are important
   to applications.  Expanded CSIG, on the other hand, uses uniform
   signal quantization for more accuracy and provides even more
   flexibility in defining signals and locator metadata with a larger
   bit width.

Ravi, et al.              Expires 5 August 2024                [Page 33]
Internet-Draft                    CSIG                     February 2024

8.  Use Cases defined by Bottleneck Signals

   The use cases for CSIG are motivated by congestion control, traffic
   management and network debuggability.  These use cases have always
   existed in production before CSIG, often using signals that are
   measured end-to-end (such as packet loss and delay), or out-of-band
   signals from network devices such as port utilization.  CSIG provides
   a boost in performance, efficiency and debuggability by augmenting
   existing use cases with explicit in-band measurements.

   In this document, we present the use cases for the three signals
   defined in Section 5.  At the crux of a signal is the definition of
   bottleneck.  Over time we envision use cases for other signals that
   would define a bottleneck, e.g., the maximum number of co-sharing
   flows on a link.  For each of these new signals, locator metadata can
   continue to provide attributes about the bottleneck port such as port
   capacity.

8.1.  Congestion Control

   CCA can make use of CSIG signals in at least two different ways.
   First, existing CCA can use CSIG values to address blindspots in end-
   to-end signals such as packet loss, delay, and delivery rates.  This
   use case is immediately relevant as most production networks deploy
   some form of end-to-end congestion control including Swift [SWIFT],
   and BBR [BBR].  A second way to use CSIG is to design entirely new
   congestion control algorithms that use CSIG as their primary signal.
   We focus below on the former category.

   E2E CCA comes in various forms and for simplicity we describe the use
   cases taking Swift CC [SWIFT] as the baseline.  Swift is delay-based
   congestion control that uses accurate round-trip time (RTT)
   measurements done via the NIC hardware timestamps.  These signals can
   be applied to other CCA and are NOT limited to Swift.

   The interpretation and applications of CSIG for congestion control in
   lossless networks and networks that use packet spraying is a topic
   for future research.

8.1.1.  Using maximum per-hop delay in E2E CC

   E2E RTT measurements used in Swift include the queueing delays on all
   hops along the flows' path, including the forward and reverse paths.
   A consequence of using a lumped delay signal is that a flow reduces
   its sending rate in response to delays that it may not be able to
   directly control.  Furthermore, in deployments where there can be
   multiple congested links along the path of a flow, it is desirable to
   modulate the sending rate of a flow in response to just the maximum

Ravi, et al.              Expires 5 August 2024                [Page 34]
Internet-Draft                    CSIG                     February 2024

   of the per-hop delays, max(PD), along a flows' path.  Replacing the
   end-to-end measured delay with bottleneck delay into Swift's equation
   yields the following:

   // Reduce the congestion window when bottleneck hop delay
   // exceeds a chosen target hop delay
   if (max(PD) > target_delay) then
     md = beta * (max(PD) - target_delay) / max(PD)
     cwnd = (1 - md) *cwnd

   Poseidon [POSEIDON] is a CC proposed in literature that exemplifies
   the use of maximum per-hop delay in reducing its congestion window.
   By incorporating bottleneck information in congestion control
   response, POSEIDON flows achieve higher flow throughputs in presence
   of reverse path congestion, and congestion across multiple network
   hops.  Algorithm 1 in [POSEIDON] details the use of maximum per-hop
   delay in both the increase and the decrease of the congestion window.

8.1.2.  Using maximum link utilization in E2E CC

   E2E CC uses heuristics to determine by how much to increase the
   congestion window, e.g., in the case of Swift, when the measured
   round-trip time is lower than the target delay, Swift increments the
   congestion window by one per round-trip time.  BBR [BBR] increases
   the rate as a function of the flow's measured delivery rate.

   The problem with these heuristics is that they don't get the rate or
   window adjustments just right and either under or overshoot.
   Undershooting the rate would mean that transfers take longer to
   complete even when the bottleneck link has a low utilization, while
   overshooting can cause an unnecessary increase in queueing delay and
   packet losses.

   In the following example, we integrate the maximum utilization signal
   into Swift's congestion window update equation to ramp up adaptively
   faster when the bottleneck link has low utilization.  The congestion
   window evolution is represented below:

   // Increase congestion window in proportion
   // to the utilization headroom
   if (rtt < target_rtt) then
     fcwnd <-- fcwnd + additive_increment
               + kLambda . fcwnd . (1 - max(U/C))

Ravi, et al.              Expires 5 August 2024                [Page 35]
Internet-Draft                    CSIG                     February 2024

   As an example, the fixed additive increase in Swift of rate <-- rate
   + Additive Increment, means that it takes 200 RTTs to take 80 Gbps of
   bandwidth with an Additive Increment of 400 Mbps.  The fast ramp-up
   with CSIG using the bottleneck link utilization takes <10 RTTs to
   safely ramp to 80 Gbps.

8.1.3.  Using minimum available bandwidth in E2E CC

   E2E CC uses heuristics to determine the initial transfer rate for
   newly established connections.  Starting too slowly would cause the
   transfer to take longer than necessary while wasting available
   bandwidth, whereas starting too quickly would cause queue delays and
   packet drops.  The same dilemma exists for transfers that are
   starting on a connection that has been idle for multiple round-trip
   times.

   In networks where we know ahead of time that the degree of
   multiplexing is low i.e., just a handful of flows co-existing on the
   link at any point in time, transfers complete quickly when they
   "jump-start" to use up all of the bottleneck bandwidth.  This is
   especially helpful when transports employ robust loss recovery
   mechanisms such that even if the queue overflows, any lost packets
   can be quickly recovered.

   As an example, on an empty network of 200Gbps, a single transfer can
   use up the entire 200Gbps in the second RTT, after the CSIG feedback
   in the first RTT indicates the availability of 200Gbps at the
   bottleneck link.

   CSIG's min(ABW) bottleneck bandwidth allows transfers to start safely
   at line-rate.

8.2.  Traffic Management

   CSIG encodes the most notable information about the path for each
   flow by carrying bottleneck link signals and bottleneck locator
   metadata.  This path-level information, which is obtained directly
   from application data packets rather than synthetic probes, is
   directly attributable to the flow and is valuable for traffic
   engineering and application performance debugging.

Ravi, et al.              Expires 5 August 2024                [Page 36]
Internet-Draft                    CSIG                     February 2024

8.2.1.  Load Balancing and Multipathing

   Datacenter topologies employ a diverse set of paths between any
   source-destination pairs.  Transports employ techniques such as
   Protective Load Balancing [PLB] and Multipathing [RFC8684] to spread
   traffic across the multitude of paths.  Load balancing and
   multipathing in transports use a combination of end-to-end signals
   and heuristics to select which paths to use and how much traffic to
   channel in each of the paths.

   Using CSIG signals from bottleneck links along the diverse set of
   paths, load balancing and multipathing schemes can select high
   quality paths with lower congestion, and spread traffic across them
   in a congestion-aware manner.

   Locator metadata can also be used to distinguish between incast
   congestion and core network congestion, which can then be used to
   adjust load balancing / multipathing actions.  For instance, the
   stage of the bottleneck and link orientation attributes are enough to
   determine whether the last hop is the bottleneck or not.  When the
   last hop is the bottleneck, flow-level load balancing / multipathing
   actions may not be effective and may, in fact, worsen incasts.  Such
   cases may require application-level load balancing or job scheduling
   techniques to distribute traffic.  However, when congestion is
   instead known to be in the core network, flow-level load balancing /
   multipathing actions can route around congested areas and improve
   performance.

8.2.2.  Traffic Engineering

   Traffic Engineering carves out paths with apt bandwidth across
   aggregate source-destination pairs.  Examples within a datacenter
   include Datacenter Network Interconnection Layer (DCNI)
   [JUPITEREVOL].  CSIG can be used to provide fine-grained path level
   information, including short timescale microburst congestion, to TE
   systems.  By using summarized CSIG signals aggregated both spatially
   and temporally across flows, TE can select paths and balance traffic
   at the datacenter level to accommodate bursty traffic, e.g., from ML.

8.3.  Application Performance Debugging

   Applications often complain that the network is slow, but it can be
   challenging to identify the specific segment of the network that is
   causing the problem.  This is especially true with the scale of
   datacenters, where flows can traverse up to nine hops [JUPITEREVOL].
   Figuring out where the bottleneck is and the timescales at which the
   path poses a bottleneck is like searching for a needle in a haystack
   for an application with thousands of flows across various source-

Ravi, et al.              Expires 5 August 2024                [Page 37]
Internet-Draft                    CSIG                     February 2024

   destination pairs.

   On application network flows, CSIG information, with its bottleneck
   locator, can quickly and precisely answer why the flows are slow and
   where the network / path bottlenecks are.

   CSIG can also be enabled on mesh prober systems similar to [PINGMESH]
   to augment end-to-end probe measurements between any two servers with
   bottleneck information to aid troubleshooting.

9.  Security Considerations

   Only trusted sender hosts MUST be allowed to construct, initialize
   and insert a CSIG tag into packets for authorized flows.  Based on
   deployments, the authorization can be done at the NICs or at the
   switches, akin to firewall rules.  CSIG stripping may also be
   employed as fencing rules at domain boundaries to ensure that
   unauthorized CSIG-tags are not traversing across these boundaries.

   A rogue or broken network-device in a private network might put in
   arbitrary CSIG values, or insert a CSIG tag in packets on a transit
   node.  We expect there to be checks and balances to identify and take
   non-functioning or rogue network devices out of a private network, as
   they can impose greater harm than distributing misleading CSIG
   values.

10.  IANA Considerations

   There are no IANA considerations.  CSIG Tag Protocol Identifier
   (TPID) is requested from IEEE.

11.  Conclusions

   With the increased deployment of applications that are sensitive to
   delay and bandwidth usage in data centers, e.g., AI/ML/HPC workloads
   and RDMA based applications, relying solely on end-to-end signals is
   insufficient under dynamically changing traffic patterns.  Simple and
   timely signals from network devices to end-hosts can augment and
   optimize end-host transports to make optimal use of datacenter
   bandwidth.  CSIG is a simple, practical and deployable protocol for
   distributing congestion information in networks that builds on the
   successful aspects of prior work and is grounded in use-cases of
   congestion control, traffic management and network debuggability.

Ravi, et al.              Expires 5 August 2024                [Page 38]
Internet-Draft                    CSIG                     February 2024

12.  Acknowledgements

   This work would not be possible without the following individuals
   whose various engineering and design contributions shaped CSIG and
   its use cases:

   Christopher Alfeld, Neelesh Bansod, Jis Ben, Neal Cardwell, Yongzhou
   Chen, Yuchung Cheng, Dal Chand Choudhary, Mick Fingleton, Mahmudul
   Hasan, Jeffrey Ji, Marc De Kruijf, Praveen Kumar, Rich Lane, Chang
   Liu, Morley Mao, Carl Mauer, Sachin Menezes, Nipen Mody, Masoud
   Moshref, Alex Rumyantsev, Gerald Schmidt, Arjun Singh, Arjun Singhvi,
   Babru Thatikunta, Jeff Tikkanen, Frank Uyeda, Brian Vasquez, Rui
   Wang, Hassan Wassel, Yong Xia, Zhengxu Xia, Kevin Yang, Liangcheng
   Yu.

   We would like to thank Arjun Singh, David Wetherall, Neal Cardwell,
   Akash Deshpande and Arvind Krishnamurthy for their feedback on
   several portions of this document.

13.  Normative References

   [BBR]      Cardwell, N., Cheng, Y., Gunn, C., Yeganeh, S., and V.
              Jacobson, "BBR: congestion-based congestion control",
              Communications of the ACM vol. 60, no. 2, pp. 58-66,
              DOI 10.1145/3009824, January 2017,
              <https://doi.org/10.1145/3009824>.

   [CONGA]    Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan,
              R., Chu, K., Fingerhut, A., Lam, V., Matus, F., Pan, R.,
              Yadav, N., and G. Varghese, "CONGA: distributed
              congestion-aware load balancing for datacenters", ACM
              SIGCOMM Computer Communication Review vol. 44, no. 4, pp.
              503-514, DOI 10.1145/2740070.2626316, August 2014,
              <https://doi.org/10.1145/2740070.2626316>.

   [DCQCN]    Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M.,
              Liron, Y., Padhye, J., Raindel, S., Yahia, M., and M.
              Zhang, "Congestion Control for Large-Scale RDMA
              Deployments", ACM SIGCOMM Computer Communication
              Review vol. 45, no. 4, pp. 523-536,
              DOI 10.1145/2829988.2787484, August 2015,
              <https://doi.org/10.1145/2829988.2787484>.

   [HPCCPLUS] "High-precision congestion control (HPCC++) deployment at
              Alibaba leveraging In-band Flow Analyzer (IFA)", n.d.,
              <https://www.broadcom.com/blog/high-precision-congestion-
              control>.

Ravi, et al.              Expires 5 August 2024                [Page 39]
Internet-Draft                    CSIG                     February 2024

   [I-D.kumar-ippm-ifa]
              Kumar, J., Anubolu, S., Lemon, J., Manur, R., Holbrook,
              H., Ghanwani, A., Cai, D., Ou, H., Li, Y., and X. Wang,
              "Inband Flow Analyzer", Work in Progress, Internet-Draft,
              draft-kumar-ippm-ifa-07, 7 September 2023,
              <https://datatracker.ietf.org/doc/html/draft-kumar-ippm-
              ifa-07>.

   [I-D.miao-tsv-hpcc]
              Miao, R., Anubolu, S., Pan, R., Lee, J., Gafni, B.,
              Shpigelman, Y., Tantsura, J., and G. Caspary, "HPCC++:
              Enhanced High Precision Congestion Control", Work in
              Progress, Internet-Draft, draft-miao-tsv-hpcc-02, 17 May
              2023, <https://datatracker.ietf.org/doc/html/draft-miao-
              tsv-hpcc-02>.

   [JUPITEREVOL]
              Poutievski, L., Mashayekhi, O., Ong, J., Singh, A., Tariq,
              M., Wang, R., Zhang, J., Beauregard, V., Conner, P.,
              Gribble, S., Kapoor, R., Kratzer, S., Li, N., Liu, H.,
              Nagaraj, K., Ornstein, J., Sawhney, S., Urata, R.,
              Vicisano, L., Yasumura, K., Zhang, S., Zhou, J., and A.
              Vahdat, "Jupiter evolving: transforming google's
              datacenter network via optical circuit switches and
              software-defined networking", Proceedings of the ACM
              SIGCOMM 2022 Conference, DOI 10.1145/3544216.3544265,
              August 2022, <https://doi.org/10.1145/3544216.3544265>.

   [P4-INT]   "In-band Network Telemetry (INT) Dataplane Specification",
              n.d., <https://p4.org/p4-spec/docs/INT_v2_1.pdf>.

   [PINGMESH] Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R., Maltz,
              D., Liu, Z., Wang, V., Pang, B., Chen, H., Lin, Z., and V.
              Kurien, "Pingmesh: A Large-Scale System for Data Center
              Network Latency Measurement and Analysis", ACM SIGCOMM
              Computer Communication Review vol. 45, no. 4, pp. 139-152,
              DOI 10.1145/2829988.2787496, August 2015,
              <https://doi.org/10.1145/2829988.2787496>.

   [PLB]      Qureshi, M., Cheng, Y., Yin, Q., Fu, Q., Kumar, G.,
              Moshref, M., Yan, J., Jacobson, V., Wetherall, D., and A.
              Kabbani, "PLB: congestion signals are simple and effective
              for network load balancing", Proceedings of the ACM
              SIGCOMM 2022 Conference, DOI 10.1145/3544216.3544226,
              August 2022, <https://doi.org/10.1145/3544216.3544226>.

Ravi, et al.              Expires 5 August 2024                [Page 40]
Internet-Draft                    CSIG                     February 2024

   [PONYEXPRESS]
              Marty, M., de Kruijf, M., Adriaens, J., Alfeld, C., Bauer,
              S., Contavalli, C., Dalton, M., Dukkipati, N., Evans, W.,
              Gribble, S., Kidd, N., Kononov, R., Kumar, G., Mauer, C.,
              Musick, E., Olson, L., Rubow, E., Ryan, M., Springborn,
              K., Turner, P., Valancius, V., Wang, X., and A. Vahdat,
              "Snap: a microkernel approach to host networking",
              Proceedings of the 27th ACM Symposium on Operating
              Systems Principles, DOI 10.1145/3341301.3359657, October
              2019, <https://doi.org/10.1145/3341301.3359657>.

   [POSEIDON] Wang, W., Moshref, M., Li, Y., Kumar, G., Ng, E.,
              Cardwell, N., and N. Dukkipati, "Poseidon: Efficient,
              Robust, and Practical Datacenter CC via Deployable INT",
              2023,
              <https://www.usenix.org/conference/nsdi23/presentation/
              wang-weitao>.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/rfc/rfc2119>.

   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
              of Explicit Congestion Notification (ECN) to IP",
              RFC 3168, DOI 10.17487/RFC3168, September 2001,
              <https://www.rfc-editor.org/rfc/rfc3168>.

   [RFC8257]  Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
              and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
              Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
              October 2017, <https://www.rfc-editor.org/rfc/rfc8257>.

   [RFC8684]  Ford, A., Raiciu, C., Handley, M., Bonaventure, O., and C.
              Paasch, "TCP Extensions for Multipath Operation with
              Multiple Addresses", RFC 8684, DOI 10.17487/RFC8684, March
              2020, <https://www.rfc-editor.org/rfc/rfc8684>.

   [RFC9000]  Iyengar, J., Ed. and M. Thomson, Ed., "QUIC: A UDP-Based
              Multiplexed and Secure Transport", RFC 9000,
              DOI 10.17487/RFC9000, May 2021,
              <https://www.rfc-editor.org/rfc/rfc9000>.

   [RFC9378]  Brockners, F., Ed., Bhandari, S., Ed., Bernier, D., and T.
              Mizrahi, Ed., "In Situ Operations, Administration, and
              Maintenance (IOAM) Deployment", RFC 9378,
              DOI 10.17487/RFC9378, April 2023,
              <https://www.rfc-editor.org/rfc/rfc9378>.

Ravi, et al.              Expires 5 August 2024                [Page 41]
Internet-Draft                    CSIG                     February 2024

   [SWIFT]    Kumar, G., Dukkipati, N., Jang, K., Wassel, H., Wu, X.,
              Montazeri, B., Wang, Y., Springborn, K., Alfeld, C., Ryan,
              M., Wetherall, D., and A. Vahdat, "Swift: Delay is Simple
              and Effective for Congestion Control in the Datacenter",
              Proceedings of the Annual conference of the ACM Special
              Interest Group on Data Communication on the applications,
              technologies, architectures, and protocols for
              computer communication, DOI 10.1145/3387514.3406591, July
              2020, <https://doi.org/10.1145/3387514.3406591>.

   [TCP-INT]  Jereczek, G., Jepsen, T., Wass, S., Pujari, B., Zhen, J.,
              and J. Lee, "TCP-INT: lightweight network telemetry with
              TCP transport", Proceedings of the SIGCOMM '22 Poster and
              Demo Sessions, DOI 10.1145/3546037.3546064, August 2022,
              <https://doi.org/10.1145/3546037.3546064>.

Appendix A.  Example encodings of CSIG signals

   The following table demonstrates an example encoding of a 3-bit
   signal value.  Note that this is an example ONLY.  The encoding that
   is meaningful to a certain deployment is specific to the use cases in
   consideration.

   Note that CSIG tag supports 5 bit (20 bit) signal value size for the
   compact (expanded) formats.

              +=======+============+===========+============+
              | Value | min(ABW/C) | min(ABW)  | max(PD)    |
              +=======+============+===========+============+
              | 0x0   | 0%-1%      | 0-1Gbps   | 0-10us     |
              +-------+------------+-----------+------------+
              | 0x1   | 1%-5%      | 1-5Gbps   | 10-50us    |
              +-------+------------+-----------+------------+
              | 0x2   | 5%-10%     | 5-10Gbps  | 50-100us   |
              +-------+------------+-----------+------------+
              | 0x3   | 10%-20%    | 10-20Gbps | 100-200us  |
              +-------+------------+-----------+------------+
              | 0x4   | 20%-50%    | 20-50Gbps | 200-400us  |
              +-------+------------+-----------+------------+
              | 0x5   | 50%-75%    | 50-75Gbps | 400-800us  |
              +-------+------------+-----------+------------+
              | 0x6   | 75%-90%    | 75-90Gbps | 800-2000us |
              +-------+------------+-----------+------------+
              | 0x7   | 90%-100%   | >90 Gbps  | >2000us    |
              +-------+------------+-----------+------------+

                                  Table 2

Ravi, et al.              Expires 5 August 2024                [Page 42]
Internet-Draft                    CSIG                     February 2024

Contributors

   Weida Huang
   Google LLC

   Tyler Griggs
   UC Berkeley

   Mohammad Jafar Akhbarizadeh
   Google LLC

   Jeongkeun Lee
   Google LLC

   Surendra Anubolu
   Broadcom Inc.

   Kok-Kiong Yap
   Google LLC

   Neal Cardwell
   Google LLC

Authors' Addresses

   Abhiram Ravi
   Google LLC
   Email: abhiramr@google.com

   Nandita Dukkipati
   Google LLC
   Email: nanditad@google.com

   Naoshad Mehta
   Google LLC
   Email: naoshad@google.com

Ravi, et al.              Expires 5 August 2024                [Page 43]
Internet-Draft                    CSIG                     February 2024

   Jai Kumar
   Broadcom Inc.
   Email: jai.kumar@broadcom.com

Ravi, et al.              Expires 5 August 2024                [Page 44]
Congestion Signaling (CSIG) draft-ravi-ippm-csig-01

Congestion Signaling (CSIG)
draft-ravi-ippm-csig-01