Skip to main content

PFC PAUSE Frame Forwarded Transparently in Wide Area Networks
draft-he-rtgwg-wan-pfc-00

Document Type Active Internet-Draft (individual)
Authors hexiaoming , Lijie Deng
Last updated 2026-02-14
RFC stream (None)
Intended RFC status (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-he-rtgwg-wan-pfc-00
RTGWG Working Group                                                X. He
Internet-Draft                                                   L. Deng
Intended status: Standards Track                           China Telecom
Expires: 18 August 2026                                 14 February 2026

     PFC PAUSE Frame Forwarded Transparently in Wide Area Networks
                       draft-he-rtgwg-wan-pfc-00

Abstract

   This document describes a solution for transparent forwarding of PFC
   PAUSE frames in wide area networks, which does not require the nodes
   in wide area networks to support PFC flow control capabilities.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 18 August 2026.

Copyright Notice

   Copyright (c) 2026 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

He & Deng                Expires 18 August 2026                 [Page 1]
Internet-Draft   PFC PAUSE Frame Forwarded Transparently   February 2026

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Conventions . . . . . . . . . . . . . . . . . . . . . . . . .   3
     2.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
     2.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Transparent Forwarding of PFC PAUSE Frames in WANs  . . . . .   3
     3.1.  Flow Control Mechanism For PFC Frame  . . . . . . . . . .   4
     3.2.  PFC PAUSE Frame Processing  . . . . . . . . . . . . . . .   5
   4.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   7
   5.  Security Considerations . . . . . . . . . . . . . . . . . . .   7
   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   7
     6.1.  Normative References  . . . . . . . . . . . . . . . . . .   7
     6.2.  Informative References  . . . . . . . . . . . . . . . . .   7
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   7

1.  Introduction

   Remote Direct Memory Access (RDMA) is a method of accessing memory on
   a remote system without interrupting the processing of the Central
   Processing Unit (CPU) on that system.  RDMA enables lower latency and
   higher throughput on the network and lower CPU utilization for the
   servers and storage systems.  Currently, RoCEv2 (RDMA over Converged
   Ethernet Version 2) is widely deployed in lossless networks in
   intelligent computing centers, providing packet loss free data
   transmission services for high-performance computing (HPC) and AI
   model training and inference scenarios.

   With the rapid growth in demand for computing and storage resources
   in AI big models and distributed storage, intelligent computing
   centers are interconnected through wide area networks (WANs) to
   provide multi-DCs collaboration to compensate for the limitations of
   insufficient computing and storage resources in a single DC.  The
   interconnection of artificial intelligence Data Centers (AIDCs)
   through WANs are becoming a new network structure gradually accepted
   by the industry, providing wide area lossless transmission for
   emerging application scenarios.  Priority-based Flow
   Control(PFC)[IEEE8021Q-2022] technology is widely deployed in RoCEv2
   networks to aviod packet loss caused by congestion.  However, the
   deployment of PFC in WANs may lead to head-of-line blocking,
   deadlocks, and even congestion diffusion over a wider range, which
   will degrade network performance.  On the other hand, WANs need to
   provide differentiated services for various applications, and there
   exist differences in buffering capacity from different nodes as well
   as link delay metrics between two nodes, leading to inconsistent
   parameters configuration of node, which makes network operation and
   maintenance more complicated.  Therefore, PFC mechanism is not
   suitable for large-scale deployment in WANs.

He & Deng                Expires 18 August 2026                 [Page 2]
Internet-Draft   PFC PAUSE Frame Forwarded Transparently   February 2026

   This document describes a solution for transparent forwarding of PFC
   PAUSE frames in wide area networks, which does not require the nodes
   in WANs to support PFC flow control capabilities.  As a result, end-
   to-end flow control between AIDCs interconnected through MANs can be
   realized with minimal impact on network performance.

2.  Conventions

2.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

2.2.  Terminology

   Abbreviations used in this document:

   AI: Artificial Intelligence

   AIDC: Artificial Intelligence Data Center

   DC: Data Center

   MAC: Media Access Control

   P: Provider

   PE: Provider Edge

   PFC: Priority-based Flow Control

   RDMA: Remote Direct Memory Access

   RoCEv2: RDMA over Converged Ethernet version 2

   SR-MPLS: Segment Routing Based on Multiprotocol Label Switching

   SRv6: Segment Routing over IPv6

   VXLAN: Virtual Extensible Local Area Network

   WAN: Wide Area Network

3.  Transparent Forwarding of PFC PAUSE Frames in WANs

He & Deng                Expires 18 August 2026                 [Page 3]
Internet-Draft   PFC PAUSE Frame Forwarded Transparently   February 2026

3.1.  Flow Control Mechanism For PFC Frame

   The PFC is referred to as classical stepwise back pressure with
   dedicated Ethernet pause frame, which is widely deployed in RoCEv2
   networks to aviod packet loss caused by congestion.  The PFC PAUSE
   frame format is shown in Figure 1.

        +--------------------------+
 6Bytes |   DMAC(0180-C200-0001)   |
        +--------------------------+
 6Bytes |  SMAC(Sender Port MAC)   |
        +--------------------------+
 2Bytes |    Ethertype(0x8808)     |
        +--------------------------+
 2Bytes |      Opcode(0x0101)      |             +----------------------------------+
        +--------------------------+   high 8bit |              0x00                |
 2Bytes |    Class enable vector   | -->         +----------------------------------+
        +--------------------------+    low 8bit | e[7]e[6]e[5]e[4]e[3]e[2]e[1]e[0] |
 2Bytes |      PAUSE Time[0]       |             +----------------------------------+
        +--------------------------+      e[n]corresponds to different priority class
 2Bytes |      PAUSE Time[1]       |      e[n]=1,PAUSE Time valid
        +--------------------------+      e[n]=0,PAUSE Time invalid
 2Bytes |           ...            |
        +--------------------------+
 2Bytes |      PAUSE Time[7]       |
        +--------------------------+
 26Bytes|           Pad            |
        +--------------------------+
 4Bytes |           CRC            |
        +--------------------------+

                   Figure 1: PFC PAUSE Frame Format

   With this flow control mechanism, the congested node asks the
   directly connected upstream network node to pause the data traffic by
   a dedicated Ethernet pause frame called PFC frame, and then the
   upstream network node may stepwise ask its directly connected
   upstream network node to pause the data traffic by a PFC frame, until
   the most upstream network node may ask the directly connected traffic
   sender to pause the data traffic by a PFC frame.  [IEEE8021Q-2022]
   details how this kind of flow control mechanism works.

   Typically, when two AIDCs are interconnected through WANs, VPN
   tunnels (e.g., SR-MPLS, SRv6, VXLAN) are established between the
   ingress PE and egress PE to carry massive RDMA traffic between DCs,
   as shown in Figure 2.

He & Deng                Expires 18 August 2026                 [Page 4]
Internet-Draft   PFC PAUSE Frame Forwarded Transparently   February 2026

     +----------+                                          +----------+
     |  AIDC 1  |                                          |  AIDC 2  |
     |          |                                          |          |
     +----------+                                          +----------+
         ^                                                        |
         |PFC Frame                                      PFC Frame|
         |                                                        v
     +-------+       +----+      +--------+       +----+      +--------+
     | DC1 GW|  -->  |PE1 |  --> |P1...Pn |  -->  |PE2 |  --> | DC2 GW |
     +-------+       +----+      +--------+       +----+      +--------+
        |               |                            |               |
        |<--------------|<---------------------------|<--------------|
           PFC Frame        PFC Frame Forwarding           PFC Frame

               Figure 2: AIDCs Interconnected Through WANs

3.2.  PFC PAUSE Frame Processing

   When congestion occurs in the destination AIDC, the PFC frames are
   stepwise sent to the destination DC gateway.  Similarly, the
   destination DC may stepwise ask its directly connected upstream
   egress PE node to pause the data traffic by sending a PFC frame.  In
   Figure 2, AIDC 2 sends the PFC frames to DC2 gateway, and in turn,
   DC2 gateway sends the PFC frames to PE2 When congestion occurs at the
   recieved port.

   When the egress PE node of WAN receives a PFC frame, it needs to
   parse a PFC frame and determine that it is a legal PFC frame, that
   is, besides its correct frame format, its destination MAC address
   must be the multicast address: 0180-C200-0001 and the source MAC
   address must be its directly connected downstream DC gateway port MAC
   address (some vendors also use device system MAC address).
   Otherwise, the egress PE node must discard this illegal PFC frame.

   The egress PE node encapsulates the PFC frame based on tunnel
   encapsulation protocol, then forwards it to the immediate transit
   node, which in turn forwads it transparently to the upstream node
   until it reaches the ingress PE node.

   The ingress PE node decapsulates the PFC frame and replaces the
   source MAC address in the original PFC frame with the MAC address of
   its port directly connected to the source DC gateway, then forwards
   it to the source DC gateway.

He & Deng                Expires 18 August 2026                 [Page 5]
Internet-Draft   PFC PAUSE Frame Forwarded Transparently   February 2026

   In order to ensure that the PFC frames can be forwarded to the
   ingress PE quickly, it is preferable to configure the highest
   priority for the encapsulated PFC frames such that the PFC frames are
   not discarded in case of network congestion.

   Similarly, the source DC gateway needs to parse the forwarded PFC
   frame and determine that it is a legal PFC frame, that is, besides
   its correct frame format, its destination MAC address must be the
   multicast address: 0180-C200-0001 and the source MAC address must be
   its directly connected ingress PE port MAC address(some vendors also
   use device system MAC address).  Otherwise, the source DC gateway
   must discard this illegal PFC frame.

   the source DC gateway sends the PFC frames to the source AIDC (AIDC1
   in Figure 2) When congestion occurs at the recieved port.
   Consequently, end-to-end flow control between AIDCs can be realized
   across WANs.

   An example is that two AIDCs are interconnected through SRv6 tunnel
   in WANs.  The encapsulated PFC frame format is depicted as follows:

     +-------------------------------+
     |          IPv6 Header          |
     +-------------------------------+
     |  IPv6 Extension Header (SRH)  |
     +-------------------------------+
     |     Original PFC Frame        |
     +-------------------------------+

   Due to the much longer transmission distance of WANs compared to
   Internal DCs , the PFC frames forwarded from the egress PE to the
   ingress PE require a significant transmission delay.  The destination
   DC gateway still needs to receive the data traffic continuously sent
   from the source DC gateway until the source DC gateway receives the
   PFC frames and pauses sending the corresponding priority data
   traffic.  The amount of data received by the destination DC gateway
   is positively correlated with the transmission delay of PFC frame.
   To avoid packet loss caused by overflow in the receiving port queue,
   the destination DC gateway needs to reserve more buffer for the
   corresponding priority queue of the receiving port based on WAN
   transmission delay of PFC frame.

   The reserved buffer setting for the priority queue of the receiving
   port at the destination DC gateway is required to meet the following
   condition.

He & Deng                Expires 18 August 2026                 [Page 6]
Internet-Draft   PFC PAUSE Frame Forwarded Transparently   February 2026

   The buffer size of the priority queue reserved for the receiving port
   > (the average receiving rate of the corresponding priority flow at
   the receiving port - the average sending rate of the corresponding
   priority flow at the sending port) * the forwarding delay of the PFC
   frame from the destination DC gateway to the source DC gateway.

4.  IANA Considerations

   This document has no IANA actions.

5.  Security Considerations

   This document does not introduce any new security considerations.

6.  References

6.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC8126]  Cotton, M., Leiba, B., and T. Narten, "Guidelines for
              Writing an IANA Considerations Section in RFCs", BCP 26,
              RFC 8126, DOI 10.17487/RFC8126, June 2017,
              <https://www.rfc-editor.org/info/rfc8126>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

6.2.  Informative References

   [IEEE.802.1Q.2022]
              IEEE, "IEEE Standard for Local and Metropolitan Area
              Networks--Bridges and Bridged Networks", IEEE 802-1q-2022,
              DOI 10.1109/IEEESTD.2022.10004498, 30 December 2022,
              <https://ieeexplore.ieee.org/document/10004498>.

Authors' Addresses

   Xiaoming He
   China Telecom
   Email: hexm4@chinatelecom.cn

He & Deng                Expires 18 August 2026                 [Page 7]
Internet-Draft   PFC PAUSE Frame Forwarded Transparently   February 2026

   Lijie Deng
   China Telecom
   Email: denglj4@chinatelecom.cn

He & Deng                Expires 18 August 2026                 [Page 8]