Skip to main content

BGP Deterministic Path Forwarding (DPF)
draft-wang-idr-dpf-00

Document Type Active Internet-Draft (individual)
Authors Kevin Wang , Michal Styszynski , W. Lin , Mahesh Subramaniam , Thomas Kampa , Diptanshu Singh
Last updated 2025-12-01
RFC stream (None)
Intended RFC status (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-wang-idr-dpf-00
IDR                                                              K. Wang
Internet-Draft                                             M. Styszynski
Intended status: Standards Track                                  W. Lin
Expires: 4 June 2026                                      M. Subramaniam
                                                                     HPE
                                                                T. Kampa
                                                                    Audi
                                                                D. Singh
                                             Oracle Cloud Infrastructure
                                                         1 December 2025

                BGP Deterministic Path Forwarding (DPF)
                         draft-wang-idr-dpf-00

Abstract

   Modern data center (DC) fabrics typically employ Clos topologies with
   External BGP (EBGP) for plain IPv4/IPv6 routing.  While hop-by-hop
   EBGP routing is simple and scalable, it provides only a single best-
   effort forwarding service for all types of traffic.  This single
   best-effort service might be insufficient for increasingly diverse
   traffic requirements in modern DC environments.  For example, loss
   and latency sensitive AI/ML flows may demand stronger Service Level
   Agreements (SLA) than general purpose traffic.  Duplication schemes
   which are standardized through protocols such as Parallel Redundancy
   Protocol (PRP) require disjoint forwarding paths to avoid single
   points of failure.  Congestion avoidance may require more
   deterministic forwarding behavior.

   This document introduces BGP Deterministic Path Forwarding (DPF), a
   mechanism that partitions the physical fabric into multiple logical
   fabrics.  Flows can be mapped to different logical fabrics based on
   their specific requirements, enabling deterministic forwarding
   behavior within the data center.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

Wang, et al.               Expires 4 June 2026                  [Page 1]
Internet-Draft                     DPF                     December 2025

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 4 June 2026.

Copyright Notice

   Copyright (c) 2025 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
   2.  BGP DPF . . . . . . . . . . . . . . . . . . . . . . . . . . .   3
     2.1.  BGP Session Coloring  . . . . . . . . . . . . . . . . . .   4
       2.1.1.  Strict Mode . . . . . . . . . . . . . . . . . . . . .   4
       2.1.2.  Loose Mode  . . . . . . . . . . . . . . . . . . . . .   5
     2.2.  Route Coloring  . . . . . . . . . . . . . . . . . . . . .   6
       2.2.1.  Route Coloring at the Egress Leaf . . . . . . . . . .   6
       2.2.2.  Color Matching at the Spine and Super Spine . . . . .   7
       2.2.3.  Flow Mapping at the Ingress Leaf  . . . . . . . . . .   8
   3.  Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . .   9
     3.1.  AI/ML backend training Data Center network  . . . . . . .   9
     3.2.  AI/ML frontend DC and the Inference network . . . . . . .  12
     3.3.  IP Storage networks with Fab-A/Fab-B path diversity . . .  13
     3.4.  DCI - Data Center Interconnect  . . . . . . . . . . . . .  14
     3.5.  Industrial/factory hybrid DC/Campus networks  . . . . . .  14
   4.  Operational Considerations  . . . . . . . . . . . . . . . . .  15
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  15
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  15
   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  15
     7.1.  Normative References  . . . . . . . . . . . . . . . . . .  15
     7.2.  Informative References  . . . . . . . . . . . . . . . . .  16
   Appendix A.  Alternative Solutions  . . . . . . . . . . . . . . .  17
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  17
   Contributors  . . . . . . . . . . . . . . . . . . . . . . . . . .  17

Wang, et al.               Expires 4 June 2026                  [Page 2]
Internet-Draft                     DPF                     December 2025

   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  17

1.  Introduction

   Modern data center (DC) fabrics typically employ Clos topologies with
   External BGP (EBGP) [RFC7938] for plain IPv4/IPv6 routing.  While
   hop-by-hop EBGP routing is simple and scalable, it provides only a
   single best-effort forwarding service for all types of traffic.  This
   single best-effort service might be insufficient for increasingly
   diverse traffic requirements in modern DC environments.  For example,
   loss and latency sensitive AI/ML flows may demand stronger Service
   Level Agreements (SLAs) than general purpose traffic.  Duplication
   schemes which are standardized through protocols such as Parallel
   Redundancy Protocol (PRP) [IEC62439-3] require disjoint forwarding
   paths to avoid single points of failure.  Congestion avoidance may
   require more deterministic forwarding behavior.

   Traditionally, traffic engineering requirements like these can be
   served using technologies like RSVP-TE [RFC3209] or Segment Routing
   [RFC8402] in MPLS networks.  However, according to the reasons stated
   in [RFC7938], modern data centers mostly use IP routing with EBGP as
   their sole routing protocol.  BGP DPF is a lightweight traffic
   engineering alternative designed specifically for the IP Clos fabrics
   with EBGP as the routing protocol.  It partitions the physical fabric
   into multiple logical fabrics by coloring the EBGP sessions running
   on the fabric links.  Routes are also colored so that they are only
   advertised and received over the matching colored EBGP sessions.
   Together, they provide a certain level of deterministic forwarding
   behavior for the flows to satisfy the diverse traffic requirements of
   today's data centers.

1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

2.  BGP DPF

   BGP DPF use BGP session coloring and route coloring to direct flows
   to different logical fabrics.

Wang, et al.               Expires 4 June 2026                  [Page 3]
Internet-Draft                     DPF                     December 2025

2.1.  BGP Session Coloring

   Figure 1 shows how a physical fabric is partitioned into two logical
   fabrics, the red fabric and the blue fabric.  Leaf1 and Leaf2 can
   communicate using the red fabric via Spine1, or using the blue fabric
   via Spine2.  Link Spine1-Leaf1 and Spine1-Leaf2 belong to the red
   fabric and link Spine2-Leaf1, Spine2-Leaf2 belong to the blue fabric.
   Instead of coloring the links directly, BGP DPF colors the EBGP
   sessions running on the corresponding links.  The color of an EBGP
   session is configured on both ends separately, using the Color
   Extended Community as defined in Section 4.3 of [RFC9012].

   There are two modes for session coloring, the strict mode and the
   loose mode.  In the strict mode, the EBGP session MUST NOT come to
   Established state unless both ends are configured with the same
   color.  In the loose mode, mismatched colors on both ends of an EBGP
   session SHALL NOT prevent the session from coming up.

                    +---------+           +---------+
                    | Spine 1 |           | Spine 2 |
                    |  (red)  |           |  (blue) |
                    +---------+           +---------+
                         | \                 / |
                         |    \           /    |
                         |   red \     / blue  |
                     red |          /          | blue
                         |       /    \        |
                         |    /           \    |
                         | /                 \ |
                    +---------+           +---------+
                    | Leaf 1  |           | Leaf 2  |
                    +---------+           +---------+

       Figure 1: Divide one physical fabric into two logical fabrics

2.1.1.  Strict Mode

   When running in the strict session coloring mode, a BGP speaker uses
   the Capability Advertisement procedures from [RFC5492] to determine
   whether the color configured locally matches the color configured on
   the remote end.  When a color is configured for an EBGP session
   locally, the BGP speaker sends the SESSION-COLOR capability in the
   OPEN message.  The fields in the Capability Optional Parameter are
   set as follows.  The Capability Code field is set as TBD.  The
   Capability Length field is set as 4.  The Capability Value field is
   set as the 4-octet Color Value of the Color Extended Community, as
   defined in Section 4.3 of [RFC9012].  Note, even though the BGP
   session is colored using a Color Extended Community, the only field

Wang, et al.               Expires 4 June 2026                  [Page 4]
Internet-Draft                     DPF                     December 2025

   useful is the Color Value of the Color Extended Community.  The Flags
   field is ignored.  That is why only the 4-octect Color Value is
   included in the SESSION-COLOR Capability.  The SESSION-COLOR
   capability format is shown in Figure 2:

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |Cap Code = TBD |Cap Length = 4 |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                        Color Value                            |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                     Figure 2: SESSION-COLOR Capability

   When receiving the OPEN message for an EBGP session, the BGP speaker
   matches the SESSION-COLOR capability against its locally configured
   session color.  Session color is considered as a match for one of the
   following conditions:

   No color on both ends:
      The receive OPEN message has no SESSION-COLOR capability and the
      EBGP session is not configured with a color.

   Same color on both ends:
      The received OPEN message has SESSION-COLOR capability and its
      color is the same as the session color configured locally for the
      EBGP session.

   All other cases MUST be considered as session color mismatch.  When a
   session color mismatch is detected, the BGP speaker MUST reject the
   session by sending a Color Mismatch Notification (code 2, subcode
   TBD) to the peer BGP speaker.

2.1.2.  Loose Mode

   The strict session coloring mode ensures that an Established EBGP
   session must have matching session colors on both ends.  It helps to
   detects the color misconfigurations earlier.  However, exchanging
   session colors through a Capability in BGP OPEN message requires BGP
   session flaps whenever session colors are changed.  To address this
   session flap issue, the loose session coloring mode is introduced.
   When running the loose session coloring mode, session colors are not
   carried in the BGP OPEN message therefore change of the session color
   will not lead to the session flap.  In this case, if the colors
   configured on both ends of the EBGP session mismatch, the routes
   received over the session will only match the color of the remote end
   but mismatch the color of the local end, as described in Section 2.2.

Wang, et al.               Expires 4 June 2026                  [Page 5]
Internet-Draft                     DPF                     December 2025

   A route received with mismatched color MUST NOT be accepted.

   [I-D.ietf-idr-dynamic-cap] allows Capabilities to be exchanged
   without flapping the session.  That might allow us to gradually phase
   out the Loose Mode once dynamic capability is widely deployed.

2.2.  Route Coloring

   Once the EBGP sessions are colored accordingly, the physical fabric
   is partitioned into multiple logical fabrics.  Routes can also be
   colored at the egress leaves to indicate which EBGP sessions (or
   which logical fabrics) they should be advertised over.

2.2.1.  Route Coloring at the Egress Leaf

   There are several ways to color a route at an egress leaf:

   One color:
      When a route is configured with one color at the egress leaf, it
      is advertised over the same colored or uncolored EBGP sessions,
      with the corresponding Color Extended Community attached.  This is
      the easiest way to make use of the logical fabrics.

   One primary color and one backup color:
      When a route is configured with one primary color and one backup
      color at the egress leaf, it is advertised over the EBGP sessions
      of the primary color, with the primary Color Extended Community
      and an AIGP metric [RFC7311] of value zero.  It is also advertised
      over the EBGP sessions of the backup color, with the backup Color
      Extended Community.  In case there are uncolored sessions, the
      route is also advertised over the uncolored sessions, without
      Color Extended Community.  The AIGP metric will help the receiving
      node to identify the primary colored paths.  This allows traffic
      to fall back to the backup logical fabric when the primary logical
      fabric fails.

   One primary color and all-colors as backup colors:
      When a route is configured with one primary color and all-colors
      as backup colors at the egress leaf, it is advertised over the
      EBGP sessions of the primary color, with the primary Color
      Extended Community and an AIGP metric of value 0.  It is also
      advertised over the EBGP sessions of all other colors, with the
      Color Extended Community same as the corresponding session color.
      In case there are uncolored sessions, the route is also advertised
      over the uncolored sessions, without Color Extended Community.
      The AIGP metric will help the receiving nodes to identify the
      primary colored paths.  By specifying all-colors as backup colors,
      traffic can be spread over all remaining logical fabrics when the

Wang, et al.               Expires 4 June 2026                  [Page 6]
Internet-Draft                     DPF                     December 2025

      primary fabric fails.  In the single backup color approach,
      traffic from the failed primary logical fabric might congest the
      backup fabric.  By spreading the failed primary logical fabric
      traffic to all backup logical fabrics, the chance of congestion on
      the backup logical fabrics will be significantly reduced.

   All-colors:
      When a route is configured with all-colors at the egress leaf, it
      is advertised over the EBGP sessions with any color, with the
      Color Extended Community same as the corresponding session color.
      In case there are uncolored sessions, the route is also advertised
      over the uncolored sessions, without Color Extended Community.
      This allows the ingress router to map different flows of the route
      to different logical fabrics.

   No color:
      An uncolored route from the egress leaf can be advertised over
      EBGP sessions with any color or no color.  It is advertised
      without Color Extended Community.  Uncolored routes could be
      useful to carry routing protocol PDUs which do not use much
      bandwidth but needs to be sent over any links regardless of the
      logical fabrics.

   Since AIGP metric is used in the primary/backup color cases, it is
   expected that all BGP speakers MUST support AIGP if we need DPF
   primary/backup protection.

2.2.2.  Color Matching at the Spine and Super Spine

   At the transit nodes (Spines or Super Spines), the Color Extended
   Community of the route is used to match against the EBGP session
   color to decide whether the route should be advertised over the
   session:

   Advertising over an uncolored EBGP session:  If the session is
      uncolored, the route is re-advertised following the existing route
      advertisement rules defined in [RFC4271].

   Advertising over a colored BGP session:  If the active route has no
      Color Extended Community or a Color Extended Community which is
      the same as the session color, then the active route is advertised
      over the session.  If the active route has a Color Extended
      Community mismatching the session color, then check whether there
      is an inactive route with a Color Extended Community matching the
      session color.  If yes, advertise the active route to the session,
      except that the AIGP attributed (if any) MUST be stripped and the
      Extended Color Community MUST be replaced with the session's Color
      Extended Community.  Otherwise, don't advertise the route.

Wang, et al.               Expires 4 June 2026                  [Page 7]
Internet-Draft                     DPF                     December 2025

      Matching the session color against the inactive routes is
      necessary because a backup route needs to be re-advertised to the
      backup fabric.  So, when a packet arrives from the backup fabric,
      it is forwarded over the primary fabric to the destination, unless
      the primary fabric is down.

2.2.3.  Flow Mapping at the Ingress Leaf

   At the ingress leaf, flows can be mapped to different logical fabrics
   based on the route coloring approaches from the egress leaf:

   One color:  When a route is configured with one color at the egress
      leaf, the ingress leaf will receive the route from the EBGP
      session(s) with that color only.  Flows towards this destination
      will be mapped to the logical fabric of this color only.

   One primary color and one backup color:  When a route is configured
      with one color as primary color and one color as backup color at
      the egress leaf, the ingress leaf will receive the route from EBGP
      sessions of both the primary color and the backup color.  The
      routes received from the primary color sessions will be preferred
      due to AIGP.  The routes received from the backup color sessions
      can be used as the backup paths.  Flows towards this destination
      will be mapped to the primary logical fabric.  In case the primary
      logical fabric fails, flows towards this destination will be
      mapped to the backup logical fabrics.  Note that fallback to the
      backup logical fabric could happen at the ingress leaf as well as
      the spines and super spines.

   One primary color and all-colors as backup color:  When a route is
      configured with one color as primary color and all-colors as
      backup color at the egress leaf, the ingress leaf will receive the
      route from EBGP sessions of all colors.  The routes received from
      the primary color sessions will be preferred due to AIGP.  The
      routed received from all other colored sessions can be used as
      backup paths.  Flows towards this destination will be mapped to
      the primary logical fabric.  In case the primary logical fabric
      fails, flows towards this destination will be mapped to all backup
      logical fabrics.  Note fallback to backup logical fabrics could
      happen at the ingress leaf as well as the spines and super spines.

   All colors:  When a route is configured with all-colors at the egress

Wang, et al.               Expires 4 June 2026                  [Page 8]
Internet-Draft                     DPF                     December 2025

      leaf, the ingress leaf will receive the route from EBGP session of
      all colors.  The routes from all sessions can be used to forward
      traffic.  The ingress leaf can map flows towards this destination
      to routes with different Color Extended Communities, using
      mechanisms such as the Access Control List (ACL) filter.  The
      details of mapping different flows to different routes of the same
      destination is out of the scope of this document.

   Apart from mapping IP flows as described above, the ingress leaf
   could also map VPN flows, such as EVPN-VXLAN flows, to different
   logical fabrics.  For example, the egress leaf can advertise multiple
   VXLAN tunnel endpoint routes, each with its own color.  When a VXLAN
   tunnel endpoint is chosen for a MAC VRF at the ingress leaf, flows of
   that MAC VRF will be mapped to the logical fabric corresponding to
   the color of the tunnel endpoint route.

3.  Use Cases

   The most common use cases related to the BGP-DPF are:

   *  AI/ML backend training DC networks

   *  AI/ML frontend DC Inference networks

   *  IP Storage networks

   *  DCI - Data Center Interconnect

   *  Industrial hybrid DC/Campus networks

3.1.  AI/ML backend training Data Center network

   In the context of the AI/ML data centers (DC), especially where the
   training of LLM (Large Language Models) is the primary goal, there
   might be some challenges with the traditional IP ECMP packet
   spraying, such as sending the packets in an unordered manner due to
   the way load balancing is performed or maintaining consistency of
   performance between different phases of the job executions.  AI/ML
   training in a data center refers to the process of utilizing large-
   scale computing infrastructure to train machine learning models on
   massive datasets.  This process can take weeks or sometimes months
   for larger models.  LLM training is taking place in DCs with GPU-
   enabled servers interconnected in the Rail Optimized Design within
   the IP Clos scale-out fabrics.  In such architectures, every GPU of
   the server is linked to a 400G/800G NIC card, which connects to a
   different ToR (Top of Rack) leaf Ethernet switch node.  The typical
   AI training server uses eight GPUs, so each server requires eight NIC
   cards, each connecting to a different ToR.  A typical Rail is based

Wang, et al.               Expires 4 June 2026                  [Page 9]
Internet-Draft                     DPF                     December 2025

   on eight 400G/800G/1.6Tbps switches, and rail-to-rail communication
   between strips is achieved through multiple spine nodes (typically 32
   or more).

   The transport used by the GPU servers between the rails or within the
   rail is either based on ROCEv2, or UEC transport (UET) in the future.
   The number of these flows per GPU/NIC is sometimes limited.  A single
   ROCEv2 flow can utilize a massive bandwidth, and the characteristics
   of the flows may have very low entropy - the same source UDP and
   destination UDP are used by the ROCEv2 transport between the GPU
   servers during the given Job-ID.  This may lead to short-term
   congestion at the spines, triggering the DCQCN reactive congestion
   control in the AI/DC fabric, with the PFC (Priority Flow Control) and
   ECN (Explicit Congestion Notification) mechanisms activated to
   prevent frame loss.  Consequently, these mechanisms slow down the AI/
   ML session by temporarily reducing the rate at the source GPU server
   and extending the time needed to complete the given Job-ID.  If
   congestion persists, frame loss may also occur, and the given Job-ID
   may need to be restarted to be synced across all GPUs participating
   in the collective communication.  With packet spraying techniques or
   flow-based Dynamic Load Balancing, this is a less common situation in
   a well-designed Ethernet/IP fabric, but the GPU servers NIC cards
   must support the Out Of Order delivery.  Additionally, it may still
   reduce performance or cause instability between Job-IDs or between
   tenants connected to the same AI/DC fabric.

   This is where deterministic path pinning-based load balancing of
   flows can be applied, and where the BGP-DPF can be utilized to color
   the paths of a given tenant or a specific AI/ML workload, controlling
   how these paths are used.  When the given ROCEv2 traffic is
   identified through the destination QPAIR in the BTH header at the ToR
   Ethernet switch, it can be allocated to a specific DPF color ID using
   ingress enforcement rules or TCAM flow awareness at the ASIC level.
   The AI/ML flows can be load-balanced across different DPF fabric
   color IDs and remain on the specified fabric color for the duration
   of the AI/ML Job. Thanks to that, not only does the given AI workload
   get a dedicated fabric color ID, but it also becomes isolated from
   the other AI workloads, which offers more predictable performance
   results (consistent tail latency and same Job Completion Time (JCT))
   when compared to packet spraying based load balancing across all of
   the IP ECMP paths.

   In this case, the probability of encountering congestion is also
   lower, as the given workload is assigned a dedicated path and is not
   competing with other AI workloads.  When pinning the AI workload to a
   specific path, this means that there will be no packet reordering at
   the destination/target server, as the ROCEv2/UET packets will follow
   the same path from the beginning to the end of the given session.

Wang, et al.               Expires 4 June 2026                 [Page 10]
Internet-Draft                     DPF                     December 2025

   The Rail Optimized Design shown in Figure 3 may also run two LLM
   training sessions simultaneously from two different tenants.  This is
   also where IP path diversity of the DPF comes into play - by simply
   coloring the two workloads from the two LLMs, we can forward them
   across a different set of spine switches.

                     +-----------+
                +----|GPU-server1|---+
                |    +-----------+   |
                |       |            |
                |       |            |
                |       |            |
          +-----+-------+------------+-----rail1
          |  +--+---+ +-+----+      ++-----+  |
          |  +leaf1-+ +leaf2-+ .... +leaf8-+  |
          +---+----+-------------+--------+---+
    +---------+    |             |        +-----------+
    |              |             |                    |
    |              |             |                    |
    |              |             |                    |
    |              |             |                    |
   Fab-A          Fab-A         Fab-B                Fab-B
    |              |             |                    |
    |              |             |                    |
    |              |             |                    |
   ++--------+   +-+-------+   +-+-------+   +--------++
   |spine1   |...|spine16  |   |spine17  |...|spine32  |
   ++--------+   +-+-------+   +-+-------+   +--------++
    |              |             |                    |
    |              |             |                    |
    |              |             |                    |
   Fab-A          Fab-A         Fab-B               Fab-B
    |              |             |                    |
    |              |             |                    |
    |              |             |                    |
    |              |             |                    |
    +-------+      |             |         +----------+
           ++------+-------------+---------+---+
           |  +------+ +------+      +------+  |
           |  +leaf9-+ +leaf10+ .... +leaf16+  |
           +---+---------+--------------+---rail2
               |         |              |
               |         |              |
               |         |              |
               |    +----+------+       |
               +----|GPU-server2+-------+
                    +-----------+

Wang, et al.               Expires 4 June 2026                 [Page 11]
Internet-Draft                     DPF                     December 2025

            Figure 3: AI/ML backend training Data Center network

   For example, 16 spines are allocated to the LLM-A training, and the
   other 16 spines are mapped to the LLM-B.  Within each group of
   colored spines, IP ECMP with Dynamic Load Balancing can still operate
   on a per-flow or per-packet basis.  Each tenant LLM with this
   approach receives half of the fabric's capacity of the fabric, and if
   required, this can be adjusted to be reduced or increased.  The given
   fabric color fab-A and fab-B can be also allocated to the tenants
   enabled with EVPN-VXLAN overlays.

   In summary, using BGP-DPF in backend DC network could achieve:

   *  Predictable and more efficient load balancing of the AI/ML
      workloads with the path pinning (for example, the ROCEv2 Op Code-
      based pinning or the destination ROCEv2 QPAIR-based path pinning
      in case of the ROCEv2 traffic)

   *  Isolation of the tenants inside the larger-scale AI/ML IP Clos
      fabric

   *  Consistency of the performances and faster AI workload ramp time

   *  Eliminated or highly reduced utilization of the PFC/ECN in the
      lossless fabric

3.2.  AI/ML frontend DC and the Inference network

   In the context of an AI/ML data center, an inference network refers
   to the computing infrastructure and networking components optimized
   for running already trained machine learning models (inference) at
   scale.  Its primary purpose is to deliver low-latency, high-
   throughput predictions for both real-time and batch workloads.
   ChatGPT is a large-scale inference application deployed in a data
   center environment that utilizes real-time data.  Still, it employs a
   generative AI model, such as GPT, which has been trained for several
   weeks in the training domain, as explained in Section 3.1 above.

Wang, et al.               Expires 4 June 2026                 [Page 12]
Internet-Draft                     DPF                     December 2025

   The reason we mention it is that in many cases, cloud or service
   providers will run inferences in parallel for multiple customers
   simultaneously.  Multi-tenancy is likely to be used at the network
   level - for example, utilizing EVPN-VXLAN-based tenant isolation in
   the leaf/spine/super-spine IP Clos fabric, or using MAC-VRFs or Pure
   RT5 IPVPN.  In such cases, many inference applications can be enabled
   simultaneously within the same physical fabric.  In some cases, the
   tenant/customer may request to be fully isolated from the other
   tenants, not only from a control plane perspective but also from a
   data plane perspective when forwarding traffic between the two ToR
   switches.

   For example, the tenant-A and tenant-B may each be allocated to a
   different RT5 EVPN-VXLAN instance, and these instances are mapped to
   two different BGP-DPF color-ids.  With this approach, the overlays of
   tenant A and tenant B will never overlap and will utilize different
   fabric spines.  The outcomes here are that the latency, which is
   critical for inference applications, is also becoming more
   predictable if the fabric paths for the two tenants are different.
   The two overlays are more correlated with the underlay path.  In some
   cases, with the explicit definition of the backup color ID at the
   BGP-DPF level, the fast convergence will become an additional outcome
   for the frontend EVPN-VXLAN fabrics.

3.3.  IP Storage networks with Fab-A/Fab-B path diversity

   In the context of the DC, storage networks are a key component of the
   infrastructure that manages and enables servers with scalable block
   or object storage systems.  For block storage, such as NVMe-o-F
   (using NVMe-o-RDMA or NVMe-o-TCP), the Fab-A/Fab-B design is often
   used, where Fabric-A serves as the primary and Fabric-B as the backup
   path for performing read or write operations on the remote storage
   arrays.  The given server inside the DC typically has dedicated
   storage NICs.  For redundancy purposes, two NIC ports are generally
   used - one connected to Fab-A and another to Fab-B.  As in the case
   of traditional storage, such as Fiber Channel(FC), the recommended
   approach is to make sure that the storage dedicated fabric supports
   complete path isolation.  In case of failure, at least one of the two
   fabrics becomes available.

   This is also where BGP DPF can help, by explicitly defining the IP
   Storage paths for Fab-A and Fab-B.  Besides the storage redundancy
   aspect, the capacity planning is also essential here.  After the
   failover from A to B, the same read and write capacity is offered to
   all IP Fabric-connected servers.  Fab A/B offers 100% capacity in the
   event of failure, while all operations are managed at the logical
   level using the BGP DPF.

Wang, et al.               Expires 4 June 2026                 [Page 13]
Internet-Draft                     DPF                     December 2025

3.4.  DCI - Data Center Interconnect

   In case of critical applications, disaster recovery plans usually
   require a second availability zone for redundancy and resilience.
   Concepts foresee either the replication of persistent storage data,
   or to run the same application in parallel in a backup location, or
   to load balance across multiple DCs.

   When replicating data or synchronizing the application state between
   two places, it is sometimes also necessary to isolate the paths
   across long-distance connectivity.  If the connection between DC1 and
   DC2 use a mesh of links or partial mesh and the DCI connect solution
   uses EVPN-VXLAN or Pure IP connections, some workloads may require
   communication in a more deterministic way by correlating the underlay
   and overlay when both uses the BGP as IP routing protocol - one path
   may have better latency and jitter than the other when connecting
   between the two remote locations so the admin may decide to push one
   EVPN-VXLAN instance (MAC-VRF and/or RT5 IPVPN) through very well
   selected underlay path of the dark fiber connection.  In this use
   case, we assume the DCI is using the underlay IP EBGP, and some links
   may be colored using the BGP-DPF.  EVPN-VXLAN can use the EVPN-VXLAN
   to EVPN-VXLAN tunnel stitching [RFC7938], with the DCI underlay links
   colored by BGP-DPF as red and blue paths.  Different MAC-VRFs and RT5
   instances are assigned to various DPF colors to control the
   forwarding of the workloads between the two DC locations.

   The outcome of this use case is that the DCI admin can anticipate
   failovers and allocate EVPN-VXLAN-connected workloads based on the
   capacity and performance (including latency and jitter) of the DCI
   links.

3.5.  Industrial/factory hybrid DC/Campus networks

   Industrial and factory automation is increasingly adopting
   distributed computing concepts to leverage the benefits of
   virtualization and containerization.  This change often comes with a
   shift of application into a remote DC, which imposes stringent
   requirements on the networking infrastructure between DC and the
   respective process.  These hybrid DC campus networks require a high
   level of resiliency against failures as certain applications tolerate
   zero loss of frames.  Duplication schemes like PRP [IEC62439-3] are
   being leveraged in these scenarios to provide zero loss in face of
   failures but require disjoint paths to avoid any single point of
   failures.

   When the Campus and DC fabrics utilize modern solutions, such as
   EVPN-VXLAN overlays, IP ECMP from leaf to spine is frequently
   employed.  This might lead to PRP duplicates being forwarded across

Wang, et al.               Expires 4 June 2026                 [Page 14]
Internet-Draft                     DPF                     December 2025

   the same spine and bring processes to a standstill in case of a spine
   maintenance or physical failure.  That's where the BGP-DPF based
   underlay network can guarantee that the EVPN-VXLAN overlays are
   always forwarded over their predefined nominal and backup paths,
   allowing for disjoint paths across the fabric.  The primary and
   backup paths taken by PRP frames are well-defined, enabling fault-
   tolerant communication, i.e., between robots on the shop floor and
   control applications running on a distributed environment in the DC.
   With the PRP frames destined for LAN A and LAN B being sent through
   EVPN-VXLAN MAC-VRF-A and MAC-VRF-B, over diverse paths DPF color-A
   and DPF color-B, critical communication flows are being controlled in
   terms of forwarding and recovery for the deterministic behavior they
   require.

4.  Operational Considerations

   When routes are colored with both primary and backup colors at the
   egress leaf, we need to make sure the network is a strictly staged
   network to avoid potential routing and forwarding loops.  A strictly
   staged network ensures that packet always goes to the next stage and
   never come back.  In the Clos topology with EBGP, staged routing is
   guaranteed by configuring the same AS number on the spines and super
   spines in the same stage.  Only leaves have unique AS numbers.

5.  IANA Considerations

   A new BGP Capability will be requested from the "Capability Codes"
   registry within the "IETF Review" range [RFC5492].

   A new OPEN Message Error subcode named "Color mismatch" will be
   requested from the "OPEN Message Error subcodes" registry.

6.  Security Considerations

   Modifying Color Extended Community of a BGP UPDATE message by an
   attacker could potentially cause the routes to be advertised to the
   unintended logical fabrics.  This could potentially lead to failed or
   suboptimal routing.

7.  References

7.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

Wang, et al.               Expires 4 June 2026                 [Page 15]
Internet-Draft                     DPF                     December 2025

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

   [RFC5492]  Scudder, J. and R. Chandra, "Capabilities Advertisement
              with BGP-4", RFC 5492, DOI 10.17487/RFC5492, February
              2009, <https://www.rfc-editor.org/info/rfc5492>.

   [RFC9012]  Patel, K., Van de Velde, G., Sangli, S., and J. Scudder,
              "The BGP Tunnel Encapsulation Attribute", RFC 9012,
              DOI 10.17487/RFC9012, April 2021,
              <https://www.rfc-editor.org/info/rfc9012>.

   [RFC4271]  Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
              Border Gateway Protocol 4 (BGP-4)", RFC 4271,
              DOI 10.17487/RFC4271, January 2006,
              <https://www.rfc-editor.org/info/rfc4271>.

   [RFC7311]  Mohapatra, P., Fernando, R., Rosen, E., and J. Uttaro,
              "The Accumulated IGP Metric Attribute for BGP", RFC 7311,
              DOI 10.17487/RFC7311, August 2014,
              <https://www.rfc-editor.org/info/rfc7311>.

7.2.  Informative References

   [RFC3209]  Awduche, D., Berger, L., Gan, D., Li, T., Srinivasan, V.,
              and G. Swallow, "RSVP-TE: Extensions to RSVP for LSP
              Tunnels", RFC 3209, DOI 10.17487/RFC3209, December 2001,
              <https://www.rfc-editor.org/info/rfc3209>.

   [RFC8402]  Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L.,
              Decraene, B., Litkowski, S., and R. Shakir, "Segment
              Routing Architecture", RFC 8402, DOI 10.17487/RFC8402,
              July 2018, <https://www.rfc-editor.org/info/rfc8402>.

   [RFC7938]  Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
              BGP for Routing in Large-Scale Data Centers", RFC 7938,
              DOI 10.17487/RFC7938, August 2016,
              <https://www.rfc-editor.org/info/rfc7938>.

   [I-D.ietf-idr-dynamic-cap]
              Chen, E. and S. R. Sangli, "Dynamic Capability for BGP-4",
              Work in Progress, Internet-Draft, draft-ietf-idr-dynamic-
              cap-17, 6 July 2025,
              <https://datatracker.ietf.org/doc/html/draft-ietf-idr-
              dynamic-cap-17>.

Wang, et al.               Expires 4 June 2026                 [Page 16]
Internet-Draft                     DPF                     December 2025

   [IEC62439-3]
              International Electrotechnical Commission, "Industrial
              communication networks – High availability automation
              networks – Part 3: Parallel Redundancy Protocol (PRP) and
              High-availability Seamless Redundancy (HSR)",
              IEC 62439-3:2016, 2016.

Appendix A.  Alternative Solutions

   An alternative way to achieve part of the BGP DPF functionalities is
   to use BGP export and import policies.  Instead of coloring the EBGP
   sessions and routes, one could choose to use export policies to
   specify which session(s) a route should be advertised.  On the
   receiving side, one could also choose to use import policies to
   ensure a route is only received from certain EBGP sessions.  The
   alternative approach is not chosen due to the following factors:

   *  The policy configurations have to be done on each nodes and might
      need to change when new routes are added.

   *  Policy configurations are less intuitive than session coloring and
      could be prone to configuration mistakes.

   *  Certain functionalities in DPF, like the primary and backup
      logical fabrics, might not be achievable using popular policies.

Acknowledgements

   TBD.

Contributors

   Jeffrey Haas
   HPE
   Email: jeffrey.haas@hpe.com

Authors' Addresses

   Kevin Wang
   HPE
   Email: kevin.wang@hpe.com

   Michal Styszynski
   HPE
   Email: mlstyszynski@juniper.net

Wang, et al.               Expires 4 June 2026                 [Page 17]
Internet-Draft                     DPF                     December 2025

   Wen Lin
   HPE
   Email: wen.lin@hpe.com

   Mahesh Subramaniam
   HPE
   Email: mahesh-kumar.subramaniam@hpe.com

   Thomas Kampa
   Audi
   Email: thomas.kampa@audi.de

   Diptanshu Singh
   Oracle Cloud Infrastructure
   Email: diptanshu.singh@oracle.com

Wang, et al.               Expires 4 June 2026                 [Page 18]