Multipath Traffic Engineering
draft-kompella-teas-mpte-00
This document is an Internet-Draft (I-D).
Anyone may submit an I-D to the IETF.
This I-D is not endorsed by the IETF and has no formal standing in the
IETF standards process.
The information below is for an old version of the document.
| Document | Type |
This is an older version of an Internet-Draft whose latest revision state is "Active".
|
|
|---|---|---|---|
| Authors | Kireeti Kompella , Luay Jalil , Mazen Khaddam , Andy Smith | ||
| Last updated | 2025-03-03 | ||
| RFC stream | (None) | ||
| Formats | |||
| Stream | Stream state | (No stream defined) | |
| Consensus boilerplate | Unknown | ||
| RFC Editor Note | (None) | ||
| IESG | IESG state | I-D Exists | |
| Telechat date | (None) | ||
| Responsible AD | (None) | ||
| Send notices to | (None) |
draft-kompella-teas-mpte-00
TEAS WG K. Kompella
Internet-Draft Juniper Networks
Intended status: Standards Track L. Jalil
Expires: 4 September 2025 Verizon
M. Khaddam
Cox Communications
A. Smith
Oracle Cloud Infrastructure
3 March 2025
Multipath Traffic Engineering
draft-kompella-teas-mpte-00
Abstract
Shortest path routing offers an easy-to-understand, easy-to-implement
method of establishing loop-free connectivity in a network, but
offers few other features. Equal-cost multipath (ECMP), a simple
extension, uses multiple equal-cost paths between any two points in a
network: at any node in a path (really, Directed Acyclic Graph),
traffic can be (typically equally) load-balanced among the next hops.
ECMP is easy to add on to shortest path routing, and offers a few
more features, such as resiliency and load distribution, but the
feature set is still quite limited.
Traffic Engineering (TE), on the other hand, offers a very rich
toolkit for managing traffic flows and the paths they take in a
network. A TE network can have link attributes such as bandwidth,
colors, risk groups and alternate metrics. A TE path can use these
attributes to include or avoid certain links, increase path
diversity, manage bandwidth reservations, improve service experience,
and offer protection paths. However, TE typically doesn't offer
multipathing as the tunnels used to implement TE usually take a
single path.
This memo proposes multipath traffic-engineering (MPTE), combining
the best of ECMP and TE. The multipathing proposed here need not be
strictly equal-cost, nor the load balancing equally weighted to each
next hop. Moreover, the desired destination may be reachable via
multiple egresses. The proposal includes a protocol for signaling
MPTE paths using various types of tunnels, some of which are better
suited to multipathing.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Kompella, et al. Expires 4 September 2025 [Page 1]
Internet-Draft MPTE March 2025
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 4 September 2025.
Copyright Notice
Copyright (c) 2025 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1. Definition of Commonly Used Terms . . . . . . . . . . 4
2. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1. Multipathing . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1. ECMP (slack 0) from node 0 to node 5 . . . . . . . . 6
2.1.2. nECMP from node 0 to node 5 with slack 10 . . . . . . 7
2.1.3. Multipathing from node 0 to egresses {5, 8} . . . . . 7
2.1.4. MPTED from ingresses {0, 1} to egresses {5, 8} . . . 7
2.2. Load balancing . . . . . . . . . . . . . . . . . . . . . 7
2.2.1. Flow-aware load balancing . . . . . . . . . . . . . . 8
2.2.2. Per-packet load balancing . . . . . . . . . . . . . . 8
2.3. Constraints . . . . . . . . . . . . . . . . . . . . . . . 9
2.4. Protection . . . . . . . . . . . . . . . . . . . . . . . 9
2.5. Tunnels . . . . . . . . . . . . . . . . . . . . . . . . . 10
3. Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1. MPTED . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2. Signaling overview . . . . . . . . . . . . . . . . . . . 13
3.3. Forwarding state . . . . . . . . . . . . . . . . . . . . 14
4. Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Kompella, et al. Expires 4 September 2025 [Page 2]
Internet-Draft MPTE March 2025
4.1. Message IDs . . . . . . . . . . . . . . . . . . . . . . . 14
4.2. Messages . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.1. MSGHDR . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.2. REFHDR . . . . . . . . . . . . . . . . . . . . . . . 15
4.3. Message types . . . . . . . . . . . . . . . . . . . . . . 15
4.3.1. OPEN . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3.2. HELLO . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3.3. JUNCTION . . . . . . . . . . . . . . . . . . . . . . 17
4.3.4. LABEL . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.5. NOTIFICATION . . . . . . . . . . . . . . . . . . . . 19
5. MPTEP Reflector . . . . . . . . . . . . . . . . . . . . . . . 19
6. Graceful Restart . . . . . . . . . . . . . . . . . . . . . . 20
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20
8. Security Considerations . . . . . . . . . . . . . . . . . . . 20
9. References . . . . . . . . . . . . . . . . . . . . . . . . . 20
9.1. Normative References . . . . . . . . . . . . . . . . . . 20
9.2. Informative References . . . . . . . . . . . . . . . . . 21
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22
1. Introduction
Operators managing traffic within their networks have several tools,
among them:
1. Equal-cost Multipath (ECMP): balance traffic along multiple
paths. This yields some resilience and some traffic management,
as traffic can be load-balanced across multiple paths. To use
ECMP effectively, one may have to adjust link metrics to allow
multiple paths to have the same overall distance.
2. Traffic Engineering (TE): state constraints for a path from an
ingress router to an egress router, and let a path computation
engine compute it. This gives much greater control over the
nodes and links traversed, but is usually limited to finding a
single path from ingress to egress [RFC2702].
3. Multi-egress: allow traffic from an ingress router to a
destination dst to use several egress routers, all of which have
routes to that destination. dst may be an Internet prefix
[RFC4271], a VPN prefix [RFC4364], an EVPN address [RFC7432], a
VPLS site [RFC4761], [RFC4762] or some other service destination.
For BGP-signaled destinations, this requires that the BGP tie-
breaking algorithm yield multiple results (rather than a single
one), all of which become candidates for egress.
[RFC2702] describes requirements for MPLS-based TE, and thus is
relevant to this memo. At the same time, the authors appear to
believe that one can either have TE or multipathing, but not both.
Kompella, et al. Expires 4 September 2025 [Page 3]
Internet-Draft MPTE March 2025
This is further emphasized by the notion of a Label Switched Path,
which is used to implement MPLS-based TE. RSVP-TE ([RFC3209]), the
protocol designed to meet the requirements of [RFC2702], builds a
single path from one ingress to one egress (for unicast traffic).
In order to satisfy the constraints, TE often uses non-shortest
paths. To do so without looplng packets, a tunnel is used. Such
tunnels have to be signaled. RSVP-TE is a signaling protocol for
MPLS-based tunnels.
In this memo, we introduce a new tool: multipath TE (MPTE). This
allows an operator to specify constraints for paths (as in TE),
specify multiple egresses, and use multiple paths to each egress.
Effectively, MPTE combines the advantages of the three tools above.
The resulting set of paths from an ingress to egresses is a Directed
Acyclic Graph (DAG), here called an MPTE DAG or MPTED. Finally, this
memo allows the use of multiple types of tunnels. The main
contribution of this memo is a protocol for signaling a (multipath)
unicast tunnel across an MPTED.
1.1. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
1.1.1. Definition of Commonly Used Terms
This section provides definitions for terms and abbreviations that
have a specific meaning to the MPTE protocol and that are used
throughout this memo.
constraints: desired properties of paths between ingresses and
egresses.
directed acyclic graph: a directed graph that has no cycles.
directed graph: a set of nodes and directed links. A network is
represented by a directed graph.
egress: an end node of an MPTE DAG.
ingress: a starting node of an MPTE DAG.
link: A (directed) edge between two nodes. A pair of nodes may have
Kompella, et al. Expires 4 September 2025 [Page 4]
Internet-Draft MPTE March 2025
0 or more links between them. A link between nodes u and v will
be denoted by (u, v, i), where i is u's oif for the link. A link
may have associated attributes, in particular, a metric.
metric: a link attribute denoted by met(u, v, i), a positive number.
node: a vertex of a graph. A node may have associated attributes.
outgoing interface: a unique number (oif) assigned by a node for
each outgoing link it has.
path length: the sum of the metrics of the links that constitute
path p, denoted by len(p)
shortest path: a path between a pair of nodes u, v with minimum
length. The set of shortest paths between u and v is a DAG,
denoted by sp(u, v). The length of a shortest path from u to v is
denoted by min(u, v)
slack: a path p from u to v is acceptable with slack s if len(p) <=
min(u, v) + s.
traffic trunk: a unidirectional aggregate of traffic flows from an
ingress to a set of egresses that is treated identically in the
forwarding plane.
The following abbreviations are used in this memo:
CSPF: constrained shortest path first. A modification to SPF to
take into account path constraints.
DAG: directed acyclic graph. The result of a multipath SPF or CSPF
computation is a DAG.
ECMP: equal-cost multipath.
FALB: flow-aware load balancing
LB: load balancing
LSP: label-switched path
MC: MPTED computer: the entity computing the MPTED, typically the
ingress (if there is a single ingress) or a Path Computation
Element
MPTE: multipath TE with path constraints (including a slack) using
nECMP paths from an ingress to one or more egresses.
Kompella, et al. Expires 4 September 2025 [Page 5]
Internet-Draft MPTE March 2025
MPTED: an MPTE DAG resulting from CSPF-type computation on MPTE
constraints.
MPTEP: MPTE protocol: the protocol used to signal MPTEDs.
nECMP: non-equal-cost multipath; generally qualified by "with slack
s", meaning within slack s of the minimum path length.
oif: outgoing interface index
PCE: Path Computation Element
SPF: shortest path first. Typically refers to Dijkstra's algorithm
for computing shortest paths between a given pair of nodes, or
pairwise between all nodes.
SRG: shared risk group -- nodes and/or links that share "risk"
(e.g., have common power, or use a common fiber conduit)
TE: traffic engineering
2. Overview
Consider Figure 1:
2 == 3 Link Metrics (symm): 0-2: 100; 0-4: 200; 0-6: 110
r/ r\ r\\ 1-2 (not shown): 110; 1-4 (not sh): 100; 1-6: 100
0 -- 4 -- 5 2-3: (100, 100); 2-4: 100; 3-5: (100, 110)
\ / \ / \ 4-5: 100; 4-6: 110; 4-7: 50
1 - 6 = 7 -- 8 5-7: 100; 5-8: 10; 6-7: (100, 110); 7-8: 50
r Node pairs 2-3, 4-5 and 6-7 each have two links.
Links marked with 'r' have color red.
Figure 1: Network 1
2.1. Multipathing
2.1.1. ECMP (slack 0) from node 0 to node 5
There are 4 ECMP paths from node 0 to node 5:
1. 0-2=3-5 (2 paths)
2. 0-2-4-5
3. 0-4-5
These 4 distinct paths all have length 300.
Kompella, et al. Expires 4 September 2025 [Page 6]
Internet-Draft MPTE March 2025
2.1.2. nECMP from node 0 to node 5 with slack 10
There are 7 nECMP paths with slack 10 to node 5:
1. 0-2=3=5 (4 paths)
2. 0-2-4-5
3. 0-4-5
4. 0-6-7-5
These 7 paths have lengths 300 or 310. Thus, allowing nECMP paths a
slack of 10 has yielded 3 additional paths, which provide increased
diversity and load balancing, and possibly decreased congestion.
2.1.3. Multipathing from node 0 to egresses {5, 8}
If, for some traffic trunk that starts at node 0, nodes 5 and 8 are
equally good as egresses, then one can compute an ECMPD from 0 to {5,
8}; this yields 4 paths to 5 and 6 paths to 8, for a total of 10
paths this traffic trunk can take. Similarly, a nECMP DAG to {5, 8}
with slack 10 has 15 paths, whereas one with slack 5 has the same 11
paths as with slack 0.
2.1.4. MPTED from ingresses {0, 1} to egresses {5, 8}
If traffic from node 0 to nodes {5, 8} and from node 1 to nodes {5,
8} have common characteristics, it may make sense to compute a single
DAG from {0, 1} to {5, 8}. Doing so allows the operator to view this
entire DAG as one logical entity; a nice side benifit is reduced
control and data plane state due to state sharing.
2.2. Load balancing
Nodes in a netword have a Forwarding Information Base (FIB). A FIB
maps a packet's destination address da to one or more "next hops".
When a packet with address da arrives at n, n sends the packet to one
of the next hops. n typically will distribute packets in a given
ratio among the next hops. This is load balancing.
The main goal of ECMP/nECMP is to supply as many nodes as possible in
the MPTED with multiple next hops on which to forward the traffic
trunk. At such nodes, traffic belonging to the trunk can be
distributed among the next hops instead of going to a single next
hop. This has the potential to reduce congestion and provide better
utilization of available links.
Kompella, et al. Expires 4 September 2025 [Page 7]
Internet-Draft MPTE March 2025
2.2.1. Flow-aware load balancing
When load balancing packets from a traffic trunk, it is often
required that packets from a given flow be sent to the same next hop.
This improves the probability of in-order delivery of packets in that
flow, which is important for certain types of traffic. This is
called flow-aware load balancing (FALB). The most common flow in IP
traffic is defined by a 5-tuple consisting of the source IP address,
the destination IP address, the protocol, the source port and the
destination port. A 16- or 20-bit hash of this 5-tuple is called the
packet's entropy.
There are two common ways to achieve FALB of IP traffic. One is to
do a "deepish" packet inspection (dPI), find the relevant 5-tuple,
and use that to compute the packet's entropy. The entropy is then
used to ensure that packets in the flow are sent to the same next
hop. This memo suggests sending TE traffic over a tunnel (see
{tunnels}); this makes the identification of IP flows expensive and
error-prone.
Another way of accomplishing this is to insert the entropy in the
tunnel header. Many of the tunnels suggested in this memo have such
a field. The ingress is in a good position to identify flows, and,
when encapsulating the packet into the tunnel, can insert the entropy
in the header. The heavy lifting of identifying flows is thus placed
on the ingress. Transit nodes can simply use the entropy field to
correctly map packets in a flow to the same next hop, thus ensuring
FALB.
2.2.2. Per-packet load balancing
FALB is often required and is a good default behavior, especially as
end applications may be expecting packets in a flow to be delivered
in order. However, FALB has the issue that it attempts
(statistically) to place roughly the number of flows in the given
ratio on the outgoing links; that may not place traffic in the same
ratio, as flows need not carry the same traffic. In some cases
(typically when configured to), one can do per-packet load balancing
(PPLB), meaning that load balancing is no longer flow aware. This
can be done when the end applications do not require packets in a
flow to be in order, or if some (bookended) devices outside the
network put the packets back in order before delivering them to the
applications (typically by addind a sequence number). When feasible,
PPLB gives much better load distribution, and is currently the
subject of investigation, implementation and standardization.
Kompella, et al. Expires 4 September 2025 [Page 8]
Internet-Draft MPTE March 2025
One can achieve this by configuring each router in the DAG to do PPLB
for the traffic trunks in the DAG, or more simply by the ingress
router assigning entropy at random to the traffic it places in the
DAG. The latter approach keeps the decision of which DAGs (and
corresponding traffic trunks) should be flow-aware and which not at
the ingress; all other nodes simply do what the entropy fields tells
them to do.
2.3. Constraints
Constraints are an intent-based specification of acceptable paths
that a traffic trunk may take from ingress to egress(es).
Constraints are thus an abstract way to control the resources that a
particular traffic trunk uses.
One way to do this is to add "resource class attributes" or "colors"
[RFC2702] to links, and then specify "include" and "exclude" sets.
An include set means that all links that a path traverses must
contain at least one element of the include set. An exclude set
means that no link in the path can contain any color from the exclude
set.
Another way is to specify a (maximum) bandwidth that a traffic trunk
can carry. This means that all links in the path must have that much
available capacity. Packets exceending the bandwidth can forwarded
normally, marked as droppable, or dropped.
Let's add some simple constraints to our DAG. We associate the color
red to one of the links from B to C, and to the shorter of the links
from F to G. Then, we constrain the paths to "exclude red", meaning
avoid links with color red. This yields the following:
* ECMP from node 0 to node 5 with constraints "include red or blue"
yields a single path.
2.4. Protection
One very useful aspect of TE is the ability to specify that a path
must be link- or node- or shared-risk-disjoint from another path.
That means that the two paths do not have links or nodes or "shared
risk groups". Additionally, one can build protection paths for an
existing path to protect against link or node failures [RFC4090].
This is especially important as TE currently takes a single path
through the network, meaning that a link or node failure will result
in dropped traffic until the TE path is restored.
Kompella, et al. Expires 4 September 2025 [Page 9]
Internet-Draft MPTE March 2025
While not quite as crucial in the case of an MPTED, since ideally,
there will be multiple nexthops at each node, there will be cases
where a node has a single next hop, or all next hops share a common
failure mode. Identifying these cases and building protection paths
for such nodes will be described in a future version of this memo.
2.5. Tunnels
The shortest path first algorithm [SPF] is an easy-to-implement and
very efficient algorithm whereby all routers in a network can agree
on the path that a packet to a particular destination should take.
That means, if all routers are agreed (roughly) on the topology and
metrics of the network, they will forward packets in a loop-free
manner to all destinations -- without the need for signaling or
tunnels. However, an MPTED will not take the same paths -- some
paths may be rejected as they don't conform to the constraints, and
others may be used even though they are not shortest paths. Thus, to
route packets in a traffic trunk over a computed MPTED, a tunnel is
typically used. This tunnel will have to be signaled to the MPTED
nodes. The tunnel may be MPLS- or IP-based.
A few things are important about tunnels: whether they carry an
entropy field (EF), whether they have a "discriminator" (D) that
allows multiple tunnels between an ingress-egress pair, whether they
allow multiple egresses (ME), and whether they allow multiple
ingresses (MI). These will be discussed in the description of the
tunnels below.
In the memo, we consider the following tunnel types:
1. IP-in-IP: [RFC2003] encapsulation allows the creation of an
"outer" IP header to carry a payload packet (which is typically
an IP payload). The outer IP header's protocol field indicates
the "protocol" of the inner payload packet. The outer header of
IP-in-IP tunnel doesn't contain an EF; transit nodes can either
spray packets across outgoing next hops, attempt to do dPI, or
use the same next hop for all packets. To accommodate ME, the
egresses have to have the same (anycast) IP address which would
be used as the destination IP of the tunnel. MI is not possible.
2. GRE: Generic Routing Encapsulation. We include in this
definition [RFC2784] and [RFC2890] with the Key Present (bit 2)
set to 0. This is similar to IP-in-IP; however, the payload is
not required to be IP. There is no EF in the header. D, ME and
MI same as for IP-in-IP.
3. GRE-E: GRE with Key Present; the Key value is the EF. D, ME and
MI same as for IP-in-IP.
Kompella, et al. Expires 4 September 2025 [Page 10]
Internet-Draft MPTE March 2025
4. GRE6: GRE with IPv6 addresses. The entropy is carried in the
Flow Label field of the IPv6 header. D, ME and MI same as for
IP-in-IP.
5. G-in-U: GRE-in-UDP [RFC8086]. The UDP source port is the EF; the
GRE Key, if present, can be ignored from a load balancing point
of view. D, ME and MI as in IP-in-IP.
6. MPLS-in-UDP [RFC7510]. The UDP source port is the EF; D, ME and
MI as in IP-in-IP.
7. SigLab (signaled label switching). The labels to be used are
signaled. Signaling proceeds from egress(es) to ingress(es). An
entropy label can be used as the EF. At each node, a different
label is used for each MPTED; this is the discriminator. ME and
MI are both allowed.
8. StatLab (static label). A single statically-assigned label
defines the tunnel throughout the MPTED. Here, a block of MPLS
labels is given to a label allocator; these labels MUST NOT be
allocated by any node in the network. EF, D, ME and MI are as
for SigLab. The MPTED computer (MC) must interact with the
allocator when creating or deleting an MPTED.
3. Operation
The starting point in building an MPTE DAG is to define the
properties of a traffic trunk from ingress to egress. Examples
include "BGP destinations with community xyz" or "gold class traffic
belonging to VPN foo". Next, define a set of constraints that
capture the types of paths permissible for this traffic trunk. These
include a metric to minimize (perhaps with slack); this could capture
delay or fiber length, link colors, shared risk groups (SRGs) and
bandwidth. The desired outcome is an MPTED into which the traffic
trunk can be mapped.
An MPTED is specified by defining:
1. a (non-empty) set of ingresses
2. a (non-empty) set of egresses
3. the metric to use and the slack
4. path constraints
5. whether or not the MPTED is "strict".
Kompella, et al. Expires 4 September 2025 [Page 11]
Internet-Draft MPTE March 2025
An MPTED is strict if all paths from all ingresses to all egresses
are within slack of the shortest path. An MPTED is loose if all
paths from a given ingress I to a given egress E are within slack of
each other, but paths from I to a different egress F may not be
within slack of the paths to I.
Computation (possibly using a variant of CSPF) of an MPTED is done by
the MC, which is either an ingress or a PCE [RFC4655]. (This memo
does not specify such an algorithm.) Signaling primarily occurs
between the MC and each junction node. Auxiliary signaling may occur
between a junction node and its phops.
3.1. MPTED
In this memo, a node is identified by its (16-octet) IPv6 loopback
address. A link from node u to node v is identified by u's loopback
address and its (4-octet) outgoing interface index (oif), a unique
identifier for the link allocated by u. oifs are usually exchanged in
the TE extensions of an IGP. (A link also has a (4-octet) incoming
interface index, the iif. For neighbors u and v, the correlation
between u's oif and v's iif is typically done by the IGP. iifs are
not used in this memo.) For now, this memo only deals with point-to-
point links; a future revision will describe the use of multi-access
links.
An MPTED is identified by a unique (4-octet) ID (the MID) assigned to
the MPTED by the MC. As an MPTED can change over its lifetime, it is
assigned a version number starting at 0 and incremented every time
the MPTED is recomputed. Thus, a full MPTED ID (the FID) consists of
<MC, MID, version>.
An MPTED consists two or more "junction nodes". A junction node can
have one of five types:
1. a pure ingress node has zero incoming links and one or more
outgoing links in the MPTED. Traffic routed on a MPTED enters at
the ingress.
2. a pure egress node has one or more incoming links and zero
outgoing links in the MPTED. Traffic routed on a MPTED leaves at
an egress.
3. a transit ingress node where traffic can either enter the MPTED
or arrive from another ingress node to continue on in the MPTED.
4. a transit egress node where traffic can either exit the MPTED or
go on to another egress node.
Kompella, et al. Expires 4 September 2025 [Page 12]
Internet-Draft MPTE March 2025
5. a "regular" junction node has one or more incoming links and one
or more outgoing links. Traffic does not enter or leave at such
a node: it comes from a phop and goes to an nhop.
A junction node v consists of v, its previous hops (phops) and its
next hops (nhops). A phop is specified by an incoming link of v: (u,
v, oif1); an nhop by an outgoing link of v: (v, w, oif2). Note that,
since links are point-to-point, it is sufficient to specify (u, oif1)
((v, oif2)) for a phop (nhop, respectively). The nodes u (and w) are
loosely referred to as a phop (and nhop) of v, although strictly
speaking the link should be included. A pure ingress has no phops
and a pure egress has no nhops.
The MPTED is broken down into a set of junction nodes. A junction
node v is specified by:
1. bandwidth (coming in to and going out of v)
2. a list of phops of v
3. a list of nhops of v, with corresponding load balancing splits
3.2. Signaling overview
The MC signals the creation or update of an MPTED by sending to each
junction node v a JUNCTION message consisting of:
1. the MPTED ID
2. the junction node specification
3. the tunnel type
4. some flags
After v parses this specification, for all tunnel types other than
SigLab, it installs FIB state for the junction.
For tunnel type SigLab, v allocates an incoming MPLS label L_u for
each phop u, and sends a LABEL message to u containing:
1. the MPTED ID
2. the phop (u's loopback + u's oif for the link)
3. the allocated label L_u
u records label L_u as part of its own junction state.
Kompella, et al. Expires 4 September 2025 [Page 13]
Internet-Draft MPTE March 2025
When v receives a LABEL message from all its nhops, it installs swap
state in its LFIB.
3.3. Forwarding state
<TBD>
4. Protocol
MPTEP, the protocol used to create an MPTED, runs over TCP, and is
loosely modeled on BGP [RFC4271]. The following TCP sessions are
needed:
1. between any ingress acting as MC and all potential junction
nodes;
2. between the PCE and all potential junction nodes;
3. if tunnel type SigLab is used, between a junction node and all
its immediate neighbors.
Thus, there will be a full mesh of TCP sessions between all pairs of
potential junction nodes. For networks with several hundreds or
thousands of nodes, see Section 5 for an alternative solution.
4.1. Message IDs
Every semantically significant message (SSM) (i.e., one that causes
state to be created in a receiver) has a (4-octet) message ID
(msgID). msgID starts from 1 and counts up in a session. The last
processed and stored message ID is sent in a hello. This tells the
sender of the SSM that the receiver of the SSM (sender of the hello)
has finished processing the SSM. See Section 6.
4.2. Messages
An MPTEP message consists of a fixed-length message header (including
a message type) followed by a variable length body that depends on
the type. There are two types of message headers, MSGHDR and REFHDR.
MSGHDR MUST NOT be used for messages to or from a reflector. REFHDR
MUST be used for all messages to or from a reflector.
4.2.1. MSGHDR
A "normal" MPTEP message header has the following format:
Kompella, et al. Expires 4 September 2025 [Page 14]
Internet-Draft MPTE March 2025
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type (2 octets) | Length (2 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Type: message type (2 octets)
Length: total length of the message (including header) in octets (2
octets)
4.2.2. REFHDR
An MPTEP message to or from an MPTEP reflector uses the following
header:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type (2 octets) | Length (2 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MPTEP Sender (16 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MPTEP Receiver (16 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
4.3. Message types
4.3.1. OPEN
Kompella, et al. Expires 4 September 2025 [Page 15]
Internet-Draft MPTE March 2025
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Version | Capabilities |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Hello Time | Keep Time |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sender Identifier (16 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Receiver Identifier (16 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Supported Tunnel Types (4 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Opt Param Len (2 octets) | Optional Parameters |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| |
| (variable) |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Version: 0 (MUST match between endpoints)
Capabilities: bit vector of sender's capabilities
0x1: node is capable of Graceful Restart (see Section 6)
Rest: SHOULD be zero on sending and ignored on receipt
Hello Time: time in seconds between hellos. If a hello is not
received in time, it is deemed to be missed. If three consecutive
hellos are missed, the session is torn down.
Keep Time: time that control plane and forwarding plane state
received from neighbor is kept after session teardown.
Sender Identifier, Receiver Identifier: IPv6 loopback addresses of
the two endpoints
Supported Tunnel Types: bit vector of tunnel types that the sender
can install. If the receiver is an MC, it MUST NOT send an MPTED
with a tunnel type that the sender does not implement.
Opt Param Len, Optional Parameters: none defined yet
4.3.2. HELLO
Kompella, et al. Expires 4 September 2025 [Page 16]
Internet-Draft MPTE March 2025
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Last Processed MsgID (4 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Indication that the sender is alive and functioning; also, that the
sender has processed and safely stored state related to messages up
to and including the enclosed msgID; the receiver can throw away
signaling state for messages with a lower msgID.
4.3.3. JUNCTION
A JUNCTION message has the following format:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MC ID (16 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MPTED ID (4 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MPTED Version (4 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Tunnel Type | Flags | TunInfLen |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Tunnel Information (TunInfLen octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Tunnel Bandwidth in MBPS (4 octets) | (?)
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| # Ingresses (m) (2 octets) | # Egresses (n) (2 octets) | \
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| Ingress ID 1 (16 octets) | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| Ingress ID 2 (16 octets) | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ... | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| Ingress ID m (16 octets) | | (?)
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| Egress ID 1 (16 octets) | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| Egress ID 2 (16 octets) | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ... | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| Egress ID n (16 octets) | /
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Kompella, et al. Expires 4 September 2025 [Page 17]
Internet-Draft MPTE March 2025
| # phops (p) (2 octets) | # nhops (q) (2 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Junction bandwidth (4 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| phop node 1 ID (16 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| phop oif 1 (4 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| phop node ID p (16 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| phop oif p (4 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nhop oif 1 (4 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nhop share 1 (2 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nhop oif q (4 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nhop share q (2 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Tunnel Information field is to identify the type of tunnel to use
for the MPTED. For example, for an MPLS tunnel with a statically
assigned label, the Tunnel Information is the label. For IP-based
tunnels, the Tunnel Information is the source and destination IP
addresses (plus possibly other information). Details TBD.
The fields marked (?) may not be required in a Junction message; TBD.
4.3.3.1. Tunnel Flags (1 octet)
0x1: Junction is an ingress
0x2: Junction is an egress
(Pure vs. transit ingresses/egresses are distinguished by the number
of phops/nhops.)
Rest: Reserved (MUST be sent as 0 and ignored on receipt)
4.3.3.2. Junction bandwidth
bandwidth incoming to the junction in Megabits per second (Mbps) as a
4 octet non-negative integer
Kompella, et al. Expires 4 September 2025 [Page 18]
Internet-Draft MPTE March 2025
4.3.3.3. nhop share
2-octet share of the outgoing bandwidth. A Junction should attempt
to send a ratio of (share n)/(sum (share i)) of the incoming
bandwidth to nhop #n.
4.3.4. LABEL
A LABEL message MUST only be used for MPTEDs of type SigLab. A LABEL
message is sent from an egress junction node to each of its phops.
Any other junction node MUST only send a LABEL message when it has
received a LABEL message from all of its nhops (cf "Ordered Label
Distribution Control" [RFC3036], Section 2.6.1.2). A pure ingress
node never sends a LABEL message as it has no phops.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MC ID (16 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MPTED ID (4 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MPTED Version (4 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| phop node (16 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| phop oif (4 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Label (20 bits) | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
4.3.5. NOTIFICATION
<TBD>
5. MPTEP Reflector
Instead of establishing a full mesh of MPTEP connections among the
nodes participating in establishing MPTE DAGs, one could instead have
a small set of designated MPTEP Reflectors with whom all MPTEP nodes
establish connections. An MPTEP Reflector passes on an MPTEP message
from the sender to the (single) ultimate receiver. In this, an
MPTEP's function is different from that of a BGP Reflector: an MPTEP
Reflector sends a received message to exactly one destination node.
The goal of having MPTEP Reflectors is simply to reduce the number of
MPTEP sessions that a node (typically, a router) has. In a network
of (say) 500 nodes and (say) 3 Reflectors, each of these 500 nodes
Kompella, et al. Expires 4 September 2025 [Page 19]
Internet-Draft MPTE March 2025
would only need 3 sessions with the Reflectors. The Reflectors
themselves would need 500 sessions with the router nodes, plus 2
sessions among themselves.
6. Graceful Restart
A node N is capable of Graceful Restart if a) it can maintain control
plane state across restarts; and b) it can maintain forwarding state
across restarts. If N is capable of Graceful Restart, an MPTE DAG
going through N can continue functioning while N restarts. While N
is restarting, new JUNCTION/LABEL messages will be dropped or
ignored; new MPTE DAGs passing through N will not be established.
Once restart is complete, N will send an OPEN message and re-
establish connections will all its peers (or all the MPTEP
Reflectors). Thereafter, N can participate in new DAGs passing
through it by processing received JUNCTION messages.
More details will be described in a future version.
7. IANA Considerations
TBD
8. Security Considerations
TBD
9. References
9.1. Normative References
[RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003,
DOI 10.17487/RFC2003, October 1996,
<https://www.rfc-editor.org/rfc/rfc2003>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/rfc/rfc2119>.
[RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P.
Traina, "Generic Routing Encapsulation (GRE)", RFC 2784,
DOI 10.17487/RFC2784, March 2000,
<https://www.rfc-editor.org/rfc/rfc2784>.
[RFC2890] Dommety, G., "Key and Sequence Number Extensions to GRE",
RFC 2890, DOI 10.17487/RFC2890, September 2000,
<https://www.rfc-editor.org/rfc/rfc2890>.
Kompella, et al. Expires 4 September 2025 [Page 20]
Internet-Draft MPTE March 2025
[RFC7510] Xu, X., Sheth, N., Yong, L., Callon, R., and D. Black,
"Encapsulating MPLS in UDP", RFC 7510,
DOI 10.17487/RFC7510, April 2015,
<https://www.rfc-editor.org/rfc/rfc7510>.
[RFC8086] Yong, L., Ed., Crabbe, E., Xu, X., and T. Herbert, "GRE-
in-UDP Encapsulation", RFC 8086, DOI 10.17487/RFC8086,
March 2017, <https://www.rfc-editor.org/rfc/rfc8086>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.
9.2. Informative References
[RFC2702] Awduche, D., Malcolm, J., Agogbua, J., O'Dell, M., and J.
McManus, "Requirements for Traffic Engineering Over MPLS",
RFC 2702, DOI 10.17487/RFC2702, September 1999,
<https://www.rfc-editor.org/rfc/rfc2702>.
[RFC3036] Andersson, L., Doolan, P., Feldman, N., Fredette, A., and
B. Thomas, "LDP Specification", RFC 3036,
DOI 10.17487/RFC3036, January 2001,
<https://www.rfc-editor.org/rfc/rfc3036>.
[RFC3209] Awduche, D., Berger, L., Gan, D., Li, T., Srinivasan, V.,
and G. Swallow, "RSVP-TE: Extensions to RSVP for LSP
Tunnels", RFC 3209, DOI 10.17487/RFC3209, December 2001,
<https://www.rfc-editor.org/rfc/rfc3209>.
[RFC4090] Pan, P., Ed., Swallow, G., Ed., and A. Atlas, Ed., "Fast
Reroute Extensions to RSVP-TE for LSP Tunnels", RFC 4090,
DOI 10.17487/RFC4090, May 2005,
<https://www.rfc-editor.org/rfc/rfc4090>.
[RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
Border Gateway Protocol 4 (BGP-4)", RFC 4271,
DOI 10.17487/RFC4271, January 2006,
<https://www.rfc-editor.org/rfc/rfc4271>.
[RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
Networks (VPNs)", RFC 4364, DOI 10.17487/RFC4364, February
2006, <https://www.rfc-editor.org/rfc/rfc4364>.
[RFC4655] Farrel, A., Vasseur, J.-P., and J. Ash, "A Path
Computation Element (PCE)-Based Architecture", RFC 4655,
DOI 10.17487/RFC4655, August 2006,
<https://www.rfc-editor.org/rfc/rfc4655>.
Kompella, et al. Expires 4 September 2025 [Page 21]
Internet-Draft MPTE March 2025
[RFC4761] Kompella, K., Ed. and Y. Rekhter, Ed., "Virtual Private
LAN Service (VPLS) Using BGP for Auto-Discovery and
Signaling", RFC 4761, DOI 10.17487/RFC4761, January 2007,
<https://www.rfc-editor.org/rfc/rfc4761>.
[RFC4762] Lasserre, M., Ed. and V. Kompella, Ed., "Virtual Private
LAN Service (VPLS) Using Label Distribution Protocol (LDP)
Signaling", RFC 4762, DOI 10.17487/RFC4762, January 2007,
<https://www.rfc-editor.org/rfc/rfc4762>.
[RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A.,
Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based
Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February
2015, <https://www.rfc-editor.org/rfc/rfc7432>.
[SPF] Dijkstra, E. W., "A note on two problems in connexion with
graphs", 1 December 1959,
<https://doi.org/10.1007/BF01386390>.
Authors' Addresses
Kireeti Kompella
Juniper Networks
Sunnyvale, California 94089
United States of America
Email: kireeti.ietf@gmail.com
Luay Jalil
Verizon
Richardson, Texas 75081
United States of America
Email: luay.jalil@verizon.com
Mazen Khaddam
Cox Communications
Atlanta, Georgia 30328
United States of America
Email: mazen.khaddam@cox.com
Andy Smith
Oracle Cloud Infrastructure
Austin, Texas 78741
United States of America
Email: andy.j.smith@oracle.com
Kompella, et al. Expires 4 September 2025 [Page 22]