PWE3 Y(J). Stein
Internet-Draft I. Mendelsohn
Intended status: Standards Track R. Insler
Expires: May 6, 2009 RAD Data Communications
November 2, 2008
PW Bonding
draft-stein-pwe3-pwbonding-01.txt
Status of this Memo
By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on May 6, 2009.
Copyright Notice
Copyright (C) The IETF Trust (2008).
Abstract
There are times when pseudowires must be transported over physical
links with limited bandwidth. We shall use the term "bonding" (also
variously known as inverse multiplexing, link aggregation, trunking,
teaming, etc.) to mean an efficient mechanism for separating the PW
traffic over several links. Unlike load balancing and equal cost
multipath, bonding makes no assumption that the PW traffic can be
decomposed into distinguishable flows, and thus bonding requires
Stein, et al. Expires May 6, 2009 [Page 1]
Internet-Draft pwbond November 2008
delay compensation and packet reordering. Furthermore, PW bonding
can optionally track bandwidth constraints in order to minimize
packet loss.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. PW Bonding mechanism . . . . . . . . . . . . . . . . . . . . . 5
3. PW Dynamic Bandwidth Allocation . . . . . . . . . . . . . . . 6
4. Protocol Extensions . . . . . . . . . . . . . . . . . . . . . 7
5. Partial Path PW Bonding . . . . . . . . . . . . . . . . . . . 8
6. Applicability . . . . . . . . . . . . . . . . . . . . . . . . 9
7. Security Considerations . . . . . . . . . . . . . . . . . . . 10
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10
9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10
10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10
10.1. Normative References . . . . . . . . . . . . . . . . . . 10
10.2. Informative References . . . . . . . . . . . . . . . . . 11
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 11
Intellectual Property and Copyright Statements . . . . . . . . . . 12
Stein, et al. Expires May 6, 2009 [Page 2]
Internet-Draft pwbond November 2008
1. Introduction
Inverse multiplexing is any mechanism for transporting a single high
capacity traffic flow over multiple lower capacity paths. Inverse
multiplexing is also known as bonding, link load balancing, link
aggregation, trunking, teaming, concatenation, and multipath. In the
context of pseudowires we will use the term bonding.
Bonding has been defined for many transport technologies (and often
more than one mechanism has been developed for a single technology)
including TDM (continguous and virtual concatenation VCAT), ATM (ATM
forum's IMA and ITU's G.998.1 multi-pair bonding), Ethernet (802.3
link aggregation LAG and EFM PME aggregation), xDSL (the previous two
and G.998.3 time domain inverse multiplexing TDIM), PPP (MLPPP), and
in the context of IP transport, equal cost multiplath (ECMP).
Regardless of the transport infrastructure, all bonding mechanisms
must confront a fundamental problem, namely that the constituent
paths will in general have different (and not necessarily constant)
propagation delays. Thus a mechanism must be employed to ensure in-
order delivery of the data units. Two solutions have been proposed
for this problem, namely performing differential delay compensation,
and decomposing the input into mutually distinct flows. Methods
using the former solution (e.g., VCAT, TDIM) buffer the data from
each path at egress (e.g., VCAT buffers up to 1/2 second), and
introduce protocol elements to synchronize the paths before
recombining them. Methods using the latter soution (LAG, ECMP) skirt
the problem by consistently mapping data units from a given flow onto
the same constituent path, assuming that there is only the need to
maintain order inside each flow, and not across flows.
Methods employing differential delay compensation tend to more
complex and to require large buffers, but are universally applicable.
Methods decomposing the input into flows depend on the existence of
such flows and sniffing the input for their identification. Thus if
the input is a single large flow, or if it is not possible to
identify flows (e.g., due to lower layer encryption), or if it is
undesirably complex to do so, these methods may not be applicable.
Furthermore, methods decomposing the input into flows tacitly assume
that the hashing of flow identifiers onto tunnels results in fair
distribution of traffic. This is generally a good assumption when
there are a very large number of independent flows. Incorrect
distribution causes some underlying paths to become congested and
drop packets, while others are relatively underutlized. Direct
inverse multiplexing with differential delay compensation one can
ensure fairness, and in fact can adapt to underlying paths with
unequal and even time varying capacity.
Stein, et al. Expires May 6, 2009 [Page 3]
Internet-Draft pwbond November 2008
In the context of pseudowires a decomposition mechanism has been
previously proposed [5]. The present draft proposes a PW bonding
mechanism based on direct inverse multiplexing with differential
delay compensation. In particular, the proposed mechanism may be
used when PWs are supported by DSL links.
The simplest scenario for PW-bonding is depicted in Figure 1. Here
the entire PW is transported edge to edge over separate PW
components, each inside a distinct transport tunnel. A somewhat more
complex scenario is partial path bonding, as depicted in Figure 2,
where only a portion of the PW path is bandwidth restricted. Here
only the PW components are shown, and not the tunnels into which they
are placed. Here it is required to separate the PW into components
in separate tunnels at some point inside the network. However, since
P device where this happens is not PW aware, the PW components must
still be defined by the ingress PE.
+--------+ +--------+
| PE | | PE |
| | tunnel 1 | |
| X========================X |
| | PW component 1 | |
| X------------------------X |
| | | |
| X========================X |
| | | |
AC | | | | AC
-------o | | o-------
| | | |
| | tunnel 2 | |
| X========================X |
| | PW component 2 | |
| X------------------------X |
| | | |
| X========================X |
| | | |
+--------+ +--------+
Figure 1. edge-to-edge PW bonding - 2 PW components in tunnels
Stein, et al. Expires May 6, 2009 [Page 4]
Internet-Draft pwbond November 2008
+------+ +-----+ +------+
| PE | | P | | PE |
| | | | PW component | |
| | | X================X |
| | | | | |
AC | | | | | | AC
------o | PW | | PW component | o------
| X==========X X================X |
| | | | | |
| | | | | |
| | | | PW component | |
| X | X================X |
| | | | | |
+------+ +-----+ +------+
Figure 2. partial path PW bonding - 3 PW components
Each PW component will normally receive a distinct PW label, and thus
seem to the network to be a distinct PW. Furthermore, PW components
MUST use the PW control word [2]. However, as we shall see in the
next section, the sequence number generation and processing is
different for PW components that for true PWs.
2. PW Bonding mechanism
As discussed in the previous section, at the egress PE the traffic
from each PW component is buffered, and the protocol is responsible
for ensuring that packets constituting the PW are reassembled in
correct order. This is accomplished by mandating use of the PW
control word, and sharing the same sequence number sequence for all
PW components making up the PW. The sequence numbers are used by the
egress PE to ensure properly ordering. The idea is depicted in
Figure 3, for the simple case of edge-to-edge bonding. Here eight
packets are divided amongst three PW components by the ingress PE,
according to a bandwidth allocation algorithm to be described later.
Due to different link latencies, the packets arrive at the egress out
of order, but are easily reordered by the egress PE by observing the
sequence number.
Stein, et al. Expires May 6, 2009 [Page 5]
Internet-Draft pwbond November 2008
+------+ +---------------+
| PE | | PE |
| | 1 2 7 | |
| X==========X |
| | | |
1 2 3 4 5 6 7 8| | |1 3 2 4 5 7 6 8|1 2 3 4 5 6 7 8
---------------o | 3 4 8 | o---------------
PW | X==========X |
| | | |
| | | |
| | 5 6 | |
| X==========X |
| | | |
+------+ +---------------+
Figure 3. Use of sequence numbers to ensure correct packet ordering
In order to enable reordering, the egress PE must allocate sufficient
buffer memory to sustain the largest expected differential delay.
The differential delay is added to the latencies of all packets,
making the effective latency equal to that of the slowest PW
component.
3. PW Dynamic Bandwidth Allocation
In the simplest case, all packets to be sent over the various PW
components are of the same size, and all PW components support the
same data rates. For this case (but only for this case), a simple
round-robin algorithm for distributing the packets onto PW components
is optimal in the sense that it minimizes the probability of packet
loss due to buffer exhaustion.
The simple round-robin algorithm is not optimal when the packets are
not all of the same size, or when the PW components do not all
support the same data rate, or both. In such cases we need to fairly
distribute data bytes over the components in such fashion as to
minimize the probability that a packet will be dropped due to over-
run of a component's buffer. While the packet sizes are always known
before transmission, the state of the buffers are usually unknown,
and in some cases the supported data rates may be unknown. The
following discussion will be for the edge-to-edge component case; the
partial path case is similar, but requires separate consideration of
the two directions.
If the packet size is not constant, and the component rates are
known, but we have no further information (e.g., we do not know the
size of the buffers, nor do we have feedback from the egress PE on
Stein, et al. Expires May 6, 2009 [Page 6]
Internet-Draft pwbond November 2008
the actual fill states) the best algorithm for an ingress PE is based
on a leaky bucket scheme. In this scheme the ingress PE maintains,
for each PW component, a variable Bn that approximately tracks the
fill state of the egress PE's buffer for this component. The
variable Bn is continually decreased at a rate equal to the data rate
of the component n, but always remains non-negative. Each time a
packet is sent over PW component n, its size in bytes is added to Bn.
When a new packet needs to be sent, the ingress PE sends it on the PW
component with minimal Bn. This algorithm can also be used when it
can be assumed that the component rates are equal, or approximately
so.
If in addition to packet size and PW component date rates, the
ingress PE knows the buffer size used for differential compensation,
a similar, but somewhat better, algorithm can be used. When deciding
over which component to send the packet, rather than choosing the
minimal Bn, the ingress PE chooses the maximal Bn to which the packet
size can be added without overflowing the given buffer size. In
practice some extra margin must be applied in order to account for
PDV.
Finally, if the egress PE can send information on the actual state of
its buffers back to the ingress PE, then an algorithm that uses these
buffer states instead of the approximated leaky bucket ones can be
employed.
Any implementation MUST support the round-robin method, and SHOULD
support the first leaky bucket mode. Control protocol extensions are
needed to enable communication from egress back to ingress of the
additional information needed to support more optimal modes. If the
rates can be accurately known the first leaky bucket mode MUST be
used, and if further information is available then other mechanisms
MAY be used.
4. Protocol Extensions
In order to set up the PW components using the PWE3 control protocol
[3] a single PWid or generalized PWid is assigned to the logical PW,
and additional PWids or generalized PWids are allocated for the PW
components. All PW components are assigned an identical group ID, in
order to indicate their relationship, and to enable easy withdrawal
of the logical PW. First the logical PW is set up using a label
mapping message containing the interface parameters, and a new
"bonding" sub-TLV containing the group ID. Subsequently the PW
components are configured. Each PW component is assigned to a
distinct transport tunnel by mechanisms not specified here.
Stein, et al. Expires May 6, 2009 [Page 7]
Internet-Draft pwbond November 2008
Attachment circuit faults are signaled via PW status messages
associated with the PWid or generalized PWid of the logical PW. PW
component faults and capacity indicators are sent via status messages
per PW component PWid or generalized PWid.
Enhancements to the PWE3 control protocol are needed in order to
associate PW components with distinct labels in distinct tunnels to a
single logical PW, and to communicate component capacity and status
information. The format of these LDP extensions will be detailed in
the next version of this draft.
Standard VCCV mechanisms [4] may be used independently for each PW
component, and the resulting connectivity information may be used by
the ingress PE in the process of distributing traffic over PW
components. VCCV for the partial path scenario is for further study.
5. Partial Path PW Bonding
When only a portion of the PW's path suffers from bandwidth
constriction, the partial path bonding scenario depicted in Figure 2
is used. As for the regular bonding case, the ingress PE decomposes
the input into multiple PW components, and performs the same
algorithm to decide into which component to send a given packet. For
those portions of the network where a single tunnel can support the
entire service bandwidth, the PW components may all be all placed in
the same transport tunnel. For constricted bandwidth segments, each
PW component must be placed in a distinct tunnel. The distinct
transport tunnels are merged into the single tunnel using label
merging, per section 3.26.2 of [1].
Another case of practical interest is when the bandwidth is
restricted in a non-MPLS access network, and the PE terminating the
MPLS can not inverse multiplex the traffic onto low capacity links
based on PW labels alone. This case arises for a DSLAM terminating
MPLS (or a PE terminating MPLS upstream from the DSLAM) and
forwarding to customers solely based on Ethernet MAC address (and
possibly VLAN ID). For such a case a double PW encapsulation may be
used. Through the core network we tunnel an Ethernet PW, which
itself carries the bonded PW components (which may be of any type
supported by PWE encapsulations), see Figure 4.
Stein, et al. Expires May 6, 2009 [Page 8]
Internet-Draft pwbond November 2008
+--------------------+
| MPLS label stack |
+--------------------+
| exterior PW label |
+--------------------+
| Ethernet header |
+--------------------+
| interior PW label |
+--------------------+
| control word |
+--------------------+
| payload |
+--------------------+
Figure 4. packet format for DSL partial path scenario
The DSLAM (or PE immediately upstream from the DSLAM) terminates the
MPLS and exterior PW protocols, thus exposing the Ethernet header.
Under the Ethernet header there MAY be an MPLS header (which the CE
negotiates with the immediately upstream PE), and there MUST be an
interior PW label (which the CE negotiates with the remote CE or PE).
Based purely on the Ethernet addressing the DSLAM distributes the
traffic over multiple DSL links following the partition crafted by
the ingress PE. All of these DSL links terminate on a single CE
device which terminates the Ethernet, exposes the interior PW labels
and sequence numbers in the control word. Using these sequence
numbers the CE can thus piece together the original traffic stream.
6. Applicability
PW bonding is a useful mechanism when the bandwidth of available
physical links is insufficient to carry the user traffic, but several
links can be dedicated. Unlike load balancing and equal cost
multipath mechanisms, PW bonding makes no assumption that the PW
traffic can be decomposed into distinguishable flows. It is fully
applicable for non-IP or encrypted traffic. By using mechanisms
described above, PW bonding can approach full utilization of the
aggregate link bandwidth.
PW bonding involves delay compensation and packet reordering, and
thus requires allocation of sufficient memory at the egress PE. The
amount of memory needed is proportional to the link speed and to the
difference in propagation delay between the fastest and slowest
links. Thus PW bonding is most applicable when the link speeds are
low (e.g., supported by DSL lines), and the delay differences are
small.
Stein, et al. Expires May 6, 2009 [Page 9]
Internet-Draft pwbond November 2008
Only the PEs need to know that the PW components are not full PWs
(the only difference being the sequence number processing). Thus PW
bonding requires changes only to the PEs and does not require any
changes to the intervening PSN.
7. Security Considerations
PW bonding does not introduce security considerations above those
present for regular PWs. In particular, attacks based on sequence
number manipulation are of concern. For partial path cases where CE
devices participate in the PWE signaling, authentication is required.
8. IANA Considerations
Required extensions to the PWE3 control protocol, including the sub-
TLV type code for the PW component label, and new PW status codes,
will be detailed in the next version of this draft.
9. Acknowledgments
The authors would like to thank Gabriel Zigelboim for fruitful
discussions on optimal dynamic allocation mechanisms.
10. References
10.1. Normative References
[1] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol Label
Switching Architecture", RFC 3031, January 2001.
[2] Bryant, S., Swallow, G., Martini, L., and D. McPherson,
"Pseudowire Emulation Edge-to-Edge (PWE3) Control Word for Use
over an MPLS PSN", RFC 4385, February 2006.
[3] Martini, L., Rosen, E., El-Aawar, N., Smith, T., and G. Heron,
"Pseudowire Setup and Maintenance Using the Label Distribution
Protocol (LDP)", RFC 4447, April 2006.
[4] Nadeau, T. and C. Pignataro, "Pseudowire Virtual Circuit
Connectivity Verification (VCCV): A Control Channel for
Pseudowires", RFC 5085, December 2007.
Stein, et al. Expires May 6, 2009 [Page 10]
Internet-Draft pwbond November 2008
10.2. Informative References
[5] Bryant, S., Filsfils, C., and U. Drafz, "Load Balancing Fat MPLS
Pseudowires", draft-bryant-filsfils-fat-pw-02 (work in
progress), July 2008.
Authors' Addresses
Yaakov (Jonathan) Stein
RAD Data Communications
24 Raoul Wallenberg St., Bldg C
Tel Aviv 69719
ISRAEL
Phone: +972 3 645-5389
Email: yaakov_s@rad.com
Itai Mendelsohn
RAD Data Communications
24 Raoul Wallenberg St., Bldg C
Tel Aviv 69719
ISRAEL
Phone: +972 3 645-5761
Email: itai_m@rad.com
Ron Insler
RAD Data Communications
24 Raoul Wallenberg St., Bldg C
Tel Aviv 69719
ISRAEL
Phone: +972 3 645-5445
Email: ron_i@rad.com
Stein, et al. Expires May 6, 2009 [Page 11]
Internet-Draft pwbond November 2008
Full Copyright Statement
Copyright (C) The IETF Trust (2008).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
ietf-ipr@ietf.org.
Acknowledgment
Funding for the RFC Editor function is provided by the IETF
Administrative Support Activity (IASA).
Stein, et al. Expires May 6, 2009 [Page 12]