PWE3                                                      S. Bryant, Ed.
Internet-Draft                                               C. Filsfils
Intended status: Standards Track                           Cisco Systems
Expires: January 10, 2009                                       U. Drafz
                                                        Deutsche Telekom
                                                            July 9, 2008


                  Load Balancing Fat MPLS Pseudowires
                    draft-bryant-filsfils-fat-pw-02

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on January 10, 2009.

Abstract

   Where the payload carried over a pseudowire carries a number of
   identifiable flows it can in some circumstances be desirable to carry
   those flows over the equal cost multiple paths that exist in the
   packet switched network.  This draft describes a method of
   identifying the flows, or flow groups, to the label switched routers
   by including an additional label in the label stack.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",



Bryant, et al.          Expires January 10, 2009                [Page 1]


Internet-Draft                  LB-fat-pw                      July 2008


   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC2119 [1].


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
     1.1.  ECMP in Label Switched Routers . . . . . . . . . . . . . .  3
     1.2.  Load Balance Label . . . . . . . . . . . . . . . . . . . .  4
   2.  Native Service Processing Function . . . . . . . . . . . . . .  4
   3.  Pseudowire Forwarder . . . . . . . . . . . . . . . . . . . . .  4
     3.1.  Encapsulation  . . . . . . . . . . . . . . . . . . . . . .  5
   4.  Signaling Load Balance Label . . . . . . . . . . . . . . . . .  6
     4.1.  Structure of Load Balance Label TLV  . . . . . . . . . . .  7
   5.  OAM  . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  7
   6.  Applicability  . . . . . . . . . . . . . . . . . . . . . . . .  8
   7.  Security Considerations  . . . . . . . . . . . . . . . . . . .  9
   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . .  9
   9.  Congestion Considerations  . . . . . . . . . . . . . . . . . . 10
   10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10
   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10
     11.1. Normative References . . . . . . . . . . . . . . . . . . . 10
     11.2. Informative References . . . . . . . . . . . . . . . . . . 11
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 11
   Intellectual Property and Copyright Statements . . . . . . . . . . 13


























Bryant, et al.          Expires January 10, 2009                [Page 2]


Internet-Draft                  LB-fat-pw                      July 2008


1.  Introduction

   A pseudowire is defined as a mechanism that carries the essential
   elements of an emulated service from one provider edge (PE) to one or
   more other PEs over a packet switched network (PSN) [10].

   A pseudowire is normally transported over one single network path,
   even if multiple Equal Cost Multiple Paths (ECMP) exit between the
   ingress and egress PEs[2] [3].  This is required to preserve the
   characteristics of the emulated service (e.g. avoid misordering for
   example for SAToP pseudowire's [4]).  Except in the extreme case
   described in Section 6, the new capability proposed in this draft
   does not change this default property of pseudowires.

   Some pseudowires are used to transport large volumes of IP traffic
   between routers at two locations.  One example of this is the use of
   an Ethernet pseudowire to create a virtual direct link between a pair
   of routers.  Such pseudowire's may carry from hundred's of Mbps to
   Gbps of traffic.  Such pseudowire's do not require strict ordering to
   be preserved between packets of the pseudowire.  They only require
   ordering to be preserved within the context of each individual
   transported IP flow.  Some operators have requested the ability to
   explicitly configure such a pseudowire to leverage the availability
   of multiple ECMP paths.  This allows for better capacity planning as
   the statistical multiplexing of a larger number of smaller flows is
   more efficient than with a smaller set of larger flows.  Although
   Ethernet is used as an example above, the mechanisms described in
   this draft are general mechanisms that may be applied to any
   pseudowire type in which there are identifiable flows, and in which
   the there is no requirement to preserve the order between flows.

1.1.  ECMP in Label Switched Routers

   Label switched routers commonly hash the label stack or some elements
   of the label stack as a method of discriminating between flows, in
   order to distribute those flows over the available equal cost
   multiple paths that exist in the network.  Since the label at the
   bottom of stack is usually the label most closely associated with the
   flow, this normally provides the greatest entropy and hence is
   normally included in the hash.  This draft describes a method of
   adding an additional label at the bottom of stack in order to
   facilitate the load balancing of the flows within a pseudowire over
   the available ECMPs.  A similar design for general MPLS use has also
   been proposed [11].

   An alternative method of load balancing by creating a number of
   pseudowires and distributing the flows amongst them was considered,
   but was rejected because:



Bryant, et al.          Expires January 10, 2009                [Page 3]


Internet-Draft                  LB-fat-pw                      July 2008


   o  It did not introduce as much entropy as the load balance label
      method.

   o  It required additional pseudowires to be set up and maintained.

1.2.  Load Balance Label

   An additional label is interposed between the pseudowire label and
   the control word, or if the control word is not present, between the
   pseudowire label and the pseudowire payload.  This additional label
   is called the pseudowire load balancing label (LB label).
   Indivisible flows within the pseudowire MUST be mapped to the same
   pseudowire LB label by the ingress PE.  The pseudowire load balancing
   label stimulates the correct ECMP load balancing behaviour in the
   PSN.  On receipt of the pseudowire packet at the egress PE (which
   knows this additional label is present) the label is discarded
   without processing.

   Note that the LB label MUST NOT be an MPLS reserved label [5], but is
   otherwise unconstrained by the protocol.

   To ensure that the load balance label is not used inadvertently used
   for forwarding the load balance label MUST have a TTL of 0.


2.  Native Service Processing Function

   The Native Service Processing (NSP) function is a component of a PE
   that has knowledge of the structure of the emulated service and is
   able to take action on the service outside the scope of the
   pseudowire.  In this case it is required that the NSP in the ingress
   PE identify flows, or groups of flows within the service, and
   indicate the flow (group) identity of each packet as it is passed to
   the pseudowire forwarder.  Since this is an NSP function, by
   definition, the method used to identify a flow is outside the scope
   of the pseudowire design.  Similarly, since the NSP is internal to
   the PE, the method of flow indication to the pseudowire forwarder is
   outside the scope of this document


3.  Pseudowire Forwarder

   The pseudowire forwarder must be provided with a method of mapping
   flows to load balanced paths.

   The forwarder must generate a label for the flow or group of flows.
   How the load balance label values are determined is outside the scope
   of this document, however the load balance label allocated to a flow



Bryant, et al.          Expires January 10, 2009                [Page 4]


Internet-Draft                  LB-fat-pw                      July 2008


   MUST NOT be an MPLS reserved label and SHOULD remain constant.  It is
   recommended that the method chosen to generate the load balancing
   labels introduces a high degree of entropy in their values, to
   maximise the entropy presented to the ECMP path selection mechanism
   in the LSRs in the PSN, and hence distribute the flows as evenly as
   possible over the available PSN ECMP paths.  The forwarder at the
   ingress PE prepends the pseudowire control word (if applicable), then
   prepends either the pseudowire load balancing label, followed by the
   pseudowire label.  Alternatively it prepends the pseudowire control
   word (if applicable), then selects and appends one of the allocated
   pseudowire labels.

   The forwarder at the egress PE uses the pseudowire label to identify
   the pseudowire.  If the label block approach is used operation is
   identical to the current non-load balanced case.  Alternatively, from
   the pseudowire context, the egress PE can determine whether a
   pseudowire load balancing label is present, and if one is present,
   the label is discarded.

   All other pseudowire forwarding operations are unmodified by the
   inclusion of the pseudowire load balancing label.

3.1.  Encapsulation

   The PWE3 Protocol Stack Reference Model modified to include
   pseudowire LB label is shown in Figure 1 below

      +-------------+                                +-------------+
      |  Emulated   |                                |  Emulated   |
      |  Ethernet   |                                |  Ethernet   |
      | (including  |         Emulated Service       | (including  |
      |  VLAN)      |<==============================>|  VLAN)      |
      |  Services   |                                |  Services   |
      +-------------+                                +-------------+
      | Load balance|                                | Load balance|
      +-------------+            Pseudowire          +-------------+
      |Demultiplexer|<==============================>|Demultiplexer|
      +-------------+                                +-------------+
      |    PSN      |            PSN Tunnel          |    PSN      |
      |   MPLS      |<==============================>|   MPLS      |
      +-------------+                                +-------------+
      |  Physical   |                                |  Physical   |
      +-----+-------+                                +-----+-------+


               Figure 1: PWE3 Protocol Stack Reference Model

   The encapsulation of a pseudowire with a pseudowire LB label is shown



Bryant, et al.          Expires January 10, 2009                [Page 5]


Internet-Draft                  LB-fat-pw                      July 2008


   in Figure 2 below

    +-------------------------------+
    |      MPLS Tunnel label(s)     | n*4 octets (four octets per label)
    +-------------------------------+
    |      PW label                 |  4 octets
    +-------------------------------+
    |      Load Balance label       |  4 octets
    +-------------------------------+
    |   Optional Control Word       |  4 octets
    +-------------------------------+
    |            Payload            |
    |                               |
    |                               |  n octets
    |                               |
    +-------------------------------+


      Figure 2: Encapsulation of a pseudowire with a pseudowire load
                              balancing label


4.  Signaling Load Balance Label

   When using the signalling procedures in [6], there is a Pseudowire
   Interface Parameter Sub-TLV type used to signal the desire to include
   the load balance label when advertising a VC label.

   The presence of this parameter indicates that the egress PE requests
   that the ingress PE place a load balance label between the pseudowire
   label and the control word (or is the control word is not present
   between the pseudowire label and the pseudowire payload).

   If the ingress PE recognises load balance label indicator parameter
   but does not wish to include the load balance label, it need only
   issue its own label mapping message for the opposite direction
   without including the load balance label Indicator.  This will
   prevent inclusion of the load balance label in either direction.

   If PWE3 signalling [6] is not in use for a pseudowire, then whether
   the load balance label is used MUST be identically provisioned in
   both PEs at the pseudowire endpoints.  If there is no provisioning
   support for this option, the default behaviour is not to include the
   load balance label.

   Note that what is signalled is the desire to include the load balance
   label in the label stack.  The value of the label is a local matter
   for the ingress PE, and the label value itself is not signalled.



Bryant, et al.          Expires January 10, 2009                [Page 6]


Internet-Draft                  LB-fat-pw                      July 2008


4.1.  Structure of Load Balance Label TLV

   The structure of the load balance label TLV is shown in Figure 3.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | LBL           |    Length     |    must be zero               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                         Figure 3: Multiple VC TLV

   Where:

   o  LBL is the load balance label TLV identifier assigned by IANA.

   o  Length is the length of the TLV in octets and is 4.


5.  OAM

   The following OAM considerations apply to this method of load
   balancing.

   Where the OAM is only to be used to perform a basic test that the
   pseudowires have been configured at the PEs, VCCV [12] messages may
   be sent using any load balance pseudowire path, i.e. over any of the
   multiple pseudowire labels, or using any pseudowire load balance
   label.

   Where it is required to verify that a pseudowire is fully functional
   for all flows, VCCV [12] connection verification message MUST be sent
   over each ECMP path to the pseudowire egress PE.  This problem is
   difficult to solve and scales poorly.  We believe that this problem
   is addressed by the following two methods:

   1.  If a failure occurs within the PSN, this failure will normally be
       detected by the PSN's IGP (link/node failure, link or BFD or IGP
       hello detection), and the IGP convergence will naturally modify
       the ECMP set of network paths between the Ingress and Egress
       PE's.  Hence the PW is only impacted during the normal IGP
       convergence time.

   2.  If the failure is related to the individual corruption of an LFIB
       entry in a router, then only the network path using that specific
       entry is impacted.  If the PW is load balanced over multiple
       network paths, then this failure can only be detected if, by
       chance, the transported OAM flow is mapped onto the impacted



Bryant, et al.          Expires January 10, 2009                [Page 7]


Internet-Draft                  LB-fat-pw                      July 2008


       network path, or all paths are tested.  This type of error may be
       better solved be solved by other means such as LSP self test
       [13].

   To troubleshoot the MPLS PSN, including multiple paths, the
   techniques described in [7] and [8] can be used.


6.  Applicability

   The requirement to load-balance over multiple PSN paths occurs when
   the ratio between the PW access speed and the PSN's core link
   bandwidth is large (e.g. >= 0.1).  ATM and FR are unlikely to meet
   this property.  Ethernet does and this is the reason why this
   document focuses on Ethernet.  Applications for other high-access-
   bandwidth PW's (fiber-channel) may be defined in the future.

   This design applies to MPLS pseudowires where it is meaningful to
   deconstruct the packets presented to the ingress PE into flows.  The
   mechanism described in this document promotes the distribution of
   flows within the pseudowire over different network paths.  This in
   turn means that whilst packets within a flow are delivered in order
   (subject to normal IP delivery perturbations due to topology
   variation), order is not maintained amongst packets of different
   flows.  It is not proposed to associate a different sequence number
   with each flow.  If sequence number support is required this
   mechanism is not applicable.

   Where it is known that the traffic carried by the Ethernet pseudowire
   is IP the method of identifying the flows are well known and can be
   applied.  Such methods typically include hashing on the source and
   destination addresses, the protocol ID and higher-layer flow-
   dependent fields such as TCP/UDP ports, L2TPv3 Session ID's etc.

   Where it is known that the traffic carried by the Ethernet pseudowire
   is non-IP, techniques used for link bundling between Ethernet
   switches may be reused.  In this case however the latency
   distribution would be larger than is found in the link bundle case.
   The acceptability of the increased latency is for further study.  Of
   particular importance the Ethernet control frames SHOULD always be
   mapped to the same PSN path to ensure in-order delivery.

   If the payload of an Ethernet PW is made of a single inner flow (i.e.
   an encrypted connection between two routers), then the functionality
   described in this document does not give any benefits, though neither
   does it give any drawbacks.  This is unlikely to be a show-stopper
   for two reasons:




Bryant, et al.          Expires January 10, 2009                [Page 8]


Internet-Draft                  LB-fat-pw                      July 2008


   o  Firstly, the customer of a high-bandwidth PW service has incentive
      to get the best transport service because an inefficient use of
      the PSN leads to jitter and eventually to loss to the PW's
      payload.

   o  Secondly, the customer is usually able to tailor their
      applications to generate many flows in the PSN.  A well-known
      example is massive data transport between servers which use many
      parallel TCP sessions.  This same technique can be used by any
      transport protocol: multiple UDP ports, multiple L2TPv3 Session
      ID's, multiple GRE keys may be used to decompose a large flow into
      smaller components.  This approach may be applied to IPsec where
      multiple SPI's may be allocated to the same security association.

   A node within the PSN is not able to perform deep-packet-inspection
   (DPI) of the PW as the PW technology is not self-describing: the
   structure of the PW payload is only known to the ingress and egress
   PE devices.  The two methods proposed in this document solve this
   limitation.

   The methods describe in this document are transparent to the PSN and
   as such do not require any new capability from the PSN.


7.  Security Considerations

   The pseudowire generic security considerations described in [10] and
   the security considerations applicable to a specific pseudowire type
   (for example, in the case of an Ethernet pseudowire [9] apply.

   The ingress PE should take steps to ensure that the load-balance
   label is not used as a covert channel.


8.  IANA Considerations

   IANA is requested to allocate the next available values from the IETF
   Consensus range in the Pseudowire Interface Parameters Sub-TLV type
   Registry as a Load Balance Label indicator.

   Parameter  Length       Description

   TBD         4            Load Balancing Label








Bryant, et al.          Expires January 10, 2009                [Page 9]


Internet-Draft                  LB-fat-pw                      July 2008


9.  Congestion Considerations

   The congestion considerations applicable to pseudowires as described
   in [10] and any additional congestion considerations developed at the
   time of publication apply to this design.

   The ability to explicitly configure a PW to leverage the availability
   of multiple ECMP paths is beneficial to capacity planning as, all
   other parameters being constant, the statistical multiplexing of a
   larger number of smaller flows is more efficient than with a smaller
   number of larger flows.

   Note that if the classification into flows is only performed on IP
   packets the behaviour of those flows in the face of congestion will
   be as already defined by the IETF for packets of that type and no
   additional congestion processing is required.

   Where flows that are not IP are classified pseudowire congestion
   avoidance must be applied to each non-IP load balance group.


10.  Acknowledgements

   The authors wish to thank Joerg Kuechemann, Wilfried Maas, Luca
   Martini, Mark Townsley, Kireeti Kompella and Shane Amante for
   valuable comments and contributions to this design.


11.  References

11.1.  Normative References

   [1]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
         Levels", BCP 14, RFC 2119, March 1997.

   [2]   Bryant, S., Swallow, G., Martini, L., and D. McPherson,
         "Pseudowire Emulation Edge-to-Edge (PWE3) Control Word for Use
         over an MPLS PSN", RFC 4385, February 2006.

   [3]   Swallow, G., Bryant, S., and L. Andersson, "Avoiding Equal Cost
         Multipath Treatment in MPLS Networks", BCP 128, RFC 4928,
         June 2007.

   [4]   Vainshtein, A. and YJ. Stein, "Structure-Agnostic Time Division
         Multiplexing (TDM) over Packet (SAToP)", RFC 4553, June 2006.

   [5]   Rosen, E., Tappan, D., Fedorkow, G., Rekhter, Y., Farinacci,
         D., Li, T., and A. Conta, "MPLS Label Stack Encoding",



Bryant, et al.          Expires January 10, 2009               [Page 10]


Internet-Draft                  LB-fat-pw                      July 2008


         RFC 3032, January 2001.

   [6]   Martini, L., Rosen, E., El-Aawar, N., Smith, T., and G. Heron,
         "Pseudowire Setup and Maintenance Using the Label Distribution
         Protocol (LDP)", RFC 4447, April 2006.

   [7]   Allan, D. and T. Nadeau, "A Framework for Multi-Protocol Label
         Switching (MPLS) Operations and Management (OAM)", RFC 4378,
         February 2006.

   [8]   Kompella, K. and G. Swallow, "Detecting Multi-Protocol Label
         Switched (MPLS) Data Plane Failures", RFC 4379, February 2006.

   [9]   Martini, L., Rosen, E., El-Aawar, N., and G. Heron,
         "Encapsulation Methods for Transport of Ethernet over MPLS
         Networks", RFC 4448, April 2006.

11.2.  Informative References

   [10]  Bryant, S. and P. Pate, "Pseudo Wire Emulation Edge-to-Edge
         (PWE3) Architecture", RFC 3985, March 2005.

   [11]  Kompella, K. and S. Amante, "The Use of Entropy Labels in MPLS
         Forwarding", draft-kompella-mpls-entropy-label-00 (work in
         progress), July 2008.

   [12]  Nadeau, T. and C. Pignataro, "Pseudowire Virtual Circuit
         Connectivity Verification (VCCV) A Control  Channel for
         Pseudowires", draft-ietf-pwe3-vccv-15 (work in progress),
         September 2007.

   [13]  Swallow, G., "Label Switching Router Self-Test",
         draft-ietf-mpls-lsr-self-test-07 (work in progress), May 2007.


Authors' Addresses

   Stewart Bryant (editor)
   Cisco Systems
   250 Longwater Ave
   Reading  RG2 6GB
   United Kingdom

   Phone: +44-208-824-8828
   Email: stbryant@cisco.com






Bryant, et al.          Expires January 10, 2009               [Page 11]


Internet-Draft                  LB-fat-pw                      July 2008


   Clarence Filsfils
   Cisco Systems
   Brussels
   Belgium

   Email: cfilsfil@cisco.com


   Ulrich Drafz
   Deutsche Telekom
   Muenster,
   Germany

   Phone:
   Fax:
   Email: Ulrich.Drafz@t-com.net
   URI:


































Bryant, et al.          Expires January 10, 2009               [Page 12]


Internet-Draft                  LB-fat-pw                      July 2008


Full Copyright Statement

   Copyright (C) The IETF Trust (2008).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.











Bryant, et al.          Expires January 10, 2009               [Page 13]