BESS                                                            Z. Zhang
Internet-Draft                                          Juniper Networks
Intended status: Standards Track                                K. Patel
Expires: April 18, 2016                                    Cisco Systems
                                                        October 16, 2015


                          BGP Based Multicast
                   draft-zzhang-bess-bgp-multicast-00

Abstract

   This document describes multicast signaling based on Border Gateway
   Protocol (BGP).

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC2119.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on April 18, 2016.

Copyright Notice

   Copyright (c) 2015 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect



Zhang & Patel            Expires April 18, 2016                 [Page 1]


Internet-Draft                  bgp-mcast                   October 2015


   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Motivation  . . . . . . . . . . . . . . . . . . . . . . .   2
     1.2.  Overview  . . . . . . . . . . . . . . . . . . . . . . . .   3
       1.2.1.  BGP Sessions  . . . . . . . . . . . . . . . . . . . .   3
       1.2.2.  LAN and Parallel Links  . . . . . . . . . . . . . . .   4
       1.2.3.  Source Discovery for ASM  . . . . . . . . . . . . . .   5
       1.2.4.  Bidirectional Trees . . . . . . . . . . . . . . . . .   6
       1.2.5.  Transition  . . . . . . . . . . . . . . . . . . . . .   6
   2.  Specification . . . . . . . . . . . . . . . . . . . . . . . .   7
     2.1.  BGP NLRIs and Attributes  . . . . . . . . . . . . . . . .   7
       2.1.1.  S-PMSI A-D Route  . . . . . . . . . . . . . . . . . .   7
       2.1.2.  Source Active A-D Route . . . . . . . . . . . . . . .   8
     2.2.  Procedures  . . . . . . . . . . . . . . . . . . . . . . .   8
       2.2.1.  Originating Tree Join Routes  . . . . . . . . . . . .   8
       2.2.2.  Receiving Tree Join Routes  . . . . . . . . . . . . .   9
       2.2.3.  Originating S-PMSI A-D Routes . . . . . . . . . . . .  10
       2.2.4.  Receiving S-PMSI A-D Routes . . . . . . . . . . . . .  10
   3.  Security Considerations . . . . . . . . . . . . . . . . . . .  11
   4.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  11
   5.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  11
     5.1.  Normative References  . . . . . . . . . . . . . . . . . .  11
     5.2.  Informative References  . . . . . . . . . . . . . . . . .  12
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  12

1.  Introduction

1.1.  Motivation

   Protocol Independent Multicast (PIM) has been the prevailing
   multicast protocol for many years.  Despite its success, it has two
   drawbacks:

   o  Complexity originated from RPT/SPT switchover and data driven
      nature for PIM-ASM.

   o  Periodical protocol state refreshes due to soft state nature.

   While PIM-SSM removes the complexity of PIM-ASM, there have not been
   a good way of discovering sources, limiting its deployment.  PIM-Port
   (PIM over Reliable Transport) solves the soft state issue, though its
   deployment has also been limited.



Zhang & Patel            Expires April 18, 2016                 [Page 2]


Internet-Draft                  bgp-mcast                   October 2015


   Partly because of the complexity concern, some Data Center operators
   have been avoiding deploying multicast in their networks.

   BGP-MVPN [RFC 6514] uses BGP to signal VPN customer multicast state
   over provider networks.  It removes the above mentioned problems, and
   the deployment experiences have been encouraging. [draft-ietf-bess-
   mvpn-pe-ce] adapts the concept of BGP-MVPN to PE-CE links, and this
   document extends it further to general topologies, so that it can
   deployed in any network where BGP is running, or can be run,
   throughout or on most routers.  One target deployment would be a Data
   Center that requires multicast and that uses BGP as its only routing
   protocol.

1.2.  Overview

   In a nut shell, this is PIM with BGP based join/prune signaling, plus
   BGP based source discovery in case of ASM.  The same RPF procedures
   as in PIM are used for each router to determine the RPF neighbor for
   a particular source or RPA (in case of Bidirectional Tree).  Except
   in the Bidirectional Tree case, no (*,G) join is used - LHR routers
   discover the sources for ASM and then join towards the sources
   directly.  Data driven mechanisms like PIM Assert is replaced by
   control driven mechanisms (Section 1.2.2).

   The joins are carried in BGP Updates with CMCAST SAFI and types of
   routes as defined in [draft-ietf-bess-mvpn-pe-ce].  CMCAST NLRIs are
   targted at the upstream neighbor by use of Route Targets.

1.2.1.  BGP Sessions

   As specified in [draft-ietf-bess-mvpn-pe-ce-00], in order for two BGP
   speakers to exchange C-MCAST NLRI, they must use BGP Capabilities
   Advertisement [RFC5492] to ensure that they both are capable of
   properly processing the C-MCAST NLRI.  This is done as specified in
   [RFC4760], by using a capability code 1 (multiprotocol BGP) with an
   AFI of IPv4 (1) or IPv6 (2) and a SAFI of C-MCAST with a value to be
   assigned by IANA.

   How the BGP peer sessions are provisioned, whether EBGP or IBGP,
   whether statically, automatically (e.g., based on IGP neighbor
   discovery), or programmably via an external controller, is outside
   the scope of this document.

   In case of IBGP, it could be that every router peering with Route
   Reflectors, or hop by hop IBGP sessions could be used to exchange
   CMCAST NLRIs for joins.  In the latter case, unless desired otherwise
   for reasons outside of the scope of this document, the hop by hop
   IBGP sessions MUST only be used to exchange CMCAST NLRIs.



Zhang & Patel            Expires April 18, 2016                 [Page 3]


Internet-Draft                  bgp-mcast                   October 2015


   FHRs and LHRs also establish BGP sessions to some Route Reflectors
   for source discovery purpose (Section 1.2.3).

   With the traditional PIM, the FHRs and LHRs refer to the PIM DRs on
   the source or receiver networks.  With BGP based multicast, PIM may
   not be running at all, and the FHRs and LHRs refer to the IGMP/MLD
   queriers in that case.

1.2.2.  LAN and Parallel Links

   There could be parallel links between two BGP peers.  A single multi-
   hop session, whether IBGP or EBGP, between loopback addresses may be
   used.  Except for LAN interfaces, any link between the two peers can
   be automatically used by a downstream peer to receive traffic from
   the upstream peer, and it is for the upstream peer to decide which
   link to use.  If one of the link goes down, the upstream peer
   switches to a different link and there is no change needed on the
   downstream peer.

   The upstream peer MAY prefer LAN interfaces to send traffic, since
   multiple downstream peers may be reached simultaneously, or it may
   make a decision based on local policy, e.g., for load balancing
   purpose.  Because different downstream peers might choose different
   upstream peers for RPF, when an upstream peer decides to use a LAN
   interface to send traffic, it originates an S-PMSI A-D route
   indicating that one or more LAN interface will be used.  The route
   carries Route Targets specific to the LANs so that all the peers on
   the LANs import the route.  If more than one router originate the
   route specifying the same LAN for the same (s,g) or (*,g) flow, then
   assert procedure based on the S-PMSI A-D routes happens and assert
   losers will stop sending traffic to the LAN.

   In this multihop session case, there need be a way to determine if
   two peers are directly connected, so that traffic can be sent
   natively when possible or tunneled when necessary.  Advertising
   attached interface addresses, like LDP does, could be one way.  Those
   advertisements can be limited to peers that are directly connected by
   using of Route Targets.  More details may be provided in a future
   revision, pending further consideration.

   Alternatively, multiple single-hop sessions between interface
   addresses, whether IBGP or EBPG, can be used.  This is especially
   suitable in DC scenarios.








Zhang & Patel            Expires April 18, 2016                 [Page 4]


Internet-Draft                  bgp-mcast                   October 2015


1.2.3.  Source Discovery for ASM

   This document does not support ASM via shared trees (aka RP Tree, or
   RPT).  Instead, FHRs, RPs, and LHRs work together to propagate/
   discover source information via control plane and LHRs join source
   specific Shortest Path Trees (SPT) directly.

   The RPs are just Route Reflectors.  Multicast data traffic does not
   necessarily go through them, and redundancy can be easily achieved by
   having multiple RRs.  They do not participate in any multicast
   specific procedures, besides that they redistribute Source Active A-D
   routes.  A FHR originates Source Active A-D routes upon discovery
   sources for particular flows and advertise them to the RRs, carrying
   an IPv4 or IPv6 address specific Route Target.  The Global
   Administrator field is set the group address of the flow, and the
   Local Administrator field is set to 0.  An LHR originate Route Target
   Constraint routes towards the RRs, with the Route Target field in the
   NLRI set accordingly, for the groups it wants to receive traffic for.
   That way, RR maintains all source information but only distributes to
   interested LHRs on demand.

   Because the RPs are only used for distributing SA route and not as
   data rendezvous points, a small number of them are enough and there
   is no need to have different RPs for different groups.  As a result,
   static configuration is sufficient - no need for dynamic RP learning
   protocols like BSR and Auto-RP.

1.2.3.1.  Integration with BGP-MVPN

   For each VPN, the RRs for that VPN can be completely separate from
   those for a different VPN.  The provider is not involved at all, as
   in the Inter-site Shared C-Tree model described in Section 13 of RFC
   6514.

   Alternatively, one or more PEs can serve as the RRs for their local
   sites for the purpose of distributing SA routes.  Compared to the
   approach in the previous paragraph, those PEs use a single session
   (vs. one session for each VPN) to exchange BGP-MVPN SA routes (MCAST-
   VPN SAFI) among themselves, following the procedures defined in
   Section 14 of RFC 6514.  That's in addition to exchanging BGP SA
   routes (CMCAST SAFI) between a PE and FHRs/LHRs that it is
   responsible for.  Note that RFC 6514 does not explicictly specify
   that an egress PE translate received BGP-MVPN SA A-D routes into PIM
   Null Register messages or MSDP SA routes (for the purpose of Anycast
   RP).  In this document, a PE acting as a RR for SA A-D routes does
   translate received BGP-MVPN SA A-D routes to BGP SA A-D routes, and
   vice versa.




Zhang & Patel            Expires April 18, 2016                 [Page 5]


Internet-Draft                  bgp-mcast                   October 2015


1.2.4.  Bidirectional Trees

   For Bidirectional PIM, on transit LANs it is required that a DF is
   elected to forward traffic to/from the RPA direction.  This is based
   on DF messages exchanged rapidly among the BIDIR-PIM routers on the
   same LAN.  The procedure is complicated and may not be robust enough
   in all situations.  In a typical provider network, transit LANs are
   rarely used therefore for simplicy this document does not support
   transit LANs for bidirectional trees.

   For resilience purpose the RPA is typically a "virtual address" on a
   multi-access link and is not associated with any routers.  No DF
   election is needed on this RPL (Rendezvous Point Link), and all
   routers on the RPL forward traffic to/from the RPL.  With Bidir-PIM,
   the RPL routers terminate the Join/Prune messages from downstream
   neighbors and the same applies if BGP is used for signaling.

1.2.5.  Transition

   A network currently running PIM can be incrementally transitioned to
   BGP based multicast.  At any time, a router supporting BGP based
   multicast can use PIM with some neighbors (upstream or downstream)
   and BGP with some other neighbors.  PIM and BGP MUST not be used
   simultaneously between two neighbors for multicast purpose, and
   routers connected to the same LAN MUST be transitioned during the
   same maintenance window.

   In case of PIM-SSM, any router can be transitioned at any time
   (except on a LAN all routers must be transitioned together).  It may
   receive source tree joins from a mixed set of BGP and PIM downstream
   neighbors and send source tree joins to its upstream neighbor using
   either PIM or BGP signaling.

   In case of PIM-ASM, the RPs are first upgraded to support BGP based
   multicast.  They learn sources either via PIM procedures from PIM
   FHRs, or via Source Active A-D routes from BGP FHRs.  In the former
   case, the RPs can originate proxy Source Active A-D routes.  There
   may be a mixed set of RPs/RRs - some capable of both traditional PIM
   RP functionalities while some only redistribute SA routes.

   Then any routers can be transitioned incrementally.  A transitioned
   LHR router will pull Source Active A-D routes from the RPs when they
   receive IGMP/MLD (*,G) joins for ASM groups, and may send either PIM
   (s,g) joins or BGP Source Tree Join routes.  A transitioned transit
   router may receive (*,g) PIM joins but only send source tree joins
   after pulling Source Active A-D routes from RPs.





Zhang & Patel            Expires April 18, 2016                 [Page 6]


Internet-Draft                  bgp-mcast                   October 2015


2.  Specification

2.1.  BGP NLRIs and Attributes

   The same CMCAST SAFI and types of routes as defined in [draft-ietf-
   bess-mvpn-pe-ce] are used, except that the Source Prune A-D Route is
   not used, and an additional two types are defined.  In summary:

          3 -  S-PMSI A-D Route         [new]
          5 -  Source Active A-D Route  [new]
          6 -  Shared Tree Join Route   [existing]
          7 -  Source Tree Join Route   [existing]
          8 -  Source Prune A-D Route   [not used]

   Except for the Source Active A-D routes, the routes carry a NO-
   ADVERTISE community so that the receiving peer will not propagate it
   further.

2.1.1.  S-PMSI A-D Route

   Similar to defined in RFC 6514, an S-PMSI A-D Route Type specific
   CMCAST NLRI consists of the following, though it does not have an RD:


         +-----------------------------------+
         | Multicast Source Length (1 octet) |
         +-----------------------------------+
         |  Multicast Source (variable)      |
         +-----------------------------------+
         |  Multicast Group Length (1 octet) |
         +-----------------------------------+
         |  Multicast Group   (variable)     |
         +-----------------------------------+
         |   Originating Router's IP Addr    |
         +-----------------------------------+

   If the Multicast Source (or Group) field contains an IPv4 address,
   then the value of the Multicast Source (or Group) Length field is 32.
   If the Multicast Source (or Group) field contains an IPv6 address,
   then the value of the Multicast Source (or Group) Length field is
   128.

   Usage of other values of the Multicast Source Length and Multicast
   Group Length fields is outside the scope of this document.

   Usage of S-PMSI A-D routes is described in Section 2.2.3 and
   Section 2.2.4.




Zhang & Patel            Expires April 18, 2016                 [Page 7]


Internet-Draft                  bgp-mcast                   October 2015


2.1.2.  Source Active A-D Route

   Similar to defined in RFC 6514, a Source Active A-D Route Type
   specific MCAST NLRI consists of the following:

         +-----------------------------------+
         | Multicast Source Length (1 octet) |
         +-----------------------------------+
         |   Multicast Source (variable)     |
         +-----------------------------------+
         |  Multicast Group Length (1 octet) |
         +-----------------------------------+
         |  Multicast Group (variable)       |
         +-----------------------------------+

   The definition of the source/length and group/length fields are the
   same as in the S-PMSI A-D routes.

   Source Active A-D routes with a Multicast group belonging to the
   Source Specific Multicast (SSM) range (as defined in [RFC4607], and
   potentially extended locally on a router) MUST NOT be advertised by a
   router and MUST be discarded if received.

   Usage of Source Active A-D routes is described in Section 1.2.3.

2.2.  Procedures

2.2.1.  Originating Tree Join Routes

   When a router learns from IGMP/MLD or a downstream PIM/BGP peer that
   it needs to join a SPT to receive traffic for a particular (s,g)
   flow, it determines the RPF neighbor wrt the source following the
   same RPF procedures as defined for PIM.  If the RPF neighbor supports
   CMCAST SAFI, it originates a Source Tree Join Route and advertises
   the route to the RPF neighbor (in case of EBGP or hop-by-hop IBGP),
   or one or more RRs.

   When a router learns that it needs to join a bi-directional tree for
   a particular group, it determines the RPF neighbor wrt the RPA.  If
   the neighbor supports CMCAST SAFI, it originates a Shared Tree Join
   Route and advertises the route to the RPF neighbor (in case of EBGP
   or hop-by-hop IBGP), or one or more RRs.

   When a router first learns that it needs to receive traffic for an
   ASM group, it originates a RTC route with the NLRI's AS field set to
   its AS number and the Route Target field set to an address based
   Route Target, with the Global Administrator field set to group
   address and the Local Administrator field set to 0.  The route is



Zhang & Patel            Expires April 18, 2016                 [Page 8]


Internet-Draft                  bgp-mcast                   October 2015


   advertised to the RRs, so that RRs can re-advertise the matching
   Source Active A-D routes to this router.  Upon the receiving of the
   Source Active A-D routes, the router originates Source Tree Join
   routes as described above, as long as it still needs to receive
   traffic for the flows (i.e., the corresponding IGMP/MLD membership
   exists or join from downstream PIM/BGP neighbor exists).

   When a Source/Shared Tree Join route is originated by this router, it
   sets up corresponding forwarding state such that the expected
   incoming interface list includes all non-LAN interfaces directly
   connecting to the upstream neighbor.  LAN interfaces are added upon
   receiving corresponding S-PMSI A-D route (Section 2.2.4).

   In this revision, it is assumed that the single-hop peering is used
   for DC deployments.  As discussed earlier, additional signaling could
   be used for a router to discover direct interfaces connected to its
   upstream or downstream neighbors.

   The Source/Shared Tree Join routes carry an Address Specific RT, with
   the global administrative field set to the upstream peer's address
   and the local administrative field set to 0.

2.2.2.  Receiving Tree Join Routes

   A router (auto-)configures Import RTs matching itself so that it can
   import tree join routes from their peers.

   When a router receives a tree join route from a downstream router and
   imports it, it determines if it needs to originate its own
   corresponding route and advertise further upstream wrt the source or
   RPA.  If itself is the FHR or is on the RPL, then it does not need
   to.  Otherwise the procedures in Section 2.2.1 are followed.

   Additionally, the router sets up its corresponding forwarding state
   such that one of the interfaces that directly connects to the
   downstream neighbor is added to outgoing interface list.  If there is
   a LAN interface connecting to the downstream neighbor, it MAY be
   preferred over non-LAN interfaces, but an S-PMSI A-D route MUST be
   originated (Section 2.2.3).

   In this revision, it is assumed that the single-hop peering is used
   for DC deployments.  As discussed earlier, additional signaling could
   be used for a peer to discover direct interfaces connected to its
   upstream or downstream neighbors.







Zhang & Patel            Expires April 18, 2016                 [Page 9]


Internet-Draft                  bgp-mcast                   October 2015


2.2.3.  Originating S-PMSI A-D Routes

   If this router chooses to use a LAN interface to send traffic to its
   neighbors for a particular (s,g) or (*,g) flow, it MUST announce that
   by originating a corresponding S-PMSI A-D route.  The Tunnel Type in
   the PTA is set to 0 (no tunnel information Present).  The LAN
   interface is identified by an IP address specific RT, with the Global
   Administrative Field set to the LAN interface's address prefix and
   the Local Administrative Field set to the prefix length.  The RT also
   serves the purpose of restricting the importing of the route by all
   routers on the LAN.

   If multiple LAN interfaces are to be used (to reach different sets of
   neighbors), then the route will include multiple RTs, one for each
   used LAN interface as described above.

   The S-PMSI A-D routes may also be used to announce tunnels that could
   be used to send traffic to downstream neighbors that are not directly
   connected.  This is outside of the scope for now.

2.2.4.  Receiving S-PMSI A-D Routes

   A router (auto-)configures an Import RT for each of its LAN
   interfaces over which BGP is used for multicast signaling.  The
   construction of the RT is described in the previous section.

   When a router imports an S-PMSI A-D route, it checks if it also
   originated the same route and if the route has at least one common RT
   of the received one.  If yes, it means both itself and the originator
   of the receive route want to send to the same LANs.  This kicks off
   the assert procedure to elect a winner - the one with the highest
   next hop address wins.  The assert losers will not include the
   corresponding LAN interface in its outgoing interface list, but it
   keeps the S-PMSI A-D route that it originates.

   If this router does not have a matching S-PMSI route of its own with
   some common RTs, and the originator of the received S-PMSI route is a
   chosen upstream neighbor for the corresponding flow, then this router
   updates its forwarding state to include the LAN interface in the
   incoming interface list.  When the last S-PMSI route with a RT
   matching the LAN is withdrawn later, the LAN interface is removed
   from the incoming interface list.

   Note that a downstream router on the LAN does not participate in the
   assert procedure.  It adds/keeps the LAN interface in the expected
   incoming interfaces as long as its chosen upstream peer originates
   the S-PMSI AD route.  It does not switch to the assert winner as its
   upstream.  An assert loser MAY keep sending joins upstream based on



Zhang & Patel            Expires April 18, 2016                [Page 10]


Internet-Draft                  bgp-mcast                   October 2015


   local policy even if it has no other downstream neighbors (this could
   be used for fast switch over in case the assert winner would fail).

3.  Security Considerations

   This document does not introduce new security risks.

4.  Acknowledgements

   The authors thank Marco Rodrigues and Lenny Giuliano for their
   initial idea/ask of using BGP for multicast signaling beyond MVPN.
   We also thank Eric Rosen for his questions, suggestions, and help
   finding solutions to some issues.

5.  References

5.1.  Normative References

   [I-D.ietf-bess-mvpn-pe-ce]
              Patel, K., Rosen, E., and Y. Rekhter, "BGP as an MVPN PE-
              CE Protocol", draft-ietf-bess-mvpn-pe-ce-00 (work in
              progress), April 2015.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <http://www.rfc-editor.org/info/rfc2119>.

   [RFC4601]  Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas,
              "Protocol Independent Multicast - Sparse Mode (PIM-SM):
              Protocol Specification (Revised)", RFC 4601,
              DOI 10.17487/RFC4601, August 2006,
              <http://www.rfc-editor.org/info/rfc4601>.

   [RFC5015]  Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano,
              "Bidirectional Protocol Independent Multicast (BIDIR-
              PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007,
              <http://www.rfc-editor.org/info/rfc5015>.

   [RFC6514]  Aggarwal, R., Rosen, E., Morin, T., and Y. Rekhter, "BGP
              Encodings and Procedures for Multicast in MPLS/BGP IP
              VPNs", RFC 6514, DOI 10.17487/RFC6514, February 2012,
              <http://www.rfc-editor.org/info/rfc6514>.








Zhang & Patel            Expires April 18, 2016                [Page 11]


Internet-Draft                  bgp-mcast                   October 2015


5.2.  Informative References

   [I-D.ietf-rtgwg-bgp-routing-large-dc]
              Lapukhov, P., Premji, A., and J. Mitchell, "Use of BGP for
              routing in large-scale data centers", draft-ietf-rtgwg-
              bgp-routing-large-dc-02 (work in progress), April 2015.

Authors' Addresses

   Zhaohui Zhang
   Juniper Networks

   EMail: zzhang@juniper.net


   Keyur Patel
   Cisco Systems

   EMail: keyupate@cisco.com
































Zhang & Patel            Expires April 18, 2016                [Page 12]