BESS Z. Zhang
Internet-Draft Juniper Networks
Intended status: Standards Track K. Patel
Expires: April 18, 2016 Cisco Systems
October 16, 2015
BGP Based Multicast
draft-zzhang-bess-bgp-multicast-00
Abstract
This document describes multicast signaling based on Border Gateway
Protocol (BGP).
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC2119.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on April 18, 2016.
Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
Zhang & Patel Expires April 18, 2016 [Page 1]
Internet-Draft bgp-mcast October 2015
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . 2
1.2. Overview . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1. BGP Sessions . . . . . . . . . . . . . . . . . . . . 3
1.2.2. LAN and Parallel Links . . . . . . . . . . . . . . . 4
1.2.3. Source Discovery for ASM . . . . . . . . . . . . . . 5
1.2.4. Bidirectional Trees . . . . . . . . . . . . . . . . . 6
1.2.5. Transition . . . . . . . . . . . . . . . . . . . . . 6
2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1. BGP NLRIs and Attributes . . . . . . . . . . . . . . . . 7
2.1.1. S-PMSI A-D Route . . . . . . . . . . . . . . . . . . 7
2.1.2. Source Active A-D Route . . . . . . . . . . . . . . . 8
2.2. Procedures . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1. Originating Tree Join Routes . . . . . . . . . . . . 8
2.2.2. Receiving Tree Join Routes . . . . . . . . . . . . . 9
2.2.3. Originating S-PMSI A-D Routes . . . . . . . . . . . . 10
2.2.4. Receiving S-PMSI A-D Routes . . . . . . . . . . . . . 10
3. Security Considerations . . . . . . . . . . . . . . . . . . . 11
4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 11
5. References . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.1. Normative References . . . . . . . . . . . . . . . . . . 11
5.2. Informative References . . . . . . . . . . . . . . . . . 12
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12
1. Introduction
1.1. Motivation
Protocol Independent Multicast (PIM) has been the prevailing
multicast protocol for many years. Despite its success, it has two
drawbacks:
o Complexity originated from RPT/SPT switchover and data driven
nature for PIM-ASM.
o Periodical protocol state refreshes due to soft state nature.
While PIM-SSM removes the complexity of PIM-ASM, there have not been
a good way of discovering sources, limiting its deployment. PIM-Port
(PIM over Reliable Transport) solves the soft state issue, though its
deployment has also been limited.
Zhang & Patel Expires April 18, 2016 [Page 2]
Internet-Draft bgp-mcast October 2015
Partly because of the complexity concern, some Data Center operators
have been avoiding deploying multicast in their networks.
BGP-MVPN [RFC 6514] uses BGP to signal VPN customer multicast state
over provider networks. It removes the above mentioned problems, and
the deployment experiences have been encouraging. [draft-ietf-bess-
mvpn-pe-ce] adapts the concept of BGP-MVPN to PE-CE links, and this
document extends it further to general topologies, so that it can
deployed in any network where BGP is running, or can be run,
throughout or on most routers. One target deployment would be a Data
Center that requires multicast and that uses BGP as its only routing
protocol.
1.2. Overview
In a nut shell, this is PIM with BGP based join/prune signaling, plus
BGP based source discovery in case of ASM. The same RPF procedures
as in PIM are used for each router to determine the RPF neighbor for
a particular source or RPA (in case of Bidirectional Tree). Except
in the Bidirectional Tree case, no (*,G) join is used - LHR routers
discover the sources for ASM and then join towards the sources
directly. Data driven mechanisms like PIM Assert is replaced by
control driven mechanisms (Section 1.2.2).
The joins are carried in BGP Updates with CMCAST SAFI and types of
routes as defined in [draft-ietf-bess-mvpn-pe-ce]. CMCAST NLRIs are
targted at the upstream neighbor by use of Route Targets.
1.2.1. BGP Sessions
As specified in [draft-ietf-bess-mvpn-pe-ce-00], in order for two BGP
speakers to exchange C-MCAST NLRI, they must use BGP Capabilities
Advertisement [RFC5492] to ensure that they both are capable of
properly processing the C-MCAST NLRI. This is done as specified in
[RFC4760], by using a capability code 1 (multiprotocol BGP) with an
AFI of IPv4 (1) or IPv6 (2) and a SAFI of C-MCAST with a value to be
assigned by IANA.
How the BGP peer sessions are provisioned, whether EBGP or IBGP,
whether statically, automatically (e.g., based on IGP neighbor
discovery), or programmably via an external controller, is outside
the scope of this document.
In case of IBGP, it could be that every router peering with Route
Reflectors, or hop by hop IBGP sessions could be used to exchange
CMCAST NLRIs for joins. In the latter case, unless desired otherwise
for reasons outside of the scope of this document, the hop by hop
IBGP sessions MUST only be used to exchange CMCAST NLRIs.
Zhang & Patel Expires April 18, 2016 [Page 3]
Internet-Draft bgp-mcast October 2015
FHRs and LHRs also establish BGP sessions to some Route Reflectors
for source discovery purpose (Section 1.2.3).
With the traditional PIM, the FHRs and LHRs refer to the PIM DRs on
the source or receiver networks. With BGP based multicast, PIM may
not be running at all, and the FHRs and LHRs refer to the IGMP/MLD
queriers in that case.
1.2.2. LAN and Parallel Links
There could be parallel links between two BGP peers. A single multi-
hop session, whether IBGP or EBGP, between loopback addresses may be
used. Except for LAN interfaces, any link between the two peers can
be automatically used by a downstream peer to receive traffic from
the upstream peer, and it is for the upstream peer to decide which
link to use. If one of the link goes down, the upstream peer
switches to a different link and there is no change needed on the
downstream peer.
The upstream peer MAY prefer LAN interfaces to send traffic, since
multiple downstream peers may be reached simultaneously, or it may
make a decision based on local policy, e.g., for load balancing
purpose. Because different downstream peers might choose different
upstream peers for RPF, when an upstream peer decides to use a LAN
interface to send traffic, it originates an S-PMSI A-D route
indicating that one or more LAN interface will be used. The route
carries Route Targets specific to the LANs so that all the peers on
the LANs import the route. If more than one router originate the
route specifying the same LAN for the same (s,g) or (*,g) flow, then
assert procedure based on the S-PMSI A-D routes happens and assert
losers will stop sending traffic to the LAN.
In this multihop session case, there need be a way to determine if
two peers are directly connected, so that traffic can be sent
natively when possible or tunneled when necessary. Advertising
attached interface addresses, like LDP does, could be one way. Those
advertisements can be limited to peers that are directly connected by
using of Route Targets. More details may be provided in a future
revision, pending further consideration.
Alternatively, multiple single-hop sessions between interface
addresses, whether IBGP or EBPG, can be used. This is especially
suitable in DC scenarios.
Zhang & Patel Expires April 18, 2016 [Page 4]
Internet-Draft bgp-mcast October 2015
1.2.3. Source Discovery for ASM
This document does not support ASM via shared trees (aka RP Tree, or
RPT). Instead, FHRs, RPs, and LHRs work together to propagate/
discover source information via control plane and LHRs join source
specific Shortest Path Trees (SPT) directly.
The RPs are just Route Reflectors. Multicast data traffic does not
necessarily go through them, and redundancy can be easily achieved by
having multiple RRs. They do not participate in any multicast
specific procedures, besides that they redistribute Source Active A-D
routes. A FHR originates Source Active A-D routes upon discovery
sources for particular flows and advertise them to the RRs, carrying
an IPv4 or IPv6 address specific Route Target. The Global
Administrator field is set the group address of the flow, and the
Local Administrator field is set to 0. An LHR originate Route Target
Constraint routes towards the RRs, with the Route Target field in the
NLRI set accordingly, for the groups it wants to receive traffic for.
That way, RR maintains all source information but only distributes to
interested LHRs on demand.
Because the RPs are only used for distributing SA route and not as
data rendezvous points, a small number of them are enough and there
is no need to have different RPs for different groups. As a result,
static configuration is sufficient - no need for dynamic RP learning
protocols like BSR and Auto-RP.
1.2.3.1. Integration with BGP-MVPN
For each VPN, the RRs for that VPN can be completely separate from
those for a different VPN. The provider is not involved at all, as
in the Inter-site Shared C-Tree model described in Section 13 of RFC
6514.
Alternatively, one or more PEs can serve as the RRs for their local
sites for the purpose of distributing SA routes. Compared to the
approach in the previous paragraph, those PEs use a single session
(vs. one session for each VPN) to exchange BGP-MVPN SA routes (MCAST-
VPN SAFI) among themselves, following the procedures defined in
Section 14 of RFC 6514. That's in addition to exchanging BGP SA
routes (CMCAST SAFI) between a PE and FHRs/LHRs that it is
responsible for. Note that RFC 6514 does not explicictly specify
that an egress PE translate received BGP-MVPN SA A-D routes into PIM
Null Register messages or MSDP SA routes (for the purpose of Anycast
RP). In this document, a PE acting as a RR for SA A-D routes does
translate received BGP-MVPN SA A-D routes to BGP SA A-D routes, and
vice versa.
Zhang & Patel Expires April 18, 2016 [Page 5]
Internet-Draft bgp-mcast October 2015
1.2.4. Bidirectional Trees
For Bidirectional PIM, on transit LANs it is required that a DF is
elected to forward traffic to/from the RPA direction. This is based
on DF messages exchanged rapidly among the BIDIR-PIM routers on the
same LAN. The procedure is complicated and may not be robust enough
in all situations. In a typical provider network, transit LANs are
rarely used therefore for simplicy this document does not support
transit LANs for bidirectional trees.
For resilience purpose the RPA is typically a "virtual address" on a
multi-access link and is not associated with any routers. No DF
election is needed on this RPL (Rendezvous Point Link), and all
routers on the RPL forward traffic to/from the RPL. With Bidir-PIM,
the RPL routers terminate the Join/Prune messages from downstream
neighbors and the same applies if BGP is used for signaling.
1.2.5. Transition
A network currently running PIM can be incrementally transitioned to
BGP based multicast. At any time, a router supporting BGP based
multicast can use PIM with some neighbors (upstream or downstream)
and BGP with some other neighbors. PIM and BGP MUST not be used
simultaneously between two neighbors for multicast purpose, and
routers connected to the same LAN MUST be transitioned during the
same maintenance window.
In case of PIM-SSM, any router can be transitioned at any time
(except on a LAN all routers must be transitioned together). It may
receive source tree joins from a mixed set of BGP and PIM downstream
neighbors and send source tree joins to its upstream neighbor using
either PIM or BGP signaling.
In case of PIM-ASM, the RPs are first upgraded to support BGP based
multicast. They learn sources either via PIM procedures from PIM
FHRs, or via Source Active A-D routes from BGP FHRs. In the former
case, the RPs can originate proxy Source Active A-D routes. There
may be a mixed set of RPs/RRs - some capable of both traditional PIM
RP functionalities while some only redistribute SA routes.
Then any routers can be transitioned incrementally. A transitioned
LHR router will pull Source Active A-D routes from the RPs when they
receive IGMP/MLD (*,G) joins for ASM groups, and may send either PIM
(s,g) joins or BGP Source Tree Join routes. A transitioned transit
router may receive (*,g) PIM joins but only send source tree joins
after pulling Source Active A-D routes from RPs.
Zhang & Patel Expires April 18, 2016 [Page 6]
Internet-Draft bgp-mcast October 2015
2. Specification
2.1. BGP NLRIs and Attributes
The same CMCAST SAFI and types of routes as defined in [draft-ietf-
bess-mvpn-pe-ce] are used, except that the Source Prune A-D Route is
not used, and an additional two types are defined. In summary:
3 - S-PMSI A-D Route [new]
5 - Source Active A-D Route [new]
6 - Shared Tree Join Route [existing]
7 - Source Tree Join Route [existing]
8 - Source Prune A-D Route [not used]
Except for the Source Active A-D routes, the routes carry a NO-
ADVERTISE community so that the receiving peer will not propagate it
further.
2.1.1. S-PMSI A-D Route
Similar to defined in RFC 6514, an S-PMSI A-D Route Type specific
CMCAST NLRI consists of the following, though it does not have an RD:
+-----------------------------------+
| Multicast Source Length (1 octet) |
+-----------------------------------+
| Multicast Source (variable) |
+-----------------------------------+
| Multicast Group Length (1 octet) |
+-----------------------------------+
| Multicast Group (variable) |
+-----------------------------------+
| Originating Router's IP Addr |
+-----------------------------------+
If the Multicast Source (or Group) field contains an IPv4 address,
then the value of the Multicast Source (or Group) Length field is 32.
If the Multicast Source (or Group) field contains an IPv6 address,
then the value of the Multicast Source (or Group) Length field is
128.
Usage of other values of the Multicast Source Length and Multicast
Group Length fields is outside the scope of this document.
Usage of S-PMSI A-D routes is described in Section 2.2.3 and
Section 2.2.4.
Zhang & Patel Expires April 18, 2016 [Page 7]
Internet-Draft bgp-mcast October 2015
2.1.2. Source Active A-D Route
Similar to defined in RFC 6514, a Source Active A-D Route Type
specific MCAST NLRI consists of the following:
+-----------------------------------+
| Multicast Source Length (1 octet) |
+-----------------------------------+
| Multicast Source (variable) |
+-----------------------------------+
| Multicast Group Length (1 octet) |
+-----------------------------------+
| Multicast Group (variable) |
+-----------------------------------+
The definition of the source/length and group/length fields are the
same as in the S-PMSI A-D routes.
Source Active A-D routes with a Multicast group belonging to the
Source Specific Multicast (SSM) range (as defined in [RFC4607], and
potentially extended locally on a router) MUST NOT be advertised by a
router and MUST be discarded if received.
Usage of Source Active A-D routes is described in Section 1.2.3.
2.2. Procedures
2.2.1. Originating Tree Join Routes
When a router learns from IGMP/MLD or a downstream PIM/BGP peer that
it needs to join a SPT to receive traffic for a particular (s,g)
flow, it determines the RPF neighbor wrt the source following the
same RPF procedures as defined for PIM. If the RPF neighbor supports
CMCAST SAFI, it originates a Source Tree Join Route and advertises
the route to the RPF neighbor (in case of EBGP or hop-by-hop IBGP),
or one or more RRs.
When a router learns that it needs to join a bi-directional tree for
a particular group, it determines the RPF neighbor wrt the RPA. If
the neighbor supports CMCAST SAFI, it originates a Shared Tree Join
Route and advertises the route to the RPF neighbor (in case of EBGP
or hop-by-hop IBGP), or one or more RRs.
When a router first learns that it needs to receive traffic for an
ASM group, it originates a RTC route with the NLRI's AS field set to
its AS number and the Route Target field set to an address based
Route Target, with the Global Administrator field set to group
address and the Local Administrator field set to 0. The route is
Zhang & Patel Expires April 18, 2016 [Page 8]
Internet-Draft bgp-mcast October 2015
advertised to the RRs, so that RRs can re-advertise the matching
Source Active A-D routes to this router. Upon the receiving of the
Source Active A-D routes, the router originates Source Tree Join
routes as described above, as long as it still needs to receive
traffic for the flows (i.e., the corresponding IGMP/MLD membership
exists or join from downstream PIM/BGP neighbor exists).
When a Source/Shared Tree Join route is originated by this router, it
sets up corresponding forwarding state such that the expected
incoming interface list includes all non-LAN interfaces directly
connecting to the upstream neighbor. LAN interfaces are added upon
receiving corresponding S-PMSI A-D route (Section 2.2.4).
In this revision, it is assumed that the single-hop peering is used
for DC deployments. As discussed earlier, additional signaling could
be used for a router to discover direct interfaces connected to its
upstream or downstream neighbors.
The Source/Shared Tree Join routes carry an Address Specific RT, with
the global administrative field set to the upstream peer's address
and the local administrative field set to 0.
2.2.2. Receiving Tree Join Routes
A router (auto-)configures Import RTs matching itself so that it can
import tree join routes from their peers.
When a router receives a tree join route from a downstream router and
imports it, it determines if it needs to originate its own
corresponding route and advertise further upstream wrt the source or
RPA. If itself is the FHR or is on the RPL, then it does not need
to. Otherwise the procedures in Section 2.2.1 are followed.
Additionally, the router sets up its corresponding forwarding state
such that one of the interfaces that directly connects to the
downstream neighbor is added to outgoing interface list. If there is
a LAN interface connecting to the downstream neighbor, it MAY be
preferred over non-LAN interfaces, but an S-PMSI A-D route MUST be
originated (Section 2.2.3).
In this revision, it is assumed that the single-hop peering is used
for DC deployments. As discussed earlier, additional signaling could
be used for a peer to discover direct interfaces connected to its
upstream or downstream neighbors.
Zhang & Patel Expires April 18, 2016 [Page 9]
Internet-Draft bgp-mcast October 2015
2.2.3. Originating S-PMSI A-D Routes
If this router chooses to use a LAN interface to send traffic to its
neighbors for a particular (s,g) or (*,g) flow, it MUST announce that
by originating a corresponding S-PMSI A-D route. The Tunnel Type in
the PTA is set to 0 (no tunnel information Present). The LAN
interface is identified by an IP address specific RT, with the Global
Administrative Field set to the LAN interface's address prefix and
the Local Administrative Field set to the prefix length. The RT also
serves the purpose of restricting the importing of the route by all
routers on the LAN.
If multiple LAN interfaces are to be used (to reach different sets of
neighbors), then the route will include multiple RTs, one for each
used LAN interface as described above.
The S-PMSI A-D routes may also be used to announce tunnels that could
be used to send traffic to downstream neighbors that are not directly
connected. This is outside of the scope for now.
2.2.4. Receiving S-PMSI A-D Routes
A router (auto-)configures an Import RT for each of its LAN
interfaces over which BGP is used for multicast signaling. The
construction of the RT is described in the previous section.
When a router imports an S-PMSI A-D route, it checks if it also
originated the same route and if the route has at least one common RT
of the received one. If yes, it means both itself and the originator
of the receive route want to send to the same LANs. This kicks off
the assert procedure to elect a winner - the one with the highest
next hop address wins. The assert losers will not include the
corresponding LAN interface in its outgoing interface list, but it
keeps the S-PMSI A-D route that it originates.
If this router does not have a matching S-PMSI route of its own with
some common RTs, and the originator of the received S-PMSI route is a
chosen upstream neighbor for the corresponding flow, then this router
updates its forwarding state to include the LAN interface in the
incoming interface list. When the last S-PMSI route with a RT
matching the LAN is withdrawn later, the LAN interface is removed
from the incoming interface list.
Note that a downstream router on the LAN does not participate in the
assert procedure. It adds/keeps the LAN interface in the expected
incoming interfaces as long as its chosen upstream peer originates
the S-PMSI AD route. It does not switch to the assert winner as its
upstream. An assert loser MAY keep sending joins upstream based on
Zhang & Patel Expires April 18, 2016 [Page 10]
Internet-Draft bgp-mcast October 2015
local policy even if it has no other downstream neighbors (this could
be used for fast switch over in case the assert winner would fail).
3. Security Considerations
This document does not introduce new security risks.
4. Acknowledgements
The authors thank Marco Rodrigues and Lenny Giuliano for their
initial idea/ask of using BGP for multicast signaling beyond MVPN.
We also thank Eric Rosen for his questions, suggestions, and help
finding solutions to some issues.
5. References
5.1. Normative References
[I-D.ietf-bess-mvpn-pe-ce]
Patel, K., Rosen, E., and Y. Rekhter, "BGP as an MVPN PE-
CE Protocol", draft-ietf-bess-mvpn-pe-ce-00 (work in
progress), April 2015.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<http://www.rfc-editor.org/info/rfc2119>.
[RFC4601] Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas,
"Protocol Independent Multicast - Sparse Mode (PIM-SM):
Protocol Specification (Revised)", RFC 4601,
DOI 10.17487/RFC4601, August 2006,
<http://www.rfc-editor.org/info/rfc4601>.
[RFC5015] Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano,
"Bidirectional Protocol Independent Multicast (BIDIR-
PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007,
<http://www.rfc-editor.org/info/rfc5015>.
[RFC6514] Aggarwal, R., Rosen, E., Morin, T., and Y. Rekhter, "BGP
Encodings and Procedures for Multicast in MPLS/BGP IP
VPNs", RFC 6514, DOI 10.17487/RFC6514, February 2012,
<http://www.rfc-editor.org/info/rfc6514>.
Zhang & Patel Expires April 18, 2016 [Page 11]
Internet-Draft bgp-mcast October 2015
5.2. Informative References
[I-D.ietf-rtgwg-bgp-routing-large-dc]
Lapukhov, P., Premji, A., and J. Mitchell, "Use of BGP for
routing in large-scale data centers", draft-ietf-rtgwg-
bgp-routing-large-dc-02 (work in progress), April 2015.
Authors' Addresses
Zhaohui Zhang
Juniper Networks
EMail: zzhang@juniper.net
Keyur Patel
Cisco Systems
EMail: keyupate@cisco.com
Zhang & Patel Expires April 18, 2016 [Page 12]