Internet Engineering Task Force R. Sharma, Ed.
Internet-Draft A. Banerjee
Intended status: Standards Track R. Sivaramu
Expires: January 18, 2018 A. Sajassi
Cisco Systems
July 17, 2017
Multi-site EVPN based VXLAN using Border Gateways
draft-sharma-multi-site-evpn-03
Abstract
This document describes the procedures for interconnecting two or
more BGP based Ethernet VPN (EVPN) sites in a scalable fashion over
an IP-only network. The motivation is to support extension of EVPN
sites without having to rely on typical Data Center Interconnect
(DCI) technologies like MPLS/VPLS for the interconnection. The
requirements for such a deployment are very similar to the ones
specified in RFC 7209 -- "Requirements for Ethernet VPN (EVPN)".
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 18, 2018.
Copyright Notice
Copyright (c) 2017 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
Sharma, et al. Expires January 18, 2018 [Page 1]
Internet-Draft Multi-site EVPN July 2017
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Multi-Site EVPN Overview . . . . . . . . . . . . . . . . . . 4
3.1. MS-EVPN Interconnect Requirements . . . . . . . . . . . . 4
3.2. MS-EVPN Interconnect concept and framework . . . . . . . 5
4. Multi-site EVPN Interconnect Procedures . . . . . . . . . . . 8
4.1. Border Gateway Discovery . . . . . . . . . . . . . . . . 8
4.2. Border Gateway Provisioning . . . . . . . . . . . . . . . 11
4.2.1. Border Gateway Designated Forwarder Election . . . . 11
4.2.2. Anycast Border Gateway . . . . . . . . . . . . . . . 12
4.2.3. Multi-path Border Gateway . . . . . . . . . . . . . . 12
4.3. EVPN route processing at Border Gateway . . . . . . . . . 13
4.4. Multi-Destination tree between Border Gateways . . . . . 14
4.5. Inter-site Unicast traffic . . . . . . . . . . . . . . . 15
4.6. Inter-site Multi-destination traffic . . . . . . . . . . 15
4.7. Host Mobility . . . . . . . . . . . . . . . . . . . . . . 16
5. Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.1. Fabric to Border Gateway Failure . . . . . . . . . . . . 16
5.2. Border Gateway to Border Gateway Failures . . . . . . . . 16
6. Interoperability . . . . . . . . . . . . . . . . . . . . . . 16
7. Isolation of Fault Domains . . . . . . . . . . . . . . . . . 17
8. Loop detection and Prevention . . . . . . . . . . . . . . . . 17
9. MVPN with Multi-site EVPN . . . . . . . . . . . . . . . . . . 18
9.1. Inter-Site MI-PMSI . . . . . . . . . . . . . . . . . . . 18
9.2. Stitching of customer multicast trees across sites . . . 18
9.3. RPF resolution in Anycast Border Gateway model . . . . . 19
9.4. Inter-Site S-PMSI . . . . . . . . . . . . . . . . . . . . 19
10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 19
11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19
12. Security Considerations . . . . . . . . . . . . . . . . . . . 19
13. References . . . . . . . . . . . . . . . . . . . . . . . . . 19
13.1. Normative References . . . . . . . . . . . . . . . . . . 19
13.2. Informative References . . . . . . . . . . . . . . . . . 20
Appendix A. Additional Stuff . . . . . . . . . . . . . . . . . . 20
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 20
Sharma, et al. Expires January 18, 2018 [Page 2]
Internet-Draft Multi-site EVPN July 2017
1. Introduction
BGP based Ethernet VPNs (EVPNs) are being used to support various VPN
topologies with the motivation and requirements being discussed in
detail in RFC7209 [RFC7209]. EVPN has been used to provide a Network
Virtualization Overlay (NVO) solution with a variety of tunnel
encapsulation options over IP as described in [DCI-EVPN-OVERLAY].
EVPN used for the Data center interconnect (DCI) at the WAN Edge is
discussed in [DCI-EVPN-OVERLAY]. The EVPN DCI procedures are defined
for IP and MPLS hand-off at the site boundaries.
In the current EVPN deployments, there is a need to segment the EVPN
domains within a Data Center (DC) primarily due to the service
architecture and the scaling requirements around it. The number of
routes, tunnel end-points, and next-hops needed in the DC are larger
than some of the hardware elements that are being deployed. Network
operators would like to ensure that they have means to have smaller
sites within the data center, if they so desire, without having to
have traditional DCI technologies to inter-connect them. In essence,
they want smaller multi-site EVPN domains with an IP backbone.
Network operators today are using the Virtual Network Identifier
(VNI) to designate a service. However, they would like to have this
service available to a smaller set of nodes within the DC for
administrative reasons; in essence they want to break up the EVPN
domain to multiple smaller sites. An advantage of having a smaller
footprint for these EVPN sites, implies that the various fault
isolation domains are now more constrained. It is also feasible to
have features that can re-use the VNI space across these sites if
desired. The above mentioned motivations for having smaller multi-
site EVPN domains are over and above the ones that are already
detailed in RFC7209 [RFC7209].
In this document we focus primarily on the VXLAN encapsulation for
EVPN deployments. We assume that the underlay provides simple IP
connectivity. We go into the details of the IP/VXLAN hand-off
mechanisms, to interconnect these smaller sites, within the data
center itself. We describe this deployment model as a scalable
multi-site EVPN (MS-EVPN) deployment. The procedures described here
go into substantial detail regarding interconnecting L2 and L3,
unicast and multicast domains across multiple EVPN sites.
1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
Sharma, et al. Expires January 18, 2018 [Page 3]
Internet-Draft Multi-site EVPN July 2017
2. Terminology
o Border Gateway (BG): This is the node that interacts with nodes
within a site and with nodes that are external to the site. For
example, in a leaf-spine data center fabric, it can be a leaf, a
spine, or a separate device acting as gateway to interconnect the
sites.
o Anycast Border Gateway: A Virtual set of shared Border Gateways
(or Next-hops) acting as Multiple entry-exit points for a site.
o Multipath Border Gateway: A Virtual set of unique border Gateways
(or Next-hops) acting as a Multiple entry-exit points for a site.
o A-D: Auto-discovery.
3. Multi-Site EVPN Overview
In this section we describe the motivation, requirements, and
framework of the multi-site EVPN enhancements.
3.1. MS-EVPN Interconnect Requirements
In this section we discuss the requirements and motivation for
interconnecting different EVPN sites within a data center. In
general any interconnect technology has the following requirements:
a. Scalability: Multi-Site EVPN (MS-EVPN) should be able to
interconnect multiple sites in a scalable fashion. In other
words, interconnecting such sites should not lead to one giant
fabric with full mesh of end-to-end VXLAN tunnels across leafs in
different sites. This leads to scale issues with respect to
managing large number of tunnel end-points and a large number of
tunnel next-hops. Also a huge flat fabric rules out option of
ingress replication (IR) trees as number of replications becomes
practically unachievable due to the internal bandwidth needed in
hardware.
b. Multi-Destination traffic over unicast-only cloud: MS-EVPN
mechanisms should be able to provide an efficient forwarding
mechanism for multi-destination frames even if the underlay
inter-site network is not capable of forwarding multicast frames.
This requirement is meant to ensure that for the solution to work
there are no additional constraints being requested of the IP
network. This allows for use of existing network elements as-is.
c. Maintain Site-specific Administrative control: The MS-EVPN
technology should be able to interconnect fabrics from different
Sharma, et al. Expires January 18, 2018 [Page 4]
Internet-Draft Multi-site EVPN July 2017
Administrative domains. It is possible that different sites have
different VLAN-VNI mappings, use different underlay routing
protocols, and/or have different PIM-SM group ranges etc. It is
expected that the technology should not impose any additional
constraints on the various administrative domains.
d. Co-existence of other VPN Address Families: MS-EVPN should
coexists with other VPN address families like L3-VPN or MVPN. As
each VPN address family provides different set of services, MS-
EVPN architecture should allow for coexistence of such VPN
address families in the same topology.
e. Isolate fault domains: MS-EVPN technology hand-off should have
capability to isolate traffic cross site boundaries and prevent
defects to percolate from one site to another. As an example, a
broadcast storm in a site should not lead to meltdown of all
other sites.
f. Loop detection and prevention: In the scenarios where flood
domains are stretched across fabrics, interconnecting sites are
very vulnerable to loops and flood storms. There is a need to
provide comprehensive loop detection and prevention capabilities.
g. Plug-and-play and extensibility: Addition of new sites or
increasing capacity of existing sites should be achievable in a
completely plug-and-play fashion. This essentially means that
all control plane and forwarding states (L2 or L3 interconnect)
should be built in downstream allocation mode. MS-EVPN should
not pose any maximum requirements on the scale and capacity, it
should be easily extendable on those metrics.
3.2. MS-EVPN Interconnect concept and framework
EVPN with an IP-only interconnect is conceptualized as multiple site-
local EVPN control planes and IP forwarding domains interconnected
via a single common EVPN control and IP forwarding domain. Every
EVPN node is identified with a unique site-scope identifier. A site-
local EVPN domain consists of EVPN nodes with the same site
identifier. Border gateways on one hand are also part of site-
specific EVPN domain and on other hand part of a common EVPN domain
to interconnect with Border Gateways from other sites. Although a
border gateway has only a single explicit site-id (that of the site
it is a member of), it can be considered to also have a second
implicit site-id, that of the interconnect-domain which has
membership of all the BG's from all sites that are being
interconnected. This implicit site-id membership is derived by the
presence of the Border A-D route announced by that border gateway
node (please refer to Section 4.1 for details of the route format).
Sharma, et al. Expires January 18, 2018 [Page 5]
Internet-Draft Multi-site EVPN July 2017
These border gateways discover each other through EVPN Border A-D
routes and act as both control and forwarding plane gateway across
sites. This will facilitate site-specific nodes to visualize all
other sites to be reachable only via its Border Gateways.
____________________________
| ooo Encapsulation tunnel |
| X X X Leaf-spine fabric |
|__________________________|
Site A (EVPN site A) Site B (EVPN site B)
___________________________ ____________________________
| X X X X X X X X | | X X X X X X X X |
| X X X X | | X X X X |
| o o | | o o |
|BG-1 Site A BG-2 Site A| |BG-1 Site B BG-2 Site B|
___________________________ ____________________________
o o o o
o o o o
o o o o
o o o o
_______________________________________________
| |
| |
| Inter-site common EVPN site |
| |
| |
_______________________________________________
o o
o o
o o
o o
___________________________
| BG-1 Site C BG-2 Site C|
| X X X X |
| X X X X X X X X |
_____________________________
Site C (EVPN site C)
Figure 1
We describe the MS-EVPN deployment model using the topology above.
In the topology there are 3 sites, Site A, Site B, and Site C that
are inter-connected using IP. This entire topology is deemed to be
part of the same Data Center. In most deployments these sites can be
thought of as pods, which may span a rack, a row, or multiple rows in
Sharma, et al. Expires January 18, 2018 [Page 6]
Internet-Draft Multi-site EVPN July 2017
the data center, depending on the size of domain desired for scale
and fault and/or administrative isolation domains.
In this topology, site-local nodes are connected to each other by
iBGP EVPN peering and Border Gateways are connected by eBGP Muti-hop
EVPN peering via inter-site cloud. We explicitly spell this out to
ensure that we can re-use BGP semantics of route announcement between
and across the sites. There are other BGP mechanisms to instantiate
this and they are not discussed in this document. This implies that
each domain has its own AS number associated with it. In the
topology, only 2 border gateway per site are shown; this is more for
ease of illustration and explanation. The technology poses no such
limitation. As mentioned earlier, site-specific EVPN domain will
consists of only site-local nodes in the sites. A Border Gateway is
logically partitioned into site specific EVPN domain towards the site
and into common EVPN domain towards other sites. This facilitates
them to acts as control and forwarding plane gateway for forwarding
traffic across sites.
EVPN nodes with in a site will discover each other via regular EVPN
procedures and build site-local bidirectional VXLAN tunnels and
multi-destination trees from leaves to Border Gateways. Border
Gateways will discover each other by A-D routes with unique site-
identifiers (as described in Section 4.1) and build inter-site bi-
directional VXLAN tunnels and Multi-destination trees between them.
We thus build an end-to-end bidirectional forwarding path across all
sites by stitching (and not by stretching end-to-end) site-local
VXLAN tunnels with inter-site VXLAN tunnels.
In essence, a MS-EVPN fabric is proposed to be built in complete
downstream and modular fashion.
o Site-local tenant Bridging domains are interconnected ONLY via
Border Gateways with Bridging domains from other sites. Such
interconnect do not assume uniform mappings of mac-vrf VNI-VLAN
across sites and stitches such bridging domains in complete
downstream fashion using EVPN route advertisements.
o Site-local tenant Routing domains are interconnected ONLY via
Border Gateways with Routing domains from other sites. Such
interconnect do not assume uniform mappings of IP VRF-VNI across
sites and stitches such routing domains in complete downstream
fashion using EVPN route advertisements.
o Site-local tenant Flood domains are interconnected ONLY via Border
Gateways with flood domains from other sites. Such interconnect
do not assume uniform mappings of mac-vrf VNI across sites (or
mechanisms to build flood domains with in site) and stitches such
Sharma, et al. Expires January 18, 2018 [Page 7]
Internet-Draft Multi-site EVPN July 2017
flood domains in complete downstream fashion using EVPN route
advertisements. It however do not exclude possibility of building
an end-to-end flood domain, if desired for other reasons.
o Site-local tenant Multicast domains are interconnected ONLY via
Border Gateways by stitching tenant multicast trees via Border
Gateways.
o There could be potential use cases where border gateways should
behave as gateway for a subset of VXLAN tunnels (or VNIs) and an
underlay pass through for the rest. In other words, MS-EVPN
fabric can be built by stitching VXLAN tunnels at border gateways
while providing flexibility for other VXLAN (or VNI) tunnels to
pass through border gateways as native L3 underlay. The procedure
defined here provides flexibility to accommodate such use cases.
The above architecture satisfies the constraints laid out in
Section 3.1. For example, the size of a domain may be made dependent
on the route and next-hop scale that can be supported by the
deployment of the network nodes. There are no constraints on the
network that connects the nodes within the domain or across the
domains. In the event multicast capability is available and enabled,
the nodes can use those resources. In the event the underlay is
connected using unicast semantics, creation of ingress replication
lists ensure that multi-destination frames reach their destinations.
The domains may have their own deployment constraints, and the
overlay does not need any form of stretching. It is within the
control of the administrator with respect to containing fault
isolation domains. The automated discovery of the border nodes needs
no further configurations for existing deployed domains.
4. Multi-site EVPN Interconnect Procedures
In this section we describe the new functionalities in the Border
Gateway nodes for interconnecting EVPN sites within the DC.
4.1. Border Gateway Discovery
Border Gateway discovery will facilitate termination and re-
origination of inter-site VXLAN tunnels. Such discovery provides
flexibility for intra-site leaf-to-leaf VXLAN tunnels to co-exists
with inter-site tunnels terminating on Border Gateways. In other
words, border gateways discovery will facilitate learning of VXLAN
tunnel termination points while providing flexibility for such border
gateways to behave as native L3 transit for other VXLAN tunnels.
Border Gateways leverage the Type-1 A-D route type defined in RFC7432
[RFC7432]. Border Gateways in different sites will use Type-1 A-D
Sharma, et al. Expires January 18, 2018 [Page 8]
Internet-Draft Multi-site EVPN July 2017
routes with unique site-identifiers to announce themselves as
"Borders" to other border gateways. Nodes within the same site MUST
be configured or auto-generate to announce the same site-identifier.
Nodes that are not configured to be a border node will build VXLAN
tunnels only between each member of the site (which it is aware due
to the site-identifier that is additionally announced by them).
Border nodes will additionally build VXLAN tunnels between itself and
other border nodes that are announced with a different site
identifier. Note that the site-identifier is encoded within the ESI
label itself as described below.
In this specification, we define a new Ethernet Segment Type (as
described in Section 5 of RFC7432 [RFC7432]) that can be auto-
generated or configured by the operator.
o Type 6 (T=0x06) - This type indicates a multi-site router-ID ESI
Value that can be auto-generated or configured by the operator.
The ESI Value is constructed as follows:
* Site Identifier (4 octets): The Site Identifier and its value
MUST be encoded in the high-order 4 octets of the ESI Value
field. It is used to announce the site to which a border
gateway belongs. All border gateways MUST announce this value.
* Reserved (5 octets): The low-order octets of the ESI Value will
be set to 0 and will be ignored on receipt.
Along with the Type-1 A-D routes, border nodes MUST announce an ESI
label extended community with such A-D routes. They will also
announce the Type-4 Ethernet Segment routes with the ESI Label
extended community (defined in Section 7.5 of RFC7432 [RFC7432] and
shown below in Figure 2) in order to perform the Designated Forwarder
election among the Border gateways of the same site. These Type-4
routes and ESI Label extended community will carry a new bit in the
Flags field to indicate that the DF election is for Border gateways
as against the traditional Ethernet segment DF election. Routes with
such bits set are generated only by Border Gateways and imported by
all site-local leafs, site-local Border Gateways, and inter-site
Border gateways.
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type=0x06 | Sub-Type=0x01 | Flags(1 octet)| Reserved=0 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Reserved=0 | ESI Label |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 2
Sharma, et al. Expires January 18, 2018 [Page 9]
Internet-Draft Multi-site EVPN July 2017
The lowest order bit of Flags Octet in ESI Label extended community
has been defined to address multihoming with the Single-Active or
All-Active redundancy mode. In this specification, we define the the
Second Low order bit of Flag Octet in ESI Label extended Community.
It MUST be set to 1 by border gateway nodes if it is willing to take
part in the DF election for the VNI carried in the associated ESI
label.
Type-4 Ethernet Segment routes with the ESI Label extended community
will be leveraged to perform Designated Forwarder election among the
Border gateways of the same site. ESI label extended community
encoding will be same as described above for Type-1 A-D routes. Site
Identifier encoding in ESI label extended community will help border
gateways to negotiate DF winner with in a site and ignore Type-4
routes from other sites.
These Type 1 A-D routes are advertised with mac-VRF and IP-VRF RTs
depending on whether the VNI carried is a mac-VRF VNI or an IP VRF
VNI.
After a Border Gateway is provisioned, Border A-D routes will re-
originate EVPN routes after some delay interval. This will provide
sufficient time to learn Border A-D routes from Border Gateways of
different sites. Also, Border Gateways will not build VXLAN tunnels
between same-site Border Gateway members.
Once Border Gateways are discovered, any Type-2/Type-5 routes will be
terminated and re-originated on such Border Gateways. Similarly
Type-1, Type-3, Type-4 from other sites will be terminated at the
Border Gateways. (Also see section 8 for Type-1 handling for loop
detection and prevention across sites)
As has been defined in the specifications, Type 2, Type 3, and Type 5
routes carry downstream VNI labels. Type 3 routes will help to pre-
build VXLAN tunnels in the common EVPN domain for L2, L3, and Multi-
Destination traffic. As Border Gateway discovery is agnostic to
symmetric or downstream VNI provisioning, rewriting next-hop
attributes before re-advertising these routes from other sites to a
given site provides flexibility to keep different VNI-VLAN mapping in
different sites and still able to interconnect L3 and L2 domains.
All control plane and data plane states are interconnected in a
complete downstream fashion. For example, BGP import rules for a
Type 3 route should be able to extend a flood domain for a VNI and
flood traffic destined to advertised EVPN node should carry the VNI
which is announced in Type 3 route. Similarly Type 2, Type 5 control
and forwarding states should be interconnected in a complete
downstream fashion.
Sharma, et al. Expires January 18, 2018 [Page 10]
Internet-Draft Multi-site EVPN July 2017
4.2. Border Gateway Provisioning
Border Gateway nodes manage both the control-plane communications and
the data forwarding plane for any inter-site traffic. Border Gateway
functionality in an EVPN site SHOULD be enabled on more than one node
in the network for redundancy and high-availability purposes. Any
external Type-2/Type-5 routes that are received by the BGs of a site
are advertised to all the intra-site nodes by all the BGs. For
internal Type-2/Type-5 routes received by the BG's from the intra-
site nodes, all the BGs of a site would advertise them to the remote
BG's, so any L2/L3 known unicast traffic to internal destinations
could be sent to any one of the local BG's by remote sources. For
known L2 and L3 unicast traffic, all of the individual border gateway
nodes will behave either as single logical forwarding node or a set
of active forwarding nodes. This can be perceived by intra-site
nodes as multiple entry/exit points for inter-site traffic. For
unknown unicast/multi-destination traffic, there must be a designated
forwarder election mechanism to determine which node would perform
the primary forwarding role at any given point in time, to ensure
there is no duplication of traffic for any given flow (See
Section 4.2.1).
4.2.1. Border Gateway Designated Forwarder Election
In the presence of more than one Border Gateway nodes in a site,
forwarding of multi-destination L2 or L3 traffic both into the site
and out of the site needs to be carried out by a single node. Border
Gateways between same site will run a Designated forwarder election
per MAC-VRF VNI for multi-destination traffic across the site.
Border Type 4 Ethernet Segment routes received from border gateways
of different sites will not trigger DF election for the local side;
it will only be used to terminate VXLAN tunnels from such border
gateways.
Border Gateway DF election will leverage Type-4 EVPN route and
Ethernet segment DF election defined in RFC7432 [RFC7432]. Ethernet
segment and ESI label extended community will be encoded as explained
in Border Gateway discovery procedures. ESI label extended community
is MUST to be announced with such routes. DF election will ignore
such routes that are announced by border gateways which have a
different site identifier value in them.
This DF election could be done independently by each candidate border
gateway, by subjecting an ordered "candidate list" of all the BG's
present in the same site (identified by reception of the Border Type
4 Ethernet Segment route per-VNI with the same site-id as itself) to
a hash-function on a per-VNI basis. All the candidate border
gateways of the same site are required to use a uniform hash-function
Sharma, et al. Expires January 18, 2018 [Page 11]
Internet-Draft Multi-site EVPN July 2017
to yield the same result. Failure events which lead to a BG losing
all of its connectivity to the IP interconnect backbone should
trigger the BG to withdraw its Border Type 4 Ethernet Segment
route(s) and Type 1 A-D route, to indicate to other BG's of the same
site that it is no longer a candidate BG and to indicate BG's of
different sites that it is no longer a Border Gateway.
There are two modes proposed for Border gateway provisioning.
4.2.2. Anycast Border Gateway
In this mode all border gateways share same gateway IP and rewrite
EVPN next-hop attributes with a shared logical next-hop entity.
However, these Gateways will maintain unique gateway IP to facilitate
building IR trees from site-local nodes to forward Multi-Destination
traffic. EVPN Type 2, Type 5 routes will be advertised to the nodes
in the site from all border gateways and Border gateway will run DF
election per VNI for Multi destination traffic. Type 3 routes may be
advertised by the DF winner Border gateway for a given VNI so that
only DF will receive and forward inter-site traffic. It is also
possible to advertise and draw traffic by all Border Gateways at a
site to improve convergence properties of the network. In case of
multi-destination trees built by non-EVPN procedures (say PIM), all
border gateways will receive but only DF winner will forward traffic.
This mode is useful when there is no preference between different
border-gateways to forward traffic from different VNIs. Standard
data plane hashing of VXLAN header will load balance traffic among
Border Gateways.
Additionally, it is recommended that border gateway be enabled in the
Anycast mode wherein the BG functionality is available to the rest of
the network as a single logical entity (as in Anycast) for inter-site
communication. In the absence of capability for Anycast, the BG
could be enabled as individual gateways (Single-Active BG) wherein a
single node will perform the active BG role for a given flow at a
given time. As of now, the Border Gateway system mac of the other
border nodes belonging to the same site is expected to be configured
out-of-band.
4.2.3. Multi-path Border Gateway
In this mode, Border gateways will rewrite EVPN Next-hop attributes
with unique next-hop entities. This provides flexibility to apply
usual policies and pick per-VRF, per-VNI or per-flow primary/backup
border Gateways. Hence, an intra-site node will see each BG as a
next-hop for any external L2 or L3 unicast destination, and would
perform an ECMP path selection to load-balance traffic sent to
Sharma, et al. Expires January 18, 2018 [Page 12]
Internet-Draft Multi-site EVPN July 2017
external destinations. In case an intra-site node is not capable of
performing ECMP hash based path-selection (possibly some L2
forwarding implementations), the node is expected to choose one of
the BG's as its designated forwarder. EVPN Type 2, Type 5 routes
will be advertised to the nodes in the site from all border gateways
and Border gateway will run DF election per VNI for Multi destination
traffic. Type 3 routes will be advertised by DF winner Border
gateway for a given VNI so that only DF will receive and forward
inter-site traffic. It is also possible to advertise and draw
traffic by all Border Gateways at a site to improve convergence
properties of the network. In case of multi-destination trees built
by non-EVPN procedures (say PIM), all border gateways will receive
but only DF winner will forward traffic.
4.3. EVPN route processing at Border Gateway
Border gateways will build EVPN peering on processing Type 4 Ethernet
Segment route from other Border gateways. Route targets MAY be auto-
generated based on some site-specific identifier. If BGP AS number
is used as site-specific identifier, import and export route targets
can be auto-generated as explained in RFC7432 [RFC7432]. This will
facilitate site-local nodes to import routes from other nodes in same
site and from its Border Gateways. Also this will prevent routes
exchange between nodes from different sites. However, in this auto-
generated scheme, import mechanism on Border Gateway should be
relaxed to allow import of Border A-D routes from other border
gateways. As defined in section Section 4.1 , ESI label extended
community will carry a bit to facilitate the import of Type 1 A-D
routes for Border gateway Discovery. Also the routes which are
imported at Border Gateway and re-advertised should implement a
mechanism to avoid looping of updates should they come back at Border
Gateways.
Type 2/Type 5 EVPN routes will be rewritten with Border Gateway IP,
Border Gateway system mac as next-hop and re-advertised. Only EVPN
routes received from discovered Border gateways with different site
identifiers will be rewritten and re-advertised. This will avoid
rewriting every EVPN update if border gateways are also acting as
Route reflector (RR) for site-local EVPN peering. Also this will
help in interoperating MS-EVPN fabric with sites which do not have
Border Gateway functionality.
There are few mechanisms suggested below for re-advertising these
inter-site routes to a site and provide connectivity of inter-site
hosts and subnets.
o All routes everywhere : In this mode all inter-site EVPN Type2/
Type5 routes are downloaded on site-local leafs from Border
Sharma, et al. Expires January 18, 2018 [Page 13]
Internet-Draft Multi-site EVPN July 2017
Gateways. In other words, every leaf in the MS-EVPN fabric will
have routes from every intra-site and inter-site leafs. This
mechanism is best-fit for the scenarios where inter-site traffic
is as voluminous as intra-site flow traffic. Also this mechanism
preserves usual glean processing, silent host discovery and
unknown traffic handling at the leafs.
o Default bridging and routing to Border Gateways : In this mode,
all received inter-site EVPN Type 2/Type 5 routes will be
installed only at Border Gateways and will not be advertised in
the site. Border Gateways will inject Type 5 default routes to
site-local nodes and avoid re-advertising Type 2 from other sites.
This mode provides scaling advantage by not downloading all inter-
site routes to every leaf in MS-EVPN fabric. This mechanism MAY
require glean processing and unknown traffic handling to be
tailored to provide efficient traffic forwarding.
o Site-scope flow registry and discovery : This mechanism provides
scaling advantage by downloading inter-site routes on-demand. It
provides scaling advantages of default routing with out need to
tailor glean processing and unknown traffic handling at the leafs.
Leafs will create on-demand flow registry on their border Gateways
and based on this flow registry border gateways will advertise
Type 2 routes in a site. In other words, assuming that we have a
trigger to send the EVPN routes that are needed by the site for
conversational learning from the Border Gateways, we can optimize
on the control plane state that is needed at the various leaf
nodes. Hardware programming can be further optimized based on
actual conversations needed by the leaf, as opposed to to the ones
needed by the site. We will describe a mechanism in the appendix
with respect to ARP processing at the Border Gateway.
Type 3 routes will be imported and processed on border gateways from
other border gateways but MUST NOT be advertised again. In both
modes (Anycast and Multipath), Type 3 routes will be generated
locally and advertised by DF winner Border Gateway with unique
gateway IP. This will facilitate building fast converging flood
domain connectivity inter-site and intra-site and on same time
avoiding duplicate traffic by electing DF winner to forward multi-
destination inter-site traffic.
4.4. Multi-Destination tree between Border Gateways
The procedures described here recommends building an Ingress
Replication (IR) tree between Border Gateways. This will facilitate
every site to independently build site-specific Multi destination
trees. Multi-destination end-to-end trees between leafs could be PIM
(site 1) + IR (between border Gateways) + PIM(site 2) or IR-IR-IR or
Sharma, et al. Expires January 18, 2018 [Page 14]
Internet-Draft Multi-site EVPN July 2017
PIM-IR-IR. However this does not rule out using IR-PIM-IR or end-to-
end PIM to build multi-destination trees end-to-end.
Border Gateways will generate Type 3 routes with unique gateway IP
and advertise to Border Gateways of other sites. These Type 3 routes
will help in building IR trees between border gateways. However only
DF winner per VNI will forward multi-destination traffic across
sites.
As Border Gateways are part of both site-specific and inter-site
Multi-destination IR trees, split-horizon mechanism will be used to
avoid loops. Multi-destination tree with Border gateway as root to
other sites (or Border-Gateways) will be in a separate horizon group.
Similarity Multi-destination IR tree with Border Gateway as root to
site-local nodes will be in another split horizon group.
If PIM is used to build Multi-Destination trees in site-specific
domain, all Border gateway will join such PIM trees and draw multi-
destination traffic. However only DF Border Gateway will forward
traffic towards other sites.
4.5. Inter-site Unicast traffic
As site-local nodes will see all inter-site EVPN routes via Border
Gateways, VXLAN tunnels will be built between leafs and site-local
Border Gateways and Inter-site VXLAN tunnels will be built between
Border gateways in different sites. An end-to-end VXLAN
bidirectional forwarding path between inter-site leafs will consist
of VXLAN tunnel from leaf (say Site A) to its Border Gateway, another
VXLAN tunnel from Border Gateway to Border Gateway in another site
(say site B) and Border gateway to leaf (in site B). Such
arrangement of tunnels are very scalable as a full mesh of VXLAN
tunnels across inter-site leafs is substituted by combination of
intra-site and inter-site tunnels.
L2 and L3 unicast frames from site-local leafs will reach border
gateway using VXLAN encapsulation. At Border gateway, VXLAN header
is stripped out and another VXLAN header is pushed to sent frames to
destination site Border Gateway. Destination site Border gateway
will strip off VXLAN header and push another VXLAN header to send
frame to the destination site leaf.
4.6. Inter-site Multi-destination traffic
Multi-destination traffic will be forwarded from one site to other
site only by DF for that VNI. As frames reach Border Gateway from
site-local nodes, VXLAN header will be popped and another VXLAN
header (derived from downstream Type3 EVPN routes) will be pushed to
Sharma, et al. Expires January 18, 2018 [Page 15]
Internet-Draft Multi-site EVPN July 2017
forward frame to destination site border gateway. Similarly
destination site Border Gateway will strip off VXLAN header and
forward frame after pushing another VXLAN header towards the
destination leaf.
As explained in Section 4.4, split horizon mechanism will be used to
avoid looping of inter-site multi-destination frames.
4.7. Host Mobility
Host movement handling will be same as defined in RFC7432 [RFC7432].
When host moves, EVPN Type 2 routes with updated sequence number will
be propagated to every EVPN node. When a host moves inter-site, only
Border gateways may see EVPN updates with both next-hop attributes
and sequence number changes and leafs may see updates only with
updated sequence numbers. However in other cases both Border gateway
and leafs may see next-hop and sequence number changes.
5. Convergence
5.1. Fabric to Border Gateway Failure
If a Border Gateway is lost, Border gateway next-hop will be
withdrawn for Type 2 routes. Also per-VNI DF election will be
triggered to chose new DF. DF new winner will become forwarder of
Multi-destination inter-site traffic.
5.2. Border Gateway to Border Gateway Failures
In case where inter-site cloud has link failures, direct forwarding
path between border gateways can be lost. In this case, traffic from
one site can reach other site via border gateway of an intermediate
site. However this will be addressed like regular underlay failure
and traffic terminations end-points will still stay same for inter-
site traffic flows.
6. Interoperability
The procedures defined here are only for Border Gateways. Therefore
other EVPN nodes in the network should be RFC7432 [RFC7432] compliant
to operate in such topologies.
As the procedures described here are applicable only after receiving
Border A-D route, if other domains are connected which are not
capable of such multi-site gateway model, they can work in regular
EVPN mode. The exact procedures will be detailed in a future version
of the draft.
Sharma, et al. Expires January 18, 2018 [Page 16]
Internet-Draft Multi-site EVPN July 2017
The procedures here provides flexibility to connect non-EVPN VXLAN
sites by provisioning Border Gateways on such sites and inter-
connecting such Border Gateways by Border Gateways of other sites.
Such Border Gateways in non-EVPN VXLAN sites will play dual role of
EVPN gateway towards common EVPN domain and non-EVPN gateway towards
non-EVPN VXLAN site.
7. Isolation of Fault Domains
Isolation of network defects requires policies like storm control,
security ACLs etc to be implemented at site boundaries. Border
gateways should be capable of inspecting inner payload of packets
received from VXLAN tunnels and enforce configured policies to
prevent defects percolating from one part to rest of the network.
8. Loop detection and Prevention
Customer L2 network deploy some flavor of Spanning tree protocol
(STP) to detect and prevent loops. Also Customer L2 segments deploy
some form of multihoming to connect L2 segments to EVPN nodes or
VTEPs. Such multihoming connectivity takes care of preventing L2
loops by multihoming mechanisms at the VTEPs. However
misconfiguration or other unexpected events in the customer L2
segments can lead to inconsistent connectivity to VTEPs leading to L2
loops.
This specification leverages Type-2 encoding of ESI label extended
community in Type-1 A-D route type as defined in RFC7432 [RFC7432] to
exchange STP root bridge information among VTEPs. When VTEPs
discovers same STP root bridge from VTEPs which are not multihoming
VTEP peers for a given L2 segment, it signals possibility of loop and
forwarding engine prunes VNI from the server facing ports to cut down
loop. As root bridge conflict across VTEPs is resolved, forwarding
engine will reestablish VNI on the server facing ports. This
mechanism can coexist with other mechanism like fast mac move
detections and is recommended as additional protection to prevent L2
loops poised by inconsistent connectivity of customer L2 segments to
L3 MS-EVPN fabric.
Such route advertisement should be originated by every EVPN node and
terminated at the border gateways. However if there is possibility
of server facing L2 segments to be stretched across sites, such
routes can be terminated and re-originated with out modifications to
be received by every other EVPN node. This behavior is exception to
usual guideline of terminating (and re-originating if required) all
routes types at border gateway. However such exception will help in
detecting loops if a customer L2 segment is inconsistently connected
to VTEPs in different sites.
Sharma, et al. Expires January 18, 2018 [Page 17]
Internet-Draft Multi-site EVPN July 2017
Also as defined in Section 4.2.1 border gateways uses mechanisms like
Designated Forwarder and Split Horizon forwarding to prevent inter-
site loops in this network.
9. MVPN with Multi-site EVPN
BGP based MVPN as defined in RFC6513 [RFC6513] and RFC6514 [RFC6514]
will coexist with Multisite-EVPN with out any changes in route types
and encodings defined for MVPN route types in these RFCs. Route
Distinguisher and VRF route import extended communities will be
attached to MVPN routes as defined in the BGP MVPN RFCs. Import and
Export Route targets will be attached to MVPN routes either by Auto-
generating them from VNI or by explicit configuration per MVPN.
Since, BGP MVPN RFC adapts to any VPN address family to provide RPF
information to build C-Multicast trees, EVPN route types will be used
to provide required RPF information for Multicast sources in MVPNs.
In order to follow segmentation model of Multisite-EVPN, following
procedures are recommended to build provider and customer multicast
trees between sources and receivers across sites.
9.1. Inter-Site MI-PMSI
As defined in above mentioned MVPN RFCs, I-PMSI A-D routes are used
to signal a provider tunnel or MI-PMSI per MVPN. Multisite-EVPN
recommends EVPN Type-3 routes to build such MI-PMSI provider tunnel
per VPN between Border Gateways of different sites. Every MVPN node
will use its unique router identifier to build these MI-PMSI provider
tunnels. In Anycast Border gateway model also, these MI-PMSI
provider tunnels are built using unique router identifier of Border
gateways. In similar fashion, these Type-3 routes can be used to
build MI-PMSI provider tunnel per MVPN with in sites.
9.2. Stitching of customer multicast trees across sites
All Border Gateways will rewrite next-hop and re-originate MVPN
routes received from other sites to local site and from local site to
other sites. Therefore customer Multicast trees will be logically
built end-to-end across sites by stitching these trees via Border
gateways. A C-multicast join route (say Type 7 MVPN) will follow
EVPN RPF path to build C-multicast tree from leaf in a site to its
Border gateway and to source site leafs via source site Border
Gateways. Similarly Source-Active A-D MVPN route (Type 5 MVPN) will
be rewritten with next-hop and re-originated via Border gateways so
that source C-Multicast trees will be stitched via Border gateways.
Sharma, et al. Expires January 18, 2018 [Page 18]
Internet-Draft Multi-site EVPN July 2017
9.3. RPF resolution in Anycast Border Gateway model
In Ancast Border Gateway model, multicast sources will be re-
advertised in EVPN routes after rewritten by anycast Border gateway
nexthop. However VRF route import (VRI) extended community tied with
these EVPN routes will be advertised by unique nexthop (typically
local loopback IP) from every such Border Gateway. Therefore
receiver leafs should use these VRIs to uniquely identify one member
of the Anycast Border Gateway set to build tenant MVPN trees.
Similarly a BG in the receiver site should pick one member of the
anycast-set of BGs in the source-site as its upstream RPF neighbor,
using the unique next hop by referring to the next hop advertised in
the VRI extended community.
9.4. Inter-Site S-PMSI
As defined in BGP MVPN RFCs, S-PMSI A-D routes (Type 3 MVPN) will be
used to signal selective PMSI trees for high bandwidth C-Multicast
streams. These S-PMSI A-D routes will be signaled across sites via
Border gateways rewriting next-hop and re-originating them to other
sites. PMSI tunnel attribute in re-originated S-PMSI routes will be
adjusted to the provide tunnel types between Border gateways across
sites.
10. Acknowledgements
This authors would like to thank Max Ardica, Lukas Krattiger, Anuj
Mittal, Lilian Quan, Veera Ravinutala, Murali Garimella, Tarun Wadhwa
for their review and comments.
11. IANA Considerations
TBD.
12. Security Considerations
TBD.
13. References
13.1. Normative References
[DCI-EVPN-OVERLAY]
A. Sajassi et. al., "A Network Virtualization Overlay
Solution using EVPN", 2017, <https://tools.ietf.org/html/
draft-ietf-bess-evpn-overlay-02>.
Sharma, et al. Expires January 18, 2018 [Page 19]
Internet-Draft Multi-site EVPN July 2017
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<http://www.rfc-editor.org/info/rfc2119>.
[RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A.,
Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based
Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February
2015, <http://www.rfc-editor.org/info/rfc7432>.
13.2. Informative References
[RFC6513] Rosen, E., Ed. and R. Aggarwal, Ed., "Multicast in MPLS/
BGP IP VPNs", RFC 6513, DOI 10.17487/RFC6513, February
2012, <http://www.rfc-editor.org/info/rfc6513>.
[RFC6514] Aggarwal, R., Rosen, E., Morin, T., and Y. Rekhter, "BGP
Encodings and Procedures for Multicast in MPLS/BGP IP
VPNs", RFC 6514, DOI 10.17487/RFC6514, February 2012,
<http://www.rfc-editor.org/info/rfc6514>.
[RFC7209] Sajassi, A., Aggarwal, R., Uttaro, J., Bitar, N.,
Henderickx, W., and A. Isaac, "Requirements for Ethernet
VPN (EVPN)", RFC 7209, DOI 10.17487/RFC7209, May 2014,
<http://www.rfc-editor.org/info/rfc7209>.
Appendix A. Additional Stuff
TBD.
Authors' Addresses
Rajesh Sharma (editor)
Cisco Systems
170 W Tasman Drive
San Jose, CA
USA
Email: rajshr@cisco.com
Ayan Banerjee
Cisco Systems
170 W Tasman Drive
San Jose, CA
USA
Email: ayabaner@cisco.com
Sharma, et al. Expires January 18, 2018 [Page 20]
Internet-Draft Multi-site EVPN July 2017
Raghava Sivaramu
Cisco Systems
170 W Tasman Drive
San Jose, CA
USA
Email: raghavas@cisco.com
Ali Sajassi
Cisco Systems
170 W Tasman Drive
San Jose, CA
USA
Email: sajassi@cisco.com
Sharma, et al. Expires January 18, 2018 [Page 21]