Network working group X. Xu
Internet Draft Huawei Technologies
Category: Standard Track
Expires: January 2011 July 2, 2010
Virtual Subnet: A Scalable Data Center Network Architecture
draft-xu-virtual-subnet-00
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with
the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on July 2, 2011.
Copyright Notice
Copyright (c) 2009 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with
respect to this document.
Xu, et al. Expires January 2, 2011 [Page 1]
Internet-Draft Virtual Subnet July 2010
Abstract
This document proposes a scalable data center network architecture
which, as an alternative to the Spanning Tree Protocol Bridge
network, uses a Layer 3 routing infrastructure to provide scalable
virtual Layer 2 network connectivity services.
Conventions used in this document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC-2119 [RFC2119].
Table of Contents
1. Problem Statement............................................3
2. Terminology..................................................3
3. Design Goals.................................................4
4. Architecture Description.....................................4
4.1. Unicast.................................................5
4.1.1. Communications within a Service Domain.............5
4.1.2. Communications between Service Domains.............6
4.2. Multicast/Broadcast.....................................8
4.3. ARP Cache..............................................10
4.4. APR Proxy..............................................10
4.5. DHCP Agent Relay.......................................10
5. Conclusions.................................................11
6. Limitations.................................................11
7. Future work.................................................11
8. Security Considerations.....................................12
9. IANA Considerations.........................................12
10. Acknowledgements...........................................12
11. References.................................................12
11.1. Normative References..................................12
11.2. Informative References................................12
Authors' Addresses.............................................13
Xu, et al. Expires January 2, 2011 [Page 2]
Internet-Draft Virtual Subnet July 2010
1. Problem Statement
With the popularity of cloud computation services, the scale of
today's data center expands larger and larger, and data centers
containing tens to hundreds of thousands of servers are not uncommon.
In most data centers, server clusters consisting of large numbers of
servers are widely used for server load-balancing. Currently, many
popular applications of server clusters require that servers within
a given cluster SHOULD be located on an IP subnet. Besides, Virtual
Machine (VM) migration technologies are also widely used to achieve
service agility which requires VMs to be able to migrate to any
physical server while keeping the same IP address. In addition, many
data center applications (e.g., server cluster and grid computing)
result in increased server-to-server traffic. Hence, a scalable and
high-bandwidth Layer 2 network connectivity service is desired for
the interconnection of servers within huge data centers.
Unfortunately, today's data center network architecture which relies
on the Spanning-Tree Protocol (STP) Bridge technology, can not
address the above challenges (i.e., large scale of servers and high-
bandwidth demands for server-to-server interconnection) facing those
large-scale data centers. First, the STP can not maximize the
utilization of the total network bandwidths to provide enough
bandwidth capacity between servers since it can only calculate out a
single forwarding tree for all connected servers and can not support
Equal Cost Multiple Path (ECMP); Second, the scalability of the
forward table would become a big concern when the existing large
Layer 2 network scales even larger, since the STP Bridge forwarding
depends on the flat MAC addresses; Third, the broadcast storm impact
on the network performance becomes much serious and unpredictable in
a continually growing large-scale STP Bridge network.
2. Terminology
This memo makes use of the terms defined in [RFC4364],[MVPN],
[RFC2236] and [RFC2131]. Below are provided terms specific to this
document:
- Service Domain: A group of servers which are dedicated for a
given service. In most cases, these servers of a service domain
are located in an IP subnet.
Xu, et al. Expires January 2, 2011 [Page 3]
Internet-Draft Virtual Subnet July 2010
3. Design Goals
To overcome the limitations of the STP Bridge network, the new
network architecture for data centers SHOULD be able to meet
following design objectives:
- Bandwidth Utilization Maximization
To provide enough bandwidth between the servers, the server-to-
server traffic SHOULD be ensured to always travel through the
shortest path while achieving the Equal Cost Multiple Path (ECMP).
- Layer 2 Semantics
To be backward compatible with the current data center applications
(e.g., cluster, VM migration etc.), servers of a given service
domain SHOULD be connected as if they were on a Local Area Network
(LAN) or an IP subnet.
- Domain Isolation
Due to performance isolation and security considerations, servers of
different service domains SHOULD be isolated just as if they were on
different VLANs.
- Forwarding Table Scalability
To accommodate tens to hundreds of thousands of servers within a
given data center, the forwarding table of each forwarding element
in the data center network SHOULD be scalable enough.
- Broadcast Storm Suppression.
To reduce the impact of broadcast storms on the network performance,
broadcast/multicast traffic SHOULD be limited within a very small
scope of network.
4. Architecture Description
Generally speaking, this new data center network architecture
partially takes advantage of the MPLS/BGP VPN technology [RFC4364]
to construct a scalable large IP subnet across the MPLS/IP backbone
network of a date center.
The following sections describe the architecture in details.
Xu, et al. Expires January 2, 2011 [Page 4]
Internet-Draft Virtual Subnet July 2010
4.1. Unicast
4.1.1. Communications within a Service Domain
BGP/MPLS VPN technology with some extensions, as an alternative to
the STP Bridge, is deployed in a data center network. Hosts, as
Customer Edge (CE) devices are attached to Provider Edge (PE)
routers directly or through a Layer 2 switch. Here, different
service domains are mapped to distinct VPNs so as to achieve domain
isolation, and different sites of a particular VPN are configured
with an identical IP subnet to achieve service agility, that is to
say, all PEs attached to the same VPN are configured with an
identical IP subnet on the corresponding Virtual Routing Forwarding
(VRF) attachment circuits. PEs automatically generate connected host
routes for each VRF according to the Address Resolution Protocol
(APR) table of the corresponding VPN, and then exchange their
connected host routes with each other via BGP. APR proxy is enabled
for each VPN on PEs, thus, upon receiving an ARP request for a
remote host from a local host, the PE as an ARP proxy returns one of
its MAC addresses as a response.
+--------------------+
+-----------------+ | | +------------------+
|VPN_A:10/8 | | | |VPN_A:10/8 |
| | | | | |
| +-----+ ++---+-+ +-+---++ +-----+ |
| | A +----+ PE-1 | | PE-2 +----+ B | |
| +-----+ ++-+-+-+ +-+---++ +-----+ |
| 10.1.1.1/32 | | | IP/MPLS Backbone | | 10.1.1.2/32 |
+-----------------+ | | | +------------------+
| +--------------------+
|
|
V
+-------+------------+--------+
|VRF ID |Destination |Next Hop|
+-------+------------+--------+
| VPN_A |10.1.1.1/32 | Local |
+-------+------------+--------+
| VPN_A |10.1.1.2/32 | PE-2 |
+-------+------------+--------+
Figure 1: Intra-domain Communication Example
As shown in Figure 1, host A sends an ARP request for host B before
communicating with B. Upon receiving this ARP request, PE-1 lookups
the associated VRF to find the host route for B. If found, PE-1
Xu, et al. Expires January 2, 2011 [Page 5]
Internet-Draft Virtual Subnet July 2010
acting as an ARP proxy returns one of its own MAC addresses as a
response to that ARP request. Otherwise, no ARP response SHOULD be
replied. After obtaining the ARP response from PE-1, A sends an IP
packet destined for B with destination MAC address of PE-1' MAC
address. Upon receiving this IP packet, PE-1 acting as an ingress
PE ,tunnels the packet towards PE-2 according to the associated VRF.
PE-2 as an egress PE in turn will forward the packet to B according
to the associated VRF. In a word, this is a special BGP/MPLS VPN
application scenario in which connected host routes of each VRF are
automatically generated according to the ARP table of the
corresponding VPN and are exchanged among PEs attaching to the same
VPN.
4.1.2. Communications between Service Domains
For hosts located in different VPNs (i.e., service domains) to
communicate with each other, the VPNs SHOULD not have any
overlapping address spaces. Besides, each VPN SHOULD be configured
with at least one default route, e.g., the default gateway router of
a given VPN is connected to a PE attached to that VPN on which a
default route SHOULD be configured in the associated VRF and then be
advertised to other PEs of that VPN.
As shown in Figure 2, PE-1 and PE-3 are attached to one VPN (i.e.
VPN A) while PE-2 and PE-4 are attached to another VPN (i.e., VPN B).
Host A and its default gateway router (i.e., GW-1) are connected to
PE-1 and PE-3, respectively. Host B and its default gateway router
(i.e., GW-2) are connected to PE-2 and PE-4, respectively. A sends
an ARP request for its default gateway (i.e., 10.1.1.1) before
communicating with B. Upon receiving this ARP request, PE-1 lookups
the associated VRF to find the host route for the default gateway.
If found, PE-1 as an ARP proxy, returns one of its own MAC addresses
as a response. After obtaining the ARP response, A constructs an IP
packet destined for B and encapsulates it in an Ethernet frame with
destination MAC address of PE-1' MAC and then sends it out. Upon
receiving this packet, PE-1 as an ingress PE, tunnels it towards PE-
3 according to the best-match route for that packet (i.e., the
default route) in the associated VRF. PE-3 as an egress PE, in turn,
forwards this packet towards the default gateway router (i.e., GW-1)
according to the best-match route for that packet (i.e., the
configured default route). After the packet arrives at the default
gateway router of B (i.e., GW-2) after hop-by-hop forwarding, GW-2
will send an APR request for B. PE-4 as an ARP proxy, returns one of
its own MAC addresses as a response. GW-2 encapsulates the IP packet
within an Ethernet frame with destination MAC address of PE-4's MAC
address and then sends it out. Upon receiving this packet, PE-4 as
an ingress PE, tunnels it towards PE-2 according to the associated
Xu, et al. Expires January 2, 2011 [Page 6]
Internet-Draft Virtual Subnet July 2010
VRF. PE-2 as an egress PE, in turn, forwards it towards B according
to the associated VRF.
+-------+------------+--------+ +-------+------------+--------+
|VRF_ID |Destination |Next Hop| |VRF_ID |Destination |Next Hop|
+-------+------------+--------+ +-------+------------+--------+
| VPN_A |10.1.1.2/32 | PE-1 | | VPN_B |20.1.1.2/32 | PE-2 |
+-------+------------+--------+ +-------+------------+--------+
| VPN_A |10.1.1.1/32 | Local | | VPN_B |20.1.1.1/32 | Local |
+-------+------------+--------+ +-------+------------+--------+
| VPN_A | 0.0.0.0/0 |10.1.1.1| | VPN_B | 0.0.0.0/0 |20.1.1.1|
+-------+------------+--------+ +-------+------------+--------+
^ ^
| +--------------------+ |
| | IP Network | |
| +----+-----------+---+ |
| +---+--+ +---+--+ |
| | GW-1 | | GW-2 | |
| +---+--+ +--+---+ |
|VPN A:10.1.1.1/32| |VPN B:20.1.1.1/32|
| | | |
+-------------+---+--+ +--+---+-------------
+-+ PE-3 +----+ PE-4 +-+
+-----------------+ | +------+ +------+ | +------------------+
|VPN A:10/8 | | | |VPN_B:20/8 |
| | | | | |
| +-----+ ++--+--+ +--+--++ +-----+ |
| | A +----+ PE-1 | | PE-2 +----+ B | |
| +-----+ ++-++--+ +--++-++ +-----+ |
| 10.1.1.2/32 | || IP/MPLS Backbone || | 20.1.1.2/32 |
+-----------------+ || || +------------------+
|+----------------------+|
| |
V V
+-------+------------+--------+ +-------+------------+--------+
|VRF ID |Destination |Next Hop| |VRF ID |Destination |Next Hop|
+-------+------------+--------+ +-------+------------+--------+
| VPN_A |10.1.1.2/32 | Local | | VPN_B |20.1.1.2/32 | Local |
+-------+------------+--------+ +-------+------------+--------+
| VPN_A |10.1.1.1/32 | PE-3 | | VPN_B |20.1.1.1/32 | PE-4 |
+-------+------------+--------+ +-------+------------+--------+
| VPN_A | 0.0.0.0/0 | PE-3 | | VPN_B | 0.0.0.0/0 | PE-4 |
+-------+------------+--------+ +-------+------------+--------+
Figure 2: Inter-domain Communication Example
Xu, et al. Expires January 2, 2011 [Page 7]
Internet-Draft Virtual Subnet July 2010
4.2. Multicast/Broadcast
The MVPN technology [MVPN], especially the Protocol-Independent-
Multicast (PIM) tree option with some extensions, is partially
reused here to support link-local multicast between hosts of a given
service domain (i.e., VPN). That is to say, the customer multicast
group addresses of a given VPN are 1:1 or n: 1 mapped to the
provider multicast group dedicated for that VPN when transporting
the customer multicast traffic across the backbone. For broadcast, a
dedicated provider multicast group is reserved for carrying
broadcast traffic across the IP/MPLS backbone. In other words,
customer broadcast is processed on PEs as a special customer
multicast group. Unless otherwise mentioned, the customer multicast
term pertains to customer multicast and broadcast. All PEs attaching
to a given VPN SHOULD maintain the identical mappings from customer
multicast group addresses to provider multicast group addresses. To
isolate the customer multicast traffics of different VPNs traveling
through the backbone, different VPNs SHOULD be assigned distinct
provider multicast group address ranges without any overlapping.
+--------------------+
+-----------------+ | | +------------------+
|VPN_A:10/8 | | | |VPN_A:10/8 |
| | | | | |
| +-----+ E0++---+-+ +-+---++ +-----+ |
| | A +----+ PE-1 | | PE-2 +----+ B | |
| +-----+ ++-+-+-+ +-+---++ +-----+ |
| 10.1.1.1/32 | | | IP/MPLS Backbone | | 10.1.1.2/32 |
+-----------------+ | | | +------------------+
| +--------------------+
|
|
V
+-------+---------------+----------+-------+--------+
|VRF ID | Customer G |Provider G| To PE | From PE|
+-------+---------------+----------+-------+--------+
| VPN_A | 224.1.1.1/32 | 239.1.1.1| True | True |
+-------+---------------+----------+-------+--------+
| VPN_A | 224.0.0.0/4 | 239.1.1.2| True | True |
+-------+---------------+----------+-------+--------+
| VPN_A |255.255.255.255| 239.1.1.3| True | True |
+-------+---------------+----------+-------+--------+
Figure 3: Link-local Multicast/Broadcast Communication Example
The multicast forwarding entry can be configured manually by the
network operators or generated dynamically according to the Internet
Xu, et al. Expires January 2, 2011 [Page 8]
Internet-Draft Virtual Subnet July 2010
Group Management Protocol (IGMP) Membership Report/Leave messages
received from CEs or remote PEs. Ingress PEs forward customer
multicast packets to other PEs (i.e., egress PEs) of the same VPN
via a provider multicast distribution tree, according to the best-
match multicast forwarding entry of the associated VRF in case that
the "To PE" field of that entry is set to True. Otherwise (i.e.,
that field set to False), ingress PEs are not allowed to forward the
customer multicast packets to remote egress PEs. Egress PEs forward
customer multicast packets received from the provider multicast
distribution tree to CEs via VRF attachment circuits, according to
the best-match multicast forwarding entry of the associated VRF in
case that the "From PE" field of that entry is set to True.
Otherwise (i.e., that field set to False), egress PEs are not
allowed to forward the customer multicast packets to CEs. For IGMP
messages to be conveyed successfully across the IP/MPLS backbone,
some multicast forwarding entries of special multicast groups
including all-routers multicast group (i.e., 224.0.0.2) and all-
systems group (224.0.0.1) SHOULD be configured in the corresponding
VRF in advance. Besides, according to the IGMP specification
[RFC2236], Group-Specific Query messages are sent to the group being
queried and Membership Report messages are sent to the group being
reported, Upon receiving these packets from CEs, the PE SHOULD
convey them over the corresponding provider multicast distribute
tree dedicated for the all-systems group (224.0.0.1) of a given VRF.
To avoid IGMP Membership Report suppression, those Membership Report
messages received from PEs or CEs SHOULD not be forwarded to CEs. As
an alternative to conveying IGMP Report/Leave messages through the
provider multicast distribute tree, customer multicast routing
information exchange among PEs can also be achieved by using the
approaches defined in [MVPN-BGP].
As shown in Figure 3, upon receiving a multicast/broadcast packet
from a CE (e.g., host A), if this packet is destined for 224.1.1.1,
PE-1 will encapsulate it into a provider multicast packet with
destination IP address of 239.1.1.1; If it is destined for an IP
multicast address other than 224.1.1.1, PE-1 will encapsulate it
into a provider multicast packet with destination IP address of
239.1.1.2; if this is a broadcast packet. PE-1 will encapsulate it
into a provider multicast packet with destination IP address of
239.1.1.3 which is dedicated for conveying broadcast of that VPN.
The customer multicast forwarding entries, no matter configured
manually or learnt automatically according to the IGMP Membership
Reports sent from local CEs, will automatically trigger PEs to join
the corresponding provider multicast groups in the MPLS/IP backbone.
For example, assume PE-2 receives an IGMP member report for a given
customer multicast group (e.g., 224.1.1.1) from a local CE (e.g.,
Xu, et al. Expires January 2, 2011 [Page 9]
Internet-Draft Virtual Subnet July 2010
host B), it SHOULD automatically join a provider multicast group
(i.e., 239.1.1.1) corresponding to that customer multicast group.
4.3. ARP Cache
After rebooting, a PE SHOULD send ARP requests for every IP address
within the subnet of each VPN. Thus, after a round of ARP requests,
the PE will know all local hosts according to the received ARP
responses. After this, the PE will send ARP requests in unicast to
those already learnt local hosts to keep the learnt ARP entries from
expiring. When receiving a gratuitous ARP from a local host, the PE
SHOULD cache this APR into the ARP table of the corresponding VPN if
that entry does not exist yet.
When a PE has received a host route for its local host from remote
PEs, it SHOULD immediately send an ARP request for that local host
in unicast to check whether this host is still connected locally. If
no response received (considering the VM migration scenario), the PE
SHOULD delete the ARP entry for that host from its APR table and
then withdraw that connected host route accordingly. If an ARP
response received (imaging the host multi-homing scenario), the PE
just needs to update the ARP entry for that local host as normal.
4.4. APR Proxy
The PE, acting as an ARP proxy, will only respond those ARP requests
for remote hosts which have been learnt from other PEs. That is to
say, the ARP proxy SHOULD not respond ARP requests for local hosts.
Otherwise, in case that the ARP response from the PE covers that
from the real requested host, the packets destined for that local
host would have to be relayed by the PE, which is the so-called
hairpin issue.
When VRRP [RFC2338] is enabled together with ARP proxy, only the
VRRP master is delegated to act as an ARP proxy and return the VRRP
virtual MAC address as a response.
4.5. DHCP Agent Relay
To avoid the Dynamic Host Configuration Protocol (DHCP) [RFC2131]
broadcast message flooding through the whole data center network,
the DHCP Agent Relay function can be enabled on PEs. In this way,
the DHCP broadcast messages from DHCP clients (i.e., local CE hosts)
would be transformed into DHCP unicast messages by the DHCP agent
relays (i.e., PEs) and then be forwarded to the DHCP servers in
unicast.
Xu, et al. Expires January 2, 2011 [Page 10]
Internet-Draft Virtual Subnet July 2010
5. Conclusions
By using Layer 3 routing in the backbone of the data center network
to replace the STP Bridge forwarding, the traffic between any two
servers is forwarded along the short path between them. Besides, the
ECMP can also be easily achieved in Layer 3 routing networks. Thus,
the total network bandwidth of the data center network is utilized
to maximum extent.
By reusing the BGP/MPLS VPN to exchange host routes of a given VPN
among PEs, the servers of that VPN are communicated with each other
just as if they are located with in a LAN or subnet.
Due to the tunnels used in MPLS/BGP VPN, the forwarding tables of P
routers just need to hold the reachability information of tunnel
endpoints (i.e., PEs). Meanwhile, the forwarding tables of PE
routers can also be ensured to scale well by distributing VPNs among
different PEs, that is to say, a given PE router only needs to hold
the routing tables of those VPNs to which the PE is attached. Thus,
the forwarding table scalability issues with data center networks
are largely alleviated.
By enabling the APR proxy function on PEs, the ARP broadcast
messages from local CE hosts are blocked on the attached PEs. Thus,
the APR broadcast messages will not be flooded through the whole
data center network. Besides, by enabling the DHCP agent relay
function on PEs, the DHCP broadcast messages from DHCP clients (i.e.,
local CE hosts) would be transformed into unicast messages by the
DHCP agent relays (i.e., PEs) and then be forwarded to the DHCP
servers in unicast. Thus, the broadcast storms in the data center
networks are largely suppressed.
6. Limitations
Since the data center network architecture described in this
document partially reuses the BGP/MPLS VPN technology to construct a
large-scale IP subnet, rather than a real LAN, the non-IP traffic
can not be supported in this architecture. However, we believe IP is
the dominate communication protocol in today's data center network,
those non-IP legacy applications will disappear from the data center
network with the elapse of time.
7. Future work
If necessary, IS-IS or OSPF can also be extended to support the
similar function as the special BGP/MPLS VPN described in this
Xu, et al. Expires January 2, 2011 [Page 11]
Internet-Draft Virtual Subnet July 2010
document. In addition, IPv6 data center network will be considered
as a part of the further work.
8. Security Considerations
TBD.
9. IANA Considerations
There is no requirement for IANA.
10. Acknowledgements
Thanks to Dacheng Zhang for his editorial review.
11. References
11.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
11.2. Informative References
[RFC4364] Rosen. E and Y. Rekhter, "BGP/MPLS IP Virtual Private
Networks (VPNs)", RFC 4364, February 2006.
[MVPN] Rosen. E and Aggarwal. R, "Multicast in MPLS/BGP IP VPNs",
draft-ietf-l3vpn-2547bis-mcast-10.txt (work in progress),
Janurary 2010.
[MVPN-BGP], R. Aggarwal, E. Rosen, T. Morin, Y. Rekhter, C.
Kodeboniya, "BGP Encodings for Multicast in MPLS/BGP IP
VPNs", draft-ietf-l3vpn-2547bis-mcast-bgp-08.txt,
September 2009.
[RFC2338] Knight, S., et. al., "Virtual Router Redundancy Protocol",
RFC 2338, April 1998.
[RFC2131] Droms, R., "Dynamic Host Configuration Protocol", RFC 2131,
March 1997.
[RFC2236] Fenner, W., "Internet Group Management Protocol, Version
2", RFC 2236, November 1997.
Xu, et al. Expires January 2, 2011 [Page 12]
Internet-Draft Virtual Subnet July 2010
Authors' Addresses
Xiaohu Xu
Huawei Technologies,
No.3 Xinxi Rd., Shang-Di Information Industry Base,
Hai-Dian District, Beijing 100085, P.R. China
Phone: +86 10 82836073
Email: xuxh@huawei.com