Network working group                                             X. Xu
Internet Draft                                      Huawei Technologies
Category: Standard Track
Expires: October 2012                                   August 27, 2011


      Virtual Subnet: A Scalable Data Center Interconnection Solution

                        draft-xu-virtual-subnet-06


Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with
   the provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on October 27, 2011.

Copyright Notice

   Copyright (c) 2009 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with
   respect to this document.






Xu                    Expires October 27, 2012                [Page 1]


Internet-Draft               Virtual Subnet                 August 2011

Abstract

   This document proposes a host route based IP-only L2VPN solution
   called Virtual Subnet, which reuses BGP/MPLS IP VPN [RFC4364] and
   ARP proxy [RFC925][RFC1027] technologies. Virtual Subnet provides a
   much scalable approach for interconnecting geographically dispersed
   data centers.

Conventions used in this document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC-2119 [RFC2119].

Table of Contents


   1. Introduction ................................................ 3
   2. Terminology ................................................. 3
   3. Solution Description......................................... 4
      3.1. Unicast ................................................ 4
         3.1.1. Intra-subnet Unicast............................... 4
         3.1.2. Inter-subnet Unicast............................... 5
      3.2. Multicast/Broadcast..................................... 6
      3.3. CE Host Discovery....................................... 7
      3.4. CE Multi-homing......................................... 7
      3.5. CE Host Mobility........................................ 8
      3.6. ARP Proxy .............................................. 8
   4. Comparison with VPLS......................................... 8
   5. Future work ................................................ 10
   6. Security Considerations..................................... 10
   7. IANA Considerations ........................................ 10
   8. Acknowledgements ........................................... 10
   9. References ................................................. 10
      9.1. Normative References................................... 10
      9.2. Informative References................................. 10
   Authors' Addresses ............................................ 11












Xu                    Expires October 27, 2012                [Page 2]


Internet-Draft               Virtual Subnet                 August 2011


1. Introduction

   To achieve service agility to the full extent of current virtual
   machine (VM) technology, cloud data center operators are demanding
   solutions for VM mobility across data centers of geographically
   dispersed locations. In this challenging environment, a solution
   that enables fast, reliable, high-capacity and highly scalable data
   center interconnection is essential. Virtual Private LAN Service
   (VPLS) [RFC4761, RFC4762] seems as an available technology for such
   demand. However, those scaling issues (e.g., ARP broadcast storm,
   unknown unicast flooding, etc.) that exit within the large Layer2
   Ethernet bridge network would badly impact the network performance
   when such a flat Layer2 network is extended across multiple data
   centers.

   This document describes a host route based IP-only L2VPN solution
   called Virtual Subnet (VS), which reuses BGP/MPLS IP VPN [RFC4364]
   and ARP proxy [RFC925][RFC1027] technologies. VS provides a much
   scalable approach for interconnecting geographically dispersed data
   centers. In contrast with existing VPLS solutions, VS alleviates the
   broadcast storm impact on the network performance to a great extent
   by partitioning the otherwise whole ARP broadcast and unknown
   unicast flooding domain associated with an IP subnet that has been
   extended across the MPLS/IP backbone, into multiple isolated parts
   per data center location. Besides, VS could provide many other
   desirable benefits that VPLS could never support. For example, the
   MAC table capacity pressure that the large amount of CE switches
   within data centers would have to face is greatly reduced. In
   addition, active-active data center exit capability could be
   achieved easily even in the case where path symmetry is required.
   Finally, the ARP table pressure on data center exit gateways could
   be reduced by several orders of magnitude.

   Note that non-IP traffic would not be supported in VS since VS just
   provides an IP-only L2VPN service.

2. Terminology

   This memo makes use of the terms defined in [RFC4364], [MVPN],
   [RFC2236] and [RFC2131].








Xu                    Expires October 27, 2012                [Page 3]


Internet-Draft               Virtual Subnet                 August 2011

3. Solution Description

3.1. Unicast

  3.1.1. Intra-subnet Unicast

   As shown in Figure 1, CE hosts dispersed across different VPN sites
   of a given IP-only L2VPN instance are actually within a single IP
   subnet (e.g., 10.0.0.0/8). PE routers automatically discover their
   locally connected CE hosts by some approaches such as ARP learning
   or ICMP PING and accordingly create host routes for their locally
   connected CE hosts. These host routes are distributed across PE
   routers with the existing BGP/MPLS IP VPN signaling. In addition, to
   avoid forwarding those packets destined for nonexistent hosts within
   the scope of their configured VPN subnet mistakenly according to the
   default route, PE routers each are configured with a null route for
   that VPN subnet. Meanwhile, APR proxy is enabled on the VRF
   interfaces of each PE router, thus, upon receiving from a local CE
   host an ARP request for a known remote CE host, the ingress PE
   router would return its own MAC address as a response.

                          +--------------------+
    +-----------------+   |                    |   +------------------+
    |VPN_A:10.0.0.0/8 |   |                    |   |VPN_A:10.0.0.0/8  |
    |                 |   |                    |   |                  |
    |    +------+    ++---+-+                +-+---++    +------+     |
    |    |Host A+----+ PE-1 |                | PE-2 +----+Host B|     |
    |    +------+    ++-+-+-+                +-+-+-++    +------+     |
    |   10.1.1.1/8    | | |  IP/MPLS Backbone  | | |    10.1.1.2/8    |
    +-----------------+ | |                    | | +------------------+
                        | +--------------------+ |
                        |                        |
                        |                        |
                        V                        V
    +-------+------------+--------+     +-------+------------+--------+
    |VRF ID |Destination |Next Hop|     |VRF ID |Destination |Next Hop|
    +-------+------------+--------+     +-------+------------+--------+
    | VPN_A |10.1.1.1/32 |  Local |     | VPN_A |10.1.1.2/32 |  Local |
    +-------+------------+--------+     +-------+------------+--------+
    | VPN_A |10.1.1.2/32 |  PE-2  |     | VPN_A |10.1.1.1/32 |  PE-1  |
    +-------+------------+--------+     +-------+------------+--------+
    | VPN_A |10.0.0.0/8  |  NULL  |     | VPN_A |10.0.0.0/8  |  NULL  |
    +-------+------------+--------+     +-------+------------+--------+

                      Figure 1: Intra-subnet Unicast





Xu                    Expires October 27, 2012                [Page 4]


Internet-Draft               Virtual Subnet                 August 2011

   Assume host A sends an ARP request for host B before communicating
   with host B, upon the receipt of this ARP request, ingress PE, PE-1,
   lookups the associated VRF table to find the corresponding host
   route for host B. If found and the route is learnt from a remote PE
   router, PE-1 acting as an ARP proxy, returns its own MAC address as
   a response to the above ARP request. Otherwise, PE-1 doesn't need to
   respond to that ARP request. Once receiving the above ARP reply from
   PE-1, host A would send out an IP packet destined for B with the
   destination MAC address of PE-1's MAC address which has been learnt
   through the above ARP resolution. One this packet arrives at PE-1,
   PE-1 would tunnel it towards the egress PE router (i.e., PE-2),
   which in turn forwards the packet to the destination CE host (i.e.,
   host B).

   3.1.2. Inter-subnet Unicast

   As shown in Figure 2, for a CE host (e.g., host A) to communicate
   with other hosts outside its own subnet, a PE router (e.g., PE-2)
   which is connected to a CE gateway router (e.g., GW) would be
   configured with a default route with the next-hop pointing to that
   CE gateway router, and this default route would be distributed to
   other PE routers.

                          +--------------------+
    +-----------------+   |                    |   +-------------+
    |VPN_A:10.0.0.0/8 |   |                    |   |VPN_A:       |
    |                 |   |                    |   |10.0.0.0/8   |
    |    +------+    ++---+-+                +-+---++       +----+--+
    |    |Host A+----+ PE-1 |                | PE-2 +-------+   GW  |
    |    +------+    ++-+-+-+                +-+-+-++       +----+--+
    |   10.1.1.1/8    | | |  IP/MPLS Backbone  | | |10.1.1.2/8   |
    +-----------------+ | |                    | | +-------------+
                        | +--------------------+ |
                        |                        |
                        |                        |
                        V                        V
    +-------+------------+--------+     +-------+------------+--------+
    |VRF ID |Destination |Next Hop|     |VRF ID |Destination |Next Hop|
    +-------+------------+--------+     +-------+------------+--------+
    | VPN_A |10.1.1.1/32 |  Local |     | VPN_A |10.1.1.2/32 |  Local |
    +-------+------------+--------+     +-------+------------+--------+
    | VPN_A |10.1.1.2/32 |  PE-2  |     | VPN_A |10.1.1.1/32 |  PE-1  |
    +-------+------------+--------+     +-------+------------+--------+
    | VPN_A |10.0.0.0/8  |  NULL  |     | VPN_A |10.0.0.0/8  |  NULL  |
    +-------+------------+--------+     +-------+------------+--------+
    | VPN_A |0.0.0.0/0   |  PE-2  |     | VPN_A |0.0.0.0/0   |   GW   |
    +-------+------------+--------+     +-------+------------+--------+



Xu                    Expires October 27, 2012                [Page 5]


Internet-Draft               Virtual Subnet                 August 2011

                      Figure 2: Inter-subnet Unicast

   Now host A sends an ARP request for its default gateway (i.e., GW)
   before communicating with a destination host outside its subnet.
   Upon receiving this ARP request, PE-1 acting as an ARP proxy returns
   its own MAC address as a response in accordance with the rules
   described in the above section. Host A then sends out an IP packet
   for that destination host with destination MAC address of PE-1's MAC.
   Upon receiving the above packet, PE-1 tunnels it towards PE-2
   according to the default route that is learnt from PE-2. PE-2 in
   turn forwards the packet to GW according to the configured default
   route.

   For the CE gateway router redundancy purpose, more than one CE
   gateway router could be connected to a given VPN subnet. In this
   case, Virtual Router Redundancy Protocol (VRRP) [RFC2338] could be
   optionally enabled among these CE gateway routers, in this way, only
   the PE router which is connected to the VRRP master is entitled to
   announce a default route. To achieve that goal, the next-hop of the
   default route SHOULD be set to the corresponding Virtual Router IP
   address, and the default route SHOULD not be deemed as valid unless
   there is a directly connected host route for the next-hop address.
   Due to the fact that only the VRRP master is entitled to respond to
   ARP requests for the corresponding Virtual Router IP address and
   broadcast gratuitous ARP requests or replies on behave of the
   Virtual Router, only the PE router which is connected to the VRRP
   master could have an ARP entry corresponding to the Virtual Router
   IP address and therefore could have a directly connected host route
   for the Virtual Router IP address. In this way, packets destined for
   the outside of a given VPN subnet would be exactly sent to the
   corresponding VRRP master. Alternatively, PE routers could intercept
   the VRRP messages received from their locally connected CE routers
   and prevent them from flooding across the MPLS/IP backbone. As a
   result, each CE router will act as a VRRP master and therefore each
   PE router connected to the CE routers would announce a default route.
   In this way, inbound and outbound traffic of the VPN subnet would be
   load-balanced across multiple CE gateway routers and route
   optimization for the above traffic is achieved simultaneously.

3.2. Multicast/Broadcast

   The MVPN technology [MVPN], in particular, the Protocol-Independent-
   Multicast (PIM) tree option with some extensions, could be reused
   here to support IP multicast and broadcast between CE hosts of the
   same VPN instance. For example, PE routers attached to a given VPN
   join a default provider multicast distribution tree which is
   dedicated for that VPN. Ingress PE routers, upon receiving customer



Xu                    Expires October 27, 2012                [Page 6]


Internet-Draft               Virtual Subnet                 August 2011

   multicast or broadcast traffic from their local CE hosts, tunnel
   such customer traffic towards remote PE routers of the same VPN over
   the corresponding default provider multicast distribution tree. When
   receiving customer multicast or broadcast traffic over a provider
   multicast distribution tree, egress PE routers forward such customer
   traffic via the corresponding VRF interfaces.

   More details about how to support multicast and broadcast in VS will
   be explored in a later version of this document.

   3.3. CE Host Discovery

   When receiving an ARP request or reply from a local CE host, PE
   router SHOULD cache or update the corresponding ARP entry for that
   CE host. In addition, PE router SHOULD periodically send ARP
   requests to those discovered local CE hosts (better in unicast) so
   as to keep the ARP entries fresh.

   To ensure a PE router to discover all of its locally connected CE
   hosts in time, this PE router SHOULD perform the IP or ARP scan on
   its attached VPN site at least once when rebooting up. One possible
   option is to use the ICMP echo approach for host discovery. For
   example, a PE router could send out an ICMP echo request to an IP
   broadcast address (e.g., 10.255.255.255), every CE host receiving
   that ICMP echo request would respond with an ICMP echo reply which
   contains its IP and MAC addresses. Thus the PE router could discover
   all of its local CE hosts by inspecting the received ICMP echo
   replies. If the PE router couldn't be able to process so many
   replies in a short period of time, the otherwise whole subnet could
   be partitioned into multiple segments and the corresponding host
   discovery for each segment could be performed in turn.

   3.4. CE Multi-homing

   For PE router redundancy purpose, a VPN site could be connected to
   more than one PE router. In this case, VRRP SHOULD run among these
   PE routers and only the PE router which is the VRRP master could
   respond to the ARP requests from local CE hosts and it MUST use the
   Virtual Router MAC address in any ARP packet it sends. To achieve
   active-active multi-homing for inbound traffic to a given multi-
   homed VPN site, those PE routers being VRRP slave could also perform
   the host discovery function and accordingly advertise host routes
   for local CE hosts. Note that there is no any contravention to the
   VRRP specification [RFC2338].






Xu                    Expires October 27, 2012                [Page 7]


Internet-Draft               Virtual Subnet                 August 2011

   3.5. CE Host Mobility

   Once a CE host moves from one VPN site to another, it will usually
   send out a gratuitous ARP request or reply when attaching to a new
   VPN site. The PE router attached to the new VPN site will create a
   CE host route upon receiving that gratuitous ARP message and then
   advertise it to remote PE routers.

   When the PE router attached to the old VPN site receives a host
   route announcement for one of its local CE hosts from a remote PE
   router, it SHOULD immediately send an ARP request or ICMP echo for
   that CE host to determine whether or not that CE host is still
   locally connected to it. If no corresponding reply is returned in a
   given period of time, the PE router would delete the ARP entry of
   that CE host and accordingly withdraw the corresponding host route.
   Meanwhile, the PE router would broadcast a gratuitous ARP on behalf
   of that CE host, with the sender hardware address field being filled
   with its own MAC addresses. As a result, the ARP entry for that CE
   host that is cached on other local CE hosts of that old VPN site
   would be refreshed timely.

   3.6. ARP Proxy

   A PE router, acting as an ARP proxy, SHOULD only respond to ARP
   requests for those CE hosts which are exactly attached to other PE
   routers. In other words, the PE router SHOULD not respond to ARP
   requests for its local CE hosts or those nonexistent CE hosts.

   When VRRP is configured on multiple PE routers which are attached to
   a given VPN site for redundancy purpose, only the PE router which is
   the VRRP master is entitled to perform the ARP proxy function.

4. Comparison with VPLS

   Since VPLS simply extends a LAN across multiple sites and it
   operates as an Ethernet bridge, most scaling issues (e.g., ARP
   broadcast storm, unknown unicast flooding, etc.) that exist within a
   large Ethernet bridge network are not addressed by VPLS. In VS, by
   partitioning the otherwise whole ARP broadcast and unknown unicast
   flooding domain associated with a given subnet, which has been
   extended across the MPLS/IP backbone, into multiple isolated parts,
   the broadcast storm impact on network performance is alleviated to a
   great extent. For example, ARP broadcast traffic is limited within
   the scope of a VPN site. Similarly, unknown unicast traffic would
   not be flooded across the MPLS/IP backbone as well.





Xu                    Expires October 27, 2012                [Page 8]


Internet-Draft               Virtual Subnet                 August 2011

   As for the MAC table capacity requirement on CE switches, CE
   switches in VPLS would have to learn MAC addresses of both local CE
   hosts and remote CE hosts. In contrast, CE switches in VS only needs
   to learn MAC addresses of local CE hosts and local PE routers due to
   the usage of ARP proxy.

   Active-active DC exit is a much desirable capability when
   considering route/path optimization for traffic routing to/from the
   outside of geographically dispersed data centers (e.g., the
   Internet). In normal cases, each DC site will be connected to a
   default gateway (i.e., DC exit router) which is responsible for
   forwarding traffic routing to/from the outside. However, since these
   default gateways are within a single subnet due to the layer2 DCI
   usage, normally there is only one default gateway router (acting as
   VRRP master) is allowed to forward traffic routing to/from the
   outside. This is obviously not optimal from the perspective of WAN
   bandwidth utilization. Active-active VRRP approach has been proposed
   in the above case so that the traffic destined for the outside could
   be forwarded by the local DC exit gateways. This is workable when
   path symmetry is not required. However, in most cases where firewall
   or NAT devices are deployed at the DC exits, path symmetry is a must.
   As a result, active-active VRRP is not available anymore in such
   cases. In contrast, if VS is used as a DCI solution, when incoming
   traffic from the Internet enters a DC, source IP addresses of the
   traffic could be NATed on the DC exit gateway. Notes that DC exit
   gateways of geographically dispersed DCs are configured with
   different IP address pools without any overlapping for source NAT.
   In addition, the corresponding routes for the above NAT address
   pools are advertised by the DC exit gateways to their own connected
   PE routers of the VS respectively. Thus, when the outgoing traffic
   destined for the Internet arrives at its local PE router, that PE
   router would forward the traffic according to the matching routes
   for the above address pools. In this way, active-active DC exit can
   be achieved easily even in the case where path symmetry is required.

   Another obvious advantage of VS over VPLS, as a DCI solution, is to
   reduce the ARP table size on DC gateways by several orders of
   magnitude. Assume there are millions of CE hosts within a single
   VLAN/subnet, if VPLS is used as a DCI solution, DC exit gateways
   would have to know millions of ARP entries corresponding to these CE
   hosts. In contrast, with VS as a DCI solution, DC exit gateways are
   directly connected to the PE routers of the VS which act as ARP
   proxies, MAC addresses of those ARP entries for CE hosts on DC
   gateways are identical (i.e., the PE router's MAC). Thus these
   millions of ARP entries can be aggregated into one entry (e.g.,
   10.0.0.0/8->the PE router's MAC). That's to say, the exact-matching
   algorithm for ARP cache lookup is changed to the longest-matching



Xu                    Expires October 27, 2012                [Page 9]


Internet-Draft               Virtual Subnet                 August 2011

   algorithm. Of course, there is no free lunch. The side-effect of
   this change is that DC exit gateways could send out packets destined
   for non-existing CE hosts to their connected PE routers of the VS.
   Fortunately, once those packets arrive at the PE router, that PE
   router in turn will drop those packets directly since there is no
   matching route for them.



5. Future work

   How to support IPv6 CE hosts in VS is for future study.

6. Security Considerations

   TBD.

7. IANA Considerations

   There is no requirement for IANA.

8. Acknowledgements

   Thanks to Dino Farinacci, Himanshu Shah, Nabil Bitar and Giles Heron
   for their valuable comments on this document.

9. References

9.1. Normative References

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
             Requirement Levels", BCP 14, RFC 2119, March 1997.

9.2. Informative References

   [RFC4364] Rosen. E and Y. Rekhter, "BGP/MPLS IP Virtual Private
             Networks (VPNs)", RFC 4364, February 2006.

   [MVPN] Rosen. E and Aggarwal. R, "Multicast in MPLS/BGP IP VPNs",
             draft-ietf-l3vpn-2547bis-mcast-10.txt (work in progress),
             Janurary 2010.

   [MVPN-BGP] R. Aggarwal, E. Rosen, T. Morin, Y. Rekhter,  C.
             Kodeboniya, "BGP Encodings for Multicast in MPLS/BGP IP
             VPNs", draft-ietf-l3vpn-2547bis-mcast-bgp-08.txt (work in
             progress), September 2009.




Xu                    Expires October 27, 2012               [Page 10]


Internet-Draft               Virtual Subnet                 August 2011

   [RFC826] Plummer, D., "An Ethernet Address Resolution Protocol or
             Converting Network Protocol Addresses to 48-bit Ethernet
             Addresses for Transmission on Ethernet Hardware", RFC-826,
             Symbolics, November 1982.

   [RFC925] Postel, J., "Multi-LAN Address Resolution", RFC-925, USC
             Information Sciences Institute, October 1984.

   [RFC1027] Smoot Carl-Mitchell, John S. Quarterman, "Using ARP to
             Implement Transparent Subnet Gateways", RFC 1027, October
             1987.

   [RFC2338] Knight, S., et. al., "Virtual Router Redundancy Protocol",
             RFC 2338, April 1998.

   [RFC2236] Fenner, W., "Internet Group Management Protocol, Version
             2", RFC 2236, November 1997.

   [RFC4761] Kompella, K. and Y. Rekhter, "Virtual Private LAN Service
             (VPLS) Using BGP for Auto-Discovery and Signaling", RFC
             4761, January 2007.

   [RFC4762] Lasserre, M. and V. Kompella, "Virtual Private LAN Service
             (VPLS) Using Label Distribution Protocol (LDP) Signaling",
             RFC 4762, January 2007.

Authors' Addresses

   Xiaohu Xu
   Huawei Technologies,
   No.3 Xinxi Rd., Shang-Di Information Industry Base,
   Hai-Dian District, Beijing 100085, P.R. China
   Phone: +86 10 82882573
   Email: xuxh@huawei.com
















Xu                    Expires October 27, 2012               [Page 11]