Internet Engineering Task Force M. McBride
Internet-Draft H. Lui
Intended status: Informational Huawei Technologies
Expires: September 4, 2012 March 3, 2012
Multicast in the Data Center Overview
draft-mcbride-armd-mcast-overview-00
Abstract
There has been much interest in issues surrounding massive amounts of
hosts in the data center. There was a discussion, in ARMD, involving
the issues with address resolution for non ARP/ND multicast traffic
in data centers with massive number of hosts. This document provides
a quick survey of multicast in the data center and should serve as an
aid to further discussion of issues related to large amounts of
multicast in the data center.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 4, 2012.
Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
McBride & Lui Expires September 4, 2012 [Page 1]
Internet-Draft Multicast in the Data Center Overview March 2012
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Multicast Applications in the Data Center . . . . . . . . . . . 3
2.1. L3 Multicast Applications . . . . . . . . . . . . . . . . . 3
2.2. L2 Multicast Applications . . . . . . . . . . . . . . . . . 4
3. L2 Multicast Protocols in the Data Center . . . . . . . . . . . 5
4. L3 Multicast solutions in the Data Center . . . . . . . . . . . 6
5. Challenges of using multicast in the Data Center . . . . . . . 7
6. Layer 3 / Layer 2 Topological Variations . . . . . . . . . . . 8
7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 9
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 9
9. Security Considerations . . . . . . . . . . . . . . . . . . . . 9
10. Informative References . . . . . . . . . . . . . . . . . . . . 9
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 9
McBride & Lui Expires September 4, 2012 [Page 2]
Internet-Draft Multicast in the Data Center Overview March 2012
1. Introduction
Data center servers often use IP Multicast to send data to clients or
other application servers. IP Multicast is expected to help conserve
bandwidth in the data center and reduce the load on servers.
Increased reliance on multicast, in next generation data centers,
requires higher performance and capacity especially from the
switches. If multicast is to continue to be used in the data center,
it must scale well within and between datacenters. There has been
much interest in issues surrounding massive amounts of hosts in the
data center. There was a discussion, in ARMD, involving the issues
with address resolution for non ARP/ND multicast traffic in data
centers. This document provides a quick survey of multicast in the
data center and should serve as an aid to further discussion of
issues related to multicast in the data center.
ARP/ND issues are not addressed in this document. ARP/ND issues are
addressed in [I-D.armd-problem-statement]
2. Multicast Applications in the Data Center
There are many data center operators who do not deploy Multicast in
their networks for scalability and stability reasons. There are also
many operators for whom multicast is critical and is enabled on their
data center switches and routers. For this latter group, there are
several uses of multicast in their data centers. An understanding of
the uses of that multicast is important in order to properly support
these applications in the ever evolving data centers. If, for
instance, the majority of the applications are discovering/signaling
each other using multicast there may be better ways to support them
then using multicast. If, however, the multicasting of data is
occurring in large volumes, there is a need for very good data center
under/overlay multicast support. The applications either fall into
the category of those that leverage L2 multicast for discovery or of
those that require L3 support and likely span multiple subnets.
2.1. L3 Multicast Applications
IPTV servers use multicast to deliver content from the data center to
end users. IPTV is typically a one to many application where the
hosts are configured for IGMPv3, the switches are configured with
IGMP snooping, and the routers are running PIM-SSM mode. Often
redundant servers are sending multicast streams into the network and
the network is forwarding the data across diverse paths.
Windows Media servers send multicast streaming to clients. Windows
Media Services streams to an IP multicast address and all clients
McBride & Lui Expires September 4, 2012 [Page 3]
Internet-Draft Multicast in the Data Center Overview March 2012
subscribe to the IP address to receive the same stream. This allows
a single stream to be played simultaneously by multiple clients and
thus reducing bandwidth utilization.
Market data relies extensively on IP multicast to deliver stock
quotes from the data center to a financial services provider and then
to the stock analysts. The most critical requirement of a multicast
trading floor is that it be highly available. The network must be
designed with no single point of failure and in a way the network can
respond in a deterministic manner to any failure. Typically
redundant servers (in a primary/backup or live live mode) are sending
multicast streams into the network and the network is forwarding the
data across diverse paths (when duplicate data is sent by multiple
servers).
With publish and subscribe servers a separate message is sent to each
subscriber of a publication. With multicast publish/subscribe, only
one message is sent, regardless of the number of subscribers. In a
publish/subscribe system, client applications, some of which are
publishers and some of which are subscribers, are connected to a
network of message brokers that receive publications on a number of
topics, and send the publications on to the subscribers for those
topics. The more subscribers there are in the publish/subscribe
system, the greater the improvement to network utilization there
might be with multicast.
With load balancing protocols, such as VRRP, routers communicate
within themselves using a multicast address.
Overlays may use IP multicast to virtualize L2 multicasts. VXLAN,
for instance, is an encapsulation scheme to carry L2 frames over L3
networks. The VXLAN Tunnel End Point (VTEP) encapsulates frames
inside an L3 tunnel. VXLANs are identified by a 24 bit VXLAN Network
Identifier (VNI). The VTEP maintains a table of known destination
MAC addresses, and stores the IP address of the tunnel to the remote
VTEP to use for each. Unicast frames, between VMs, are sent directly
to the unicast L3 address of the remote VTEP. Multicast frames are
sent to a multicast IP group associated with the VNI. Underlying IP
Multicast protocols (PIM-SM/SSM/BIDIR) are used to forward multicast
data across the overlay.
2.2. L2 Multicast Applications
Applications, such as Ganglia, uses multicast for distributed
monitoring of computing systems such as clusters and grids.
Windows Server, cluster node exchange, relies upon the use of
multicast heartbeats between servers. Only the other interfaces in
McBride & Lui Expires September 4, 2012 [Page 4]
Internet-Draft Multicast in the Data Center Overview March 2012
the same multicast group use the data. Unlike broadcast, multicast
traffic does not need to be flooded throughout the network, reducing
the chance that unnecessary CPU cycles are expended filtering traffic
on nodes outside the cluster. As the number of nodes increases, the
ability to replace several unicast messages with a single multicast
message improves node performance and decreases network bandwidth
consumption. Multicast messages replace unicast messages in two
components of clustering:
o Heartbeats: The clustering failure detection engine is based on a
scheme whereby nodes send heartbeat messages to other nodes.
Specifically, for each network interface, a node sends a heartbeat
message to all other nodes with interfaces on that network.
Heartbeat messages are sent every 1.2 seconds. In the common case
where each node has an interface on each cluster network, there
are N * (N - 1) unicast heartbeats sent per network every 1.2
seconds in an N-node cluster. With multicast heartbeats, the
message count drops to N multicast heartbeats per network every
1.2 seconds, because each node sends 1 message instead of N - 1.
This represents a reduction in processing cycles on the sending
node and a reduction in network bandwidth consumed.
o Regroup: The clustering membership engine executes a regroup
protocol during a membership view change. The regroup protocol
algorithm assumes the ability to broadcast messages to all cluster
nodes. To avoid unnecessary network flooding and to properly
authenticate messages, the broadcast primitive is implemented by a
sequence of unicast messages. Converting the unicast messages to
a single multicast message conserves processing power on the
sending node and reduces network bandwidth consumption.
Multicast addresses in the 224.0.0.x range are considered link local
multicast addresses. They are used for protocol discovery and are
flooded to every port. For example, OSPF uses 224.0.0.5 and
224.0.0.6 for neighbor and DR discovery. These addresses are
reserved and will not be constrained by IGMP snooping. These
addresses are not to be used by any application.
These types of multicast applications should be able to be supported
in data centers which support multicast.
3. L2 Multicast Protocols in the Data Center
The switches, in between the servers and the routers, rely upon igmp
snooping to bound the multicast to the ports leading to interested
hosts and to L3 routers. A switch will, by default, flood multicast
traffic to all the ports in a broadcast domain (VLAN). IGMP snooping
McBride & Lui Expires September 4, 2012 [Page 5]
Internet-Draft Multicast in the Data Center Overview March 2012
is designed to prevent hosts on a local network from receiving
traffic for a multicast group they have not explicitly joined. It
provides switches with a mechanism to prune multicast traffic from
links that do not contain a multicast listener (an IGMP client).
IGMP snooping is a L2 optimization for L3 IGMP.
IGMP snooping, with proxy reporting or report suppression, actively
filters IGMP packets in order to reduce load on the multicast router.
Joins and leaves heading upstream to the router are filtered so that
only the minimal quantity of information is sent. The switch is
trying to ensure the router only has a single entry for the group,
regardless of how many active listeners there are. If there are two
active listeners in a group and the first one leaves, then the switch
determines that the router does not need this information since it
does not affect the status of the group from the router's point of
view. However the next time there is a routine query from the router
the switch will forward the reply from the remaining host, to prevent
the router from believing there are no active listeners. It follows
that in active IGMP snooping, the router will generally only know
about the most recently joined member of the group.
In order for IGMP, and thus IGMP snooping, to function, a multicast
router must exist on the network and generate IGMP queries. The
tables (holding the member ports for each multicast group) created
for snooping are associated with the querier. Without a querier the
tables are not created and snooping will not work. Furthermore IGMP
general queries must be unconditionally forwarded by all switches
involved in IGMP snooping. Some IGMP snooping implementations
include full querier capability. Others are able to proxy and
retransmit queries from the multicast router.
In source-only networks, however, which presumably describes most
data center networks, there are no IGMP hosts on switch ports to
generate IGMP packets. Switch ports are connected to multicast
source ports and multicast router ports. The switch typically learns
about multicast groups from the multicast data stream by using a type
of source only learning (when only receiving multicast data on the
port, no IGMP packets). The switch forwards traffic only to the
multicast router ports. When the switch receives traffic for new IP
multicast groups, it will typically flood the packets to all ports in
the same VLAN. This unnecessary flooding can impact switch
performance.
4. L3 Multicast solutions in the Data Center
There are three flavors of PIM used for Multicast Routing in the Data
Center: PIM-SM [RFC4601], PIM-SSM [RFC4607], and PIM-BIDIR [RFC5015].
McBride & Lui Expires September 4, 2012 [Page 6]
Internet-Draft Multicast in the Data Center Overview March 2012
SSM provides the most efficient forwarding between sources and
receivers and is most suitable for one to many types of multicast
applications. State is built for each S,G channel therefore the more
sources and groups there are, the more state there is in the network.
BIDIR is the most efficient shared tree solution as one tree is built
for all S,G's, therefore saving state. But it is not the most
efficient in forwarding path between sources and receivers. SSM and
BIDIR are optimizations of PIM-SM. PIM-SM is still the most widely
deployed multicast routing protocol. PIM-SM can also be the most
complex. PIM-SM relies upon a RP (Rendezvous Point) to set up the
multicast tree and then will either switch to the SPT (shortest path
tree), similar to SSM, or stay on the shared tree (similar to BIDIR).
For massive amounts of hosts sending (and receiving) multicast, the
shared tree (particularly with PIM-BIDIR) provides the best potential
scaling since no matter how many multicast sources exist within a
VLAN, the tree number stays the same. IGMP snooping, IGMP proxy, and
PIM-BIDIR have the potential to scale to the huge scaling numbers
required in a data center.
5. Challenges of using multicast in the Data Center
When IGMP/MLD Snooping is not implemented, ethernet switches will
flood multicast frames out of all switch-ports, which turns the
traffic into something more like broadcast.
VRRP uses multicast heartbeat to communicate between routers. The
communication between the host and the default gateway is unicast.
The multicast heartbeat can be very chatty when there are thousands
of VRRP pairs with sub-second heartbeat calls back and forth.
Link-local multicast should scale well within one IP subnet
particularly with a large layer3 domain extending down to the access
or aggregation switches. But if multicast traverses beyond one IP
subnet, which is necessary for an overlay like VXLAN, you could
potentially have scaling concerns. If using a VXLAN overlay, it is
necessary to map the L2 multicast in the overlay to L3 multicast in
the underlay or do head end replication in the overlay and receive
duplicate frames on the first link from the router to the core
switch. The solution could be to run potentially thousands of PIM
messages to generate/maintain the required multicast state in the IP
underlay. The behavior of the upper layer, with respect to
broadcast/multicast, affects the choice of head end (*,G) or (S,G)
replication in the underlay, which affects the opex and capex of the
entire solution. A VXLAN, with thousands of logical groups, maps to
head end replication in the hypervisor or to IGMP from the hypervisor
and then PIM between the TOR and CORE 'switches' and the gateway
router.
McBride & Lui Expires September 4, 2012 [Page 7]
Internet-Draft Multicast in the Data Center Overview March 2012
Requiring IP multicast (especially PIM BIDIR) from the network can
prove challenging for data center operators especially at the kind of
scale that the VXLAN/NVGRE proposals require. This is also true when
the L2 topological domain is large and extended all the way to the L3
core. In data centers with highly virtualized servers, even small L2
domains may spread across many server racks (i.e. multiple switches
and router ports).
It's not uncommon for there to be 10-20 VMs per server in a
virtualized environment. One vendor reported a customer requesting a
scale to 400VM's per server. For multicast to be a viable solution
in this environment, the network needs to be able to scale to these
numbers when these VMs are sending/receiving multicast.
A lot of switching/routing hardware has problems with IP Multicast,
particularly with regards to hardware support of PIM-BIDIR.
Sending L2 multicast over a campus or data center backbone, in any
sort of significant way, is a new challenge enabled for the first
time by overlays. There are interesting challenges when pushing
large amounts of multicast traffic through a network, and have thus
far been dealt with using purpose-built networks. While the overlay
proposals have been careful not to impose new protocol requirements,
they have not addressed the issues of performance and scalability,
nor the large-scale availability of these protocols.
There is an unnecessary multicast stream flooding problem in the link
layer switches between the multicast source and the PIM First Hop
Router (FHR). The IGMP-Snooping Switch will forward multicast
streams to router ports, and the PIM FHR must receive all multicast
streams even if there is no request from receiver. This often leads
to waste of switch cache and link bandwidth when the multicast
streams are not actually required. [I-D.pim-umf-problem-statement]
details the problem and defines design goals for a generic mechanism
to restrain the unnecessary multicast stream flooding.
6. Layer 3 / Layer 2 Topological Variations
As discussed in [I-D.armd-problem-statement], there are a variety of
topological data center variations including L3 to Access Switches,
L3 to Aggregation Switches, and L3 in the Core only. Further
analysis is needed in order to understand how these variations affect
IP Multicast scalability
McBride & Lui Expires September 4, 2012 [Page 8]
Internet-Draft Multicast in the Data Center Overview March 2012
7. Acknowledgements
The authors would like to thank the many individuals who contributed
opinions on the ARMD wg mailing list about this topic: Linda Dunbar,
Anoop Ghanwani, Peter Ashwoodsmith, David Allan, Aldrin Isaac, Igor
Gashinsky, Michael Smith, Patrick Frejborg, Joel Jaeggli and Thomas
Narten.
8. IANA Considerations
This memo includes no request to IANA.
9. Security Considerations
No security considerations at this time.
10. Informative References
[I-D.armd-problem-statement]
Narten, T., Karir, M., and I. Foo,
"draft-ietf-armd-problem-statement", February 2012.
[I-D.pim-umf-problem-statement]
Zhou, D., Deng, H., Shi, Y., Liu, H., and I. Bhattacharya,
"draft-dizhou-pim-umf-problem-statement", October 2010.
[RFC4601] Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas,
"Protocol Independent Multicast - Sparse Mode (PIM-SM):
Protocol Specification (Revised)", RFC 4601, August 2006.
[RFC4607] Holbrook, H. and B. Cain, "Source-Specific Multicast for
IP", RFC 4607, August 2006.
[RFC5015] Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano,
"Bidirectional Protocol Independent Multicast (BIDIR-
PIM)", RFC 5015, October 2007.
McBride & Lui Expires September 4, 2012 [Page 9]
Internet-Draft Multicast in the Data Center Overview March 2012
Authors' Addresses
Mike McBride
Huawei Technologies
2330 Central Expressway
Santa Clara, CA 95050
USA
Email: michael.mcbride@huawei.com
Helen Lui
Huawei Technologies
Building Q14, No. 156, Beiqing Rd.
Beijing, 100095
China
Email: helen.liu@huawei.com
McBride & Lui Expires September 4, 2012 [Page 10]