Internet Engineering Task Force M. Scott, Ed.
Internet-Draft D. Wagner-Hall
Intended status: Informational J. Crowcroft
Expires: April 21, 2011 University of Cambridge
October 18, 2010
Addressing the Scalability of Ethernet with MOOSE
draft-malc-armd-moose-00
Abstract
Ethernet does not scale well to large networks. The flat MAC address
space, whilst having obvious benefits for the user and administrator,
is the primary cause of this poor scalability; other recent efforts
to improve upon Ethernet's scalability have addressed symptoms,
rather than this underlying cause. MOOSE, Multi-level Origin-
Organised Scalable Ethernet, is an Ethernet switch architecture that
performs in-place rewriting of MAC addresses in order to impose a
hierarchy upon the address space without reconfiguration or
modification of connected devices. This removes the need for
switches to maintain large forwarding databases, is of direct use in
implementing improved routing, and allows for a variety of other
scalability and security innovations. MOOSE also includes a
globally-scalable, distributed and resilient protocol for the
automatic assignment of addresses to switches, and for detecting and
cheaply resolving addressing conflicts.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on April 21, 2011.
Copyright Notice
Copyright (c) 2010 IETF Trust and the persons identified as the
Scott, et al. Expires April 21, 2011 [Page 1]
Internet-Draft MOOSE October 2010
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 5
2. Ethernet's Underlying Problem . . . . . . . . . . . . . . . . 5
3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 6
4. MOOSE Architecture . . . . . . . . . . . . . . . . . . . . . . 8
4.1. Shortest Path Routing . . . . . . . . . . . . . . . . . . 11
4.2. Address Selection and Conflict Resolution . . . . . . . . 11
4.3. Broadcast and Multicast . . . . . . . . . . . . . . . . . 14
4.4. Example . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5. Directory Service . . . . . . . . . . . . . . . . . . . . 16
4.6. Mobility . . . . . . . . . . . . . . . . . . . . . . . . . 16
5. Interoperability Considerations . . . . . . . . . . . . . . . 18
5.1. Layer-violating Protocols . . . . . . . . . . . . . . . . 18
5.2. Edge Virtual Bridging . . . . . . . . . . . . . . . . . . 19
6. Prototype Implementation . . . . . . . . . . . . . . . . . . . 20
7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 20
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20
9. Security Considerations . . . . . . . . . . . . . . . . . . . 20
10. Informative References . . . . . . . . . . . . . . . . . . . . 20
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22
Scott, et al. Expires April 21, 2011 [Page 2]
Internet-Draft MOOSE October 2010
1. Introduction
Ethernet has lasted well since its inception in the '70s with
Ethernet frame-structure and addressing remaining ubiquitous in the
data centre environment as in many others. Alongside IP and IP-
transported services such as iSCSI, it is now commonplace to see
converged network services such as physical disk interfaces and
cluster interconnects layered directly over Ethernet (e.g. ATA-over-
Ethernet and variants of Infiniband). However, Ethernet exhibits
scalability issues on networks of more than a few thousand devices,
such as costly and energy-dense address table logic and storms of
broadcast traffic.
Aside from more physical devices, virtualised infrastructure further
increases the density of Ethernet addresses in data centres. Widely-
used layer-2 virtualisation [Cl05] mandates a unique Ethernet address
per virtual machine. This means that each physical machine in a data
centre may represent many tens of Ethernet devices.
The traditional method of avoiding such problems is the artificial
subdivision of a network, but this introduces an administrative
burden, requires significant routing equipment and also precludes
seamless migration--a necessity for virtualised infrastructure.
While IP Mobility [RFC3344] addresses the problem of maintaining
higher-layer connections when roaming between subnets, it requires
client support that is neither ubiquitous or reliable. Common
practice sees the provision of one physical Ethernet network covering
an entire data centre, or even an entire WAN of data centres.
Our approach, Multi-level Origin-Organised Scalable Ethernet (MOOSE),
provides all the advantages of an Ethernet network without the
capital and running costs and administrative overhead of a IP router-
based approach. MOOSE does this by providing a hierarchical
addressing scheme without requiring host reconfiguration or
modification.
Ethernet's scalability is limited firstly by the forwarding database
that every switch in an Ethernet [802.1D] network must maintain. A
switch's forwarding database contains one entry per source address
seen in any frame passing through that switch, and stores that MAC
address together with the learnt location of that address--the port
on which packets from that address were last seen. This is later
used to determine on which port to transmit frames destined for that
address. Devices frequently broadcast frames throughout the network
(e.g. ARP queries) so active devices on the network are listed in
most switches' forwarding databases most of the time.
In modern switches the capacity of this database is generally of the
Scott, et al. Expires April 21, 2011 [Page 3]
Internet-Draft MOOSE October 2010
order of 16,000 entries. (Higher-capacity forwarding databases exist
but are currently constrained to very high-end switches.) On a
moderately large network, full databases are a serious risk. If the
database becomes full, entries will be discarded; frames for unknown
addresses are flooded to all ports and the resulting traffic storm
could cause major problems, especially in the presence of low-
capacity edge links.
Traditionally the forwarding database has been stored in a content-
addressable memory (CAM) as lookups must be very fast, particularly
as 10 Gbit/s Ethernet becomes ubiquitous. As networks grow, the
number of entries in a switch's forwarding database must naturally
increase; however, increasing the capacity of CAMs without
sacrificing speed whilst constraining energy consumption is proving
to be challenging. Cheaper switches use DRAM in place of a CAM, but
this is likely to remain slower especially for large tables.
Secondly, Ethernet's inability to handle networks containing loops
also presents a scalability problem. The Rapid Spanning Tree
Protocol, RSTP, must remove loops by disabling any redundant links.
On a dense mesh network, RSTP will disable a large proportion of
links; this constrains frames to suboptimal routes and may introduce
bottlenecks in the network, particularly around the root of the
spanning tree. In a data centre environment, this potentially
amounts to a very large proportion of capacity being wasted wherever
redundant fibres are installed, e.g. between cabinet switches and
between data centres.
Thirdly, not only does Ethernet flood frames destined for unknown
hosts, but it also uses--and encourages higher-layer protocols to
use-- broadcast for control messages. For example, ARP [RFC0826]
performs address resolution via broadcast queries, and DHCP [RFC2131]
uses broadcast messages for automatic configuration. It is
impractical to replace these protocols entirely as this would require
software upgrades to every device, but it would be desirable for the
network to minimise the amount of broadcast traffic required to be
forwarded.
In this document we identify the relevant underlying problems in the
design of Ethernet, review previous work and present the MOOSE switch
architecture, which addresses inadequacies in the fundamental
operation of Ethernet in a novel yet backwards-compatible way. By
revisiting the addressing scheme itself, rather than simply
addressing symptoms of the problem as many previous proposed
solutions have done, we can go about solving all of the above
scalability problems and more.
Scott, et al. Expires April 21, 2011 [Page 4]
Internet-Draft MOOSE October 2010
1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119.
2. Ethernet's Underlying Problem
The original Ethernet was a shared-medium network, where every frame
was broadcast and no switching took place. Modern-day wired
Ethernet-based networks instead consist almost entirely of point-to-
point links; as a result of this, the distinction between unicast,
broadcast and multicast has become more important. 802.11 wireless
LANs are the one remaining vestige of Ethernet operating over shared
media, where one switch (access point) serves many hosts on the same
radio channel.
Ethernet's poor scalability arises in various guises, as outlined
above. It would seem at first glance that these are entirely
distinct and unrelated. However, there is a common underlying cause:
that MAC addresses provide no location information.
Globally-unique MAC addresses are structured such that the first
three bytes of a device's address contain an organisationally unique
identifier (OUI) allocated to the device's manufacturer by the IEEE,
with the remaining three bytes allocated by the manufacturer. This
hierarchy exists solely for the purpose of allocating unique
addresses in a decentralised fashion, and is of no use to Ethernet
switches, which must treat the unicast address space as flat.
A flat address space has the advantage that no configuration of
devices is required; a device can use its unique, manufacturer-
assigned MAC address anywhere on any network. However, this leaves
each switch with the task of discovering and storing the location of
every addressable device.
If the MAC address space were not flat, but instead contained enough
information to locate the device possessing the address, several
advantages would be gained. Firstly, large forwarding databases
would no longer have to be maintained on every switch. This location
information could instead be distributed across the network so that
frames are directed towards their destinations according to
successive stages of a hierarchy.
Secondly, a hierarchical MAC address space would also make the
addition of shortest-path routing considerably easier. Shortest-path
routing is clearly a desirable property for a network, yet it is one
Scott, et al. Expires April 21, 2011 [Page 5]
Internet-Draft MOOSE October 2010
that Ethernet does not provide. Flat addressing does not lend itself
to easy routing: any address can be located anywhere on the network,
which means either advertising every host's MAC address via the
routing protocol--which scales very poorly--or providing some other
location lookup service. The use of hierarchical addresses, with
each switch handling a block of sequential addresses akin to an IP
subnet, would reduce the routing problem to the one that routing
protocols were designed to solve.
Thirdly, this would allow for reduction of broadcast traffic in a
variety of different ways. Hierarchical MAC addresses could, for
example, be mapped directly and deterministically onto the IP address
space, if appropriate for the specific deployment. This would allow
switches to respond directly and simply to DHCP and ARP queries,
avoiding the need to forward the most common sources of broadcast
frames. Alternatively, a distributed directory service can be used,
which is less limiting and is thus our preferred approach as detailed
below.
The facility for network administrators to assign locally
administered addresses (LAAs) to devices has existed for as long as
Ethernet. However, configuring and maintaining the LAA on every
device based upon where they are connected would be a considerable
and unwelcome administrative overhead. We therefore present MOOSE, a
system for applying hierarchical addressing to an Ethernet
transparently and without any configuration to edge devices.
3. Related Work
It is well-known that traditional Ethernet scales poorly, and there
have been various attempts in recent years to rectify this. The most
widely-used of these in real-world networks is MPLS-VPLS [RFC3031]
(Multiprotocol Label Switching--Virtual Private LAN Service). This
connects Ethernet islands together through tunnels across a MPLS
cloud. MPLS works by adding one or more labels to the start of every
frame, i.e. encapsulating the frame inside its own protocol.
In MPLS-VPLS, the label edge routers (LERs) must determine the
frame's initial label(s) based upon the destination address via a
lookup table. Frames follow prenegotiated label-switched paths
(LSPs) that, unlike Ethernet, are not constrained to follow a
spanning tree; LSPs are precomputed at connection setup time and the
relevant next hop is stored in a lookup table on each intermediate
switch. Each switch must hence use each frame's label to index into
this lookup table to determine how to switch the frame.
The effect, once the connection has been negotiated, is to provide
Scott, et al. Expires April 21, 2011 [Page 6]
Internet-Draft MOOSE October 2010
what appears to be one or more large Ethernet networks, transparently
overlaid on the MPLS cloud. Whilst this solves effectively the
problem of shortest-path routing across the MPLS cloud, the overlay
Ethernets are still susceptible to the usual scalability problems--
and in fact VPLS adds further large lookup tables on every switch
that can in some configurations scale even worse than Ethernet's
forwarding databases. LERs must map every MAC address to a LSP;
label switch routers (LSRs) must store the next hop for every LSP in
which they participate, which in the core of the network could scale
as O(hosts^2).
A similar scheme is proposed by Hadzic [Ha01], with the difference
that Ethernet-inside-Ethernet encapsulation is used rather than a new
protocol. This has the advantage that less processing is required on
intermediate switches in the backbone network. However, routes
across the backbone are constrained to a spanning tree, and
encapsulating switches must obtain a new destination address for
every frame using a lookup table that--like Ethernet's forwarding
database--must contain every transmitting MAC address. Due to its
heavy basis on Ethernet, this shares many of Ethernet's scalability
problems.
SmartBridge [Ro00] and RBridges [Pe04] (TRILL [RFC5556]) both
encapsulate Ethernet frames in a new inter-switch protocol, and run a
link-state routing protocol between switches. The link state graph
includes the location of every MAC address--necessary because the
address space remains flat and any address could appear
anywhere--i.e. it again contains every host. Furthermore, switches
must perform expensive computation to update routing tables whenever
a MAC address joins or leaves the network.
Myers et al [My04] suggest that Ethernet's main failing is its
broadcast service, and propose a new architecture in which hosts make
explicit use of directory services operated by switches rather than
broadcasting queries. It is clear that switches' participation is
necessary in order to deal with the broadcast problem; however the
modifications to Ethernet suggested are not backwards-compatible and
would require at least software modifications to all connected
devices. Ethernet is, perhaps unfortunately, too widespread for this
to be practical; transparent interception of broadcast frames and
subsequent local handling or redirection via multicast or unicast
remains the only practical solution. The use of hierarchical
addressing is a useful stepping-stone to such a system, and our
architecture includes a transparent directory service (ELK) for this
purpose.
SEATTLE [Ki08] takes a more scalable approach. A routing protocol is
operated between switches, but in contrast to the approaches
Scott, et al. Expires April 21, 2011 [Page 7]
Internet-Draft MOOSE October 2010
described above and in common with MOOSE, the routing protocol only
propagates switch location information, rather than every MAC address
on the network. Flat MAC addresses are still used, and hence a
mechanism is required to look up the switch to which a given address
is connected. This is achieved by using a distributed hash table
(DHT) operating on participating switches with local caching to
alleviate load. This is certainly a step in the right direction but
introduces considerable complexity to switches, since they now must
maintain and update the DHT continually, and it is clear that a
SEATTLE switch would have a significant software component in the
data path. MOOSE alleviates some of the complexity of SEATTLE by a
combination of hierarchical addresses and delegation to a separate
directory service.
4. MOOSE Architecture
The basic operation of MOOSE is to assign a new hierarchical MAC
address to each host on the network, assigned dynamically and
automatically from the unicast LAA space. This dynamically-assigned
address is referred to as a MOOSE address to avoid confusion with
hosts' static, manufacturer-assigned MAC addresses.
Every frame entering the network has its source address rewritten in-
place to the sending host's MOOSE address by the first MOOSE-aware
switch it traverses. The switch that performs address rewriting for
a host--i.e. the closest MOOSE switch to that host--is the host's
home switch and is responsible for assigning a MOOSE address to that
host. (If non-MOOSE switches or hubs are in use, a host may have
more than one "closest" MOOSE switch, in which case an RSTP-like
protocol must be used to elect a switch to handle each edge segment.)
The destination address is left intact in the expectation that it
already is a MOOSE address. Hosts' ARP caches will already contain
the MOOSE addresses of any hosts being communicated with as any
packet received will already have had its source address rewritten; a
host's manufacturer-assigned MAC address is never seen outside of the
segment containing that host. This is a crucial point since
encapsulation-based technologies such as MPLS do not reveal to the
destination host the address used for routing; as a result, switches
must also convert destination as well as source addresses of frames
entering the network. In other words, once again switches must
maintain large tables of remote hosts on the network. The only
destination rewriting that MOOSE switches perform, however, is of the
destination addresses of frames destined for local hosts back to
their manufacturer-assigned MAC addresses; this is simple as the
required information is already known, and necessary because
otherwise that host's network interface card would discard the frame
Scott, et al. Expires April 21, 2011 [Page 8]
Internet-Draft MOOSE October 2010
as misaddressed.
A MOOSE address consists of a switch identifier followed by a host
identifier. For our examples, we simply use a fixed three-byte
switch identifier followed by a fixed three-byte host identifier:
+----------+ +----------+
| switch |_____| switch |_ _ _ _ hosts 02:22:22:00:00:01,
| 02:11:11 | | 02:22:22 | 02:22:22:00:00:02, etc.
+----------+ +----------+
|
|
+----------+
| switch |_ _ _ _ hosts 02:33:33:00:00:01,
| 02:33:33 | 02:33:33:00:00:02, etc.
+----------+
Since these two identifiers when concatenated must form a unicast
LAA, the settings of two bits in the first byte of the switch
identifier are fixed: the least significant bit must be 0 to indicate
a unicast address, and the second-least significant bit must be 1 to
indicate a LAA. To cater for variable length switch identifiers,
some means of introducing separation between the switch and host
identifiers is required. Two possible implementations would be for:
1. the first three bits of the address to indicate how many of the
following 5-bit blocks make up the switch prefix;
2. some constant delimiter to appear between the switch identifier
and host identifier, with switch identifiers not allowed to
contain the delimiter.
The former is simple and gives eight classes of switch identifier.
Because the size of a MOOSE network is limited by the placement of IP
routers, these classes should be sufficient. Additionally, because
switches are free to change their identifiers, they may trivially
switch to a larger class if they have too many attached hosts, or if
a smaller class becomes full.
The latter removes the fixed classes, allowing for more flexibility
with the sizes of switch identifiers, at the cost of complexity, and
a reduction in the available address space.
Each switch can select for itself a unique switch identifier, as
identifier conflict resolution is cheap (see below). When first
joining the routing protocol, conflict should be very unlikely, as
the switch will in the process gain an up-to-date list of in-use
identifiers. Depending on requirements, the switch identifier may
Scott, et al. Expires April 21, 2011 [Page 9]
Internet-Draft MOOSE October 2010
itself be a hierarchical address--e.g. six bits to identify a network
area followed by two bytes to identify a switch within that area--
which could then be used to aid routing decisions.
Each host is assigned a host identifier by its home switch from the
pool of identifiers available to that switch. Only a host's home
switch ever bases a switching decision on the host identifier, so the
detail of how these are allocated can vary from switch to switch.
Suitable schemes include:
1. sequential assignment;
2. the port number followed by a sequential portion (to allow for
multiple hosts connected to one port);
3. a hash of the host's real MAC address.
The latter two approaches are preferable to a simple sequential
assignment, as they better isolate certain kinds of denial-of-service
attack in which a malicious host attempts to use up all available
host identifiers on the switch. They also require less state to be
shared between ports. The third option has the further advantage
that it is deterministic and hence can be recovered easily in the
event of a crash.
It is hence possible to route frames through the network to remote
hosts by simply inspecting the switch identifier in the destination
address, and ignoring the host identifier until the frame reaches the
destination host's home switch. Switches no longer need to keep a
table of all MAC addresses seen recently; they only need store the
locations of other switches and of any directly-connected hosts.
As well as reducing the amount of data that must be consulted in
order to make switching decisions, this provides extra resilience by
making this data much more predictable. The number of MAC addresses
in a network can increase unexpectedly in the event of an address
flooding attack or even under normal operation if the network
contains open wireless access points; relying on the MAC address list
for forwarding leads to some of the vulnerabilities of Ethernet. The
set of switch identifiers participating in MOOSE switching, on the
other hand, is kept predictable and manageable by ensuring that
neighbouring switches (discovered using LLDP [802.1AB]) are
authenticated before they can participate in the routing protocol.
This authentication can be achieved at layer 3 using the security
features found in most popular routing protocols and/or at layer 2
[802.1X]. As the switch identifier is the only address consulted for
forwarding decisions, a MOOSE switch is likely to remain reliable in
the face of attacks that could have brought down a traditional
Scott, et al. Expires April 21, 2011 [Page 10]
Internet-Draft MOOSE October 2010
Ethernet. Furthermore, any attacks based upon MAC address spoofing
cannot function on a MOOSE network as the user-provided MAC address
is translated immediately.
4.1. Shortest Path Routing
As described so far, MOOSE switches must still forward frames along a
spanning tree. As discussed above, this is an undesirable property
of Ethernet as it can cause frames to take a highly suboptimal path
through the network. The foundations are in place to do much better
than this using shortest-path routing.
For the purpose of frame forwarding, a MOOSE switch can be considered
akin to a layer 3 router; it has one locally-connected subnet--
containing all addresses starting with its switch identifier--and
delivers frames to other subnets by passing them to an appropriate
neighbouring switch. Bearing this in mind, the switch can run a
routing protocol of the kind normally used for IP, such as a variant
of OSPF [RFC2328]. This allows frames to be routed along the
shortest available path, rather than being constrained to a spanning
tree. A multipath variant such as OSPF-OMP may be particularly
desirable due to its ability to make use of multiple equal-cost
routing paths in order to improve performance.
4.2. Address Selection and Conflict Resolution
For reasons akin to those of the flaws of Ethernet, it is undesirable
to guarantee universally unique pre-determined MOOSE switch
identifiers. Due to the reduced size of the switch ID space compared
to the MAC address space, this would also be infeasible. We
therefore propose that each switch selects an initial address for
itself during startup. This could result in more than one switch
claiming an address, which would be undesirable, so to mitigate the
potential for MOOSE addresses to find themselves in conflict we
additionally propose a simple and inexpensive conflict resolution
protocol.
Suppose two switches each have the same identifer. We note that if
these switches are on separate MOOSE networks (on disconnected
networks, or separated by an IP router), this situation brings no
issue. Should they be on the same MOOSE network, however, a conflict
exists and must be resolved. Any routing protocol would require a
switch to know which port other switches are connected to, for
instance by OSPF neighbour lists, or simply by receiving frames and
noting the switch port and source MOOSE address. When a switch
receives a MOOSE frame, it looks up the source switch in its
forwarding database, which is likely in fast Content Addressable
Memory. If it finds that source switch to be on a port other than
Scott, et al. Expires April 21, 2011 [Page 11]
Internet-Draft MOOSE October 2010
that which it recognises from its table, one of three situations may
be possible:
1. the source switch may be the same as the known switch, and have
physically moved, or a topology change has occurred;
2. the source switch may be a different one to the known switch, and
they are in conflict;
3. the source switch may be the same as the known switch, but is
sending frames down a different route to the last used route.
To avoid disruption to the network in the first case, and to give
scope for switches to migrate within the network, the switch which
detected the possible conflict should ascertain whether the known
switch is still alive and present. The conflict-resolving switch
thus attempts to send a unicast frame to the known switch, via the
port stored in the forwarding database, asking whether it is there at
a regular interval until a timeout. This will reach the known switch
rather than the new switch if it is still present as other switches
beyond that port must not have detected the conflict yet. The nature
of the timeout we leave unspecified, and can be implementation
specific. It may, for instance, be a pre-defined constant, or it may
vary based on QoS information gathered if such capabilities are
supported. When a MOOSE switch receives such a frame, it should
promptly respond with an acknowledgement frame, showing that it is
alive.
If, within the timeout period, the conflict resolver finds the known
host not to be alive, no conflict exists, so the switch updates its
view of the network by removing the old entry from its forwarding
database and triggering a routing protocol refresh.
If, on the other hand, the host is found to be alive, a conflict
exists. The conflict resolver then sends a frame to the more
recently found switch indicating that it is in conflict and should
change its address. That switch, upon receiving this frame, changes
its address and sends a gratuitous ARP for each of its connected
hosts, so that the rest of the network is aware of the change. To
mitigate the risks of a denial of service attack, or faulty equipment
sending out conflict frames, an exponential backoff algorithm should
be used when receiving conflict notification frames.
A switch should have a timer, and counter influencing the maximum
value of the timer, both initialised to 0. When a conflict
notification frame is received, the counter is incremented (subject
to a saturation value to avoid excessive timeouts). After a conflict
has been resolved--i.e. the switch has changed its address--a timer
Scott, et al. Expires April 21, 2011 [Page 12]
Internet-Draft MOOSE October 2010
starts counting down from some time exponential in that counter;
subsequently the switch will only change its address if the timer has
returned to 0 by the time the conflict frame is received. The
counter should be reset to 0 when the timer reaches 0. Using this
scheme the event of true conflict is handled quickly, even in the
unlikely case that the newly acquired address is also in conflict.
Any node emitting malicious or erroneous conflict notifications,
however, is rate-limited enough that their damage potential is much
restricted, subject to a sufficient timer being chosen.
Pseudocode: Conflict resolution backoff:
if timer > 0:
if counter < counter_max:
counter = counter + 1
# Discard conflict notification frame
else:
timer = k^counter
change_address()
Pseudocode: Conflict resolution timer:
foreach clock tick do:
if timer > 0:
timer = timer - 1
else:
counter = 0
This could be further enhanced by detecting repeated conflicts
involving the same switch or switches, in a manner similar to BGP
Route Flap Damping [RFC2439], and performing more aggressive steps to
avoid further conflicts--for example using a significantly increased
timeout, and/or having *both* switches in conflict select new
addresses.
The conflict resolution algorithm brings a marked improvement on the
equivilent vulnerability of Ethernet, that MAC addresses can be
spoofed. We build in a flexible, well-defined system of recovery.
The decentralised nature of the system makes it much less open to
denial of service attack than any centralised directory may be.
Having every MOOSE switch acting as a barrier to the propagation of
packets from addresses in conflict provides a strong separation
between recently bridged networks with conflicting addresses, so that
communication within the individual networks may continue without
modification, until bridge-crossing traffic appears, at which point
resolution quickly happens. We also remove the possibility for
forwarding databases to frequenty have to switch their entry for a
conflicted address, which can happen with MAC conflicts in
Scott, et al. Expires April 21, 2011 [Page 13]
Internet-Draft MOOSE October 2010
traditional Ethernet. Additionally, in the case of a switch
identifier spoofing attack, the conflict resolver acts as a hard
boundary for the effects of such an attack.
It is possible that the switch performing conflict resolution could
send a suggested replacement switch address to the switch in
conflict, known by the conflict resolver to have a low probability of
being present on the network (because it is not present in its
forwarding database). This would reduce the chance of repeated
collisions, and potentially allow for longer backoff periods, but may
be premature optimisation.
Because multi-path routing is often desirable, we could introduce an
extra datum during the source address rewriting performed by MOOSE
switches. When an ingress MOOSE switch rewrites the source address
of an Ethernet frame to a MOOSE address, it could also prepend some
hash of its manufacturer-assigned MAC address to the data field, and
increment the length field as necessary. The egress switch, when
rewriting the MOOSE destination address to a host's MAC address, then
strips out this added datum. This allows the conflict resolver to
check whether conflicts actually exist by local lookup, rather than
probing other switches, at the cost of added memory requirements in
every switch. This may push the frame to be larger than Ethernet's
maximum, so may require fragmenting the packet into two, at small
added cost. Alternatively, assuming jumbo frames are permitted by
the hardware, the maximum frame size could be marginally reduced to
allow for this in the same manner as for 802.1Q VLAN tags.
From the cheapness of conflict resolution, certain other address
management tasks become simple. A switch is free to choose its
address when it joins the network however it wishes--attempting to
re-use its last-used address, from a list of preferred addresses, or
by generating an address entirely at random. More intricate
addressing schemes may be used on managed networks if desired,
perhaps encapsulating deeper layers of hierarchy.
4.3. Broadcast and Multicast
Since Ethernet does still need to support arbitrary broadcast frames,
these must still be forwarded along a spanning tree in order that
they reach each host exactly once. An explicit spanning tree
protocol is not required however, as the tree can be deduced from the
routing table via reverse path forwarding in a similar manner to
Protocol-Independent Multicast (PIM) [RFC3973]. In other words,
broadcast packets are routed as if they had been sent to the all-
hosts multicast group.
More general multicast groups can be implemented using a combination
Scott, et al. Expires April 21, 2011 [Page 14]
Internet-Draft MOOSE October 2010
of IGMP snooping [RFC4541] as used by modern Ethernet switches, and
participation of the MOOSE switches in PIM routing.
4.4. Example
To illustrate the basic behaviour of MOOSE switches, before we go on
to describe further features, we will offer a simple example. We
will describe the steps involved in forwarding a broadcast frame
containing a query in some higher-layer IPv4-based protocol, and
subsequent unicast frame containing the response, between two hosts A
and B via three MOOSE switches 02:11:11, 02:22:22 and 02:33:33.
4.4.1. Query
1. Host A transmits the broadcast query frame as it would on any
Ethernet network, with its own manufacturer-assigned MAC address
in the Ethernet header's source field and the broadcast address
(FF:FF:FF:FF:FF:FF) in the destination field.
2. The frame is received by switch 02:11:11, which observes the non-
MOOSE address in the frame's source field, and rewrites the
source field into a MOOSE address containing the switch
identifier and the appropriate host identifier. As this is Host
A's first frame, the switch must allocate a host identifier (in
this case 00:00:01, making Host A's complete MOOSE address 02:11:
11:00:00:01).
3. The three switches broadcast the frame using reverse path
forwarding away from Host A.
4. The frame is received by Host B (and any other hosts on the
network) in its current form; no further rewriting is performed.
4.4.2. Response
1. Host B looks up Host A's IP address in its ARP cache to determine
a suitable destination address for the response frame. Since the
rewritten query frame arrived at Host B with the source field
containing the MOOSE address 02:11:11:00:00:01, this is the
address returned by the cache lookup.
2. As above, switch 02:33:33 assigns a MOOSE address to Host B (02:
33:33:00:00:01) and rewrites the source address of the frame.
3. The frame is now routed through the network based solely on the
destination switch identifier--the host identifier is ignored for
now. The routing table is consulted for the location of switch
02:11:11 and the frame is forwarded accordingly.
Scott, et al. Expires April 21, 2011 [Page 15]
Internet-Draft MOOSE October 2010
4. On receiving the frame, switch 02:11:11 observes that it is
destined for a directly-connected host (02:11:11:00:00:01). It
prepares the frame for transmission along its final hop by
rewriting the destination address to Host A's manufacturer-
assigned MAC address. The source field of the frame is again
left as the MOOSE address of Host B in order that this address is
used for any further communication with Host B.
4.5. Directory Service
A directory service, Enhanced Lookup (ELK), runs in conjunction with
the basic MOOSE switch described so far. ELK exists to handle ARP
and DHCP queries in a broadcast-free manner by learning mappings from
IP addresses to MOOSE addresses. The master ELK directory is served
by one or multiple systems for resilience and is reached using an
anycast MOOSE address; the layer-2 anycast feature is a convenient
side-effect of running a routing protocol. Slave copies of the
directory can be held nearer the edge of the network in order to take
load away from the masters; slaves can be reached for lookups via a
separate anycast address, and the entire herd of ELK can be kept
synchronised via the masters using a combination of multicast and
unicast.
MOOSE switches intercept ARP and DHCP packets broadcast by hosts and
convert them into anycast ELK queries to the nearest slave (for ARP)
or master (for DHCP). (DHCP handling could make use of the
protocol's existing DHCP relay mechanism.) The ELK slave answers ARP
queries directly using information in the directory; as it does so,
if the query is from a host not in the directory, it learns the
sender's IP address to MOOSE address mapping. The ELK master can
also act as a DHCP server, populating the ELK directory as it grants
IP address leases to clients.
The one case in which the ELK directory will not contain the answer
to a query is when answering an ARP request for a host that is not
configured to use DHCP and that has not yet itself sent an ARP packet
(i.e. has not yet communicated via IP). This must be dealt with by
flooding the query to every active switch port, in a manner akin to
current Ethernet switches, and caching the result in the ELK
directory. Although this is not ideal, it is necessary in order to
deal with this scenario in a compatible manner, and is unlikely to
happen frequently.
4.6. Mobility
A consequence of introducing location-based hierarchy into MAC
addresses is the need to explicitly handle host mobility. In a
traditional Ethernet, hosts can migrate between switches as the
Scott, et al. Expires April 21, 2011 [Page 16]
Internet-Draft MOOSE October 2010
switches will learn the host's new location as soon as it sends a
frame. With MOOSE, if a host relocates to a new switch its address
changes and any ARP cache entries on other hosts pertaining to the
migrated host become incorrect; frames will continue to be sent to
the host's old location for a while. There are two strategies for
dealing with this, which can be used separately or in conjunction:
1. The previous home switch of the migrated host can forward frames
sent to the host's old address until outdated ARP cache entries
expire. This is similar to IP Mobility: the previous home switch
essentially becomes a care-of agent for the host. However,
unlike IP Mobility, it requires no host support. A handover
protocol is necessary for the old and new home switches to set up
such forwarding: on the arrival of a new host at a switch, that
switch would ask all other switches (via multicast) whether any
had seen this host before, identifying it using its manufacturer-
assigned MAC address, and would instruct such switches to
redirect frames.
2. A broadcast ARP announcement (or "gratuitous ARP") can be sent by
the new home switch to immediately update remote ARP caches and
the ELK directory with the new MOOSE address. This is the
technique used by Xen when migrating live virtual machines.
Unlike the previous approach, this works even if the previous
switch is no longer reachable, for example if this host migration
was as a result of a switch failure. This is a simpler approach
as a handover protocol is not required, but results in additional
broadcast traffic.
Unless the frequency of host migrations is very high, the additional
load introduced by either mobility approach is expected to be
negligible.
Scott, et al. Expires April 21, 2011 [Page 17]
Internet-Draft MOOSE October 2010
Illustration of the two ways to handle a host A roaming onto another
switch whilst maintaining communication with another host B:
(1) +--------+
##============== | Host B | <=== ARP ===## (2) gratuitous
|| +--------+ || ARP sent by
|| | || new home switch
|| +---+ ||
|| .------------| X |------------. ||
|| / +---+ \ ||
\/ | | ||
+---+ (1) data forwarded +---+
| X | ==========================> | X |
+---+ by care-of switch ||+---+
| \/ |
+ - - - + +--------+
| |- - host relocated to - >| Host A |
+ - - - + new switch +--------+
5. Interoperability Considerations
5.1. Layer-violating Protocols
In an ideal world, free from layering violations, all layer 3
protocols would operate correctly on top of MOOSE in exactly the same
way that they currently operate on top of Ethernet, with no protocol-
specific handling necessary in the switch. In reality, however,
protocols abound which use hosts' MAC addresses for purposes other
than layer 2 addressing or which place MAC addresses in the frame
payload. DHCP and ARP have already been mentioned as such protocols
which must be specifically handled by edge switches in order to
operate; luckily, the rewriting required for these important
protocols is simple.
Of particular concern are recent standards for layering on top of
Ethernet protocols which were previously used solely on dedicated
hardware interconnects, such as Fibre Channel over Ethernet (FCoE
[FC-BB-5]). In order to support FCoE and similar protocols on a
MOOSE network, each edge switch will need to be able to interpret and
rewrite individual protocols that are in use. A production MOOSE
switch would, therefore, need to be implemented such that it is
possible to add rewriting support for additional protocols after
manufacture, for example by loading an additional software or FPGA
configuration module.
Ultimately, in the general case, this problem could be addressed more
Scott, et al. Expires April 21, 2011 [Page 18]
Internet-Draft MOOSE October 2010
satisfactorily by extending the Ethernet standard to provide a
protocol-agnostic method for a layer 2 network to inform hosts of
their own addresses; LLDP [802.1AB] would make a good basis for this
extension. This would allow the use of network-assigned MAC
addresses for any protocol, with some rewriting performed either
partially (within the frame payload) or fully by the host itself, and
furthermore would allow higher-layer protocols to respond to changes
of the host's network-assigned address (e.g. due to mobility). Such
a mechanism could be deployed incrementally as needed, with switches
able to perform address rewriting for hosts which are not able to do
this themselves. This is, however, a very long-term solution, and
protocol-specific rewriting on the switch is likely to be required
for the foreseeable future.
FCoE in particular is unusual, however, as it already does its own
dynamic allocation of MAC address to devices. It is conceivable that
an extension to FCoE could be developed which allows a network-wide
dynamic address assignment scheme such as MOOSE to be exploited to
provide addresses directly to fibre channel devices.
5.2. Edge Virtual Bridging
The rise of virtualisation has caused an unanticipated proliferation
of software switches, usually in the host operating system or
hypervisor which provides network connectivity to multiple virtual
machines. Since software switches are almost always neither fast nor
centrally manageable in the same way as hardware switches, there is
ongoing work to standardise--by Cisco as Port Extension and by the
IEEE as Edge Virtual Bridging [P802.1Qbg]--a means of making these
software switches act merely as additional ports which are logically
part of a more central hardware switch. This reduces the work
required by a virtual edge switch: frames from local virtual edge
ports can be forwarded straight out via the uplink to a physical
switch without consideration, and frames from the uplink will arrive
simply tagged with a virtual edge port identifier.
(The scope of Port Extension in particular is greater than this, and
allows for physical port extenders to exist in place of switches
where a large number of ports but a small amount of processing is
required, but virtualisation is likely to be the most significant use
case.)
Edge Virtual Bridging and Port Extension require very little
adaptation to be implemented on a MOOSE switch. It is unlikely,
although too early in the standardisation process to say for certain,
that the virtual bridge will need to be MOOSE-aware. A virtual-
bridging-aware physical MOOSE switch will thus simply need to take
into account the possibility that one physical port may hide a large
Scott, et al. Expires April 21, 2011 [Page 19]
Internet-Draft MOOSE October 2010
number of virtual ports when allocating host identifiers, as it would
if it had an Ethernet switch connected on that port. If, however,
the virtual bridge is made MOOSE-aware, the hierarchical addressing
of MOOSE could be exploited to allow the virtual bridge to allocate
host identifiers itself, given that it is likely to be aware of the
exact number and nature of virtual edge ports. The parent MOOSE
switch would accordingly allocate an address prefix to each child
virtual bridge, and hosts' full MOOSE addresses could be formed as:
SWITCH ID : CHILD ID : HOST ID
(parent) (allocated (allocated
by parent) by child)
6. Prototype Implementation
We have implemented a MOOSE switch in OpenFlow and NOX, which can be
run on off-the-shelf switches. Details can be found in our paper
[Wa10].
7. Conclusions
Ethernet remains popular due to its simplicity and ubiquity, but is
showing its age and exhibits serious scalability issues in large
deployments. Previously-proposed improvements address either a few
of the problems in a simple way, or most of the problems in a highly
complex or backwards-incompatible way. We have demonstrated a
simple, novel and easily-implementable approach for significantly
boosting the scalability of Ethernet, which has a working prototype
switch firmware implementation.
8. IANA Considerations
This memo includes no request to IANA.
9. Security Considerations
Security will be considered in a later revision of this document.
10. Informative References
[802.1AB] IEEE, "802.1AB: Station and Media Access Control
Connectivity Discovery", 2009.
Scott, et al. Expires April 21, 2011 [Page 20]
Internet-Draft MOOSE October 2010
[802.1D] IEEE, "802.1D: Standard for Local and Metropolitan Area
Networks: Media Access Control (MAC)", 2004.
[802.1X] IEEE, "802.1X: Port Based Network Access Control", 2004.
[Cl05] Clark, C. and others, "Live Migration of Virtual
Machines", USENIX NSDI 2005, 2005.
[FC-BB-5] T11 FC-BB-5 working group, "Fibre Channel Backbone - 5",
June 2009.
[Ha01] Hadzic, I., "Hierarchical MAC Address Space in Public
Ethernet Networks", IEEE GLOBECOM vol 3, 2001, 2001.
[Ki08] Kim, C., Caesar, M., and J. Rexford, "Floodless in
SEATTLE: A Scalable Ethernet Architecture for Large
Enterprises", ACM SIGCOMM 2008, 2008.
[My04] Myers, A., Ng, E., and H. Zhang, "Rethinking the Service
Model: Scaling Ethernet to a Million Nodes", ACM SIGCOMM
Workshop on Hot Topics in Networking 2004, November 2004.
[P802.1Qbg]
Jeffree, A., Congdon, P., and J. Pelissier, "P802.1Qbg:
Edge Virtual Bridging", September 2009.
[Pe04] Perlman, R., "RBridges: Transparent Routing", Proc.
INFOCOM vol 2, 2005, March 2004.
[RFC0826] Plummer, D., "Ethernet Address Resolution Protocol: Or
converting network protocol addresses to 48.bit Ethernet
address for transmission on Ethernet hardware", STD 37,
RFC 826, November 1982.
[RFC2131] Droms, R., "Dynamic Host Configuration Protocol",
RFC 2131, March 1997.
[RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998.
[RFC2439] Villamizar, C., Chandra, R., and R. Govindan, "BGP Route
Flap Damping", RFC 2439, November 1998.
[RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol
Label Switching Architecture", RFC 3031, January 2001.
[RFC3344] Perkins, C., "IP Mobility Support for IPv4", RFC 3344,
August 2002.
Scott, et al. Expires April 21, 2011 [Page 21]
Internet-Draft MOOSE October 2010
[RFC3973] Adams, A., Nicholas, J., and W. Siadak, "Protocol
Independent Multicast - Dense Mode (PIM-DM): Protocol
Specification (Revised)", RFC 3973, January 2005.
[RFC4541] Christensen, M., Kimball, K., and F. Solensky,
"Considerations for Internet Group Management Protocol
(IGMP) and Multicast Listener Discovery (MLD) Snooping
Switches", RFC 4541, May 2006.
[RFC5556] Touch, J. and R. Perlman, "Transparent Interconnection of
Lots of Links (TRILL): Problem and Applicability
Statement", RFC 5556, May 2009.
[Ro00] Rodeheffer, T., Thekkath, C., and D. Anderson,
"SmartBridge: A Scalable Bridge Architecture", ACM
SIGCOMM 2000, 2000.
[Wa10] Wagner-Hall, D., "A Prototype Implementation of MOOSE on a
NetFPGA/OpenFlow/NOX Stack", First European NetFPGA
Developers' Workshop Cambridge, September 2010.
Authors' Addresses
Malcolm Scott (editor)
University of Cambridge
15 JJ Thomson Ave
Cambridge, CB3 0FD
UK
Phone: +44 1223 763500
Fax: +44 1223 334678
Email: Malcolm.Scott@cl.cam.ac.uk
URI: http://www.cl.cam.ac.uk/~mas90/MOOSE/
Daniel Wagner-Hall
University of Cambridge
Email: dwh@cantab.net
Scott, et al. Expires April 21, 2011 [Page 22]
Internet-Draft MOOSE October 2010
Jon Crowcroft
University of Cambridge
15 JJ Thomson Ave
Cambridge, CB3 0FD
UK
Phone: +44 1223 763500
Fax: +44 1223 334678
Email: Jon.Crowcroft@cl.cam.ac.uk
Scott, et al. Expires April 21, 2011 [Page 23]