Routing Area Working Group Bin Liu
Internet-Draft ZTE Inc.
Intended status: Informational Yantao Sun
Expires: June 22, 2014 Jing Cheng
Yichen Zhang
Beijing Jiaotong University
December 19, 2013
Generic Fault-avoidance Routing Protocol for Data Center Networks
draft-sl-rtgwg-far-dcn-00
Abstract
This draft proposed a generic routing method and protocol for a
regular data center network, named as the fault-avoidance routing
(FAR) protocol. FAR protocol provides a generic routing method for
all kinds of network architectures that are proposed for large-scale
cloud data centers over the past few years. FAR protocol is well
designed to fully leverage the regularity in the topology and compute
its routing table in a simple way. Fat-tree is taken as an example
architecture to illuminate how to apply FAR protocol in this draft.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on June 22, 2014.
Copyright Notice
Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
Bin Liu, et al. Expires June 22, 2014 [Page 1]
Internet-Draft FAR for DCN December 2013
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Acronyms & Definitions . . . . . . . . . . . . . . . . . 3
2. Conventions used in this document . . . . . . . . . . . . . . 4
3. Problem Statement . . . . . . . . . . . . . . . . . . . . . 4
3.1. The Impact of Large-scale Networks on Route Calculation . 4
3.2. Network Addressing Issues . . . . . . . . . . . . . . . . 5
3.3. Big Routing Table Issues . . . . . . . . . . . . . . . . 5
3.4. Adaptivity Issues for Routing Algorithms . . . . . . . . 6
3.5. Virtual Machine Migration Issues . . . . . . . . . . . . 6
4. The FAR Framework . . . . . . . . . . . . . . . . . . . . . . 6
5. Data Format . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.1. Data Tables . . . . . . . . . . . . . . . . . . . . . . . 7
5.2. Messages . . . . . . . . . . . . . . . . . . . . . . . . 10
6. FAR Modules . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.1. Neighbor & Link Detection Module(M1) . . . . . . . . . . 14
6.2. Device Learning Module(M2) . . . . . . . . . . . . . . . 14
6.3. Invisible Neighbor & Link Failure Inferring Module(M3) . 15
6.4. Link Failure Learning Module(M4) . . . . . . . . . . . . 15
6.5. BRT Building Module(M5) . . . . . . . . . . . . . . . . . 15
6.6. NRT Building Module(M6) . . . . . . . . . . . . . . . . . 15
6.7. Routing Table Lookup(M7) . . . . . . . . . . . . . . . . 16
7. Application Example . . . . . . . . . . . . . . . . . . . . . 16
7.1. BRT Building Procedure . . . . . . . . . . . . . . . . . 18
7.2. NRT Building Procedure . . . . . . . . . . . . . . . . . 18
7.2.1. Single Link Failure . . . . . . . . . . . . . . . . . 19
7.2.2. A Group of Link Failures . . . . . . . . . . . . . . 20
7.2.3. Node Failures . . . . . . . . . . . . . . . . . . . . 20
7.3. Routing Procedure . . . . . . . . . . . . . . . . . . . . 21
8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 22
9. Reference . . . . . . . . . . . . . . . . . . . . . . . . . . 23
10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 23
11. Security Considerations . . . . . . . . . . . . . . . . . . . 23
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23
Bin Liu, et al. Expires June 22, 2014 [Page 2]
Internet-Draft FAR for DCN December 2013
1. Introduction
In recent years, with the rapid development of cloud computing
technologies, the widely deployed cloud services, such as Amazon EC2
and Google search, bring about huge challenges to data center
networking (DCN). Today!_s cloud data centers (DCs) require large-
scale networks with higher internal bandwidth and lower transfer
delay, but conventional networks cannot meet such requirements due to
limitations in their network architecture. To satisfy the
requirements of cloud computing services, many new network
architectures have been proposed for data centers, such as Fat-tree,
MatrixDCN, and BCube. These new architectures can support non-
blocking large-scale datacenter networks with more than tens of
thousands of physical servers.
Generic routing protocols such as OSPF and IS-IS cannot be scaled to
support a very large-scale datacenter network because of the problems
of network convergence and PDU overhead. For each type of
architecture, researchers designed a routing algorithm according to
the features of its topology. Because those routing algorithms have
great difference and bad compatibility with each other, it is very
difficult to develop a routing protocol for network routers
supporting multiple routing algorithms. Furthermore, the fault
tolerances in those routing algorithms are very complicated and have
low efficiency.
This draft proposed a generic routing method and protocol, fault-
avoidance routing (FAR) protocol, for DCNs. This method leverages
the regularity in the topologies of data center networks to simplify
routing learning and speed up the query of routing tables. This
routing method has good fault tolerance and can be applied to any DCN
with a regular topology.
1.1. Acronyms & Definitions
DCN - Data Center Network
FAR - Fault-avoidance Routing
BRT - Basic Routing Table
NRT - Negative Routing Table
NDT - Neighbor Devices Table
ADT - All Devices Table
LFT - Link failure Table
Bin Liu, et al. Expires June 22, 2014 [Page 3]
Internet-Draft FAR for DCN December 2013
DA - Device Announcement
LFA - Link Failure Announcement
DLR¨C Device and Link Request
IP - Internet Protocol
UDP - User Datagram Protocol
2. Conventions used in this document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
this document are to be interpreted as described in RFC-2119
[RFC2119]. In this document, these words will appear with that
interpretation only when in ALL CAPS. Lower case uses of these words
are not to be interpreted as carrying RFC-2119 significance.
3. Problem Statement
The problem intended to be addressed by FAR is proposed in this
section. The expansion of Cloud data center networks has brought
great challenges to the existing routing technologies. FAR mainly
solves a series of routing problems faced by large-scale data center
network.
3.1. The Impact of Large-scale Networks on Route Calculation
In a large-scale cloud data center network, there may be thousands of
routers. Running OSPF and other routing protocols in such network
will encounter these two questions: a) Network convergence time is
too long, which will take a long time for creating and updating
routes. The response to network failures may not be timely; b) a
large number of routing protocol packets need to be sent. The
routing information consumes a lot of network bandwidth and CPU
resources, which easily leads to packet loss and makes the problem
(a) more prominent.
In order to solve these problems, the common practice is partitioning
the large network into some small areas, where the route calculation
runs independently within different areas. But nowadays the cloud
data centers typically require very large internal bandwidth. To
meet this requirement, a large number of parallel equivalent links
are deployed in the network, such as the Fat-tree network
architecture. Partitioning the network will affect the utilization
of routing algorithm on equivalent multi-path and reduce internal
network bandwidth.
Bin Liu, et al. Expires June 22, 2014 [Page 4]
Internet-Draft FAR for DCN December 2013
In the FAR routing calculation process, a Basic Routing Table (BRT)
is built on local network topology leveraging the regularity of the
network topologies.In addition to BRT, FAR also built a Negative
Routing Table (NRT). FAR gradually builds NRT in the process of
learning network link failure information, which does not require
learning the complete network fault information. FAR does not need
to wait for the completion of the network convergence in the process
of building these two tables. It avoids the problem of excessive
network convergence in the route calculation process. In addition,
FAR only needs to exchange a small amount of link change information
between routers, and consumes less network bandwidth.
3.2. Network Addressing Issues
Routers typically configure multiple network interfaces, each
connected to a subnet. OSPF and other routing algorithm require each
interface of the router must be configured with an IP address. A
large-scale data center network may contain thousands of routers.
Tens of thousands of IP addresses may be needed to configure for each
router with dozens of network interface. It will be a very complex
issue to configure and manage a large number of network interfaces.
Network maintenance is costly and error-prone. It will be difficult
to troubleshoot the problems.
In FAR, the device position information is encoded in the IP address
of the router. Each router only needs to be assigned a unique IP
address according its location, which greatly solves complex network
addressing issues in large-scale network.
3.3. Big Routing Table Issues
There are a large number of subnets in the large-scale data center
network. Routers may build a routing entry for each subnet, and
therefore the size of the routing tables on each router may be very
large. It will increase equipment cost and reduce the querying speed
of the routing table.
FAR uses two measures to reduce the size of the routing tables: a)
Builds a BRT on the regularity of the network topologies; b)
introduces a new routing table, i.e. a NRT. In this way FAR can
reduce the size of routing tables to only a few dozen routing
entries.
Bin Liu, et al. Expires June 22, 2014 [Page 5]
Internet-Draft FAR for DCN December 2013
3.4. Adaptivity Issues for Routing Algorithms
To implement efficient routing in large-scale datacenters, besides
FAR, some other routing methods are proposed for some specific
network architectures, such as Fat-tree and BCube. These routing
methods have great difference and bad compatibility with conventional
routing method and between each other on the ideas of design and
implementation, which brings big troubles to network equipment
providers to develop new routers supporting various new routing
methods.
FAR is a generic routing method. With slight modification, FAR
method can be applied to most of regular datacenter networks.
Furthermore, the structure of routing tables and querying a routing
table in FAR are the same as conventional routing method. If FAR is
adopted, the workload of developing a new type of router will be
greatly decreased.
3.5. Virtual Machine Migration Issues
Supporting VM migration is very important for cloud datacenter
networks, but to support layer-3 routing, routing methods including
OSPF and FAR require limiting VM migration within a subnet. For this
paradox, the mainstream methods still run layer-3 routing on routers
or switches, but transmit packets encapsulated by IPinIP or MACinIP
between hosts by tunnels passing through network to the destination
access switch, then extract original packet out and send it to the
destination host.
By means of the methods above, FAR can be applied to Fat-tree,
MatrixDCN or BCube networks with supporting VM migration in entire
network.
4. The FAR Framework
FAR requires that a DCN has a regular topology, and network devices,
including routers, switches, and servers, are assigned IP addresses
according to their locations in the network. In other word, we can
locate a device in the network according to its IP address.
FAR is a distributed routing method. To support FAR, each router
should be deployed a routing module that implement the FAR algorithm.
FAR algorithm is composed of three parts, i.e., link-state learning,
routing table building and routing table querying, as shown in Fig.
1.
Bin Liu, et al. Expires June 22, 2014 [Page 6]
Internet-Draft FAR for DCN December 2013
Link-state Learning |Routing Table | Routing Table
| Building
| | Querying
+--------+ /---------------\ | +--------+ | Packets
|2 Device|<--| 1 Neighbor & |----->| 5 BRT \ |
|Learning| | Link Detection| | |Building|\ | |
+--------+ \---------------/ | +--------+ \ | \|/
| | | \ |+--------------+
| | | /|| 7 Querying |
\|/ \|/ | / || Routing Table|
+-----------------------+| / |+--------------+
|3 Invisible Neighbor & || / |
|Link Failure Inferring || +--------- |
+-----------------------+|/| 6 NRT | |
| / |Building| |
\|/ /| +--------+ |
+--------------+ / | |
|4 Link Failure| / | |
| Learning | | |
+--------------+ | |
| |
Figure 1: The FAR framework
Link-state learning is responsible for a router to detect the states
of its connected links and learn the states of all the other links in
the entire network. The second part builds two routing tables, a
basic routing table (BRT) and an negative routing table (NRT),
according to the learned link states in the first part. The third
part queries the BRT and the NRT to decide a next forwarding hop for
arrived packets.
5. Data Format
5.1. Data Tables
Some data tables are maintained on each router in FAR. They are:
Neighbor Device Table (NDT): To store neighbor routers and related
links.
All Devices Table (ADT): To store all routers in the entire network.
Link Failures Table (LFT): To store all link failures in the entire
network.
Basic Routing Table (BRT): To store the candidate routes.
Bin Liu, et al. Expires June 22, 2014 [Page 7]
Internet-Draft FAR for DCN December 2013
Negative Routing Table(NRT): To store the avoiding routes.
The format of NDT
----------------------------------------------------------
Device ID | Device IP | Port ID | Link State | Update Time
----------------------------------------------------------
Device ID: The ID of a neighbor router.
Device IP: The IP address of a neighbor router.
Port ID: The port ID that a neighbor router is attached to.
Link State: The state of the link between a router and its neighbor
router. There are two states: Up and Down.
Update Time: The time of updating the entry.
The format of ADT
--------------------------------------------------
Device ID | Device IP | Type | State | Update Time
--------------------------------------------------
Device ID: The ID of a neighbor router.
Device IP: The IP address of a neighbor router.
Type: The type of a neighbor router.
State: The state of a neighbor router. There are two states: Up and
Down.
Update Time: The time of updating the entry.
The format of LFT
--------------------------------------------
No | Router 1 IP | Router 2 IP | Update Time
--------------------------------------------
No: The entry number.
Bin Liu, et al. Expires June 22, 2014 [Page 8]
Internet-Draft FAR for DCN December 2013
Router 1 IP: The IP address of one router that a failed link connects
to.
Router 2 IP: THe IP address of another router that a failed link
connects to.
Update Time: The time of updating the entry.
The format of BRT
-------------------------------------------------------
Destination | Mask | Next Hop | Interface | Update Time
-------------------------------------------------------
Destination: A destination network
Mask: The subnet mask of a destination network.
Next Hop: The IP address of a next hop for a destination.
Interface: The interface related to a next hop.
Update Time: The time of updating the entry.
The format of NRT
-------------------------------------------------------------------
Destination| Mask| Next Hop| Interface| Failed Link No| Timestamp
-------------------------------------------------------------------
Destination: A destination network.
Mask: The subnet mask of a destination network.
Next Hop: The IP address of a next hop that should be avoided for a
destination.
Interface: The interface related to a next hop that should be
avoided.
Failed Link No: A group of failed link numbers divided by !o/!+/-,
for example 1/2/3.
Timestamp: The time of updating the entry.
Bin Liu, et al. Expires June 22, 2014 [Page 9]
Internet-Draft FAR for DCN December 2013
5.2. Messages
Some protocol messages are exchanged between routers in FAR.
Hello Message: Be exchanged between neighbor routers to learn
adjacency.
Device Announcement (DA): Synchronize the knowledge of routers
between routers.
Link Failure Announcement (LFA): Synchronize link failures between
routers.
Device and Link Request (DLR): When a router starts, it requests the
knowledge of routers and links from its neighbors by a DLR message.
A FAR Message is directly encapsulated in an IP packet. The protocol
field of IP header indicates an IP packet is an FAR message. The
protocol of IP for FAR should be assigned by IANA.
The four types of FAR messages have same format of packet header,
called FAR header.
|<--- 1 --->| <--- 1 --->|<--------- 2 ---------->|
+-----------+------------+------------------------+
| Version |Message Type| Message Length |
+-----------+------------+------------------------+
| Checksum | AuType |
+------------------------+------------------------+
| Authentication |
+-------------------------------------------------+
| Authentication |
+-------------------------------------------------+
| Timestamp |
+-------------------------------------------------+
Figure 2: The format of FAR header
Version: FAR version
Message Type: the type of FAR message.
Packet Length: the packet length of the total FAR message.
Checksum: the checksum of an entire FAR message.
Bin Liu, et al. Expires June 22, 2014 [Page 10]
Internet-Draft FAR for DCN December 2013
AuType:Authentication type. 0: no authentication, 1: Plaintext
Authentication, 2: MD5 Authentication.
Authentication: Authentication information. 0: undefined, 1: Key, 2:
key ID, MD5 data length and packet number. MD5 data is appended to
the backend of the packet.
AuType and Authentication can refer to the definition of OSPF packet.
|<--- 1 --->| <--- 1 --->|<--------- 2 ---------->|
+-----------+------------+------------------------+
| Version |Message Type| Message Length |
+-----------+------------+------------------------+
| Checksum | AuType |
+------------------------+------------------------+
| Authentication |
+-------------------------------------------------+
| Authentication |
+-------------------------------------------------+
| Timestamp |
+-------------------------------------------------+
| Router IP |
+------------------------+------------------------+
| HelloInterval | HelloDeadInterval |
+------------------------+------------------------+
| Neighbor Router IP |
+-------------------------------------------------+
| ... |
+-------------------------------------------------+
Figure 3: The format of Hello messages
For Hello messages, the Message Type in FAR header is set to
1.Besides FAR header, a Hello message requires the following fields:
Router IP: the router IP address.
HelloInterval: the interval of sending Hello messages to neighbor
routers.
RouterDeadInterval: The interval to set a neighbor router dead. If
in the interval time, a router doesn!_t receive a Hello message from
its neighbor router, the neighbor router is treated as dead.
Bin Liu, et al. Expires June 22, 2014 [Page 11]
Internet-Draft FAR for DCN December 2013
Neighbor Router IP: the IP address of a neighbor router. All the
neighbor router's addresses should be included in a Hello message.
|<----1---->| <----1---->|<----------2----------->|
+-----------+------------+------------------------+
| Version |Message Type| Message Length |
+-----------+------------+------------------------+
| Checksum | AuType |
+------------------------+------------------------+
| Authentication |
+-------------------------------------------------+
| Authentication |
+-------------------------------------------------+
| Timestamp |
+------------------------+------------------------+
| Router1 IP |
+-------------------------------------------------+
| ... |
+-------------------------------------------------+
Figure 4: The format of DA messages
For DA messages, the Message Type in FAR header is set to 2. Besides
FAR header, a DA message includes IP addresses of all the announced
routers.
Bin Liu, et al. Expires June 22, 2014 [Page 12]
Internet-Draft FAR for DCN December 2013
|<----1---->| <----1---->|<----------2----------->|
+-----------+------------+------------------------+
| Version |Message Type| Message Length |
+-----------+------------+------------------------+
| Checksum | AuType |
+------------------------+------------------------+
| Authentication |
+-------------------------------------------------+
| Authentication |
+-------------------------------------------------+
| Timestamp |
+------------------------+------------------------+
| Left IP |
+-------------------------------------------------+
| Right IP |
+------------------------+------------------------+
| State |
+-------------------------------------------------+
| ... |
+-------------------------------------------------+
Figure 5: The format of LFA messages
For LFA messages, the Message Type in FAR header is set to 3.
Besides FAR header, a LFA message includes all the announced link
failures.
Left IP: the IP address of the left endpoint router of a link.
Right IP: the IP address of the right endpoint router of a link.
State: link state. 0: Up, 1: down
Bin Liu, et al. Expires June 22, 2014 [Page 13]
Internet-Draft FAR for DCN December 2013
|<----1---->| <----1---->|<----------2----------->|
+-----------+------------+------------------------+
| Version |Message Type| Message Length |
+-----------+------------+------------------------+
| Checksum | AuType |
+------------------------+------------------------+
| Authentication |
+-------------------------------------------------+
| Authentication |
+-------------------------------------------------+
| Timestamp |
+-------------------------------------------------+
Figure 6: The format of DLR Messages
For DLR messages, the Message Type in FAR header is set to 1.Except
for FAR header, DLR has no additional fields.
6. FAR Modules
6.1. Neighbor & Link Detection Module(M1)
M1 is responsible for sending and receiving Hello messages, and
detecting directly-connected links and neighbor routers. Each Hello
message is encapsulated in a UDP packet. M1 sends Hello messages
periodically to all the active router ports and receives Hello
messages from its neighbor routers. M1 detects neighbor routers and
directly-connected links according to received Hello Messages and
stores these neighbors and links into a Neighbor Devices Table (NDT).
Additionally, M1 also stores the neighbor routers into an All Devices
Table (ADT).
6.2. Device Learning Module(M2)
M2 is responsible for sending, receiving, and forwarding device
announcement (DA) messages, learning all the routers in the whole
network, and deducing faulted routers. When a router starts, it
sends a DA message announcing itself to its neighbors and a DLR
message requesting the knowledge of routers and links from its
neighbors. If M2 module of a router receives a DA message, it checks
whether the router encapsulated in the message is in an ADT. If the
router is not in the ADT, M2 puts this router into the ADT and
forwards this DA message to all the active ports except for the
incoming one, otherwise, M2 discards this message directly. If M2
module of a router receives a DLR message, it replies a DA message
that encapsulates all the learned routers.
Bin Liu, et al. Expires June 22, 2014 [Page 14]
Internet-Draft FAR for DCN December 2013
6.3. Invisible Neighbor & Link Failure Inferring Module(M3)
M3 is responsible for inferring invisible neighbors of the current
router by means of the ADT. If the link between a router A and its
neighbor B breaks, which results in that M1 module of A cannot detect
the existence of B, then B is an invisible neighbor of A. Because a
device!_s location has been coded into its IP address, it can be
judged whether two routers are adjacent, according to their IP
addresses. Based on this idea, M3 infers all the invisible neighbors
of the current router and the related link failures. The result
stores into a NDT. Moreover, link failures also are added into a
link-failure table (LFT). LFT stores all the failed links in the
entire network.
6.4. Link Failure Learning Module(M4)
M4 is responsible for sending, receiving and forwarding link failure
announcement (LFA) and learning all the link failures in the whole
network. M4 broadcasts each newly inferred link failure to all the
routers in the network. Each link failure is encapsulated in a LFA
message and one link failure is broadcasted only once. If a router
receives a DLR request from its neighbor, it will reply a LFA message
that encapsulates all the learned link failures through M4 module.
If M4 receives a LFA message, it checks whether the link failure
encapsulated in the message is in a LFT. If the link failure is not
in the LFT, M4 puts this link failure into the LFT and forwards this
LFA message to all the active ports except for the incoming one,
otherwise, M4 discards this message directly.
6.5. BRT Building Module(M5)
M5 is responsible for building a BRT for the current router. By
leveraging the regularity in topology, M5 can calculate the routing
paths for any destination without the knowledge of the topology of
whole network, and then build the BRT based on a NDT. Since the IP
addresses of network devices are continuous, M5 only creates one
route entry for a group of destination addresses that have the same
network prefix by means of route aggregation technology. Usually,
the size of a BRT is very small. The detail of how to build a BRT is
described in section 5.
6.6. NRT Building Module(M6)
M6 is responsible for building a NRT for the current router. Because
M5 builds a BRT without considering link failures in network, the
routing paths calculated by the BRT cannot avoid failed links. To
solve this problem, a NRT is used to exclude the routing paths that
include some failed links from the paths calculated by a BRT. M6
Bin Liu, et al. Expires June 22, 2014 [Page 15]
Internet-Draft FAR for DCN December 2013
calculate the routing paths that include failed links and stored them
into the NRT. The detail of how to build a NRT is described in
section 5.
6.7. Routing Table Lookup(M7)
M7 is responsible for querying routing tables and deciding the next
hop for the forwarding packets. Firstly, M7 takes the destination
address of a forwarding packet as a criterion to look up route
entries in a BRT based on longest prefix match. All of the matched
entries are composed of a candidate hops list. Secondly, M7 look up
negative route entries in a NRT taking the destination address of the
forwarding packet as criteria. In this lookup, Not limited to
longest prefix match, any entry that matches the criteria would be
selected and composed of an avoiding hops list. Thirdly, the
candidate hops minus avoiding hops are composed of an applicable hops
list. At last, M7 sends the forwarding packet to any one of the
applicable hops. If the applicable list is empty, the forwarding
packet will be dropped.
7. Application Example
In this section, we take a Fat-tree network as an example to describe
how to apply FAR routing. Since M1 to M4 are very simple, we only
introduce how the modules M5, M6 & M7 work in a Fat-tree network.
A Fat-tree network is composed of 4 layers. The top layer is core
layer, and the other layers are aggregation layer, edge layer and
server layer. There are k pods, each one containing two layers of k/
2 switches. Each k-port switch in the edge layer is directly
connected to k/2 hosts. The remaining k/2 ports are connected to k/2
of the k-port switches in the aggregation layer. There are (k/2)2
k-port core switches. Each core switch has one port connected to
each of k pods.
Bin Liu, et al. Expires June 22, 2014 [Page 16]
Internet-Draft FAR for DCN December 2013
10.0.1.1 10.0.1.2 10.0.2.1 10.0.2.2
+--+ +--+ +--+ +--+
| | | | | | | |
+--+,_ +--+', +--+ ,,+--+
|`,`',`-., / | \ `. .'` - .-'``.` /|
| . `', `'., / | ' ' ,-`,' |`. .` ' |
| \ `', `-., .` / | `, .` ,' |
| `, `'. `'-,_ .'` ,' | ', / |
| . `'. `-.,-` / | \
| \ `'., .` `'., ` | `.
| `, .'`, ,`'., | ',
| . ,-` '., - `'-,_ | `.
| \ .` ,'., `|., .
| .'` / `-, | `'., `.
10.1.0.1 ,-` . .' 10.3.0.1 10.3.0.2`'., ',
+--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+
| | | | | | | | | | | | | | | |
+--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+
| \/ | | \/ | | \/ | | \/ |
+--+/\+--+ +--+/\+--+ +--+/\+--+ +--+/\+--+
| | | | | | | | | | | | | | | |
+--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+
/| 10.1.2.1 /| |\ 10 3.1.3 |\ /| | |
/ | | \ / | | \ / | | \ / | | |
/ | | \ / | | \ / | | \ / | | |
/ | | \ / | | \ / | | \ / | | |
++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++
++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++
10.1.2.2 10.3.1.3
Figure 7: Fat-tree
Aggregation switches are given addresses of the form 10.pod.0.switch,
where pod denotes the pod number, and switch denotes the position of
that switch in the upper pod (in [1, k/2]). Edge switches are given
addresses of the form 10.pod.switch.1, where pod denotes the pod
number, and switch denotes the position of that switch in the lower
pod (in [1, k/2]). The core switches are given addresses of the form
10.0.j.i, where j and i denote that switch!_s coordinates in the (k/
2)2 core switch grid (each in[1, (k/2)], starting from top-left).
The address of a host follows the pod switch it is connected to;
hosts have addresses of the form: 10.pod.switch.ID, where ID is the
host's position in that subnet (in [2, k/2+1], starting from left to
right).
Bin Liu, et al. Expires June 22, 2014 [Page 17]
Internet-Draft FAR for DCN December 2013
7.1. BRT Building Procedure
By leveraging the topology's regularity, every switch clearly knows
how it forwards a packet. When a packet arrives at an edge switch,
if the destination of the packet lies in the same subnet with the
switch, then the switch directly forwards the packet to the
destination server through layer-2 switching; otherwise, the switch
forwards the packet to any of aggregation switches in the same pod.
When a packet arrives at an aggregation switch, if the destination of
the packet lies in the same pod, then the switch forwards the packet
to the corresponding edge switch, otherwise, the switch forwards the
packet to any of core switches that it is connected to. If a core
switch receives a packet, it forwards the packet to the corresponding
aggregation switch that lies in the destination pod.
The forwarding policy discussed above is easily expressed through a
BRT. The BRT of an edge switch, such as 10.1.1.1, is composed of the
following entries:
Destination/Mask Next hop
10.0.0.0/255.0.0.0 10.1.0.1
10.0.0.0/255.0.0.0 10.1.0.2
The BRT of an aggregation switch, such as 10.1.0.1, is composed of
the following entries:
Destination/Mask Next hop
10.1.1.0/255 255.255.0 10.1.1.1
10.1.2.0/255.255.255.0 10.1.2.1
10.0.0.0/255.0.0.0 10.0.1.1
10.0.0.0/255.0.0.0 10.0.1.2
The BRT of acore switch, such as 10.0.1.1, is composed of the
following entries:
Destination/Mask Next hop
10.1.0.0/255 255.0.0 10.1.0.1
10.2.0.0/255.255.0.0 10.2.0.1
10.3.0.0/255.255.0.0 10.3.0.1
10.4.0.0/255.255.0.0 10.4.0.1
7.2. NRT Building Procedure
The route entries in a NRT are related with link and node failures.
We conclude all kinds of cases into 3 catalogs.
Bin Liu, et al. Expires June 22, 2014 [Page 18]
Internet-Draft FAR for DCN December 2013
7.2.1. Single Link Failure
In Fat-tree, Links can be classified as 3 types by their locations:
1) servers to edge switches; 2) edge to aggregation switches; 3)
aggregation to core switches. Link failures between servers to edge
switches only affect the communication of the corresponding servers
and don't affect the routing tables of any switch, so we only discuss
the second and third type of links failures.
Edge to Aggregation Switches
Suppose that the link between an edge switch, such as 10.1.2.1 (A),
and an aggregation switch, such as 10.1.0.1(B),fails. This link
failure may affect 3 types of communications.
o Sources lie in the same subnet with A, and destinations do not. In
this case, the link failure will only affect the routing tables of A.
As this link is attached to A directly, A only needs to delete the
route entries whose next hop is B in its BRT and add no entries to
its NRT when A's M6 module detect the link failure.
o Destinations lie in the same subnet with A, and sources lie in
another subnet of the same pod. In this case, the link failure will
affect the routing tables of all the edge switches in the same pod
except for A. When an edge switch, such as 10.1.1.1, learns the link
failure, it will add a route entry to its NRT:
Destination/Mask Next hop
10.1.2.0/255.255.255.0 10.1.0.1
o Destinations lie in the same subnet with A, sources lie in another
pod. In this case, the link failure will affect the routing tables
of all the edge switches in the other pods. When an edge switch in
one other pod, such as 10.3.1.1, learns the link failure, because all
the routings that pass through 10.3.0.1 to A will certainly pass
through the link between A and B, 10.3.1.1 need add a route entry to
its NRT:
Destination/Mask Next hop
10.1.2.0/255.255.255.0 10.3.0.1
Aggregation to Core Switches
Suppose that the link between an aggregation switch, such as 10.1.0.1
(A), and a core switch, such as 10.0.1.2(B), fails. This link
failure may affect 2 types of communications.
Bin Liu, et al. Expires June 22, 2014 [Page 19]
Internet-Draft FAR for DCN December 2013
o Sources lie in the same pod (pod 1) with A, and destinations lie in
the other pods. In this case, the link failure will only affect the
routing tables of A. As this link is attached to A directly, A only
need to delete the route entries whose next hop is B in its BRT and
add no entries to its NRT when A!_s M6 module detect the link
failure.
o Destinations lie in the same pod (pod 1) with A, and sources lie in
another pod. In this case, the link failure will affect the routing
tables of all the aggregation switches in other pods except for pod
1. When an aggregation switch in one other pod, such as 10.3.0.1,
learns the link failure, because all the routings that pass through
10.0.1.2 to the pod 1 where A lies will certainly pass through the
link between A and B, 10.3.0.1 need add a route entry to its NRT:
Destination/Mask Next hop
10.1.0.0/255.255.0.0 10.0.1.2
7.2.2. A Group of Link Failures
If all the uplinks of an aggregation switch fail, then this switch
cannot forward packets, which will affect the routing of every edge
switches. Suppose that all the uplinks of the node A (10.1.0.1)
fail, it will affect two types of communications.
o Sources lie in the same pod (pod 1) with A, and destinations lie in
the other pods. In this case, the link failures will affect the
routing of the edge switches in the Pod of A. To avoid the node A,
each edge switch should remove the route entry !o10.0.0.0/
255.0.0.0GBP[not]10.1.0.1!+/- in which the next hop is the node A.
o Destinations lie in the same pod (pod 1) with A, and sources lie in
other pods. In this case, the link failures will affect the routing
of edge switches in other pods. For example, if the edge switch
10.3.1.1 communicates with some node in the pod of A, it should avoid
the node 10.3.0.1, because any communication through 10.3.0.1 to the
pod of A will pass through the node A. So a route entry should be
added to 10.3.1.1:
Destination/Mask Next hop
10.1.0.0/255.255.0.0 10.3.0.1
7.2.3. Node Failures
Bin Liu, et al. Expires June 22, 2014 [Page 20]
Internet-Draft FAR for DCN December 2013
At last, we discuss the effect of node failures to a NRT. There are
3 types of node failures: the failure of edge, aggregation and core
switches.
o An edge switch fails. The failure doesn't affect the routing table
of any switch.
o A core switch fails. Only when all the core switches connected to
the same aggregation switch fail, they will affect the routing of
other switches. This case is equal to the case that all the uplinks
of an aggregation switch fail, so the process of link failures can
cover it.
o An aggregation switch fails. This case is similar to the case that
all the uplinks of an aggregation switch fail. It affects the
routing of edge switches in other pods, but doesn't affect the
routing of edge switches in pod of the failed switch. The process of
this failure is same to the second case in section 6.2.2.
7.3. Routing Procedure
FAR decides a routing by looking up its BRT and NRT. We illuminate
the routing procedure by an example. In this example, we suppose
that the link between 10.3.1.1 and 10.3.0.2 and the link between
10.1.2.1 and 10.1.0.2 have failed. Then we look into the routing
procedure of a communication from 10.3.1.3 (source) to 10.1.2.2
(destination).
Step 1: The source 10.3.1.3 sends packets to its default router
10.3.1.1
Step 2: The routing of 10.3.1.1.
1) Calculate candidate hops
10.3.1.1 looks up its BRT and gets the following matched entriesGBPo
Destination/Mask Next hop
10.0.0.0/255.0.0.0 10.3.0.1
So the candidate hops = {10.3.0.1}
2) Calculate avoiding hops
Its NRT is empty, so the set of avoiding hop is empty too.
3) Calculate applicable hops
Bin Liu, et al. Expires June 22, 2014 [Page 21]
Internet-Draft FAR for DCN December 2013
The applicable hops are candidate hops minus avoiding hops, so:
The applicable hops = {10.3.0.1}
4) Forward packets to 10.3.0.1
Step 3: The routing of 10.3.0.1
1) Calculate candidate hops.
10.3. 0.1 looks up its BRT and gets the following matched entriesGBPo
Destination/Mask Next hop
10.1.0.0/255.255.0.0 10.0.1.1
10.1.0.0/255.255.0.0 10.0.1.2
So the candidate hops = {10.0.1.1, 10.0.1.2}
2) Calculate avoiding hops
Destination/Mask Next hop
10.1.0.0/255.255.0.0 10.0.1.2
So the avoiding hops = {10.0.1.2}
3) Calculate applicable hops
The applicable hops are candidate hops minus avoiding hops, so:
The applicable hops = {10.0.1.1}
4) Forward packets to 10.0.1.1
Step 4: 10.0.1.1 forwards packets to 10.1.0.1 by looking up its
routing tables.
Step 5: 10.1.0.1 forwards packets to 10.1.2.1 by looking up its
routing tables.
Step 6:10.1.2.1 forwards packets to the destination 10.1.2.2 by
layer-2 switching.
8. Conclusion
Bin Liu, et al. Expires June 22, 2014 [Page 22]
Internet-Draft FAR for DCN December 2013
This draft introduces FAR protocol, a generic routing method and
protocol, for data centers that have a regular topology. It uses two
routing tables, a BRT and a NRT, to store the normal routing paths
and avoiding routing paths, respectively, which makes FAR very simple
and efficient. The sizes of two tables are very small. Usually, a
BRT has only several tens of entries and a NRT has only several or
about a dozen entries.
9. Reference
[FAT-TREE] M. Al-Fares, A. Loukissas, and A. Vahdat."A
Scalable,Commodity, Data Center Network Architecture",In ACM SIGCOMM
2008.
10. Acknowledgments
This document is supported by ZTE Enterprise-University-Research
Joint Project.
11. Security Considerations
Authors' Addresses
Bin Liu
ZTE Inc.
ZTE Plaza, No.19 East Huayuan Road,Haidian District
Beijing 100191
China
Email: liu.bin21@zte.com.cn
Yantao Sun
Beijing Jiaotong University
No.3 Shang Yuan Cun, Hai Dian District
Beijing 100044
China
Email: ytsun@bjtu.edu.cn
Jing Cheng
Beijing Jiaotong University
No.3 Shang Yuan Cun, Hai Dian District
Beijing 100044
China
Email: yourney.j@gmail.com
Bin Liu, et al. Expires June 22, 2014 [Page 23]
Internet-Draft FAR for DCN December 2013
Yichen Zhang
Beijing Jiaotong University
No.3 Shang Yuan Cun, Hai Dian District
Beijing 100044
China
Email: snowfall_dan@sina.com
Bin Liu, et al. Expires June 22, 2014 [Page 24]