A Reliability Analysis of Datacenter Topologies
A Reliability Analysis of Datacenter Topologies
A Reliability Analysis of Datacenter Topologies
AbstractThe network infrastructure plays an important role architecture uses specic topologies and routing protocols.
for datacenter applications. Therefore, datacenter network archi- For datacenters, networking performance is a function of
tectures are designed with three main goals: bandwidth, latency three main metrics: bandwidth, latency, and reliability. Despite
and reliability. This work focuses on the last goal and provides
a comparative analysis of the topologies of prevalent datacenter the high available bandwidth achieved by these architectures,
architectures. Those architectures use a network based only on datacenters are composed of tens of thousands of servers,
switches or a hybrid scheme of servers and switches to perform which are prone to failures as well as the networking ele-
packet forwarding. We analyze failures of the main networking ments [5]. On the other hand, the datacenter must remain
elements (link, server, and switch) to evaluate the tradeoffs of the operational and present minimal impact to the user experience.
different datacenter topologies. Considering only the network
topology, our analysis provides a baseline study to the choice To date, few studies compare existent architectures consider-
or design of a datacenter network with regard to reliability. ing failures on each one of the main networking elements,
Our results show that, as the number of failures increases, namely, servers, switches, and physical links. Popa et al. [6]
the considered hybrid topologies can substantially increase the compare the different architectures in terms of cost and en-
path length, whereas servers on the switch-only topology tend to ergy consumption, considering similar congurations to yield
disconnect more quickly from the main network.
compatible performance. By analyzing the network capacity
and maximum latency, they conclude that hybrid topologies
I. I NTRODUCTION
(e.g. BCube) are cheaper than switch-only topologies (e.g.
Currently, the time needed to complete an Internet trans- Fat-Tree). However, they foresee that switch-only topologies
action is becoming a competitive factor among companies will become more cost-effective with the appearance of very
offering online services, such as web search, home banking, low-cost switches in a near future. Guo et al. [3] address the
and shopping. The typical solution to reduce the response time reliability of the different topologies for specic trafc patterns
of these services is distributed processing (e.g., MapReduce). and protocols, concluding that BCube is the most reliable.
This strategy is more efcient if more servers in the datacenter In this work, we analyze the network topologies of three of
execute the parts of a single task. As a consequence, the the main existent datacenter architectures (Fat-Tree, BCube,
number of servers in datacenters is growing steadily fast. and DCell) in terms of reliability, adding to the cost and
Google, for instance, has a computing infrastructure of almost bandwidth comparisons found in the literature. The present
1 million servers spread in datacenters around the world [1]. reliability analysis does not depend on applications, routing
Distributed processing incurs in communication between algorithms, or trafc engineering strategies used by each archi-
servers, which adds latency to the completion of a distributed tecture. Instead, it provides a baseline study by using metrics
task. Moreover, high communication between servers cause to quantify the reliability of the datacenter network. These
high link utilization, which may lead to buffer congestion in metrics can be combined with cost and available bandwidth
switches, adding to latency. As data transfer is a potential metrics to help the datacenter designer. For example, the
slowdown for datacenter operations, distributed programming framework proposed by Curtis et. al. [7] proposes a datacen-
models use locality properties to choose the most appropriate ter topology optimizing metrics as available bandwidth and
server to store data. Ideally, one would plan for limiting data latency. However, it can be improved by using our denition
transfers to servers in a single rack. However, choosing the of reliability evaluated in this work. In our analysis, we model
best server to store a specic piece of data is a difcult the datacenter topology as a graph with servers and switches as
task, especially if we consider the ever increasing number nodes with network links connecting them. Using this model,
of servers in datacenter networks. Thus, signicant effort has we evaluate the impact of each networking component (server,
been devoted to the development of new datacenter architec- switch and link) failure to the entire network.
tures which improve networking performance, while keeping The results of our analysis show that the network degrades
the economical aspect in mind. One of the earliest datacenter with the removal of connected components with a relatively
architecture is Fat-Tree [2], which focuses on the utilization small number of servers as the number of failures increases, for
of off-the-shelf switches to avoid high costs. BCube [3] and all considered topologies. We also show that hybrid topologies
DCell [4] are examples of architectures that use a combina- as BCube and DCell can substantially increase the average
tion of servers and switches to perform packet forwarding. path length as the failures increase, whereas in Fat-Tree servers
The server-based forwarding allows those architectures to tend to disconnect more quickly.
use switches with lower port density than Fat-Tree. Each This paper is organized as follows. Section II details the
Fig. 1. Fat-Tree topology. Fig. 2. BCube topology.
topology used in each architecture. Section III describes the energy costs, and for strategic positioning, allowing the place-
methodology and the metrics used. Section IV shows the ment close to regions with high service demands. As MDCs
obtained results and Section V concludes this work. are built in sealed containers with a high equipment density,
they need to be highly reliable. Furthermore, the performance
II. DATACENTER NETWORK TOPOLOGIES of these networks has to slowly degrade as equipment failures
In this work, we consider three representative datacenter occur. Also, as in the case of Fat-Tree, the network have a high
topologies found on the current literature, Fat-Tree, BCube, transmission rate capacity and low cost. To this end, the BCube
and DCell. Their main characteristics are explained below. topology has layers of COTS (commodity off-the-shelf) mini-
switches and servers, which participate in packet forwarding.
A. Fat-Tree These servers thus have several network interfaces, usually no
We refer to Fat-Tree as the topology proposed by Al-Fares more than ve [3]. The main module of a BCube topology
et al. in [2]. The authors use the concept of fat-tree, which is a is BCube0 , which consists of a single switch with n ports
special case of a Clos network, to dene a datacenter topology connected to n servers. A BCube1 , on the other hand, is
organized as a k-ary tree. VL2 [8] also uses a Clos network constructed using n BCube0 networks and n switches. Each
and is not considered in our analysis due to its similarity to switch is connected to all BCube0 networks through its
Fat-Tree. As shown in Figure 1 the topology has two sets of connection with one server of each BCube0 . Figure 2 shows a
elements, the core and the pods. The rst set is composed BCube1 network. More generally, a BCubek (k 1) network
of switches that interconnect the pods. Each port of each consists of n BCubek1 s and nk switches of n ports. To
switch in the core is connected to a different pod. A pod is build a BCubek , the n Bcubek1 s are numbered from 0 to
composed of aggregation and edge switches, and datacenter n 1 and the servers of each one from 0 to nk 1. Next,
servers. Aggregation switches connect the pod to the core by the level k port of the i-th server (i [0, nk 1]) of the j-th
linking edge and core switches. Finally, each edge switch is BCubek (j [0, n 1]) is connected to the j-th port of the i-
connected to a different set of servers. th level k switch. A BCubek network can have nk+1 servers.
All switches are identical and have k ports. Consequently, Figure 2 shows that, in BCube0 0, Server 0 communicates
the network has k pods, and each pod has k2 aggregation through a switch to Server 1. On the other hand, Server 1
switches and k2 edge switches. In a single pod, each aggre- from BCube0 1 uses its local switch to forward its packets
gation switch is connected to all edge switches, which are to Server 2, which can forward the packet to the destination,
individually connected to k2 different servers. Thus, the Fat- in this case the Server 2 in BCube0 2 network. However, the
3
Tree topology can have k2 k2 k = k4 servers. Figure 1 shows communication between different BCubes in the same level
a Fat-Tree for k = 4. Note, as an example, that the server with may occur by only using a higher level switch, as in the case
index 0 in Pod 0 (S0.0 ) communicates with Server 1 (S1.0 ) in of Server 3 of BCube0 2 with Server 3 in BCube0 3. In a
the same pod and both of them are connected through the nutshell, BCube servers can participate in packet forwarding
same edge switch. On the other hand, Server 3 from Pod 0 depending on the communicating pair.
(S3.0 ) communicates with a server in a different pod, S3.1 ,
C. DCell
requiring the use of core switches. Fat-Tree allows all servers
to communicate at the same time using the total capacity Similarly to BCube, DCell is dened recursively and uses
of their network interfaces. In this topology, all networking servers and mini-switches for packet forwarding. The main
elements are identical, avoiding expensive switches with high module of this topology is DCell0 which, as BCube0 , is
port density in higher topology levels. composed of a switch connected to n servers. A DCell1 is
built by connecting n + 1 DCell0 networks and a DCell0 is
B. BCube connected to all other DCell0 cells by one link from one of
BCube [3] topology was proposed to be used in a Mod- its servers to a server in another DCell0 . A DCell1 network
ular Data Center (MDC), which is a datacenter built inside is illustrated in Figure 3. Note that communication inside a
shipping containers to permit simpler installation and physical cell is performed locally using a switch, as shown in the
migration procedures as compared with regular datacenters. communication between Server 2 and Server 3 from DCell0 0.
Datacenter migration is useful for energy saving, because it The communication between servers from different cells is
becomes easier to move the datacenter to regions with lower performed directly, as the one between Server 1 in DCell0 2
after a failure event. We evaluate the reliability in terms of
the network size and the variation of path length when the
datacenter network is subject to failures. The rst one identies
the number of servers interconnected, whereas the second one
is related to the number of hops between the network servers.
RSmax
Fat-Tree 24 1 3456 720 6 5.9 FatTree
BCube2 58 2 3364 116 4 3.9 BCube2
0.4 BCube3
BCube3 15 3 3375 670 6 5.6
BCube5
BCube5 5 5 3125 3125 10 8.0 0.2 DCell2
DCell2 58 2 3422 59 5 4.9 DCell3
DCell3 7 3 3192 456 11 8.2 0
0 0.2 0.4 0.6 0.8 1
Failure ratio
IV. R ESULTS
(a) Largest connected component.
Our results are obtained using the topologies detailed in
Section II, with some parameter variations to achieve a number 2.6
2.4 FatTree
of servers close to 3,400 for all topologies for fair comparison. 2.2 BCube2
On the other hand, we observed that this number is sufciently 2 BCube3
1.8 BCube5
large to disclose the differences between the topologies. As
Savg
1.6 DCell2
these topologies have a regular structure, our results can DCell3
1.4
be extended to a higher number of servers. Table I shows 1.2
1
the name associated to each considered topology and their 0.8
respective parameters: the number of switch ports and server 0 0.2 0.4 0.6 0.8 1
ports. Table I also gives the following properties of each Failure ratio
topology: the number of servers, the number of switches, the (b) Other components.
network diameter, and the average path length between all
pairs of servers. Using each topology of Table I, we average Fig. 4. Metrics related to network size considering link failures.
the outcomes when repeating the methodology of Section III 16
several times, using a condence level of 95%. 14
FatTree
BCube2
It is worth mentioning that although some of these topolo- 12 BCube3
gies can be incrementally deployed, we only consider complete 10 BCube5
DCell2
DS
8
topologies where all network interfaces of servers and switches 6
DCell3
are used. Furthermore, the number of switch ports was not 4
limited to the number of ports often seen in commercially 2
available equipment, to provide similar number of servers for 0
0 0.2 0.4 0.6 0.8 1
all topologies. As one of the key goals of the network topology
Failure ratio
of a datacenter is to provide processing capacity or storage
redundancy, which increases with a higher number of servers, (a) Diameter Stretch.
balancing the number of server per topology is an attempt to 6
FatTree
provide an analysis as fair as possible. 5 BCube2
BCube3
A. Link failures 4 BCube5
3 DCell2
PS
RSmax
decreasing in Savg occurs because as the number of connected
BCube2
components increases, the size of the largest component is 0.4 BCube3
reduced, increasing the removal probability of edges in the BCube5
0.2 DCell2
isolated components. Therefore, the peak value matches the DCell3
0
inection point of RSmax because in this point the size of the 0 0.2 0.4 0.6 0.8 1
largest component decreases faster. Failure ratio
Figure 5 shows the impact of link failures on the network (a) Largest connected component.
path length. DS and P S presents the same behavior differing
only in their absolute values. Results show that all curves FatTree
20
have a peak stretch value, from which we can conclude that BCube2
BCube3
link failures remove shortest paths until a point where the 15
BCube5
paths are shortened due to the decreasing network size. Also, DCell2
Savg
10 DCell3
path length increases fast, becoming as high as four times the
original average length. The best performance in this analysis 5
is achieved by Fat-Tree, despite the worst performance consid- 0
ering the RSmax . However, the number of servers in the main 0 0.2 0.4 0.6 0.8 1
component of Fat-Tree is smaller than on the other topologies Failure ratio
for a given failure ratio. Consequently, it is important to (b) Other components.
evaluate the reliability considering more than one metric.
Fig. 6. Metrics related to network size considering switch failures.
B. Switch Failures
Figure 6 plots the behavior of network size considering 16
FatTree
14
switch failures. The curves of RSmax have a behavior similar BCube2
12 BCube3
to the case of link failures. However, the region after the 10 BCube5
inection point is negligible. Although the curves of Savg have DCell2
DS
8
DCell3
no relevant peaks, the Savg of DCell3 increases approximately 6
20 times after the elimination of most of its switches. Ignoring 4
2
this behavior, as it represents an unreal failure ratio, we ob- 0
serve that switch failures produce small isolated components. 0 0.2 0.4 0.6 0.8 1
It is interesting to note the reliability of DCell3 with respect Failure ratio
to the network size. The RSmax of DCell3 decreases only for (a) Diameter Stretch.
a failure ratio greater than 60%. Comparing to Fat-Tree and
BCube3 topologies, which have a close number of switches 6
FatTree
in relation to the number of servers (Table I), DCell3 has a 5 BCube2
BCube3
better performance. Also, Figure 6(b) shows that DCell3 has 4 BCube5
no points for small failure ratios due to the lack of isolated 3 DCell2
PS
DCell3
components. This superior performance of DCell3 is related to 2
its high dependence on servers, which is analyzed in the next 1
section. As in our previous results, Figure 6 shows that the 0
number of ports used by servers increases the reliability for 0 0.2 0.4 0.6 0.8 1
a same topology type. This is because, as shown in Table I, Failure ratio
topologies with higher number of ports per server have a lower (b) Path Stretch.
switch port density, being less dependent on switches. Finally,
Fig. 7. Metrics related to path length considering switch failures.
DCell2 has the same performance of Bcube2.
The results of path length in Figure 7 show that switch
failures generally double the diameter size, having peaks only Equations 1 and 2 is constant and represents the total number
on values close to 90% of failure ratio for most of the of servers, because there is no
removal of this type of element.
n
topologies. Moreover, the path stretch changes slightly up to In this analysis, however, i=1 |si | reduces of a unit for
failure ratios of 90%. This indicates that switch failures have a each server removal. Thus, as shown in Figure 8(a), Fat-
lower impact on path stretch comparing to other failure types. Tree presents the maximum reliability (RSmax = 1). This
happens because a server removal in this topology does not
C. Server Failures induce disconnected components since a Fat-Tree does not
Figure 8 shows the results for server failures. It
is important depend on servers to forward packets. The results also show
n
to note that, in our previous results, the term i=1 |si | of that the other networks, except DCell3, are more reliable
1 16
FatTree
14 BCube2
0.8 12 BCube3
10 BCube5
0.6
RSmax
FatTree DCell2
DS
BCube2 8
DCell3
0.4 BCube3 6
BCube5 4
0.2 DCell2 2
DCell3
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Failure ratio Failure ratio
(a) Largest connected component. (a) Diameter Stretch.
6
20 BCube2 FatTree
BCube3 5 BCube2
BCube5 BCube3
15 4 BCube5
DCell2
DCell3 DCell2
Savg
PS
10 DCell3
2
5
1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Failure ratio Failure ratio
(b) Other components. (b) Path Stretch.
Fig. 8. Metrics related to network size considering server failures. Fig. 9. Metrics related to path length considering server failures.
to server failures than to switch and link failures. In the future work. Another future direction is to consider multipath
case of DCell3, we show in Section IV-B its high reliability routing, by analyzing path diversity.
considering switch failures, because a signicant part of this ACKNOWLEDGEMENT
network remains connected through its servers. Figure 8(a)
This work was funded by CNPq, CAPES, CTIC/RNP, and
conrms that conclusion because DCell3 shows the earliest
FAPERJ. We thank the professors Stefano Secci and Daniel R.
RSmax decrease when the server failure ratio increases. The
Figueiredo and the anonymous reviewers for their comments.
results of Savg show that the isolated components are also
small comparing to the total number of servers, which is R EFERENCES
approximately 3,400. The absence of points in Figure 8(b) for [1] R. Miller, Google uses about 900,000 servers,
failures up to a certain value, specic for each topology, shows http://www.datacenterknowledge.com/archives/2011/08/01/report-
that the topologies maintain one connected component for a google-uses-about-900000-servers/, Aug. 2011.
[2] M. Al-Fares, A. Loukissas, and A. Vahdat, A scalable, commodity data
long range of failures, except for DCell3 which has the worst center network architecture, in ACM SIGCOMM, Aug. 2008, pp. 6374.
performance. Figure 9 shows that DS and P S have the same [3] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and
behavior of those in Section IV-A, showing the signicant S. Lu, BCube: a high performance, server-centric network architecture
for modular data centers, in ACM SIGCOMM, Aug. 2009, pp. 6374.
impact that server failure produces on hybrid topologies. [4] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu, DCell: a
scalable and fault-tolerant network structure for data centers, in ACM
V. C ONCLUSIONS AND F UTURE W ORK SIGCOMM, Aug. 2008, pp. 7586.
[5] P. Gill, N. Jain, and N. Nagappan, Understanding network failures
In this work we have evaluated the reliability of datacenter in data centers: measurement, analysis, and implications, in ACM
topologies proposed in the literature when subject to different SIGCOMM, Aug. 2011, pp. 350361.
element failures, revealing the tradeoffs of each topology [6] L. Popa, S. Ratnasamy, G. Iannaccone, A. Krishnamurthy, and I. Stoica,
A cost comparison of datacenter network architectures, in ACM Co-
design. We have observed that network degradation starts with NEXT 10, Dec. 2010, pp. 16:116:12.
the removal of connected components with a relatively small [7] A. Curtis, T. Carpenter, M. Elsheikh, A. Lopez-Ortiz, and S. Keshav,
number of servers. Our results also have revealed that hybrid REWIRE: An optimization-based framework for unstructured data
center network design, in IEEE INFOCOM, Mar. 2012.
topologies degrades smoother than Fat-Tree with respect to the [8] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri,
network size, in the case of link and switch failures. However, D. A. Maltz, P. Patel, and S. Sengupta, VL2: a scalable and exible
the reliability of BCube and DCell with respect to the network data center network, in ACM SIGCOMM, Aug. 2009, pp. 5162.
[9] A. Hagberg, P. Swart, and D. S Chult, Exploring network structure,
size is maintained at the cost of substantially increased path dynamics, and function using NetworkX, (LANL), Tech. Rep., 2008.
length. With these results we can also conclude that the [10] R. Albert, H. Jeong, and A. Barabasi, Error and attack tolerance of
network size or path stretch isolated does not provide an complex networks, Letters to Nature, vol. 406, no. 6794, pp. 378382,
2000.
efcient way to evaluate the topology performance, and should [11] J. Mudigonda, P. Yalagandula, M. Al-Fares, and J. Mogul, SPAIN:
probably be combined. Also, these metrics can be combined COTS data-center ethernet for multipathing over arbitrary topologies,
with other metrics like network cost and network capacity in a in USENIX NSDI, 2010, pp. 1818.