Proceedings of the IASTED International Conference
Parallel and Distributed Computing and Systems (PDCS 2012)
November 12 - 14, 2012 Las Vegas, USA
NETWORK DESIGN CONSIDERATIONS FOR EXASCALE
SUPERCOMPUTERS
Rui Feng1, Peng Zhang2, Yuefan Deng2
1
2
Department of Applied Mathematics and Statistics
Stony Brook University
Stony Brook, NY, United States
{peng.zhang, yuefan.deng}@stonybrook.edu
School of Computer Science and Engineering
Beihang University
Beijing, P.R.China
[email protected]
ers. Federated networks, adopted by Tianhe-1A and Nebulae supercomputers [3], connect nodes by off-the-shelf
solutions such as Infiniband. Cellular networks, developed for systems as the IBM Blue Gene solutions [7-9]
and the Cray XT series supercomputers [10], utilize proprietary interconnection techniques. The latest Blue Gene
series, Blue Gene/Q [11], uses a 5D torus network to connect its 17-core nodes. The new network with a higher
node degree improves the performance at the expenses of
more engineering complications and higher monetary
costs [12].
Recent development in adding bypass links to conventional networks has gained attention [13, 14]. The iBT
network adds performance and reduces the implementation difficulties in the process of evolving from the 3D
torus network, compared with the brand new design of the
5D torus.
This article focuses on analysis of various configurations of the interconnection networks with objective of
finding the optimal performance-cost (p/c) ratios. Section
2 introduces the overall system-level design. Section 3
presents the design considerations including expansion
schemes for a supercomputer with 373,248 nodes. In Section 4, we compare our design with the original torus systems in terms of network performance, p/c ratios, and
power-efficiencies. Conclusions are summarized in Section 5.
ABSTRACT
We consider the network design optimization for the Exascale class supercomputers by altering the widely analyzed and implemented torus networks. Our alteration
scheme involves interlacing the torus networks with bypass links of lengths 6 hops, 9 hops, 12 hops, and mixed 6
and 12 hops. These bypass links are optimal resulting
from exhaustive search of massive possibilities. Our case
study is constructed by strategically coupling 288 racks of
6 × 6 × 36 nodes to a full system with 72 × 72 × 72
nodes. The peak performance of such a system is 0.56
Exa-flops when CPU-GPU complexes are adopted as a
node module capable of 1.5 Tflops. Our design optimizes,
simultaneously, the system performance, performancecost ratio, and power efficiency. The network diameter
and the average node-to-node network distance, regarded
as the performance metrics, got reduced, from the original
3D torus network, by 83.3% and 80.4%, respectively.
Similarly, the performance-cost ratio and power efficiency
are also increased 1.43 and 4.44 times, respectively.
KEY WORDS
Interconnection Network, Parallel Architecture, Network
Diameter, Supercomputer
1. Introduction
Designing, constructing, and applying the Exascale supercomputers represents the next frontiers of computer and
computational sciences [1]. Engineers must pack
1,000,000 nodes with 64,000,000 GB of memory, and
1064,000,000 GB of disk storage in a space no bigger
than 10,000 square meters and operate it at no more than
20 MW of electricity [2]. With all these constraints, applications developers wish to deliver 10 flops of performance.
Many efforts are currently under way in multiple research-and development centers, are examining multiple
technological sectors [3]. Computer developers are focusing on system design optimization in mechanical, thermo,
electrical, and electronics subsystems. Others focus on
software and application algorithms.
Interconnection networks are the heart of the supercomputers and two categories of interconnection networks
[4-6] are potential candidates for Exascale supercomputDOI: 10.2316/P.2012.789-001
2. iBT Interconnection Network
In [15], a new interconnection network called iBT is proposed by interlacing bypass rings to torus networks. A
general 𝑑 -dimensional iBT network constructed from a
torus network of dimensions 𝑁 × ⋯ × 𝑁 can be expressed as
iBT(𝑁 × ⋯ × 𝑁 ; 𝐿 = 𝑚; 𝒃 = 〈𝑏 , ⋯ , 𝑏 〉),
where the bypass link sequence 𝒃 = 〈𝑏 , ⋯ , 𝑏 〉 is a vector whose component values increase monotonically, i.e.,
𝑏 < 𝑏 < ⋯ < 𝑏 , and it indicates that 𝑏 -hop bypass
rings (𝑖 = 1, … , 𝑘) are recursively interlaced into any 𝑚
of the 𝑛 dimensions (𝑚 ≤ 𝑛) . The node degree of the
resulting iBT network is 2𝑑 + 2 where 2𝑑 is the node
degree for the original torus network and the added 2 is
for the bypass connections. To determine the two bypass
86
connections for a node 𝑝 = (𝑥 , 𝑥 , ⋯ , 𝑥 ) where 𝑥 ∈
[0, 𝑁 − 1] with 𝑖 = 1 … , 𝑑, we introduce three terms: a
nodal bypass dimension 𝑏𝑑(𝑝) ∈ {1,2, ⋯ 𝑚}, a nodal bypass length
system architected as iBT(6 × 6 × 72; 𝒃 , 𝒃 , 𝒃 ) allows
full connections of all z-dimensional inter-rack torus and
bypass links. One can further expand to achieve a system
of desirable size by connecting such bi-rack entities in xor y-dimension or both. As we will discuss more, a system architected as iBT (72 ; 𝒃 , 𝒃 , 𝒃 ) can achieve an
optimal performance at the Exascale. It will have 288
racks of nodes capable of 1.5 Tflops performance per
node in 2011.
After extensive studies, we perform analysis on four
bypass configurations for the iBT(72 ; 𝒃 , 𝒃 , 𝒃 ) with
bypass vectors {〈6〉, 〈9〉, 〈12〉, 〈6,12〉}. With such analysis,
we determine the bypass configuration in each dimension.
𝑏𝑑(𝑝) = [(∑ 𝑥 ) (mod 𝑚)] + 1 and 𝑏𝑙(𝑝) = 𝑏 ,
𝑥 )(mod 𝑚𝑘)
⌋ + 1 ∈ {1, … , 𝑘}
𝑚
and thus a nodal bypass species 𝑏𝑠(𝑝) = 〈𝑏𝑑(𝑝), 𝑏𝑙(𝑝)〉
indicating that two 𝑏𝑙(𝑝) -hop bypass links have been
added to the given node 𝑝 in each direction along the dimension 𝑏𝑑(𝑝). For example,
iBT(32 × 32 × 16; 𝐿 = 2; 𝒃 = 〈4,16〉)
indicates interlacing 4-hop and 16-hop bypass rings in the
𝑥𝑦-plane of the 3D torus 𝑇(32 × 32 × 16). In this network, a node 𝑝 = (1,1,4) has 𝑏𝑑(𝑝) = 1, 𝑏𝑙(𝑝) = 𝑏 =
16 and thus 𝑏𝑠(𝑝) = 〈1,16〉 indicating that 𝑝 has two 16hop bypass links in each direction along the x dimension.
Such an interconnection network is a desirable network for an Exascale supercomputer because of its
achievable absolute network performance with a given
low engineering complexity, low performance-cost ratio,
and high power efficiency. Take a 3-D iBT network, a
specific network notation:
iBT(𝑁 × 𝑁 × 𝑁 ; 𝒃 , 𝒃 , 𝒃 )
indicates that bypass rings were interlaced to all three
dimensions of a 3-D Torus 𝑇(𝑁 × 𝑁 × 𝑁 ). The bypass
configuration in each dimension is illustrated in 𝒃 , 𝒃 or
𝒃 . For another example,
iBT(6 × 24 × 72; 𝒃 = 〈0〉, 𝒃 = 〈6〉, 𝒃 = 〈6,12〉)
indicates that:
1. Base network: 3-D Torus 𝑇(6 × 24 × 72);
2. No bypass in x-dimension indicated by 𝒃 = 〈0〉;
3. Uniform 6-hop bypass in y-dimension indicated
by 𝒃 = 〈6〉;
4. A mix of 6-hop and 12-hop bypasses in zdimension indicated by 𝒃 = 〈6,12〉.
ℎ=⌊
(∑
3.1 Design Considerations
When evaluating a design, we consider two factors: the
network characteristics and the p/c ratio. The network
characteristics include the network diameter and average
node-to-node distance. The network performance is defined as the reciprocal of the average distance (A), i.e.,
network performance = 1/𝐴.
We use the extra external inter-rack wires in our iBT
design over the Blue Gene’s design as our cost metric
which is defined as 𝐶 ∙ 𝑙 where 𝑙 is the total external wire
length and 𝐶 is the wires’ unit cost. Thus, the material
cost of all of the external wires 𝑙 ∀𝑖 is
material cost = ∑ 𝐶 𝑙
∀
The p/c ratio, defined as the performance divided by
the material cost, is written as
network performance
1
𝑓 =
=
(𝐶 ∑∀ 𝑙 ) ∙ 𝐴
material cost
By analyzing the relationship of these factors, we
identify a class of optimized design plans with specific
requirements and trade-offs.
3.2 MPU: Multiple Processing Unit
Fig. 1 shows the multiple processing unit (MPU) architecture which has the following properties:
1. 216 nodes interconnected as a 3-D mesh
(6×6×6 );
2. 432 external bypass links, two links per node;
3. 216 external torus links, two links per node at the
boundaries;
4. 648 external links, in total, for connecting to other MPUs. These 648 links are categorized into 18
groups with 36 links per group.
To classify the external links, we assign each node
relative coordinates 𝑝 = (𝑥, 𝑦, 𝑧) where 𝑥, 𝑦, 𝑧 ∈ [0,5]
and let 𝑠(𝑝) = 𝑥 + 𝑦 + 𝑧. Let the x-dimensional bypass
configuration be 𝒃 = 〈𝑏 , 𝑏 〉. The same notations hold
for y- and z-dimension. With such, we have
𝑏𝑑(𝑝) = 𝑠(𝑝) (mod 3) + 1 ∈ {1,2,3},
𝑏𝑙(𝑝) = 𝑏 ,
𝑠(𝑝) (mod 6)
where ℎ = ⌊
⌋ + 1 ∈ {1,2}.
3
3. Design Analysis
Our design follows the common principle in optimizing
system performance, enhancing modularity, reducing engineering complexity, and lowering monetary costs [1].
The lowest-level module of our design is a node that performs two functions: computation and communication
with its directly connected neighboring nodes. A multiple
processing unit or MPU is the 2nd level module with the
architecture of iBT (6 × 6 × 6; 𝒃 , 𝒃 , 𝒃 ) , containing
6 × 6 × 6 = 216 nodes with a given network architecture. The 3rd level module is a rack that is constituted of
six MPUs architected as iBT(6 × 6 × 36; 𝒃 , 𝒃 , 𝒃 ). The
highest level, i.e., the fourth level, is the complete system
that can contain an appropriated number of such racks
according to budget and performance requirements as
well as engineering feasibility. For example, a bi-rack
87
cal experiments and they show results in Table 1.
It appears a bypass scheme 𝒃 = 〈6,12〉 shows an optimal performance among the six possibilities and, thus,
iBT (6 × 6 × 36; 𝒃 , 𝒃 , 𝒃 = 〈6,12〉) is selected as our
rack configuration
For identifying a node, we introduce a MPU number
and a rack number in addition to the nodal relative coordinates. A MPU number is assigned for identifying the
MPU’s position in a rack while a rack number is assigned
for identifying the rack’s position in a system.
A node is defined as a boundary node if and only if
one of its three coordinates is equal to 0 or 5. Accordingly, we classify these external links into 18 groups of
which 6 belong to the torus groups and 12 belong to the
bypass groups marked in Fig. 2. For example, 𝑡𝑥 contains one link of each node with coordinates (0, 𝑦, 𝑧). 𝑏
contains one link of each node with 𝑏𝑠(𝑝) = 〈2, 𝑏 〉, same
as other groups. Thirty-six external torus or bypass links
are bundled in one group and this group provides the inter-MPU connections.
Fig. 1. External torus and bypass links in three typical planes of a MPU
Fig. 3 (a)
Fig. 3 (b)
Fig. 3. Internal connections in a rack with the architectures:
(a) iBT(6 × 12 × 18) and (b) iBT(6 × 6 × 36)
Table 1
Performance Metrics in a Single Rack
Configurations
Base
Average
Diameter
Deviation
𝒃
Networks
Distance
𝒃
1.64
〈6〉
10
5.56
𝑇(6 × 12 × 18) 〈6〉
1.92
〈9〉
11
6.16
1.82
〈6〉
11
5.94
2.00
〈9〉
12
6.23
𝑇(6 × 6 × 36) 〈0〉
13
6.65
2.24
〈12〉
10
5.69
1.60
〈6,12〉
Fig. 2. External link groups of a MPU
3.3 Rack
A rack has 936 external torus links and 1,944 bypass
links classified in 18 link sets.
A rack consists of six internally connected MPUs with
external links for inter-rack connections. The six MPUs
can be arranged in two configurations: iBT (6 × 12 ×
18)as shown by Fig. 3(a) and iBT(6 × 6 × 36) as shown
by Fig. 3(b). No bypass link is added in x-dimension, i.e.,
𝒃 = 〈0〉 because of too few nodes in this dimension.
Determination of the y- and z-bypasses requires numeri-
3.4 Z-Expansion
To achieve the final system design goal of having 72
nodes in each dimension, we start assembling the system
first from the z-dimension to form a rack pair that has the
88
architecture iBT(6 × 6 × 72; 𝒃 , 𝒃 , 𝒃 = 〈6,12〉). Interrack connections are shown in Fig. 4.
This modular design packs all z-dimensional external
links, i.e. 72 torus and 216 bypass links, in one rack pair,
allowing us to duplicate the rack pair by connecting its xand y-dimensional external links, i.e. 864 torus and 1728
bypass links,. We classified such links into 12 link sets.
〈6〉).
iBT(… , 𝒃 = 〈9〉) and iBT(𝒃 = 〈6,12〉) use the
same inter-rack connection pattern as a mixture
of the above two patterns.
3.
Fig. 6 shows the connection of 12 rack-pairs in a row
in the implementation of iBT (6 × 72 × 72; 𝒃 =
〈0〉, 𝒃 = 𝒃 = 〈6,12〉).
Fig. 6. Y-dimensional external links in the architecture iBT(6 × 72 ×
72; 𝒃 = 〈0〉, 𝒃 = 〈6,12〉, 𝒃 = 〈6,12〉). To simplify and clarify the
connection, only links in half of racks are shown. The black lines show
both of the original 3D-Torus links and the additional y-directional bypass links, and the red lines show bypass links.
Fig. 4. Inter-rack connections in a rack pair iBT(6 × 6 ×
72; 𝒃 , 𝒃 , 𝒃 = 〈6,12〉) for completing z-dimensional links
3.5 Y-Expansion
Fig. 7 compares the network performance and p/c ratios of the four 𝒃 configurations as the number of racks
increases. We found two configurations that warrant further consideration:
1. Configuration with 𝒃 = 〈6〉 always achieves the
lowest p/c ratio as well as the best network performance when systems size is less than 8 racks;
2. Configuration with 𝒃 = 〈6,12〉 reduces the diameter and the average distance by 11.8% and
9.4% than that of 𝒃 = 〈6〉 does but it also reduces 13.9% p/c ratio.
We now expand the system along y-dimension by arranging the 12 rack pairs in a row, resulting in iBT(6 × 72 ×
72). Similarly, we have four bypass configurations in ydimension: 𝒃 ∈ {〈6〉, 〈9〉, 〈12〉, 〈6,12〉}. Fig. 5 shows the
inter-rack connections with the same torus connection but
different bypass patterns.
Diameters (hops)
Average Distance (hops)
80
Fig. 5 (a)
60
40
20
0
40
30
20
10
0
4 8 12 16 20 24
0 4 8 12 16 20 24
Number of Racks
Number of Racks
Fig. 7 (a). Diameters vs. number
Fig. 7 (b). Average distances vs.
of racks
number of racks
1.4
0
Fig. 5 (c)
Fig. 5. Y-dimensional external links in the architectures: (a) iBT(𝒃 =
〈6〉), (b) iBT(𝒃 = 〈12〉) and (a) iBT(𝒃 ∈ {〈9〉, 〈6,12〉})
1.2
80
60
40
20
0
iBT(bx=<0>,by=<6>,bz=<6,12>)
Diameters
(hops)
Performance-cost Ratios
Fig. 5 (b)
1.0
0.8
0
0.6
0
iBT(bx=<0>,by=<9>,bz=<6,12>)
iBT(bx=<0>,by=<6,12>,bz=<6,12>)
iBT(bx=<0>,by=<12>,bz=<6,12>)
iBT(bx=<0>,by=<0>,bz=<0>)
4Number
8 12of Racks
16 20
24
4 8 12 16 20 24
Number of Racks
Fig. 7 (c). Performance-cost ratios
Legend
vs. number of racks
Fig. 7. Y-Expansion in the architecture iBT(6 × 𝑁 × 72; 𝒃 =
〈0〉, 𝒃 ∈ {〈6〉, 〈9〉, 〈12〉, 〈6,12〉}, 𝒃 = 〈6,12〉)
From the Fig. 5 we can see that:
1. iBT(… , 𝒃 = 〈6〉) has the same inter-rack connection pattern as a 3D Torus network but two
times more wires per linking bundle.
2. iBT(… , 𝒃 = 〈12〉) has the longest wirings and
the same wires per linking bundle as iBT(𝒃 =
89
Fig. 8. External links that are for three partitions with different dimensions and with different bypass schemes in the system iBT(72 ; 𝒃 = 〈6,12〉)
Thus, if minimizing the p/c costs is the objective,
𝒃 = 〈6〉 is the best choice. If maximizing network performance is the objective, 𝒃 = 〈6,12〉 is the best choice.
In the following sections we limit our discussions on the
bypass configurations with 𝒃 ∈ {〈6〉, 〈6,12〉}.
{〈6〉, 〈6,12〉}.
Fig. 8 shows the three architectures of entire system
with 288 racks:
iBT(12 × 72 × 72; 𝒃 = 〈6〉, 𝒃 = 𝒃 = 〈6,12〉),
iBT(24 × 72 × 72; 𝒃 = 𝒃 = 〈6〉, 𝒃 = 〈6,12〉),
iBT(36 × 72 × 72; 𝒃 = 𝒃 = 〈6〉, 𝒃 = 〈6,12〉).
We assume the distances between two adjacent rows
and between two adjacent racks in a row to be 𝑎 = 1.22
and 𝑏 = 1.88 (meters) [16].
Fig. 9 compares the network performance and the p/c
ratios when systems expand with multiple rows. These
results show that:
3.6 X-Expansion
After completing one-row y-expansion, we assemble the
system along x-dimension by connecting multiple rows of
12 rack pairs per row to achieve the architecture
iBT(𝑁 × 72 × 72; 𝒃 , 𝒃 , 𝒃 = 〈6,12〉) , where 𝒃 , 𝒃 ∈
90
40
0
40
20
0
1.5
1.4
120
1.3
Diameters
(hops)
80
60
Performance-cost Ratios
Average Distanes (hops)
Diameters (hops)
120
iBT(bx=<6>,by=<6>,bz=<6,12>)
80
iBT(bx=<6>,by=<6,12>,bz=<6,12>)
40
1.2
iBT(bx=<6,12>,by=<6,12>,bz=<6,12>)
0
Number
Racks
48iBT(bx=<0>,by=<0>,bz=<0>)
96
144 of
192
240 288
1.1
48 96 144 192 240 288
96 144 192 240 288
Number of Racks
Number of Racks
Fig. 9 (a). Diameters vs. number of
Fig. 9 (b). Average distances vs.
Fig. 9 (c). Performance-cost ratios
racks
number of racks
vs. number of racks
Fig. 9. X-Expansion in the architectures iBT(𝑁 × 72 × 72; 𝒃 , 𝒃 ∈ {〈6〉, 〈6,12〉}, 𝒃 = 〈6,12〉)
48
48
96 144 192 240 288
Number of Racks
Configuration with 𝒃 = 𝒃 = 〈6〉 has the highest network p/c ratio but also the lowest performance;
2. Configuration with 𝒃 = 𝒃 = 〈6,12〉 has the
highest performance but also the lowest p/c ratio,
for 2 or more rows;
3. Configuration with 𝒃 = 〈6〉, 𝒃 = 〈6,12〉 has
the moderate performance and moderate p/c ratio
and is no better than the above two.
Orchestrating consideration of the y- and xexpansions, we conclude iBT(𝒃 = 𝒃 = 〈6〉, 𝒃 =
〈6,12〉) is the optimal architecture if maximizing network
p/c ratio is the design objective while iBT(𝒃 = 𝒃 =
〈6〉, 𝒃 = 〈6,12〉) is optimal if maximizing network performance is the design objective.
Performance-cost Ratios
1.
Average Distances (hops)
Diameters (hops)
40
iBT(Nx×Ny×72;bx=by=bz=<6,12>)
0
1.2
48iBT(Nx×Ny×72;bx=by=bz=<0>)
96 144192240288
Number of Racks
96 144 192 240 288
Number of Racks
Fig. 10 (c). Performance-cost ratios
Legend
vs. number of racks
Fig. 10. XY-Expansion in the architecture iBT(𝑁 × 𝑁 × 72; 𝒃 , 𝒃 ∈
{〈6〉, 〈6,12〉}, 𝒃 = 〈6,12〉)
4. Comparisons with Torus
For further comparisons between the selected iBT configurations and the original 3D torus, we consider the operational cost of electricity to power it up and cool it off.
Moving 𝑁 bits on a copper wire consumes [2]
energy = 𝑟𝑁 𝑙 /𝑎
where 𝑟 is the bit rate, 𝑙 is the wire length, and 𝑎 is the
cross section area of wires. Thus, the energy cost of moving 𝑁 /𝑁 bits on each of 𝑁 external wires is
𝑟𝑁 ∑∀ 𝑙
𝐶 ∑∀ 𝑙
energy =
=
,
𝑁 𝑎
𝑁
where 𝐶 = 𝑟𝑁 /𝑎 .
The power efficiency is defined as
𝑁
network performance
=
.
𝑓 =
energy
𝐶 ∑∀ 𝑙 ∙ 𝐴
The iBT(𝒃 , 𝒃 , 𝒃 ) networks, with very minimal design variation over the popular 3D torus, outperforms the
later greatly in both categories. Figs. 11 to 13 compare the
configurations with the best network performance and the
best p/c ratio over the original 3D torus for the y- and xexpansions respectively. Fig. 11 shows the network performance ratios of iBT networks, defined as the diameter
of torus divided by the diameter of iBT, a 5-time performance gain. Figs. 12 and 13 show the network performance-cost ratios and power efficiency of iBT networks
over torus of the same network size, respectively.
60
40
20
0
48 96 144 192 240 288
96 144 192 240 288
Number of Racks
Number of Racks
Fig. 10 (a). Diameters vs. number Fig. 10 (b). Average distances vs.
of racks
number of racks
0
1.3
48
In addition to independent y- and x-expansions, we consider simultaneous expansion in both X and Y dimensions. For architecture iBT (𝑁 × 𝑁 × 72; 𝒃 , 𝒃 , 𝒃 =
〈6,12〉) , we choose the bypass schemes for x- and ydimension as:
〈6〉, 𝑁 ≤ 24
〈6〉, 𝑁 ≤ 24
𝒃 ={
𝒃 ={
.
〈6,12〉, otherwise.
〈6,12〉, otherwise.
Fig. 10 shows the variations of the performance and
p/c ratios as we expand the system from 16 to 288 racks.
For example, for 96 racks iBT(48 × 36 × 72) outperforms iBT(72 × 24 × 72) , iBT(36 × 48 × 72) , and
iBT(24 × 72 × 72).
80
1.4
1.1
3.7 XY-Expansion
120
Legend
48
91
5
4
Power-efficiency Ratios
Performance-cost Ratios
Performance Ratios
5
1.5
6
1.4
1.3
1.2
1.1
1
3
0
96
192
Number of Racks
288
Fig. 11. Network performance ratios
of iBT over torus
4
5
4
3
2
1
3
iBT(bx=by=<6>,bz=<6,12>)
0
iBT(bx=by=bz=<6,12>)
96
192
Number of Racks
288
2
0
96
192
288
Number of Racks
Fig. 12. Network performance/material
cost ratios of iBT over torus
0
96
192
288
Number of Racks
Fig. 13. Network performance/energy
cost ratios of iBT over torus
Legend
performance-optimized system and a cost-optimized system, respectively.
Summarizing the above, we see:
1. Both iBT architectures outperform the 3D Torus
of the same node count. The performanceoptimized iBT reduces the network diameter and
average distance by 83.3% and 80.4% over the
3D Torus, and the cost-optimized iBT reduces
the diameter and average distance by 79.6% and
77.0%, respectively;
2. The network performance-cost ratio of the comparable iBT network is 1.43 times that of 3D torus with the same node dimensions of 72 × 72 ×
72;
3. The network power efficiency of the comparable
iBT network is 4.44 times that of 3D torus with
the same node dimensions of 72 × 72 × 72.
A supercomputer with our proposed iBT network
coupled with 72 × 72 × 72 CPU-GPU nodes capable
peak performance of 1.5 TFlops per node can achieve
0.56 Exa-flops.
Acknowledgements
This work is supported by the National High-Tech Research and Development Plan of China (Project 863) under Grand No. 2009AA012201, and the Shanghai Science
and Technology Development Fund under Grand No.
08dz1501600.
References
[1] R. L. Al Geist, "Major Computer Science Challenges At
Exascale," International Journal of High Performance Computing Applications, vol. 23, pp. 427-436, Nov 2009
[2] H. Simon, Exascale Challenges for the Computational Science Community. Available: http://symposium2010.oscer.ou.ed
u/oksupercompsymp2010_talk_simon_20101006.pdf, accessed
date: 6 Oct 2010
[3] TOP500. Top 500 Supercomputer Sites. Available:
http://www.top500.org
[4] W. J. Dally, "Performance Analysis of k-ary n-cube Interconnection Networks," IEEE Transactions on Computers, vol.
39, pp. 775-785, 1990
[5] W. Dally and B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann Publishers Inc., 2003
[6] J. Duato, et al., Interconnection Networks: Engineering
Approach, Morgan Kaufmann Publishers Inc., 2002.
[7] A. Gara, et al., "Overview of the Blue Gene/L system architecture," IBM Journal of Research and Development, vol. 49,
pp. 195-212, 2005
[8] N. R. Adiga, et al., "Blue Gene/L Torus Interconnection
Network," IBM Journal of Research and Development, vol. 49,
March/May 2005
[9] IBM Blue Gene Team, "Overview of the Blue Gene/P project," IBM Journal of Research and Development, vol. 2, pp.
199-220, 2008
[10] R. B. P. Worley, and J. Kuehn, "Early Evaluation of the
Cray XT5," in Proceedings of the 51st Cray User Group Conference, Atlanta, 2009
[11] IBM uncloaks 20 petaflops BlueGene/Q super. Available:
http://www.theregister.co.uk/2010/11/22/ibm_blue_gene_q_supe
r/, accessed date: 22 Nov, 2010
[12] GREEN500. Green 500 Supercomputer Sites. Available:
http://www.green500.org
5. Conclusion
Through extensive analyses of a technique of alteration of
the widely adopted 3D torus networks for supercomputers, we proposed a much more monetary and energy efficient architecture. The new architecture results from adding 6-hop, 9-hop, 12-hop, or 6- and 12-hop bypass links
to x-, y-, and z-directions of the corresponding 3D torus
network. Our methodology can be applied to general architectures and our case study of a system of 72 × 72 ×
72 =373,248 nodes demonstrates the procedure and the
values of our analysis. We analyzed these four bypass
configurations, while expanding the system, in terms of
network diameter, average distance, distance deviation,
length of total external links, number of external links,
and relative network costs (average distance × total link
length) for optimal performance and costs. Our comparisons of the performance, price-performance ratios, and
power efficiency between iBT architecture and original
3D Torus show that: 1) All four configurations demonstrate significant gain in performance and priceperformance ratio; 2) Configurations with parameters
𝒃 = 𝒃 = 〈6,12〉 and 𝒃 = 𝒃 = 〈6〉 are optimal for a
92
[13] Y. Inoguchi, et al., "SRT interconnection network on 3D
stacked implementation by considering thermo-radiation," Innovative Systems in Silicon, 1997. Proceedings, Second Annual
IEEE International Conference on, pp. 41-51, 8-10 Oct 1997
[14] Y. Inoguchi and S. Horiguchi, "Shifted Recursive Torus
Interconnection for High Performance Computing," presented at
the Proceedings of the High-Performance Computing on the
Information Superhighway, HPC-Asia '97, 1997
[15] P. Zhang, et al., "Interlacing Bypass Rings to Torus Networks for More Efficient Networks," IEEE Transactions on
Parallel and Distributed Systems, vol. 22, pp. 287-295, 2011
[16] IBM System Blue Gene/P Solution Installation Planning
Guide. Available: http://www.scc.acad.bg/articles/library/BLue
%20Gene%20P/BGP%20Site%20Installation%20Planning%20
Guide.pdf, accessed date: 16 Jan 2009
93