Academia.eduAcademia.edu

Network design considerations for exascale supercomputers

2012

Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems (PDCS 2012) November 12 - 14, 2012 Las Vegas, USA NETWORK DESIGN CONSIDERATIONS FOR EXASCALE SUPERCOMPUTERS Rui Feng1, Peng Zhang2, Yuefan Deng2 1 2 Department of Applied Mathematics and Statistics Stony Brook University Stony Brook, NY, United States {peng.zhang, yuefan.deng}@stonybrook.edu School of Computer Science and Engineering Beihang University Beijing, P.R.China [email protected] ers. Federated networks, adopted by Tianhe-1A and Nebulae supercomputers [3], connect nodes by off-the-shelf solutions such as Infiniband. Cellular networks, developed for systems as the IBM Blue Gene solutions [7-9] and the Cray XT series supercomputers [10], utilize proprietary interconnection techniques. The latest Blue Gene series, Blue Gene/Q [11], uses a 5D torus network to connect its 17-core nodes. The new network with a higher node degree improves the performance at the expenses of more engineering complications and higher monetary costs [12]. Recent development in adding bypass links to conventional networks has gained attention [13, 14]. The iBT network adds performance and reduces the implementation difficulties in the process of evolving from the 3D torus network, compared with the brand new design of the 5D torus. This article focuses on analysis of various configurations of the interconnection networks with objective of finding the optimal performance-cost (p/c) ratios. Section 2 introduces the overall system-level design. Section 3 presents the design considerations including expansion schemes for a supercomputer with 373,248 nodes. In Section 4, we compare our design with the original torus systems in terms of network performance, p/c ratios, and power-efficiencies. Conclusions are summarized in Section 5. ABSTRACT We consider the network design optimization for the Exascale class supercomputers by altering the widely analyzed and implemented torus networks. Our alteration scheme involves interlacing the torus networks with bypass links of lengths 6 hops, 9 hops, 12 hops, and mixed 6 and 12 hops. These bypass links are optimal resulting from exhaustive search of massive possibilities. Our case study is constructed by strategically coupling 288 racks of 6 × 6 × 36 nodes to a full system with 72 × 72 × 72 nodes. The peak performance of such a system is 0.56 Exa-flops when CPU-GPU complexes are adopted as a node module capable of 1.5 Tflops. Our design optimizes, simultaneously, the system performance, performancecost ratio, and power efficiency. The network diameter and the average node-to-node network distance, regarded as the performance metrics, got reduced, from the original 3D torus network, by 83.3% and 80.4%, respectively. Similarly, the performance-cost ratio and power efficiency are also increased 1.43 and 4.44 times, respectively. KEY WORDS Interconnection Network, Parallel Architecture, Network Diameter, Supercomputer 1. Introduction Designing, constructing, and applying the Exascale supercomputers represents the next frontiers of computer and computational sciences [1]. Engineers must pack 1,000,000 nodes with 64,000,000 GB of memory, and 1064,000,000 GB of disk storage in a space no bigger than 10,000 square meters and operate it at no more than 20 MW of electricity [2]. With all these constraints, applications developers wish to deliver 10 flops of performance. Many efforts are currently under way in multiple research-and development centers, are examining multiple technological sectors [3]. Computer developers are focusing on system design optimization in mechanical, thermo, electrical, and electronics subsystems. Others focus on software and application algorithms. Interconnection networks are the heart of the supercomputers and two categories of interconnection networks [4-6] are potential candidates for Exascale supercomputDOI: 10.2316/P.2012.789-001 2. iBT Interconnection Network In [15], a new interconnection network called iBT is proposed by interlacing bypass rings to torus networks. A general 𝑑 -dimensional iBT network constructed from a torus network of dimensions 𝑁 × ⋯ × 𝑁 can be expressed as iBT(𝑁 × ⋯ × 𝑁 ; 𝐿 = 𝑚; 𝒃 = 〈𝑏 , ⋯ , 𝑏 〉), where the bypass link sequence 𝒃 = 〈𝑏 , ⋯ , 𝑏 〉 is a vector whose component values increase monotonically, i.e., 𝑏 < 𝑏 < ⋯ < 𝑏 , and it indicates that 𝑏 -hop bypass rings (𝑖 = 1, … , 𝑘) are recursively interlaced into any 𝑚 of the 𝑛 dimensions (𝑚 ≤ 𝑛) . The node degree of the resulting iBT network is 2𝑑 + 2 where 2𝑑 is the node degree for the original torus network and the added 2 is for the bypass connections. To determine the two bypass 86 connections for a node 𝑝 = (𝑥 , 𝑥 , ⋯ , 𝑥 ) where 𝑥 ∈ [0, 𝑁 − 1] with 𝑖 = 1 … , 𝑑, we introduce three terms: a nodal bypass dimension 𝑏𝑑(𝑝) ∈ {1,2, ⋯ 𝑚}, a nodal bypass length system architected as iBT(6 × 6 × 72; 𝒃 , 𝒃 , 𝒃 ) allows full connections of all z-dimensional inter-rack torus and bypass links. One can further expand to achieve a system of desirable size by connecting such bi-rack entities in xor y-dimension or both. As we will discuss more, a system architected as iBT (72 ; 𝒃 , 𝒃 , 𝒃 ) can achieve an optimal performance at the Exascale. It will have 288 racks of nodes capable of 1.5 Tflops performance per node in 2011. After extensive studies, we perform analysis on four bypass configurations for the iBT(72 ; 𝒃 , 𝒃 , 𝒃 ) with bypass vectors {〈6〉, 〈9〉, 〈12〉, 〈6,12〉}. With such analysis, we determine the bypass configuration in each dimension. 𝑏𝑑(𝑝) = [(∑ 𝑥 ) (mod 𝑚)] + 1 and 𝑏𝑙(𝑝) = 𝑏 , 𝑥 )(mod 𝑚𝑘) ⌋ + 1 ∈ {1, … , 𝑘} 𝑚 and thus a nodal bypass species 𝑏𝑠(𝑝) = 〈𝑏𝑑(𝑝), 𝑏𝑙(𝑝)〉 indicating that two 𝑏𝑙(𝑝) -hop bypass links have been added to the given node 𝑝 in each direction along the dimension 𝑏𝑑(𝑝). For example, iBT(32 × 32 × 16; 𝐿 = 2; 𝒃 = 〈4,16〉) indicates interlacing 4-hop and 16-hop bypass rings in the 𝑥𝑦-plane of the 3D torus 𝑇(32 × 32 × 16). In this network, a node 𝑝 = (1,1,4) has 𝑏𝑑(𝑝) = 1, 𝑏𝑙(𝑝) = 𝑏 = 16 and thus 𝑏𝑠(𝑝) = 〈1,16〉 indicating that 𝑝 has two 16hop bypass links in each direction along the x dimension. Such an interconnection network is a desirable network for an Exascale supercomputer because of its achievable absolute network performance with a given low engineering complexity, low performance-cost ratio, and high power efficiency. Take a 3-D iBT network, a specific network notation: iBT(𝑁 × 𝑁 × 𝑁 ; 𝒃 , 𝒃 , 𝒃 ) indicates that bypass rings were interlaced to all three dimensions of a 3-D Torus 𝑇(𝑁 × 𝑁 × 𝑁 ). The bypass configuration in each dimension is illustrated in 𝒃 , 𝒃 or 𝒃 . For another example, iBT(6 × 24 × 72; 𝒃 = 〈0〉, 𝒃 = 〈6〉, 𝒃 = 〈6,12〉) indicates that: 1. Base network: 3-D Torus 𝑇(6 × 24 × 72); 2. No bypass in x-dimension indicated by 𝒃 = 〈0〉; 3. Uniform 6-hop bypass in y-dimension indicated by 𝒃 = 〈6〉; 4. A mix of 6-hop and 12-hop bypasses in zdimension indicated by 𝒃 = 〈6,12〉. ℎ=⌊ (∑ 3.1 Design Considerations When evaluating a design, we consider two factors: the network characteristics and the p/c ratio. The network characteristics include the network diameter and average node-to-node distance. The network performance is defined as the reciprocal of the average distance (A), i.e., network performance = 1/𝐴. We use the extra external inter-rack wires in our iBT design over the Blue Gene’s design as our cost metric which is defined as 𝐶 ∙ 𝑙 where 𝑙 is the total external wire length and 𝐶 is the wires’ unit cost. Thus, the material cost of all of the external wires 𝑙 ∀𝑖 is material cost = ∑ 𝐶 𝑙 ∀ The p/c ratio, defined as the performance divided by the material cost, is written as network performance 1 𝑓 = = (𝐶 ∑∀ 𝑙 ) ∙ 𝐴 material cost By analyzing the relationship of these factors, we identify a class of optimized design plans with specific requirements and trade-offs. 3.2 MPU: Multiple Processing Unit Fig. 1 shows the multiple processing unit (MPU) architecture which has the following properties: 1. 216 nodes interconnected as a 3-D mesh (6×6×6 ); 2. 432 external bypass links, two links per node; 3. 216 external torus links, two links per node at the boundaries; 4. 648 external links, in total, for connecting to other MPUs. These 648 links are categorized into 18 groups with 36 links per group. To classify the external links, we assign each node relative coordinates 𝑝 = (𝑥, 𝑦, 𝑧) where 𝑥, 𝑦, 𝑧 ∈ [0,5] and let 𝑠(𝑝) = 𝑥 + 𝑦 + 𝑧. Let the x-dimensional bypass configuration be 𝒃 = 〈𝑏 , 𝑏 〉. The same notations hold for y- and z-dimension. With such, we have 𝑏𝑑(𝑝) = 𝑠(𝑝) (mod 3) + 1 ∈ {1,2,3}, 𝑏𝑙(𝑝) = 𝑏 , 𝑠(𝑝) (mod 6) where ℎ = ⌊ ⌋ + 1 ∈ {1,2}. 3 3. Design Analysis Our design follows the common principle in optimizing system performance, enhancing modularity, reducing engineering complexity, and lowering monetary costs [1]. The lowest-level module of our design is a node that performs two functions: computation and communication with its directly connected neighboring nodes. A multiple processing unit or MPU is the 2nd level module with the architecture of iBT (6 × 6 × 6; 𝒃 , 𝒃 , 𝒃 ) , containing 6 × 6 × 6 = 216 nodes with a given network architecture. The 3rd level module is a rack that is constituted of six MPUs architected as iBT(6 × 6 × 36; 𝒃 , 𝒃 , 𝒃 ). The highest level, i.e., the fourth level, is the complete system that can contain an appropriated number of such racks according to budget and performance requirements as well as engineering feasibility. For example, a bi-rack 87 cal experiments and they show results in Table 1. It appears a bypass scheme 𝒃 = 〈6,12〉 shows an optimal performance among the six possibilities and, thus, iBT (6 × 6 × 36; 𝒃 , 𝒃 , 𝒃 = 〈6,12〉) is selected as our rack configuration For identifying a node, we introduce a MPU number and a rack number in addition to the nodal relative coordinates. A MPU number is assigned for identifying the MPU’s position in a rack while a rack number is assigned for identifying the rack’s position in a system. A node is defined as a boundary node if and only if one of its three coordinates is equal to 0 or 5. Accordingly, we classify these external links into 18 groups of which 6 belong to the torus groups and 12 belong to the bypass groups marked in Fig. 2. For example, 𝑡𝑥 contains one link of each node with coordinates (0, 𝑦, 𝑧). 𝑏 contains one link of each node with 𝑏𝑠(𝑝) = 〈2, 𝑏 〉, same as other groups. Thirty-six external torus or bypass links are bundled in one group and this group provides the inter-MPU connections. Fig. 1. External torus and bypass links in three typical planes of a MPU Fig. 3 (a) Fig. 3 (b) Fig. 3. Internal connections in a rack with the architectures: (a) iBT(6 × 12 × 18) and (b) iBT(6 × 6 × 36) Table 1 Performance Metrics in a Single Rack Configurations Base Average Diameter Deviation 𝒃 Networks Distance 𝒃 1.64 〈6〉 10 5.56 𝑇(6 × 12 × 18) 〈6〉 1.92 〈9〉 11 6.16 1.82 〈6〉 11 5.94 2.00 〈9〉 12 6.23 𝑇(6 × 6 × 36) 〈0〉 13 6.65 2.24 〈12〉 10 5.69 1.60 〈6,12〉 Fig. 2. External link groups of a MPU 3.3 Rack A rack has 936 external torus links and 1,944 bypass links classified in 18 link sets. A rack consists of six internally connected MPUs with external links for inter-rack connections. The six MPUs can be arranged in two configurations: iBT (6 × 12 × 18)as shown by Fig. 3(a) and iBT(6 × 6 × 36) as shown by Fig. 3(b). No bypass link is added in x-dimension, i.e., 𝒃 = 〈0〉 because of too few nodes in this dimension. Determination of the y- and z-bypasses requires numeri- 3.4 Z-Expansion To achieve the final system design goal of having 72 nodes in each dimension, we start assembling the system first from the z-dimension to form a rack pair that has the 88 architecture iBT(6 × 6 × 72; 𝒃 , 𝒃 , 𝒃 = 〈6,12〉). Interrack connections are shown in Fig. 4. This modular design packs all z-dimensional external links, i.e. 72 torus and 216 bypass links, in one rack pair, allowing us to duplicate the rack pair by connecting its xand y-dimensional external links, i.e. 864 torus and 1728 bypass links,. We classified such links into 12 link sets. 〈6〉). iBT(… , 𝒃 = 〈9〉) and iBT(𝒃 = 〈6,12〉) use the same inter-rack connection pattern as a mixture of the above two patterns. 3. Fig. 6 shows the connection of 12 rack-pairs in a row in the implementation of iBT (6 × 72 × 72; 𝒃 = 〈0〉, 𝒃 = 𝒃 = 〈6,12〉). Fig. 6. Y-dimensional external links in the architecture iBT(6 × 72 × 72; 𝒃 = 〈0〉, 𝒃 = 〈6,12〉, 𝒃 = 〈6,12〉). To simplify and clarify the connection, only links in half of racks are shown. The black lines show both of the original 3D-Torus links and the additional y-directional bypass links, and the red lines show bypass links. Fig. 4. Inter-rack connections in a rack pair iBT(6 × 6 × 72; 𝒃 , 𝒃 , 𝒃 = 〈6,12〉) for completing z-dimensional links 3.5 Y-Expansion Fig. 7 compares the network performance and p/c ratios of the four 𝒃 configurations as the number of racks increases. We found two configurations that warrant further consideration: 1. Configuration with 𝒃 = 〈6〉 always achieves the lowest p/c ratio as well as the best network performance when systems size is less than 8 racks; 2. Configuration with 𝒃 = 〈6,12〉 reduces the diameter and the average distance by 11.8% and 9.4% than that of 𝒃 = 〈6〉 does but it also reduces 13.9% p/c ratio. We now expand the system along y-dimension by arranging the 12 rack pairs in a row, resulting in iBT(6 × 72 × 72). Similarly, we have four bypass configurations in ydimension: 𝒃 ∈ {〈6〉, 〈9〉, 〈12〉, 〈6,12〉}. Fig. 5 shows the inter-rack connections with the same torus connection but different bypass patterns. Diameters (hops) Average Distance (hops) 80 Fig. 5 (a) 60 40 20 0 40 30 20 10 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Number of Racks Number of Racks Fig. 7 (a). Diameters vs. number Fig. 7 (b). Average distances vs. of racks number of racks 1.4 0 Fig. 5 (c) Fig. 5. Y-dimensional external links in the architectures: (a) iBT(𝒃 = 〈6〉), (b) iBT(𝒃 = 〈12〉) and (a) iBT(𝒃 ∈ {〈9〉, 〈6,12〉}) 1.2 80 60 40 20 0 iBT(bx=<0>,by=<6>,bz=<6,12>) Diameters (hops) Performance-cost Ratios Fig. 5 (b) 1.0 0.8 0 0.6 0 iBT(bx=<0>,by=<9>,bz=<6,12>) iBT(bx=<0>,by=<6,12>,bz=<6,12>) iBT(bx=<0>,by=<12>,bz=<6,12>) iBT(bx=<0>,by=<0>,bz=<0>) 4Number 8 12of Racks 16 20 24 4 8 12 16 20 24 Number of Racks Fig. 7 (c). Performance-cost ratios Legend vs. number of racks Fig. 7. Y-Expansion in the architecture iBT(6 × 𝑁 × 72; 𝒃 = 〈0〉, 𝒃 ∈ {〈6〉, 〈9〉, 〈12〉, 〈6,12〉}, 𝒃 = 〈6,12〉) From the Fig. 5 we can see that: 1. iBT(… , 𝒃 = 〈6〉) has the same inter-rack connection pattern as a 3D Torus network but two times more wires per linking bundle. 2. iBT(… , 𝒃 = 〈12〉) has the longest wirings and the same wires per linking bundle as iBT(𝒃 = 89 Fig. 8. External links that are for three partitions with different dimensions and with different bypass schemes in the system iBT(72 ; 𝒃 = 〈6,12〉) Thus, if minimizing the p/c costs is the objective, 𝒃 = 〈6〉 is the best choice. If maximizing network performance is the objective, 𝒃 = 〈6,12〉 is the best choice. In the following sections we limit our discussions on the bypass configurations with 𝒃 ∈ {〈6〉, 〈6,12〉}. {〈6〉, 〈6,12〉}. Fig. 8 shows the three architectures of entire system with 288 racks: iBT(12 × 72 × 72; 𝒃 = 〈6〉, 𝒃 = 𝒃 = 〈6,12〉), iBT(24 × 72 × 72; 𝒃 = 𝒃 = 〈6〉, 𝒃 = 〈6,12〉), iBT(36 × 72 × 72; 𝒃 = 𝒃 = 〈6〉, 𝒃 = 〈6,12〉). We assume the distances between two adjacent rows and between two adjacent racks in a row to be 𝑎 = 1.22 and 𝑏 = 1.88 (meters) [16]. Fig. 9 compares the network performance and the p/c ratios when systems expand with multiple rows. These results show that: 3.6 X-Expansion After completing one-row y-expansion, we assemble the system along x-dimension by connecting multiple rows of 12 rack pairs per row to achieve the architecture iBT(𝑁 × 72 × 72; 𝒃 , 𝒃 , 𝒃 = 〈6,12〉) , where 𝒃 , 𝒃 ∈ 90 40 0 40 20 0 1.5 1.4 120 1.3 Diameters (hops) 80 60 Performance-cost Ratios Average Distanes (hops) Diameters (hops) 120 iBT(bx=<6>,by=<6>,bz=<6,12>) 80 iBT(bx=<6>,by=<6,12>,bz=<6,12>) 40 1.2 iBT(bx=<6,12>,by=<6,12>,bz=<6,12>) 0 Number Racks 48iBT(bx=<0>,by=<0>,bz=<0>) 96 144 of 192 240 288 1.1 48 96 144 192 240 288 96 144 192 240 288 Number of Racks Number of Racks Fig. 9 (a). Diameters vs. number of Fig. 9 (b). Average distances vs. Fig. 9 (c). Performance-cost ratios racks number of racks vs. number of racks Fig. 9. X-Expansion in the architectures iBT(𝑁 × 72 × 72; 𝒃 , 𝒃 ∈ {〈6〉, 〈6,12〉}, 𝒃 = 〈6,12〉) 48 48 96 144 192 240 288 Number of Racks Configuration with 𝒃 = 𝒃 = 〈6〉 has the highest network p/c ratio but also the lowest performance; 2. Configuration with 𝒃 = 𝒃 = 〈6,12〉 has the highest performance but also the lowest p/c ratio, for 2 or more rows; 3. Configuration with 𝒃 = 〈6〉, 𝒃 = 〈6,12〉 has the moderate performance and moderate p/c ratio and is no better than the above two. Orchestrating consideration of the y- and xexpansions, we conclude iBT(𝒃 = 𝒃 = 〈6〉, 𝒃 = 〈6,12〉) is the optimal architecture if maximizing network p/c ratio is the design objective while iBT(𝒃 = 𝒃 = 〈6〉, 𝒃 = 〈6,12〉) is optimal if maximizing network performance is the design objective. Performance-cost Ratios 1. Average Distances (hops) Diameters (hops) 40 iBT(Nx×Ny×72;bx=by=bz=<6,12>) 0 1.2 48iBT(Nx×Ny×72;bx=by=bz=<0>) 96 144192240288 Number of Racks 96 144 192 240 288 Number of Racks Fig. 10 (c). Performance-cost ratios Legend vs. number of racks Fig. 10. XY-Expansion in the architecture iBT(𝑁 × 𝑁 × 72; 𝒃 , 𝒃 ∈ {〈6〉, 〈6,12〉}, 𝒃 = 〈6,12〉) 4. Comparisons with Torus For further comparisons between the selected iBT configurations and the original 3D torus, we consider the operational cost of electricity to power it up and cool it off. Moving 𝑁 bits on a copper wire consumes [2] energy = 𝑟𝑁 𝑙 /𝑎 where 𝑟 is the bit rate, 𝑙 is the wire length, and 𝑎 is the cross section area of wires. Thus, the energy cost of moving 𝑁 /𝑁 bits on each of 𝑁 external wires is 𝑟𝑁 ∑∀ 𝑙 𝐶 ∑∀ 𝑙 energy = = , 𝑁 𝑎 𝑁 where 𝐶 = 𝑟𝑁 /𝑎 . The power efficiency is defined as 𝑁 network performance = . 𝑓 = energy 𝐶 ∑∀ 𝑙 ∙ 𝐴 The iBT(𝒃 , 𝒃 , 𝒃 ) networks, with very minimal design variation over the popular 3D torus, outperforms the later greatly in both categories. Figs. 11 to 13 compare the configurations with the best network performance and the best p/c ratio over the original 3D torus for the y- and xexpansions respectively. Fig. 11 shows the network performance ratios of iBT networks, defined as the diameter of torus divided by the diameter of iBT, a 5-time performance gain. Figs. 12 and 13 show the network performance-cost ratios and power efficiency of iBT networks over torus of the same network size, respectively. 60 40 20 0 48 96 144 192 240 288 96 144 192 240 288 Number of Racks Number of Racks Fig. 10 (a). Diameters vs. number Fig. 10 (b). Average distances vs. of racks number of racks 0 1.3 48 In addition to independent y- and x-expansions, we consider simultaneous expansion in both X and Y dimensions. For architecture iBT (𝑁 × 𝑁 × 72; 𝒃 , 𝒃 , 𝒃 = 〈6,12〉) , we choose the bypass schemes for x- and ydimension as: 〈6〉, 𝑁 ≤ 24 〈6〉, 𝑁 ≤ 24 𝒃 ={ 𝒃 ={ . 〈6,12〉, otherwise. 〈6,12〉, otherwise. Fig. 10 shows the variations of the performance and p/c ratios as we expand the system from 16 to 288 racks. For example, for 96 racks iBT(48 × 36 × 72) outperforms iBT(72 × 24 × 72) , iBT(36 × 48 × 72) , and iBT(24 × 72 × 72). 80 1.4 1.1 3.7 XY-Expansion 120 Legend 48 91 5 4 Power-efficiency Ratios Performance-cost Ratios Performance Ratios 5 1.5 6 1.4 1.3 1.2 1.1 1 3 0 96 192 Number of Racks 288 Fig. 11. Network performance ratios of iBT over torus 4 5 4 3 2 1 3 iBT(bx=by=<6>,bz=<6,12>) 0 iBT(bx=by=bz=<6,12>) 96 192 Number of Racks 288 2 0 96 192 288 Number of Racks Fig. 12. Network performance/material cost ratios of iBT over torus 0 96 192 288 Number of Racks Fig. 13. Network performance/energy cost ratios of iBT over torus Legend performance-optimized system and a cost-optimized system, respectively. Summarizing the above, we see: 1. Both iBT architectures outperform the 3D Torus of the same node count. The performanceoptimized iBT reduces the network diameter and average distance by 83.3% and 80.4% over the 3D Torus, and the cost-optimized iBT reduces the diameter and average distance by 79.6% and 77.0%, respectively; 2. The network performance-cost ratio of the comparable iBT network is 1.43 times that of 3D torus with the same node dimensions of 72 × 72 × 72; 3. The network power efficiency of the comparable iBT network is 4.44 times that of 3D torus with the same node dimensions of 72 × 72 × 72. A supercomputer with our proposed iBT network coupled with 72 × 72 × 72 CPU-GPU nodes capable peak performance of 1.5 TFlops per node can achieve 0.56 Exa-flops. Acknowledgements This work is supported by the National High-Tech Research and Development Plan of China (Project 863) under Grand No. 2009AA012201, and the Shanghai Science and Technology Development Fund under Grand No. 08dz1501600. References [1] R. L. Al Geist, "Major Computer Science Challenges At Exascale," International Journal of High Performance Computing Applications, vol. 23, pp. 427-436, Nov 2009 [2] H. Simon, Exascale Challenges for the Computational Science Community. Available: http://symposium2010.oscer.ou.ed u/oksupercompsymp2010_talk_simon_20101006.pdf, accessed date: 6 Oct 2010 [3] TOP500. Top 500 Supercomputer Sites. Available: http://www.top500.org [4] W. J. Dally, "Performance Analysis of k-ary n-cube Interconnection Networks," IEEE Transactions on Computers, vol. 39, pp. 775-785, 1990 [5] W. Dally and B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann Publishers Inc., 2003 [6] J. Duato, et al., Interconnection Networks: Engineering Approach, Morgan Kaufmann Publishers Inc., 2002. [7] A. Gara, et al., "Overview of the Blue Gene/L system architecture," IBM Journal of Research and Development, vol. 49, pp. 195-212, 2005 [8] N. R. Adiga, et al., "Blue Gene/L Torus Interconnection Network," IBM Journal of Research and Development, vol. 49, March/May 2005 [9] IBM Blue Gene Team, "Overview of the Blue Gene/P project," IBM Journal of Research and Development, vol. 2, pp. 199-220, 2008 [10] R. B. P. Worley, and J. Kuehn, "Early Evaluation of the Cray XT5," in Proceedings of the 51st Cray User Group Conference, Atlanta, 2009 [11] IBM uncloaks 20 petaflops BlueGene/Q super. Available: http://www.theregister.co.uk/2010/11/22/ibm_blue_gene_q_supe r/, accessed date: 22 Nov, 2010 [12] GREEN500. Green 500 Supercomputer Sites. Available: http://www.green500.org 5. Conclusion Through extensive analyses of a technique of alteration of the widely adopted 3D torus networks for supercomputers, we proposed a much more monetary and energy efficient architecture. The new architecture results from adding 6-hop, 9-hop, 12-hop, or 6- and 12-hop bypass links to x-, y-, and z-directions of the corresponding 3D torus network. Our methodology can be applied to general architectures and our case study of a system of 72 × 72 × 72 =373,248 nodes demonstrates the procedure and the values of our analysis. We analyzed these four bypass configurations, while expanding the system, in terms of network diameter, average distance, distance deviation, length of total external links, number of external links, and relative network costs (average distance × total link length) for optimal performance and costs. Our comparisons of the performance, price-performance ratios, and power efficiency between iBT architecture and original 3D Torus show that: 1) All four configurations demonstrate significant gain in performance and priceperformance ratio; 2) Configurations with parameters 𝒃 = 𝒃 = 〈6,12〉 and 𝒃 = 𝒃 = 〈6〉 are optimal for a 92 [13] Y. Inoguchi, et al., "SRT interconnection network on 3D stacked implementation by considering thermo-radiation," Innovative Systems in Silicon, 1997. Proceedings, Second Annual IEEE International Conference on, pp. 41-51, 8-10 Oct 1997 [14] Y. Inoguchi and S. Horiguchi, "Shifted Recursive Torus Interconnection for High Performance Computing," presented at the Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97, 1997 [15] P. Zhang, et al., "Interlacing Bypass Rings to Torus Networks for More Efficient Networks," IEEE Transactions on Parallel and Distributed Systems, vol. 22, pp. 287-295, 2011 [16] IBM System Blue Gene/P Solution Installation Planning Guide. Available: http://www.scc.acad.bg/articles/library/BLue %20Gene%20P/BGP%20Site%20Installation%20Planning%20 Guide.pdf, accessed date: 16 Jan 2009 93