Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
…
5 pages
1 file
As the complexity and size of challenges in science and engineering are continually increasing, it is highly important that applications are able to scale strongly to very large numbers of cores (>100,000 cores) to enable HPC systems to be utilised efficiently. This paper presents results of strong scaling tests performed with an MPI only and a hybrid MPI + OpenMP version of the Lattice QCD application BQCD on the European Tier-0 system SuperMUC at LRZ.
IBM Journal of Research and Development, 2000
Quantum chromodynamics (QCD), the theory of the strong nuclear force, can be numerically simulated on massively parallel supercomputers using the method of lattice gauge theory. We describe the special programming requirements of lattice QCD (LQCD) as well as the optimal supercomputer hardware architectures for which LQCD suggests a need. We demonstrate these methods on the IBM Blue Gene/Le (BG/L) massively parallel supercomputer and argue that the BG/L architecture is very well suited for LQCD studies. This suitability arises from the fact that LQCD is a regular lattice discretization of space into lattice sites, while the BG/L supercomputer is a discretization of space into compute nodes. Both LQCD and the BG/L architecture are constrained by the requirement of short-distance exchanges. This simple relation is technologically important and theoretically intriguing. We demonstrate a computational speedup of LQCD using up to 131,072 CPUs on the largest BG/L supercomputer available in 2007. As the number of CPUs is increased, the speedup increases linearly with sustained performance of about 20% of the maximum possible hardware speed. This corresponds to a maximum of 70.5 sustained teraflops. At these speeds, LQCD and the BG/L supercomputer are able to produce theoretical results for the next generation of strong-interaction physics.
Journal of Physics: Conference Series, 2010
The study and design of a very ambitious petaflop cluster exclusively dedicated to Lattice QCD simulations started in early '08 among a consortium of 7 laboratories (IN2P3, CNRS, INRIA, CEA) and 2 SMEs. This consortium received a grant from the French ANR agency in July '08, and the PetaQCD project kickoff took place in January '09. Building upon several years of fruitful collaborative studies in this area, the aim of this project is to demonstrate that the simulation of a 256 x 128 3 lattice can be achieved through the HMC/ETMC software, using a machine with efficient speed/cost/reliability/power consumption ratios. It is expected that this machine can be built out of a rather limited number of processors (e.g. between 1000 and 4000), although capable of a sustained petaflop CPU performance. The proof-of-concept should be a mock-up cluster built as much as possible with off-theshelf components, and 2 particularly attractive axis will be mainly investigated, in addition to fast all-purpose multi-core processors: the use of the new brand of IBM-Cell processors (with on-chip accelerators) and the very recent Nvidia GP-GPUs (off-chip co-processors). This cluster will obviously be massively parallel, and heterogeneous. Communication issues between processors, implied by the Physics of the simulation and the lattice partitioning, will certainly be a major key to the project.
Computer Physics Communications, 2003
We study the feasibility of a PC-based parallel computer for medium to large scale lattice QCD simulations. The Eötvös Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes with 512 MB RDRAM. The 32-bit, single precision sustained performance for dynamical QCD without communication is 1510 Mflops/node with Wilson and 970 Mflops/node with staggered fermions. This gives a total performance of 208 Gflops for Wilson and 133 Gflops for staggered QCD, respectively (for 64-bit applications the performance is approximately halved). The novel feature of our system is its communication architecture. In order to have a scalable, cost-effective machine we use Gigabit Ethernet cards for nearest-neighbor communications in a two-dimensional mesh. This type of communication is cost effective (only 30% of the hardware costs is spent on the communication). According to our benchmark measurements this type of communication results in around 40% communication time fraction for lattices upto 48 3 · 96 in full QCD simulations. The price/sustained-performance ratio for full QCD is better than $1/Mflops for Wilson (and around $1.5/Mflops for staggered) quarks for practically any lattice size, which can fit in our parallel computer. The communication software is freely available upon request for non-profit organizations. * 1 There are obvious advantages of PC based systems. Single PC hardware usually has excellent price/performance ratios for both single and double precision applications. In most cases the operating system (Linux), compiler (gcc) and other software are free. Another advantage of using PC/Linux based systems is that lattice codes remain portable. Furthermore, due to their price they are available for a broader community working on lattice gauge theory. For recent review papers and benchmarks see .
Computer Physics Communications, 2010
We describe how we have used simultaneously O(10 3) nodes of the EGEE Grid, accumulating ca. 300 CPU-years in 2-3 months, to determine an important property of Quantum Chromodynamics. We explain how Grid resources were exploited eciently and with ease, using userlevel overlay based on Ganga and DIANE tools above standard Grid software stack. Application-specic scheduling and resource selection based on simple but powerful heuristics allowed to improve eciency of the processing to obtain desired scientic results by a specied deadline. This is also a demonstration of combined use of supercomputers, to calculate the initial state of the QCD system, and Grids, to perform the subsequent massively distributed simulations. The QCD simulation was performed on a 16 3 × 4 lattice. Keeping the strange quark mass at its physical value, we reduced the masses of the up and down quarks until, under an increase of temperature, the system underwent a second-order phase transition to a quark-gluon plasma. Then we measured the response of this system to an increase in the quark density. We nd that the transition is smoothened rather than sharpened. If conrmed on a ner lattice, this nding makes it unlikely for ongoing experimental searches to nd a QCD critical point at small chemical potential.
The Intel Xeon Phi architecture from Intel Corporation features parallelism at the level of many x86-based cores, multiple threads per core, and vector processing units. Lattice Quantum Chromodynamics (LQCD) is currently the only known model independent, non perturbative computational method for calculations in theory of the strong interactions, and is of importance in studies of nuclear and high energy physics. In this contribution, we describe our experiences with optimizing a key LQCD kernel for the Xeon Phi architecture. On a single node, our Dslash kernel sustains a performance of around 280 GFLOPS, while our full solver sustains around 215 GFLOPS. Furthermore we demonstrate a fully 'native' multi-node LQCD implementation running entirely on KNC nodes with minimum involvement of the host CPU. Our multi-node implementation of the solver has been strong scaled to 3.6 TFLOPS on 64 KNCs.
Computer Physics Communications, 2000
We study the feasibility of a PC-based parallel computer for medium to large scale lattice QCD simulations. The Eotvos Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes with 512 MB RDRAM. The 32-bit, single precision sustained performance for dynamical QCD without communication is 1510 Mflops/node with Wilson and 970 Mflops/node with staggered fermions. This gives a total
Nuclear Physics B-proceedings Supplements, 2001
The architecture of a new class of computers, optimized for lattice QCD calculations, is described. An individual node is based on a single integrated circuit containing a PowerPC 32-bit integer processor with a 1 Gflops 64-bit IEEE floating point unit, 4 Mbyte of memory, 8 Gbit/sec nearest-neighbor communications and additional control and diagnostic circuitry. The machine's name, QCDOC, derives from “QCD On a Chip”.
Introduction
Within the standard model of elementary particle physics quantum chromodynamics (QCD) describes the strong interactions between quarks and gluons, the smallest building blocks of matter. The equations of this theory are so complicated that they cannot be solved by traditional perturbative methods of quantum field theory. The only computational ab initio approach for solving QCD is Lattice QCD. In order to simulate QCD on HPC systems, one approximates the space-time continuum by a four dimensional finite box divided into a discrete set of points, the lattice. Due to continuous algorithmic improvements and the advent of Petaflops-scale computing facilities, lattice QCD simulations have reached a level where realistic simulations with physical quark masses have become possible and allow for a first principle study of strongly interacting elementary particles.
BQCD (Berlin Quantum ChromoDynamics) is a Hybrid Monte-Carlo (HMC) code written in Fortran that simulates QCD with dynamical Wilson fermions. Beyond being widely used in the lattice QCD community, BQCD is also included in the Unified European Applications Benchmark Suite (UEABS) of PRACE, the Partnership for Advanced Computing in Europe. The kernel of the program is a standard conjugate gradient solver with even/odd pre-conditioning. In a typical HMC run between 80% and 95% of the total computing time is used for the multiplication of a very large sparse matrix ("hopping matrix") with a vector. At the single CPU level QCD programs benefit from the fact that the basic operation is the multiplication of small complex matrices. QCD programs are parallelised by decomposing the lattice into regular domains. The nearest neighbour structure of the hopping matrix implies that the boundary values (surfaces) of the input vector have to be exchanged between neighbouring processes in every iteration of the solver. The boundary exchange is communication intensive because the local lattices are typically small and the surface to volume ratio is quite large. BQCD has various communication methods implemented in the hopping matrix multiplication: MPI, OpenMP, a hybrid combination of both, as well as shmem (single sided communication).
Execution Environment
The initial dynamical lattice QCD computations at LRZ have become possible with the first national supercomputer at LRZ, the Hitachi SR8000-F1 machine HLRB-I, installed in 2000 [1,2]. Since then BQCD has been ported to several machines and architectures [3,4]. On the next national supercomputer at LRZ, the SGI 4700 Altix HLRB-II, BQCD has shown good scaling on up to 8K cores for both the pure MPI and the hybrid version of the code.
In this paper we discuss the scaling behaviour of BQCD on the current 3 Pflop/s SuperMUC system at LRZ. SuperMUC [5] serves as a Tier-0 system within PRACE. It is an IBM System x iDataPlex machine based on Intel processors and Infiniband technology. SuperMUC consists of 18 thin node islands equipped with Intel SandyBridge processors and one fat node island with Intel Westmere-EX processors. The thin node islands are connected via a fully non-blocking Mellanox FDR-10 Infiniband network. Each thin node island consists of 512 nodes of 8 SandyBridge-EP Intel Xeon E5-2680 processors, providing 8192 cores per island.
On SuperMUC we have investigated the scaling of an MPI only and a hybrid MPI + OpenMP version of the code. BQCD was built with the Intel ifort 12.1 Fortran compiler and Intel MPI 4.1. All scaling tests have been performed on the thin node islands of SuperMUC.
Scaling has been studied for two lattices sizes, 96 3 × 192 and 64 3 × 96, for the MPI only and the hybrid MPI + OpenMP version, respectively. The lattice sizes have been chosen in order to fit the domain decomposition and the model parameters with respect to the memory size of the system. Similar lattice sizes have been analysed on the Blue Gene/P machine during the Extreme Scaling Workshop 2010 in Jülich [4].
Results
MPI only Version of BQCD for a 96 3 × 192 Lattice
Performance results for the MPI only version of the code are summarised in Table 1. Figure 1 shows the strong scaling of BQCD on up to 16K cores for a lattice size of 96 3 × 192. Unfortunately, we were unable to acquire data for 32K, 64K and 128K cores, because the application was either crashing or running extremely slowly for still unknown reasons 1 . Within one thin node island of SuperMUC, i.e. for up to 8K cores, scaling of BQCD is almost linear. For 16K cores (2 full islands) the overall performance significantly drops below linear.
Table 1
Lattice decomposition, total time, mean performance per core and overall performance of the conjugate gradient solver of BQCD for a 96 3 × 192 lattice using the MPI only version.
Figure 1
Strong scaling of the conjugate gradient solver of BQCD using the MPI only version of the code for a 96 3 × 192 lattice. Subfigure (a) shows the total time spent within the solver and (b) the overall performance as a function of the total number of cores. The straight dotted line indicates linear scaling.
Hybrid MPI + OpenMP Version of BQCD for a 64 3 × 96 Lattice
Performance results for the hybrid MPI + OpenMP version of the code are summarised in Table 2. Each MPI task executes 8 OpenMP threads. The plots in Figure 2 show that the hybrid version scales well within 2 islands (16384 cores). Negative scaling is observed from 64K to 128K cores. The performance decreases observed between 64K and 128K cores with the hybrid version of BQCD is due to the small local lattice sizes. The data from a relatively large surface of the small domains has to be communicated to the neighbour processes and challenges the communication network for large job sizes.
Table 2
Figure 2
Strong scaling of the conjugate gradient solver of BQCD using the hybrid MPI + OpenMP version of the code for a 64 3 × 96 lattice. 8 OpenMP threads are executed per MPI task. Subfigure (a) shows the total time spent within the solver and (b) the overall performance as a function of the total number of cores. The dotted straight line indicates linear scaling
Conclusion
We have shown that BQCD scales well on the SuperMUC machine at LRZ with both the MPI only and the hybrid MPI + OpenMP versions of the code within one (8k cores) or two (16k cores) thin node islands, respectively. The scaling properties of BQCD strongly depend on the local lattice decomposition. The optimal values are highly dependent upon the architecture of the HPC system used. It is important to tune the local lattice sizes to achieve optimal scaling performance and ensure efficient communication between the processors. It seems that there are some underlying issues, which result in BQCD crashing or running extremely slowly with very large number of MPI jobs. This is still not clearly understood yet and requires further analysis. To be able to better compare results obtained with the two versions of the code with each other and also with results from other HPC systems, further runs at various lattice sizes will be done in future. Also more investigation on the weak scaling behaviour of BQCD is needed on SuperMUC.
Table 2 .