Proceedings of the 11th ACM Conference on Computing Frontiers - CF '14, 2014
ABSTRACT Understanding workload behavior plays an important role in performance studies. The grow... more ABSTRACT Understanding workload behavior plays an important role in performance studies. The growing complexity of applications and architectures has increased the gap among application developers, performance engineers, and hardware designers. To reduce this gap, we propose SKOPE, a SKeleton framework for Performance Exploration, that produces a descriptive model about the semantic behavior of a workload, which can infer potential transformations and help users understand how workloads may interact with and adapt to emerging hardware. SKOPE models can be shared, annotated, and studied by a community of performance engineers and system designers; they offer readability in the frontend and versatility in the backend. SKOPE can be used for performance analysis, tuning, and projection. We provide two example use cases. First, we project GPU performance from CPU code without GPU programming or accessing the hardware, and are able to automatically explore transformations and the projected best-achievable performance deviates from the measured by 18% on average. Second, we project the multi-node scaling trends of two scientific workloads, and are able to achieve a projection accuracy of 95%.
2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012
Remarkable observational advances have established a compelling cross-validated model of the Univ... more Remarkable observational advances have established a compelling cross-validated model of the Universe. Yet, two key pillars of this model -dark matter and dark energyremain mysterious. Sky surveys that map billions of galaxies to explore the 'Dark Universe', demand a corresponding extremescale simulation capability; the HACC (Hybrid/Hardware Accelerated Cosmology Code) framework has been designed to deliver this level of performance now, and into the future. With its novel algorithmic structure, HACC allows flexible tuning across diverse architectures, including accelerated and multi-core systems.
ABSTRACT A varied collection of scientific and engineering codes has been adapted and enhanced to... more ABSTRACT A varied collection of scientific and engineering codes has been adapted and enhanced to take advantage of the IBM Blue Gene®/Q architecture and thus enable research that was previously out of reach. Computational research teams from a number of disciplines collaborated with the staff of the Argonne Leadership Computing Facility to assess which of Blue Gene/Q's many novel features could be exploited for each application to equip it to tackle existing problem classes with greater fidelity and in some cases to address new phenomena. The quad floating-point units and the five-dimensional torus interconnect are among the features that were demonstrated to be effective for a number of important applications. Furthermore, data obtained from the hardware counters provided insights that were valuable in guiding the code modifications. Hardware features and programming techniques that were effective across multiple codes are documented as well. First, we have confirmed that there is no significant code rewrite needed to run today's production codes with good performance on Mira, an IBM Blue Gene/Q supercomputer. Performance improvements are already demonstrated, even though our measurements are all on pre-production software and hardware. The application domains included biology, materials science, combustion, chemistry, nuclear physics, and industrial-scale design of nuclear reactors, jet engines, and the efficiency of transportation systems.
2013 IEEE 27th International Symposium on Parallel and Distributed Processing, 2013
ABSTRACT The Argonne Leadership Computing Facility (ALCF) is home to Mira, a 10 PF Blue Gene/Q (B... more ABSTRACT The Argonne Leadership Computing Facility (ALCF) is home to Mira, a 10 PF Blue Gene/Q (BG/Q) system. The BG/Q system is the third generation in Blue Gene architecture from IBM and like its predecessors combines system-onchip technology with a proprietary interconnect (5-D torus). Each compute node has 16 augmented PowerPC A2 processor cores with support for simultaneous multithreading, 4-wide double precision SIMD, and different data prefetching mechanisms. Mira offers several new opportunities for tuning and scaling scientific applications. This paper discusses our early experience with a subset of micro-benchmarks, MPI benchmarks, and a variety of science and engineering applications running at ALCF. Both performance and power are studied and results on BG/Q is compared with its predecessor BG/P. Several lessons gleaned from tuning applications on the BG/Q architecture for better performance and scalability are shared.
Proceedings of the 11th ACM Conference on Computing Frontiers - CF '14, 2014
ABSTRACT Understanding workload behavior plays an important role in performance studies. The grow... more ABSTRACT Understanding workload behavior plays an important role in performance studies. The growing complexity of applications and architectures has increased the gap among application developers, performance engineers, and hardware designers. To reduce this gap, we propose SKOPE, a SKeleton framework for Performance Exploration, that produces a descriptive model about the semantic behavior of a workload, which can infer potential transformations and help users understand how workloads may interact with and adapt to emerging hardware. SKOPE models can be shared, annotated, and studied by a community of performance engineers and system designers; they offer readability in the frontend and versatility in the backend. SKOPE can be used for performance analysis, tuning, and projection. We provide two example use cases. First, we project GPU performance from CPU code without GPU programming or accessing the hardware, and are able to automatically explore transformations and the projected best-achievable performance deviates from the measured by 18% on average. Second, we project the multi-node scaling trends of two scientific workloads, and are able to achieve a projection accuracy of 95%.
2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012
Remarkable observational advances have established a compelling cross-validated model of the Univ... more Remarkable observational advances have established a compelling cross-validated model of the Universe. Yet, two key pillars of this model -dark matter and dark energyremain mysterious. Sky surveys that map billions of galaxies to explore the 'Dark Universe', demand a corresponding extremescale simulation capability; the HACC (Hybrid/Hardware Accelerated Cosmology Code) framework has been designed to deliver this level of performance now, and into the future. With its novel algorithmic structure, HACC allows flexible tuning across diverse architectures, including accelerated and multi-core systems.
ABSTRACT A varied collection of scientific and engineering codes has been adapted and enhanced to... more ABSTRACT A varied collection of scientific and engineering codes has been adapted and enhanced to take advantage of the IBM Blue Gene®/Q architecture and thus enable research that was previously out of reach. Computational research teams from a number of disciplines collaborated with the staff of the Argonne Leadership Computing Facility to assess which of Blue Gene/Q's many novel features could be exploited for each application to equip it to tackle existing problem classes with greater fidelity and in some cases to address new phenomena. The quad floating-point units and the five-dimensional torus interconnect are among the features that were demonstrated to be effective for a number of important applications. Furthermore, data obtained from the hardware counters provided insights that were valuable in guiding the code modifications. Hardware features and programming techniques that were effective across multiple codes are documented as well. First, we have confirmed that there is no significant code rewrite needed to run today's production codes with good performance on Mira, an IBM Blue Gene/Q supercomputer. Performance improvements are already demonstrated, even though our measurements are all on pre-production software and hardware. The application domains included biology, materials science, combustion, chemistry, nuclear physics, and industrial-scale design of nuclear reactors, jet engines, and the efficiency of transportation systems.
2013 IEEE 27th International Symposium on Parallel and Distributed Processing, 2013
ABSTRACT The Argonne Leadership Computing Facility (ALCF) is home to Mira, a 10 PF Blue Gene/Q (B... more ABSTRACT The Argonne Leadership Computing Facility (ALCF) is home to Mira, a 10 PF Blue Gene/Q (BG/Q) system. The BG/Q system is the third generation in Blue Gene architecture from IBM and like its predecessors combines system-onchip technology with a proprietary interconnect (5-D torus). Each compute node has 16 augmented PowerPC A2 processor cores with support for simultaneous multithreading, 4-wide double precision SIMD, and different data prefetching mechanisms. Mira offers several new opportunities for tuning and scaling scientific applications. This paper discusses our early experience with a subset of micro-benchmarks, MPI benchmarks, and a variety of science and engineering applications running at ALCF. Both performance and power are studied and results on BG/Q is compared with its predecessor BG/P. Several lessons gleaned from tuning applications on the BG/Q architecture for better performance and scalability are shared.
Uploads
Papers by K. Kumaran