Skip to main content

J. Shalf

Followers

25

Following

25

Co-authors

21

Public Views

Interests

Uploads

Papers by J. Shalf

Cactus Framework: Black Holes to Gamma Ray Bursts

ArXiv, 2007

Gamma Ray Bursts (GRBs) are intense narrowly-beamed flashes of gamma-rays of cosmological origin.... more Gamma Ray Bursts (GRBs) are intense narrowly-beamed flashes of gamma-rays of cosmological origin. They are among the most scientifically interesting astrophysical systems, and the riddle concerning their central engines and emission mechanisms is one of the most complex and challenging problems of astrophysics today. In this article we outline our petascale approach to the GRB problem and discuss the computational toolkits and numerical codes that are currently in use and that will be scaled up to run on emerging petaflop scale computing platforms in the near future. Petascale computing will require additional ingredients over conventional parallelism. We consider some of the challenges which will be caused by future petascale architectures, and discuss our plans for the future development of the Cactus framework and its applications to meet these challenges in order to profit from these new architectures.

The roofline model: A pedagogical tool for program analysis and optimization

2008 IEEE Hot Chips 20 Symposium (HCS), 2008

 Motivation, Goals, Audience, etc…  Survey of multicore architectures  Description of the Roof... more  Motivation, Goals, Audience, etc…  Survey of multicore architectures  Description of the Roofline model  Introduction to Auto-tuning  Application of the roofline to auto-tuned kernels  Example #1-SpMV  Example #2-LBMHD  Conclusions 3 EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB Motivation  Multicore guarantees neither good scalability nor good (attained) performance  Performance and scalability can be extremely non-intuitive even to computer scientists  Success of the multicore paradigm seems to be premised upon their programmability  To that end, one must understand the limits to both scalability and efficiency.

Enabling Applications on the Grid: A Gridlab Overview

The International Journal of High Performance Computing Applications, 2003

Grid technology is widely emerging. Still, there is an eminent shortage of real Grid users, mostl... more Grid technology is widely emerging. Still, there is an eminent shortage of real Grid users, mostly due to the lack of a “critical mass” of widely deployed and reliable higher-level Grid services, tailored to application needs. The GridLab project aims to provide fundamentally new capabilities for applications to exploit the power of Grid computing, thus bridging the gap between application needs and existing Grid middleware. We present an overview of GridLab, a large-scale, EU-funded Grid project spanning over a dozen groups in Europe and the US. We first outline our vision of Grid-empowered applications and then discuss GridLab’s general architecture and its Grid Application Toolkit (GAT). We illustrate how applications can be Grid-enabled with the GAT and discuss GridLab’s scheduler as an example of GAT services.

Tiling as a Durable Abstraction for Parallelism and Data Locality

Tiling as a Durable Abstraction for Parallelism and Data Locality Didem Unat Cy Chan Weiqun Zhang... more Tiling as a Durable Abstraction for Parallelism and Data Locality Didem Unat Cy Chan Weiqun Zhang John Bell John Shalf Lawrence Berkeley National Laboratory 1 Cyclotron Rd, Berkeley, California, USA 94720 dunat, cychan, weiqunzhang, jbbell, jshalf @lbl.gov Abstract—Tiling is a useful loop transformation for expressing parallelism and data locality. Automated tiling transformations that preserve data-locality are increasingly important due to hardware trends towards massive parallelism and the increasing costs of data movement relative to the cost of computing. We propose TiDA as a durable tiling abstraction that centralizes parameterized tiling information within array data types with minimal changes to the source code. The data layout information can be used by the compiler and runtime to automatically manage parallelism, optimize data locality, and schedule tasks intelligently. In this paper, we present the design features and early interface of TiDA along with some preliminary re...

Software Design Space Exploration for Exascale Combustion Co-design

Lecture Notes in Computer Science, 2013

The design of hardware for next-generation exascale computing systems will require a deep underst... more The design of hardware for next-generation exascale computing systems will require a deep understanding of how software optimizations impact hardware design trade-offs. In order to characterize how co-tuning hardware and software parameters affects the performance of combustion simulation codes, we created ExaSAT, a compiler-driven static analysis and performance modeling framework. Our framework can evaluate hundreds of hardware/software configurations in seconds, providing an essential speed advantage over simulators and dynamic analysis techniques during the co-design process. Our analytic performance model shows that advanced code transformations, such as cache blocking and loop fusion, can have a significant impact on choices for cache and memory architecture. Our modeling helped us identify tuned configurations that achieve a 90% reduction in memory traffic, which could significantly improve performance and reduce energy consumption. These techniques will also be useful for the development of advanced programming models and runtimes, which must reason about these optimizations to deliver better performance and energy efficiency.

Performance Modeling for 3D Visualization in a Heterogeneous Computing Environment

The visualization of large, remotely located data sets necessitates the development of a distribu... more The visualization of large, remotely located data sets necessitates the development of a distributed computing pipeline in order to reduce the data, in stages, to a manageable size. The required baseline infrastructure for launching such a distributed pipeline is becoming available, but few services support even marginally optimal resource selection and partitioning of the data analysis workflow. We explore a methodology for building a model of overall application performance using a composition of the analytic models of individual components that comprise the pipeline. The analytic models are shown to be accurate on a testbed of distributed heterogeneous systems. The prediction methodology will form the foundation of a more robust resource management service for future Grid-based visualization applications.

A Power Efficient Exaflop Computer Design for Global Cloud System Resolving Climate Models

Exascale computers would allow routine ensemble modeling of the global climate system at the clou... more Exascale computers would allow routine ensemble modeling of the global climate system at the cloud system resolving scale. Power and cost requirements of traditional architecture systems are likely to delay such capability for many years. We present an alternative route to the exascale using embedded processor technology to design a system optimized for ultra high resolution climate modeling. These power

The Cactus Worm: Experiments with Dynamic Resource Discovery and Allocation in a Grid Environment

The International Journal of High Performance Computing Applications, 2001

The ability to harness heterogeneous, dynamically available grid resources is attractive to typic... more The ability to harness heterogeneous, dynamically available grid resources is attractive to typically resource-starved computational scientists and engineers, as in principle it can increase, by significant factors, the number of cycles that can be delivered to applications. However, new adaptive application structures and dynamic runtime system mechanisms are required if we are to operate effectively in grid environments. To explore some of these issues in a practical setting, the authors are developing an experimental framework, called Cactus, that incorporates both adaptive application structures for dealing with changing resource characteristics and adaptive resource selection mechanisms that allow applications to change their resource allocations (e.g., via migration) when performance falls outside specified limits. The authors describe the adaptive resource selection mechanisms and describe how they are used to achieve automatic application migration to “better” resources foll...

Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, 2011

Improving Memory Subsystem Performance Using ViVA: Virtual Vector Architecture

Lecture Notes in Computer Science, 2009

Scientific Computing Kernels on the Cell Processor

International Journal of Parallel Programming, 2007

Software Roadmap to Plug and Play Petaflop/s

Hardware/software co-design of global cloud system resolving models

Journal of Advances in Modeling Earth Systems, 2011

We present an analysis of the performance aspects of an atmospheric general circulation model at ... more We present an analysis of the performance aspects of an atmospheric general circulation model at the ultra-high resolution required to resolve individual cloud systems and describe alternative technological paths to realize the integration of such a model in the relatively near future. Due to a superlinear scaling of the computational burden dictated by the Courant stability criterion, the solution of the equations of motion dominate the calculation at these ultra-high resolutions. From this extrapolation, it is estimated that a credible kilometer scale atmospheric model would require a sustained computational rate of at least 28 Petaflop/s to provide scientifically useful climate simulations. Our design study portends an alternate strategy for practical power-efficient implementations of next-generation ultra-scale systems. We demonstrate that hardware/software co-design of low-power embedded processor technology could be exploited to design a custom machine tailored to ultra-high resolution climate model specifications at relatively affordable cost and power considerations. A strawman machine design is presented consisting of in excess of 20 million processing elements that effectively exploits forthcoming many-core chips. The system pushes the limits of domain decomposition to increase explicit parallelism, and suggests that functional partitioning of sub-components of the climate code (much like the coarse-grained partitioning of computation between the atmospheric, ocean, land, and ice components of current coupled models) may be necessary for future performance scaling.

The Astrophysics Simulation Collaboratory Portal: a framework for effective distributed research

Future Generation Computer Systems, 2005

Grid Portals, based on standard web technologies, are emerging as important and useful user inter... more Grid Portals, based on standard web technologies, are emerging as important and useful user interfaces to computational and data Grids. Grid Portals enable Virtual Organizations, comprised of distributed researchers to collaborate and access resources more efficiently and seamlessly. The Astrophysics Simulation Collaboratory (ASC) Grid Portal provides a framework to enable researchers in the field of numerical relativity to study astrophysical phenomenon by making use of the Cactus computational toolkit. We examine user requirements and describe the design and implementation of the ASC Grid Portal.

Solving Einstein's equations on supercomputers

Computer, 1999

Global Address Space Communication Techniques for Gyrokinetic Fusion Applications on Ultra-Scale Platforms

... Robert Preissl John Shalf Alice Koniges Lawrence Berkeley National Laboratory {rpreissl,jshal... more

The International Exascale Software Project roadmap

The International Journal of High Performance Computing Applications, 2011

Over the last 20 years, the open-source community has provided more and more software on which th... more Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics proces...

Auto-tuning Stencil Computations on Multicore and Accelerators

Cactus as benchmarking platform

Vector and Parallel Processing��VECPAR��2002, 5th Int. Conf

Cactus Framework: Black Holes to Gamma Ray Bursts

ArXiv, 2007

Gamma Ray Bursts (GRBs) are intense narrowly-beamed flashes of gamma-rays of cosmological origin.... more Gamma Ray Bursts (GRBs) are intense narrowly-beamed flashes of gamma-rays of cosmological origin. They are among the most scientifically interesting astrophysical systems, and the riddle concerning their central engines and emission mechanisms is one of the most complex and challenging problems of astrophysics today. In this article we outline our petascale approach to the GRB problem and discuss the computational toolkits and numerical codes that are currently in use and that will be scaled up to run on emerging petaflop scale computing platforms in the near future. Petascale computing will require additional ingredients over conventional parallelism. We consider some of the challenges which will be caused by future petascale architectures, and discuss our plans for the future development of the Cactus framework and its applications to meet these challenges in order to profit from these new architectures.

The roofline model: A pedagogical tool for program analysis and optimization

2008 IEEE Hot Chips 20 Symposium (HCS), 2008

 Motivation, Goals, Audience, etc…  Survey of multicore architectures  Description of the Roof... more  Motivation, Goals, Audience, etc…  Survey of multicore architectures  Description of the Roofline model  Introduction to Auto-tuning  Application of the roofline to auto-tuned kernels  Example #1-SpMV  Example #2-LBMHD  Conclusions 3 EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB Motivation  Multicore guarantees neither good scalability nor good (attained) performance  Performance and scalability can be extremely non-intuitive even to computer scientists  Success of the multicore paradigm seems to be premised upon their programmability  To that end, one must understand the limits to both scalability and efficiency.

Enabling Applications on the Grid: A Gridlab Overview

The International Journal of High Performance Computing Applications, 2003

Grid technology is widely emerging. Still, there is an eminent shortage of real Grid users, mostl... more Grid technology is widely emerging. Still, there is an eminent shortage of real Grid users, mostly due to the lack of a “critical mass” of widely deployed and reliable higher-level Grid services, tailored to application needs. The GridLab project aims to provide fundamentally new capabilities for applications to exploit the power of Grid computing, thus bridging the gap between application needs and existing Grid middleware. We present an overview of GridLab, a large-scale, EU-funded Grid project spanning over a dozen groups in Europe and the US. We first outline our vision of Grid-empowered applications and then discuss GridLab’s general architecture and its Grid Application Toolkit (GAT). We illustrate how applications can be Grid-enabled with the GAT and discuss GridLab’s scheduler as an example of GAT services.

Tiling as a Durable Abstraction for Parallelism and Data Locality

Tiling as a Durable Abstraction for Parallelism and Data Locality Didem Unat Cy Chan Weiqun Zhang... more Tiling as a Durable Abstraction for Parallelism and Data Locality Didem Unat Cy Chan Weiqun Zhang John Bell John Shalf Lawrence Berkeley National Laboratory 1 Cyclotron Rd, Berkeley, California, USA 94720 dunat, cychan, weiqunzhang, jbbell, jshalf @lbl.gov Abstract—Tiling is a useful loop transformation for expressing parallelism and data locality. Automated tiling transformations that preserve data-locality are increasingly important due to hardware trends towards massive parallelism and the increasing costs of data movement relative to the cost of computing. We propose TiDA as a durable tiling abstraction that centralizes parameterized tiling information within array data types with minimal changes to the source code. The data layout information can be used by the compiler and runtime to automatically manage parallelism, optimize data locality, and schedule tasks intelligently. In this paper, we present the design features and early interface of TiDA along with some preliminary re...

Software Design Space Exploration for Exascale Combustion Co-design

Lecture Notes in Computer Science, 2013

The design of hardware for next-generation exascale computing systems will require a deep underst... more The design of hardware for next-generation exascale computing systems will require a deep understanding of how software optimizations impact hardware design trade-offs. In order to characterize how co-tuning hardware and software parameters affects the performance of combustion simulation codes, we created ExaSAT, a compiler-driven static analysis and performance modeling framework. Our framework can evaluate hundreds of hardware/software configurations in seconds, providing an essential speed advantage over simulators and dynamic analysis techniques during the co-design process. Our analytic performance model shows that advanced code transformations, such as cache blocking and loop fusion, can have a significant impact on choices for cache and memory architecture. Our modeling helped us identify tuned configurations that achieve a 90% reduction in memory traffic, which could significantly improve performance and reduce energy consumption. These techniques will also be useful for the development of advanced programming models and runtimes, which must reason about these optimizations to deliver better performance and energy efficiency.

Performance Modeling for 3D Visualization in a Heterogeneous Computing Environment

The visualization of large, remotely located data sets necessitates the development of a distribu... more The visualization of large, remotely located data sets necessitates the development of a distributed computing pipeline in order to reduce the data, in stages, to a manageable size. The required baseline infrastructure for launching such a distributed pipeline is becoming available, but few services support even marginally optimal resource selection and partitioning of the data analysis workflow. We explore a methodology for building a model of overall application performance using a composition of the analytic models of individual components that comprise the pipeline. The analytic models are shown to be accurate on a testbed of distributed heterogeneous systems. The prediction methodology will form the foundation of a more robust resource management service for future Grid-based visualization applications.

A Power Efficient Exaflop Computer Design for Global Cloud System Resolving Climate Models

Exascale computers would allow routine ensemble modeling of the global climate system at the clou... more Exascale computers would allow routine ensemble modeling of the global climate system at the cloud system resolving scale. Power and cost requirements of traditional architecture systems are likely to delay such capability for many years. We present an alternative route to the exascale using embedded processor technology to design a system optimized for ultra high resolution climate modeling. These power

The Cactus Worm: Experiments with Dynamic Resource Discovery and Allocation in a Grid Environment

The International Journal of High Performance Computing Applications, 2001

The ability to harness heterogeneous, dynamically available grid resources is attractive to typic... more The ability to harness heterogeneous, dynamically available grid resources is attractive to typically resource-starved computational scientists and engineers, as in principle it can increase, by significant factors, the number of cycles that can be delivered to applications. However, new adaptive application structures and dynamic runtime system mechanisms are required if we are to operate effectively in grid environments. To explore some of these issues in a practical setting, the authors are developing an experimental framework, called Cactus, that incorporates both adaptive application structures for dealing with changing resource characteristics and adaptive resource selection mechanisms that allow applications to change their resource allocations (e.g., via migration) when performance falls outside specified limits. The authors describe the adaptive resource selection mechanisms and describe how they are used to achieve automatic application migration to “better” resources foll...

Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, 2011

Improving Memory Subsystem Performance Using ViVA: Virtual Vector Architecture

Lecture Notes in Computer Science, 2009

Scientific Computing Kernels on the Cell Processor

International Journal of Parallel Programming, 2007

Software Roadmap to Plug and Play Petaflop/s

Hardware/software co-design of global cloud system resolving models

Journal of Advances in Modeling Earth Systems, 2011

We present an analysis of the performance aspects of an atmospheric general circulation model at ... more We present an analysis of the performance aspects of an atmospheric general circulation model at the ultra-high resolution required to resolve individual cloud systems and describe alternative technological paths to realize the integration of such a model in the relatively near future. Due to a superlinear scaling of the computational burden dictated by the Courant stability criterion, the solution of the equations of motion dominate the calculation at these ultra-high resolutions. From this extrapolation, it is estimated that a credible kilometer scale atmospheric model would require a sustained computational rate of at least 28 Petaflop/s to provide scientifically useful climate simulations. Our design study portends an alternate strategy for practical power-efficient implementations of next-generation ultra-scale systems. We demonstrate that hardware/software co-design of low-power embedded processor technology could be exploited to design a custom machine tailored to ultra-high resolution climate model specifications at relatively affordable cost and power considerations. A strawman machine design is presented consisting of in excess of 20 million processing elements that effectively exploits forthcoming many-core chips. The system pushes the limits of domain decomposition to increase explicit parallelism, and suggests that functional partitioning of sub-components of the climate code (much like the coarse-grained partitioning of computation between the atmospheric, ocean, land, and ice components of current coupled models) may be necessary for future performance scaling.

The Astrophysics Simulation Collaboratory Portal: a framework for effective distributed research

Future Generation Computer Systems, 2005

Grid Portals, based on standard web technologies, are emerging as important and useful user inter... more Grid Portals, based on standard web technologies, are emerging as important and useful user interfaces to computational and data Grids. Grid Portals enable Virtual Organizations, comprised of distributed researchers to collaborate and access resources more efficiently and seamlessly. The Astrophysics Simulation Collaboratory (ASC) Grid Portal provides a framework to enable researchers in the field of numerical relativity to study astrophysical phenomenon by making use of the Cactus computational toolkit. We examine user requirements and describe the design and implementation of the ASC Grid Portal.

Solving Einstein's equations on supercomputers

Computer, 1999

Global Address Space Communication Techniques for Gyrokinetic Fusion Applications on Ultra-Scale Platforms

... Robert Preissl John Shalf Alice Koniges Lawrence Berkeley National Laboratory {rpreissl,jshal... more

The International Exascale Software Project roadmap

The International Journal of High Performance Computing Applications, 2011

Over the last 20 years, the open-source community has provided more and more software on which th... more Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics proces...

Auto-tuning Stencil Computations on Multicore and Accelerators

Cactus as benchmarking platform

Vector and Parallel Processing��VECPAR��2002, 5th Int. Conf