Skip to main content

Oscar Hernandez

Followers

0

Following

0

Public Views

Interests

Uploads

Papers by Oscar Hernandez

Dragon: A Static and Dynamic Tool for OpenMP

Lecture Notes in Computer Science, 2005

A program analysis tool can play an important role in helping users understand and improve OpenMP... more A program analysis tool can play an important role in helping users understand and improve OpenMP codes. Dragon is a robust interactive program analysis tool based on the Open64 compiler, an open source OpenMP, C/C++/Fortran77/90 compiler for Intel Itanium systems. We developed the Dragon tool on top of Open64 to exploit its powerful analyses in order to provide static as well as dynamic (feedback-based) information which can be used to develop or optimize OpenMP codes. Dragon enables users to visualize and print essential program structures and obtain runtime information on their applications. Current features include static/dynamic call graphs and control flow graphs, data dependence analysis and interprocedural array region summaries, that help understand procedure side effects within parallel loops. Ongoing work extends Dragon to display data access patterns at runtime, and provide support for runtime instrumentation and optimizations.

Analyzing the Energy and Power Consumption of Remote Memory Accesses in the OpenSHMEM Model

Lecture Notes in Computer Science, 2014

PGAS models like OpenSHMEM provide interfaces to explicitly initiate one-sided remote memory acce... more PGAS models like OpenSHMEM provide interfaces to explicitly initiate one-sided remote memory accesses among processes. In addition, the model also provides synchronizing barriers to ensure a consistent view of the distributed memory at different phases of an application. The incorrect use of such interfaces affects the scalability achievable while using a parallel programming model. This study aims at understanding the effects of these constructs on the energy and power consumption behavior of OpenSHMEM applications. Our experiments show that cost incurred in terms of the total energy and power consumed depends on multiple factors across the software and hardware stack. We conclude that there is a significant impact on the power consumed by the CPU and DRAM due to multiple factors including the design of the data transfer patterns within an application, the design of the communication protocols within a middleware, the architectural constraints laid by the interconnect solutions, and also the levels of memory hierarchy within a compute node. This work motivates treating energy and power consumption as important factors while designing compute solutions for current and future distributed systems. Recent studies on the challenges facing the Exascale era express a need for understanding the energy profile of applications that depend on inter-process communication on large-scale systems. The amount of energy consumed due to data movement poses a serious threat to the usability of distributed memory models on future systems. One-sided communication in PGAS models are analogous to memory accesses in shared-memory models. However, its impact on the performance and power consumption is different.

A component infrastructure for performance and power modeling of parallel scientific applications

Proceedings of the 2008 compFrame/HPC-GECO workshop on Component based high performance, 2008

Characterizing the performance of scientific applications is essential for effective code optimiz... more Characterizing the performance of scientific applications is essential for effective code optimization, both by compilers and by high-level adaptive numerical algorithms. While maximizing power efficiency is becoming increasingly important in current high-performance architectures, there is little or no hardware or software support for detailed power measurements. Hardware counter-based power models are a promising method for guiding software-based techniques for reducing power. We present a component-based infrastructure for performance and power modeling of parallel scientific applications. The power model leverages on-chip performance hardware counters and is designed to model power consumption for modern multiprocessor/multicore systems. Our tool infrastructure includes application components as well as performance/power measurement and analysis components. We collect performance data using the TAU performance component and apply the power model in the performance and power analysis of a PETSc-based parallel fluid dynamics application by using the PerfExplorer component.

Power Consumption Due to Data Movement in Distributed Programming Models

Lecture Notes in Computer Science, 2014

The amount of energy consumed due to data movement poses a serious challenge when implementing an... more The amount of energy consumed due to data movement poses a serious challenge when implementing and using distributed programming models. Message-passing models like MPI provide the user with explicit interfaces to initiate data-transfers among distributed processes. In this work, we establish the notion that from a programmer's standpoint, design decisions like the size of the data-payload to be transferred and the number of explicit MPI calls to service such transfers have a direct impact on the power signatures of communication kernels. Upon closer look, we additionally observe that the choice of the transport layer (along with the associated interconnect) and the design of the data transfer protocol, both, contribute to these signatures. This paper presents a fine-grained study on the impact of the power and energy consumption due to data movement in distributed programming models. We hope that results discussed in this work would motivate application and system programmers to include energy consumption as one of the important design factors while targeting HPC systems.

Scalability Evaluation of Barrier Algorithms for OpenMP

Lecture Notes in Computer Science, 2009

OpenMP relies heavily on barrier synchronization to coordinate the work of threads that are perfo... more OpenMP relies heavily on barrier synchronization to coordinate the work of threads that are performing the computations in a parallel region. A good implementation of barriers is thus an important part of any implementation of this API. As the number of cores in shared and distributed shared memory machines continues to grow, the quality of the barrier implementation is critical for application scalability. There are a number of known algorithms for providing barriers in software. In this paper, we consider some of the most widely used approaches for implementing barriers on large-scale shared-memory multiprocessor systems: a "blocking" implementation that de-schedules a waiting thread, a "centralized" busy wait and three forms of distributed "busy" wait implementations are discussed. We have implemented the barrier algorithms in the runtime library associated with a research compiler, OpenUH. We first compare the impact of these algorithms on the overheads incurred for OpenMP constructs that involve a barrier, possibly implicitly. We then show how the different barrier implementations influence the performance of two different OpenMP application codes.

Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications

Lecture Notes in Computer Science, 2008

This article describes how the integration of the OpenUH OpenMP compiler with the KOJAK performan... more This article describes how the integration of the OpenUH OpenMP compiler with the KOJAK performance analysis tool can assist developers of OpenMP and hybrid codes in optimizing their applications with as little user intervention as possible. In particular, we (i) describe how the compiler's ability to automatically instrument user code down to the flow-graph level can improve the location of performance problems and (ii) outline how the performance feedback provided by KOJAK will direct the compiler's optimization decisions in the future. To demonstrate our methodology, we present experimental results showing how reasons for the performance slow down of the ASPCG benchmark could be identified.

Open Source Software Support for the OpenMP Runtime API for Profiling

2009 International Conference on Parallel Processing Workshops, 2009

OpenMP is a de facto standard API for shared-memory programming with widespread vendor support an... more OpenMP is a de facto standard API for shared-memory programming with widespread vendor support and a large user base. The OpenMP Architecture Review Board has sanctioned an interface specification known as the "OpenMP Runtime API for Profiling" to enable tools to collect performance data for OpenMP programs. This paper describes the interface and our experiences implementing it in OpenUH, an open source OpenMP compiler.

Dragon: an Open64-based interactive program analysis tool for large applications

Proceedings of the 8th International Scientific and Practical Conference of Students, Post-graduates and Young Scientists. Modern Technique and Technologies. MTT'2002 (Cat. No.02EX550)

A program analysis tool can play an important role in helping users understand and improve large ... more A program analysis tool can play an important role in helping users understand and improve large application codes. Dragon is a robust interactive program analysis tool based on the Open64 compiler, which is an Open source C/C++/Fortran77/90 compiler for Intel Itanium systems. We designed and developed the Dragon analysis tool to support manual optimization and parallelization of large applications by

Extending the OpenSHMEM Memory Model to Support User-Defined Spaces

Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, 2014

OpenSHMEM is an open standard for SHMEM libraries. With the standardisation process complete, the... more OpenSHMEM is an open standard for SHMEM libraries. With the standardisation process complete, the community is looking towards extending the API for increasing programmer flexibility and extreme scalability. According to the current OpenSHMEM specification (revision 1.1), allocation of symmetric memory is collective across all PEs executing the application. For better work sharing and memory utilisation, we are proposing the concepts of teams and spaces for OpenSHMEM that together allow allocation of memory only across user-specified teams. Through our implementation we show that by using teams we can confine memory allocation and usage to only the PEs that actually communicate via symmetric memory. We provide our preliminary results that demonstrate creating spaces for teams allows for less consumption of memory resources than the current alternative. We also examine the impact of our extensions on Scalable Synthetic Compact Applications #3 (SSCA3), which is a sensor processing and knowledge formation kernel involving file I/O, and show that up to 30% of symmetric memory allocation can be eliminated without affecting the correctness of the benchmark.

Zero-order suppression for two-photon holographic excitation

Optics Letters, 2014

Wavefront shaping with liquid crystal spatial light modulators (LC-SLMs) is frequently hindered b... more Wavefront shaping with liquid crystal spatial light modulators (LC-SLMs) is frequently hindered by a remaining fraction of undiffracted light, the so-called "zero-order". This contribution is all the more detrimental in configurations for which the LC-SLM is Fourier conjugated to a sample by a lens, because in these cases this undiffracted light produces a diffraction-limited spot at the image focal plane. In this work we propose to minimize two-photon excitation of the sample resulting from this unmodulated light by introducing optical aberrations to the

When can temporally focused excitation be axially shifted by dispersion?

Optics Express, 2014

Temporal focusing (TF) allows for axially confined wide-field multi-photon excitation at the temp... more Temporal focusing (TF) allows for axially confined wide-field multi-photon excitation at the temporal focal plane. For temporally focused Gaussian beams, it was shown both theoretically and experimentally that the temporal focus plane can be shifted by applying a quadratic spectral phase to the incident beam. However, the case for more complex wave-fronts is quite different. Here we study the temporal focus plane shift (TFS) for a broader class of excitation profiles, with particular emphasis on the case of temporally focused computer generated holography (CGH) which allows for generation of arbitrary, yet speckled, 2D patterns. We present an analytical, numerical and experimental study of this phenomenon. The TFS is found to depend mainly on the autocorrelation of the CGH pattern in the direction of the beam dispersion after the grating in the TF setup. This provides a pathway for 3D control of multi-photon excitation patterns.

Cougar: Interactive Tool for Cluster Computing

Cougar Compiler is a tool designed to help the programmer understand the structure of a sequentia... more Cougar Compiler is a tool designed to help the programmer understand the structure of a sequential or parallel Fortran program. We support the de facto standards OpenMP and MPI, as well as the mixed mode OpenMP/MPI model, which can be used to write programs for executions on SMP clusters. The user may query the system interactively and view the results obtained by our program analysis via a graphical interface. This analysis includes up-to-date dependence tests, array section analysis and parallel dataflow analysis. In addition to representing a program's structure, Cougar is able to automatically generate OpenMP code and assist in its optimization as well as check for common parallel programming errors such as race conditions. We plan to make Cougar available to the community.

Aplicación Web para Buques Oceanográficos

Dragon: A Static and Dynamic Tool for OpenMP

Lecture Notes in Computer Science, 2005

A program analysis tool can play an important role in helping users understand and improve OpenMP... more A program analysis tool can play an important role in helping users understand and improve OpenMP codes. Dragon is a robust interactive program analysis tool based on the Open64 compiler, an open source OpenMP, C/C++/Fortran77/90 compiler for Intel Itanium systems. We developed the Dragon tool on top of Open64 to exploit its powerful analyses in order to provide static as well as dynamic (feedback-based) information which can be used to develop or optimize OpenMP codes. Dragon enables users to visualize and print essential program structures and obtain runtime information on their applications. Current features include static/dynamic call graphs and control flow graphs, data dependence analysis and interprocedural array region summaries, that help understand procedure side effects within parallel loops. Ongoing work extends Dragon to display data access patterns at runtime, and provide support for runtime instrumentation and optimizations.

Analyzing the Energy and Power Consumption of Remote Memory Accesses in the OpenSHMEM Model

Lecture Notes in Computer Science, 2014

PGAS models like OpenSHMEM provide interfaces to explicitly initiate one-sided remote memory acce... more PGAS models like OpenSHMEM provide interfaces to explicitly initiate one-sided remote memory accesses among processes. In addition, the model also provides synchronizing barriers to ensure a consistent view of the distributed memory at different phases of an application. The incorrect use of such interfaces affects the scalability achievable while using a parallel programming model. This study aims at understanding the effects of these constructs on the energy and power consumption behavior of OpenSHMEM applications. Our experiments show that cost incurred in terms of the total energy and power consumed depends on multiple factors across the software and hardware stack. We conclude that there is a significant impact on the power consumed by the CPU and DRAM due to multiple factors including the design of the data transfer patterns within an application, the design of the communication protocols within a middleware, the architectural constraints laid by the interconnect solutions, and also the levels of memory hierarchy within a compute node. This work motivates treating energy and power consumption as important factors while designing compute solutions for current and future distributed systems. Recent studies on the challenges facing the Exascale era express a need for understanding the energy profile of applications that depend on inter-process communication on large-scale systems. The amount of energy consumed due to data movement poses a serious threat to the usability of distributed memory models on future systems. One-sided communication in PGAS models are analogous to memory accesses in shared-memory models. However, its impact on the performance and power consumption is different.

A component infrastructure for performance and power modeling of parallel scientific applications

Proceedings of the 2008 compFrame/HPC-GECO workshop on Component based high performance, 2008

Characterizing the performance of scientific applications is essential for effective code optimiz... more Characterizing the performance of scientific applications is essential for effective code optimization, both by compilers and by high-level adaptive numerical algorithms. While maximizing power efficiency is becoming increasingly important in current high-performance architectures, there is little or no hardware or software support for detailed power measurements. Hardware counter-based power models are a promising method for guiding software-based techniques for reducing power. We present a component-based infrastructure for performance and power modeling of parallel scientific applications. The power model leverages on-chip performance hardware counters and is designed to model power consumption for modern multiprocessor/multicore systems. Our tool infrastructure includes application components as well as performance/power measurement and analysis components. We collect performance data using the TAU performance component and apply the power model in the performance and power analysis of a PETSc-based parallel fluid dynamics application by using the PerfExplorer component.

Power Consumption Due to Data Movement in Distributed Programming Models

Lecture Notes in Computer Science, 2014

The amount of energy consumed due to data movement poses a serious challenge when implementing an... more The amount of energy consumed due to data movement poses a serious challenge when implementing and using distributed programming models. Message-passing models like MPI provide the user with explicit interfaces to initiate data-transfers among distributed processes. In this work, we establish the notion that from a programmer's standpoint, design decisions like the size of the data-payload to be transferred and the number of explicit MPI calls to service such transfers have a direct impact on the power signatures of communication kernels. Upon closer look, we additionally observe that the choice of the transport layer (along with the associated interconnect) and the design of the data transfer protocol, both, contribute to these signatures. This paper presents a fine-grained study on the impact of the power and energy consumption due to data movement in distributed programming models. We hope that results discussed in this work would motivate application and system programmers to include energy consumption as one of the important design factors while targeting HPC systems.

Scalability Evaluation of Barrier Algorithms for OpenMP

Lecture Notes in Computer Science, 2009

OpenMP relies heavily on barrier synchronization to coordinate the work of threads that are perfo... more OpenMP relies heavily on barrier synchronization to coordinate the work of threads that are performing the computations in a parallel region. A good implementation of barriers is thus an important part of any implementation of this API. As the number of cores in shared and distributed shared memory machines continues to grow, the quality of the barrier implementation is critical for application scalability. There are a number of known algorithms for providing barriers in software. In this paper, we consider some of the most widely used approaches for implementing barriers on large-scale shared-memory multiprocessor systems: a "blocking" implementation that de-schedules a waiting thread, a "centralized" busy wait and three forms of distributed "busy" wait implementations are discussed. We have implemented the barrier algorithms in the runtime library associated with a research compiler, OpenUH. We first compare the impact of these algorithms on the overheads incurred for OpenMP constructs that involve a barrier, possibly implicitly. We then show how the different barrier implementations influence the performance of two different OpenMP application codes.

Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications

Lecture Notes in Computer Science, 2008

This article describes how the integration of the OpenUH OpenMP compiler with the KOJAK performan... more This article describes how the integration of the OpenUH OpenMP compiler with the KOJAK performance analysis tool can assist developers of OpenMP and hybrid codes in optimizing their applications with as little user intervention as possible. In particular, we (i) describe how the compiler's ability to automatically instrument user code down to the flow-graph level can improve the location of performance problems and (ii) outline how the performance feedback provided by KOJAK will direct the compiler's optimization decisions in the future. To demonstrate our methodology, we present experimental results showing how reasons for the performance slow down of the ASPCG benchmark could be identified.

Open Source Software Support for the OpenMP Runtime API for Profiling

2009 International Conference on Parallel Processing Workshops, 2009

OpenMP is a de facto standard API for shared-memory programming with widespread vendor support an... more OpenMP is a de facto standard API for shared-memory programming with widespread vendor support and a large user base. The OpenMP Architecture Review Board has sanctioned an interface specification known as the "OpenMP Runtime API for Profiling" to enable tools to collect performance data for OpenMP programs. This paper describes the interface and our experiences implementing it in OpenUH, an open source OpenMP compiler.

Dragon: an Open64-based interactive program analysis tool for large applications

Proceedings of the 8th International Scientific and Practical Conference of Students, Post-graduates and Young Scientists. Modern Technique and Technologies. MTT'2002 (Cat. No.02EX550)

A program analysis tool can play an important role in helping users understand and improve large ... more A program analysis tool can play an important role in helping users understand and improve large application codes. Dragon is a robust interactive program analysis tool based on the Open64 compiler, which is an Open source C/C++/Fortran77/90 compiler for Intel Itanium systems. We designed and developed the Dragon analysis tool to support manual optimization and parallelization of large applications by

Extending the OpenSHMEM Memory Model to Support User-Defined Spaces

Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, 2014

OpenSHMEM is an open standard for SHMEM libraries. With the standardisation process complete, the... more OpenSHMEM is an open standard for SHMEM libraries. With the standardisation process complete, the community is looking towards extending the API for increasing programmer flexibility and extreme scalability. According to the current OpenSHMEM specification (revision 1.1), allocation of symmetric memory is collective across all PEs executing the application. For better work sharing and memory utilisation, we are proposing the concepts of teams and spaces for OpenSHMEM that together allow allocation of memory only across user-specified teams. Through our implementation we show that by using teams we can confine memory allocation and usage to only the PEs that actually communicate via symmetric memory. We provide our preliminary results that demonstrate creating spaces for teams allows for less consumption of memory resources than the current alternative. We also examine the impact of our extensions on Scalable Synthetic Compact Applications #3 (SSCA3), which is a sensor processing and knowledge formation kernel involving file I/O, and show that up to 30% of symmetric memory allocation can be eliminated without affecting the correctness of the benchmark.

Zero-order suppression for two-photon holographic excitation

Optics Letters, 2014

Wavefront shaping with liquid crystal spatial light modulators (LC-SLMs) is frequently hindered b... more Wavefront shaping with liquid crystal spatial light modulators (LC-SLMs) is frequently hindered by a remaining fraction of undiffracted light, the so-called "zero-order". This contribution is all the more detrimental in configurations for which the LC-SLM is Fourier conjugated to a sample by a lens, because in these cases this undiffracted light produces a diffraction-limited spot at the image focal plane. In this work we propose to minimize two-photon excitation of the sample resulting from this unmodulated light by introducing optical aberrations to the

When can temporally focused excitation be axially shifted by dispersion?

Optics Express, 2014

Temporal focusing (TF) allows for axially confined wide-field multi-photon excitation at the temp... more Temporal focusing (TF) allows for axially confined wide-field multi-photon excitation at the temporal focal plane. For temporally focused Gaussian beams, it was shown both theoretically and experimentally that the temporal focus plane can be shifted by applying a quadratic spectral phase to the incident beam. However, the case for more complex wave-fronts is quite different. Here we study the temporal focus plane shift (TFS) for a broader class of excitation profiles, with particular emphasis on the case of temporally focused computer generated holography (CGH) which allows for generation of arbitrary, yet speckled, 2D patterns. We present an analytical, numerical and experimental study of this phenomenon. The TFS is found to depend mainly on the autocorrelation of the CGH pattern in the direction of the beam dispersion after the grating in the TF setup. This provides a pathway for 3D control of multi-photon excitation patterns.

Cougar: Interactive Tool for Cluster Computing

Cougar Compiler is a tool designed to help the programmer understand the structure of a sequentia... more Cougar Compiler is a tool designed to help the programmer understand the structure of a sequential or parallel Fortran program. We support the de facto standards OpenMP and MPI, as well as the mixed mode OpenMP/MPI model, which can be used to write programs for executions on SMP clusters. The user may query the system interactively and view the results obtained by our program analysis via a graphical interface. This analysis includes up-to-date dependence tests, array section analysis and parallel dataflow analysis. In addition to representing a program's structure, Cougar is able to automatically generate OpenMP code and assist in its optimization as well as check for common parallel programming errors such as race conditions. We plan to make Cougar available to the community.

Aplicación Web para Buques Oceanográficos