Skip to main content

Biagio Cosenza

University of Salerno, Dipartimento di Informatica, Faculty Member

Followers

5

Following

1

Public Views

Uploads

Papers by Biagio Cosenza

Static optimization in PHP 7

PHP is a dynamically typed programming language commonly used for the server-side implementation ... more PHP is a dynamically typed programming language commonly used for the server-side implementation of web applications. Approachability and ease of deployment have made PHP one of the most widely used scripting languages for the web, powering important web applications such as WordPress, Wikipedia, and Facebook. PHP's highly dynamic nature, while providing useful language features, also makes it hard to optimize statically. This paper reports on the implementation of purely static bytecode optimizations for PHP 7, the last major version of PHP. We discuss the challenge of integrating classical compiler optimizations, which have been developed in the context of statically-typed languages, into a programming language that is dynamically and weakly typed, and supports a plethora of dynamic language features. Based on a careful analysis of language semantics, we adapt static single assignment (SSA) form for use in PHP. Combined with type inference, this allows type-based specializatio...

Cost Modelling for Vectorization on ARM

2018 IEEE International Conference on Cluster Computing (CLUSTER)

When applying a code transformation to optimize for performance, compilers need to assess its pro... more When applying a code transformation to optimize for performance, compilers need to assess its profitability beforehand. For this purpose, they utilize cost models, which compare the cost, an abstract measure of the code, before and after the transformation. If the cost is lower after the transformation, it will be applied. Exact cost modelling is therefore critical to avoid slowdowns or missed opportunities for speedups. In this work, we analyze the accuracy of LLVM's loop-level vectorization (LLV) cost model, and show the benefit of modelling speedup instead of instruction costs for higher vectorization rates and smaller execution times. The presented approach is portable to other compilers and hardwares as well.

Static optimization in PHP 7

Proceedings of the 26th International Conference on Compiler Construction

PHP is a dynamically typed programming language commonly used for the server-side implementation ... more PHP is a dynamically typed programming language commonly used for the server-side implementation of web applications. Approachability and ease of deployment have made PHP one of the most widely used scripting languages for the web, powering important web applications such as WordPress, Wikipedia, and Facebook. PHP's highly dynamic nature, while providing useful language features, also makes it hard to optimize statically. This paper reports on the implementation of purely static bytecode optimizations for PHP 7, the last major version of PHP. We discuss the challenge of integrating classical compiler optimizations, which have been developed in the context of statically-typed languages, into a programming language that is dynamically and weakly typed, and supports a plethora of dynamic language features. Based on a careful analysis of language semantics, we adapt static single assignment (SSA) form for use in PHP. Combined with type inference, this allows type-based specialization of instructions, as well as the application of various classical SSA-enabled compiler optimizations such as constant propagation or dead code elimination. We evaluate the impact of the proposed static optimizations on a wide collection of programs, including micro-benchmarks, libraries and web frameworks. Despite the dynamic nature of PHP, our approach achieves an average speedup of 50% on microbenchmarks, 13% on computationally intensive libraries, as well as 1.1% (MediaWiki) and 3.5% (WordPress) on web applications.

Experiences with Mesh-like computations using Prediction Binary Trees

Scalable Computing : Practice and Experience, 2009

Evaluation of Adaptive Subivision Schemas for Parallel Ray Tracing

Ray tracing algorithm is widely used for rendering images aiming at an high realism. Speeding up ... more Ray tracing algorithm is widely used for rendering images aiming at an high realism. Speeding up ray tracing for interactive use with parallel architectures has received a big impulse during last years. Despite several techniques are employed in order to amortize communication costs and manage load balancing, they still represents a bottleneck to the scalability. We consider a new way of manage load balancing based on adaptive subdivision, assuring higher scalability and performance, compatible with commonly used balancing techniques such as work stealing and work sharing.

Correlating Cost with Performance in LLVM

A common technique to exploit data level parallelism is code vectorization. When performed by a c... more A common technique to exploit data level parallelism is code vectorization. When performed by a compiler, it needs to find a valid vectorization and assess its benefit. In LLVM, this analysis is based on a cost calculation, which will approve the transformation if the cost of the vectorized code is lower than the original scalar code. However, the calculated cost does not correlate to the actual speedup gain. We therefore propose a pluggable cost model that correctly accounts for vectorization overheads and further features of the target hardware platform. Using such a platform specific model, the compiler can assess a code transformation’s impact on application performance, make safe choices whether to transform or not, and compare different optimization options.

The Italian research on HPC key technologies across EuroHPC

Proceedings of the 18th ACM International Conference on Computing Frontiers

High-Performance Computing (HPC) is one of the strategic priorities for research and innovation w... more

SYCL-Bench: A Versatile Cross-Platform Benchmark Suite for Heterogeneous Computing

Euro-Par 2020: Parallel Processing

The SYCL standard promises to enable high productivity in heterogeneous programming of a broad ra... more The SYCL standard promises to enable high productivity in heterogeneous programming of a broad range of parallel devices, including multicore CPUs, GPUs, and FPGAs. Its modern and expressive C++ API design, as well as flexible task graph execution model give rise to ample optimization opportunities at run-time, such as the overlapping of data transfers and kernel execution. However, it is not clear which of the existing SYCL implementations perform such scheduling optimizations, and to what extent. Furthermore, SYCL's high level of abstraction may raise concerns about sacrificing performance for ease of use. Benchmarks are required to accurately assess the performance behavior of high-level programming models such as SYCL. To this end, we present SYCL-Bench, a versatile benchmark suite for device characterization and runtime benchmarking, written in SYCL. We experimentally demonstrate the effectiveness of SYCL-Bench by performing device characterization of the NVIDIA TITAN X GPU, and by evaluating the efficiency of the hipSYCL and ComputeCpp SYCL implementations.

ALONA: Automatic Loop Nest Approximation with Reconstruction and Space Pruning

Approximate computing comprises a large variety of techniques that trade the accuracy of an appli... more Approximate computing comprises a large variety of techniques that trade the accuracy of an application’s output for other metrics such as computing time or energy cost. Many existing approximation techniques focus on loops such as loop perforation, which skips iterations for faster, approximated computation. This paper introduces ALONA, a novel approach for automatic loop nest approximation based on polyhedral compilation. ALONA’s compilation framework applies a sequence of loop approximation transformations, generalizes state-of-the-art perforation techniques, and introduces new multi-dimensional approximation schemes. The framework includes a reconstruction technique that significantly improves the accuracy of the approximations and a transformation space pruning method based on Barvinok’s counting that removes inaccurate approximations. Evaluated on a collection of more than twenty applications from PolyBench/C, ALONA discovers new approximations that are better than state-of-th...

Accelerating the RICH Particle Detector Algorithm on Intel Xeon Phi

At the LHC, particles are collided in order to understand how the universe was created. Those col... more At the LHC, particles are collided in order to understand how the universe was created. Those collisions are called events and generate large quantities of data, which have to be pre-filtered before they are stored to hard disks. This paper presents a parallel implementation of these algorithms that is specifically designed for the Intel Xeon Phi Knights Landing platform, exploiting its 64 cores and AVX-512 instruction set. It shows that a linear speedup up until approximately 64 threads is attainable when vectorization is used, data is aligned to cache line boundaries, program execution is pinned to MCDRAM, mathematical expressions are transformed to a more efficient equivalent formulation, and OpenMP is used for parallelization. The code was transformed from being compute bound to memory bound. Overall, a speedup of 36.47x was reached while obtaining an error which is smaller than the detector resolution.

Approximating Memory-bound Applications on Mobile GPUs

Approximate computing techniques are often used to improve the performance of applications that c... more Approximate computing techniques are often used to improve the performance of applications that can tolerate some amount of impurity in the calculations or data. In the context of embedded and mobile systems, a broad number of applications have exploited approximation techniques to improve performance and overcome the limited capabilities of the hardware. On such systems, even small performance improvements can be sufficient to meet scheduled requirements such as hard real-time deadlines. We study the approximation of memory-bound applications on mobile GPUs using kernel perforation, an approximation technique that exploits the availability of fast GPU local memory to provide high performance with more accurate results. Using this approximation technique, we approximated six applications and evaluated them on two mobile GPU architectures with very different memory layouts: a Qualcomm Adreno 506 and an ARM Mali T860 MP2. Results show that, even when the local memory is not mapped to ...

Celerity: High-Level C++ for Accelerator Clusters

Lecture Notes in Computer Science

In the face of ever-slowing single-thread performance growth for CPUs, the scientific and enginee... more In the face of ever-slowing single-thread performance growth for CPUs, the scientific and engineering communities increasingly turn to accelerator parallelization to tackle growing application workloads. Existing means of targeting distributed memory accelerator clusters impose severe programmability barriers and maintenance burdens. The Celerity programming environment seeks to enable developers to scale C++ applications to accelerator clusters with relative ease, while leveraging and extending the SYCL domain-specific embedded language. By having users provide minimal information about how data is accessed within compute kernels, Celerity automatically distributes work and data. We introduce the Celerity C++ API as well as a prototype implementation, demonstrating that existing SYCL code can be brought to distributed memory clusters with only a small set of changes that follow established idioms. The Celerity prototype runtime implementation is shown to have comparable performance to more traditional approaches to distributed memory accelerator programming, such as MPI+OpenCL, with significantly lower implementation complexity.

Proceedings of the International Workshop on OpenCL

Predictable GPUs Frequency Scaling for Energy and Performance

Proceedings of the 48th International Conference on Parallel Processing

Dynamic voltage and frequency scaling (DVFS) is an important solution to balance performance and ... more Dynamic voltage and frequency scaling (DVFS) is an important solution to balance performance and energy consumption, and hardware vendors provide management libraries that allow the programmer to change both memory and core frequencies. The possibility to manually set these frequencies is a great opportunity for application tuning, which can focus on the best applicationdependent setting. However, this task is not straightforward because of the large set of possible configurations and because of the multi-objective nature of the problem, which minimizes energy consumption and maximizes performance. This paper proposes a method to predict the best core and memory frequency configurations on GPUs for an input OpenCL kernel. Our modeling approach, based on machine learning, first predicts speedup and normalized energy over the default frequency configuration. Then, it combines the two models into a multi-objective one that predicts a Pareto-set of frequency configurations. The approach uses static code features, is built on a set of carefully designed microbenchmarks, and can predict the best frequency settings of a new kernel without executing it. Test results show that our modeling approach is very accurate on predicting extrema points and Pareto set for ten out of twelve test benchmarks, and discover frequency configurations that dominate the default configuration in either energy or performance. CCS CONCEPTS • Computer systems organization → Parallel architectures; • Hardware → Power and energy.

OpenABL: A Domain-Specific Language for Parallel and Distributed Agent-Based Simulations

Euro-Par 2018: Parallel Processing

Agent-based simulations are becoming widespread among scientists from different areas, who use th... more Agent-based simulations are becoming widespread among scientists from different areas, who use them to model increasingly complex problems. To cope with the growing computational complexity, parallel and distributed implementations have been developed for a wide range of platforms. However, it is difficult to have simulations that are portable to different platforms while still achieving high performance. We present OpenABL, a domain-specific language for portable, highperformance, parallel agent modeling. It comprises an easy-to-program language that relies on high-level abstractions for programmability and explicitly exploits agent parallelism to deliver high performance. A sourceto-source compiler translates the input code to a high-level intermediate representation exposing parallelism, locality and synchronization, and, thanks to an architecture based on pluggable backends, generates target code for multi-core CPUs, GPUs, large clusters and cloud systems. OpenABL has been evaluated on six applications from various fields such as ecology, animation, and social sciences. The generated code scales to large clusters and performs similarly to handwritten target-specific code, while requiring significantly fewer lines of codes.

Autotuning Stencil Computations with Structural Ordinal Regression Learning

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Stencil computations expose a large and complex space of equivalent implementations. These comput... more Stencil computations expose a large and complex space of equivalent implementations. These computations often rely on autotuning techniques, based on iterative compilation or machine learning (ML), to achieve high performance. Iterative compilation autotuning is a challenging and time-consuming task that may be unaffordable in many scenarios. Meanwhile, traditional ML autotuning approaches exploiting classification algorithms (such as neural networks and support vector machines) face difficulties in capturing all features of large search spaces. This paper proposes a new way of automatically tuning stencil computations based on structural learning. By organizing the training data in a set of partially-sorted samples (i.e., rankings), the problem is formulated as a ranking prediction model, which translates to an ordinal regression problem. Our approach can be coupled with an iterative compilation method or used as a standalone autotuner. We demonstrate its potential by comparing it with state-of-the-art iterative compilation methods on a set of nine stencil codes and by analyzing the quality of the obtained ranking in terms of Kendall rank correlation coefficients.

Stencil Autotuning with Ordinal Regression

Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems

The increasing performance of today's computer architecture comes with an unprecedented augme... more The increasing performance of today's computer architecture comes with an unprecedented augment of hardware complexity. Unfortunately this results in difficult-to-tune software and consequentially in a gap between the potential peak performance and the actual performance. Automatic tuning is an emerging approach that assists the programmer in managing this complexity. State-of-the-art autotuners are limited, though: they either require long tuning times, e.g., due to iterative searches, or cannot tackle the complexity of the problem due to the limitation of the supervised machine learning (ML) methodologies used. In particular, traditional ML autotuning approaches exploiting classification algorithms (such as neural networks and support vector machines) face difficulties in capturing all features of large search spaces. We propose a new way of performing automatic tuning based on structural learning: the tuning problem is formulated as a version ranking prediction modeling and solved using ordinal regression. We demonstrate its potential on a well-known autotuning problem: stencil computations. We compare state-of-the-art iterative compilation methods with our ordinal regression approach and analyze the quality of the obtained ranking in terms of Kendall rank correlation coefficients.

A Performance Analysis of Vector Length Agnostic Code

2019 International Conference on High Performance Computing & Simulation (HPCS)

Vector extensions are a popular mean to exploit data parallelism in applications. Over recent yea... more Vector extensions are a popular mean to exploit data parallelism in applications. Over recent years, the most commonly used extensions have been growing in vector length and amount of vector instructions. However, code portability remains a problem when speaking about a compute continuum. Hence, vector length agnostic (VLA) architectures have been proposed for the future generations of ARM and RISC-V processors. With these architectures, code is vectorized independently of the vector length of the target hardware platform. It is therefore possible to tune software to a generic vector length. To understand the performance impact of VLA code compared to vector length specific code, we analyze the current capabilities of code generation for ARM's SVE architecture. Our experiments show that VLA code reaches about 90% of the performance of vector length specific code, i.e. a 10% overhead is inferred due to global predication of instructions. Furthermore, we show that code performance is not increasing proportionally with increasing vector lengths due to the higher memory demands.

Vectorization cost modeling for NEON, AVX and SVE

Performance Evaluation

Compiler optimization passes employ cost models to determine if a code transformation will yield ... more Compiler optimization passes employ cost models to determine if a code transformation will yield performance improvements. When this assessment is inaccurate, compilers apply transformations that are not beneficial, or refrain from applying ones that would have improved the code. We analyze the accuracy of the cost models used in LLVM's and GCC's vectorization passes for three different instruction set architectures, including both traditional SIMD architectures with a defined fixed vector register size (AVX2 and NEON), and novel instruction set with scalable vector size (SVE). In general, speedup is overestimated , resulting in mispredictions and a weak to medium correlation between predicted and actual performance gain. We therefore propose a novel cost model that is based on a code's intermediate representation with refined memory access pattern features. Using linear regression techniques, this platform independent model is fitted to an AVX2 and a NEON hardware, as well as an SVE simulator. Results show that the fitted model significantly improves the correlation between predicted and measured speedup (AVX2: +52% for training data, +13% for validation data), reduces the average error of the speedup prediction (SVE:-43% for training data,-36% for validation data), as well as the number of mispredictions (NEON:-88% for training data,-71% for validation data) for more than 80 code patterns.

Easy and efficient agent-based simulations with the OpenABL language and compiler

Future Generation Computer Systems

Abstract Agent-based simulations represent an effective scientific tool, with numerous applicatio... more Abstract Agent-based simulations represent an effective scientific tool, with numerous applications from social sciences to biology, which aims to emulate or predict complex phenomena through a set of simple rules performed by multiple agents. To simulate a large number of agents with complex models, practitioners have developed high-performance parallel implementations, often specialized for particular scenarios and target hardware. It is, however, difficult to obtain portable simulations, which achieve high performance and at the same time are easy to write and to reproduce on different hardware. This article gives a complete presentation of OpenABL , a domain-specific language and a compiler for agent-based simulations that enable users to achieve high-performance parallel and distributed agent simulations with a simple and portable programming environment. OpenABL is comprised of (1) an easy-to-program language, which relies on domain abstractions and explicitly exposes agent parallelism, synchronization and locality, (2) a source-to-source compiler, and (3) a set of pluggable compiler backends, which generate target code for multi-core CPUs, GPUs, and cloud-based systems. We evaluate OpenABL on simulations from different fields. In particular, our analysis includes predator-prey and keratinocyte, two complex simulations with multiple step functions, heterogeneous agent types, and dynamic creation and removal of agents. The results show that OpenABL -generated codes are portable to different platforms, perform similarly to manual target-specific implementations, and require significantly fewer lines of codes.

Static optimization in PHP 7

PHP is a dynamically typed programming language commonly used for the server-side implementation ... more PHP is a dynamically typed programming language commonly used for the server-side implementation of web applications. Approachability and ease of deployment have made PHP one of the most widely used scripting languages for the web, powering important web applications such as WordPress, Wikipedia, and Facebook. PHP's highly dynamic nature, while providing useful language features, also makes it hard to optimize statically. This paper reports on the implementation of purely static bytecode optimizations for PHP 7, the last major version of PHP. We discuss the challenge of integrating classical compiler optimizations, which have been developed in the context of statically-typed languages, into a programming language that is dynamically and weakly typed, and supports a plethora of dynamic language features. Based on a careful analysis of language semantics, we adapt static single assignment (SSA) form for use in PHP. Combined with type inference, this allows type-based specializatio...

Cost Modelling for Vectorization on ARM

2018 IEEE International Conference on Cluster Computing (CLUSTER)

When applying a code transformation to optimize for performance, compilers need to assess its pro... more When applying a code transformation to optimize for performance, compilers need to assess its profitability beforehand. For this purpose, they utilize cost models, which compare the cost, an abstract measure of the code, before and after the transformation. If the cost is lower after the transformation, it will be applied. Exact cost modelling is therefore critical to avoid slowdowns or missed opportunities for speedups. In this work, we analyze the accuracy of LLVM's loop-level vectorization (LLV) cost model, and show the benefit of modelling speedup instead of instruction costs for higher vectorization rates and smaller execution times. The presented approach is portable to other compilers and hardwares as well.

Static optimization in PHP 7

Proceedings of the 26th International Conference on Compiler Construction

PHP is a dynamically typed programming language commonly used for the server-side implementation ... more PHP is a dynamically typed programming language commonly used for the server-side implementation of web applications. Approachability and ease of deployment have made PHP one of the most widely used scripting languages for the web, powering important web applications such as WordPress, Wikipedia, and Facebook. PHP's highly dynamic nature, while providing useful language features, also makes it hard to optimize statically. This paper reports on the implementation of purely static bytecode optimizations for PHP 7, the last major version of PHP. We discuss the challenge of integrating classical compiler optimizations, which have been developed in the context of statically-typed languages, into a programming language that is dynamically and weakly typed, and supports a plethora of dynamic language features. Based on a careful analysis of language semantics, we adapt static single assignment (SSA) form for use in PHP. Combined with type inference, this allows type-based specialization of instructions, as well as the application of various classical SSA-enabled compiler optimizations such as constant propagation or dead code elimination. We evaluate the impact of the proposed static optimizations on a wide collection of programs, including micro-benchmarks, libraries and web frameworks. Despite the dynamic nature of PHP, our approach achieves an average speedup of 50% on microbenchmarks, 13% on computationally intensive libraries, as well as 1.1% (MediaWiki) and 3.5% (WordPress) on web applications.

Experiences with Mesh-like computations using Prediction Binary Trees

Scalable Computing : Practice and Experience, 2009

Evaluation of Adaptive Subivision Schemas for Parallel Ray Tracing

Ray tracing algorithm is widely used for rendering images aiming at an high realism. Speeding up ... more Ray tracing algorithm is widely used for rendering images aiming at an high realism. Speeding up ray tracing for interactive use with parallel architectures has received a big impulse during last years. Despite several techniques are employed in order to amortize communication costs and manage load balancing, they still represents a bottleneck to the scalability. We consider a new way of manage load balancing based on adaptive subdivision, assuring higher scalability and performance, compatible with commonly used balancing techniques such as work stealing and work sharing.

Correlating Cost with Performance in LLVM

A common technique to exploit data level parallelism is code vectorization. When performed by a c... more A common technique to exploit data level parallelism is code vectorization. When performed by a compiler, it needs to find a valid vectorization and assess its benefit. In LLVM, this analysis is based on a cost calculation, which will approve the transformation if the cost of the vectorized code is lower than the original scalar code. However, the calculated cost does not correlate to the actual speedup gain. We therefore propose a pluggable cost model that correctly accounts for vectorization overheads and further features of the target hardware platform. Using such a platform specific model, the compiler can assess a code transformation’s impact on application performance, make safe choices whether to transform or not, and compare different optimization options.

The Italian research on HPC key technologies across EuroHPC

Proceedings of the 18th ACM International Conference on Computing Frontiers

High-Performance Computing (HPC) is one of the strategic priorities for research and innovation w... more

SYCL-Bench: A Versatile Cross-Platform Benchmark Suite for Heterogeneous Computing

Euro-Par 2020: Parallel Processing

The SYCL standard promises to enable high productivity in heterogeneous programming of a broad ra... more The SYCL standard promises to enable high productivity in heterogeneous programming of a broad range of parallel devices, including multicore CPUs, GPUs, and FPGAs. Its modern and expressive C++ API design, as well as flexible task graph execution model give rise to ample optimization opportunities at run-time, such as the overlapping of data transfers and kernel execution. However, it is not clear which of the existing SYCL implementations perform such scheduling optimizations, and to what extent. Furthermore, SYCL's high level of abstraction may raise concerns about sacrificing performance for ease of use. Benchmarks are required to accurately assess the performance behavior of high-level programming models such as SYCL. To this end, we present SYCL-Bench, a versatile benchmark suite for device characterization and runtime benchmarking, written in SYCL. We experimentally demonstrate the effectiveness of SYCL-Bench by performing device characterization of the NVIDIA TITAN X GPU, and by evaluating the efficiency of the hipSYCL and ComputeCpp SYCL implementations.

ALONA: Automatic Loop Nest Approximation with Reconstruction and Space Pruning

Approximate computing comprises a large variety of techniques that trade the accuracy of an appli... more Approximate computing comprises a large variety of techniques that trade the accuracy of an application’s output for other metrics such as computing time or energy cost. Many existing approximation techniques focus on loops such as loop perforation, which skips iterations for faster, approximated computation. This paper introduces ALONA, a novel approach for automatic loop nest approximation based on polyhedral compilation. ALONA’s compilation framework applies a sequence of loop approximation transformations, generalizes state-of-the-art perforation techniques, and introduces new multi-dimensional approximation schemes. The framework includes a reconstruction technique that significantly improves the accuracy of the approximations and a transformation space pruning method based on Barvinok’s counting that removes inaccurate approximations. Evaluated on a collection of more than twenty applications from PolyBench/C, ALONA discovers new approximations that are better than state-of-th...

Accelerating the RICH Particle Detector Algorithm on Intel Xeon Phi

At the LHC, particles are collided in order to understand how the universe was created. Those col... more At the LHC, particles are collided in order to understand how the universe was created. Those collisions are called events and generate large quantities of data, which have to be pre-filtered before they are stored to hard disks. This paper presents a parallel implementation of these algorithms that is specifically designed for the Intel Xeon Phi Knights Landing platform, exploiting its 64 cores and AVX-512 instruction set. It shows that a linear speedup up until approximately 64 threads is attainable when vectorization is used, data is aligned to cache line boundaries, program execution is pinned to MCDRAM, mathematical expressions are transformed to a more efficient equivalent formulation, and OpenMP is used for parallelization. The code was transformed from being compute bound to memory bound. Overall, a speedup of 36.47x was reached while obtaining an error which is smaller than the detector resolution.

Approximating Memory-bound Applications on Mobile GPUs

Approximate computing techniques are often used to improve the performance of applications that c... more Approximate computing techniques are often used to improve the performance of applications that can tolerate some amount of impurity in the calculations or data. In the context of embedded and mobile systems, a broad number of applications have exploited approximation techniques to improve performance and overcome the limited capabilities of the hardware. On such systems, even small performance improvements can be sufficient to meet scheduled requirements such as hard real-time deadlines. We study the approximation of memory-bound applications on mobile GPUs using kernel perforation, an approximation technique that exploits the availability of fast GPU local memory to provide high performance with more accurate results. Using this approximation technique, we approximated six applications and evaluated them on two mobile GPU architectures with very different memory layouts: a Qualcomm Adreno 506 and an ARM Mali T860 MP2. Results show that, even when the local memory is not mapped to ...

Celerity: High-Level C++ for Accelerator Clusters

Lecture Notes in Computer Science

In the face of ever-slowing single-thread performance growth for CPUs, the scientific and enginee... more In the face of ever-slowing single-thread performance growth for CPUs, the scientific and engineering communities increasingly turn to accelerator parallelization to tackle growing application workloads. Existing means of targeting distributed memory accelerator clusters impose severe programmability barriers and maintenance burdens. The Celerity programming environment seeks to enable developers to scale C++ applications to accelerator clusters with relative ease, while leveraging and extending the SYCL domain-specific embedded language. By having users provide minimal information about how data is accessed within compute kernels, Celerity automatically distributes work and data. We introduce the Celerity C++ API as well as a prototype implementation, demonstrating that existing SYCL code can be brought to distributed memory clusters with only a small set of changes that follow established idioms. The Celerity prototype runtime implementation is shown to have comparable performance to more traditional approaches to distributed memory accelerator programming, such as MPI+OpenCL, with significantly lower implementation complexity.

Proceedings of the International Workshop on OpenCL

Predictable GPUs Frequency Scaling for Energy and Performance

Proceedings of the 48th International Conference on Parallel Processing

Dynamic voltage and frequency scaling (DVFS) is an important solution to balance performance and ... more Dynamic voltage and frequency scaling (DVFS) is an important solution to balance performance and energy consumption, and hardware vendors provide management libraries that allow the programmer to change both memory and core frequencies. The possibility to manually set these frequencies is a great opportunity for application tuning, which can focus on the best applicationdependent setting. However, this task is not straightforward because of the large set of possible configurations and because of the multi-objective nature of the problem, which minimizes energy consumption and maximizes performance. This paper proposes a method to predict the best core and memory frequency configurations on GPUs for an input OpenCL kernel. Our modeling approach, based on machine learning, first predicts speedup and normalized energy over the default frequency configuration. Then, it combines the two models into a multi-objective one that predicts a Pareto-set of frequency configurations. The approach uses static code features, is built on a set of carefully designed microbenchmarks, and can predict the best frequency settings of a new kernel without executing it. Test results show that our modeling approach is very accurate on predicting extrema points and Pareto set for ten out of twelve test benchmarks, and discover frequency configurations that dominate the default configuration in either energy or performance. CCS CONCEPTS • Computer systems organization → Parallel architectures; • Hardware → Power and energy.

OpenABL: A Domain-Specific Language for Parallel and Distributed Agent-Based Simulations

Euro-Par 2018: Parallel Processing

Agent-based simulations are becoming widespread among scientists from different areas, who use th... more Agent-based simulations are becoming widespread among scientists from different areas, who use them to model increasingly complex problems. To cope with the growing computational complexity, parallel and distributed implementations have been developed for a wide range of platforms. However, it is difficult to have simulations that are portable to different platforms while still achieving high performance. We present OpenABL, a domain-specific language for portable, highperformance, parallel agent modeling. It comprises an easy-to-program language that relies on high-level abstractions for programmability and explicitly exploits agent parallelism to deliver high performance. A sourceto-source compiler translates the input code to a high-level intermediate representation exposing parallelism, locality and synchronization, and, thanks to an architecture based on pluggable backends, generates target code for multi-core CPUs, GPUs, large clusters and cloud systems. OpenABL has been evaluated on six applications from various fields such as ecology, animation, and social sciences. The generated code scales to large clusters and performs similarly to handwritten target-specific code, while requiring significantly fewer lines of codes.

Autotuning Stencil Computations with Structural Ordinal Regression Learning

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Stencil computations expose a large and complex space of equivalent implementations. These comput... more Stencil computations expose a large and complex space of equivalent implementations. These computations often rely on autotuning techniques, based on iterative compilation or machine learning (ML), to achieve high performance. Iterative compilation autotuning is a challenging and time-consuming task that may be unaffordable in many scenarios. Meanwhile, traditional ML autotuning approaches exploiting classification algorithms (such as neural networks and support vector machines) face difficulties in capturing all features of large search spaces. This paper proposes a new way of automatically tuning stencil computations based on structural learning. By organizing the training data in a set of partially-sorted samples (i.e., rankings), the problem is formulated as a ranking prediction model, which translates to an ordinal regression problem. Our approach can be coupled with an iterative compilation method or used as a standalone autotuner. We demonstrate its potential by comparing it with state-of-the-art iterative compilation methods on a set of nine stencil codes and by analyzing the quality of the obtained ranking in terms of Kendall rank correlation coefficients.

Stencil Autotuning with Ordinal Regression

Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems

The increasing performance of today's computer architecture comes with an unprecedented augme... more The increasing performance of today's computer architecture comes with an unprecedented augment of hardware complexity. Unfortunately this results in difficult-to-tune software and consequentially in a gap between the potential peak performance and the actual performance. Automatic tuning is an emerging approach that assists the programmer in managing this complexity. State-of-the-art autotuners are limited, though: they either require long tuning times, e.g., due to iterative searches, or cannot tackle the complexity of the problem due to the limitation of the supervised machine learning (ML) methodologies used. In particular, traditional ML autotuning approaches exploiting classification algorithms (such as neural networks and support vector machines) face difficulties in capturing all features of large search spaces. We propose a new way of performing automatic tuning based on structural learning: the tuning problem is formulated as a version ranking prediction modeling and solved using ordinal regression. We demonstrate its potential on a well-known autotuning problem: stencil computations. We compare state-of-the-art iterative compilation methods with our ordinal regression approach and analyze the quality of the obtained ranking in terms of Kendall rank correlation coefficients.

A Performance Analysis of Vector Length Agnostic Code

2019 International Conference on High Performance Computing & Simulation (HPCS)

Vector extensions are a popular mean to exploit data parallelism in applications. Over recent yea... more Vector extensions are a popular mean to exploit data parallelism in applications. Over recent years, the most commonly used extensions have been growing in vector length and amount of vector instructions. However, code portability remains a problem when speaking about a compute continuum. Hence, vector length agnostic (VLA) architectures have been proposed for the future generations of ARM and RISC-V processors. With these architectures, code is vectorized independently of the vector length of the target hardware platform. It is therefore possible to tune software to a generic vector length. To understand the performance impact of VLA code compared to vector length specific code, we analyze the current capabilities of code generation for ARM's SVE architecture. Our experiments show that VLA code reaches about 90% of the performance of vector length specific code, i.e. a 10% overhead is inferred due to global predication of instructions. Furthermore, we show that code performance is not increasing proportionally with increasing vector lengths due to the higher memory demands.

Vectorization cost modeling for NEON, AVX and SVE

Performance Evaluation

Compiler optimization passes employ cost models to determine if a code transformation will yield ... more Compiler optimization passes employ cost models to determine if a code transformation will yield performance improvements. When this assessment is inaccurate, compilers apply transformations that are not beneficial, or refrain from applying ones that would have improved the code. We analyze the accuracy of the cost models used in LLVM's and GCC's vectorization passes for three different instruction set architectures, including both traditional SIMD architectures with a defined fixed vector register size (AVX2 and NEON), and novel instruction set with scalable vector size (SVE). In general, speedup is overestimated , resulting in mispredictions and a weak to medium correlation between predicted and actual performance gain. We therefore propose a novel cost model that is based on a code's intermediate representation with refined memory access pattern features. Using linear regression techniques, this platform independent model is fitted to an AVX2 and a NEON hardware, as well as an SVE simulator. Results show that the fitted model significantly improves the correlation between predicted and measured speedup (AVX2: +52% for training data, +13% for validation data), reduces the average error of the speedup prediction (SVE:-43% for training data,-36% for validation data), as well as the number of mispredictions (NEON:-88% for training data,-71% for validation data) for more than 80 code patterns.

Easy and efficient agent-based simulations with the OpenABL language and compiler

Future Generation Computer Systems

Abstract Agent-based simulations represent an effective scientific tool, with numerous applicatio... more Abstract Agent-based simulations represent an effective scientific tool, with numerous applications from social sciences to biology, which aims to emulate or predict complex phenomena through a set of simple rules performed by multiple agents. To simulate a large number of agents with complex models, practitioners have developed high-performance parallel implementations, often specialized for particular scenarios and target hardware. It is, however, difficult to obtain portable simulations, which achieve high performance and at the same time are easy to write and to reproduce on different hardware. This article gives a complete presentation of OpenABL , a domain-specific language and a compiler for agent-based simulations that enable users to achieve high-performance parallel and distributed agent simulations with a simple and portable programming environment. OpenABL is comprised of (1) an easy-to-program language, which relies on domain abstractions and explicitly exposes agent parallelism, synchronization and locality, (2) a source-to-source compiler, and (3) a set of pluggable compiler backends, which generate target code for multi-core CPUs, GPUs, and cloud-based systems. We evaluate OpenABL on simulations from different fields. In particular, our analysis includes predator-prey and keratinocyte, two complex simulations with multiple step functions, heterogeneous agent types, and dynamic creation and removal of agents. The results show that OpenABL -generated codes are portable to different platforms, perform similarly to manual target-specific implementations, and require significantly fewer lines of codes.