Papers by Athanasios Stratikopoulos
SpringerBriefs in computer science, 2024
SpringerBriefs in computer science, 2024
SpringerBriefs in computer science, 2024
SpringerBriefs in computer science, 2024
Proceedings of the 15th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages
The Standard Portable Intermediate Representation (SPIR-V) is a low-level binary format designed ... more The Standard Portable Intermediate Representation (SPIR-V) is a low-level binary format designed for representing shaders and compute kernels that can be consumed by OpenCL for computing kernels, and Vulkan for graphics rendering. As a binary representation, SPIR-V is meant to be used by compilers and runtime systems, and is usually performed by C/C++ programs and the LLVM software and compiler ecosystem. However, not all programming environments, runtime systems, and language implementations are C/C++ or based on LLVM. This paper presents the Beehive SPIR-V Toolkit; a framework that can automatically generate a Java composable and functional library for dynamically building SPIR-V binary modules. The Beehive SPIR-V Toolkit can be used by optimizing compilers and runtime systems to generate and validate SPIR-V binary modules from managed runtime systems. Furthermore, our framework is architected to accommodate new SPIR-V releases in an easy-to-maintain manner, and it facilitates the automatic generation of Java libraries for other standards, besides SPIR-V. The Beehive SPIR-V Toolkit also includes an assembler that emits SPIR-V binary modules from disassembled SPIR-V text les, and a disassembler that converts the SPIR-V binary code into a text le. To the best of our knowledge, the Beehive SPIR-V Toolkit is the rst Java programming framework that can dynamically generate SPIR-V binary modules.
Proceedings of the 20th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes
Java benchmarking suites like Dacapo and Renaissance are employed by the research community to ev... more Java benchmarking suites like Dacapo and Renaissance are employed by the research community to evaluate the performance of novel features in managed runtime systems. These suites encompass various applications with diverse behaviors in order to stress test di erent subsystems of a managed runtime. Therefore, understanding and characterizing the behavior of these benchmarks is important when trying to interpret experimental results. This paper presents an in-depth study of the memory behavior of 30 Dacapo and Renaissance applications. To realize the study, a characterization methodology based on a two-faceted pro ling process of the Java applications is employed. The two-faceted pro ling o ers comprehensive insights into the memory behavior of Java applications, as it is composed of high-level and low-level metrics obtained through a Java object pro ler (NUMAPro ler) and a microarchitectural event pro ler (PerfUtil) of MaxineVM, respectively. By using this pro ling methodology we classify the Dacapo and Renaissance applications regarding their intensity in object allocations, object accesses, LLC, and main memory pressure. In addition, several other aspects such as the JVM impact on the memory behavior of the application are discussed. CCS Concepts: • Software and its engineering → Memory management; Runtime environments; Object oriented languages.
arXiv (Cornell University), May 16, 2023
This paper presents the Beehive SPIR-V Toolkit; a framework that can automatically generate a Jav... more This paper presents the Beehive SPIR-V Toolkit; a framework that can automatically generate a Java composable and functional library for dynamically building SPIR-V binary modules. The Beehive SPIR-V Toolkit can be used by optimizing compilers and runtime systems to generate and validate SPIR-V binary modules from managed runtime systems, such as the Java Virtual Machine (JVM). Furthermore, our framework is architected to accommodate new SPIR-V releases in an easy-to-maintain manner, and it facilitates the automatic generation of Java libraries for other standards, besides SPIR-V. The Beehive SPIR-V Toolkit also includes an assembler that emits SPIR-V binary modules from disassembled SPIR-V text files, and a disassembler that converts the SPIR-V binary code into a text file, and a console client application. To the best of our knowledge, the Beehive SPIR-V Toolkit is the first Java programming framework that can dynamically generate SPIR-V binary modules. To demonstrate the use of our framework, we showcase the integration of the SPIR-V Beehive Toolkit in the context of the TornadoVM, a Java framework for automatically offloading and running Java programs on heterogeneous hardware. We show that, via the SPIR-V Beehive Toolkit, the TornadoVM is able to compile code 3x faster than its existing OpenCL C JIT compiler, and it performs up to 1.52x faster than the existing OpenCL C backend in TornadoVM.
Zenodo (CERN European Organization for Nuclear Research), May 16, 2023
In this talk, we will present the newly EU-funded project AERO (Accelerated EU Cloud) whose missi... more In this talk, we will present the newly EU-funded project AERO (Accelerated EU Cloud) whose mission is to bring up and optimize the software stack of cloud deployments on top of the EU processor. After providing an overview of the AERO project, we will expand on two main components of the software stack to enable seamless acceleration of various programming languages on RISC-V architectures; namely, ComputeAorta which enables the generation of RISC-V vector instructions from SPIR-V binary modules, and TornadoVM which enables transparent hardware acceleration of managed applications. Finally, we will describe how the ongoing integration of ComputeAorta and TornadoVM will enable a plethora of applications from managed languages to harness RISC-V auto-vectorization completely transparently to developers.
arXiv (Cornell University), May 1, 2023
Ray tracing has been typically known as a graphics rendering method capable of producing highly r... more Ray tracing has been typically known as a graphics rendering method capable of producing highly realistic imagery and visual effects generated by computers. More recently the performance improvements in Graphics Processing Units (GPUs) have enabled developers to exploit sufficient computing power to build a fair amount of ray tracing applications with the ability to run in realtime. Typically, real-time ray tracing is achieved by utilizing high performance kernels written in CUDA, OpenCL, and Vulkan which can be invoked by high-level languages via native bindings; a technique that fragments application code bases as well as limits portability. This paper presents a hardware-accelerated ray tracing rendering engine, fully written in Java, that can seamlessly harness the performance of underlying GPUs via the TornadoVM framework. Through this paper, we show the potential of Java and acceleration frameworks to process in real time a compute intensive application. Our results indicate that it is possible to enable real time ray tracing from Java by achieving up to 234, 152, 45 frames-per-second in 720p, 1080p, and 4K resolutions, respectively. CCS CONCEPTS • General and reference → Computing standards, RFCs and guidelines; • Applied computing → Publishing; • Software and its engineering → Software libraries and repositories.
Scaling up the performance of managed applications on Non-Uniform Memory Access (NUMA) architectu... more Scaling up the performance of managed applications on Non-Uniform Memory Access (NUMA) architectures has been a challenging task, as it requires a good understanding of the underlying architecture and managed runtime environments (MRE). Prior work has studied this problem from the scope of speci c components of the managed runtimes, such as the Garbage Collectors, as a means to increase the NUMA awareness in MREs. In this paper, we follow a di erent approach that complements prior work by studying the behavior of managed applications on NUMA architectures during mutation time. At rst, we perform a characterization study that classi es several Dacapo and Renaissance applications as per their scalability-critical properties. Based on this study, we propose a novel lightweight mechanism in MREs for optimizing the scalability of managed applications on NUMA systems, in an application-agnostic way. Our experimental results show that the proposed mechanism can result in relative performance ranging from 0.66x up to 3.29x, with a geometric mean of 1.11x, against a NUMA-agnostic execution. CCS Concepts: • Computer systems organization → Multicore architectures; • Software and its engineering → Object oriented languages; Runtime environments; Software design engineering.
arXiv (Cornell University), May 23, 2023
Quantum computers are driving a new computing paradigm to address important computational problem... more Quantum computers are driving a new computing paradigm to address important computational problems in science. For example, quantum computing can be the solution to demystify complex mathematic formulas applied in cryptography, or complex models used in chemistry for biological systems. Due to the early stage in the development of quantum hardware, simulation is currently playing a prime role in research. To tackle the exponential cost of quantum simulation, state-of-the-art simulators are typically implemented using programming languages associated with High Performance Computing, while also providing the means for hardware acceleration on heterogeneous co-processors (e.g., GPUs). The vast majority of quantum simulators implements a part of the simulator in a platform-specific language (e.g., CUDA, OpenCL). This approach results in fragmented development as developers have to manually specialize the code for custom execution across different devices or microarchitectures. In this article, we present TornadoQSim, an open-source quantum circuit simulation framework implemented in Java. The proposed framework has been designed to be modular and easily expandable for accommodating different user-defined simulation backends, such as the unitary matrix simulation technique. Furthermore, TornadoQSim features the ability to interchange simulation backends that can simulate arbitrary quantum circuits. Another novel aspect of TornadoQSim over other quantum simulators is the transparent hardware acceleration of the simulation backends on heterogeneous devices. TornadoQSim employs TornadoVM to automatically compile parts of the simulation backends onto heterogeneous hardware, thereby addressing the fragmentation in development due to the low-level heterogeneous programming models. The evaluation of TornadoQSim has shown that the transparent utilization of GPU hardware can result in up to 506.5 performance speedup when compared to the vanilla Java code for a fully entangled quantum circuit of 11 qubits. Other evaluated quantum algorithms have been the Deutsch-Jozsa algorithm (493.10 speedup for a 11-qubit circuit) and the quantum Fourier transform algorithm (518.12 speedup for a 11-qubit circuit). Finally, the best TornadoQSim implementation of unitary matrix has been evaluated against a semantically equivalent simulation via Qiskit. The comparative evaluation has shown that the simulation with TornadoQSim is faster for small circuits, while for large circuits Qiskit outperforms TornadoQSim by an order of magnitude. CCS Concepts: • Computer systems organization → Quantum computing; • Software and its engineering → Object oriented frameworks.
Companion Proceedings of the 7th International Conference on the Art, Science, and Engineering of Programming
Zenodo (CERN European Organization for Nuclear Research), Mar 13, 2023
Proceedings of the VLDB Endowment
The ever-increasing demand for high performance Big Data analytics and data processing, has paved... more The ever-increasing demand for high performance Big Data analytics and data processing, has paved the way for heterogeneous hardware accelerators, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), to be integrated into modern Big Data platforms. Currently, this integration comes at the cost of programmability since the end-user Application Programming Interface (APIs) must be altered to access the underlying heterogeneous hardware. For example, current Big Data frameworks, such as Apache Spark, provide a new API that combines the existing Spark programming model with GPUs. For other Big Data frameworks, such as Flink, the integration of GPUs and FPGAs is achieved via external API calls that bypass their execution models completely. In this paper, we rethink current Big Data frameworks from a systems and programming language perspective, and introduce a novel co-designed approach for integrating hardware acceleration into their execution models. The...
In recent years, heterogeneous computing has emerged as the vital way to increase computers? perf... more In recent years, heterogeneous computing has emerged as the vital way to increase computers? performance and energy efficiency by combining diverse hardware devices, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The rationale behind this trend is that different parts of an application can be offloaded from the main CPU to diverse devices, which can efficiently execute these parts as co-processors. FPGAs are a subset of the most widely used co-processors, typically used for accelerating specific workloads due to their flexible hardware and energy-efficient characteristics. These characteristics have made them prevalent in a broad spectrum of computing systems ranging from low-power embedded systems to high-end data centers and cloud infrastructures. However, these hardware characteristics come at the cost of programmability. Developers who create their applications using high-level programming languages (e.g., Java, Python, etc.) are required to...
Conference Companion of the 4th International Conference on Art, Science, and Engineering of Programming, 2020
Since the early conception of managed runtime systems with tiered JIT compilation, several resear... more Since the early conception of managed runtime systems with tiered JIT compilation, several research attempts have been made to accelerate the bytecode execution. In this paper, we extend prior attempts by performing an initial analysis of whether heterogeneous hardware accelerators in the form of Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAS) can help towards achieving higher performance during the bytecode interpreter mode. To answer this question, we implemented a simple parallel Java bytecode interpreter written in OpenCL and executed it across a plethora of devices, including GPUs and FPGAs. Our preliminary evaluation shows that under specific workloads, hardware acceleration can yield up to 17x better performance compared to traditional optimized interpreters running on Intel CPUs and up to 214x compared to ARM CPUs.
Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2021
Although Graphics Processing Units (GPUs) have become pervasive for data-parallel workloads, the ... more Although Graphics Processing Units (GPUs) have become pervasive for data-parallel workloads, the efficient exploitation of their tiered memory hierarchy requires explicit programming. The efficient utilization of different GPU memory tiers can yield higher performance at the expense of programmability since developers must have extended knowledge of the architectural details in order to utilize them. In this paper, we propose an alternative approach based on Just-In-Time (JIT) compilation to automatically and transparently exploit local memory allocation and data locality on GPUs. In particular, we present a set of compiler extensions that allow arbitrary Java programs to utilize local memory on GPUs without explicit programming. We prototype and evaluate our proposed solution in the context of TornadoVM against a set of benchmarks and GPU architectures, showcasing performance speedups of up to 2.5x compared to equivalent baseline implementations that do not utilize local memory or ...
The advent of modern cloud services along with the huge volume of data produced on a daily basis,... more The advent of modern cloud services along with the huge volume of data produced on a daily basis, have set the demand for fast and efficient data processing. This demand is common among numerous application domains, such as deep learning, data mining, and computer vision. Prior research has focused on employing hardware accelerators as a means to overcome this inefficiency. This trend has driven software development to target heterogeneous execution, and several modern computing systems have incorporated a mixture of diverse computing components, including GPUs and FPGAs. However, the specialization of the applications' code for heterogeneous execution is not a trivial task, as it requires developers to have hardware expertise in order to obtain high performance. The vast majority of the existing deep learning frameworks that support heterogeneous acceleration, rely on the implementation of wrapper calls from a high-level programming language to a low-level accelerator backend, ...
The advent of modern cloud services, along with the huge volume of data produced on a daily basis... more The advent of modern cloud services, along with the huge volume of data produced on a daily basis, have increased the demand for fast and efficient data processing. This demand is common among numerous application domains, such as deep learning, data mining, and computer vision. In recent years, hardware accelerators have been employed as a means to meet this demand, due to the high parallelism that these applications exhibit. Although this approach can yield high performance, the development of new deep learning neural networks on heterogeneous hardware requires a steep learning curve. The main reason is that existing deep learning engines support the static compilation of the accelerated code, that can be accessed via wrapper calls from a wide range of managed programming languages (e.g., Java, Python, Scala). Therefore, the development of high-performance neural network architectures is fragmented between programming models, thereby forcing developers to manually specialize the c...
Uploads
Papers by Athanasios Stratikopoulos