Papers by Philippe Clauss
Combinatorics, Probability & Computing, 2001
A signicant source of enhancing application performance and of reducing power consumptionin embed... more A signicant source of enhancing application performance and of reducing power consumptionin embedded processor applications is to improve the usage of the memory hierarchy. In this work,a temporal and spatial locality optimization framework of nested loops is proposed, driven by parameterizedcost functions. It is based on the minimization of strides occuring while accessing arrayelements from an ane reference function. A
Procedia Computer Science, 2013
Speculative parallelization is a classic strategy for automatically parallelizing codes that cann... more Speculative parallelization is a classic strategy for automatically parallelizing codes that cannot be handled at compile-time due to the use of dynamic data and control structures. Another motivation of being speculative is to adapt the code to the current execution context, by selecting at run-time an efficient parallel schedule. However, since this parallelization scheme requires on-the-fly semantics verification, it is in general difficult to perform advanced transformations for optimization and parallelism extraction. We propose a framework dedicated to speculative parallelization of scientific nested loop kernels, able to transform the code at runtime by re-scheduling the iterations to exhibit parallelism and data locality. The run-time process includes a transformation selection guided by profiling phases on short samples, using an instrumented version of the code. During this phase, the accessed memory addresses are interpolated to build a predictor of the forthcoming accesses. The collected addresses are also used to compute on-the-fly dependence distance vectors by tracking accesses to common addresses. Interpolating functions and distance vectors are then employed in dynamic dependence analysis and in selecting a parallelizing transformation that, if the prediction is correct, does not induce any rollback during execution. In order to ensure that the rollback time overhead stays low, the code is executed in successive slices of the outermost original loop of the nest. Each slice can be either a parallelized version, a sequential original version, or an instrumented version. Moreover, such slicing of the execution provides the opportunity of transforming differently the code to adapt to the observed execution phases. Parallel code generation is achieved almost at no cost by using binary code patterns that are generated at compile-time and that are simply patched at run-time to result in the transformed code.
A significant source for enhancing application performance and for reducing power consumption in ... more A significant source for enhancing application performance and for reducing power consumption in embedded processor applications is to improve the usage of the memory hierarchy. In this paper, a temporal and spatial locality optimization framework of nested loops is proposed, driven by parameterized cost functions. The considered loops can be imperfectly nested. New data layouts are propagated through the connected references and through the loop nests as constraints for optimizing the next connected reference in the same nest or in the other ones. Unlike many existing methods, special attention is paid to TLB (Translation Lookaside Buffer) effectiveness since TLB misses can take from tens to hundreds of processor cycles. Our approach only considers active data, that is, array elements that are actually accessed by a loop, in order to prevent useless memory loads and take advantage of storage compression and temporal locality. Moreover, the same data transformation is not necessarily applied to a whole array. Depending on the referenced data subsets, the transformation can result in different data layouts for a same array. This can significantly improve the performance since a priori incompatible references can be simultaneously optimized. Finally, the process does not only consider the innermost loop level but all levels. Hence, large strides when control returns to the enclosing loop are avoided in several cases, and better optimization is provided in the case of a small index range of the innermost loop.
USENIX Technical Conference, 1997
Optimizing parallel compilers need to be able to analyze nested loop programs with parametric aff... more Optimizing parallel compilers need to be able to analyze nested loop programs with parametric affine loop bounds, in order to derive efficient parallel programs. The iteration spaces of nested loop programs can be modeled by polyhedra and systems of linear constraints. Using this model, important program analyses such as computing the number of flops executed by a loop, computing the
An essential way to obtain high performance on todays computers consists in optimizingspatial and... more An essential way to obtain high performance on todays computers consists in optimizingspatial and temporal locality of the data. Many scientic applications contain loopnests that operate on large multi-dimensional arrays whose sizes are often parameterized.In this work, temporal reuse occuring from several references in a same statement of aparameterized loop is modeled geometrically. More precisely, iterations are classied dependingon the
ACM SIGPLAN Notices, 2012
In this paper, we present a Thread-Level Speculation (TLS) framework whose main feature is to be ... more In this paper, we present a Thread-Level Speculation (TLS) framework whose main feature is to be able to speculatively parallelize a sequential loop nest in various ways, by re-scheduling its iterations. The transformation to be applied is selected at runtime with the goal of minimizing the number of rollbacks and maximizing performance. We perform code transformations by applying the polyhedral model that we adapted for speculative and runtime code parallelization. For this purpose, we design a parallel code pattern which is patched by our runtime system according to the profiling information collected on some execution samples. Adaptability is ensured by considering chunks of code of various sizes, that are launched successively, each of which being parallelized in a different manner, or run sequentially, depending on the currently observed behavior for accessing memory. We show on several benchmarks that our framework yields good performance on codes which could not be handled efficiently by previously proposed TLS systems.
Lecture Notes in Computer Science, 2001
A significant source for enhancing application performance and for reducing power consumption in ... more A significant source for enhancing application performance and for reducing power consumption in embedded processor applications is to improve the usage of the memory hierarchy. Such objective classically translates into optimizing spatial and temporal data locality especially for nested loops. In this paper, we focus on temporal data locality.
(IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE, 2011
... Our experiments, conducted on the SPEC CPU 2006 and on the Pointer Intensive benchmark suite,... more ... Our experiments, conducted on the SPEC CPU 2006 and on the Pointer Intensive benchmark suite, with O0 optimization level, reveal almost negligible over ... The execution platform is a 3.4 Ghz AMD Phenom II X4 965 micro-processor with 4GB of RAM running Linux 2.6.32. ...
Proceedings of International Conference on Application Specific Systems, Architectures and Processors: ASAP '96, 1996
In the area of automatic parallelization of programs, analyzing and transforming loop nests with ... more In the area of automatic parallelization of programs, analyzing and transforming loop nests with parametric a ne loop bounds requires fundamental mathematical results. The most common geometrical model of iteration spaces, called the polytope model, is based on mathematics dealing with convex and discrete geometry, linear programming, combinatorics and geometry of numbers.
ACM SIGARCH Computer Architecture News, 2000
One of the most e cient ways to improve program performances onto nowadays computers is to optimi... more One of the most e cient ways to improve program performances onto nowadays computers is to optimize the way cache memories are used. In particular, many scienti c applications contain loop nests that operate on large multi-dimensional arrays whose sizes are often parameterized. No special attention is paid to cache memory performance when such loops are written. In this work, we focus on spatial locality optimization such that all the data that are loaded as a block in the cache will be used successively by the program. Our method consists in providing a new array reference evaluation function to the compiler, such that the data layout corresponds exactly to the utilization order of these data. The computation of this function concerns the eld of parameterized polyhedra and Ehrhart polynomials.
IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06), 2006
In this paper, we propose to model memory access profile information as loop nests exhibiting use... more In this paper, we propose to model memory access profile information as loop nests exhibiting useful characteristics on the memory behavior, such as periodicity, linearly linked memory access patterns and repetitions. It is shown that static analysis methods as the polytope model approach can then apply onto the generated nested-loop representations. Moreover, the modeling loop nests can themselves be instrumented and run in order to generate further useful information that can also be modeled and analyzed.
Lecture Notes in Computer Science, 2008
Dynamic optimizers modify the binary code of programs at runtime by profiling and optimizing cert... more Dynamic optimizers modify the binary code of programs at runtime by profiling and optimizing certain aspects of the execution. We present a completely software-based framework that dynamically optimizes programs for object-based Distributed Shared Memory (DSM) systems. In DSM systems, reducing the number of messages between nodes is crucial. Prefetching transfers data in advance from the storage node to the local node so that communication is minimized. Our framework uses a profiler and a dynamic binary rewriter that monitors the access behavior of the application and places prefetches where they are beneficial to speed up the application. In addition, we adapt the number of prefetches per request to best fit the application's behavior. Evaluation shows that the performance of our system is better than manual prefetching. The number of messages sent decreases by up to 89%. Performance gains of up to 73% can be observed on the benchmarks.
Chapman & Hall/CRC Computer & Information Science Series, 2011
Proceedings of International Conference on Application Specific Systems, Architectures and Processors: ASAP '96, 1996
In the area of automatic parallelization of programs, analyzing and transforming loop nests with ... more In the area of automatic parallelization of programs, analyzing and transforming loop nests with parametric a ne loop bounds requires fundamental mathematical results. The most common geometrical model of iteration spaces, called the polytope model, is based on mathematics dealing with convex and discrete geometry, linear programming, combinatorics and geometry of numbers.
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014
Data locality optimization is a well-known goal when handling programs that must run as fast as p... more Data locality optimization is a well-known goal when handling programs that must run as fast as possible or use a minimum amount of energy. However, usual techniques never address the significant impact of numerous stalled processor cycles that may occur when consecutive load and store instructions are accessing the same memory location. We show that two versions of the same program may exhibit similar memory performance, while performing very differently regarding their execution times because of the stalled processor cycles generated by many pipeline hazards. We propose a new programming structure called "xfor", enabling the explicit control of the way data locality is optimized in a program and thus, to control the amount of stalled processor cycles. We show the benefits of xfor regarding execution time and energy saving.
Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization - CGO '08, 2008
This paper describes an algorithm that takes a trace (i.e., a sequence of numbers or vectors of n... more This paper describes an algorithm that takes a trace (i.e., a sequence of numbers or vectors of numbers) as input, and from that produces a sequence of loop nests that, when run, produces exactly the original sequence. The input format is suitable for any kind of program execution trace, and the output conforms to standard models of loop nests. The first, most obvious, use of such an algorithm is for program behavior modeling for any measured quantity (memory accesses, number of cache misses, etc.). Finding loops amounts to detecting periodic behavior and provides an explanatory model. The second application is trace compression, i.e., storing the loop nests instead of the original trace. Decompression consists of running the loops, which is easy and fast. A third application is value prediction. Since the algorithm forms loops while reading input, it is able to extrapolate the loop under construction to predict further incoming values. Throughout the paper, we provide examples that explain our algorithms. Moreover, we evaluate trace compression and value prediction on a subset of the SPEC2000 benchmarks.
2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012
This paper describes a tool using one or more executions of a sequential program to detect parall... more This paper describes a tool using one or more executions of a sequential program to detect parallel portions of the program. The tool, called Parwiz, uses dynamic binary instrumentation, targets various forms of parallelism, and suggests distinct parallelization actions, ranging from simple directive tagging to elaborate loop transformations.
(IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE, 2011
Memory profiling is useful for a variety of tasks, most notably to produce traces of memory acces... more Memory profiling is useful for a variety of tasks, most notably to produce traces of memory accesses for cache simulation. However, instrumenting every memory access incurs a large overhead, in the amount of code injected in the original program as well as in execution time. This paper describes how static analysis of the binary code can be used to reduce the amount of instrumentation. The analysis extracts loops and memory access functions by tracking how memory addresses are computed from a small set of base registers holding, e.g., routine parameters and loop counters. Instrumenting these base registers instead of memory operands reduces the weight of instrumentation, first statically by reducing the amount of injected code, and second dynamically by reducing the amount of instrumentation code actually executed. Also, because the static analysis extracts intermediate-level program structures (loops and branches) and access functions in symbolic form, it is easy to transform the original executable into a skeleton program that consumes base register values and produces memory addresses. The first advantage of using a skeleton is to be able to overlap the execution of the instrumented program with that of the skeleton, thereby reducing the overhead of recomputing addresses. The second advantage is that the skeleton program and its shorter input trace can be saved and rerun as many times as necessary without requiring access to the original architecture, e.g., for cache design space exploration.
Uploads
Papers by Philippe Clauss