This paper presents a fast new algorithm for modeling and reasoning about interferences for varia... more This paper presents a fast new algorithm for modeling and reasoning about interferences for variables in a program without constructing an interference graph. It then describes how to use this information to minimize copy insertion for φ-node instantiation during the conversion of the static single assignment (SSA) form into the control-flow graph (CFG), effectively yielding a new, very fast copy coalescing and live-range identification algorithm. This paper proves some properties of the SSA form that enable construction of data structures to compute interference information for variables that are considered for folding. The asymptotic complexity of our SSA-to-CFG conversion algorithm is O(nα(n)), where n is the number of instructions in the program. Performing copy folding during the SSA-to-CFG conversion eliminates the need for a separate coalescing phase while simplifying the intermediate code. This may make graph-coloring register allocation more practical in just in time (JIT) and other time-critical compilers For example, Sun's Hotspot Server Compiler already employs a graph-coloring register allocator[10]. This paper also presents an improvement to the classical interference-graph based coalescing optimization that shows a decrease in memory usage of up to three orders of magnitude and a decrease of a factor of two in compilation time, while providing the exact same results. We present experimental results that demonstrate that our algorithm is almost as precise (within one percent on average) as the improved interference-graph-based coalescing algorithm, while requiring three times less compilation time.
Research over the past five years has shown significant performance improvements using a techniqu... more Research over the past five years has shown significant performance improvements using a technique called adaptive compilation. An adaptive compiler uses a compile-execute-analyze feedback loop to find the combination of optimizations and parameters that minimizes some performance goal, such as code size or execution time. Despite its ability to improve performance, adaptive compilation has not seen widespread use because of two obstacles: the large amounts of time that such systems have used to perform the many compilations and executions prohibits most users from adopting these systems, and the complexity inherent in a feedback-driven adaptive system has made it difficult to build and hard to use. A significant portion of the adaptive compilation process is devoted to multiple executions of the code being compiled. We have developed a technique called virtual execution to address this problem. Virtual execution runs the program a single time and preserves information that allows us to accurately predict the performance of different optimization sequences without running the code again. Our prototype implementation of this technique significantly reduces the time required by our adaptive compiler. In conjunction with this performance boost, we have developed a graphical-user interface (GUI) that provides a controlled view of the compilation process. By providing appropriate defaults, the interface limits the amount of information that the user must provide to get started. At the same time, it lets the experienced user exert fine-grained control over the parameters that control the system.
The ParaScope Editor is an interactive parallel programming tool that assists knowledgeable users... more The ParaScope Editor is an interactive parallel programming tool that assists knowledgeable users in developing scientific Fortran programs. It displays the results of sophisticated program analyses, provides a set of powerful interactive transformations, and supports program editing. This paper summarizes experiences of scientific programmers and tool designers using the ParaScope Editor. We evaluate existing features and describe enhancements in three key areas: user interface, analysis, and transformation. Many existing features prove crucial to successful program parallelization. They include interprocedural array side-effect analysis and program and dependence view filtering. Desirable functionality includes improved program navigation based on performance estimation, incorporating user assertions in analysis and more guidance in selecting transformations. These results offer insights for the authors of a variety of programming tools and paral-Ielizing compilers. 1
Static single assignment (SSA) form is a program representation that is becoming increasingly pop... more Static single assignment (SSA) form is a program representation that is becoming increasingly popular for compiler-based code optimization. In this paper, we address three problems that have arisen in our use of SSA form. Two are variations to the SSA construction algorithms presented by Cytron et al. 1 The first variation is a version of SSA
The iterative algorithm is widely used to solve instances of data-flow analysis problems. The alg... more The iterative algorithm is widely used to solve instances of data-flow analysis problems. The algorithm is attractive because it is easy to implement and robust in its behavior. The theory behind the iterative algorithm establishes a set of conditions where the algorithm runs in at most d(G)+3 passes over the graph - a round-robin algorithm, running a "rapid" framework, on
All graph-coloring register allocators rely on heuristics to arrive at a "good" answer to the NP-... more All graph-coloring register allocators rely on heuristics to arrive at a "good" answer to the NP-complete problem of register allocation, resulting in suboptimal code due to spill code. We look at a post-pass to the allocator that removes unnecessary spill code by finding places where the availability of an unused register allows us to "promote" a spill to a register. We explain and correct an error in Briggs' spill-code insertion algorithm that sometimes inserts an unnecessary number of spill instructions. This fix has an insignificant impact on the runtime of the compiler and never causes a degradation in runtime of the code produced. We suggest minimizing the impact of the spill code with a small separate memory dedicated to spills and under the exclusive control of the compiler. We show an algorithm and experimental results which suggest that this hardware construct would significantly decrease the runtime of the code. 2 This work is based in part upon work supported by the Texas Advanced Technology Program under Grant No. 003604-015 and by Darpa through Army Contract DABT63-95-C-0115. 1 Most are not computer scientists, and didn't understand a word I was saying-not that that ever stopped me... 2 Providing they were getting their ears scratched at the same time.
Most modern compilers operate by applying a fixed, program-independent sequence of optimizations ... more Most modern compilers operate by applying a fixed, program-independent sequence of optimizations to all programs. Compiler writers choose a single "compilation sequence", or perhaps a couple of compilation sequences. In choosing a sequence, they may consider performance of benchmarks or other important codes. These sequences are intended as general-purpose tools, accessible through command-line flags such as-O2 and-O3. Specific compilation sequences make a significant difference in the quality of the generated code, whether compiling for speed, for space, or for other metrics. A single universal compilation sequence does not produce the best results over all programs [8, 10, 29, 32]. Finding an optimal programspecific compilation sequence is difficult because the space of potential sequences is huge and the interactions between optimizations are poorly understood. Moreover, there is no systematic exploration of the costs and benefits of searching for good (i.e., within a certain percentage of optimal) program-specific compilation sequences. In this paper, we perform a large experimental study of the space of compilation sequences over a set of known benchmarks, using our prototype adaptive compiler. Our goal is to characterize these spaces and to determine if it is costeffective to construct custom compilation sequences. We report on five exhaustive enumerations which demonstrate that 80% of the local minima in the space are within 5 to 10% of the optimal solution. We describe three algorithms tailored to search such spaces and report on experiments that use these algorithms to find good compilation sequences. These experiments suggest that properties observed in the enumerations hold for larger search spaces and larger programs. Our findings indicate that for the cost of 200 to 4,550 compilations, we can find custom sequences that are 15 to 25% better than the human-designed fixed-sequence originally used in our compiler.
... Keith D. Cooper, Timothy J. Harvey, and Linda Torczon Rice University, 6100 S. Main MS 132,... more ... Keith D. Cooper, Timothy J. Harvey, and Linda Torczon Rice University, 6100 S. Main MS 132, Houston, Texas 77005, USA {email: [email protected] ... Across the entire suite of 169 routines from parts of the Spec benchmark suite and Forsythe, Malcolm, and Moler's small library of ...
Optimizations aimed at reducing the impact of memory operations on execution speed have long conc... more Optimizations aimed at reducing the impact of memory operations on execution speed have long concentrated on improving cache performance. These efforts achieve a. reasonable level of success. The primary limit on the compiler's ability to improve memory behavior is its imperfect knowledge about the run-time behavior of the program. The compiler cannot completely predict runtime access patterns. There is an exception to this rule. During the register allocation phase, the compiler often must insert substantial amount,s of spill code; that is, instructions that move values from registers to memory and back again. Because the compiler itself inserts these memory instructions, it has more knowledge about them than other memory operations in the program. Spill-code operations are disjoint from the memory manipulations required by the semantics of the program being compiled, and, indeed, the two can interfere in the cache. This paper proposes a hardware solution to the problem of increased spill costs-a small compiler-controlled memory (CCM) to hold spilled values. This small random-access memory can (and should) be placed in a distinct address space from the main memory hierarchy. The compiler can target spill instructions to use the CCM, moving most compiler-inserted memory traffic out of the pathway to main memory and eliminating any impact that those spill instructions would have on the state of the main memory hierarchy. Such memories already exist on some DSP microprocessors. Our techniques can be applied directly on those chips. This paper presents two compiler-based methods to exploit such a memory, along with experimental results showing that speedups from using CCM may be sizable. It shows that using the register allocation's coloring paradigm to assign spilled values to memory can greatly reduce the amount of memory required by a program. Permtsslon to make dIgItal or hard copies of all or part of this work for personal or classroom "se IS granted wthout fee prowded that copnes are not made or distributed for profn or commercial advantage and that copws bear thts notice and the full c!tatmn on the Hurst page. To copy otherwise. to republish, to post on servers or to redlstrlbute to lets. reqwres pnor spectfic perm~sswn and/or a fee. ASPLOS VIII lo/98 CA.LJSA 0 1998 ACM 1~58113-107.0/98/0010...$5.00 CCM for spills should shorten spill latencies and let the scheduler place the load for a spilled value next to its use-speeding up execution and shortening the live range created for the spilled value. Spilling to the CCM removes spill traffic from the path to main memory. If the system has a cache memory, spilling to the CCM should also eliminate any cache pollution introduced by spill operations-loads and stores that can interfere directly with the cache behavior "planned" by highlevel, compiler-based transformations that exploit locality caused by regular accesses in loop nests [8, 2'7, lo].
A growing body of literature on adaptive compilation suggests that using program-specific [7] or ... more A growing body of literature on adaptive compilation suggests that using program-specific [7] or function-specific [24] compilation sequences can produce consistent improvements over compiling the same code with a traditional fixed-sequence compiler [18, 1, 27, 24]. The early work on this problem used genetic algorithms (GAs) [7]. GAs find good solutions to these problems. However, they must probe the search space thousands of times; each probe compiles and evaluates the code. To build a practical compiler that discovers good compilation sequences, we need techniques that find good sequences with much less effort than the GAs require. To find such techniques, we embarked on a detailed study of the search spaces in which the compiler operates. By understanding the properties of these spaces, we can design more effective searches. This paper focuses on effective search algorithms for the problem of choosing compilation sequences-an ordered list of optimizations to apply to the input program. It summarizes the search-space properties that we discovered in our studies. It presents and evaluates two new search methods, designed with knowledge of the search-space properties. It compares the new methods against the best sequence-finding GA that we have developed. Our new search methods can find good solutions with 400 to 600 probes. The first GA for sequence finding required 10,000 to 20,000 probes [7]. Our most effective GA runs for 2,300 probes. The strength of these results validates our paradigm-learn about the spaces and use that knowledge to improve the search techniques.
Inline substitution is an optimization that replaces a procedure call with the body of the proced... more Inline substitution is an optimization that replaces a procedure call with the body of the procedure that it calls. Inlining has the immediate benefit of reducing the overhead associated with the call, including register saves and restores, parameter evaluation, and activation record setup and teardown. It has secondary benefits that arise from providing greater context for global optimizations. These benefits can be offset by the effects of increased code size, and by deleterious interactions with other optimizations, such as register allocation. The difficult aspect of inline substitution is choosing which calls to inline. Previous work has focused on static, one-size-fits-all heuristics. This paper presents a feedback-driven adaptive scheme that derives a programspecific inlining heuristic. The key contributions of this work are: (1) a novel parameterization scheme for the inliner that makes it susceptible to fine-grained external control, (2) a scheme for discretizing large integer parameter spaces, and (3) effective search techniques for the resulting search space. This work provides a proof of concept that can provide insight into the design of adaptive controllers for other optimizations with complex decision heuristics. Our goal in this work is not to exhibit the world's best inliner. Instead, we present evidence to suggest that a program-specific, adaptive scheme is needed to achieve the best results.
A variety of applications have arisen where it is worthwhile to apply code optimizations directly... more A variety of applications have arisen where it is worthwhile to apply code optimizations directly to the machine code (or assembly code) produced by a compiler. These include link-time whole-program analysis and optimization, code compression, binary-to-binary translation, and bit-transition reduction (for power). Many, if not most, optimizations assume the presence of a control-flow graph (cfg). Compiled, scheduled code has properties that can make cfg construction more complex than it is inside a typical compiler. In particular, branch-to-register operations can introduce spurious edges into the cfg. If branch delay slots contain other branches, the classic algorithms for building a cfg produce incorrect results. This paper uses two simple examples to explain the problem. It presents an algorithm for building correct cfgs from scheduled assembly code that includes branches in branch delay slots. The algorithm works by building an approximate cfg and then refining it to reflect the actions of delayed branches. If all branches have explicit targets, the complexity of the refining step is linear with respect to the number of branches in the code.
Th ep roblem of finding the dominators in a control-flow graph has a long history in the literatu... more Th ep roblem of finding the dominators in a control-flow graph has a long history in the literature. The original algorithms suffered from a large asymptotic complexity but were easy to understand. Subsequent work improved the time bound, but generally sacrificed both simplicity and ease of implementation. This paper returns to a simple formulation of dominance as a global data-flow problem. Some insights into the natur eo f dominance lead to an implementation of an O(N 2 )a lgorithm that runs faster, in practice, than the classic Lengauer-Tarjan algorithm, which has a timebound of O(E ∗ log(N )). We compare the algorithm to Lengauer-Tarjan because it is the best known and most widely used of the fast algorithms for dominance. Working from the same implementatio ni nsights, we also rederive (from earlier work on control dependence by Ferrante, et al. )a method for calculating dominance frontiers that we show is faster than the original algorithm by Cytron, et al. The aim of this paper is not to present a new algorithm, but, rather, to make an argument based on empirical evidence that algorithms with discouraging asymptotic complexities can be faster in practice than those more commonly employed. We show that, in some cases, careful engineering of simple algorithms can overcome theoretical advantages, even when problems grow beyond realistic sizes. Further, we argue that the algorithms presented herein are intuitive and easily implemented, making them excellent teaching tools.
This paper presents a fast new algorithm for modeling and reasoning about interferences for varia... more This paper presents a fast new algorithm for modeling and reasoning about interferences for variables in a program without constructing an interference graph. It then describes how to use this information to minimize copy insertion for φ-node instantiation during the conversion of the static single assignment (SSA) form into the control-flow graph (CFG), effectively yielding a new, very fast copy coalescing and live-range identification algorithm. This paper proves some properties of the SSA form that enable construction of data structures to compute interference information for variables that are considered for folding. The asymptotic complexity of our SSA-to-CFG conversion algorithm is O(nα(n)), where n is the number of instructions in the program. Performing copy folding during the SSA-to-CFG conversion eliminates the need for a separate coalescing phase while simplifying the intermediate code. This may make graph-coloring register allocation more practical in just in time (JIT) and other time-critical compilers For example, Sun's Hotspot Server Compiler already employs a graph-coloring register allocator[10]. This paper also presents an improvement to the classical interference-graph based coalescing optimization that shows a decrease in memory usage of up to three orders of magnitude and a decrease of a factor of two in compilation time, while providing the exact same results. We present experimental results that demonstrate that our algorithm is almost as precise (within one percent on average) as the improved interference-graph-based coalescing algorithm, while requiring three times less compilation time.
Research over the past five years has shown significant performance improvements using a techniqu... more Research over the past five years has shown significant performance improvements using a technique called adaptive compilation. An adaptive compiler uses a compile-execute-analyze feedback loop to find the combination of optimizations and parameters that minimizes some performance goal, such as code size or execution time. Despite its ability to improve performance, adaptive compilation has not seen widespread use because of two obstacles: the large amounts of time that such systems have used to perform the many compilations and executions prohibits most users from adopting these systems, and the complexity inherent in a feedback-driven adaptive system has made it difficult to build and hard to use. A significant portion of the adaptive compilation process is devoted to multiple executions of the code being compiled. We have developed a technique called virtual execution to address this problem. Virtual execution runs the program a single time and preserves information that allows us to accurately predict the performance of different optimization sequences without running the code again. Our prototype implementation of this technique significantly reduces the time required by our adaptive compiler. In conjunction with this performance boost, we have developed a graphical-user interface (GUI) that provides a controlled view of the compilation process. By providing appropriate defaults, the interface limits the amount of information that the user must provide to get started. At the same time, it lets the experienced user exert fine-grained control over the parameters that control the system.
This paper presents a fast new algorithm for modeling and reasoning about interferences for varia... more This paper presents a fast new algorithm for modeling and reasoning about interferences for variables in a program without constructing an interference graph. It then describes how to use this information to minimize copy insertion for φ-node instantiation during the conversion of the static single assignment (SSA) form into the control-flow graph (CFG), effectively yielding a new, very fast copy coalescing and live-range identification algorithm. This paper proves some properties of the SSA form that enable construction of data structures to compute interference information for variables that are considered for folding. The asymptotic complexity of our SSA-to-CFG conversion algorithm is O(nα(n)), where n is the number of instructions in the program. Performing copy folding during the SSA-to-CFG conversion eliminates the need for a separate coalescing phase while simplifying the intermediate code. This may make graph-coloring register allocation more practical in just in time (JIT) and other time-critical compilers For example, Sun's Hotspot Server Compiler already employs a graph-coloring register allocator[10]. This paper also presents an improvement to the classical interference-graph based coalescing optimization that shows a decrease in memory usage of up to three orders of magnitude and a decrease of a factor of two in compilation time, while providing the exact same results. We present experimental results that demonstrate that our algorithm is almost as precise (within one percent on average) as the improved interference-graph-based coalescing algorithm, while requiring three times less compilation time.
Research over the past five years has shown significant performance improvements using a techniqu... more Research over the past five years has shown significant performance improvements using a technique called adaptive compilation. An adaptive compiler uses a compile-execute-analyze feedback loop to find the combination of optimizations and parameters that minimizes some performance goal, such as code size or execution time. Despite its ability to improve performance, adaptive compilation has not seen widespread use because of two obstacles: the large amounts of time that such systems have used to perform the many compilations and executions prohibits most users from adopting these systems, and the complexity inherent in a feedback-driven adaptive system has made it difficult to build and hard to use. A significant portion of the adaptive compilation process is devoted to multiple executions of the code being compiled. We have developed a technique called virtual execution to address this problem. Virtual execution runs the program a single time and preserves information that allows us to accurately predict the performance of different optimization sequences without running the code again. Our prototype implementation of this technique significantly reduces the time required by our adaptive compiler. In conjunction with this performance boost, we have developed a graphical-user interface (GUI) that provides a controlled view of the compilation process. By providing appropriate defaults, the interface limits the amount of information that the user must provide to get started. At the same time, it lets the experienced user exert fine-grained control over the parameters that control the system.
The ParaScope Editor is an interactive parallel programming tool that assists knowledgeable users... more The ParaScope Editor is an interactive parallel programming tool that assists knowledgeable users in developing scientific Fortran programs. It displays the results of sophisticated program analyses, provides a set of powerful interactive transformations, and supports program editing. This paper summarizes experiences of scientific programmers and tool designers using the ParaScope Editor. We evaluate existing features and describe enhancements in three key areas: user interface, analysis, and transformation. Many existing features prove crucial to successful program parallelization. They include interprocedural array side-effect analysis and program and dependence view filtering. Desirable functionality includes improved program navigation based on performance estimation, incorporating user assertions in analysis and more guidance in selecting transformations. These results offer insights for the authors of a variety of programming tools and paral-Ielizing compilers. 1
Static single assignment (SSA) form is a program representation that is becoming increasingly pop... more Static single assignment (SSA) form is a program representation that is becoming increasingly popular for compiler-based code optimization. In this paper, we address three problems that have arisen in our use of SSA form. Two are variations to the SSA construction algorithms presented by Cytron et al. 1 The first variation is a version of SSA
The iterative algorithm is widely used to solve instances of data-flow analysis problems. The alg... more The iterative algorithm is widely used to solve instances of data-flow analysis problems. The algorithm is attractive because it is easy to implement and robust in its behavior. The theory behind the iterative algorithm establishes a set of conditions where the algorithm runs in at most d(G)+3 passes over the graph - a round-robin algorithm, running a "rapid" framework, on
All graph-coloring register allocators rely on heuristics to arrive at a "good" answer to the NP-... more All graph-coloring register allocators rely on heuristics to arrive at a "good" answer to the NP-complete problem of register allocation, resulting in suboptimal code due to spill code. We look at a post-pass to the allocator that removes unnecessary spill code by finding places where the availability of an unused register allows us to "promote" a spill to a register. We explain and correct an error in Briggs' spill-code insertion algorithm that sometimes inserts an unnecessary number of spill instructions. This fix has an insignificant impact on the runtime of the compiler and never causes a degradation in runtime of the code produced. We suggest minimizing the impact of the spill code with a small separate memory dedicated to spills and under the exclusive control of the compiler. We show an algorithm and experimental results which suggest that this hardware construct would significantly decrease the runtime of the code. 2 This work is based in part upon work supported by the Texas Advanced Technology Program under Grant No. 003604-015 and by Darpa through Army Contract DABT63-95-C-0115. 1 Most are not computer scientists, and didn't understand a word I was saying-not that that ever stopped me... 2 Providing they were getting their ears scratched at the same time.
Most modern compilers operate by applying a fixed, program-independent sequence of optimizations ... more Most modern compilers operate by applying a fixed, program-independent sequence of optimizations to all programs. Compiler writers choose a single "compilation sequence", or perhaps a couple of compilation sequences. In choosing a sequence, they may consider performance of benchmarks or other important codes. These sequences are intended as general-purpose tools, accessible through command-line flags such as-O2 and-O3. Specific compilation sequences make a significant difference in the quality of the generated code, whether compiling for speed, for space, or for other metrics. A single universal compilation sequence does not produce the best results over all programs [8, 10, 29, 32]. Finding an optimal programspecific compilation sequence is difficult because the space of potential sequences is huge and the interactions between optimizations are poorly understood. Moreover, there is no systematic exploration of the costs and benefits of searching for good (i.e., within a certain percentage of optimal) program-specific compilation sequences. In this paper, we perform a large experimental study of the space of compilation sequences over a set of known benchmarks, using our prototype adaptive compiler. Our goal is to characterize these spaces and to determine if it is costeffective to construct custom compilation sequences. We report on five exhaustive enumerations which demonstrate that 80% of the local minima in the space are within 5 to 10% of the optimal solution. We describe three algorithms tailored to search such spaces and report on experiments that use these algorithms to find good compilation sequences. These experiments suggest that properties observed in the enumerations hold for larger search spaces and larger programs. Our findings indicate that for the cost of 200 to 4,550 compilations, we can find custom sequences that are 15 to 25% better than the human-designed fixed-sequence originally used in our compiler.
... Keith D. Cooper, Timothy J. Harvey, and Linda Torczon Rice University, 6100 S. Main MS 132,... more ... Keith D. Cooper, Timothy J. Harvey, and Linda Torczon Rice University, 6100 S. Main MS 132, Houston, Texas 77005, USA {email: [email protected] ... Across the entire suite of 169 routines from parts of the Spec benchmark suite and Forsythe, Malcolm, and Moler's small library of ...
Optimizations aimed at reducing the impact of memory operations on execution speed have long conc... more Optimizations aimed at reducing the impact of memory operations on execution speed have long concentrated on improving cache performance. These efforts achieve a. reasonable level of success. The primary limit on the compiler's ability to improve memory behavior is its imperfect knowledge about the run-time behavior of the program. The compiler cannot completely predict runtime access patterns. There is an exception to this rule. During the register allocation phase, the compiler often must insert substantial amount,s of spill code; that is, instructions that move values from registers to memory and back again. Because the compiler itself inserts these memory instructions, it has more knowledge about them than other memory operations in the program. Spill-code operations are disjoint from the memory manipulations required by the semantics of the program being compiled, and, indeed, the two can interfere in the cache. This paper proposes a hardware solution to the problem of increased spill costs-a small compiler-controlled memory (CCM) to hold spilled values. This small random-access memory can (and should) be placed in a distinct address space from the main memory hierarchy. The compiler can target spill instructions to use the CCM, moving most compiler-inserted memory traffic out of the pathway to main memory and eliminating any impact that those spill instructions would have on the state of the main memory hierarchy. Such memories already exist on some DSP microprocessors. Our techniques can be applied directly on those chips. This paper presents two compiler-based methods to exploit such a memory, along with experimental results showing that speedups from using CCM may be sizable. It shows that using the register allocation's coloring paradigm to assign spilled values to memory can greatly reduce the amount of memory required by a program. Permtsslon to make dIgItal or hard copies of all or part of this work for personal or classroom "se IS granted wthout fee prowded that copnes are not made or distributed for profn or commercial advantage and that copws bear thts notice and the full c!tatmn on the Hurst page. To copy otherwise. to republish, to post on servers or to redlstrlbute to lets. reqwres pnor spectfic perm~sswn and/or a fee. ASPLOS VIII lo/98 CA.LJSA 0 1998 ACM 1~58113-107.0/98/0010...$5.00 CCM for spills should shorten spill latencies and let the scheduler place the load for a spilled value next to its use-speeding up execution and shortening the live range created for the spilled value. Spilling to the CCM removes spill traffic from the path to main memory. If the system has a cache memory, spilling to the CCM should also eliminate any cache pollution introduced by spill operations-loads and stores that can interfere directly with the cache behavior "planned" by highlevel, compiler-based transformations that exploit locality caused by regular accesses in loop nests [8, 2'7, lo].
A growing body of literature on adaptive compilation suggests that using program-specific [7] or ... more A growing body of literature on adaptive compilation suggests that using program-specific [7] or function-specific [24] compilation sequences can produce consistent improvements over compiling the same code with a traditional fixed-sequence compiler [18, 1, 27, 24]. The early work on this problem used genetic algorithms (GAs) [7]. GAs find good solutions to these problems. However, they must probe the search space thousands of times; each probe compiles and evaluates the code. To build a practical compiler that discovers good compilation sequences, we need techniques that find good sequences with much less effort than the GAs require. To find such techniques, we embarked on a detailed study of the search spaces in which the compiler operates. By understanding the properties of these spaces, we can design more effective searches. This paper focuses on effective search algorithms for the problem of choosing compilation sequences-an ordered list of optimizations to apply to the input program. It summarizes the search-space properties that we discovered in our studies. It presents and evaluates two new search methods, designed with knowledge of the search-space properties. It compares the new methods against the best sequence-finding GA that we have developed. Our new search methods can find good solutions with 400 to 600 probes. The first GA for sequence finding required 10,000 to 20,000 probes [7]. Our most effective GA runs for 2,300 probes. The strength of these results validates our paradigm-learn about the spaces and use that knowledge to improve the search techniques.
Inline substitution is an optimization that replaces a procedure call with the body of the proced... more Inline substitution is an optimization that replaces a procedure call with the body of the procedure that it calls. Inlining has the immediate benefit of reducing the overhead associated with the call, including register saves and restores, parameter evaluation, and activation record setup and teardown. It has secondary benefits that arise from providing greater context for global optimizations. These benefits can be offset by the effects of increased code size, and by deleterious interactions with other optimizations, such as register allocation. The difficult aspect of inline substitution is choosing which calls to inline. Previous work has focused on static, one-size-fits-all heuristics. This paper presents a feedback-driven adaptive scheme that derives a programspecific inlining heuristic. The key contributions of this work are: (1) a novel parameterization scheme for the inliner that makes it susceptible to fine-grained external control, (2) a scheme for discretizing large integer parameter spaces, and (3) effective search techniques for the resulting search space. This work provides a proof of concept that can provide insight into the design of adaptive controllers for other optimizations with complex decision heuristics. Our goal in this work is not to exhibit the world's best inliner. Instead, we present evidence to suggest that a program-specific, adaptive scheme is needed to achieve the best results.
A variety of applications have arisen where it is worthwhile to apply code optimizations directly... more A variety of applications have arisen where it is worthwhile to apply code optimizations directly to the machine code (or assembly code) produced by a compiler. These include link-time whole-program analysis and optimization, code compression, binary-to-binary translation, and bit-transition reduction (for power). Many, if not most, optimizations assume the presence of a control-flow graph (cfg). Compiled, scheduled code has properties that can make cfg construction more complex than it is inside a typical compiler. In particular, branch-to-register operations can introduce spurious edges into the cfg. If branch delay slots contain other branches, the classic algorithms for building a cfg produce incorrect results. This paper uses two simple examples to explain the problem. It presents an algorithm for building correct cfgs from scheduled assembly code that includes branches in branch delay slots. The algorithm works by building an approximate cfg and then refining it to reflect the actions of delayed branches. If all branches have explicit targets, the complexity of the refining step is linear with respect to the number of branches in the code.
Th ep roblem of finding the dominators in a control-flow graph has a long history in the literatu... more Th ep roblem of finding the dominators in a control-flow graph has a long history in the literature. The original algorithms suffered from a large asymptotic complexity but were easy to understand. Subsequent work improved the time bound, but generally sacrificed both simplicity and ease of implementation. This paper returns to a simple formulation of dominance as a global data-flow problem. Some insights into the natur eo f dominance lead to an implementation of an O(N 2 )a lgorithm that runs faster, in practice, than the classic Lengauer-Tarjan algorithm, which has a timebound of O(E ∗ log(N )). We compare the algorithm to Lengauer-Tarjan because it is the best known and most widely used of the fast algorithms for dominance. Working from the same implementatio ni nsights, we also rederive (from earlier work on control dependence by Ferrante, et al. )a method for calculating dominance frontiers that we show is faster than the original algorithm by Cytron, et al. The aim of this paper is not to present a new algorithm, but, rather, to make an argument based on empirical evidence that algorithms with discouraging asymptotic complexities can be faster in practice than those more commonly employed. We show that, in some cases, careful engineering of simple algorithms can overcome theoretical advantages, even when problems grow beyond realistic sizes. Further, we argue that the algorithms presented herein are intuitive and easily implemented, making them excellent teaching tools.
This paper presents a fast new algorithm for modeling and reasoning about interferences for varia... more This paper presents a fast new algorithm for modeling and reasoning about interferences for variables in a program without constructing an interference graph. It then describes how to use this information to minimize copy insertion for φ-node instantiation during the conversion of the static single assignment (SSA) form into the control-flow graph (CFG), effectively yielding a new, very fast copy coalescing and live-range identification algorithm. This paper proves some properties of the SSA form that enable construction of data structures to compute interference information for variables that are considered for folding. The asymptotic complexity of our SSA-to-CFG conversion algorithm is O(nα(n)), where n is the number of instructions in the program. Performing copy folding during the SSA-to-CFG conversion eliminates the need for a separate coalescing phase while simplifying the intermediate code. This may make graph-coloring register allocation more practical in just in time (JIT) and other time-critical compilers For example, Sun's Hotspot Server Compiler already employs a graph-coloring register allocator[10]. This paper also presents an improvement to the classical interference-graph based coalescing optimization that shows a decrease in memory usage of up to three orders of magnitude and a decrease of a factor of two in compilation time, while providing the exact same results. We present experimental results that demonstrate that our algorithm is almost as precise (within one percent on average) as the improved interference-graph-based coalescing algorithm, while requiring three times less compilation time.
Research over the past five years has shown significant performance improvements using a techniqu... more Research over the past five years has shown significant performance improvements using a technique called adaptive compilation. An adaptive compiler uses a compile-execute-analyze feedback loop to find the combination of optimizations and parameters that minimizes some performance goal, such as code size or execution time. Despite its ability to improve performance, adaptive compilation has not seen widespread use because of two obstacles: the large amounts of time that such systems have used to perform the many compilations and executions prohibits most users from adopting these systems, and the complexity inherent in a feedback-driven adaptive system has made it difficult to build and hard to use. A significant portion of the adaptive compilation process is devoted to multiple executions of the code being compiled. We have developed a technique called virtual execution to address this problem. Virtual execution runs the program a single time and preserves information that allows us to accurately predict the performance of different optimization sequences without running the code again. Our prototype implementation of this technique significantly reduces the time required by our adaptive compiler. In conjunction with this performance boost, we have developed a graphical-user interface (GUI) that provides a controlled view of the compilation process. By providing appropriate defaults, the interface limits the amount of information that the user must provide to get started. At the same time, it lets the experienced user exert fine-grained control over the parameters that control the system.
Uploads
Papers by Timothy Harvey