ACM Transactions on Architecture and Code Optimization, 2021
While data filter caches (DFCs) have been shown to be effective at reducing data access energy, t... more While data filter caches (DFCs) have been shown to be effective at reducing data access energy, they have not been adopted in processors due to the associated performance penalty caused by high DFC miss rates. In this article, we present a design that both decreases the DFC miss rate and completely eliminates the DFC performance penalty even for a level-one data cache (L1 DC) with a single cycle access time. First, we show that a DFC that lazily fills each word in a DFC line from an L1 DC only when the word is referenced is more energy-efficient than eagerly filling the entire DFC line. For a 512B DFC, we are able to eliminate loads of words into the DFC that are never referenced before being evicted, which occurred for about 75% of the words in 32B lines. Second, we demonstrate that a lazily word filled DFC line can effectively share and pack data words from multiple L1 DC lines to lower the DFC miss rate. For a 512B DFC, we completely avoid accessing the L1 DC for loads about 23% ...
2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES), 2016
Energy efficiency is a first-order design goal for nearly all classes of processors, but it is pa... more Energy efficiency is a first-order design goal for nearly all classes of processors, but it is particularly important in mobile and embedded systems. Data caches in such systems account for a large portion of the processor's energy usage, and thus techniques to improve the energy efficiency of the cache hierarchy are likely to have high impact. Our prior work reduced data cache energy via a tagless access buffer (TAB) that sits at the top of the cache hierarchy. Strided memory references are redirected from the level-one data cache (L1D) to the smaller, more energy-efficient TAB. These references need not access the data translation lookaside buffer (DTLB), and they can avoid unnecessary transfers from lower levels of the memory hierarchy. The original TAB implementation requires changing the immediate field of load and store instructions, necessitating substantial ISA modifications. Here we present a new TAB design that requires minimal instruction set changes, gives software m...
Applications in embedded systems often need to meet specified timing constraints. It is advantage... more Applications in embedded systems often need to meet specified timing constraints. It is advantageous to not only calculate the Worst-Case Execution Time (WCET) of an application, but to also perform transformations that attempt to reduce the WCET, since an application with a lower WCET will be less likely to violate its timing constraints. A compiler has been integrated with a timing analyzer to obtain the WCET of a program on demand during compilation. This environment is used to investigate three different types of compiler optimization techniques to reduce WCET. First, an interactive compilation system has been developed that allows a user to interact with a compiler and get feedback regarding the WCET. In addition, a genetic algorithm is used to automatically search for an effective optimization phase sequence to reduce the WCET. Second, a WCET code positioning optimization has been investigated that uses worst-case path information to reorder basic blocks so that the branch pen...
Proceedings of the 16th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems 2015 CD-ROM - LCTES'15, 2015
Statically pipelined processors offer a new way to improve the performance beyond that of a tradi... more Statically pipelined processors offer a new way to improve the performance beyond that of a traditional in-order pipeline while simultaneously reducing energy usage by enabling the compiler to control more fine-grained details of the program execution. This paper describes how a compiler can exploit the features of the static pipeline architecture to apply optimizations on transfers of control that are not possible on a conventional architecture. The optimizations presented in this paper include hoisting the target address calculations for branches, jumps, and calls out of loops, performing branch chaining between calls and jumps, hoisting the setting of return addresses out of loops, and exploiting conditional calls and returns. The benefits of performing these transfer of control optimizations include a 6.8% reduction in execution time and a 3.6% decrease in estimated energy usage.
This paper introduces a new method for instruction cache analysis that outperforms conventional t... more This paper introduces a new method for instruction cache analysis that outperforms conventional tracedriven methods. The new method, static cache simulation, analyzes a program for a given cache conguration and determines prior to execution time if an instruction reference will always result in a cache hit or miss. At run time, counters are incremented to provide the execution frequency of portions of code. In addition, the cache behavior is simulated for references that could not be predicted statically. The dynamic simulation employs a novel view of the cache by updating local state information associated with code portions. The total number of cache hits and misses can be inferred from the frequency counters at program exit. Measurements taken from a variety of programs show that this new method speeds up cache analysis over conventional trace-driven methods by almost an order of a magnitude. Thus, cache analysis with static cache simulation makes it possible to analyze the instruction cache behavior of longer and more realistic program executions.
Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems - OM '01, 2001
Embedded systems often have real-time constraints. Traditional timing analysis statically determi... more Embedded systems often have real-time constraints. Traditional timing analysis statically determines the maximum execution time of a task or a program in a real-time system. These systems typically depend on the worst-case execution time of tasks in order to make static scheduling decisions so that tasks can meet their deadlines. Static determination of worst-case execution times imposes numerous restrictions on real-time programs, which include that the maximum number of iterations of each loop must be known statically. These restrictions can significantly limit the class of programs that would be suitable for a real-time embedded system. This paper describes work-in-progress that uses static timing analysis to aid in making dynamic scheduling decisions. For instance, different algorithms with varying levels of accuracy may be selected based on the algorithm's predicted worst-case execution time and the time allotted for the task. We represent the worstcase execution time of a function or a loop as a formula, where the unknown values affecting the execution time are parameterized. This parametric timing analysis produces formulas that can then be quickly evaluated at run-time so dynamic scheduling decisions can be made with little overhead. Benefits of this work include expanding the class of applications that can be used in a real-time system, improving the accuracy of dynamic scheduling decisions, and more effective utilization of system resources.
Hawaii International Conference on System Sciences, 1998
Most compiler optimizations focus on saving time and sometimes occur at the expense of increasing... more Most compiler optimizations focus on saving time and sometimes occur at the expense of increasing size. Yet processor speeds continue to increase at a faster rate than main memory and disk access times. Processors are now frequently being used in embedded systems that often have strict limitations on the size of programs it can execute. Also, reducing the size of
This paper describes xvpodb,av isualization tool developed to support the analysis of optimizatio... more This paper describes xvpodb,av isualization tool developed to support the analysis of optimizations performed by the vpo optimizer.T he tool is a graphical optimization viewer that can display the state of the program representation beforeand after sequences of changes, referred to as transformations, that results in semantically equivalent (and usually improved) code. The information and insight such visualization provides can simplify the debugging of problems with the optimizer. Unique features of xvpodb include rev erse viewing (or undoing) of transformations and the ability to stop at breakpoints associated with the generated instructions. The viewer facilitates the retargeting of vpo to a new machine, supports experimentation with new optimizations, and has been used as a teaching aid in compiler classes.
Debugging is an integral part of the software development cycle which can account for up to 50% o... more Debugging is an integral part of the software development cycle which can account for up to 50% of the development time of an application. This paper discusses some of the challenges speci c to real-time debugging. It explains how developing real-time applications can be supported by an environment which addresses the issues of time deadline monitoring and distortion due to the interference of debugging. The current implementation of this environment provides the elapsed time during debugging on request at breakpoints. This time information corresponds to the elapsed execution time since program initiation. Delays due to the interference of the debugger, for example input delays at breakpoints, are excluded from the time estimates. The environment includes a modi ed compiler and a static cache simulator which together produce instrumented programs for the purpose of debugging. The instrumented program supports source-level debugging of optimized code and an e cient cache simulation to provide timing information at execution time. The overhead in execution time of an instrumented program is only approximately 1 to 4 times slower than the corresponding unoptimized program. Conventional hardware simulators could alternatively be used to obtain the same information but would run much slower. The environment facilitates the debugging of real-time applications. It allows the monitoring of deadlines, helps to locate the rst task which misses a deadline, and supports the search for code portions which account for most of the execution time. This facilitates handtuning of selected tasks to make a schedule feasible.
ACM Transactions on Architecture and Code Optimization, 2013
Francisco Cazorla Barcelona Supercomputing Center, Spain Albert Cohen INRIA, France Alex Veidenba... more Francisco Cazorla Barcelona Supercomputing Center, Spain Albert Cohen INRIA, France Alex Veidenbaum UC Irvine, USA David Whalley Florida State University, USA Derek Chiou University of Texas at Austin, USA Marcelo Cintra University of Edinburgh, UK Nikil Dutt UC Irvine, USA Rajeev Balasubramonian University of Utah, USA Rudolph Eigenmann Purdue University, USA Alex Ramirez Barcelona Supercomputing Center, Spain Alexandru Nicolau UC Irvine, USA Bruce Childers University of Pittsburg, USA Bruce Jacob University of Maryland, USA David Gregg Trinity College, Dublin, Ireland Francois Bodin University of Rennes, France Guang Gao University of Delaware, USA Jacqueline Chame ISI, USA Jaejin Lee Seoul National University, South Korea Mary Hall University of Utah, USA Mattan Erez University of Texas at Austin, USA Olivier Temam INRIA, France Ramon Canal UPC, Spain Robert Hundt Google, USA Sandhya Dwarkadas University of Rochester, USA Yoav Etsion Technion, Israel Abhishek Bhattacharjee Rutgers University, USA Andreas Moshovos University of Toronto, Canada Angelos Bilas University of Crete, Greece Babak Falsafi EPFL, Switzerland Bjorn De Sutter Ghent University, Belgium Björn Franke University of Edinburgh, UK Chandra Krintz UC Santa Barbara, USA David Kaeli Northeastern University, USA Glenn Reinman UCLA, USA Ian Watson University of Manchester, UK Iris Bahar Brown University, USA Lawrence Rauchwerger Texas A&M University, USA Lixin Zhang Institute of Computing Technology, China Luca Benini Università di Bologna, Italy Mahmut Kandemir Penn State University, USA Mikko Lipasti University of Wisconsin, Madison, USA Murali Annavaram University of Southern California, USA Nacho Navarro BSC-UPC, Spain
ACM Transactions on Embedded Computing Systems, 2008
The determination of upper bounds on execution times, commonly called worst-case execution times ... more The determination of upper bounds on execution times, commonly called worst-case execution times (WCETs), is a necessary step in the development and validation process for hard real-time systems. This problem is hard if the underlying processor architecture has components, such as caches, pipelines, branch prediction, and other speculative components. This article describes different approaches to this problem and surveys several commercially available tools 1 and research prototypes.
ACM Transactions on Embedded Computing Systems, 2010
Embedded systems with real-time constraints depend on a priori knowledge of worst-case execution ... more Embedded systems with real-time constraints depend on a priori knowledge of worst-case execution times (WCETs) to determine if tasks meet deadlines. Static timing analysis derives bounds on WCETs but requires statically known loop bounds. This work removes the constraint on known loop bounds through parametric analysis expressing WCETs as functions. Tighter WCETs are dynamically discovered to exploit slack by dynamic voltage scaling (DVS) saving 60% to 82% energy over DVS-oblivious techniques and showing savings close to more costly dynamic-priority DVS algorithms. Overall, parametric analysis expands the class of real-time applications to programs with loop-invariant dynamic loop bounds while retaining tight WCET bounds.
Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016
Conventional set-associative data cache accesses waste energy since tag and data arrays of severa... more Conventional set-associative data cache accesses waste energy since tag and data arrays of several ways are simultaneously accessed to sustain pipeline speed. Different access techniques to avoid activating all cache ways have been previously proposed in an effort to reduce energy usage. However, a problem that many of these access techniques have in common is that they need to access different cache memory portions in a sequential manner, which is difficult to support with standard synchronous SRAM memory. We propose the speculative halt-tag access (SHA) approach, which accesses low-order tag bits, i.e., the halt tag, in the address generation stage instead of the SRAM access stage to eliminate accesses to cache ways that cannot possibly contain the data. The key feature of our SHA approach is that it determines which tag and data arrays need to be accessed early enough for conventional SRAMs to be used. We evaluate the SHA approach using a 65-nm processor implementation running MiBench benchmarks and find that it on average reduces data access energy by 25.6%.
Abstract: Indirect jumps from tables are traditionally only generated by compilers as an intermed... more Abstract: Indirect jumps from tables are traditionally only generated by compilers as an intermediate code generation decision when translating multiway selection statements. However, making this decision during intermediate code generation poses problems. The research described in this paper resolves these problems by using several types of static analysis as a framework for a code improving transformation that exploits indirect jumps from tables. First, control-flow analysis is performed that provides...
ACM Transactions on Architecture and Code Optimization, 2021
While data filter caches (DFCs) have been shown to be effective at reducing data access energy, t... more While data filter caches (DFCs) have been shown to be effective at reducing data access energy, they have not been adopted in processors due to the associated performance penalty caused by high DFC miss rates. In this article, we present a design that both decreases the DFC miss rate and completely eliminates the DFC performance penalty even for a level-one data cache (L1 DC) with a single cycle access time. First, we show that a DFC that lazily fills each word in a DFC line from an L1 DC only when the word is referenced is more energy-efficient than eagerly filling the entire DFC line. For a 512B DFC, we are able to eliminate loads of words into the DFC that are never referenced before being evicted, which occurred for about 75% of the words in 32B lines. Second, we demonstrate that a lazily word filled DFC line can effectively share and pack data words from multiple L1 DC lines to lower the DFC miss rate. For a 512B DFC, we completely avoid accessing the L1 DC for loads about 23% ...
2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES), 2016
Energy efficiency is a first-order design goal for nearly all classes of processors, but it is pa... more Energy efficiency is a first-order design goal for nearly all classes of processors, but it is particularly important in mobile and embedded systems. Data caches in such systems account for a large portion of the processor's energy usage, and thus techniques to improve the energy efficiency of the cache hierarchy are likely to have high impact. Our prior work reduced data cache energy via a tagless access buffer (TAB) that sits at the top of the cache hierarchy. Strided memory references are redirected from the level-one data cache (L1D) to the smaller, more energy-efficient TAB. These references need not access the data translation lookaside buffer (DTLB), and they can avoid unnecessary transfers from lower levels of the memory hierarchy. The original TAB implementation requires changing the immediate field of load and store instructions, necessitating substantial ISA modifications. Here we present a new TAB design that requires minimal instruction set changes, gives software m...
Applications in embedded systems often need to meet specified timing constraints. It is advantage... more Applications in embedded systems often need to meet specified timing constraints. It is advantageous to not only calculate the Worst-Case Execution Time (WCET) of an application, but to also perform transformations that attempt to reduce the WCET, since an application with a lower WCET will be less likely to violate its timing constraints. A compiler has been integrated with a timing analyzer to obtain the WCET of a program on demand during compilation. This environment is used to investigate three different types of compiler optimization techniques to reduce WCET. First, an interactive compilation system has been developed that allows a user to interact with a compiler and get feedback regarding the WCET. In addition, a genetic algorithm is used to automatically search for an effective optimization phase sequence to reduce the WCET. Second, a WCET code positioning optimization has been investigated that uses worst-case path information to reorder basic blocks so that the branch pen...
Proceedings of the 16th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems 2015 CD-ROM - LCTES'15, 2015
Statically pipelined processors offer a new way to improve the performance beyond that of a tradi... more Statically pipelined processors offer a new way to improve the performance beyond that of a traditional in-order pipeline while simultaneously reducing energy usage by enabling the compiler to control more fine-grained details of the program execution. This paper describes how a compiler can exploit the features of the static pipeline architecture to apply optimizations on transfers of control that are not possible on a conventional architecture. The optimizations presented in this paper include hoisting the target address calculations for branches, jumps, and calls out of loops, performing branch chaining between calls and jumps, hoisting the setting of return addresses out of loops, and exploiting conditional calls and returns. The benefits of performing these transfer of control optimizations include a 6.8% reduction in execution time and a 3.6% decrease in estimated energy usage.
This paper introduces a new method for instruction cache analysis that outperforms conventional t... more This paper introduces a new method for instruction cache analysis that outperforms conventional tracedriven methods. The new method, static cache simulation, analyzes a program for a given cache conguration and determines prior to execution time if an instruction reference will always result in a cache hit or miss. At run time, counters are incremented to provide the execution frequency of portions of code. In addition, the cache behavior is simulated for references that could not be predicted statically. The dynamic simulation employs a novel view of the cache by updating local state information associated with code portions. The total number of cache hits and misses can be inferred from the frequency counters at program exit. Measurements taken from a variety of programs show that this new method speeds up cache analysis over conventional trace-driven methods by almost an order of a magnitude. Thus, cache analysis with static cache simulation makes it possible to analyze the instruction cache behavior of longer and more realistic program executions.
Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems - OM '01, 2001
Embedded systems often have real-time constraints. Traditional timing analysis statically determi... more Embedded systems often have real-time constraints. Traditional timing analysis statically determines the maximum execution time of a task or a program in a real-time system. These systems typically depend on the worst-case execution time of tasks in order to make static scheduling decisions so that tasks can meet their deadlines. Static determination of worst-case execution times imposes numerous restrictions on real-time programs, which include that the maximum number of iterations of each loop must be known statically. These restrictions can significantly limit the class of programs that would be suitable for a real-time embedded system. This paper describes work-in-progress that uses static timing analysis to aid in making dynamic scheduling decisions. For instance, different algorithms with varying levels of accuracy may be selected based on the algorithm's predicted worst-case execution time and the time allotted for the task. We represent the worstcase execution time of a function or a loop as a formula, where the unknown values affecting the execution time are parameterized. This parametric timing analysis produces formulas that can then be quickly evaluated at run-time so dynamic scheduling decisions can be made with little overhead. Benefits of this work include expanding the class of applications that can be used in a real-time system, improving the accuracy of dynamic scheduling decisions, and more effective utilization of system resources.
Hawaii International Conference on System Sciences, 1998
Most compiler optimizations focus on saving time and sometimes occur at the expense of increasing... more Most compiler optimizations focus on saving time and sometimes occur at the expense of increasing size. Yet processor speeds continue to increase at a faster rate than main memory and disk access times. Processors are now frequently being used in embedded systems that often have strict limitations on the size of programs it can execute. Also, reducing the size of
This paper describes xvpodb,av isualization tool developed to support the analysis of optimizatio... more This paper describes xvpodb,av isualization tool developed to support the analysis of optimizations performed by the vpo optimizer.T he tool is a graphical optimization viewer that can display the state of the program representation beforeand after sequences of changes, referred to as transformations, that results in semantically equivalent (and usually improved) code. The information and insight such visualization provides can simplify the debugging of problems with the optimizer. Unique features of xvpodb include rev erse viewing (or undoing) of transformations and the ability to stop at breakpoints associated with the generated instructions. The viewer facilitates the retargeting of vpo to a new machine, supports experimentation with new optimizations, and has been used as a teaching aid in compiler classes.
Debugging is an integral part of the software development cycle which can account for up to 50% o... more Debugging is an integral part of the software development cycle which can account for up to 50% of the development time of an application. This paper discusses some of the challenges speci c to real-time debugging. It explains how developing real-time applications can be supported by an environment which addresses the issues of time deadline monitoring and distortion due to the interference of debugging. The current implementation of this environment provides the elapsed time during debugging on request at breakpoints. This time information corresponds to the elapsed execution time since program initiation. Delays due to the interference of the debugger, for example input delays at breakpoints, are excluded from the time estimates. The environment includes a modi ed compiler and a static cache simulator which together produce instrumented programs for the purpose of debugging. The instrumented program supports source-level debugging of optimized code and an e cient cache simulation to provide timing information at execution time. The overhead in execution time of an instrumented program is only approximately 1 to 4 times slower than the corresponding unoptimized program. Conventional hardware simulators could alternatively be used to obtain the same information but would run much slower. The environment facilitates the debugging of real-time applications. It allows the monitoring of deadlines, helps to locate the rst task which misses a deadline, and supports the search for code portions which account for most of the execution time. This facilitates handtuning of selected tasks to make a schedule feasible.
ACM Transactions on Architecture and Code Optimization, 2013
Francisco Cazorla Barcelona Supercomputing Center, Spain Albert Cohen INRIA, France Alex Veidenba... more Francisco Cazorla Barcelona Supercomputing Center, Spain Albert Cohen INRIA, France Alex Veidenbaum UC Irvine, USA David Whalley Florida State University, USA Derek Chiou University of Texas at Austin, USA Marcelo Cintra University of Edinburgh, UK Nikil Dutt UC Irvine, USA Rajeev Balasubramonian University of Utah, USA Rudolph Eigenmann Purdue University, USA Alex Ramirez Barcelona Supercomputing Center, Spain Alexandru Nicolau UC Irvine, USA Bruce Childers University of Pittsburg, USA Bruce Jacob University of Maryland, USA David Gregg Trinity College, Dublin, Ireland Francois Bodin University of Rennes, France Guang Gao University of Delaware, USA Jacqueline Chame ISI, USA Jaejin Lee Seoul National University, South Korea Mary Hall University of Utah, USA Mattan Erez University of Texas at Austin, USA Olivier Temam INRIA, France Ramon Canal UPC, Spain Robert Hundt Google, USA Sandhya Dwarkadas University of Rochester, USA Yoav Etsion Technion, Israel Abhishek Bhattacharjee Rutgers University, USA Andreas Moshovos University of Toronto, Canada Angelos Bilas University of Crete, Greece Babak Falsafi EPFL, Switzerland Bjorn De Sutter Ghent University, Belgium Björn Franke University of Edinburgh, UK Chandra Krintz UC Santa Barbara, USA David Kaeli Northeastern University, USA Glenn Reinman UCLA, USA Ian Watson University of Manchester, UK Iris Bahar Brown University, USA Lawrence Rauchwerger Texas A&M University, USA Lixin Zhang Institute of Computing Technology, China Luca Benini Università di Bologna, Italy Mahmut Kandemir Penn State University, USA Mikko Lipasti University of Wisconsin, Madison, USA Murali Annavaram University of Southern California, USA Nacho Navarro BSC-UPC, Spain
ACM Transactions on Embedded Computing Systems, 2008
The determination of upper bounds on execution times, commonly called worst-case execution times ... more The determination of upper bounds on execution times, commonly called worst-case execution times (WCETs), is a necessary step in the development and validation process for hard real-time systems. This problem is hard if the underlying processor architecture has components, such as caches, pipelines, branch prediction, and other speculative components. This article describes different approaches to this problem and surveys several commercially available tools 1 and research prototypes.
ACM Transactions on Embedded Computing Systems, 2010
Embedded systems with real-time constraints depend on a priori knowledge of worst-case execution ... more Embedded systems with real-time constraints depend on a priori knowledge of worst-case execution times (WCETs) to determine if tasks meet deadlines. Static timing analysis derives bounds on WCETs but requires statically known loop bounds. This work removes the constraint on known loop bounds through parametric analysis expressing WCETs as functions. Tighter WCETs are dynamically discovered to exploit slack by dynamic voltage scaling (DVS) saving 60% to 82% energy over DVS-oblivious techniques and showing savings close to more costly dynamic-priority DVS algorithms. Overall, parametric analysis expands the class of real-time applications to programs with loop-invariant dynamic loop bounds while retaining tight WCET bounds.
Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016
Conventional set-associative data cache accesses waste energy since tag and data arrays of severa... more Conventional set-associative data cache accesses waste energy since tag and data arrays of several ways are simultaneously accessed to sustain pipeline speed. Different access techniques to avoid activating all cache ways have been previously proposed in an effort to reduce energy usage. However, a problem that many of these access techniques have in common is that they need to access different cache memory portions in a sequential manner, which is difficult to support with standard synchronous SRAM memory. We propose the speculative halt-tag access (SHA) approach, which accesses low-order tag bits, i.e., the halt tag, in the address generation stage instead of the SRAM access stage to eliminate accesses to cache ways that cannot possibly contain the data. The key feature of our SHA approach is that it determines which tag and data arrays need to be accessed early enough for conventional SRAMs to be used. We evaluate the SHA approach using a 65-nm processor implementation running MiBench benchmarks and find that it on average reduces data access energy by 25.6%.
Abstract: Indirect jumps from tables are traditionally only generated by compilers as an intermed... more Abstract: Indirect jumps from tables are traditionally only generated by compilers as an intermediate code generation decision when translating multiway selection statements. However, making this decision during intermediate code generation poses problems. The research described in this paper resolves these problems by using several types of static analysis as a framework for a code improving transformation that exploits indirect jumps from tables. First, control-flow analysis is performed that provides...
Uploads
Papers by David Whalley