Papers by Horia Calborean
In this paper we continue our work on detecting and predicting unbiased branches, extending the q... more In this paper we continue our work on detecting and predicting unbiased branches, extending the qualitative and quantitative analysis from C -procedural benchmarks to Java benchmarks, which are entirely object-oriented programs. We focused on two directions: first, based on a simple example from Perm -Stanford benchmark, we show that extending context information some of branches in certain contexts became fully biased, thus diminishing the frequency of unbiased branches at benchmark level. Second, we use some state-of-the art branch predictors to predict the unbiased branches. Following our aims, we developed the ABPS tool (Advanced Branch Prediction Simulator), an original useful simulator written in Java that performs trace-driven simulation on 33 benchmarks from Stanford, SPEC2000 and SPECJVM98 suites.
In modern superscalar microarchitectures that speculatively execute a great quantity of code, wit... more In modern superscalar microarchitectures that speculatively execute a great quantity of code, without performing branch prediction, it won't be possible to aggressively exploit program's instruction level parallelism. Both the architectural and technological complexity of current processors emphasizes the negative impact on performance due to every branch missprediction. Due to this importance, branch prediction becomes a core topic in Computer Architecture curricula. The fast development of computer science and information technology domains, and of computer architecture especially, have determined that many software tools used not far ago in research, to be enhanced with an interactive graphical interface and to be taught in Introductory Computer Organization respectively Computer Architecture courses. The lack of simulators dedicated to branch prediction used in didactical purposes despite of plenty used in research goals, represents the starting point of this paper. The ...
The 6th EUROSIM …, 2007
In modern superscalar microarchitectures that speculatively execute a great quantity of code, wit... more In modern superscalar microarchitectures that speculatively execute a great quantity of code, without performing branch prediction, it won't be possible to aggressively exploit program's instruction level parallelism. Both the architectural and technological complexity of current ...
ulbsibiu.ro
Multicore architectures are currently the most common solution for further increasing the pro-ces... more Multicore architectures are currently the most common solution for further increasing the pro-cessing performance since the methods for exploiting the Instruction Level Parallelism (ILP) have reached a certain saturation point. However, we believe that multicores should still ...
IET Computers & Digital Techniques, 2012
This work extends an earlier manual design space exploration of our developed Selective Load Valu... more This work extends an earlier manual design space exploration of our developed Selective Load Value Prediction based superscalar architecture to the L2 unified cache. After that we perform an automatic design space exploration using a special developed software tool by varying several architectural parameters. Our goal is to find optimal configurations in terms of CPI (Cycles per Instruction) and energy consumption. By varying 19 architectural parameters, as we proposed, the design space is over 2.5 millions of billions configurations which obviously means that only heuristic search can be considered. Therefore, we propose different methods of automatic design space exploration based on our developed FADSE tool which allow us to evaluate only 2500 configurations of the above mentioned huge design space!
The 6th EUROSIM …, 2007
In modern superscalar microarchitectures that speculatively execute a great quantity of code, wit... more In modern superscalar microarchitectures that speculatively execute a great quantity of code, without performing branch prediction, it won't be possible to aggressively exploit program's instruction level parallelism. Both the architectural and technological complexity of current ...
In this paper we continue our work on detecting and predicting unbiased branches, extending the q... more In this paper we continue our work on detecting and predicting unbiased branches, extending the qualitative and quantitative analysis from C -procedural benchmarks to Java benchmarks, which are entirely object-oriented programs. We focused on two directions: first, based on a simple example from Perm -Stanford benchmark, we show that extending context information some of branches in certain contexts became fully biased, thus diminishing the frequency of unbiased branches at benchmark level. Second, we use some state-of-the art branch predictors to predict the unbiased branches. Following our aims, we developed the ABPS tool (Advanced Branch Prediction Simulator), an original useful simulator written in Java that performs trace-driven simulation on 33 benchmarks from Stanford, SPEC2000 and SPECJVM98 suites.
Advances in Intelligent Systems and Computing, 2013
In today's computer architectures the design spaces are huge, thus making it very difficult to fi... more In today's computer architectures the design spaces are huge, thus making it very difficult to find optimal configurations. One way to cope with this problem is to use Automatic Design Space Exploration (ADSE) techniques. We developed the Framework for Automatic Design Space Exploration (FADSE) which is focused on microarchitectural optimizations. This framework includes several state-of-the art heuristic algorithms. In this paper we selected three of them, NSGA-II and SPEA2 as genetic algorithms as well as SMPSO as a particle swarm optimization, and compared their performance. As test case we optimize the parameters of the Grid ALU Processor (GAP) microarchitecture and then GAP together with the post-link code optimizer GAPtimize. An analysis of the simulation results shows a very good performance of all the three algorithms. SMPSO reveals the fastest convergence speed. A clear winner between NSGA-II and SPEA2 cannot be determined.
Lecture Notes in Computer Science, 2012
During development, processor architectures can be tuned and configured by many different paramet... more During development, processor architectures can be tuned and configured by many different parameters. For benchmarking, automatic design space explorations (DSEs) with heuristic algorithms are a helpful approach to find the best settings for these parameters according to multiple objectives, e.g. performance, energy consumption, or real-time constraints. But if the setup is slightly changed and a new DSE has to be performed, it will start from scratch, resulting in very long evaluation times. To reduce the evaluation times we extend the NSGA-II algorithm in this article, such that automatic DSEs can be supported with a set of transformation rules defined in a highly readable format, the fuzzy control language (FCL). Rules can be specified by an engineer, thereby representing existing knowledge. Beyond this, a decision tree classifying high-quality configurations can be constructed automatically and translated into transformation rules. These can also be seen as very valuable result of a DSE because they allow drawing conclusions on the influence of parameters and describe regions of the design space with high density of good configurations. Our evaluations show that automatically generated decision trees can classify near optimal configurations for the hardware parameters of the Grid ALU Processor (GAP) and M-Sim 2. Further evaluations show that automatically constructed transformation rules can reduce the number of evaluations required to reach the same quality of results as without rules by 43%, leading to a significant saving of time of about 25%. In the demonstrated example using rules also leads to better results.
2011 International Conference on High Performance Computing & Simulation, 2011
Recent computer architectures can be configured in lots of different ways. To explore this huge d... more Recent computer architectures can be configured in lots of different ways. To explore this huge design space, system simulators are typically used. As performance is no longer the only decisive factor but also e.g. power usage or the resource usage of the system it became very hard for designers to select optimal configurations.
Acta Universitatis Cibiniensis …, 2011
Recent years have shown an increasing complexity of the computer architectures. These led to an i... more Recent years have shown an increasing complexity of the computer architectures. These led to an increased number of parameters that can be configured. To explore this huge design space system simulators are used. Since the traditional manual exploration will not scale as the complexity grows automatic tools are needed. A framework for automatic design space exploration called FADSE has been developed. We have already proved in some of our previous work that FADSE is able to discover good design points using evolutionary algorithms like: NSGA-II, SPEA2 and SMPSO on the GAP simulator. An overview of our prior work is presented in this article and also an analysis of the reuse of previously simulated individuals.
Proceedings of the 18th …, 2011
One way to cope with a huge design space formed by several parameters is using methods for Automa... more One way to cope with a huge design space formed by several parameters is using methods for Automatic Design Space Exploration (ADSE). Recently we developed a Framework for Automatic Design Space Explorations focused on micro-architectural optimizations. In this article we evaluate the influence of three different evolutionary algorithms on the performance of design space explorations. More precisely, we selected two genetic algorithms, NSGA-II and SPEA2, as well as the bio-inspired SMPSO algorithm, running a particle swarm optimization. With these algorithms we run FADSE to optimize both the parameters of the Grid ALU Processor (GAP) microarchitecture and of the GAPtimize post-link code optimizer. An analysis of the simulation results showed a very good performance of the SMPSO algorithm during the design space exploration process on the GAP simulator. SPEA2 provided slightly better results than NSGA-II when they were used on the design space exploration of both GAP and GAPtimize.
Roedunet International Conference ( …, 2010
During the last years, especially due to the computing systems complexity growth, the need for to... more During the last years, especially due to the computing systems complexity growth, the need for tools which perform automatic design space exploration becomes more and more stringent. This paper presents a new initiated project having as the main aim developing a software tool, called FADSE (Framework for Automatic Design Space Exploration), that comes to meet this need. It is intended to provide out-ofthe-box algorithms capable of solving single and multiobjective optimization problems. It focuses on automatic design space exploration for multicore and manycore systems. This tool is intended to be flexible, to provide easy development and portability.
ACACES 2010 poster Abstracts, 2010
During the recent years the computing system complexity has grown (many heterogeneous cores). Fin... more During the recent years the computing system complexity has grown (many heterogeneous cores). Finding the best configuration in the extremely huge design space has become a major problem. Many performance indicators have to be considered (IPC, power consumption, area integration, etc.). Mature tools which are able to perform automatically this multiobjective exploration are needed. We present an automatic design space exploration framework called FADSE (Framework for Automatic Design Space Exploration) which comes to meet this need. It includes several multiobjective evolutionary algorithms and it is able to work with most of the existing computing system simulators.
IET Computers & Digital Techniques, 2012
This work extends an earlier manual design space exploration of our developed Selective Load Valu... more This work extends an earlier manual design space exploration of our developed Selective Load Value Prediction based superscalar architecture to the L2 unified cache. After that we perform an automatic design space exploration using a special developed software tool by varying several architectural parameters. Our goal is to find optimal configurations in terms of CPI (Cycles per Instruction) and energy consumption. By varying 19 architectural parameters, as we proposed, the design space is over 2.5 millions of billions configurations which obviously means that only heuristic search can be considered. Therefore, we propose different methods of automatic design space exploration based on our developed FADSE tool which allow us to evaluate only 2500 configurations of the above mentioned huge design space!
Concurrency and Computation: Practice and Experience, 2012
In the design process of computer systems or processor architectures, typically many different pa... more In the design process of computer systems or processor architectures, typically many different parameters are exposed to configure, tune, and optimize every component of a system. For evaluations and before production, it is desirable to know the best setting for all parameters. Processing speed is no longer the only objective that needs to be optimized; power consumption, area, and so on have become very important. Thus, the best configurations have to be found in respect to multiple objectives. In this article, we use a multi-objective design space exploration tool called Framework for Automatic Design Space Exploration (FADSE) to automatically find near-optimal configurations in the vast design space of a processor architecture together with a tool for code optimizations and hence evaluate both automatically. As example, we use the Grid ALU Processor (GAP) and its postlink optimizer called GAPtimize, which can apply feedback-directed and platformspecific code optimizations. Our results show that FADSE is able to cope with both design spaces. Less than 25% of the maximal reasonable hardware effort for the scalable elements of the GAP is enough to achieve the processor's performance maximum. With a performance reduction tolerance of 10%, the necessary hardware complexity can be further reduced by about two-thirds. The found high-quality configurations are analyzed, exhibiting strong relationships between the parameters of the GAP, the distribution of complexity, and the total performance. These performance numbers can be improved by applying code optimizations concurrently to optimizing the hardware parameters. FADSE can find near-optimal configurations by effectively combining and selecting parameters for hardware and code optimizations in a short time. The maximum observed speedup is 15%. With the use of code optimizations, the maximum possible reduction of the hardware resources, while sustaining the same performance level, is 50%.
ulbsibiu.ro
Multicore architectures are currently the most common solution for further increasing the pro-ces... more Multicore architectures are currently the most common solution for further increasing the pro-cessing performance since the methods for exploiting the Instruction Level Parallelism (ILP) have reached a certain saturation point. However, we believe that multicores should still ...
Proceedings of 9th …, 1843
In this paper we continue our work on detecting and predicting unbiased branches. We centered o... more In this paper we continue our work on detecting and predicting unbiased branches. We centered on two directions: first, based on a simple example from Perm Stanford benchmark, we show that extending context information some of branches in certain ...
The 6th EUROSIM …, 2007
In modern superscalar microarchitectures that speculatively execute a great quantity of code, wit... more In modern superscalar microarchitectures that speculatively execute a great quantity of code, without performing branch prediction, it won't be possible to aggressively exploit instruction level parallelism from programs. Both the architectural and technological complexity of current processors emphasizes the negative impact on performance due to every branch misprediction. Due to this importance, branch prediction becomes a core topic in Computer Architecture curricula. The fast development of computer science and information technology domains, and of computer architecture especially, have determined that many software tools used not long ago in research, to be enhanced with an interactive graphical interface and to be taught in Introductory Computer Organization respectively Computer Architecture courses. The lack of simulators dedicated to branch prediction used for didactical purposes despite of plenty used in research goals, represents the starting point of this paper. The main aim of this work consists in identifying the difficult-to-predict branches, to quantify them at benchmarks level and to find the relevant information in order to reduce their numbers. Finally, we evaluate the impact of these branches on three commonly used prediction contexts (local, global and path) and their corresponding predictors, ranging from classical two-level predictors to present-day predictors (neural -Simple Perceptron and Fast Path-based Perceptron). The developed ABPS simulator provides a wide variety of configuration options. Beside statistics related to the number of difficult-to-predict branches, the simulator generates graphical results illustrating the influence of different simulation parameters (number of entries in prediction table, history length, etc.) on prediction accuracy, resource usage, etc., for every implemented predictor.
Uploads
Papers by Horia Calborean