As we approach billion-transistor processor chips, the need for a new architecture to make effici... more As we approach billion-transistor processor chips, the need for a new architecture to make efficient use of the increased transistor budget arises. Many studies have shown that significant amounts of parallelism exist at different granularities that is yet to be exploited. ...
International Journal of Computer Applications, 2012
Nowadays, Multi-Processor System-on-Chip (MPSoC) have become an essential solution for embedded a... more Nowadays, Multi-Processor System-on-Chip (MPSoC) have become an essential solution for embedded applications. In this paper we focus on MPSoCs using shared-memory programming model, which facilitates the programmer task. Moreover, one of the main factors affecting the performance of such systems is the management of cache coherency problem. In this context, we propose a new cache-coherency protocol. The proposed protocol is able to dynamically adapt its functioning mode according to variations in application memory access
2008 IEEE International Conference on Computer Design, 2008
This paper addresses the problem of chip level thermal profile estimation using runtime temperatu... more This paper addresses the problem of chip level thermal profile estimation using runtime temperature sensor readings. We address the challenges of a) availability of only a few thermal sensors with constrained locations (sensors cannot be placed just anywhere) b) random on-chip power density characteristics due to unpredictable workloads and fabrication variability. Firstly we model the random power density as a probability density function. Given this random characteristic and runtime thermal sensor readings, we exploit the correlation between power dissipation of different chip modules to estimate the expected value of temperature at each chip location. Our methods are optimal if the underlying power density has Gaussian nature. We also present a heuristic to generate the chip level thermal profile estimates when the underlying randomness is non-Gaussian. Experimental results indicate that our method generates highly accurate thermal profile estimates of the entire chip at runtime using only a few thermal sensors.
Abstract The papers that follow comprise the proceedings of the first Reconfigurable and Adaptive... more Abstract The papers that follow comprise the proceedings of the first Reconfigurable and Adaptive Architecture Workshop (RAAW 2006) that was held in conjunction with the 39 th International Conference on Microarchitecture in Orlando, Florida.
9th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT'05), 2005
Memory system is one of the main performancelimiting factors in contemporary processors. This is ... more Memory system is one of the main performancelimiting factors in contemporary processors. This is due to the gap between the memory system speed and the processor speed. This results in moving as much memory as possible from off-chip to on-chip. Furthermore, we are on a sustained effort into integrating a larger number of devices per chip. This renders integrating a large on-chip memory feasible. However, cache memories are starting to give diminishing returns. One of the main reasons for that is the delay in writing back the data of the replaced block to memory or to the next level cache. This makes block replacement time consuming, and therefore affects the overall performance. In this paper, we present a compiler-microarchitecture hybrid technique for solving the cache traffic problem. The microarchitecture part deals with bandwidth management. This is done by predicting the time at which a dirty cache block will no longer be written before replacement, and writing it back to the memory, at the time of low traffic. Thus, when the block is replaced, it is clean and the replacement is done much faster. The compiler technique deals with bandwidth saving. The compiler detects values that are dead, and hence do not need to be written to the memory altogether. Therefore, reducing the traffic to the memory and making the replacement faster. We show that the proposed techniques reduce the writebacks from L1 cache by 24% for SpecINT and 18% for SpecFP. Moreover, around half of the dirty blocks are cleared during low traffic time, and before their actual replacement time.
2009 International SoC Design Conference (ISOCC), 2009
AbstractCache coherence is one of the main factors that can complicate the design of Network On ... more AbstractCache coherence is one of the main factors that can complicate the design of Network On Chip (NoC) due to the large volume of control traffic NoCs generate. With the demand to build a scalable multicore system, this problem has become more urgent to resolve. The ...
Proceedings 21st International Conference on Computer Design, 2003
There is a growing interest in the use of speculative multithreading to speed up the execution of... more There is a growing interest in the use of speculative multithreading to speed up the execution of a program. In speculative multithreading model, threads are extracted from a sequential program and are speculatively executed in parallel, without violating sequential program semantics. In order to get the best performance from this model, a highly accurate thread selection scheme is needed in order to accurately assign threads to processing elements (PEs) for parallel execution. This is done using a thread predictor that assigns threads to PEs sequentially. However, this in-order thread assignment has severe limitations. One of the limitations is when the thread predictor is unable to predict the successor of a particular thread. This may cause successor PEs to remain idle for many cycles. Another limitation has to do with control independence. When a misprediction occurs, all threads, starting from the misprediction point, get squashed, although many of them may be control independent of the misprediction. In this paper we present a hierarchical technique for building threads, as well as a non-sequential scheme of assigning them to PEs, and a selective approach to squash threads in case of misprediction, in order to take advantage of control independences. This technique uses dynamic resizing, and builds threads in two steps, statically using the compiler as well as dynamically at run-time. Based on the dynamic behavior of the program, a thread can dynamically expand or shrink in size, and can span several PEs. Detailed simulation results show that our dynamic resizing based approach results in a 11.6% average increase in speedup relative to a conventional speculative multithreaded processor.
Proceedings 16th International Parallel and Distributed Processing Symposium, 2002
Many studies have shown that significant amounts of parallelism exist at different granularities.... more Many studies have shown that significant amounts of parallelism exist at different granularities. Execution models such as superscalar and VLIW exploit parallelism from a single thread. Multithreaded processors make a step towards exploiting parallelism from different threads, but are not geared to exploit parallelism at different granularities (fine and medium grain). In this paper we present a feasibility study of a new execution model for exploiting both adjacent and distant parallelism in the dynamic instruction stream. Our model, called hierarchical multithreading, uses a two-level hierarchical arrangement of processing elements. The lower level of the hierarchy exploits instruction-level parallelism and fine-grain threadlevel parallelism, whereas the upper level exploits more distant parallelism. Detailed simulation studies with a cycleaccurate simulator are presented, showing the feasibility of hierarchical multithreading. Conclusions are drawn about the best ways to obtain the most from the hierarchical multithreading scheme.
The spectacular increasing speed of microprocessors is been handicapped by the modest evolution o... more The spectacular increasing speed of microprocessors is been handicapped by the modest evolution of memories speed. Thus, the pressure is put on the cache and in particular on the LLC (Low Level Cache) attempting to avoid costly access to main memory. Moreover, increasing both LLC size and number of cores per chip created an other problem : the non-uniformity of cache access. Indeed, LLC banks and processors are non-uniformly distant, penalizing hence cores attempting to access distant banks. We aim to improve NUCA (Non Uniform Cache Access) blocks migration. The new NUCA controller monitors block accesses according to each one's behavior. In this paper we present the first step of our approach where we attempt to observe blocks behavior in order to categorize them in the goal to treat differently each category.
With the major advances in process technology, several processors, and more sophisticated cache h... more With the major advances in process technology, several processors, and more sophisticated cache hierarchy, can be embedded on-chip. This opens the door for many interesting and challenging opportunities, by providing high on-chip bandwidth. However, ...
Cache replacement policy is a major design parameter of any memory hierarchy. The efficiency of t... more Cache replacement policy is a major design parameter of any memory hierarchy. The efficiency of the replacement policy affects both the hit rate and the access latency of a cache system. The higher the associativity of the cache, the more vital the replacement ...
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2000
SMT processors are widely used in high performance computing tasks. However, with the improved pe... more SMT processors are widely used in high performance computing tasks. However, with the improved performance of the SMT architecture, the utilization of their functional units is significantly increased, straining the power budget of the processor. This increases not only the dynamic power consumption, but also the leakage power consumption due to the increased temperature. In this paper, a comparison of the static and dynamic sleep signal generation techniques for SMT processors is presented. This is conducted under various workloads to assess their effectiveness in leakage power management. Results show that the dynamic approach exhibits a threefold increase in leakage savings, compared with that of the static approach for certain functional units.
IEEE Transactions on Information Forensics and Security, 2000
A trusted platform module (TPM) enhances the security of general purpose computer systems by auth... more A trusted platform module (TPM) enhances the security of general purpose computer systems by authenticating the platform at boot time. Security can often be compromised due to the presence of vulnerabilities in the trusted software that is executed on the system. Existing TPM architectures do not support runtime integrity checking and this allows attackers to exploit these vulnerabilities to modify
Fifth Annual Boston Area Architecture Workshop, 2007
Cache Hierarchy for 100 On-Chip Cores Mohamed Zahran Department of Electrical Engineering City Un... more Cache Hierarchy for 100 On-Chip Cores Mohamed Zahran Department of Electrical Engineering City University of New York (mzahran@ ccny. cuny. edu) Abstract The increase in the number of on-chip cores, as well as the so-phistication of each core, place significant demands ...
... of Electrical Engineering City University of New York ahsan [email protected][email protected]... more ... of Electrical Engineering City University of New York ahsan [email protected][email protected] Abstract ... The traffic from memory to LLC occurs when-ever there is a cache miss in the LLC and the corre-sponding block has to be fetched from the memory. ...
The 2nd Workshop on Managed Multi-Core Systems …, 2009
... Bushra Ahsan Electrical Engineering Department City University of New [email protected] ... more ... Bushra Ahsan Electrical Engineering Department City University of New [email protected] ... First, it shows the importance of off-chip traffic as a main design param-eter that must be tackled as early in the design process as possible. ...
ACM Transactions on Design Automation of Electronic Systems, 2010
This paper addresses the problem of chip level thermal profile estimation using runtime temperatu... more This paper addresses the problem of chip level thermal profile estimation using runtime temperature sensor readings. We address the challenges of a) availability of only a few thermal sensors with constrained locations (sensors cannot be placed just anywhere) b) random chip power density characteristics due to unpredictable workloads and fabrication variability. Firstly we model the random power density as a probability density function. Given such statistical characteristics and the runtime thermal sensor readings, we exploit the correlation in power dissipation among different chip modules to estimate the expected value of temperature at each chip location. Our methods are optimal if the underlying power density has Gaussian nature. We give a heuristic method to estimate the chip level thermal profile when the underlying randomness is non-Gaussian. An extension of our method has also been proposed to address the dynamic case. Several speedup strategies are carefully investigated to improve the efficiency of the estimation algorithm. Experimental results indicated that, given only a few thermal sensors, our method can generate highly accurate chip level thermal profile estimates within a few milliseconds.
As we approach billion-transistor processor chips, the need for a new architecture to make effici... more As we approach billion-transistor processor chips, the need for a new architecture to make efficient use of the increased transistor budget arises. Many studies have shown that significant amounts of parallelism exist at different granularities that is yet to be exploited. ...
International Journal of Computer Applications, 2012
Nowadays, Multi-Processor System-on-Chip (MPSoC) have become an essential solution for embedded a... more Nowadays, Multi-Processor System-on-Chip (MPSoC) have become an essential solution for embedded applications. In this paper we focus on MPSoCs using shared-memory programming model, which facilitates the programmer task. Moreover, one of the main factors affecting the performance of such systems is the management of cache coherency problem. In this context, we propose a new cache-coherency protocol. The proposed protocol is able to dynamically adapt its functioning mode according to variations in application memory access
2008 IEEE International Conference on Computer Design, 2008
This paper addresses the problem of chip level thermal profile estimation using runtime temperatu... more This paper addresses the problem of chip level thermal profile estimation using runtime temperature sensor readings. We address the challenges of a) availability of only a few thermal sensors with constrained locations (sensors cannot be placed just anywhere) b) random on-chip power density characteristics due to unpredictable workloads and fabrication variability. Firstly we model the random power density as a probability density function. Given this random characteristic and runtime thermal sensor readings, we exploit the correlation between power dissipation of different chip modules to estimate the expected value of temperature at each chip location. Our methods are optimal if the underlying power density has Gaussian nature. We also present a heuristic to generate the chip level thermal profile estimates when the underlying randomness is non-Gaussian. Experimental results indicate that our method generates highly accurate thermal profile estimates of the entire chip at runtime using only a few thermal sensors.
Abstract The papers that follow comprise the proceedings of the first Reconfigurable and Adaptive... more Abstract The papers that follow comprise the proceedings of the first Reconfigurable and Adaptive Architecture Workshop (RAAW 2006) that was held in conjunction with the 39 th International Conference on Microarchitecture in Orlando, Florida.
9th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT'05), 2005
Memory system is one of the main performancelimiting factors in contemporary processors. This is ... more Memory system is one of the main performancelimiting factors in contemporary processors. This is due to the gap between the memory system speed and the processor speed. This results in moving as much memory as possible from off-chip to on-chip. Furthermore, we are on a sustained effort into integrating a larger number of devices per chip. This renders integrating a large on-chip memory feasible. However, cache memories are starting to give diminishing returns. One of the main reasons for that is the delay in writing back the data of the replaced block to memory or to the next level cache. This makes block replacement time consuming, and therefore affects the overall performance. In this paper, we present a compiler-microarchitecture hybrid technique for solving the cache traffic problem. The microarchitecture part deals with bandwidth management. This is done by predicting the time at which a dirty cache block will no longer be written before replacement, and writing it back to the memory, at the time of low traffic. Thus, when the block is replaced, it is clean and the replacement is done much faster. The compiler technique deals with bandwidth saving. The compiler detects values that are dead, and hence do not need to be written to the memory altogether. Therefore, reducing the traffic to the memory and making the replacement faster. We show that the proposed techniques reduce the writebacks from L1 cache by 24% for SpecINT and 18% for SpecFP. Moreover, around half of the dirty blocks are cleared during low traffic time, and before their actual replacement time.
2009 International SoC Design Conference (ISOCC), 2009
AbstractCache coherence is one of the main factors that can complicate the design of Network On ... more AbstractCache coherence is one of the main factors that can complicate the design of Network On Chip (NoC) due to the large volume of control traffic NoCs generate. With the demand to build a scalable multicore system, this problem has become more urgent to resolve. The ...
Proceedings 21st International Conference on Computer Design, 2003
There is a growing interest in the use of speculative multithreading to speed up the execution of... more There is a growing interest in the use of speculative multithreading to speed up the execution of a program. In speculative multithreading model, threads are extracted from a sequential program and are speculatively executed in parallel, without violating sequential program semantics. In order to get the best performance from this model, a highly accurate thread selection scheme is needed in order to accurately assign threads to processing elements (PEs) for parallel execution. This is done using a thread predictor that assigns threads to PEs sequentially. However, this in-order thread assignment has severe limitations. One of the limitations is when the thread predictor is unable to predict the successor of a particular thread. This may cause successor PEs to remain idle for many cycles. Another limitation has to do with control independence. When a misprediction occurs, all threads, starting from the misprediction point, get squashed, although many of them may be control independent of the misprediction. In this paper we present a hierarchical technique for building threads, as well as a non-sequential scheme of assigning them to PEs, and a selective approach to squash threads in case of misprediction, in order to take advantage of control independences. This technique uses dynamic resizing, and builds threads in two steps, statically using the compiler as well as dynamically at run-time. Based on the dynamic behavior of the program, a thread can dynamically expand or shrink in size, and can span several PEs. Detailed simulation results show that our dynamic resizing based approach results in a 11.6% average increase in speedup relative to a conventional speculative multithreaded processor.
Proceedings 16th International Parallel and Distributed Processing Symposium, 2002
Many studies have shown that significant amounts of parallelism exist at different granularities.... more Many studies have shown that significant amounts of parallelism exist at different granularities. Execution models such as superscalar and VLIW exploit parallelism from a single thread. Multithreaded processors make a step towards exploiting parallelism from different threads, but are not geared to exploit parallelism at different granularities (fine and medium grain). In this paper we present a feasibility study of a new execution model for exploiting both adjacent and distant parallelism in the dynamic instruction stream. Our model, called hierarchical multithreading, uses a two-level hierarchical arrangement of processing elements. The lower level of the hierarchy exploits instruction-level parallelism and fine-grain threadlevel parallelism, whereas the upper level exploits more distant parallelism. Detailed simulation studies with a cycleaccurate simulator are presented, showing the feasibility of hierarchical multithreading. Conclusions are drawn about the best ways to obtain the most from the hierarchical multithreading scheme.
The spectacular increasing speed of microprocessors is been handicapped by the modest evolution o... more The spectacular increasing speed of microprocessors is been handicapped by the modest evolution of memories speed. Thus, the pressure is put on the cache and in particular on the LLC (Low Level Cache) attempting to avoid costly access to main memory. Moreover, increasing both LLC size and number of cores per chip created an other problem : the non-uniformity of cache access. Indeed, LLC banks and processors are non-uniformly distant, penalizing hence cores attempting to access distant banks. We aim to improve NUCA (Non Uniform Cache Access) blocks migration. The new NUCA controller monitors block accesses according to each one's behavior. In this paper we present the first step of our approach where we attempt to observe blocks behavior in order to categorize them in the goal to treat differently each category.
With the major advances in process technology, several processors, and more sophisticated cache h... more With the major advances in process technology, several processors, and more sophisticated cache hierarchy, can be embedded on-chip. This opens the door for many interesting and challenging opportunities, by providing high on-chip bandwidth. However, ...
Cache replacement policy is a major design parameter of any memory hierarchy. The efficiency of t... more Cache replacement policy is a major design parameter of any memory hierarchy. The efficiency of the replacement policy affects both the hit rate and the access latency of a cache system. The higher the associativity of the cache, the more vital the replacement ...
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2000
SMT processors are widely used in high performance computing tasks. However, with the improved pe... more SMT processors are widely used in high performance computing tasks. However, with the improved performance of the SMT architecture, the utilization of their functional units is significantly increased, straining the power budget of the processor. This increases not only the dynamic power consumption, but also the leakage power consumption due to the increased temperature. In this paper, a comparison of the static and dynamic sleep signal generation techniques for SMT processors is presented. This is conducted under various workloads to assess their effectiveness in leakage power management. Results show that the dynamic approach exhibits a threefold increase in leakage savings, compared with that of the static approach for certain functional units.
IEEE Transactions on Information Forensics and Security, 2000
A trusted platform module (TPM) enhances the security of general purpose computer systems by auth... more A trusted platform module (TPM) enhances the security of general purpose computer systems by authenticating the platform at boot time. Security can often be compromised due to the presence of vulnerabilities in the trusted software that is executed on the system. Existing TPM architectures do not support runtime integrity checking and this allows attackers to exploit these vulnerabilities to modify
Fifth Annual Boston Area Architecture Workshop, 2007
Cache Hierarchy for 100 On-Chip Cores Mohamed Zahran Department of Electrical Engineering City Un... more Cache Hierarchy for 100 On-Chip Cores Mohamed Zahran Department of Electrical Engineering City University of New York (mzahran@ ccny. cuny. edu) Abstract The increase in the number of on-chip cores, as well as the so-phistication of each core, place significant demands ...
... of Electrical Engineering City University of New York ahsan [email protected][email protected]... more ... of Electrical Engineering City University of New York ahsan [email protected][email protected] Abstract ... The traffic from memory to LLC occurs when-ever there is a cache miss in the LLC and the corre-sponding block has to be fetched from the memory. ...
The 2nd Workshop on Managed Multi-Core Systems …, 2009
... Bushra Ahsan Electrical Engineering Department City University of New [email protected] ... more ... Bushra Ahsan Electrical Engineering Department City University of New [email protected] ... First, it shows the importance of off-chip traffic as a main design param-eter that must be tackled as early in the design process as possible. ...
ACM Transactions on Design Automation of Electronic Systems, 2010
This paper addresses the problem of chip level thermal profile estimation using runtime temperatu... more This paper addresses the problem of chip level thermal profile estimation using runtime temperature sensor readings. We address the challenges of a) availability of only a few thermal sensors with constrained locations (sensors cannot be placed just anywhere) b) random chip power density characteristics due to unpredictable workloads and fabrication variability. Firstly we model the random power density as a probability density function. Given such statistical characteristics and the runtime thermal sensor readings, we exploit the correlation in power dissipation among different chip modules to estimate the expected value of temperature at each chip location. Our methods are optimal if the underlying power density has Gaussian nature. We give a heuristic method to estimate the chip level thermal profile when the underlying randomness is non-Gaussian. An extension of our method has also been proposed to address the dynamic case. Several speedup strategies are carefully investigated to improve the efficiency of the estimation algorithm. Experimental results indicated that, given only a few thermal sensors, our method can generate highly accurate chip level thermal profile estimates within a few milliseconds.
Uploads
Papers by M. Zahran