M. Zahran

Followers

Following

Co-author

Public Views

Manoj Franklin

University of Maryland, College Park

Muhammad Zahran

Ben Lee

Charles Sturt University

Intel Corporation

University of California, San Diego

Interests

Uploads

Papers by M. Zahran

Hierarchical Multi-Threading For Exploiting Parallelism at Multiple Granularities

ukpmc.ac.uk

As we approach billion-transistor processor chips, the need for a new architecture to make effici... more

Download

A Dynamic Hybrid Cache Coherency Protocol for Shared-Memory MPSoC Architectures

International Journal of Computer Applications, 2012

Nowadays, Multi-Processor System-on-Chip (MPSoC) have become an essential solution for embedded a... more Nowadays, Multi-Processor System-on-Chip (MPSoC) have become an essential solution for embedded applications. In this paper we focus on MPSoCs using shared-memory programming model, which facilitates the programmer task. Moreover, one of the main factors affecting the performance of such systems is the management of cache coherency problem. In this context, we propose a new cache-coherency protocol. The proposed protocol is able to dynamically adapt its functioning mode according to variations in application memory access

Download

Chip level thermal profile estimation using on-chip temperature sensors

2008 IEEE International Conference on Computer Design, 2008

This paper addresses the problem of chip level thermal profile estimation using runtime temperatu... more This paper addresses the problem of chip level thermal profile estimation using runtime temperature sensor readings. We address the challenges of a) availability of only a few thermal sensors with constrained locations (sensors cannot be placed just anywhere) b) random on-chip power density characteristics due to unpredictable workloads and fabrication variability. Firstly we model the random power density as a probability density function. Given this random characteristic and runtime thermal sensor readings, we exploit the correlation between power dissipation of different chip modules to estimate the expected value of temperature at each chip location. Our methods are optimal if the underlying power density has Gaussian nature. We also present a heuristic to generate the chip level thermal profile estimates when the underlying randomness is non-Gaussian. Experimental results indicate that our method generates highly accurate thermal profile estimates of the entire chip at runtime using only a few thermal sensors.

Download

Introduction to the special issue on the 2006 reconfigurable and adaptive architecture workshop

ACM SIGARCH Computer Architecture News, 2007

Abstract The papers that follow comprise the proceedings of the first Reconfigurable and Adaptive... more

Efficient utilization of GPGPU cache hierarchy

Proceedings of the 8th Workshop on General Purpose Processing using GPUs - GPGPU 2015, 2015

ABSTRACT

Hybrid Compiler and Microarchitecture Technique for Cache Traffic Optimization

9th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT'05), 2005

Memory system is one of the main performancelimiting factors in contemporary processors. This is ... more Memory system is one of the main performancelimiting factors in contemporary processors. This is due to the gap between the memory system speed and the processor speed. This results in moving as much memory as possible from off-chip to on-chip. Furthermore, we are on a sustained effort into integrating a larger number of devices per chip. This renders integrating a large on-chip memory feasible. However, cache memories are starting to give diminishing returns. One of the main reasons for that is the delay in writing back the data of the replaced block to memory or to the next level cache. This makes block replacement time consuming, and therefore affects the overall performance. In this paper, we present a compiler-microarchitecture hybrid technique for solving the cache traffic problem. The microarchitecture part deals with bandwidth management. This is done by predicting the time at which a dirty cache block will no longer be written before replacement, and writing it back to the memory, at the time of low traffic. Thus, when the block is replaced, it is clean and the replacement is done much faster. The compiler technique deals with bandwidth saving. The compiler detects values that are dead, and hence do not need to be written to the memory altogether. Therefore, reducing the traffic to the memory and making the replacement faster. We show that the proposed techniques reduce the writebacks from L1 cache by 24% for SpecINT and 18% for SpecFP. Moreover, around half of the dirty blocks are cleared during low traffic time, and before their actual replacement time.

Download

NBC: Network-based cache coherence protocol for multistage NoCs

2009 International SoC Design Conference (ISOCC), 2009

AbstractCache coherence is one of the main factors that can complicate the design of Network On ... more

Dynamic thread resizing for speculative multithreaded processors

Proceedings 21st International Conference on Computer Design, 2003

There is a growing interest in the use of speculative multithreading to speed up the execution of... more There is a growing interest in the use of speculative multithreading to speed up the execution of a program. In speculative multithreading model, threads are extracted from a sequential program and are speculatively executed in parallel, without violating sequential program semantics. In order to get the best performance from this model, a highly accurate thread selection scheme is needed in order to accurately assign threads to processing elements (PEs) for parallel execution. This is done using a thread predictor that assigns threads to PEs sequentially. However, this in-order thread assignment has severe limitations. One of the limitations is when the thread predictor is unable to predict the successor of a particular thread. This may cause successor PEs to remain idle for many cycles. Another limitation has to do with control independence. When a misprediction occurs, all threads, starting from the misprediction point, get squashed, although many of them may be control independent of the misprediction. In this paper we present a hierarchical technique for building threads, as well as a non-sequential scheme of assigning them to PEs, and a selective approach to squash threads in case of misprediction, in order to take advantage of control independences. This technique uses dynamic resizing, and builds threads in two steps, statically using the compiler as well as dynamically at run-time. Based on the dynamic behavior of the program, a thread can dynamically expand or shrink in size, and can span several PEs. Detailed simulation results show that our dynamic resizing based approach results in a 11.6% average increase in speedup relative to a conventional speculative multithreaded processor.

Download

A feasibility study of hierarchical multithreading

Proceedings 16th International Parallel and Distributed Processing Symposium, 2002

Many studies have shown that significant amounts of parallelism exist at different granularities.... more Many studies have shown that significant amounts of parallelism exist at different granularities. Execution models such as superscalar and VLIW exploit parallelism from a single thread. Multithreaded processors make a step towards exploiting parallelism from different threads, but are not geared to exploit parallelism at different granularities (fine and medium grain). In this paper we present a feasibility study of a new execution model for exploiting both adjacent and distant parallelism in the dynamic instruction stream. Our model, called hierarchical multithreading, uses a two-level hierarchical arrangement of processing elements. The lower level of the hierarchy exploits instruction-level parallelism and fine-grain threadlevel parallelism, whereas the upper level exploits more distant parallelism. Detailed simulation studies with a cycleaccurate simulator are presented, showing the feasibility of hierarchical multithreading. Conclusions are drawn about the best ways to obtain the most from the hierarchical multithreading scheme.

Download

Towards dynamic cache block placement for multi-processor NUCA

ICM 2011 Proceeding, 2011

The spectacular increasing speed of microprocessors is been handicapped by the modest evolution o... more The spectacular increasing speed of microprocessors is been handicapped by the modest evolution of memories speed. Thus, the pressure is put on the cache and in particular on the LLC (Low Level Cache) attempting to avoid costly access to main memory. Moreover, increasing both LLC size and number of cores per chip created an other problem : the non-uniformity of cache access. Indeed, LLC banks and processors are non-uniformly distant, penalizing hence cores attempting to access distant banks. We aim to improve NUCA (Non Uniform Cache Access) blocks migration. The new NUCA controller monitors block accesses according to each one's behavior. In this paper we present the first step of our approach where we attempt to observe blocks behavior in order to categorize them in the goal to treat differently each category.

Download

Bandwidth-Friendly Cache Hierarchy Mohamed Zahran Anasua Bhowmik Department of Electrical Engineering AMD India Engineering Centre City College of …

With the major advances in process technology, several processors, and more sophisticated cache h... more

Cache replacement policy revisited

WDDD held in conjunction with ISCA, 2007

Cache replacement policy is a major design parameter of any memory hierarchy. The efficiency of t... more

Download

On the Power Management of Simultaneous Multithreading Processors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2000

SMT processors are widely used in high performance computing tasks. However, with the improved pe... more SMT processors are widely used in high performance computing tasks. However, with the improved performance of the SMT architecture, the utilization of their functional units is significantly increased, straining the power budget of the processor. This increases not only the dynamic power consumption, but also the leakage power consumption due to the increased temperature. In this paper, a comparison of the static and dynamic sleep signal generation techniques for SMT processors is presented. This is conducted under various workloads to assess their effectiveness in leakage power management. Results show that the dynamic approach exhibits a threefold increase in leakage savings, compared with that of the static approach for certain functional units.

Download

Architecture Support for Dynamic Integrity Checking

IEEE Transactions on Information Forensics and Security, 2000

A trusted platform module (TPM) enhances the security of general purpose computer systems by auth... more A trusted platform module (TPM) enhances the security of general purpose computer systems by authenticating the platform at boot time. Security can often be compromised due to the presence of vulnerabilities in the trusted software that is executed on the system. Existing TPM architectures do not support runtime integrity checking and this allows attackers to exploit these vulnerabilities to modify

Cache Hierarchy for 100 On-Chip Cores

Fifth Annual Boston Area Architecture Workshop, 2007

Cache Hierarchy for 100 On-Chip Cores Mohamed Zahran Department of Electrical Engineering City Un... more

Cache performance, system performance, and off-chip bandwidth... pick any two

Proceedings of INA-OCMC, 2009

... of Electrical Engineering City University of New York ahsan [email protected] [email protected]... more

Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy

The 2nd Workshop on Managed Multi-Core Systems …, 2009

... Bushra Ahsan Electrical Engineering Department City University of New [email protected] ... more

On-chip sensor-driven efficient thermal profile estimation algorithms

ACM Transactions on Design Automation of Electronic Systems, 2010

This paper addresses the problem of chip level thermal profile estimation using runtime temperatu... more This paper addresses the problem of chip level thermal profile estimation using runtime temperature sensor readings. We address the challenges of a) availability of only a few thermal sensors with constrained locations (sensors cannot be placed just anywhere) b) random chip power density characteristics due to unpredictable workloads and fabrication variability. Firstly we model the random power density as a probability density function. Given such statistical characteristics and the runtime thermal sensor readings, we exploit the correlation in power dissipation among different chip modules to estimate the expected value of temperature at each chip location. Our methods are optimal if the underlying power density has Gaussian nature. We give a heuristic method to estimate the chip level thermal profile when the underlying randomness is non-Gaussian. An extension of our method has also been proposed to address the dynamic case. Several speedup strategies are carefully investigated to improve the efficiency of the estimation algorithm. Experimental results indicated that, given only a few thermal sensors, our method can generate highly accurate chip level thermal profile estimates within a few milliseconds.

Download

A Dynamic Hybrid Cache Coherency Protocol for Shared-Memory MPSoC Architectures

by jean-luc Dekeyser and M. Zahran

International Journal of Computer Applications, 2012

Hierarchical Multi-Threading For Exploiting Parallelism at Multiple Granularities

ukpmc.ac.uk

As we approach billion-transistor processor chips, the need for a new architecture to make effici... more

Download

A Dynamic Hybrid Cache Coherency Protocol for Shared-Memory MPSoC Architectures

International Journal of Computer Applications, 2012

Download

Chip level thermal profile estimation using on-chip temperature sensors

2008 IEEE International Conference on Computer Design, 2008

Download

Introduction to the special issue on the 2006 reconfigurable and adaptive architecture workshop

ACM SIGARCH Computer Architecture News, 2007

Abstract The papers that follow comprise the proceedings of the first Reconfigurable and Adaptive... more

Efficient utilization of GPGPU cache hierarchy

Proceedings of the 8th Workshop on General Purpose Processing using GPUs - GPGPU 2015, 2015

ABSTRACT

Hybrid Compiler and Microarchitecture Technique for Cache Traffic Optimization

9th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT'05), 2005

Download

NBC: Network-based cache coherence protocol for multistage NoCs

2009 International SoC Design Conference (ISOCC), 2009

AbstractCache coherence is one of the main factors that can complicate the design of Network On ... more

Dynamic thread resizing for speculative multithreaded processors

Proceedings 21st International Conference on Computer Design, 2003

Download

A feasibility study of hierarchical multithreading

Proceedings 16th International Parallel and Distributed Processing Symposium, 2002

Download

Towards dynamic cache block placement for multi-processor NUCA

ICM 2011 Proceeding, 2011

Download

Bandwidth-Friendly Cache Hierarchy Mohamed Zahran Anasua Bhowmik Department of Electrical Engineering AMD India Engineering Centre City College of …

With the major advances in process technology, several processors, and more sophisticated cache h... more

Cache replacement policy revisited

WDDD held in conjunction with ISCA, 2007

Cache replacement policy is a major design parameter of any memory hierarchy. The efficiency of t... more

Download

On the Power Management of Simultaneous Multithreading Processors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2000

Download

Architecture Support for Dynamic Integrity Checking

IEEE Transactions on Information Forensics and Security, 2000

Cache Hierarchy for 100 On-Chip Cores

Fifth Annual Boston Area Architecture Workshop, 2007

Cache Hierarchy for 100 On-Chip Cores Mohamed Zahran Department of Electrical Engineering City Un... more

Cache performance, system performance, and off-chip bandwidth... pick any two

Proceedings of INA-OCMC, 2009

... of Electrical Engineering City University of New York ahsan [email protected] [email protected]... more

Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy

The 2nd Workshop on Managed Multi-Core Systems …, 2009

... Bushra Ahsan Electrical Engineering Department City University of New [email protected] ... more

On-chip sensor-driven efficient thermal profile estimation algorithms

ACM Transactions on Design Automation of Electronic Systems, 2010

This paper addresses the problem of chip level thermal profile estimation using runtime temperatu... more This paper addresses the problem of chip level thermal profile estimation using runtime temperature sensor readings. We address the challenges of a) availability of only a few thermal sensors with constrained locations (sensors cannot be placed just anywhere) b) random chip power density characteristics due to unpredictable workloads and fabrication variability. Firstly we model the random power density as a probability density function. Given such statistical characteristics and the runtime thermal sensor readings, we exploit the correlation in power dissipation among different chip modules to estimate the expected value of temperature at each chip location. Our methods are optimal if the underlying power density has Gaussian nature. We give a heuristic method to estimate the chip level thermal profile when the underlying randomness is non-Gaussian. An extension of our method has also been proposed to address the dynamic case. Several speedup strategies are carefully investigated to improve the efficiency of the estimation algorithm. Experimental results indicated that, given only a few thermal sensors, our method can generate highly accurate chip level thermal profile estimates within a few milliseconds.

Download

A Dynamic Hybrid Cache Coherency Protocol for Shared-Memory MPSoC Architectures

by jean-luc Dekeyser and M. Zahran

International Journal of Computer Applications, 2012

M. Zahran

Uploads

Papers by M. Zahran

Log In