Pravin Chandran

Followers

Following

Public Views

Peter Reiher

University of California, Los Angeles

Julita Vassileva

University of Saskatchewan

Carlos Serrao

ISCTE - University Institute of Lisbon (ISCTE-IUL)

Armando Marques-Guedes

UNL - New University of Lisbon

Paul Tobin

Dublin Institute of Technology

Lee McKnight

Syracuse University

Roshan Chitrakar

Nepal College of Information Technology

Jaydip Sen

Praxis Business School

Gopal Tadepalli

Anna University

Reem Bahgat

Cairo University

Interests

Uploads

Papers by Pravin Chandran

DNNLibGen: Deep Neural Network Based Fast Library Generator

We propose a new modeling methodology using deep learning techniques for generating timing models... more We propose a new modeling methodology using deep learning techniques for generating timing models for Static Timing Analysis (STA). Current device behavior is non-linear, non-monotonic and exhibits high sensitivity to (Process Voltage Temperature) PVT variation which imposes a myriad of design challenges including the need for analysis at several PVT corners. While complete PVT coverage is crucial for detecting design issues early and achieving time-to-market goals with improved predictability, the number of PVT corners are growing exponentially and library generation has also become a significant bottleneck in current design cycles. To this end, we have developed a novel methodology for timing library generation that uses data from sparse characterization in PVT space and generates delay models at required sign-off corners. We have employed deep neural nets with residual connections for delay modeling and our methodology enables a ‘single model’ to fully comprehend multiple cell types, PVT corners and generate required PVT timing libraries. The proposed library-generator uses a novel inter-corner model to generate delay tables at 17 test corners using 7 corners as reference. In addition, we have developed an intra-corner model, to generate dense 8x8 delay tables using delays from 10 slew/load points as reference. The results show that, using these models, we are able to achieve key improvements with over 98.7% of calculated delays within acceptable tolerance while reducing characterization run-time for early milestones by upto 60%.

Weight Divergence Driven Divide-and-Conquer Approach for Optimal Federated Learning from non-IID Data

arXiv (Cornell University), Jun 28, 2021

Federated Learning allows training of data stored in distributed devices without the need for cen... more Federated Learning allows training of data stored in distributed devices without the need for centralizing training data, thereby maintaining data privacy. Addressing the ability to handle data heterogeneity (non-identical and independent distribution or non-IID) is a key enabler for the wider deployment of Federated Learning. In this paper, we propose a novel Divide-and-Conquer training methodology that enables the use of the popular FedAvg aggregation algorithm by overcoming the acknowledged FedAvg limitations in non-IID environments. We propose a novel use of Cosine-distance based Weight Divergence metric to determine the exact point where a Deep Learning network can be divided into class agnostic initial layers and class-specific deep layers for performing a Divide and Conquer training. We show that the methodology achieves trained model accuracy at par (and in certain cases exceeding) with numbers achieved by state-of-theart Aggregation algorithms like FedProx, FedMA, etc. Also, we show that this methodology leads to compute and bandwidth optimizations under certain documented conditions.

Download

NTP: A Neural Net Topology Profiler

Lecture Notes in Computer Science, 2020

Performance of end-to-end neural networks on a given hardware platform is a function of its compu... more Performance of end-to-end neural networks on a given hardware platform is a function of its compute and memory signature, which in-turn, is governed by a wide range of parameters such as topology size, primitives used, framework used, batching strategy, latency requirements, precision etc. Current benchmarking tools suffer from limitations such as a) being either too granular like DeepBench [1] (or) b) mandate a working implementation that is either framework specific or hardware-architecture specific or both (or) c) provide only high level benchmark metrics. In this paper, we present NTP (Neural Net Topology Profiler), a sophisticated benchmarking framework, to effectively identify memory and compute signature of an end-to-end topology on multiple hardware architectures, without the need for an actual implementation. NTP is tightly integrated with hardware specific benchmarking tools to enable exhaustive data collection and analysis. Using NTP, a deep learning researcher can quickly establish baselines needed to understand performance of an end-to-end neural network topology and make high level architectural decisions. Further, integration of NTP with frameworks like Tensorflow allows for performance comparison along several vectors like a) Comparison of different frameworks on a given hardware b) Comparison of different hardware using a given framework c) Comparison across different heterogeneous hardware configurations for given framework etc. These capabilities empower a researcher to effortlessly make architectural decisions needed for achieving optimized performance on any hardware platform. The paper documents the architectural approach of NTP and demonstrates the capabilities of the tool by benchmarking Mozilla DeepSpeech, a popular Speech Recognition topology.

Divide-and-Conquer Federated Learning Under Data Heterogeneity

Federated Learning allows training of data stored in distributed devices without the need for cen... more Federated Learning allows training of data stored in distributed devices without the need for centralizing training-data, thereby maintaining data-privacy. Addressing the ability to handle data heterogeneity (non-identical and independent distribution or non-IID) is a key enabler for the wider deployment of Federated Learning. In this paper, we propose a novel Divide-and-Conquer training methodology that enables the use of the popular FedAvg aggregation algorithm by overcoming the acknowledged FedAvg limitations in non-IID environments. We propose a novel use of Cosine-distance based Weight Divergence metric to determine the exact point where a Deep Learning network can be divided into class-agnostic initial layers and class-specific deep layers for performing a Divide and Conquer training. We show that the methodology achieves trained-model accuracy at-par with (and in certain cases exceeding) the numbers achieved by state-of-the-art algorithms like FedProx, FedMA, etc. Also, we show that this methodology leads to compute and/or bandwidth optimizations under certain documented conditions.

Download

NTP : A Neural Network Topology Profiler

arXiv (Cornell University), May 22, 2019

Performance of end-to-end neural networks on a given hardware platform is a function of its compu... more Performance of end-to-end neural networks on a given hardware platform is a function of its compute and memory signature, which in-turn, is governed by a wide range of parameters such as topology size, primitives used, framework used, batching strategy, latency requirements, precision etc. Current benchmarking tools suffer from limitations such as a) being either too granular like DeepBench [1] (or) b) mandate a working implementation that is either framework specific or hardware-architecture specific or both (or) c) provide only high level benchmark metrics. In this paper, we present NTP (Neural Net Topology Profiler), a sophisticated benchmarking framework, to effectively identify memory and compute signature of an end-to-end topology on multiple hardware architectures, without the need for an actual implementation. NTP is tightly integrated with hardware specific benchmarking tools to enable exhaustive data collection and analysis. Using NTP, a deep learning researcher can quickly establish baselines needed to understand performance of an end-to-end neural network topology and make high level architectural decisions. Further, integration of NTP with frameworks like Tensorflow, Pytorch, Intel OpenVINO etc. allows for performance comparison along several vectors like a) Comparison of different frameworks on a given hardware b) Comparison of different hardware using a given framework c) Comparison across different heterogeneous hardware configurations for given framework etc. These capabilities empower a researcher to effortlessly make architectural decisions needed for achieving optimized performance on any hardware platform. The paper documents the architectural approach of NTP and demonstrates the capabilities of the tool by benchmarking Mozilla DeepSpeech, a popular Speech Recognition topology. Preprint. Under review.

Download

Fast and Accurate Library Generation Leveraging Deep Learning for OCV Modelling

Statistical timing characterization for modeling On-Chip Variation (OCV) is critical in current t... more Statistical timing characterization for modeling On-Chip Variation (OCV) is critical in current technology nodes to avoid over-design and to improve design convergence and predictability. OCV characterization, however, is resource intensive as it involves running millions of Monte-Carlo spice simulations to cover different timing arcs for multiple cells in standard-cell library. We have developed a neural network model that fully comprehends multiple cell types to model cell propagation delays as well as OCV sigma at target process-voltage-temperature (PVT) corners with a significantly reduced number of simulations. The proposed method generates Liberty Variation Format (LVF) models which are the latest and most accurate representation of OCV margin in the industry’s standard tools and flows.On extensive testing with 7 million OCV delay values in 10nm node, we attained 60% reduction in runtime while maintaining prediction-error less than 5% for 99.98% arcs which can be used for early timing integration.

Communication Optimization in Large Scale Federated Learning using Autoencoder Compressed Weight Updates

ArXiv, 2021

Federated Learning (FL) solves many of this decade’s concerns regarding data privacy and computat... more Federated Learning (FL) solves many of this decade’s concerns regarding data privacy and computation challenges. FL ensures no data leaves its source as the model is trained at where the data resides. However, FL comes with its own set of challenges. The communication of model weight updates in this distributed environment comes with significant network bandwidth costs. In this context, we propose a mechanism of compressing the weight updates using Autoencoders (AE), which learn the data features of the weight updates and subsequently perform compression. The encoder is set up on each of the nodes where the training is performed while the decoder is set up on the node where the weights are aggregated. This setup achieves compression through the encoder and recreates the weights at the end of every communication round using the decoder. This paper shows that the dynamic and orthogonal AE based weight compression technique could serve as an advantageous alternative (or an add-on) in a...

Download

Algorithm 1 : Federated Averaging Algorithm Input

Federated Learning allows training of data stored in distributed devices without the need for cen... more Federated Learning allows training of data stored in distributed devices without the need for centralizing training-data, thereby maintaining data-privacy. Addressing the ability to handle data heterogeneity (non-identical and independent distribution or non-IID) is a key enabler for the wider deployment of Federated Learning. In this paper, we propose a novel Divide-andConquer training methodology that enables the use of the popular FedAvg aggregation algorithm by over-coming the acknowledged FedAvg limitations in non-IID environments. We propose a novel use of Cosine-distance based Weight Divergence metric to determine the exact point where a Deep Learning network can be divided into class-agnostic initial layers and class-specific deep layers for performing a Divide and Conquer training. We show that the methodology achieves trained-model accuracy at-par with (and in certain cases exceeding) the numbers achieved by state-of-the-art algorithms like FedProx, FedMA, etc. Also, we sh...

Download

NTP: A Neural Net Topology Profiler

Benchmarking, Measuring, and Optimizing, 2020

Design of ALU and Cache memory for an 8 bit microprocessor

Design of ALU and Cache Memory for an 8 bit ALU

The design of an ALU and a Cache memory for use in a high performance processor was examined in t... more The design of an ALU and a Cache memory for use in a high performance processor was examined in this thesis. Advanced architectures employing increased parallelism were analyzed to minimize the number of execution cycles needed for 8 bit integer arithmetic operations. In addition to the arithmetic unit, an optimized SRAM memory cell was designed to be used as cache memory and as fast Look Up Table. The ALU consists of stand alone units for bit parallel computation of basic integer arithmetic operations. Addition and subtraction were performed using Kogge Stone parallel prefix hardware operating at 330MHz. A high performance multiplier was built using Radix 4 Modified Booth Encoder (MBE) and a Wallace Tree summation array. The multiplier requires single clock cycle for 8 bit integer multiplication and operates at a maximum frequency of 100MHz. Multiplicative division hardware was built for executing both integer division and square root. The division hardware computes 8-bit division and square root in 4 clock cycles. Multiplier forms the basic building block of all these functional units, making high level of resource sharing feasible with this architecture. The optimal operating frequency for the arithmetic unit is 70MHz. A 6T CMOS SRAM cell measuring 90 µm 2 was designed using minimum size transistors. The layout allows for horizontal overlap resulting in effective area of 76 µm 2 for an 8x8 array. By substituting equivalent bit line capacitance of P4 L1 Cache, the memory was simulated to have a read time of 3.27ns.

Download

Technique Using Power Macromodeling for Register Transfer Level Power Estimation

A method for estimating power consumption of a design block of an integrated circuit includes obt... more A method for estimating power consumption of a design block of an integrated circuit includes obtaining power consumption data from designs of older-generation microprocessors, selecting a set of power consumption parameters, applying a curve-fitting technique on the obtained power consumption data for the selected set of power consumption parameters, creating a new power consumption model based on the curve-fitting technique and one or more of the power consumption parameters, using the model ...

Divide-and-Conquer Federated Learning Under Data Heterogeneity

AI, Machine Learning and Applications, 2021

Federated Learning allows training of data stored in distributed devices without the need for cen... more Federated Learning allows training of data stored in distributed devices without the need for centralizing training-data, thereby maintaining data-privacy. Addressing the ability to handle data heterogeneity (non-identical and independent distribution or non-IID) is a key enabler for the wider deployment of Federated Learning. In this paper, we propose a novel Divide-andConquer training methodology that enables the use of the popular FedAvg aggregation algorithm by over-coming the acknowledged FedAvg limitations in non-IID environments. We propose a novel use of Cosine-distance based Weight Divergence metric to determine the exact point where a Deep Learning network can be divided into class-agnostic initial layers and class-specific deep layers for performing a Divide and Conquer training. We show that the methodology achieves trained-model accuracy at-par with (and in certain cases exceeding) the numbers achieved by state-of-the-art algorithms like FedProx, FedMA, etc. Also, we sh...

Download

DNNLibGen: Deep Neural Network Based Fast Library Generator

Weight Divergence Driven Divide-and-Conquer Approach for Optimal Federated Learning from non-IID Data

arXiv (Cornell University), Jun 28, 2021

Download

NTP: A Neural Net Topology Profiler

Lecture Notes in Computer Science, 2020

Divide-and-Conquer Federated Learning Under Data Heterogeneity

Federated Learning allows training of data stored in distributed devices without the need for cen... more Federated Learning allows training of data stored in distributed devices without the need for centralizing training-data, thereby maintaining data-privacy. Addressing the ability to handle data heterogeneity (non-identical and independent distribution or non-IID) is a key enabler for the wider deployment of Federated Learning. In this paper, we propose a novel Divide-and-Conquer training methodology that enables the use of the popular FedAvg aggregation algorithm by overcoming the acknowledged FedAvg limitations in non-IID environments. We propose a novel use of Cosine-distance based Weight Divergence metric to determine the exact point where a Deep Learning network can be divided into class-agnostic initial layers and class-specific deep layers for performing a Divide and Conquer training. We show that the methodology achieves trained-model accuracy at-par with (and in certain cases exceeding) the numbers achieved by state-of-the-art algorithms like FedProx, FedMA, etc. Also, we show that this methodology leads to compute and/or bandwidth optimizations under certain documented conditions.

Download

NTP : A Neural Network Topology Profiler

arXiv (Cornell University), May 22, 2019

Performance of end-to-end neural networks on a given hardware platform is a function of its compu... more Performance of end-to-end neural networks on a given hardware platform is a function of its compute and memory signature, which in-turn, is governed by a wide range of parameters such as topology size, primitives used, framework used, batching strategy, latency requirements, precision etc. Current benchmarking tools suffer from limitations such as a) being either too granular like DeepBench [1] (or) b) mandate a working implementation that is either framework specific or hardware-architecture specific or both (or) c) provide only high level benchmark metrics. In this paper, we present NTP (Neural Net Topology Profiler), a sophisticated benchmarking framework, to effectively identify memory and compute signature of an end-to-end topology on multiple hardware architectures, without the need for an actual implementation. NTP is tightly integrated with hardware specific benchmarking tools to enable exhaustive data collection and analysis. Using NTP, a deep learning researcher can quickly establish baselines needed to understand performance of an end-to-end neural network topology and make high level architectural decisions. Further, integration of NTP with frameworks like Tensorflow, Pytorch, Intel OpenVINO etc. allows for performance comparison along several vectors like a) Comparison of different frameworks on a given hardware b) Comparison of different hardware using a given framework c) Comparison across different heterogeneous hardware configurations for given framework etc. These capabilities empower a researcher to effortlessly make architectural decisions needed for achieving optimized performance on any hardware platform. The paper documents the architectural approach of NTP and demonstrates the capabilities of the tool by benchmarking Mozilla DeepSpeech, a popular Speech Recognition topology. Preprint. Under review.

Download

Fast and Accurate Library Generation Leveraging Deep Learning for OCV Modelling

Communication Optimization in Large Scale Federated Learning using Autoencoder Compressed Weight Updates

ArXiv, 2021

Download

Algorithm 1 : Federated Averaging Algorithm Input

Federated Learning allows training of data stored in distributed devices without the need for cen... more Federated Learning allows training of data stored in distributed devices without the need for centralizing training-data, thereby maintaining data-privacy. Addressing the ability to handle data heterogeneity (non-identical and independent distribution or non-IID) is a key enabler for the wider deployment of Federated Learning. In this paper, we propose a novel Divide-andConquer training methodology that enables the use of the popular FedAvg aggregation algorithm by over-coming the acknowledged FedAvg limitations in non-IID environments. We propose a novel use of Cosine-distance based Weight Divergence metric to determine the exact point where a Deep Learning network can be divided into class-agnostic initial layers and class-specific deep layers for performing a Divide and Conquer training. We show that the methodology achieves trained-model accuracy at-par with (and in certain cases exceeding) the numbers achieved by state-of-the-art algorithms like FedProx, FedMA, etc. Also, we sh...

Download

NTP: A Neural Net Topology Profiler

Benchmarking, Measuring, and Optimizing, 2020

Design of ALU and Cache memory for an 8 bit microprocessor

Design of ALU and Cache Memory for an 8 bit ALU

Download

Technique Using Power Macromodeling for Register Transfer Level Power Estimation

Divide-and-Conquer Federated Learning Under Data Heterogeneity

AI, Machine Learning and Applications, 2021

Federated Learning allows training of data stored in distributed devices without the need for cen... more Federated Learning allows training of data stored in distributed devices without the need for centralizing training-data, thereby maintaining data-privacy. Addressing the ability to handle data heterogeneity (non-identical and independent distribution or non-IID) is a key enabler for the wider deployment of Federated Learning. In this paper, we propose a novel Divide-andConquer training methodology that enables the use of the popular FedAvg aggregation algorithm by over-coming the acknowledged FedAvg limitations in non-IID environments. We propose a novel use of Cosine-distance based Weight Divergence metric to determine the exact point where a Deep Learning network can be divided into class-agnostic initial layers and class-specific deep layers for performing a Divide and Conquer training. We show that the methodology achieves trained-model accuracy at-par with (and in certain cases exceeding) the numbers achieved by state-of-the-art algorithms like FedProx, FedMA, etc. Also, we sh...

Download

Pravin Chandran

Uploads

Papers by Pravin Chandran

Log In