2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021
Mobile software is becoming increasingly feature rich, commonly being accessorized with the power... more Mobile software is becoming increasingly feature rich, commonly being accessorized with the powerful decision making capabilities of machine learning (ML). To keep up with the consequently higher power and performance demands, system and hardware architects add specialized hardware units onto their system-on-chips (SoCs) coupled with frameworks to delegate compute optimally. While these SoC innovations are rapidly improving ML model performance and power efficiency, auxiliary data processing and supporting infrastructure to enable ML model execution can substantially alter the performance profile of a system. This work posits the existence of an AI tax, the time spent on non-model execution tasks. We characterize the execution pipeline of open source ML benchmarks and Android applications in terms of AI tax and discuss where performance bottlenecks may unexpectedly arise. Index Terms-mobile systems, Android, system-on-chip, hardware acceleration, machine learning, workload characterization.
Machine Learning (ML) is widely used today in many mobile applications. To preserve user privacy,... more Machine Learning (ML) is widely used today in many mobile applications. To preserve user privacy, there is a need to perform ML inference on the mobile devices. Given that ML inference is a computationally intensive task, the common technique used in mobile devices is offloading the task to a neural accelerator. However, the speed-up gained from offloading these tasks on the accelerators is limited by the overhead of frequent host-accelerator communication. In this paper, we propose a complete end-to-end solution that uses in-pipeline machine learning processing unit for accelerating ML workloads. First we introduce the software infrastructure we developed to support compilation and execution of machine learning models used in TensorFlow Lite framework. Then we discuss the microarchitecture we plan to implement for supporting the execution of our vectorized machine learning kernels.
ObjectivesHospitals are psychologically demanding workplaces with a need for context-specific str... more ObjectivesHospitals are psychologically demanding workplaces with a need for context-specific stress-preventive leadership interventions. A stress-preventive interprofessional leadership intervention for middle management has been developed. This phase-II study investigates its feasibility and outcomes, including work-related stress, well-being and transformational leadership.DesignThis is a mixed-methods study with three measure points (T0: baseline, T1: after the last training session, T2: 3-month follow-up). Additionally, focus groups were conducted to assess participants’ change in everyday work.SettingA tertiary hospital in Germany.ParticipantsN=93 leaders of different professions.InterventionAn interactive group setting intervention divided in five separate sessions ((1) self-care as a leader, (2) leadership attitudes and behaviour, (3) motives, needs and stressors of employees, (4) strengthen the resource ‘team’, (5) reflection and focus groups). The intervention was conducte...
Deep neural networks have been extensively adopted for a myriad of applications due to their abil... more Deep neural networks have been extensively adopted for a myriad of applications due to their ability to learn patterns from large amounts of data. The desire to preserve user privacy and reduce user-perceived latency has created the need to perform deep neural network inference tasks on low-power consumer edge devices. Since such tasks often tend to be computationally intensive, offloading this compute from mobile/embedded CPU to a purposedesigned "Neural Processing Engines" is a commonly adopted solution for accelerating deep learning computations. While these accelerators offer significant speed-ups for key machine learning kernels, overheads resulting from frequent host-accelerator communication often diminish the net application-level benefit of this heterogeneous system. Our solution for accelerating such workloads involves developing ISA extensions customized for machine learning kernels and designing a custom in-pipeline execution unit for these specialized instructions. We base our ISA extensions on RISC-V: an open ISA specification that lends itself to such specializations. In this paper, we present the software infrastructure for optimizing neural network execution on RISC-V with ISA extensions. Our ISA extensions are derived from the RISC-V Vector ISA proposal, and we develop optimized implementations of the critical kernels such as convolution and matrix multiplication using these instructions. These optimized functions are subsequently added to the TensorFlow Lite source code and cross-compiled for RISC-V. We find that only a small set of instruction extensions achieves coverage over a wide variety of deep neural networks designed for vision and speech-related tasks. On average, our software implementation using the extended instructions set reduces the executed instruction count by 8X in comparison to baseline implementation. In parallel, we are also working on the hardware design of the inpipeline machine learning accelerator. We plan to open-source our
In this paper we introduce the BlackParrot multicore processor, a mainstream industrial-strength ... more In this paper we introduce the BlackParrot multicore processor, a mainstream industrial-strength opensource implementation of the RISC-V RV64G architecture. BlackParrot is a clean-slate processor design with a lean, energyefficient, and highly performant implementation. The key differentiator between BlackParrot and prior RISC-V processor efforts is that our goal is to distribute stewardship across industry and government stakeholders instead of adopting a “freemium” model where the source is controlled by a private startup that does not release the actual code it tapes out. Our approach enables a pathway for creating the equivalent of Linux for RISC-V: a truly open RISC-V processor to power the opensource hardware revolution and the Age of Bespoke Silicon. Keywords—open-source hardware; architecture; RISC-V
2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021
Query processing for data analytics with machine learning scoring involves executing heterogeneou... more Query processing for data analytics with machine learning scoring involves executing heterogeneous operations in a pipelined fashion. Hardware acceleration is one approach to improve the pipeline performance and free up processor resources by offloading computations to the accelerators. However, the performance benefits of accelerators can be limited by the compute and data offloading overheads. Although prior works have studied acceleration opportunities, including with accelerators for machine learning operations, an end-to-end application performance analysis has not been well studied, particularly for data analytics and model scoring pipelines. In this paper, we study speedups and overheads of using PCIe-based hardware accelerators in such pipelines. In particular, we analyze the effectiveness of using GPUs and FPGAs to accelerate scoring for random forest, a popular machine learning model, on tabular input data obtained from Microsoft SQL Server. We observe that the offloading decision as well as the choice of the optimal hardware backend should depend at least on the model complexity (e.g., number of features and tree depth), the scoring data size, and the overheads associated with data movement and invocation of the pipeline stages. We also highlight potential future research explorations based on our findings.
This article introduces BlackParrot, which aims to be the default open-source, Linux-capable, cac... more This article introduces BlackParrot, which aims to be the default open-source, Linux-capable, cache-coherent, 64-bit RISC-V multicore used by the world. In executing this goal, our research aims to advance the world's knowledge about the "software engineering of hardware." Although originally bootstrapped by the University of Washington and Boston University via DARPA funding, BlackParrot strives to be community driven and infrastructure agnostic; a multicore which is Pareto optimal in terms of power, performance, area, and complexity. In order to ensure BlackParrot is easy to use, extend, and, most importantly, trust, development is guided by three core principles: Be Tiny, Be Modular, and Be Friendly. Development efforts have prioritized the use of intentional interfaces and modularity and silicon validation as first-order design metrics, so that users can quickly get started and trust that their design will perform as
2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021
Mobile software is becoming increasingly feature rich, commonly being accessorized with the power... more Mobile software is becoming increasingly feature rich, commonly being accessorized with the powerful decision making capabilities of machine learning (ML). To keep up with the consequently higher power and performance demands, system and hardware architects add specialized hardware units onto their system-on-chips (SoCs) coupled with frameworks to delegate compute optimally. While these SoC innovations are rapidly improving ML model performance and power efficiency, auxiliary data processing and supporting infrastructure to enable ML model execution can substantially alter the performance profile of a system. This work posits the existence of an AI tax, the time spent on non-model execution tasks. We characterize the execution pipeline of open source ML benchmarks and Android applications in terms of AI tax and discuss where performance bottlenecks may unexpectedly arise. Index Terms-mobile systems, Android, system-on-chip, hardware acceleration, machine learning, workload characterization.
Machine Learning (ML) is widely used today in many mobile applications. To preserve user privacy,... more Machine Learning (ML) is widely used today in many mobile applications. To preserve user privacy, there is a need to perform ML inference on the mobile devices. Given that ML inference is a computationally intensive task, the common technique used in mobile devices is offloading the task to a neural accelerator. However, the speed-up gained from offloading these tasks on the accelerators is limited by the overhead of frequent host-accelerator communication. In this paper, we propose a complete end-to-end solution that uses in-pipeline machine learning processing unit for accelerating ML workloads. First we introduce the software infrastructure we developed to support compilation and execution of machine learning models used in TensorFlow Lite framework. Then we discuss the microarchitecture we plan to implement for supporting the execution of our vectorized machine learning kernels.
ObjectivesHospitals are psychologically demanding workplaces with a need for context-specific str... more ObjectivesHospitals are psychologically demanding workplaces with a need for context-specific stress-preventive leadership interventions. A stress-preventive interprofessional leadership intervention for middle management has been developed. This phase-II study investigates its feasibility and outcomes, including work-related stress, well-being and transformational leadership.DesignThis is a mixed-methods study with three measure points (T0: baseline, T1: after the last training session, T2: 3-month follow-up). Additionally, focus groups were conducted to assess participants’ change in everyday work.SettingA tertiary hospital in Germany.ParticipantsN=93 leaders of different professions.InterventionAn interactive group setting intervention divided in five separate sessions ((1) self-care as a leader, (2) leadership attitudes and behaviour, (3) motives, needs and stressors of employees, (4) strengthen the resource ‘team’, (5) reflection and focus groups). The intervention was conducte...
Deep neural networks have been extensively adopted for a myriad of applications due to their abil... more Deep neural networks have been extensively adopted for a myriad of applications due to their ability to learn patterns from large amounts of data. The desire to preserve user privacy and reduce user-perceived latency has created the need to perform deep neural network inference tasks on low-power consumer edge devices. Since such tasks often tend to be computationally intensive, offloading this compute from mobile/embedded CPU to a purposedesigned "Neural Processing Engines" is a commonly adopted solution for accelerating deep learning computations. While these accelerators offer significant speed-ups for key machine learning kernels, overheads resulting from frequent host-accelerator communication often diminish the net application-level benefit of this heterogeneous system. Our solution for accelerating such workloads involves developing ISA extensions customized for machine learning kernels and designing a custom in-pipeline execution unit for these specialized instructions. We base our ISA extensions on RISC-V: an open ISA specification that lends itself to such specializations. In this paper, we present the software infrastructure for optimizing neural network execution on RISC-V with ISA extensions. Our ISA extensions are derived from the RISC-V Vector ISA proposal, and we develop optimized implementations of the critical kernels such as convolution and matrix multiplication using these instructions. These optimized functions are subsequently added to the TensorFlow Lite source code and cross-compiled for RISC-V. We find that only a small set of instruction extensions achieves coverage over a wide variety of deep neural networks designed for vision and speech-related tasks. On average, our software implementation using the extended instructions set reduces the executed instruction count by 8X in comparison to baseline implementation. In parallel, we are also working on the hardware design of the inpipeline machine learning accelerator. We plan to open-source our
In this paper we introduce the BlackParrot multicore processor, a mainstream industrial-strength ... more In this paper we introduce the BlackParrot multicore processor, a mainstream industrial-strength opensource implementation of the RISC-V RV64G architecture. BlackParrot is a clean-slate processor design with a lean, energyefficient, and highly performant implementation. The key differentiator between BlackParrot and prior RISC-V processor efforts is that our goal is to distribute stewardship across industry and government stakeholders instead of adopting a “freemium” model where the source is controlled by a private startup that does not release the actual code it tapes out. Our approach enables a pathway for creating the equivalent of Linux for RISC-V: a truly open RISC-V processor to power the opensource hardware revolution and the Age of Bespoke Silicon. Keywords—open-source hardware; architecture; RISC-V
2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021
Query processing for data analytics with machine learning scoring involves executing heterogeneou... more Query processing for data analytics with machine learning scoring involves executing heterogeneous operations in a pipelined fashion. Hardware acceleration is one approach to improve the pipeline performance and free up processor resources by offloading computations to the accelerators. However, the performance benefits of accelerators can be limited by the compute and data offloading overheads. Although prior works have studied acceleration opportunities, including with accelerators for machine learning operations, an end-to-end application performance analysis has not been well studied, particularly for data analytics and model scoring pipelines. In this paper, we study speedups and overheads of using PCIe-based hardware accelerators in such pipelines. In particular, we analyze the effectiveness of using GPUs and FPGAs to accelerate scoring for random forest, a popular machine learning model, on tabular input data obtained from Microsoft SQL Server. We observe that the offloading decision as well as the choice of the optimal hardware backend should depend at least on the model complexity (e.g., number of features and tree depth), the scoring data size, and the overheads associated with data movement and invocation of the pipeline stages. We also highlight potential future research explorations based on our findings.
This article introduces BlackParrot, which aims to be the default open-source, Linux-capable, cac... more This article introduces BlackParrot, which aims to be the default open-source, Linux-capable, cache-coherent, 64-bit RISC-V multicore used by the world. In executing this goal, our research aims to advance the world's knowledge about the "software engineering of hardware." Although originally bootstrapped by the University of Washington and Boston University via DARPA funding, BlackParrot strives to be community driven and infrastructure agnostic; a multicore which is Pareto optimal in terms of power, performance, area, and complexity. In order to ensure BlackParrot is easy to use, extend, and, most importantly, trust, development is guided by three core principles: Be Tiny, Be Modular, and Be Friendly. Development efforts have prioritized the use of intentional interfaces and modularity and silicon validation as first-order design metrics, so that users can quickly get started and trust that their design will perform as
Uploads
Papers by zahra azad