Papers by Kaoutar El Maghraoui
In neural architecture search (NAS), training every sampled architecture is very time-consuming a... more In neural architecture search (NAS), training every sampled architecture is very time-consuming and should be avoided. Weight-sharing is a promising solution to speed up the evaluation process. However, training the supernetwork incurs many discrepancies between the actual ranking and the predicted one. Additionally, efficient deep-learning engineering processes require incorporating realistic hardware-performance metrics into the NAS evaluation process, also known as hardware-aware NAS (HW-NAS). In HW-NAS, estimating task-specific performance and hardware efficiency are both required. This paper proposes a supernetwork training methodology that preserves the Pareto ranking between its different subnetworks resulting in more efficient and accurate neural networks for a variety of hardware platforms. The results show a 97% near Pareto front approximation in less than 2 GPU days of search, which provides 2x speed up compared to state-of-the-art methods. We validate our methodology on NAS-Bench-201, DARTS, and Im-ageNet. Our optimal model achieves 77.2% accuracy (+1.7% compared to baseline) with an inference time of 3.68ms on Edge GPU for ImageNet, which yields a 2.3x speedup. Training implementation can be found: https://github.com/IHIaadj/PRP-NAS
arXiv (Cornell University), Apr 5, 2021
We introduce the IBM Analog Hardware Acceleration Kit, a new and first of a kind open source tool... more We introduce the IBM Analog Hardware Acceleration Kit, a new and first of a kind open source toolkit to simulate analog crossbar arrays in a convenient fashion from within PyTorch (freely available at https://github.com/IBM/aihwkit). The toolkit is under active development and is centered around the concept of an "analog tile" which captures the computations performed on a crossbar array. Analog tiles are building blocks that can be used to extend existing network modules with analog components and compose arbitrary artificial neural networks (ANNs) using the flexibility of the PyTorch framework. Analog tiles can be conveniently configured to emulate a plethora of different analog hardware characteristics and their non-idealities, such as device-to-device and cycle-to-cycle variations, resistive device response curves, and weight and output noise. Additionally, the toolkit makes it possible to design custom unit cell configurations and to use advanced analog optimization algorithms such as Tiki-Taka. Moreover, the backward and update behavior can be set to "ideal" to enable hardware-aware training features for chips that target inference acceleration only. To evaluate the inference accuracy of such chips over time, we provide statistical programming noise and drift models calibrated on phase-change memory hardware. Our new toolkit is fully GPU accelerated and can be used to conveniently estimate the impact of material properties and non-idealities of future analog technology on the accuracy for arbitrary ANNs.
arXiv (Cornell University), May 17, 2018
Deep learning (DL), a form of machine learning, is becoming increasingly popular in several appli... more Deep learning (DL), a form of machine learning, is becoming increasingly popular in several application domains. As a result, cloud-based Deep Learning as a Service (DLaaS) platforms have become an essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper explores dependability in the context of a DLaaS platform used in IBM. We begin by explaining how DL training workloads are different, and what features ensure dependability in this context. We then describe the architecture, design and implementation of a cloud-based orchestration system for DL training. We show how this system has been architected with dependability in mind while also being horizontally scalable, elastic, flexible and efficient. We also present an initial empirical evaluation of the overheads introduced by our platform, and discuss tradeoffs between efficiency and dependability. Authors' names listed in alphabetical order. The authors would like to thank Khoa Hyunh of IBM for his help evaluating DLaaS performance overhead.
IBM journal of research and development, 2017
IBM's Technical Support Services division runs remote support centers, where agents provide phone... more IBM's Technical Support Services division runs remote support centers, where agents provide phone support for client problems related to IBM and non-IBM hardware and software products. Support center personnel use numerous pieces of informationincluding many searches, log files, and records of historical support tickets, from disparate data sources-to recommend solutions for customer technical problems. We have built an advanced search system to assist support agents who are resolving customer service requests and improving our client experience. The system has been deployed and used globally by thousands of support center personnel. In this paper, we describe the system's architecture, the technical challenges, and the innovative solution we have built. In addition, we discuss the novel ideas to address the unique requirements and challenges of the support services domain. These ideas include using system logs and domain knowledge to automatically expand agent queries, incorporating implicit agent feedback, and selecting features to extract useful information from highly unstructured and noisy ticket data. Results on the effectiveness of the system are presented. We also discuss future work on enhancing the system's capability to automatically diagnose customer hardware and software problems and remediate them.
Welcome to the first workshop on Interactions of NVM/Flash with Operating Systems and Workloads, ... more Welcome to the first workshop on Interactions of NVM/Flash with Operating Systems and Workloads, INFLOW2013. The motivation behind INFLOW was to bring together researchers and practitioners working in systems across the hardware/software stack, who are interested in the cross-cutting issues of NVM/Flash technologies, operating systems, and emerging workloads.
arXiv (Cornell University), Jul 18, 2023
Analog In-Memory Computing (AIMC) is a promising approach to reduce the latency and energy consum... more Analog In-Memory Computing (AIMC) is a promising approach to reduce the latency and energy consumption of Deep Neural Network (DNN) inference and training. However, the noisy and non-linear device characteristics, and the non-ideal peripheral circuitry in AIMC chips, require adapting DNNs to be deployed on such hardware to achieve equivalent accuracy to digital computing. In this tutorial, we provide a deep dive into how such adaptations can be achieved and evaluated using the recently released IBM Analog Hardware Acceleration Kit (AIHWKit), freely available at https://github.com/IBM/aihwkit. The AIHWKit is a Python library that simulates inference and training of DNNs using AIMC. We present an in-depth description of the AIHWKit design, functionality, and best practices to properly perform inference and training. We also present an overview of the Analog AI Cloud Composer, a platform that provides the benefits of using the AIHWKit simulation in a fully managed cloud setting along with physical AIMC hardware access, freely available at https://aihw-composer.draco.res.ibm.com. Finally, we show examples on how users can expand and customize AIHWKit for their own needs.
International Conference on e-Science, Dec 4, 2006
We have designed a maximum likelihood fitter using the actor model to distribute the computation ... more We have designed a maximum likelihood fitter using the actor model to distribute the computation over a heterogeneous network. The prototype implementation uses the SALSA programming language and the Internet Operating System middleware. We have used our fitter to perform a partial wave analysis of particle physics data. Preliminary measurements have shown good performance and scalability. We expect our approach to be applicable to other scientific domains, such as biology and astronomy, where maximum likelihood evaluation is an important technique. We also expect our performance results to scale to Internet-wide runtime infrastructures, given the high adaptability of our software framework.
Society for Industrial and Applied Mathematics eBooks, 2006
Modern large-scale scientific computation problems must execute in a parallel computational envir... more Modern large-scale scientific computation problems must execute in a parallel computational environment to achieve acceptable performance. Target parallel environments range from the largest tightly-coupled supercomputers to heterogeneous clusters of workstations. Grid technologies make Internet execution more likely. Hierarchical and heterogeneous systems are increasingly common. Processing and communication capabilities can be nonuniform, non-dedicated, transient or unreliable. Even when targeting homogeneous computing environments, each environment may differ in the number of processors per node, the relative costs of computation, communication, and memory access, and the availability of programming paradigms and software tools. Architecture-aware computation requires knowledge of the computing environment and software performance characteristics, and tools to make use of this knowledge. These challenges may be addressed by compilers, low-level tools, dynamic load balancing or solution procedures, middleware layers, high-level software development techniques, and choice of programming languages and paradigms. Computation and communication may be reordered. Data or computation may be replicated or a load imbalance may be tolerated to avoid costly communication. This paper samples a variety of approaches to architecture-aware parallel computation.
Elsevier eBooks, 2005
Computational grids are appealing platforms for the execution of large scale applications among t... more Computational grids are appealing platforms for the execution of large scale applications among the scientific and engineering communities. However, designing new applications and deploying existing ones with the capability of exploiting this potential still remains a challenge. Computational grids are characterized by their dynamic, non-dedicated, and heterogeneous nature. Novel application-level and middleware-level techniques are needed to allow applications to reconfigure themselves and adapt automatically to their underlying execution environments. In this paper, we introduce a new software framework that enhances the performance of Message Passing Interface (MPI) applications through an adaptive middleware for load balancing that includes process checkpointing and migration. Fields as diverse as fluid dynamics, materials science, biomechanics, and ecology make use of parallel adaptive computation. Target architectures have traditionally been supercomputers and tightly coupled clusters. This framework is a first step in allowing these computations to use computational grids efficiently.
Cluster Computing, Jun 28, 2007
Iterative applications are known to run as slow as their slowest computational component. This pa... more Iterative applications are known to run as slow as their slowest computational component. This paper introduces malleability, a new dynamic reconfiguration strategy to overcome this limitation. Malleability is the ability to dynamically change the data size and number of computational entities in an application. Malleability can be used by middleware to autonomously reconfigure an application in response to dynamic changes in resource availability in an architecture-aware manner, allowing applications to optimize the use of multiple processors and diverse memory hierarchies in heterogeneous environments. The modular Internet Operating System (IOS) was extended to reconfigure applications autonomously using malleability. Two different iterative applications were made malleable. The first is used in astronomical modeling, and representative of maximum-likelihood applications was made malleable in the SALSA programming language. The second models the diffusion of heat over a two dimensional object, and is representative of applications such as partial differential equations and some types of distributed simulations. Versions of the heat application were made malleable both in SALSA and MPI. Algorithms for concurrent data redistribution are given for each type of application. Results show that using malleability for reconfiguration is 10 to 100 times faster on the tested environments. The algorithms are
Lecture Notes in Computer Science, 2006
With the proliferation of large scale dynamic execution environments such as grids, the need for ... more With the proliferation of large scale dynamic execution environments such as grids, the need for providing efficient and scalable application adaptation strategies for long running parallel and distributed applications has emerged. Message passing interfaces have been initially designed with a traditional machine model in mind which assumes homogeneous and static environments. It is inevitable that long running message passing applications will require support for dynamic reconfiguration to maintain high performance under varying load conditions. In this paper we describe a framework that provides iterative MPI applications with reconfiguration capabilities. Our approach is based on integrating MPI applications with a middleware that supports process migration and large scale distributed application reconfiguration. We present our architecture for reconfiguring MPI applications, and verify our design with a heat diffusion application in a dynamic setting.
Computational grids are appealing platforms for the execution of large scale applications among t... more Computational grids are appealing platforms for the execution of large scale applications among the scientific and engineering communities. However, designing new applications and deploying existing ones with the capability of exploiting this potential still remains a challenge. Computational grids are characterized by their dynamic, non-dedicated, and heterogeneous nature. Novel application-level and middleware-level techniques are needed to allow applications to reconfigure themselves and adapt automatically to their underlying execution environments. In this paper, we introduce a new software framework that enhances the performance of Message Passing Interface (MPI) applications through an adaptive middleware for load balancing that includes process checkpointing and migration. Fields as diverse as fluid dynamics, materials science, biomechanics, and ecology make use of parallel adaptive computation. Target architectures have traditionally been supercomputers and tightly coupled clusters. This framework is a first step in allowing these computations to use computational grids efficiently.
Hawaii International Conference on System Sciences, Jan 5, 2004
The Internet is constantly growing as a ubiquitous platform for high-performance distributed comp... more The Internet is constantly growing as a ubiquitous platform for high-performance distributed computing. In this paper, we propose a new software framework for distributed computing over large scale dynamic and heterogeneous systems. Our framework wraps computation into autonomous actors, self organizing computing entities, which freely roam over the network to find their optimal target execution environments. We introduce the architecture of our worldwide computing framework, which consists of an actor-oriented programming language (SALSA), a distributed run time environment (WWC), and a middleware infrastructure for autonomous reconfiguration and load balancing (IO). Load balancing is completely transparent to application programmers. The middleware triggers actor migration based on profiling resources in a completely decentralized manner. Our infrastructure also allows for the dynamic addition and removal of nodes from the computation, while continuously balancing the load given the changing resources. To balance computational load, we introduce three variations of random work stealing: load-sensitive (RS), actor topology-sensitive (ARS), and network topology-sensitive (NRS) random stealing. We evaluated RS and ARS with several actor interconnection topologies in a local area network. While RS performed worse than static round-robin (RR) actor placement, ARS outperformed both RS and RR in the sparse connectivity and hypercube connectivity tests, by a full order of magnitude.
2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
IEEE Circuits and Systems Magazine
Parallel Processing for Scientific Computing, 2006
Modern large-scale scientific computation problems must execute in a parallel computational envir... more Modern large-scale scientific computation problems must execute in a parallel computational environment to achieve acceptable performance. Target parallel environments range from the largest tightly-coupled supercomputers to heterogeneous clusters of workstations. Grid technologies make Internet execution more likely. Hierarchical and heterogeneous systems are increasingly common. Processing and communication capabilities can be nonuniform, non-dedicated, transient or unreliable. Even when targeting homogeneous computing environments, each environment may differ in the number of processors per node, the relative costs of computation, communication, and memory access, and the availability of programming paradigms and software tools. Architecture-aware computation requires knowledge of the computing environment and software performance characteristics, and tools to make use of this knowledge. These challenges may be addressed by compilers, low-level tools, dynamic load balancing or solution procedures, middleware layers, high-level software development techniques, and choice of programming languages and paradigms. Computation and communication may be reordered. Data or computation may be replicated or a load imbalance may be tolerated to avoid costly communication. This paper samples a variety of approaches to architecture-aware parallel computation.
Proceedings of the first joint WOSP/SIPEW international conference on Performance engineering - WOSP/SIPEW '10, 2010
2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012
Simultaneous multithreading (SMT) increases CPU utilization and application performance in many c... more Simultaneous multithreading (SMT) increases CPU utilization and application performance in many circumstances, but it can be detrimental when performance is limited by application scalability or when there is significant contention for CPU resources. This paper describes an SMT-selection metric that predicts the change in application performance when the SMT level and number of application threads are varied. This metric is obtained online through hardware performance counters with little overhead, and allows the application or operating system to dynamically choose the best SMT level. We have validated the SMT-selection metric using a variety of benchmarks that capture various application characteristics on two different processor architectures. Our results show that the SMT-selection metric is capable of predicting the best SMT level for a given workload in 90% of the cases. The paper also shows that such a metric can be used with a scheduler or application optimizer to help guide its optimization decisions.
2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012
Recovering from OS crashes has traditionally been done using reboot or checkpoint-restart mechani... more Recovering from OS crashes has traditionally been done using reboot or checkpoint-restart mechanisms. Such techniques either fail to preserve the state before the crash happens or require modifications to applications. To eliminate these problems, we present a novel OS-hypervisor infrastructure for automated OS crash diagnosis and recovery in virtual servers. Our approach uses a small hidden OSrepair-image that is dynamically created from the healthy running OS instance. Upon an OS crash, the hypervisor automatically loads this repair-image to perform diagnosis and repair. The offending process is then quarantined, and the fixed OS automatically resumes running without a reboot. Our experimental evaluations demonstrated that it takes less than 3 seconds to recover from an OS crash. This approach can significantly reduce the downtime and maintenance costs in data centers. This is the first design and implementation of an OS-hypervisor combo capable of automatically resurrecting a crashed commercial server-OS.
Uploads
Papers by Kaoutar El Maghraoui