Skip to main content

Clement Farabet

New York University, Courant Institute of Mathematical Sciences, Department Member

Followers

122

Following

1

Public Views

Clément Farabet is VP of AI Infrastructure at NVIDIA. He received a PhD from Université Paris-Est in 2013, while at NYU, co-advised by Laurent Najman and Yann LeCun. His thesis focused on real-time image understanding, introducing multi-scale convolutional neural networks and a custom hardware arch for deep learning. He co-founded Madbits, a startup focused on web-scale image understanding, sold to Twitter in 2014. He cofounded Twitter Cortex, a team focused on building Twitter’s deep learning platform for recommendations/search/spam/nsfw/ads.

less

InterestsView All (8)

Uploads

Papers by Clement Farabet

Indoor semantic segmentation using depth information: 1st International Conference on Learning Representations, ICLR 2013

This work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area ... more This work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area of research has gained much attention recently, most works still rely on hand-crafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and the depth information. We obtain state-of-the-art on the NYU-v2 depth dataset with an accuracy of 64.5%. We illustrate the labeling of indoor scenes in videos sequences that could be processed in real-time using appropriate hardware such as an FPGA.

NeuFlow: Dataflow vision processing system-on-a-chip

This paper presents neuFlow system-on-a-chip-a neuromorphic vision system-on-a-chip implemented i... more This paper presents neuFlow system-on-a-chip-a neuromorphic vision system-on-a-chip implemented in the IBM 45 nm SOI process. The neuFlow processor was designed to accelerate neural networks and other complex vision algorithms based on large numbers of convolutions and matrix-to-matrix operations. Post-layout characterization shows that the system delivers up to 160 GOPS with an average power consumption of 570 mW. The power-efficiency and portability of this system is ideal for embedded vision-based devices, such as driver assistance, and robotic vision.

An FPGA-based stream processor for embedded real-time vision with Convolutional Networks

Many recent visual recognition systems can be seen as being composed of multiple layers of convol... more Many recent visual recognition systems can be seen as being composed of multiple layers of convolutional filter banks, interspersed with various types of non-linearities. This includes Convolutional Networks, HMAX-type architectures, as well as systems based on dense SIFT features or Histogram of Gradients. This paper describes a highlycompact and low power embedded system that can run such vision systems at very high speed. A custom board built around a Xilinx Virtex-4 FPGA was built and tested. It measures 70 × 80 mm, and the complete system-FPGA, camera, memory chips, flash-consumes 15 watts in peak, and is capable of more than 4 × 10 9 multiply-accumulate operations per second in real vision application. This enables real-time implementations of object detection, object recognition, and vision-based navigation algorithms in small-size robots, micro-UAVs, and hand-held devices. Real-time face detection is demonstrated, with speeds of 10 frames per second at VGA resolution.

Large-Scale FPGA-Based Convolutional Networks

Cambridge University Press eBooks, Dec 30, 2011

Other models like HMAX-type models (Serre et al., 2005; Mutch and Lowe, 2006) and convolutional n... more Other models like HMAX-type models (Serre et al., 2005; Mutch and Lowe, 2006) and convolutional networks use two more layers of successive feature extractors. Different training algorithms have been used for learning the parameters of convolutional networks. In LeCun et al. (1998b) and Huang and LeCun (2006), pure supervised learning is used to update the parameters. However, recent works have focused on training with an auxiliary task (

Hardware accelerated visual attention algorithm

Hardware accelerated convolutional neural networks for synthetic vision systems

In this paper we present a scalable hardware architecture to implement large-scale convolutional ... more In this paper we present a scalable hardware architecture to implement large-scale convolutional neural networks and state-of-the-art multi-layered artificial vision systems. This system is fully digital and is a modular vision engine with the goal of performing real-time detection, recognition and segmentation of mega-pixel images. We present a performance comparison between a software, FPGA and ASIC implementation that shows a speed up in custom hardware implementations.

Torch7: A Matlab-like Environment for Machine Learning

Neural Information Processing Systems, 2011

Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. ... more Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be interfaced to third-party software thanks to Lua's light interface. 1 Torch7 Overview With Torch7, we aim at providing a framework with three main advantages: (1) it should ease the development of numerical algorithms, (2) it should be easily extended (including the use of other libraries), and (3) it should be fast. We found that a scripting (interpreted) language with a good C API appears as a convenient solution to "satisfy" the constraint (2). First, a high-level language makes the process of developing a program simpler and more understandable than a low-level language. Second, if the programming language is interpreted, it becomes also easier to quickly try various ideas in an interactive manner. And finally, assuming a good C API, the scripting language becomes the "glue" between heterogeneous libraries: different structures of the same concept (coming from different libraries) can be hidden behind a unique structure in the scripting language, while keeping all the functionalities coming from all the different libraries. Among existing scripting languages 1 finding the ones that satisfy condition (3) severely restricted our choice. We chose Lua, the fastest interpreted language (with also the fastest Just In Time (JIT) compiler 2) we could find. Lua as also the advantage to have been designed to be easily embedded in a C application, and provides a great C API, based on a virtual stack to pass values to and from C. This unifies the interface to C/C++ and makes library wrapping trivial. Lua is intended to be used as a powerful, lightweight scripting language for any program that needs one. Lua is implemented as a library, written in clean C (that is, in the common subset of ANSI C and C++). Quoting Lua webpage 3 ,

Tracking with deep neural networks

Deep learning methods have become the dominant artificial f b d d l f vision system for object de... more

Causal graph-based video segmentation

arXiv (Cornell University), Jan 8, 2013

Numerous approaches in image processing and computer vision are making use of super-pixels as a p... more Numerous approaches in image processing and computer vision are making use of super-pixels as a preprocessing step. Among the different methods producing such over-segmentation of an image, the graph-based approach of Felzenszwalb and Huttenlocher is broadly employed. One of its interesting properties is that the regions are computed in a greedy manner in quasi-linear time. The algorithm may be trivially extended to video segmentation by considering a video as a 3D volume, however, this can not be the case for causal segmentation, when subsequent frames are unknown. We propose an efficient video segmentation approach that computes temporally consistent pixels in a causal manner, filling the need for causal and real time applications.

Implementing Neural Networks Efficiently

Lecture Notes in Computer Science, 2012

Neural networks and machine learning algorithms in general require a flexible environment where n... more Neural networks and machine learning algorithms in general require a flexible environment where new algorithm prototypes and experiments can be set up as quickly as possible with best possible computational performance. To that end, we provide a new framework called Torch7, that is especially suited to achieve both of these competing goals. Torch7 is a versatile numeric computing framework and machine learning library that extends a very lightweight and powerful programming language Lua. Its goal is to provide a flexible environment to design, train and deploy learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can also easily be interfaced to third-party software thanks to Lua's light C interface.

CNP: An FPGA-based processor for Convolutional Networks

Training Data Subset Search With Ensemble Active Learning

IEEE Transactions on Intelligent Transportation Systems, Sep 1, 2022

Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size ... more Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size of such datasets, it is conceivable that they contain certain samples that either do not contribute or negatively impact the DNN's optimization. Modifying the training distribution in a way that excludes such samples could provide an effective solution to both improve performance and reduce training time. In this paper, we propose to scale up ensemble Active Learning (AL) methods to perform acquisition at a large scale (10k to 500k samples at a time). We do this with ensembles of hundreds of models, obtained at a minimal computational cost by reusing intermediate training checkpoints. This allows us to automatically and efficiently perform a training data subset search for large labeled datasets. We observe that our approach obtains favorable subsets of training data, which can be used to train more accurate DNNs than training with the entire dataset. We perform an extensive experimental study of this phenomenon on three image classification benchmarks (CIFAR-10, CIFAR-100 and ImageNet), as well as an internal object detection benchmark for prototyping perception models for autonomous driving. Unlike existing studies, our experiments on object detection are at the scale required for production-ready autonomous driving systems. We provide insights on the impact of different initialization schemes, acquisition functions and ensemble configurations at this scale. Our results provide strong empirical evidence that optimizing the training data distribution can provide significant benefits on large scale vision tasks.

Inside NVIDIA’s {AI} Infrastructure for Self-driving Cars

Training Data Distribution Search with Ensemble Active Learning

arXiv (Cornell University), Sep 25, 2019

Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size ... more Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size of such datasets, it is conceivable that they contain certain samples that either do not contribute or negatively impact the DNN's optimization. Modifying the training distribution in a way that excludes such samples could provide an effective solution to both improve performance and reduce training time. In this paper, we propose to scale up ensemble Active Learning methods to perform acquisition at a large scale (10k to 500k samples at a time). We do this with ensembles of hundreds of models, obtained at a minimal computational cost by reusing intermediate training checkpoints. This allows us to automatically and efficiently perform a training data subset search for large labeled datasets. We observe that our approach obtains favorable subsets of training data, which can be used to train more accurate DNNs than training with the entire dataset. We perform an extensive experimental study of this phenomenon on three image classification benchmarks (CIFAR-10, CIFAR-100 and ImageNet), analyzing the impact of initialization schemes, acquisition functions and ensemble configurations. We demonstrate that data subsets identified with a lightweight ResNet-18 ensemble remain effective when used to train deep models like ResNet-101 and DenseNet-121. Our results provide strong empirical evidence that optimizing the training data distribution can provide significant benefits on large scale vision tasks.

Active Learning for Deep Object Detection via Probabilistic Modeling

arXiv (Cornell University), Mar 30, 2021

Active learning aims to reduce labeling costs by selecting only the most informative samples on a... more Active learning aims to reduce labeling costs by selecting only the most informative samples on a dataset. Few existing works have addressed active learning for object detection. Most of these methods are based on multiple models or are straightforward extensions of classification methods, hence estimate an image's informativeness using only the classification head. In this paper, we propose a novel deep active learning approach for object detection. Our approach relies on mixture density networks that estimate a probabilistic distribution for each localization and classification head's output. We explicitly estimate the aleatoric and epistemic uncertainty in a single forward pass of a single model. Our method uses a scoring function that aggregates these two types of uncertainties for both heads to obtain every image's informativeness score. We demonstrate the efficacy of our approach in PASCAL VOC and MS-COCO datasets. Our approach outperforms single-model based methods and performs on par with multi-model based methods at a fraction of the computing cost. Code is available at https://github.com/NVlabs/AL-MDN. Deep neural network Localization head Classification head Mixture density networks Unlabeled pool Scoring function Localization uncertainty Classification uncertainty … Informativeness scores Lab trainin Annotate top-data points Annotator Deep neural network Localization head Classification head Mixture density networks Unlabeled pool Scoring function Localization uncertainty Classification uncertainty … Informativeness scores Labeled training set Annotator

Less is More: An Exploration of Data Redundancy with Active Dataset Subsampling

arXiv (Cornell University), May 29, 2019

Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size ... more Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size of such datasets, it is conceivable that they contain certain samples that either do not contribute or negatively impact the DNN's performance. If there is a large number of such samples, subsampling the training dataset in a way that removes them could provide an effective solution to both improve performance and reduce training time. In this paper, we propose an approach called Active Dataset Subsampling (ADS), to identify favorable subsets within a dataset for training using ensemble based uncertainty estimation. When applied to three image classification benchmarks (CIFAR-10, CIFAR-100 and ImageNet) we find that there are low uncertainty subsets, which can be as large as 50% of the full dataset, that negatively impact performance. These subsets are identified and removed with ADS. We demonstrate that datasets obtained using ADS with a lightweight ResNet-18 ensemble remain effective when used to train deeper models like ResNet-101. Our results provide strong empirical evidence that using all the available data for training can hurt performance on large scale vision tasks.

Deep Active Learning for Object Detection with Mixture Density Networks

Towards real-time image understanding with convolutional networks

One of the open questions of artificial computer vision is how to produce good internal represent... more One of the open questions of artificial computer vision is how to produce good internal representations of the visual world. What sort of internal representation would allow an artificial vision system to detect and classify objects into categories, independently of pose, scale, illumination, conformation, and clutter? More interestingly, how could an artificial vision system learn appropriate internal representations automatically, the way animals and humans seem to learn by simply looking at the world? Another related question is that of computational tractability, and more precisely that of computational efficiency. Given a good visual representation, how efficiently can it be trained, and used to encode new sensorial data. Efficiency has several dimensions: power requirements, processing speed, and memory usage. In this thesis I present three new contributions to the field of computer vision: (1) a multiscale deep convolutional network architecture to easily capture long-distance relationships between input variables in image data, (2) a tree-based algorithm to efficiently explore multiple segmentation candidates, to produce maximally confident semantic segmentations of images, (3) a custom dataflow computer architecture optimized for the computation of convolutional networks, and similarly dense image processing models. All three contributions were produced with the common goal of getting us closer to real-time image understanding. Scene parsing consists in labeling each pixel in an image with the category of the object it belongs to. In the first part of this thesis, I propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features. In Contents List of Figures xi List of Tables xiii LIST OF TABLES 3.6 Performance comparison. 1-CPU: Intel DuoCore, 2.7GHz, optimized C code, 2-V6: neuFlow on Xilinx Virtex 6 FPGA-on board power and GOPs measurements; 3-IBM: neuFlow on IBM 45nm process: simulated results, the design was fully placed and routed; 4-mGPU/GPU: two GPU implementations, a low power GT335m and a high-end GTX480.. 85 xiv

Convolutional Nets and Watershed Cuts for Real-Time Semantic Labeling of RGBD Videos

HAL (Le Centre pour la Communication Scientifique Directe), Oct 1, 2014

HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific re... more HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

IFP Energies nouvelles 92500 Rueil Malmaison, France

Indoor semantic segmentation using depth information: 1st International Conference on Learning Representations, ICLR 2013

This work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area ... more This work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area of research has gained much attention recently, most works still rely on hand-crafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and the depth information. We obtain state-of-the-art on the NYU-v2 depth dataset with an accuracy of 64.5%. We illustrate the labeling of indoor scenes in videos sequences that could be processed in real-time using appropriate hardware such as an FPGA.

NeuFlow: Dataflow vision processing system-on-a-chip

This paper presents neuFlow system-on-a-chip-a neuromorphic vision system-on-a-chip implemented i... more This paper presents neuFlow system-on-a-chip-a neuromorphic vision system-on-a-chip implemented in the IBM 45 nm SOI process. The neuFlow processor was designed to accelerate neural networks and other complex vision algorithms based on large numbers of convolutions and matrix-to-matrix operations. Post-layout characterization shows that the system delivers up to 160 GOPS with an average power consumption of 570 mW. The power-efficiency and portability of this system is ideal for embedded vision-based devices, such as driver assistance, and robotic vision.

An FPGA-based stream processor for embedded real-time vision with Convolutional Networks

Many recent visual recognition systems can be seen as being composed of multiple layers of convol... more Many recent visual recognition systems can be seen as being composed of multiple layers of convolutional filter banks, interspersed with various types of non-linearities. This includes Convolutional Networks, HMAX-type architectures, as well as systems based on dense SIFT features or Histogram of Gradients. This paper describes a highlycompact and low power embedded system that can run such vision systems at very high speed. A custom board built around a Xilinx Virtex-4 FPGA was built and tested. It measures 70 × 80 mm, and the complete system-FPGA, camera, memory chips, flash-consumes 15 watts in peak, and is capable of more than 4 × 10 9 multiply-accumulate operations per second in real vision application. This enables real-time implementations of object detection, object recognition, and vision-based navigation algorithms in small-size robots, micro-UAVs, and hand-held devices. Real-time face detection is demonstrated, with speeds of 10 frames per second at VGA resolution.

Large-Scale FPGA-Based Convolutional Networks

Cambridge University Press eBooks, Dec 30, 2011

Other models like HMAX-type models (Serre et al., 2005; Mutch and Lowe, 2006) and convolutional n... more Other models like HMAX-type models (Serre et al., 2005; Mutch and Lowe, 2006) and convolutional networks use two more layers of successive feature extractors. Different training algorithms have been used for learning the parameters of convolutional networks. In LeCun et al. (1998b) and Huang and LeCun (2006), pure supervised learning is used to update the parameters. However, recent works have focused on training with an auxiliary task (

Hardware accelerated visual attention algorithm

Hardware accelerated convolutional neural networks for synthetic vision systems

In this paper we present a scalable hardware architecture to implement large-scale convolutional ... more In this paper we present a scalable hardware architecture to implement large-scale convolutional neural networks and state-of-the-art multi-layered artificial vision systems. This system is fully digital and is a modular vision engine with the goal of performing real-time detection, recognition and segmentation of mega-pixel images. We present a performance comparison between a software, FPGA and ASIC implementation that shows a speed up in custom hardware implementations.

Torch7: A Matlab-like Environment for Machine Learning

Neural Information Processing Systems, 2011

Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. ... more Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be interfaced to third-party software thanks to Lua's light interface. 1 Torch7 Overview With Torch7, we aim at providing a framework with three main advantages: (1) it should ease the development of numerical algorithms, (2) it should be easily extended (including the use of other libraries), and (3) it should be fast. We found that a scripting (interpreted) language with a good C API appears as a convenient solution to "satisfy" the constraint (2). First, a high-level language makes the process of developing a program simpler and more understandable than a low-level language. Second, if the programming language is interpreted, it becomes also easier to quickly try various ideas in an interactive manner. And finally, assuming a good C API, the scripting language becomes the "glue" between heterogeneous libraries: different structures of the same concept (coming from different libraries) can be hidden behind a unique structure in the scripting language, while keeping all the functionalities coming from all the different libraries. Among existing scripting languages 1 finding the ones that satisfy condition (3) severely restricted our choice. We chose Lua, the fastest interpreted language (with also the fastest Just In Time (JIT) compiler 2) we could find. Lua as also the advantage to have been designed to be easily embedded in a C application, and provides a great C API, based on a virtual stack to pass values to and from C. This unifies the interface to C/C++ and makes library wrapping trivial. Lua is intended to be used as a powerful, lightweight scripting language for any program that needs one. Lua is implemented as a library, written in clean C (that is, in the common subset of ANSI C and C++). Quoting Lua webpage 3 ,

Tracking with deep neural networks

Deep learning methods have become the dominant artificial f b d d l f vision system for object de... more

Causal graph-based video segmentation

arXiv (Cornell University), Jan 8, 2013

Numerous approaches in image processing and computer vision are making use of super-pixels as a p... more Numerous approaches in image processing and computer vision are making use of super-pixels as a preprocessing step. Among the different methods producing such over-segmentation of an image, the graph-based approach of Felzenszwalb and Huttenlocher is broadly employed. One of its interesting properties is that the regions are computed in a greedy manner in quasi-linear time. The algorithm may be trivially extended to video segmentation by considering a video as a 3D volume, however, this can not be the case for causal segmentation, when subsequent frames are unknown. We propose an efficient video segmentation approach that computes temporally consistent pixels in a causal manner, filling the need for causal and real time applications.

Implementing Neural Networks Efficiently

Lecture Notes in Computer Science, 2012

Neural networks and machine learning algorithms in general require a flexible environment where n... more Neural networks and machine learning algorithms in general require a flexible environment where new algorithm prototypes and experiments can be set up as quickly as possible with best possible computational performance. To that end, we provide a new framework called Torch7, that is especially suited to achieve both of these competing goals. Torch7 is a versatile numeric computing framework and machine learning library that extends a very lightweight and powerful programming language Lua. Its goal is to provide a flexible environment to design, train and deploy learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can also easily be interfaced to third-party software thanks to Lua's light C interface.

CNP: An FPGA-based processor for Convolutional Networks

Training Data Subset Search With Ensemble Active Learning

IEEE Transactions on Intelligent Transportation Systems, Sep 1, 2022

Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size ... more Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size of such datasets, it is conceivable that they contain certain samples that either do not contribute or negatively impact the DNN's optimization. Modifying the training distribution in a way that excludes such samples could provide an effective solution to both improve performance and reduce training time. In this paper, we propose to scale up ensemble Active Learning (AL) methods to perform acquisition at a large scale (10k to 500k samples at a time). We do this with ensembles of hundreds of models, obtained at a minimal computational cost by reusing intermediate training checkpoints. This allows us to automatically and efficiently perform a training data subset search for large labeled datasets. We observe that our approach obtains favorable subsets of training data, which can be used to train more accurate DNNs than training with the entire dataset. We perform an extensive experimental study of this phenomenon on three image classification benchmarks (CIFAR-10, CIFAR-100 and ImageNet), as well as an internal object detection benchmark for prototyping perception models for autonomous driving. Unlike existing studies, our experiments on object detection are at the scale required for production-ready autonomous driving systems. We provide insights on the impact of different initialization schemes, acquisition functions and ensemble configurations at this scale. Our results provide strong empirical evidence that optimizing the training data distribution can provide significant benefits on large scale vision tasks.

Inside NVIDIA’s {AI} Infrastructure for Self-driving Cars

Training Data Distribution Search with Ensemble Active Learning

arXiv (Cornell University), Sep 25, 2019

Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size ... more Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size of such datasets, it is conceivable that they contain certain samples that either do not contribute or negatively impact the DNN's optimization. Modifying the training distribution in a way that excludes such samples could provide an effective solution to both improve performance and reduce training time. In this paper, we propose to scale up ensemble Active Learning methods to perform acquisition at a large scale (10k to 500k samples at a time). We do this with ensembles of hundreds of models, obtained at a minimal computational cost by reusing intermediate training checkpoints. This allows us to automatically and efficiently perform a training data subset search for large labeled datasets. We observe that our approach obtains favorable subsets of training data, which can be used to train more accurate DNNs than training with the entire dataset. We perform an extensive experimental study of this phenomenon on three image classification benchmarks (CIFAR-10, CIFAR-100 and ImageNet), analyzing the impact of initialization schemes, acquisition functions and ensemble configurations. We demonstrate that data subsets identified with a lightweight ResNet-18 ensemble remain effective when used to train deep models like ResNet-101 and DenseNet-121. Our results provide strong empirical evidence that optimizing the training data distribution can provide significant benefits on large scale vision tasks.

Active Learning for Deep Object Detection via Probabilistic Modeling

arXiv (Cornell University), Mar 30, 2021

Active learning aims to reduce labeling costs by selecting only the most informative samples on a... more Active learning aims to reduce labeling costs by selecting only the most informative samples on a dataset. Few existing works have addressed active learning for object detection. Most of these methods are based on multiple models or are straightforward extensions of classification methods, hence estimate an image's informativeness using only the classification head. In this paper, we propose a novel deep active learning approach for object detection. Our approach relies on mixture density networks that estimate a probabilistic distribution for each localization and classification head's output. We explicitly estimate the aleatoric and epistemic uncertainty in a single forward pass of a single model. Our method uses a scoring function that aggregates these two types of uncertainties for both heads to obtain every image's informativeness score. We demonstrate the efficacy of our approach in PASCAL VOC and MS-COCO datasets. Our approach outperforms single-model based methods and performs on par with multi-model based methods at a fraction of the computing cost. Code is available at https://github.com/NVlabs/AL-MDN. Deep neural network Localization head Classification head Mixture density networks Unlabeled pool Scoring function Localization uncertainty Classification uncertainty … Informativeness scores Lab trainin Annotate top-data points Annotator Deep neural network Localization head Classification head Mixture density networks Unlabeled pool Scoring function Localization uncertainty Classification uncertainty … Informativeness scores Labeled training set Annotator

Less is More: An Exploration of Data Redundancy with Active Dataset Subsampling

arXiv (Cornell University), May 29, 2019

Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size ... more Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size of such datasets, it is conceivable that they contain certain samples that either do not contribute or negatively impact the DNN's performance. If there is a large number of such samples, subsampling the training dataset in a way that removes them could provide an effective solution to both improve performance and reduce training time. In this paper, we propose an approach called Active Dataset Subsampling (ADS), to identify favorable subsets within a dataset for training using ensemble based uncertainty estimation. When applied to three image classification benchmarks (CIFAR-10, CIFAR-100 and ImageNet) we find that there are low uncertainty subsets, which can be as large as 50% of the full dataset, that negatively impact performance. These subsets are identified and removed with ADS. We demonstrate that datasets obtained using ADS with a lightweight ResNet-18 ensemble remain effective when used to train deeper models like ResNet-101. Our results provide strong empirical evidence that using all the available data for training can hurt performance on large scale vision tasks.

Deep Active Learning for Object Detection with Mixture Density Networks

Towards real-time image understanding with convolutional networks

One of the open questions of artificial computer vision is how to produce good internal represent... more One of the open questions of artificial computer vision is how to produce good internal representations of the visual world. What sort of internal representation would allow an artificial vision system to detect and classify objects into categories, independently of pose, scale, illumination, conformation, and clutter? More interestingly, how could an artificial vision system learn appropriate internal representations automatically, the way animals and humans seem to learn by simply looking at the world? Another related question is that of computational tractability, and more precisely that of computational efficiency. Given a good visual representation, how efficiently can it be trained, and used to encode new sensorial data. Efficiency has several dimensions: power requirements, processing speed, and memory usage. In this thesis I present three new contributions to the field of computer vision: (1) a multiscale deep convolutional network architecture to easily capture long-distance relationships between input variables in image data, (2) a tree-based algorithm to efficiently explore multiple segmentation candidates, to produce maximally confident semantic segmentations of images, (3) a custom dataflow computer architecture optimized for the computation of convolutional networks, and similarly dense image processing models. All three contributions were produced with the common goal of getting us closer to real-time image understanding. Scene parsing consists in labeling each pixel in an image with the category of the object it belongs to. In the first part of this thesis, I propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features. In Contents List of Figures xi List of Tables xiii LIST OF TABLES 3.6 Performance comparison. 1-CPU: Intel DuoCore, 2.7GHz, optimized C code, 2-V6: neuFlow on Xilinx Virtex 6 FPGA-on board power and GOPs measurements; 3-IBM: neuFlow on IBM 45nm process: simulated results, the design was fully placed and routed; 4-mGPU/GPU: two GPU implementations, a low power GT335m and a high-end GTX480.. 85 xiv

Convolutional Nets and Watershed Cuts for Real-Time Semantic Labeling of RGBD Videos

HAL (Le Centre pour la Communication Scientifique Directe), Oct 1, 2014

HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific re... more HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

IFP Energies nouvelles 92500 Rueil Malmaison, France