Papers by Jenny Benois-Pineau
Multimedia tools and applications, Mar 11, 2024
Recognition of objects of a given category in visual content is one of the key problems in comput... more Recognition of objects of a given category in visual content is one of the key problems in computer vision and multimedia. It is strongly needed in wearable video shooting for a wide range of important applications in society. Supervised learning approaches are proved to be the most efficient in this task. They require available ground truth for training models. It is specifically true for Deep Convolution Networks, but is also hold for other popular models such as SVM on visual signatures. Annotation of ground truth when drawing bounding boxes (BB) is a very tedious task requiring important human resource. The research in prediction of visual attention in images and videos has attained maturity, specifically in what concerns bottom-up visual attention modeling. Hence, instead of annotating the ground truth manually with BB we propose to use automatically predicted salient areas as object locators for annotation. Such a prediction of saliency is not perfect, nevertheless. Hence active contours models on saliency maps are used in order to isolate the most prominent areas covering the objects. The approach is tested in the framework of a well-studied supervised learning model by SVM with psycho-visual weighted Bag-of-Words. An egocentric GTEA dataset was used in the experiment. The difference in mAP (mean average precision) is less than 10 percent while the mean annotation time is 36% lower.
Attention models in deep learning algorithms gained popularity in recent years. In this work, we ... more Attention models in deep learning algorithms gained popularity in recent years. In this work, we propose an attention mechanism on the basis of visual saliency maps injected into the Deep Neural Network (DNN) to enhance regions in feature maps during forward-backward propagation in training, and only forward propagation in testing. The key idea is to spatially capture features associated to prominent regions in images and propagate them to deeper layers. During training, first, we take as backbone the well-known AlexNet architecture and then the ResNet architecture to solve the task of building identification of Mexican architecture. Our model equipped with the "external" visual saliency-based attention mechanism outperforms models armed with squeeze-and-excitation units and double-attention blocks.
Lecture Notes in Computer Science, Dec 24, 2019
Laparoscopic skill training and evaluation as well as identifying technical errors in surgical pr... more Laparoscopic skill training and evaluation as well as identifying technical errors in surgical procedures have become important aspects in Surgical Quality Assessment (SQA). Typically performed in a manual, time-consuming and effortful post-surgical process, evaluating technical skills for a large part involves assessing proper instrument handling as the main cause for these type of errors. Therefore, when attempting to improve upon this situation using computer vision approaches, the automatic identification of instruments in laparoscopy videos is the very first step toward a semi-automatic assessment procedure. Within this work we summarize existing methodologies for instrument recognition, while proposing a state-of-the-art instance segmentation approach. As a first experiment in the domain of gynecology, our approach is able to segment instruments well but a much higher precision will be required, since this early step is critical before attempting any kind of skill recognition.
Pattern Recognition, Mar 1, 2022
HAL (Le Centre pour la Communication Scientifique Directe), Jun 19, 2017
The methods of Content-Based visual information indexing and retrieval penetrate into Healthcare ... more The methods of Content-Based visual information indexing and retrieval penetrate into Healthcare and become popular in Computer-Aided Diagnosis. Multimedia in medical images means different imaging modalities, but also multiple views of the same physiological object, such as human brain. In this paper we propose1 a multi-projection fusion approach with CNNs for diagnostics of Alzheimer Disease. Instead of working with the whole brain volume, it fuses CNNs from each brain projection sagittal, coronal, and axial ingesting a 2D+ε limited volume we have previously proposed. Three binary classification tasks are considered separating Alzheimer Disease (AD) patients from Mild Cognitive Impairment (MCI) and Normal control Subject (NC). Two fusion methods on FC-layer and on the single-projection CNN output show better performances, up to 91% and show competitive results with the SOA using heavier algorithmic chains.
HAL (Le Centre pour la Communication Scientifique Directe), Jun 19, 2017
The automatic description of multimedia content was mainly developed for classification tasks, re... more The automatic description of multimedia content was mainly developed for classification tasks, retrieval systems and massive ordering of data. Preservation of cultural heritage is a field of high importance for application to this method. Our problem is classification of architectural styles of buildings in digital photographs of Mexican cultural heritage. The selection of relevant content in the scene for training classification models allows them to be more precise in the classification task. Here we use a saliency-driven approach to predict visual attention in images and use it to train a Convolutional Neural Network to identify the architectural style of Mexican buildings. Also, we present an analysis of the behavior of the models trained under the traditional cropped image and the prominence maps. In this sense, we show that the performance of the saliency-based CNNs is better than the traditional training reaching a classification rate of 97% in validation dataset. It is considered that style identification with this technique can make a wide contribution in video description tasks, specifically in the automatic documentation of Mexican cultural heritage.
Incorporating user perception into visual content search and understanding tasks has become one o... more Incorporating user perception into visual content search and understanding tasks has become one of the major trends in multimedia retrieval. We tackle the problem of object recognition guided by user perception, as indicated by his gaze during visual exploration, in the application domain of assistance to upper-limb amputees. Although selecting the object to be grasped represents a task-driven visual search, human gaze recordings are noisy due to several physiological factors. Hence, since gaze does not always point to the object of interest, we use video-level weak annotations indicating the object to be grasped, and propose a video-level weak loss in classification with Deep CNNs. Our results show that the method achieves notably better performance than other approaches over a complex real-life dataset specifically recorded, with optimal performance for fixation times around 400-800ms, producing a minimal impact on subjects' behavior.
Recorded videos from surgeries have become an increasingly important information source for the f... more Recorded videos from surgeries have become an increasingly important information source for the field of medical endoscopy, since the recorded footage shows every single detail of the surgery. However, while video recording is straightforward these days, automatic content indexing – the basis for content-based search in a medical video archive – is still a great challenge due to the very special video content. In this work, we investigate segmentation and recognition of surgical instruments in videos recorded from laparoscopic gynecology. More precisely, we evaluate the achievable performance of segmenting surgical instruments from their background by using a region-based fully convolutional network for instance-aware (1) instrument segmentation as well as (2) instrument recognition. While the first part addresses only binary segmentation of instances (i.e., distinguishing between instrument or background) we also investigate multi-class instrument recognition (i.e., identifying the type of instrument). Our evaluation results show that even with a moderately low number of training examples, we are able to localize and segment instrument regions with a pretty high accuracy. However, the results also reveal that determining the particular instrument is still very challenging, due to the inherently high similarity of surgical instruments.
HAL (Le Centre pour la Communication Scientifique Directe), Jan 31, 2007
The paper presents the Argos evaluation campaign of video content analysis tools supported by the... more The paper presents the Argos evaluation campaign of video content analysis tools supported by the French Techno-Vision program. This project aims at developing the resources of a benchmark of content analysis methods and algorithms. The paper describes the type of the evaluated tasks, the way the content set has been produced, metrics and tools developed for the evaluations and results obtained at the end of the first phase.
SpringerBriefs in computer science, 2020
In machine learning we distinguish various approaches between two extreme ones: unsupervised and ... more In machine learning we distinguish various approaches between two extreme ones: unsupervised and supervised learning. The task of unsupervised learning consists in grouping similar data points in the description space thus inducing a structure on it. Then the data model can be expressed in terms of space partition. Probably, the most popular of such grouping algorithms in visual content mining is the K-means approach introduced by MacQueen as early as in 1967, at least this is the approach which was used for the very popular Bag-of-Visual Words model we have mentioned in Chap. 1. The Deep learning approach is a part of the family of supervised learning methods designed both for classification and regression. In this very short chapter we will focus on the formal definition of supervised learning approach, but also on fundamentals of evaluation of classification algorithms as the evaluation metrics will be used further in the book.
In this paper, we develop a new semi-automated segmentation method to cancel the chaotic blood fl... more In this paper, we develop a new semi-automated segmentation method to cancel the chaotic blood flow signal within the left ventricle (LV) in cardiac magnetic resonance (MR) images with parallel imaging. The segmentation is performed using a deformable model driven by a new external energy based on estimated probability density function (pdf) of the MR signal in the LV. The use of noise distribution through the data allows us both to pull the contour towards the myocardium edges and to ensure the smoothness of the curve. Since data for each slice are acquired with the GRAPPA parallel imaging technique, the spatial segmentation is followed by a temporal propagation to improve the convergence in terms of quality and rapidity. Experiments demonstrate that the proposed model provides better results than those obtained from the standard Active Contour, which should facilitate the use of the method for clinical purposes.
Pattern Recognition, Apr 1, 2019
We tackle the problem of predicting a grasping action in ego-centric video for the assistance to ... more We tackle the problem of predicting a grasping action in ego-centric video for the assistance to upper-limb amputees. Our work is based on paradigms of neuroscience that state that human gaze expresses intention and anticipates actions. In our scenario, human gaze fixations are recorded by a glass-worn eye-tracker and then used to predict the grasping actions. We have studied two aspects of the problem: which object from a given taxonomy will be grasped, and when is the moment to trigger the grasping action. To recognize objects, we using gaze to guide Convolutional Neural Networks (CNN) to focus on an object-to-grasp area. However, the acquired sequence of fixations is noisy due to saccades toward distractors and visual fatigue, and gaze is not always reliably directed toward the object-of-interest. To deal with this challenge, we use videolevel annotations indicating the object to be grasped and a weak loss in Deep CNNs. To detect a moment when a person will take an object we take advantage of the predictive power of Long-Short Term Memory networks to analyze gaze and visual dynamics. Results show that our method achieves better performance than other approaches on a real-life dataset.
HAL (Le Centre pour la Communication Scientifique Directe), 2004
Data visualization techniques are penetrating in various technological areas. In the field of mul... more Data visualization techniques are penetrating in various technological areas. In the field of multimedia such as information search and retrieval in multimedia archives, or digital media production and post-production, data visualization methodologies based on large graphs give an exciting ...
HAL (Le Centre pour la Communication Scientifique Directe), Apr 1, 2012
This collective work identifies the latest developments in the field of the automatic processing ... more This collective work identifies the latest developments in the field of the automatic processing and analysis of digital color images. For researchers and students, it represents a critical state of the art on the scientific issues raised by the various steps constituting the chain of color image processing. It covers a wide range of topics related to computational color imaging, including color filtering and segmentation, color texture characterization, color invariant for object recognition, color and motion analysis, as well as color image and video indexing and retrieval.
In recent years, the preservation and diffusion of culture in the digital form has been a priorit... more In recent years, the preservation and diffusion of culture in the digital form has been a priority for the governments in different countries, as in Mexico, with the objective of preserving and spreading culture through information technologies. Nowadays, a large amount of multimedia content is produced. Therefore, more efficient and accurate systems are required to organize it. In this work, we analyze the ability of a pre-trained residual network (ResNet) to describe information through the extracted deep features and we analyze its behavior by grouping new data into clusters by the K-means method at different levels of compression with the PCA algorithm showing that the structuring of new input data can be done with the proposed method.
Introduction of visual saliency or interestingness in the content selection for image classificat... more Introduction of visual saliency or interestingness in the content selection for image classification tasks is an intensively researched topic. It has been namely fulfilled for feature selection in feature-based methods. Nowadays, in the winner classifiers of visual content such as Deep Convolutional Neural Networks, visual saliency maps have not been introduced explicitly. Pooling features in CNNs is known as a good strategy to reduce data dimensionality, computational complexity and summarize representative features for subsequent layers. In this paper we introduce visual saliency in network pooling layers to spatially filter relevant features for deeper layers. Our experiments are conducted in a specific task to identify Mexican architectural styles. The results are promising: proposed approach reduces model loss and training time keeping the same accuracy as the base-line CNN.
Uploads
Papers by Jenny Benois-Pineau