Papers by Jean-marc Odobez
Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing
This paper investigates two-types of shape representations for individual Maya codical glyphs: tr... more This paper investigates two-types of shape representations for individual Maya codical glyphs: traditional bag-of-words built on knowledge-driven local shape descriptors (HOOSC), and Convolutional Neural Networks (CNN) based representations, learned from data. For CNN representations, first, we evaluate the activations of typical CNNs that are pretrained on large-scale image datasets; second, we train a CNN from scratch with all the available individual segments. One of the main challenges while training CNNs is the limited amount of available data (and handling data imbalance issue). Here, we attempt to solve this imbalance issue by introducing class-weights into the loss computation during training. Another possibility is oversampling the minority class samples during batch selection. We show that deep representations outperform the other, but CNN training requires special care for small-scale unbalanced data, that is usually the case in the cultural heritage domain. CCS CONCEPTS • Applied computing → Arts and humanities; • Computing methodologies → Object recognition; Neural networks;
2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)
This paper presents the MuMMER data set, a data set for human-robot interaction scenarios that is... more This paper presents the MuMMER data set, a data set for human-robot interaction scenarios that is available for research purposes 1. It comprises 1 h 29 min of multimodal recordings of people interacting with the social robot Pepper in entertainment scenarios, such as quiz, chat, and route guidance. In the 33 clips (of 1 to 4 min long) recorded from the robot point of view, the participants are interacting with the robot in an unconstrained manner. The data set exhibits interesting features and difficulties, such as people leaving the field of view, robot moving (head rotation with embedded camera in the head), different illumination conditions. The data set contains color and depth videos from a Kinect v2, an Intel D435, and the video from Pepper. All the visual faces and the identities in the data set were manually annotated, making the identities consistent across time and clips. The goal of the data set is to evaluate perception algorithms in multi-party human/robot interaction, in particular the re-identification part when a track is lost, as this ability is crucial for keeping the dialog history. The data set can easily be extended with other types of annotations. We also present a benchmark on this data set that should serve as a baseline for future comparison. The baseline system, IHPER 2 (Idiap Human Perception system) is available for research and is evaluated on the MuMMER data set. We show that an identity precision and recall of~80% and a MOTA score above 80% are obtained.
Proceedings of the 19th ACM International Conference on Multimodal Interaction, 2017
Gaze is an important non-verbal cue involved in many facets of social interactions like communica... more Gaze is an important non-verbal cue involved in many facets of social interactions like communication, attentiveness or attitudes. Nevertheless, extracting gaze directions visually and remotely usually suffers large errors because of low resolution images, inaccurate eye cropping, or large eye shape variations across the population, amongst others. This paper hypothesizes that these challenges can be addressed by exploiting multimodal social cues for gaze model adaptation on top of an head-pose independent 3D gaze estimation framework. First, a robust eye cropping refinement is achieved by combining a semantic face model with eye landmark detections. Investigations on whether temporal smoothing can overcome instantaneous refinement limitations is conducted. Secondly, to study whether social interaction convention could be used as priors for adaptation, we exploited the speaking status and head pose constraints to derive soft gaze labels and infer person-specific gaze bias using robust statistics. Experimental results on gaze coding in natural interactions from two different settings demonstrate that the two steps of our gaze adaptation method contribute to reduce gaze errors by a large margin over the baseline and can be generalized to several identities in challenging scenarios. CCS CONCEPTS • Computing methodologies → Tracking; Activity recognition and understanding;
ACM Symposium on Eye Tracking Research and Applications, 2020
Gaze estimation allows robots to better understand users and thus to more precisely meet their ne... more Gaze estimation allows robots to better understand users and thus to more precisely meet their needs. In this paper, we are interested in gaze sensing for analyzing collaborative tasks and manipulation behaviors in human-robot interactions (HRI), which differs from screen gazing and other communicative HRI settings. Our goal is to study the accuracy that remote vision gaze estimators can provide, as they are a promising alternative to current accurate but intrusive wearable sensors. In this view, our contributions are: 1) we collected and make public a labeled dataset involving manipulation tasks and gazing behaviors in an HRI context; 2) we evaluate the performance of a state-of-the-art gaze estimation system on this dataset. Our results show a low default accuracy, which is improved by calibration, but that more research is needed if one wishes to distinguish gazing at one object amongst a dozen on a table. CCS CONCEPTS • Computing methodologies → Neural networks; Activity recognition and understanding; Tracking.
Proceedings of the 11th ACM Symposium on Eye Tracking Research & Applications, 2019
Recognizing eye movements is important for gaze behavior understanding like in human communicatio... more Recognizing eye movements is important for gaze behavior understanding like in human communication analysis (human-human or robot interactions) or for diagnosis (medical, reading impairments). In this paper, we address this task using remote RGB-D sensors to analyze people behaving in natural conditions. This is very challenging given that such sensors have a normal sampling rate of 30 Hz and provide low-resolution eye images (typically 36x60 pixels), and natural scenarios introduce many variabilities in illumination, shadows, head pose, and dynamics. Hence gaze signals one can extract in these conditions have lower precision compared to dedicated IR eye trackers, rendering previous methods less appropriate for the task. To tackle these challenges, we propose a deep learning method that directly processes the eye image video streams to classify them into fixation, saccade, and blink classes, and allows to distinguish irrelevant noise (illumination, low-resolution artifact, inaccurate eye alignment, difficult eye shapes) from true eye motion signals. Experiments on natural 4-party interactions demonstrate the benefit of our approach compared to previous methods, including deep learning models applied to gaze outputs. CCS CONCEPTS • Computing methodologies → Tracking; Neural networks; Activity recognition and understanding.
We address the task of monocular visual head tracking in the context of applications that involve... more We address the task of monocular visual head tracking in the context of applications that involve human-robot interactions, where both near field and far field tracking settings could occur and real-time constraints are imposed. The original contribution of this paper is a real-time multiperson tracking model that combines a priori texture and colour models for different head poses with face detectors for different face orientations. We show that such a combination improves tracker performance significantly. At the same time the proposed model takes into account major difficulties that are related to real-time data processing (non-uniform observations, processing time restrictions). The model is evaluated on a set of realistic scenarios recorded on a humanoid robot that involve interactions between the robot and the participants with robot motion, unconstrainted displacement of the participants, lighting variations etc. The algorithm runs in real-time and shows significant improveme...
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021
Visual Focus of Attention (VFOA) estimation in conversation is challenging as it relies on diffic... more Visual Focus of Attention (VFOA) estimation in conversation is challenging as it relies on difficult to estimate information (gaze) combined with scene features like target positions and other contextual information (speaking status) allowing to disambiguate situations. Previous VFOA models fusing all these features are usually trained for a specific setup and using a fixed number of interacting people, and should be retrained to be applied to another one, which limits their usability. To address these limitations, we propose a novel deep learning method that encodes all input features as a fixed number of 2D maps, which makes the input more naturally processed by a convolutional neural network, provides scene normalization, and allows to consider an arbitrary number of targets. Experiments performed on two publicly available datasets demonstrate that the proposed method can be trained in a cross-dataset fashion without loss in VFOA accuracy compared to intra-dataset training.
2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012
In this paper, we deal with the estimation of body and head poses (i.e orientations) in surveilla... more In this paper, we deal with the estimation of body and head poses (i.e orientations) in surveillance videos, and we make three main contributions. First, we address this issue as a joint model adaptation problem in a semi-supervised framework. Second, we propose to leverage the adaptation on multiple information sources (external labeled datasets, weak labels provided by the motion direction, data structure manifold), and in particular, on the coupling at the output level of the head and body classifiers, accounting for the restriction in the configurations that the head and body pose can jointly take. Third, we propose a kernel-formulation of this principle that can be efficiently solved using a global optimization scheme. The method is applied to body and head features computed from automatically extracted body and head location tracks. Thorough experiments on several datasets demonstrate the validity of our approach, the benefit of the coupled adaptation, and that the method performs similarly or better than a state-of-the-art algorithm. * This work was supported by the Integrated Project VANAHEIM (248907) of the European Union under the 7th framework program. 1 We use body pose to refer to the upper-body orientation in the ground plane rather than the articulated spatial configuration of the human body.
Gaze estimation methods usually regress gaze directions directly from a single face or eye image.... more Gaze estimation methods usually regress gaze directions directly from a single face or eye image. However, due to important variabilities in eye shapes and inner eye structures amongst individuals, universal models obtain limited accuracies and their output usually exhibit high variance as well as biases which are subject dependent. Therefore, increasing accuracy is usually done through calibration, allowing gaze predictions for a subject to be mapped to his/her specific gaze. In this paper, we introduce a novel image differential method for gaze estimation. We propose to directly train a convolutional neural network to predict the gaze differences between two eye input images of the same subject. Then, given a set of subject specific calibration images, we can use the inferred differences to predict the gaze direction of a novel eye sample. The assumption is that by allowing the comparison between two eye images, annoyance factors (alignment, eyelid closing, illumination perturbati...
We address the problem of 3D gaze estimation within a 3D environment from remote sensors,which is... more We address the problem of 3D gaze estimation within a 3D environment from remote sensors,which is highly valuable for applications in human-human and human-robot interactions. To the contrary of most previous works, which are limited to screen gazing applications, we propose to leverage the depth data of RGB-D cameras to perform an accurate head pose tracking, acquire head pose invariance through a 3D rectification process that renders head pose dependent eye images into a canonical viewpoint, and computes the line-ofsight in 3D space. To address the low resolution issue of the eye image resulting from the use of remote sensors, we rely on the appearance based gaze estimation paradigm, which has demonstrated robustness against this factor. In this context, we do a comparative study of recent appearance based strategies within our framework, study the generalization of these methods to unseen individual, and propose a cross-user eye image alignment technique relying on the direct reg...
Proceedings of the 17th International Conference on Mobile and Ubiquitous Multimedia, 2018
Eye gaze and facial expressions are central to face-to-face social interactions. These behavioral... more Eye gaze and facial expressions are central to face-to-face social interactions. These behavioral cues and their connections to first impressions have been widely studied in psychology and computing literature, but limited to a single situation. Utilizing ubiquitous multimodal sensors coupled with advances in computer vision and machine learning, we investigate the connections between these behavioral cues and perceived soft skills in two diverse workplace situations (job interviews and reception desk). Pearson's correlation analysis shows a moderate connection between certain facial expressions, eye gaze cues and perceived soft skills in job interviews (r ∈ [−30, 30]) and desk (r ∈ [20, 36]) situations. Results of our computational framework to infer perceived soft skills indicates a low predictive power of eye gaze, facial expressions, and their combination in both interviews (R 2 ∈ [0.02, 0.21]) and desk (R 2 ∈ [0.05, 0.15]) situations. Our work has important implications for employee training and behavioral feedback systems.
Journal on Computing and Cultural Heritage, 2018
Thanks to the digital preservation of cultural heritage materials, multimedia tools (e.g., based ... more Thanks to the digital preservation of cultural heritage materials, multimedia tools (e.g., based on automatic visual processing) considerably ease the work of scholars in the humanities and help them to perform quantitative analysis of their data. In this context, this article assesses three different Convolutional Neural Network (CNN) architectures along with three learning approaches to train them for hieroglyph classification, which is a very challenging task due to the limited availability of segmented ancient Maya glyphs. More precisely, the first approach, the baseline, relies on pretrained networks as feature extractor. The second one investigates a transfer learning method by fine-tuning a pretrained network for our glyph classification task. The third approach considers directly training networks from scratch with our glyph data. The merits of three different network architectures are compared: a generic sequential model (i.e., LeNet), a sketch-specific sequential network (...
Interspeech 2016, 2016
Current speech synthesizers typically lack backchannel tokens. Those synthesiser, which include b... more Current speech synthesizers typically lack backchannel tokens. Those synthesiser, which include backchannels, typically only support a limited set of stereotypical functions. However, this does not mirror the subtleties of backchannels in spontaneous conversations. If we want to be able to build an artificial listener, that can display degrees of attentiveness, we need a speech synthesizer with more fine-grained control of the prosodic realisations of its backchannels. In the current study we used a corpus of three-party face-toface discussions to sample backchannels produced under varying conversational dynamics. We wanted to understand i) which prosodic cues are relevant for the perception of varying degrees of attentiveness ii) how much of a difference is necessary for people to perceive a difference in attentiveness iii) whether a preliminary classifier could be trained to distinguish between more and less attentive backchannel token.
IEEE Transactions on Multimedia, 2017
This paper focuses on the crowd-annotation of an ancient Maya glyph dataset derived from the thre... more This paper focuses on the crowd-annotation of an ancient Maya glyph dataset derived from the three ancient codices that survived up to date. More precisely, non-expert annotators are asked to segment glyph-blocks into their constituent glyph entities. As a means of supervision, available glyph variants are provided to the annotators during the crowdsourcing task. Compared to object recognition in natural images or handwriting transcription tasks, designing an engaging task and dealing with crowd behavior is challenging in our case. This challenge originates from the inherent complexity of Maya writing and an incomplete understanding of the signs and semantics in the existing catalogs. We elaborate on the evolution of the crowdsourcing task design, and discuss the choices for providing supervision during the task. We analyze the distributions of similarity and task difficulty scores, and the segmentation performance of the crowd. A unique dataset of over 9000 Maya glyphs from 291 categories individually segmented from the three codices was created and will be made publicly available thanks to this process. This dataset lends itself to automatic glyph classification tasks. We provide baseline methods for glyph classification using traditional shape descriptors and convolutional neural networks.
Journal on Computing and Cultural Heritage, 2016
Shape representations are critical for visual analysis of cultural heritage materials. This artic... more Shape representations are critical for visual analysis of cultural heritage materials. This article studies two types of shape representations in a bag-of-words-based pipeline to recognize Maya glyphs. The first is a knowledge-driven Histogram of Orientation Shape Context (HOOSC) representation, and the second is a data-driven representation obtained by applying an unsupervised Sparse Autoencoder (SA). In addition to the glyph data, the generalization ability of the descriptors is investigated on a larger-scale sketch dataset. The contributions of this article are four-fold: (1) the evaluation of the performance of a data-driven auto-encoder approach for shape representation; (2) a comparative study of hand-designed HOOSC and data-driven SA; (3) an experimental protocol to assess the effect of the different parameters of both representations; and (4) bridging humanities and computer vision/machine learning for Maya studies, specifically for visual analysis of glyphs. From our experi...
2015 IEEE International Conference on Computer Vision Workshop (ICCVW), 2015
As a non-verbal communication mean, head gestures play an important role in face-to-face conversa... more As a non-verbal communication mean, head gestures play an important role in face-to-face conversation and recognizing them is therefore of high value for social behavior analysis or Human Robotic Interactions (HRI) modelling. Among the various gestures, head nod is the most common one and can convey agreement or emphasis. In this paper, we propose a novel nod detection approach based on a full 3D face centered rotation model. Compared to previous approaches, we make two contributions. Firstly, the head rotation dynamic is computed within the head coordinate instead of the camera coordinate, leading to pose invariant gesture dynamics. Secondly, besides the rotation parameters, a feature related to the head rotation axis is proposed so that nod-like false positives due to body movements could be eliminated. The experiments on two-party and four-party conversations demonstrate the validity of the approach.
Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT, 2014
The EUMSSI project (Event Understanding through Multimodal Social Stream Interpretation) aims at ... more The EUMSSI project (Event Understanding through Multimodal Social Stream Interpretation) aims at developing technologies for aggregating data presented as unstructured information in sources of very different nature. The multimodal analytics will help organize, classify and cluster cross-media streams, by enriching its associated metadata in an interactive manner, so that the data resulting from analysing one media helps reinforce the aggregation of information from other media, in a cross-modal semantic representation framework. Once all the available descriptive information has been collected, an interpretation component will dynamically reason over the semantic representation in order to derive implicit knowledge. Finally the enriched information will be fed to a hybrid recommendation system, which will be at the basis of two well-motivated use-cases. In this paper we give a brief overview of EUMSSI's main goals and how we are approaching its implementation using UIMA to integrate and combine various layers of annotations coming from different sources. This work is licenced under a Creative Commons Attribution 4.0 International License.
Lecture Notes in Computer Science
The paper presents an evaluation of both head pose and visual focus of attention (VFOA) estimatio... more The paper presents an evaluation of both head pose and visual focus of attention (VFOA) estimation algorithms in a meeting room environment. Head orientation is estimated using a Rao-Blackwellized mixed state particle filter to achieve joint head localization and pose estimation. The output of this tracker is exploited in an Hidden Markov Model (HMM) to estimate people's VFOA. Contrarily to previous studies on the topic, in our setup , the potential VFOA of people is not restricted to other meeting participants only, but includes environmental targets (table, slide screen), which renders the task more difficult due to more ambiguity between VFOA target directions. By relying on a corpus of 8 meetings of 8 minutes on average featuring 4 persons involved in the discussion of statements projected on a slide screen, and for which head orientation ground truth was obtained using magnetic sensor devices, we thoroughly assess the performance of the above algorithms, demonstrating the validity of our approaches and pointing out to further research directions.
IEEE Signal Processing Magazine, 2015
We present an integrated framework for multimedia access and analysis of ancient Maya epigraphic ... more We present an integrated framework for multimedia access and analysis of ancient Maya epigraphic resources, which is developed as an interdisciplinary effort involving epigraphers and computer scientists. Our work includes several contributions: definition of consistent conventions to generate high-quality representations of Maya hieroglyphs from the three most valuable ancient codices, currently residing in European museums and institutions; a digital repository system for glyph annotation and management; as well as automatic glyph retrieval and classification methods. We study the combination of statistical Maya language models and shape representation within a hieroglyph retrieval system, the impact of applying language models extracted from different hieroglyphic resources on various data types, and the effect of shape representation choices for glyph classification. A novel Maya hieroglyph dataset is contributed, which can be used for shape analysis benchmarks, and also to study the ancient Maya writing system.
2014 IEEE International Conference on Image Processing (ICIP), 2014
In this paper, we investigate the influence of music on human walking behaviors in a public setti... more In this paper, we investigate the influence of music on human walking behaviors in a public setting monitored by surveillance cameras. To this end, we propose a novel algorithm to characterize the frequency and phase of the walk. It relies on a human-by-detection tracking framework, along with a robust fitting of the human head bobbing motion. Preliminary experiments conducted on more than 100 tracks show that an accuracy greater than 85% for foot strike estimation can be achieved, suggesting that large scale analysis is at reach for finer music/walking behavior relationship studies.
Uploads
Papers by Jean-marc Odobez