Papers by louis chevallier
Several works have proposed to learn a two-path neural network that maps images and texts, respec... more Several works have proposed to learn a two-path neural network that maps images and texts, respectively, to a same shared Euclidean space where geometry captures useful semantic relationships. Such a multi-modal embedding can be trained and used for various tasks, notably image captioning. In the present work, we introduce a new architecture of this type, with a visual path that leverages recent spaceaware pooling mechanisms. Combined with a textual path which is jointly trained from scratch, our semantic-visual embedding offers a versatile model. Once trained under the supervision of captioned images, it yields new state-of-theart performance on cross-modal retrieval. It also allows the localization of new concepts from the embedding space into any input image, delivering state-of-the-art result on the visual grounding of phrases.
ACM Transactions on Graphics, Jul 19, 2021
our method to capture and control the full reflectance field of the person in the image. Most edi... more our method to capture and control the full reflectance field of the person in the image. Most editing approaches rely on supervised learning using training data captured with setups such as light and camera stages. Such datasets are expensive to acquire, not readily available and do not capture all the rich variations of in-the-wild portrait images. In addition, most supervised approaches only focus on relighting, and do not allow camera viewpoint editing. Thus, they only capture and control a subset of the reflectance field. Recently, portrait editing has been demonstrated by operating in the generative model space of StyleGAN. While such approaches do not require direct supervision, there is a significant loss of quality when compared to the supervised approaches. In this paper, we present a method which learns from limited supervised training data. The training images only include people in a fixed neutral expression with eyes closed, without much hair or background variations. Each person is captured under 150 one-light-at-atime conditions and under 8 camera poses. Instead of training directly in the image space, we design a supervised problem which learns transformations in the latent space of StyleGAN. This combines the best of supervised learning and generative adversarial modeling. We show that the StyleGAN prior allows for generalisation to different expressions, hairstyles and backgrounds. This produces high-quality photorealistic results for in-the-wild images and significantly outperforms existing methods. Our approach can edit the illumination and pose simultaneously, and runs at interactive rates.
HAL (Le Centre pour la Communication Scientifique Directe), Jun 26, 2018
Nous proposons dans ce papier un réseau de neurones profond pour apprendre un alignement entre de... more Nous proposons dans ce papier un réseau de neurones profond pour apprendre un alignement entre des images et leurs descriptions textuelles. Notre architecture est basée sur un réseau à deux branches, l'une visuelle, bénéficiant des mécanismes d'agrégation (pooling) récents, et l'autre encodant l'information textuelle. L'ensemble du réseau est appris de bout en bout dans un schéma supervisé par des paires (image,légende textuelle), fournissant alors une représentation sémantique exploitable dans différents contextes. Notre système obtient des résultats état-de-l'art sur une tâche importante de recherche d'information croisée image-texte. Nous montrons également sa capacité à découvrir la position des concepts de l'espace sémantique dans les images, permettant ainsi d'ancrer des phrases sur des parties d'images.
La presente invention concerne un procede de navigation sur des documents representes par des ide... more La presente invention concerne un procede de navigation sur des documents representes par des identificateurs affiches sur un menu de navigation. Chaque document est associe a une pluralite de valeurs numeriques le caracterisant selon une pluralite de criteres. Chaque identificateur d'un document est place a une position dependante d'une partie des valeurs de parametres associees a ce document. L'ensemble des documents est decoupe en un nombre determine de zones. Les sommes des valeurs associees a tous les documents d'une zone et correspondant a un certain nombre de criteres sont quasi-egales pour chaque zone. Le contour des zones est affiche de facon a pouvoir permettre sa selection. La selection d'une zone declenche l'affichage en plein ecran de l'ensemble des identificateurs de la zone selectionnee. L'invention concerne egalement un appareil de visualisation apte a executer la methode de navigation.
Several tasks in machine learning are evaluated using non-differentiable metrics such as mean ave... more Several tasks in machine learning are evaluated using non-differentiable metrics such as mean average precision or Spearman correlation. However, their nondifferentiability prevents from using them as objective functions in a learning framework. Surrogate and relaxation methods exist but tend to be specific to a given metric. In the present work, we introduce a new method to learn approximations of such non-differentiable objective functions. Our approach is based on a deep architecture that approximates the sorting of arbitrary sets of scores. It is trained virtually for free using synthetic data. This sorting deep (SoDeep) net can then be combined in a plug-and-play manner with existing deep architectures. We demonstrate the interest of our approach in three different tasks that require ranking: Cross-modal text-image retrieval, multilabel image classification and visual memorability ranking. Our approach yields very competitive results on these three tasks, which validates the merit and the flexibility of SoDeep as a proxy for sorting operation in ranking-based losses.
HAL (Le Centre pour la Communication Scientifique Directe), Oct 30, 2013
HAL (Le Centre pour la Communication Scientifique Directe), Mar 16, 2016
Understanding human emotion when perceiving audiovisual content is an exciting and important rese... more Understanding human emotion when perceiving audiovisual content is an exciting and important research avenue. Thus, there have been emerging attempts to predict the emotion elicited by video clips or movies recently. While most existing approaches focus either on single modality, i.e., only audio or visual data is exploited, or build on a multimodal scheme with late fusion, we propose a multimodal framework with early fusion scheme and target an emotion classification task. Our proposed mechanism presents the advantages of handling (1) the variation in video length, (2) the imbalance of audio and visual feature sizes, and (3) the middle-level fusion of audio and visual information such that a higher level feature representation can be learned jointly from the two modalities for classification. We evaluate the performance of the proposed approach on the international benchmark, i.e., the MediaEval 2015 Affective Impact of Movies 1 task, and show that it outperforms most state-of-the-art systems on arousal accuracy while using a much smaller feature size.
Which parts or objects are interesting in a content? In this paper we first propose three computa... more Which parts or objects are interesting in a content? In this paper we first propose three computational models to automatically predict interestingness rankings of areas/objects inside a 2D picture. We based our modeling on previous experimental findings to ensure reliability of the prediction when compared to the human assessement of interestingness. Our two first models are based on low level features, extracted from image regions, which have been stated as useful in the human interest process. A baseline model is built by estimating a linear regression from a small dataset of 49 images. The second model estimates a rewarding term based on additional experimental observations. By adding image semantics, we then construct a last model, which more generally benefits from a better understanding of the content. It also integrates notions such that unusualness or human beings' presence that have proven to play key roles in the interestingness process. Finally, targeting VR applications, we extend our models to immersive content, both images and videos, and propose an innovative application to guide the viewer in his/her navigation based on intuitive visual or audio cues.
Computer Graphics Forum, May 1, 2021
Figure 1: Our method takes as input an unconstrained monocular face image and estimates face attr... more Figure 1: Our method takes as input an unconstrained monocular face image and estimates face attributes-3D pose, geometry, diffuse, specular, roughness and illumination (left). The estimation is self-shadow aware and handles varied illumination conditions. We show several resulting style transfer applications: albedos, illumination and textures transfers from and into face portrait images (right).
Figure 1. Given a single image, our method achieves appealing 3D face reconstruction and estimate... more Figure 1. Given a single image, our method achieves appealing 3D face reconstruction and estimates a dense detailed face geometry, spatially varying face reflectance (diffuse and specular albedos) and high frequency scene illumination.
The Digital Production Symposium, Aug 11, 2020
Lecture Notes in Computer Science, 2003
This paper discusses the benefits of both indexing and classification techniques combined with na... more This paper discusses the benefits of both indexing and classification techniques combined with natural user interfaces for building consumer browsing tools. It focuses on text indexing and classification techniques used for user profiling.
ACM Transactions on Graphics, Aug 1, 2021
arXiv (Cornell University), Oct 3, 2019
This paper focuses on so-called weighted variants of nonnnegative matrix factorization (NMF) and ... more This paper focuses on so-called weighted variants of nonnnegative matrix factorization (NMF) and more generally nonnnegative tensor factorization (NTF) approximations. We consider multiplicative update (MU) rules to optimize these approximations, and we prove that under certain conditions the results on monotonicity of MU rules for NMF generalize to both the NTF and the weighted NTF (WNTF) cases.
HAL (Le Centre pour la Communication Scientifique Directe), Feb 12, 2013
The present invention relates to a method for creating a sound series of photographs. A sound seq... more The present invention relates to a method for creating a sound series of photographs. A sound sequence is generated by a photographic device and audible to the user of the apparatus. At various times of the sound sequence, the user takes pictures and the camera records the time of the taking. A table is created associating each photograph at the time of the sequence in which it was taken. The sound sequence then ends terminating the series of taken picture. The sound sequence, the table and the data of different photographs are assembled to form a sound sequence. When playing, the sound sequence is reproduced by viewing each photograph at the time associated with the table. According to an improvement, the photographs are assembled to form a single panoramic image, a portion of the panoramic image is then displayed by a animation.L'invention also relates to a camera for creating the sound series of photographs, and a device designed to reproduce the actual sound on.
The invention concerns a method for processing audio-visual broadcasts to display a summary there... more The invention concerns a method for processing audio-visual broadcasts to display a summary thereof. The method comprises a prior step which consists in recording an audio-visual broadcast, and a step which consists in searching within the recorded broadcasts slow-motion sequences and a step which consists in displaying the sequences found. User controls enable to browse among the found sequences. The invention also a receiver set for audio-visual broadcasts provided with a storage unit for storing the slow-motion sequences of an audio-visual broadcast. The apparatus also includes controls enabling the user to display the stored sequences.
Uploads
Papers by louis chevallier