Academia.eduAcademia.edu

PhotoApp: Photorealistic Appearance Editing of Head Portraits

2021

arXiv:2103.07658v2 [cs.CV] 13 May 2021 PhotoApp: Photorealistic Appearance Editing of Head Portraits MALLIKARJUN B R and AYUSH TEWARI, MPI for Informatics, SIC, Germany ABDALLAH DIB, InterDigital R&I, France TIM WEYRICH, University College London, UK BERND BICKEL, IST Austria, Austria HANS-PETER SEIDEL, MPI for Informatics, SIC, Germany HANSPETER PFISTER, Harvard University, USA WOJCIECH MATUSIK, MIT CSAIL, USA LOUIS CHEVALLIER, InterDigital R&I, France MOHAMED ELGHARIB and CHRISTIAN THEOBALT, MPI for Informatics, SIC, Germany Fig. 1. We present a method for high-quality appearance editing of head portraits. Given an input image, our approach edits its appearance using a target environment map (see insets), and a target camera viewpoint. We achieve high-quality photorealistic results for in the wild images, capturing a wide variety of reflectance properties. Our method is trained on a light-stage dataset, using a combination of supervised learning and generative adversarial modeling which allows for accurate editing as well as generalisation outside the dataset. Portrait images are from Shih et al. [2014] and environment maps are from [Gardner et al. 2017; Hold-Geoffroy et al. 2019]. Authors’ addresses: Mallikarjun B R, [email protected]; Ayush Tewari, atewari@ mpi-inf.mpg.de, MPI for Informatics, SIC, Germany; Abdallah Dib, abdallah.dib@ interdigital.com, InterDigital R&I, France; Tim Weyrich, [email protected], University College London, UK; Bernd Bickel, [email protected], IST Austria, Austria; Hans-Peter Seidel, [email protected], MPI for Informatics, SIC, Germany; Hanspeter Pfister, [email protected], Harvard University, USA; Wojciech Matusik, [email protected], MIT CSAIL, USA; Louis Chevallier, louis.chevallier@ interdigital.com, InterDigital R&I, France; Mohamed Elgharib, [email protected]. de; Christian Theobalt, [email protected], MPI for Informatics, SIC, Germany. Photorealistic editing of head portraits is a challenging task as humans are very sensitive to inconsistencies in faces. We present an approach for highquality intuitive editing of the camera viewpoint and scene illumination (parameterised with an environment map) in a portrait image. This requires our method to capture and control the full reflectance field of the person in the image. Most editing approaches rely on supervised learning using training data captured with setups such as light and camera stages. Such datasets are expensive to acquire, not readily available and do not capture all the rich variations of in-the-wild portrait images. In addition, most supervised approaches only focus on relighting, and do not allow camera viewpoint 44:2 • B R, Mallikarjun. et al editing. Thus, they only capture and control a subset of the reflectance field. Recently, portrait editing has been demonstrated by operating in the generative model space of StyleGAN. While such approaches do not require direct supervision, there is a significant loss of quality when compared to the supervised approaches. In this paper, we present a method which learns from limited supervised training data. The training images only include people in a fixed neutral expression with eyes closed, without much hair or background variations. Each person is captured under 150 one-light-at-atime conditions and under 8 camera poses. Instead of training directly in the image space, we design a supervised problem which learns transformations in the latent space of StyleGAN. This combines the best of supervised learning and generative adversarial modeling. We show that the StyleGAN prior allows for generalisation to different expressions, hairstyles and backgrounds. This produces high-quality photorealistic results for in-the-wild images and significantly outperforms existing methods. Our approach can edit the illumination and pose simultaneously, and runs at interactive rates. CCS Concepts: · Computing methodologies → Reflectance modeling; Image representations; Image manipulation. Additional Key Words and Phrases: Portrait Editing, Relighting, Pose Editing, Neural Rendering 1 INTRODUCTION Portrait photos are among the most important photographic depictions of humans and their loved ones. Even though the quality of cameras and thus the photographs have improved dramatically, there arises many cases where people would like to change the scene illumination and camera pose after the image has been captured. Editing the appearance of the image after capture has applications in post-production, casual photography and virtual reality. Given a monocular portrait image and a target illumination and camera pose, we present a method for relighting the portrait and editing the camera pose in a photorealistic manner. This is a challenging task, as the appearance of the person in the image includes complex effects such as subsurface scattering and self-shadowing. Changing the camera requires reasoning about occluded surfaces. Humans are very sensitive to inconsistencies in portrait images, and a high level of photorealism is necessary for convincing editing. This requires our method to correctly reason about the interactions of the lights in the scene with the surface, and edit them at photorealistic quality. We are interested in editing in-the-wild images with a very wide range of illumination and pose conditions. We only rely on a single image of an identity unseen during training. These constraints make the problem very challenging. Several methods have been proposed for editing portrait appearance in the literature. One category of methods [Debevec et al. 2000; Ghosh et al. 2011; Weyrich et al. 2006] address this problem by explicitly modelling the reflectance of the human face [Kajiya 1986]. While these approaches provide well-defined, semantically meaningful reflectance output, they require the person to be captured under multi-view and multi-lit configurations. They also do not edit the full portrait image, just the inner face region, missing out important portrait components such as hair and eyes. Recently, several deep learning-based methods have been proposed for appearance editing. These methods use large light-stage datasets which consist of a limited number of people illuminated by different light sources and captured from different camera viewpoints. A neural network is trained on such datasets which enables inference from a single image. Some methods [Lattas et al. 2020; Yamaguchi et al. 2018] regress the reflectance of the face from a monocular image in the form of diffuse and specular components. Neural representations for face reflectance fields have also been explored recently [B R et al. 2020]. While these methods can work with a single image, they still only model the inner face region, missing out on important details such as hair and eyes. In contrast to the previous methods, several approaches only capture and edit a subset of the reflectance field. These approaches only allow for the editing of either scene illumination or camera pose. Most relighting methods directly learn a mapping from the input image to its relit version using a light-stage training dataset [Nestmeyer et al. 2020; Sun et al. 2019, 2020]. The controlled setting and limited variety of such datasets limits performance while generalising to in-the-wild images. Zhou et al. [2019] attempted to break out from the complexity of capturing light-stage datasets and from their limited variations. Instead, they proposed to use a synthetic dataset of in-the-wild images, synthesised with different illuminations. Illumination is modeled using spherical harmonics. The use of synthetic data impacts the photorealism of the results. All of these approaches do not allow for changing the camera pose. Several methods exist for only editing the camera pose and expressions [Averbuch-Elor et al. 2017; Geng et al. 2018; Kim et al. 2018; Nagano et al. 2018; Siarohin et al. 2019; Wiles et al. 2018]. These methods are commonly trained on videos. While person-specific methods [Kim et al. 2018; Thies et al. 2019] can obtain high-quality results, methods which generalise to unseen identities [Siarohin et al. 2019; Wiles et al. 2018] are limited in terms of photorealism. In addition, none of them can edit the scene illumination. Recently, Tewari et al. [2020b] proposed Portrait Image Embedding (PIE), an approach for editing the illumination and camera pose in portrait images by leveraging the StyleGAN generative model [Karras et al. 2019]. PIE computes a StyleGAN embedding for the input image which allows for editing of various face semantics. As StyleGAN represents a manifold of photorealistic portraits, PIE can edit the full image with high quality. However, due to the absence of labelled data, the supervision for the method is defined using a 3D reconstruction of the face. This supervision is indirect and not over the complete image, leading to results that still lack sufficient accuracy and photorealism. It uses a low-dimensional representation of the scene illumination and can thus not synthesize results with higher-frequency lights. Furthermore, PIE solves a computationally expensive optimisation problem taking several minutes to compute the embedding. We therefore propose a technique for high-quality intuitive editing of scene illumination and camera pose in a head portrait image. Our method combines the best of generative modeling and supervised learning approaches, and creates results of much higher quality compared to previous methods. We learn to transform the StyleGAN latent code of the input image into the latent code of the output. We perform this learning in a supervised manner by leveraging a light-stage dataset, containing multiple identities shot from different viewpoints and under several illumination conditions. Learning in the StyleGAN space allows us to synthesise photorealistic results for general person identities seen under in-the-wild PhotoApp: Photorealistic Appearance Editing of Head Portraits • conditions. Our method can handle properties such as shadows and other complex appearance, and can synthesise full portrait images including hair, upper body and background. We inherit the high photorealism and diversity of the StyleGAN portrait manifold in our solution, which allows us to outperform methods that only use light-stage training data [Sun et al. 2019]. Our method has analogies to self-supervised discriminative methods [Jing and Tian 2020]. We show that the StyleGAN latent representation allows for generalisation even with very little training data. We obtain high-quality results of our method even when trained on just 15 identities. Our formulation does not make any prior assumptions on the underlying surface reflectance or scene illumination (other than it being distant) and rather directly predicts the appearance as a function of the target environment map and camera pose. This leads to significantly more photorealistic results compared to methods that use spherical-harmonic illumination representations [Abdal et al. 2020; Tewari et al. 2020b; Zhou et al. 2019] which are limited to only modeling low-frequency illumination conditions. Furthermore, directly supervising our method using a multi-view and multi-lit light-stage dataset allows us to produce significantly more photorealistic results than PIE [Tewari et al. 2020b]. Our method can additionally edit at a faster speed, using just a single feedforward pass, and also edit both illumination and pose simultaneously, unlike PIE. Compared to traditional relighting approaches [Sun et al. 2020; Zhou et al. 2019], we obtain higher-quality results as well as allow for changing the camera pose. In summary we make the following contributions: • We combine the strength of supervised learning and generative adversarial modeling in a new way to develop a technique for high-quality editing of scene illumination and camera pose in portrait images. Both properties can be edited simultaneously. • Our novel formulation allows for generalisation to in-the-wild images with significantly higher quality results than related methods. It also allows for training with limited amount of supervision. 2 RELATED WORK In this section we look at related works that can edit the scene parameters in a head portrait image. We refer the reader to the state-of-the-art report of Tewari et al. [2020c] for more details on neural rendering approaches. The seminal work of Debevec et al. [2000] introduced a lightstage apparatus to capture the reflectance field of a human face, that is, its appearance under a multitude of different lighting directions. Through weighted superposition of images of the illumination conditions, their method recreates high-quality images of the face under any target illumination. By employing additional cameras and geometry reconstruction, and gathering data from the additional viewpoints, they further fit a simple bi-directional radiance distribution function (BRDF) allowing for novel-light and -view renderings of the face. Their method, however, is limited to reproducing the specific face that was captured. Weyrich et al. [2006] extend this concept using a setup with a much larger number of cameras (16) and a reconstruction pipeline that extracts geometry and a detailed spatially-varying BRDF (SVBRDF) of a face. By scanning hundreds 44:3 of subjects that way, they extract generalisable statistical information on appearance traits depending on age, gender and ethnicity. The generative power of the extracted quantities, however, is heavily constrained, and examples of sematic appearance editing were limited to subjects from within their face database. In our work, we revisit their original dataset using a state-of-the-art learning framework. Another category of methods tries to infer geometry and reflectance properties from single, unconstrained images. Shu et al. [2017] and Sengupta et al. [2018] decompose the image into simple intrinsic components, that is, normals, diffuse albedo and shading. With the assumption of Lambertian surface reflectance, these methods use spherical harmonics to model the scene illumination; however, the starkly simplified assumption ignores perceptually important reflectance properties which leads to limited photrealism. Others infer more general surface reflectance, with fewer assumptions about incident illumination [B R et al. 2020; Lattas et al. 2020; Yamaguchi et al. 2018]. While such techniques can capture rich reflectance properties, they do not synthesise the full portrait, missing out on important components such as hair, eyes and mouth. Recently, several methods addressed the simplified problem of only relighting a head portrait in the fixed input pose [Meka et al. 2019; Nestmeyer et al. 2020; Sun et al. 2019; Wang et al. 2020; Zhang et al. 2020; Zhou et al. 2019]. Nestmeyer et al. [2020] used a lightstage dataset to train a model that explicitly regresses a diffuse reflectance, as well as a residual component which accounts for specularity and other effects. Similarly, Wang et al. [2020] used a light-stage dataset to compute the ground truth diffuse albedo, normal, specularity and shadow images. A network is trained to regress each of these components which are then used in another network to finally relight the portrait image. Instead of explicitly estimating the different reflectance components, methods such as Sun et al. [2019]; Zhou et al. [2019] directly regress the relighted version of the portrait given the imput image and target illumination. Here, the target illumination is parameterised either in the form of environment map [Sun et al. 2019] or spherical harmonics [Zhou et al. 2019]. While Sun et al. [2019] used light-stage data to obtain their ground truth for supervised learning, Zhou et al. [2019] used a ratio image-based approach to generate synthetic training data. Recently, Zhang et al. [2020] proposed a method to remove harsh shadows from a monocular portrait image. They created a synthetic data from in-the-wild images by augmenting shadows and training a network to remove these shadows. Using a light-stage dataset, another network is trained to smooth the artifacts that could remain from the first network. While the methods of [Meka et al. 2019; Nestmeyer et al. 2020; Sun et al. 2019; Wang et al. 2020; Zhang et al. 2020; Zhou et al. 2019] can produce high-quality relighting results, they either focus on shadow removal [Zhang et al. 2020], or limited by spherical-harmonics illumination representation [Zhou et al. 2019]. In addition, methods trained on light-stage or synthetic datasets struggle to generalise to in-the-wild. They are also limited to only relighting, as they cannot change the camera viewpoint. There are several methods for editing the head pose of portrait images [Averbuch-Elor et al. 2017; Geng et al. 2018; Kim et al. 2018; Nagano et al. 2018; Siarohin et al. 2019; Wiles et al. 2018]. While Kim et al. [2018] require a training video of the examined subject, 44:4 • B R, Mallikarjun. et al PhotoAppNet StyleGAN pSpNet pSpNet Fig. 2. Our method allows for editing the scene illumination 𝐸𝑡 and camera pose 𝜔𝑡 in an input source image 𝐼𝑠 . We learn to map the StyleGAN [Karras et al. 2020] latent code 𝐿𝑠 of the source image, estimated using pSpNet [Richardson et al. 2020] to the latent code 𝐿𝑡 of the output image. StyleGAN [Karras et al. 2020] is then used to synthesis the final output 𝐼𝑡 . Our method is trained in a supervised manner using a light-stage dataset with multiple cameras and light sources. For training, we use a latent loss and a perceptual loss defined using a pretrained network 𝜙. Supervised learning in the latent space of StyleGAN allows for high-quality editing which can generalise to in-the-wild images. Portrait images are from Weyrich et al. [2006] and environment map is from [Gardner et al. 2017; Hold-Geoffroy et al. 2019]. the techniques of Averbuch-Elor et al. [2017]; Geng et al. [2018]; Nagano et al. [2018]; Siarohin et al. [2019]; Wiles et al. [2018] can directly operate on a single image. However, Nagano et al. [2018] does not synthesise the hair and the approaches of Siarohin et al. [2019]; Wiles et al. [2018] lack explicit 3D modeling and only allow for control using a driving video. The approaches of Averbuch-Elor et al. [2017]; Geng et al. [2018] rely on warping of the image guided by face mesh deformations, and are thus limited to very small edits in pose. Furthermore, these approaches can not change the scene illumination. Recently, Tewari et al. [2020b] proposed PIE, a method which can relight, change expressions and synthesise novel views of the portrait image using a generative model. PIE is based on StyleRig [Tewari et al. 2020a] which maps the control space of a 3D morphable face model to the latent space of StyleGAN [Karras et al. 2019] in a selfsupervised manner. It further imposes an identity perseverance loss to ensure the source identity is maintained during editing. Even though PIE inherits the high photorealism of the StyleGAN portrait manifold, its lack of direct supervision for appearance editing limits its performance and impacts the overall photorealism. The scene illumination is parameterised using spherical harmonics as it relies on a monocular 3D reconstruction approach to define its control space. Thus, it only allows for rendering using low-frequency scene illumination. In addition, PIE can not edit the illumination and pose simultaneously, but rather one at a time. PIE solves an expensive optimisation for the image which is time consuming, taking around 10 minutes per image on an NVIDIA V100 GPU. Concurrent to us, Abdal et al. [2020] also propose a method for semantic editing of portrait images using latent space transformations of StyleGAN. They use an invertible network based on continuous normalising flows to map semantic input parameters such as head pose and scene illumination into the StyleGAN latent vectors. The input parametrisation for the illumination is spherical harmonics like PIE, which limits its relighting capabilities. This method is also trained without explicit supervision, i.e., images of the same person with different scene parameters. This limits the quality of the results. While there are several other approaches which demonstrate transformations of StyleGAN latent vectors for semantic manipulation [Collins et al. 2020; Härkönen et al. 2020; Shen et al. 2020; Tewari et al. 2020a], these methods focus on StyleGAN generated images, and do not produce high-quality and high-resolution results for real existing images. 3 METHOD Our method takes as input an in-the-wild portrait image, a target illumination and the target camera pose. The output is a portrait image of the same identity, synthesised with the target camera and lit by the target illumination. Given a light-stage dataset of multiple independent illumination sources and viewpoints, the naive approach could be to learn the transformations directly in image space. Instead, we propose to learn the mapping in the latent space of StyleGAN [Karras et al. 2020]. We show that learning using this latent representation helps in generalisation to in-the-wild images with high photorealism. We use StyleGAN2 in our implementation, referred to as StyleGAN for better comprehension. PhotoApp: Photorealistic Appearance Editing of Head Portraits • Fig. 3. Visualisation of the camera poses in the training dataset. Images are from Weyrich et al. [2006]. 3.1 Dataset We make use of a light-stage [Weyrich et al. 2006] dataset for training our solution. This dataset contains 341 identities captured with 8 different cameras placed in the frontal hemisphere of the face. The camera poses available are shown in Fig. 3. The dataset also contains 150 light source evenly distributed on the sphere. Using this setup, each image is captured with one-light-at-a-time (OLAT) light. Given 150 OLAT images of a person with a specific camera pose, we can linearly combine them using an environment map to obtain relight portrait images [Debevec et al. 2000]. We use 205 HDR environment maps from the Laval Outdoor [Hold-Geoffroy et al. 2019] and 2233 from the Naval Indoor [Gardner et al. 2017] dataset for generating naturally lit images. Camera poses for the images are estimated using the approach of Yang et al. [2019]. Out of the 341 identities, we use 300 for training and the rest for testing. We synthesise 300 transformed images for each identity with randomly selected environment maps and camera viewpoints. Our training set consists of input-ground truth pairs of the same identity along with the target pose and environment map. The camera viewpoint of the ground truth is kept identical to the input for quarter of the training data. In the remaining, this camera viewpoint is randomly selected. The test set includes pairs from the test identities for quantitative evaluations, as well as in-the-wild images for qualitative evaluations, see Sec. 4. 3.2 Network Architecture Fig. 2 shows an overview of our method. Our approach takes as input a source image 𝐼𝑠 , target illumination 𝐸𝑡 and camera pose 𝜔𝑡 , and a binary input 𝑝. The value of 𝑝 is set to 0 when the target pose is same as that of the input, and 1 when they are different. This conditioning input helps in better preservation of the input camera pose for relighting results. The ground truth image for training is represented as 𝐼ˆ𝑡 . Camera pose is parameterised using Euler angles. We represent the illumination 𝐸𝑡 as a 450 dimensional vectorised environment map. This corresponds to the 150 RGB discrete light sources. A core component of our approach is the PhotoAppNet neural network, which maps the input image to the edited output image in the latent space of StyleGAN (see Fig. 2). We first compute the latent 44:5 representations of 𝐼𝑠 and 𝐼ˆ𝑡 as 𝐿𝑠 and 𝐿ˆ𝑡 using the pretrained network of Richardson et al. [2020] (pSpNet in Fig. 2). The latent space used is 18 × 512 dimensional, corresponding to the W+ space of StyleGAN. The output of PhotoAppNet is a displacement to the input in the StyleGAN latent space. This is then added to the input latent code to compute 𝐿𝑡 , which is used by StyleGAN to generate the output image 𝐼𝑡 . We only train PhotoAppNet, while pSpNet and StyleGAN are pretrained and fixed. We use an MLP-based architecture with a single hidden layer of length 512. ReLU activation is used after the hidden layer. We use independent networks for each of the 18 latent vectors of length 512 corresponding to different resolutions. This is motivated by the design of the StyleGAN network where each 512 dimensional latent code controls a different frequency of image features. The output of each independent network is the output latent code corresponding to the same resolution. 3.3 Loss Function We use multiple loss terms to train our network. L (𝐼𝑡 , 𝐿𝑡 , 𝐼ˆ𝑡 , 𝐿ˆ𝑡 , 𝜃 n ) =Ll (𝐿𝑡 , 𝐿ˆ𝑡 , 𝜃 n ) + Lp (𝐼𝑡 , 𝐼ˆ𝑡 , 𝜃 n ) . (1) Here, 𝜃𝑛 denotes the network parameters of PhotoAppNet. Both terms are weighed equally. The first term is a StyleGAN latent loss defined as Ll (𝐿𝑡 , 𝐿ˆ𝑡 , 𝜃 n ) = ∥𝐿𝑡 − 𝐿ˆ𝑡 ∥ 22 . It enforces the StyleGAN latent code of the output image 𝐿𝑡 to be close to the ground truth latent code 𝐿ˆ𝑡 . The second term is a perceptual loss defined as Lp (𝐼𝑡 , 𝐼ˆ𝑡 , 𝜃 n ) = ∥𝜙 (𝐼𝑡 ) − 𝜙 (𝐼ˆ𝑡 )∥ 22 . Here, we employ the learned perceptual similarity metric LPIPS [Zhang et al. 2018]. An ℓ2 loss is used to compare the AlexNet [Krizhevsky et al. 2012] features 𝜙 () of the synthesised output and the ground truth images. 3.4 Network Training We implement our method in PyTorch and optimise for the weights of PhotoAppNet by minimising the loss function in Eq. 1. We use Adam solver with a learning rate of 0.0001 and default hyperparameters. As mentioned earlier, the StyleGAN encoder (pSpNet) and generator [Karras et al. 2020; Richardson et al. 2020] are pretrained and fixed during training. We optimise over our training set samples using a batch size equal to 1. Since in-the-wild images are very different from the light-stage data, it is difficult to assess convergence using a light-stage validation dataset. As such, we train our networks using an in-the-wild validation set using qualitative evaluations. Our network take around 10 hours to train on a single NVIDIA Quadro RTX 8000 GPU. 3.5 Discussion Existing image-based relighting approaches such as Sun et al. [2020]; Zhou et al. [2019] rely on much larger trainable networks with several loss functions, such as losses on the input environment map and adversarial losses. Approaches for pose editing such as Kim et al. [2018]; Siarohin et al. [2019]; Thies et al. [2019] rely 44:6 • B R, Mallikarjun. et al Fig. 4. Qualitative illumination and viewpoint editing results. The environment map of the target illumination is shown in the insets. We visualize the StyleGAN projection of the input image (second column). Our method produces photorealistic editing results even under challenging high-frequency light conditions. Portrait images are from Shih et al. [2014] (first and third row) and from Livingstone and Russo [2018] (second and fourth row). Environment maps are from [Gardner et al. 2017; Hold-Geoffroy et al. 2019]. on conditional generative networks trained with a combination of photometric and adversarial losses. Since we rely on a pretrained generator as our backend renderer, our training is much simpler than existing approaches. We do not need an adversarial loss as the pretrained generator already synthesises results at a high quality. As such, our training is more stable than approaches operating in image space. In addition. the StyleGAN latent representation allows for generalisation with high-quality, even when trained on a dataset with as little as 3 identities (Sec. 4.4). Many existing methods use specialised network architectures for editing the pose such as landmark-based warping of the features [Siarohin et al. 2019], or rendering of a coarse 3D face model [Kim et al. 2018; Thies et al. 2019]. Similarly, common relighting networks are designed in a taskspecific manner where the illumination is predicted at the bottleneck of the architecture [Sun et al. 2020; Zhou et al. 2019]. Our design results in a compact and convenient to train PhotoAppNet network that does not require any sophisticated nor specialized network components. In addition, our method is also faster to train compared to these approaches, since the task we solve is only to transform the latent representation of images, unlike end-to-end approaches which also learn to synthesise high-quality images. When trained with 15 identities, our network only takes around 6 hours on a single RTX 8000 GPU to train. In contrast, the method of Sun et al. [2020] takes around 26 hours on 4 V100 GPUs for training at the same resolution. 4 RESULTS We evaluate our technique both qualitatively and quantitatively on a large set of diverse images. The role of the different loss terms is studied in Sec. 4.2. We compare against several related techniques in Sec. 4.3 ś the high-quality relighting approaches of Sun et al. [2019] and Zhou et al. [2019], as well as the recent StyleGAN-based image editing approaches of Tewari et al. [2020b] and Abdal et al. [2020] (the latter is concurrent to ours). Furthermore, we show that our PhotoApp: Photorealistic Appearance Editing of Head Portraits • 44:7 Fig. 5. Qualitative illumination and viewpoint editing results. In the first row, we show relighting results where the camera is fixed as in the input. The second row shows results where both illumination and camera pose is edited. The last row shows results with a moving camera under fixed scene illumination. Please note the local shading effects such as shadows, as well as view-dependent effects such as specularities in the image. Portrait images are from Shih et al. [2014] (first part) and Karras et al. [2019] (second part). Environment maps are from [Gardner et al. 2017; Hold-Geoffroy et al. 2019] 44:8 • B R, Mallikarjun. et al Fig. 6. Qualitative illumination and viewpoint editing results. In the first row, we show relighting results where the camera is fixed as in the input. The second row shows results where both illumination and camera pose is edited. The last row shows results with a moving camera under fixed scene illumination. Please note the local shading effects such as shadows, as well as view-dependent effects such as specularities in the image. Portrait images are from Shih et al. [2014] and environment maps are from [Gardner et al. 2017; Hold-Geoffroy et al. 2019]. PhotoApp: Photorealistic Appearance Editing of Head Portraits • method allows for learning from limited supervised training data by conducting extensive experiments in Sec. 4.4. Data Preparation We evaluate our approach on portrait images captured in the wild [Karras et al. 2019; Shih et al. 2014]. All data in our work (including the training data) are cropped and preprocessed as described in Karras et al. [2019]. The images are resized to a resolution of 1024x1024. Since we need the ground truth images for quantitative evaluations, we use the test portion of our lightstage dataset composed of images of 41 identities unseen during training. We create two test sets, Set1 has the input and ground truth pairs captured from the same viewpoint while Set2 includes pairs captured from different viewpoints. The HDR environment maps, randomly sampled from the Naval Outdoor and Naval Indoor datasets [Gardner et al. 2017; Hold-Geoffroy et al. 2019] are used to synthesise the pairs with natural illumination conditions. Viewpoints are randomly sampled from the 8 cameras of the light-stage setup. The input and ground truth images are computed using the same environment map in Set2 for evaluating the viewpoint editing results, while the pairs in Set1 use different environment maps for relighting evaluations. Set1 includes 883 and Set2 include 792 image pairs after finding common set of images which works for all the methods. For each pair, we additionally provide a reference image, which is used by related methods to estimate the target illumination and pose in the representation they work with [Abdal et al. 2020; Sun et al. 2019; Tewari et al. 2020b; Zhou et al. 2019]. In Set1, the reference image is of an identity different from the input identity. The ground truth image is directly taken as the reference image for Set2, since there can be slight pose variations between different identities for the same camera. 4.1 High-Fidelty Appearance Editing Figs. 4, 5, and 6 show simultaneous viewpoint and illumination editing results of our method for various subjects. We also show the StyleGAN projection of the input images estimated by Richardson et al. [2020]. Our approach produces high-quality photorealistic results and synthesises the full portrait, including hair, eyes, mouth, torso and the background, while preserving the identity, expression and other properties (such as facial hair). Our method works well on people of different races. Additionally, the results show that our method can preserve a variety of reflectance properties, resulting in effects such as specularities and subsurface scattering. Please note the view-dependent effects such as specularities in the results (nose, forehead...). Our method can synthesise results even under high-frequency light conditions resulting in shadows, even though the StyleGAN network is trained on a dataset of natural images. In Figs. 5-6, we show more detailed editing results. As it can be noted, the relighting preserve the input pose and identity. Also, our method can change the viewpoint under a fixed environment map (third row for each subject). 4.2 Ablation Study In this section, we evaluate the importance of the different loss terms of our objective function (Eq. 1). Results are shown in Fig. 7. The target illumination and viewpoint are visualised using a reference image (second column) with the same scene parameters. Removing 44:9 the latent loss leads to clear distortions of the head geometry. Only using the perceptual loss leads to results with closed eye expressions, as our training data only consists of people captured with closed eyes. We found that the latent loss term helps in generalisation to unseen expressions. However, using only the latent loss is not sufficient for high-quality results. In such case, the facial identity and facial hair (see row 1) are not well preserved, and the relighting is not very accurate (see rows 1,2,6). A combination of both terms is essential for high-quality. 4.3 Comparisons to Related Methods We compare our method with several state of the art portrait editing approaches. We evaluate qualitatively on in the wild data, as well as quantitatively on the test set of the light-stage data. We compare with the following approaches: • The relighting approach of Sun et al. [2019] which is a datadriven technique trained on a light-stage dataset. It can only edit the scene illumination. • The relighting approach of Zhou et al. [2019] which is trained on synthetic data. It can also only edit the scene illumination. • PIE [Tewari et al. 2020b] is a method which computes a StyleGAN embedding used to edit the image. It can edit the head pose and scene illumination sequentially (unlike ours, which can perform the edits simultaneously). It is trained without supervised image pairs. • StyleFlow [Abdal et al. 2020], like PIE can edit images by projecting them onto the StyleGAN latent space. It is also trained without supervised image pairs. Please note that this paper is concurrent to us. However, we provide comparisons for completeness. We show the relighting comparisons on in the wild data in Fig. 8. Here, the reference image in the second column is used to visualise the target illumination. Both the light-stage data-driven approach of Sun et al. [2019] and the synthetic data-driven approach of Zhou et al. [2019] produce noticeable artifacts. The approach of Zhou et al. [2019] only uses single channel illumination as input and can thus not capture the overall color tone of the illumination. The StyleGANbased approach of Abdal et al. [2020] produces less artifacts, however the quality of relighting is worse than other approaches as it mostly preserves the input lighting. In addition, similar to Zhou et al. [2019], this approach cannot capture the color tone of the environment map. PIE [Tewari et al. 2020b] produces better results but it does not capture local illumination effects like our approach (for eg., rows 5,6,7,8) and can produce significant artifacts in some cases (for eg., row 8). Our approach clearly outperforms all existing methods, demonstrating the effectiveness of a combination of supervised learning and generative modeling. It can capture the global color tone as well as local effects such as shadows and specularities. It can synthesise the image under harsh lighting (for e.g., rows 7,8) and remove source-lighting related specularities on the glasses (for eg., row 5). We also compare our method with Sun et al. [2020] on the ground truth light stage images in Fig. 11. Our method achieves higher-quality results, closer to the ground truth. 44:10 • B R, Mallikarjun. et al Fig. 7. Ablative study on the loss functions. The reference images visualise the target illumination and viewpoint. Removing the latent loss results in distortion of the head geometry and lower quality results. Removing the perceptual term leads to a loss of facial hair and identity preservation such as beards (for e.g., row 1, row 4,5 in light+viewpoint). It also often produces lower-quality results (e.g. row 1,2,6). Both terms are necessary for high-quality results. Images are from Shih et al. [2014] (first column) and Weyrich et al. [2006] (second column). Tab. 1 shows the quantitative comparisons with these methods on the light-stage test set (Set1). We use the Scale invariant-MSE (SiMSE) [Zhou et al. 2019] and SSIM [Zhou Wang et al. 2004] metrics. The Si-MSE metric does not penalize global scale offsets between the ground truth and results. Our method outperforms all methods using this metric. The method of Sun et al. [2019] outperforms other methods on SSIM. Since this method uses a U-Net architecture, it is easier to copy the details from the input image, and maintain the pixel correspondences. However, visual results clearly show that our approach outperforms all related methods, including that of Sun et al. [2019] (see Fig. 8). Fig. 9 shows joint editing of the camera viewpoint and scene illumination for in the wild images. The target viewpoint and illumination are visualised using reference images (see second column). While PIE [Tewari et al. 2020b] can change the viewpoint, it often distorts the face in an unnatural way (e.g. row 1,2,7). It also does not capture local shading effects correctly (e.g. row 1,2,6) and can produce strong artifacts (e.g. row 4). PIE solves an optimisation problem to obtain the embedding for each image, which is slow, taking about 10 mins per image. In contrast, our method is interactive, only requiring 160ms to compute the embedding and edit it. StyleFlow [Abdal et al. 2020] can preserve the identity better than PIE, but results in less photorealistic results compared to our method. In addition, the relighting results of StyleFlow often fail to capture the PhotoApp: Photorealistic Appearance Editing of Head Portraits • 44:11 Fig. 8. Relighting comparisons. Target illumination is visualised using reference images. Our approach clearly outperforms all existing approaches. Here, we compare our method with approaches of Tewari et al. [2020b], Sun et al. [2019], Abdal et al. [2020], Zhou et al. [2019]. Images are from Karras et al. [2019] (first column, row 1,5,8), Shih et al. [2014] (first column, row 2,3,4,6,7) and Weyrich et al. [2006] (second column). 44:12 • B R, Mallikarjun. et al Fig. 9. Comparisons to PIE [Tewari et al. 2020b] and StyleFlow [Abdal et al. 2020]. The reference images visualise the target illumination and viewpoint. Our approach produces higher-quality results and clearly outperforms these methods. Images are from Shih et al. [2014] (first column, row 1,2,3,4,5,7), Livingstone and Russo [2018] (first column, row 6) and Weyrich et al. [2006] (second column). input environment map. Our approach clearly outperforms both methods both in terms of photorealism as well as the quality of editing. Tab. 2 quantitatively compares the joint editing of camera viewpoint and scene illumination. We use the Si-MSE and SSIM metrics and evaluate on the Set2 of the light-stage test data. Our approach outperforms all methods here in both metrics. 4.4 Generalisation with Limited Supervision The combination of generative modeling and supervised learning allows us to train from very limited supervised data. We show results of training with different number of identities in Fig. 10. Results of PIE [Tewari et al. 2020b], StyleFlow [Abdal et al. 2020] and Sun et al. [2020] are also demonstrated. Our relighting results outperform related methods both in terms of realism as well as quality PhotoApp: Photorealistic Appearance Editing of Head Portraits • 44:13 Fig. 10. Our method allows for training with very limited supervision. We show editing results when trained with 3,15,30,150 and 300 identities. Our approach produces photorealistic results, and outperforms existing methods even with limited training data. Here, we also compare with approaches of Sun et al. [2019], Tewari et al. [2020b] and Abdal et al. [2020]. Images are from Shih et al. [2014] (first column) and Weyrich et al. [2006] (second column). 44:14 • B R, Mallikarjun. et al Table 1. Quantitative comparison with relighting methods. Our approach achieves the lowest Si-MSE numbers. While Sun et al. [2019] achieves the highest SSIM score, qualitative results show that our method significantly outperforms all existing techniques on in the wild images (see Fig. 8). [Zhou et al. 2019] [Sun et al. 2019] [Tewari et al. 2020b] [Abdal et al. 2020] Ours Fig. 11. Relighting results on the light stage dataset in comparison with Sun et al. [2019]. Our method obtains higher-quality results which are closer to the ground truth. Images are from Weyrich et al. [2006]. of editing, even when trained with as little as 3 identities. We also consistently outperform PIE and StyleFlow when editing both viewpoint and illumination, even when trained with 30 identities. More identities during training help with better preservation of the facial identity during viewpoint editing. However, very small training data is sufficient for relighting. We also quantitatively evaluate these results in Tables 1 and 2. In both tables, our method outperforms all related approaches using the Si-MSE metric, even when trained with just 3 identities. All versions of our approach perform similar in terms of SSIM. These evaluations show that while larger datasets lead to better results, only limited supervised data is required to outperform the state of the art. Finally, despite the limited expressivity of the training dataset (subjects in a single expression with eyes and mouth closed), our method is able to generalise to different expressions, as shown in our results (mouth and eyes open, smiling, etc.) (see Fig. 3). Ours (150 id.) Ours (30 id.) Ours (15 id.) Ours (3 id.) [Tewari et al. 2020b] [Abdal et al. 2020] Ours Ours (150 id.) Ours (30 id.) Ours (15 id.) Ours (3 id.) In the supplemental video, we show results on videos processed on a per-frame basis. We can synthesise the input video from different 0.0037 (𝜎= 0.0031) 0.0026 (𝜎= 0.0024) 0.0051 (𝜎= 0.0036) 0.0082 (𝜎=0.0056) 0.002 (𝜎=0.001) 0.002 (𝜎=0.001) 0.002 (𝜎=0.001) 0.002 (𝜎=0.001) 0.002 (𝜎=0.001) 0.9197 (𝜎= 0.0744) 0.9591 (𝜎= 0.0237) 0.922 (𝜎= 0.029) 0.8909 (𝜎= 0.04) 0.9199 (𝜎= 0.0297) 0.9192 (𝜎= 0.0351) 0.9188 (𝜎= 0.0300) 0.9191 (𝜎= 0.0306) 0.9193 (𝜎= 0.0293) Si-MSE ↓ SSIM ↑ 0.0067 (𝜎= 0.0044) 0.0104 (𝜎=0.0071) 0.0039 (𝜎=0.0029) 0.0035 (𝜎=0.0020) 0.0040 (𝜎=0.0033) 0.0046 (𝜎=0.0027) 0.0048 (𝜎=0.0031) 0.9005 (𝜎= 0.0363) 0.8812 (𝜎= 0.0469) 0.9086 (𝜎=0.0307) 0.9050 (𝜎=0.0340) 0.9021 (𝜎=0.0312) 0.8974 (𝜎=0.0315) 0.9000 (𝜎=0.0316) camera poses and under different scene illumination while preserving the expressions in the video. We also show additional results on a large number of images in the supplemental material. 5 4.6 Supplemental Material SSIM ↑ Table 2. Quantitative evaluation with illumination and pose editing methods using Set2 of the light-stage test set. Our approach outperforms both competing methods, also clearly illustrated using qualitative results (see Fig. 9). 4.5 Preserving the Input Illumination Our method can be easily extended for editing the viewpoint while preserving the input illumination, see Fig. 12. Here, we modify the network architecture in Fig. 2 by providing another binary input similar to 𝑝, which is set to 0 when the target illumination is same as the input illumination, and 1 when they are different. This design helps in editing both viewpoint and illumination in isolation. Si-MSE ↓ LIMITATIONS While we demonstrate high-quality results of our approach, several limitations exist, see Fig. 13. Our method can fail to preserve accessories such as caps and glasses in some cases. Background clutter PhotoApp: Photorealistic Appearance Editing of Head Portraits • 44:15 than the training data. Through extensive evaluations, we demonstrated that our method outperforms all related techniques, both in terms of realism and editing accuracy. We further demonstrated that our method can learn from very limited supervised data, achieving high-quality results when trained with as little as 3 identities captured in a single expression. While several limitations still exist, we hope that our contributions inspire future work on using generative representations for synthesis applications. ACKNOWLEDGMENTS This work was supported by the ERC Consolidator Grant 4DReply (770784). We also acknowledge support from Technicolor and InterDigital. We thank Tiancheng Sun for kindly helping us with the comparisons with Sun et al. [2019]. Fig. 12. Our method can be easily extended for editing the viewpoint while preserving the input illumination condition. Images are from Shih et al. [2014]. Fig. 13. Our method struggles in the presence of accessories such as caps and sunglasses, and background clutter. Extreme pose and illumination editing is also difficult for our method. Images are from Karras et al. [2019]. can lead to a degradation of quality in the results. Our method struggles to preserve the facial identity under large edits for both camera pose and illumination. While we can preserve the input pose in the results, our method cannot edit the camera viewpoint without changing the illumination. In the future, we can use methods that estimate illumination from portrait images [LeGendre et al. 2020] to preserve input illumination when editing the viewpoint. While we show several high-quality results on video sequences, slight flicker and instability remain. A temporal treatment of videos could lead to smoother results. 6 CONCLUSION We presented PhotoApp, a method for editing the scene illumination and camera pose in head portraits. Our method exploits the advantages of both supervised learning and generative adversarial modeling. By designing a supervised learning problem in the latent space of StyleGAN, we achieve high-quality editing results which generalise to in the wild images with significantly more diversity REFERENCES Rameen Abdal, Peihao Zhu, Niloy Mitra, and Peter Wonka. 2020. Styleflow: Attributeconditioned exploration of stylegan-generated images using conditional continuous normalizing flows. arXiv e-prints (2020), arXivś2008. Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F. Cohen. 2017. Bringing Portraits to Life. ACM Trans. on Graph. (Proceedings of SIGGRAPH Asia) 36, 6 (2017). Mallikarjun B R, Ayush Tewari, Tae-Hyun Oh, Tim Weyrich, Bernd Bickel, HansPeter Seidel, Hanspeter Pfister, Wojciech Matusik, Mohamed Elgharib, and Christian Theobalt. 2020. Monocular Reconstruction of Neural Face Reflectance Fields. arXiv:2008.10247 Edo Collins, Raja Bala, Bob Price, and Sabine Susstrunk. 2020. Editing in style: Uncovering the local semantics of gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5771ś5780. Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, Westley Sarokin, and Mark Sagar. 2000. Acquiring the reflectance field of a human face. In Annual conference on Computer graphics and interactive techniques. Marc-André Gardner, Kalyan Sunkavalli, Ersin Yumer, Xiaohui Shen, Emiliano Gambaretto, Christian Gagné, and Jean-François Lalonde. 2017. Learning to predict indoor illumination from a single image. ACM Trans. on Graph. (Proceedings of SIGGRAPH Asia) 36, 6, Article 176 (2017). Jiahao Geng, Tianjia Shao, Youyi Zheng, Yanlin Weng, and Kun Zhou. 2018. WarpGuided GANs for Single-Photo Facial Animation. ACM Trans. on Graph. (Proceedings of SIGGRAPH Asia) 37, 6 (2018). Abhijeet Ghosh, Graham Fyffe, Borom Tunwattanapong, Jay Busch, Xueming Yu, and Paul Debevec. 2011. Multiview Face Capture Using Polarized Spherical Gradient Illumination. ACM Trans. on Graph. 30, 6 (Dec. 2011), 1ś10. Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. 2020. Ganspace: Discovering interpretable gan controls. arXiv preprint arXiv:2004.02546 (2020). Yannick Hold-Geoffroy, Akshaya Athawale, and Jean-François Lalonde. 2019. Deep sky modeling for single image outdoor lighting estimation. In Computer Vision and Pattern Recognition (CVPR). Longlong Jing and Yingli Tian. 2020. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020). James T. Kajiya. 1986. The Rendering Equation. SIGGRAPH Computer Graphics 20, 4 (1986), 143ś150. https://doi.org/10.1145/15886.15902 T. Karras, S. Laine, and T. Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. In Computer Vision and Pattern Recognition (CVPR). Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. In Computer Vision and Pattern Recognition (CVPR). Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Nießner, Patrick Pérez, Christian Richardt, Michael Zollöfer, and Christian Theobalt. 2018. Deep Video Portraits. ACM Trans. on Graph. (Proceedings of SIGGRAPH) 37, 4 (2018), 163. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012), 1097ś1105. Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis, Vasileios Triantafyllou, Abhijeet Ghosh, and Stefanos Zafeiriou. 2020. AvatarMe: Realistically Renderable 3D Facial Reconstruction "in-the-wild". In Computer Vision and Pattern Recognition (CVPR). Chloe LeGendre, Wan-Chun Ma, Rohit Pandey, Sean Fanello, Christoph Rhemann, Jason Dourgarian, Jay Busch, and Paul Debevec. 2020. Learning Illumination from 44:16 • B R, Mallikarjun. et al Diverse Portraits. In SIGGRAPH Asia 2020 Technical Communications. 1ś4. Steven R. Livingstone and Frank A. Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). https://doi.org/10.5281/zenodo.1188976 Funding Information Natural Sciences and Engineering Research Council of Canada: 2012-341583 Hear the world research chair in music and emotional speech from Phonak. Abhimitra Meka, Christian Häne, Rohit Pandey, Michael Zollhöfer, Sean Fanello, Graham Fyffe, Adarsh Kowdle, Xueming Yu, Jay Busch, Jason Dourgarian, Peter Denny, Sofien Bouaziz, Peter Lincoln, Matt Whalen, Geoff Harvey, Jonathan Taylor, Shahram Izadi, Andrea Tagliasacchi, Paul Debevec, Christian Theobalt, Julien Valentin, and Christoph Rhemann. 2019. Deep Reflectance Fields: High-Quality Facial Reflectance Field Inference from Color Gradient Illumination. ACM Trans. on Graph. (Proceedings of SIGGRAPH) 38, 4 (2019). Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei, Zimo Li, Shunsuke Saito, Aviral Agarwal, Jens Fursund, and Hao Li. 2018. paGAN: real-time avatars using dynamic textures. In ACM Trans. on Graph. (Proceedings of SIGGRAPH Asia). 258. Thomas Nestmeyer, Jean-François Lalonde, Iain Matthews, and Andreas M Lehrmann. 2020. Learning Physics-guided Face Relighting under Directional Light. In Computer Vision and Pattern Recognition (CVPR). Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. 2020. Encoding in Style: a StyleGAN Encoder for Image-toImage Translation. arXiv preprint arXiv:2008.00951 (2020). Soumyadip Sengupta, Angjoo Kanazawa, Carlos D. Castillo, and David W. Jacobs. 2018. SfSNet: Learning Shape, Refectance and Illuminance of Faces in the Wild. In Computer Vision and Pattern Regognition (CVPR). Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. 2020. Interpreting the Latent Space of GANs for Semantic Face Editing. In CVPR. YiChang Shih, Sylvain Paris, Connelly Barnes, William T. Freeman, and Frédo Durand. 2014. Style Transfer for Headshot Portraits. ACM Trans. on Graph. (Proceedings of SIGGRAPH) 33, 4, Article 148 (2014), 14 pages. Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. 2017. Neural Face Editing with Intrinsic Image Disentangling. In Computer Vision and Pattern Recognition (CVPR). 5444ś5453. Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. First Order Motion Model for Image Animation. In Conference on Neural Information Processing Systems (NeurIPS). Tiancheng Sun, Jonathan T. Barron, Yun-Ta Tsai, Zexiang Xu, Xueming Yu, Graham Fyffe, Christoph Rhemann, Jay Busch, Paul Debevec, and Ravi Ramamoorthi. 2019. Single Image Portrait Relighting. 38, 4, Article 79 (July 2019). Tiancheng Sun, Zexiang Xu, Xiuming Zhang, Sean Fanello, Christoph Rhemann, Paul Debevec, Yun-Ta Tsai, Jonathan T. Barron, and Ravi Ramamoorthi. 2020. Light Stage Super-Resolution: Continuous High-Frequency Relighting. In ACM Trans. on Graph. (Proceedings of SIGGRAPH Asia). Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zöllhofer, and Christian Theobalt. 2020a. StyleRig: Rigging StyleGAN for 3D Control over Portrait Images, CVPR 2020. In Computer Vision and Pattern Recognition (CVPR). Ayush Tewari, Mohamed Elgharib, Mallikarjun BR, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zöllhofer, and Christian Theobalt. 2020b. PIE: Portrait Image Embedding for Semantic Control. ACM Trans. on Graph. (Proceedings SIGGRAPH Asia) 39, 6. Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin-Brualla, Tomas Simon, Jason Saragih, Matthias Nießner, et al. 2020c. State of the art on neural rendering. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 701ś727. Justus Thies, Michael Zollhöfer, and Matthias Nießner. 2019. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1ś12. Zhibo Wang, Xin Yu, Ming Lu, Quan Wang, Chen Qian, and Feng Xu. 2020. Single Image Portrait Relighting via Explicit Multiple Reflectance Channel Modeling. ACM Trans. on Graph. (Proceedings of SIGGRAPH Asia) 39, 6, Article 220 (2020). Tim Weyrich, Wojciech Matusik, Hanspeter Pfister, Bernd Bickel, Craig Donner, Chien Tu, Janet McAndless, Jinho Lee, Addy Ngan, Henrik Wann Jensen, and Markus Gross. 2006. Analysis of Human Faces using a Measurement-Based Skin Reflectance Model. ACM Trans. on Graphics (Proceedings of SIGGRAPH) 25, 3 (2006), 1013ś1024. Olivia Wiles, A. Sophia Koepke, and Andrew Zisserman. 2018. X2Face: A network for controlling face generation using images, audio, and pose codes. In European Conference on Computer Vision (ECCV). Shuco Yamaguchi, Shunsuke Saito, Koki Nagano, Yajie Zhao, Weikai Chen, Kyle Olszewski, Shigeo Morishima, and Hao Li. 2018. High-fidelity facial reflectance and geometry inference from an unconstrained image. ACM Trans. on Graph. (Proceedings of SIGGRAPH) 37, 4, Article 162 (2018). Tsun-Yi Yang, Yi-Ting Chen, Yen-Yu Lin, and Yung-Yu Chuang. 2019. Fsa-net: Learning fine-grained structure aggregation for head pose estimation from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1087ś1096. Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR. Xuaner Zhang, Jonathan T. Barron, Yun-Ta Tsai, Rohit Pandey, Xiuming Zhang, Ren Ng, and David E. Jacobs. 2020. Portrait Shadow Manipulation. ACM Transactions on Graphics (TOG) 39, 4. Hao Zhou, Sunil Hadap, Kalyan Sunkavalli, and David W. Jacobs. 2019. Deep SingleImage Portrait Relighting. In International Conference on Computer Vision (ICCV). Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004), 600ś612.