Virtual Try On

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

EUROGRAPHICS 2022 / R. Chaine and M. H.

Kim Volume 41 (2022), Number 2


(Guest Editors)

Real-time Virtual-Try-On from a Single Example Image through


Deep Inverse Graphics and Learned Differentiable Renderers

R. Kips 1,2 , R. Jiang3 , S. Ba1 , B. Duke3 , M. Perrot1 , P. Gori2 and I. Bloch2,4


arXiv:2205.06305v1 [cs.CV] 12 May 2022

1 L’Oréal
Research and Innovation, France
2 LTCI,
Télécom Paris, Institut Polytechnique de Paris, France
3 Modiface, Canada 4 Sorbonne Université, CNRS, LIP6, France

Abstract
Augmented reality applications have rapidly spread across online retail platforms and social media, allowing consumers to
virtually try-on a large variety of products, such as makeup, hair dying, or shoes. However, parametrizing a renderer to syn-
thesize realistic images of a given product remains a challenging task that requires expert knowledge. While recent work has
introduced neural rendering methods for virtual try-on from example images, current approaches are based on large generative
models that cannot be used in real-time on mobile devices. This calls for a hybrid method that combines the advantages of
computer graphics and neural rendering approaches. In this paper, we propose a novel framework based on deep learning
to build a real-time inverse graphics encoder that learns to map a single example image into the parameter space of a given
augmented reality rendering engine. Our method leverages self-supervised learning and does not require labeled training data,
which makes it extendable to many virtual try-on applications. Furthermore, most augmented reality renderers are not differ-
entiable in practice due to algorithmic choices or implementation constraints to reach real-time on portable devices. To relax
the need for a graphics-based differentiable renderer in inverse graphics problems, we introduce a trainable imitator module.
Our imitator is a generative network that learns to accurately reproduce the behavior of a given non-differentiable renderer. We
propose a novel rendering sensitivity loss to train the imitator, which ensures that the network learns an accurate and continu-
ous representation for each rendering parameter. Automatically learning a differentiable renderer, as proposed here, could be
beneficial for various inverse graphics tasks. Our framework enables novel applications where consumers can virtually try-on a
novel unknown product from an inspirational reference image on social media. It can also be used by computer graphics artists
to automatically create realistic rendering from a reference product image.
CCS Concepts
• Computing methodologies → Computer vision; Machine learning; Computer graphics;

1. Introduction The rapidly emerging field of inverse graphics provides vari-


ous solutions for estimating graphics parameters from natural im-
The increasing development of digital sales and online shopping
ages using differentiable rendering [KBM∗ 20]. Most AR render-
has been accompanied by the development of various technolo-
ers that run in real-time on portable devices are not differen-
gies to improve the digital customer experience. In particular, aug-
tiable in practice due to algorithmic choices or implementation
mented reality (AR) has rapidly spread across online retail plat-
constraints [GVS12]. Indeed, replacing non-differentiable opera-
forms and social media, allowing consumers to virtually try-on a
tions with their differentiable approximations leads to less op-
large variety of products like makeup, hair dying, glasses, or shoes.
timized computation speed, and would potentially lead to large
To reach a realistic experience, such augmented reality render-
re-implementation costs to ensure compatibility on multiple plat-
ers are expected to run in real-time with the limited resources of
forms. This makes conventional inverse graphics solutions non-
portable devices. Generally, commercial virtual try-on applications
suitable for our problem.
need to render not a single product, but an entire range of items
with various appearances. For this reason, AR renderers are com- Recently, a novel family of methods based on neural render-
monly parameterized by a set of graphics parameters that control ing has introduced the task of image based virtual try-on, which
the variation of appearance across the various products to render. brings new perspectives for this problem [TFT∗ 20]. This task con-
In practice, setting these parameters to obtain realistic rendering for sists in extracting a product appearance from a single reference im-
hundreds of products in a digital store is a tedious task that requires age and synthesizing it on the image of another person. However,
expert knowledge in computer graphics. existing methods in this domain are often based on large genera-

© 2022 The Author(s)


Computer Graphics Forum © 2022 The Eurographics Association and John
Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.
R. Kips et al. / Real-time Virtual-Try-On from a Single Example Image through Deep Inverse Graphics and Learned Differentiable Renderers

Figure 1: Our hybrid framework uses a deep learning based inverse graphics encoder that learns to map an example reference image into
the parameter space of a computer graphics renderer. The renderer can then be used to render a virtual try-on in real-time on mobile devices.
We illustrate the performance of our framework on makeup (lipstick and eye shadow) and hair color virtual try-on.

tive networks that suffer from temporal inconsistencies, and can- 2. Related works
not be used to process a video stream in real-time on mobile de-
vices [KJB∗ 21, LQD∗ 18]. 2.1. Inverse Graphics and Differentiable Rendering

Given a natural image, inverse graphics approaches aim to estimate


To address this issue we propose a hybrid solution that leverages features that are typically used in computer graphics scene repre-
the speed and portability of computer graphics based renderers, to- sentation, such as HDR environment map [SF19, SK21] or meshes
gether with the appearance extraction capabilities of neural-based of 3D objects such as faces [LBB∗ 17]. The idea of using a neu-
methods. Our contributions can be summarized as follow : ral network for learning to map images to rendering parameters
was first introduced in [KWKT15] and is now widely spread. Simi-
larly to our problem, some applications focus on solutions to assist
• To relax the need for a graphics-based differentiable renderer computer graphics artists to accelerate their work, such as spatially-
in inverse graphics problems, we introduce a trainable imitator varying bidirectional reflectance fields (SVBRDF) estimation from
module. Our imitator is a generative network that learns to accu- smartphone flash images [DAD∗ 18, HDMR21].
rately reproduce the behavior of a given non-differentiable ren- Most inverse graphics methods build on the rapidly growing
derer. To train the imitator, we propose a novel rendering sen- field of differentiable rendering. This area focuses on developing
sitivity loss which ensures that the network learns an accurate differentiable operations for replacing the non-differentiable mod-
and continuous representation for each rendering parameter. This ules of computer graphics rendering pipelines [LADL18,LHK∗ 20].
method for automatically learning a differentiable renderer could Nowadays, differentiable renderers are key components for solv-
be beneficial for various inverse graphics tasks. ing inverse graphics problems. Indeed, the vanilla inverse graph-
• We introduce a novel framework for image-based virtual try-on, ics approach consists in estimating the set of rendering parame-
using an inverse graphics encoder module that learns to map a ters by minimizing photometric or perceptual distances [WSB03,
single example image into the space of parameters of a rendering ZIE∗ 18] between a reference image and the synthesized image us-
engine. This model is trained using an imitator network and a ing stochastic gradient descent, as done in [LHK∗ 20]. However, in
self-supervised approach which does not require labeled training addition to the requirement of a differentiable renderer, such an ap-
data. proach is slow since a gradient descent needs to be computed at
• We assess the effectiveness of our approach by investigating two inference for each new reference image. Thus, these methods are
well-established problems: makeup and hair color virtual try-on not suited to real-time applications. More recent methods, such as
(see Figure 1). These problems are based on very different ren- [HDMR21], avoid using gradient descent during inference by train-
dering principles, respectively physically-based computer graph- ing an end-to-end neural network, while a differentiable renderer is
ics, and pixel statistics manipulation. used to provide feedback during training. This approach results in
faster inference since only a single forward pass of the model is
needed. In this paper, we propose to build upon this approach. Fi-
Our method enables new applications where consumers can vir- nally, another method consists in training graphics parameters esti-
tually try-on a novel unknown product from a reference inspira- mators by using datasets composed of physical ground truth mea-
tional image on social media. It can also be used by computer surements, such as HDR maps [SK21] or OLAT images [PEL∗ 21].
graphics artists to automatically create realistic renderings from a Even though these approaches led to successful applications, they
reference product image. We believe that our framework can be are difficult to generalize to many different inverse rendering prob-
easily adapted to other augmented reality applications since it only lems as ground truth measurements are extremely costly to acquire
requires the availability of a conventional parametrized renderer. at a large scale.

© 2022 The Author(s)


Computer Graphics Forum © 2022 The Eurographics Association and John Wiley & Sons Ltd.
R. Kips et al. / Real-time Virtual-Try-On from a Single Example Image through Deep Inverse Graphics and Learned Differentiable Renderers

Figure 2: Our inference and training pipeline. The inverse graphics encoder E is trained using a self-supervised approach based on the
random sampling of graphics vectors g sent to a renderer R to generate training data. For each synthetic image, a graphics loss function
Lgraphics enforces E to estimate the corresponding rendering parameters. To ensure additional supervision in the image space, our learned
imitator module is used to compute a differentiable rendering estimate from the encoder output. A rendering loss function Lrendering is then
computed using a perceptual distance between the rendered image and its reconstructed estimate. For inference, the imitator is discarded
and we use the original renderer to reach real-time performances.

2.2. Augmented Reality Renderers appearance from a given reference image and realistically synthe-
sizing this product in the image of another person. Most methods
Augmented reality renderers are a particular category of com-
in this field are built on a similar approach, that proposes to use a
puter graphics pipelines where the objective is to realistically syn-
neural network to extract product features into a latent space. Most
thesize an object in an image or video of a real scene. In gen-
of the methods in this field use a similar approach by training a
eral, AR renderers are composed of one or several scene per-
neural network that encodes the target product features into a la-
ception modules whose role is to estimate relevant scene infor-
tent space. Then, this product representation is decoded and ren-
mation from the source image, that is then passed to a render-
dered on the source image using generative models such as gener-
ing module. For instance, many portraits based AR applications
ative advesarial networks (GANs) [GPAM∗ 20] or variational auto-
are based on facial landmarks estimation [KS14] used to com-
encoders (VAEs) [KW14]. In particular, this idea has been success-
pute the position of a synthetic object such as glasses that are
fully used for makeup transfer [LQD∗ 18, JLG∗ 20], hair synthe-
then blended on the face image [ADSDA16]. Similarly, other pop-
sis [SDS∗ 21, KCP∗ 21] and is rapidly emerging in the field of fash-
ular scene perception methods for augmented reality focus on hand
ion articles [JB17]. Recent methods attempt to provide controllable
tracking [WMB∗ 20, ZBV∗ 20], body pose estimation [BGR∗ 20],
rendering [KPGB20], or propose to leverage additional scene in-
hair segmentation [LCP∗ 18], or scene depth estimation [KRH21].
formation in their models, such as segmentation masks for fashion
Furthermore, most augmented reality applications target video-
items [CPLC21, GSG∗ 21] or UV maps for makeup [NTH21]. An-
based problems and deployment on mobile devices. For this rea-
other category of model leverages user’s input instead of example
son, reaching real-time with limited computation resources is usu-
image to control the image synthesis. In particular, for hair virtual
ally an important focus of research in this field, as illustrated in
try-on the model from [XYH∗ 21] uses sketch input to explicitly
[BKV∗ 19, BGR∗ 20, KRH21, LPDK19]. Many AR applications fo-
control the hairstyle, while StyleCLIP [PWS∗ 21] uses text input to
cus on virtual try-on tasks in order to enhance the consumer experi-
control hair and makeup attributes in an image.
ence in digital stores. Popular applications introduce virtual try-on
for lipstick [SK19], hair color [TB19], or nail polish [DAP∗ 19],
reaching realistic results in real-time on mobile devices. However, However, current virtual try-on from example image methods
for such methods, the renderer needs to be manually parametrized suffer from several limitations. First, these neural rendering meth-
by an artist to obtain a realistic rendering of a targeted product. ods are based on large generative models that cannot be used to
Users are restricted to select a product within a pre-defined range, produce high-resolution renderings in real-time on mobile devices.
and cannot virtually try a novel product from a given new reference Furthermore, such models often lead to poor results when used on
image. videos, since generative models are known to produce time incon-
sistencies artifacts. Even though recent works attempt to address
this issue by training post-processing models [CXM∗ 20,TDKP21],
2.3. Image-Based Virtual Try-On
they cannot be used in real-time. More recently, an inverse graph-
While conventional AR renderers render objects that are previously ics approach has been introduced for makeup [KJB∗ 21], bringing
created by computer graphics artists, a recent research direction interesting perspectives for reaching real-time image-based virtual
based on neural rendering has proposed the novel task of image- try-on. In this paper, we propose to build on this approach to pro-
based virtual try-on. The objective consists in extracting a product pose a more robust and general framework.

© 2022 The Author(s)


Computer Graphics Forum © 2022 The Eurographics Association and John Wiley & Sons Ltd.
R. Kips et al. / Real-time Virtual-Try-On from a Single Example Image through Deep Inverse Graphics and Learned Differentiable Renderers

3. Framework Overview

In this section, we present an overview of our framework building


blocks (see Figure 2). First, the main requirement of our method is
to have access to a parametrized augmented reality renderer. Such
a renderer R can render in a source image an object whose appear-
ance is parametrized by a vector of graphics parameters g. Thus,
for each frame of a video, renderer R takes as input a frame X as
well as a vector of parameters g and synthesizes an output frame
where the rendered object is realistically inserted. Furthermore, we
assume that R is not differentiable, as it is the case for most AR
rendering pipeline implementations. In the context of this paper,
we illustrate our approach on two examples of AR renderers for
virtual try-on, a makeup renderer (lipstick and eye-shadow), and a
hair color renderer, that are illustrated in Figure 3 and described in
detail in Section 4.

Given a particular AR renderer R, we propose to build a special-


ized graphics encoder E that learns to map a single example image
into the parameter space of the renderer. Our graphics encoder is Figure 3: Description of our makeup and hair color augmented
trained using a two-steps self-supervised approach, which relaxes reality renderers. They are composed of scene perception modules
the need for a labeled training dataset. This makes our framework that compute scene information which is then passed to a computer
easily adaptable to many AR rendering applications. First, we pro- graphics renderer R. The lipstick renderer uses physically-based
pose a differentiable imitator module I that learns to reproduce rendering while the hair color renderer uses pixel statistics ma-
the behavior of the non-differentiable renderer R using generative nipulations. For both renderers, the appearance of the product to
learning. The imitator can then be used as a differentiable surrogate render is controlled by a vector of graphics parameters g. The eye
of R. In order to ensure that the entire parameter space of the ren- shadow and lipstick renderer are based on similar principles and
derer is correctly modeled by the imitator network, we introduce a parameters.
novel parameter sensitivity loss. The training procedure of our im-
itator is illustrated in Figure 2 and described in detail in Section 5.
This method for automatically learning a differentiable renderer
could be beneficial various inverse graphics tasks, in particular for 4.1. Makeup AR Renderer
renderers using non-differentiable operation such as path-tracing.
Following similar procedures as in [LYP∗ 19], we use a graphics-
Secondly, the learned imitator is used to train the graphics en- based makeup renderer that takes as input makeup color and texture
coder module E, by computing a rendering loss in the image space, parameters and generates realistic images in real-time. As shown in
providing differentiable feedback to optimize the graphics encoder Figure 3, the complete pipeline for lipstick includes a lip mesh pre-
weights. To better constrain our problem, we also use a graphics diction model, and an illuminant estimation module. The makeup
loss in the space of rendering parameters. This training pipeline is parameters are manually parametrized for each product. Specifi-
illustrated in Figure 2 and described in Section 6. cally, the lipstick rendering is done in two steps: 1) Re-coloring; 2)
Texture rendering. For re-coloring, the input lipstick color (R, G, B
Finally, at inference time, the imitator module is discarded and values) is adjusted to better fit with background illumination based
replaced by the original renderer R, which is faster and more on gray histograms from the input lip. In the second step, texture
portable. Then, given an example image, the encoder E estimates of the lipsitck (e.g. glossiness, sparkles) is applied using the envi-
the corresponding vector of parameters that must be passed to R to ronment reflection estimation, the estimated lip mesh and a sim-
produce a realistic rendering, as depicted in Figure 2. The proposed ple material model. In this paper, as described in Table 1, we only
example based virtual try-on framework can be run in real-time as consider the main rendering parameters, and leave other material
the rendering parameters g are computed only once at the beginning parameters to their default values.
of the training and then fixed for each frame of the video stream to
render.
4.2. Hair Color AR Renderer
The hair AR rendering pipeline we use consists of a sequence
4. Augmented Reality Renderers of image processing primitives combined with hair mask estima-
tion. First, given a set of swatch parameters (Table 2) a re-coloring
In this section, we describe the two AR renderers that we choose step computes a pixel-wise color transformation on a reference hair
to illustrate the framework proposed in this paper. We selected two swatch image. From this re-colored swatch, a reference histogram
renderers that address popular virtual try-on categories with differ- for R, G, B and gray values is extracted. The second step in the
ent rendering principles, physically-based computer graphics for hair AR rendering pipeline, the rendering process, uses both this
makeup, and pixel statistics manipulation for hair color. swatch histogram and additional non-swatch parameters from the

© 2022 The Author(s)


Computer Graphics Forum © 2022 The Eurographics Association and John Wiley & Sons Ltd.
R. Kips et al. / Real-time Virtual-Try-On from a Single Example Image through Deep Inverse Graphics and Learned Differentiable Renderers

Table 1: Main parameters used in the makeup rendering process. can lead to perceptually very different textures. Thus, the graphics
The complete rendering process includes a total of 17 parame- loss function, computed in the space of rendering parameters, does
ters in order to achieve more sophisticated lipstick looks including not provide an optimal signal for supervising the inverse graphics
sparkles. encoder training. Therefore, we propose to use an additional ren-
dering loss function in the image space, by leveraging an imitator
Parameter Range module and a perceptual distance. In this section, we detail the first
Makeup opacity [0, 1] step of our framework, which consists in training a differentiable
R,G,B [0, 255] imitator module that will later be used to train our inverse graphics
Amount of gloss on the makeup [0, +∞)
encoder. Our imitator takes the form of a generative neural network
Gloss Roughness [0, 1]
Reflection intensity [0, 1]
that automatically learns to reproduce the behavior of a given ren-
derer. This network can then be used as a differentiable surrogate of
the initial renderer, for solving various inverse graphics problems.
Table 2: Parameters used in the hair color rendering process. The
re-coloring step uses swatch parameters to modify the color of a The idea of using a generative network to build a differentiable
reference swatch image, that is then used as a target for histogram estimate of a renderer was first proposed in the field of automatic
matching. Non-swatch parameters affect post histogram-matching avatar creation [WTP17, SYF∗ 19, Shi20]. However, compared to
steps in the rendering process. Blend effect refers to applying a fixed-camera avatar renderers, AR renderers are more complex
faded look, and the parameter controls the gradient offset from the functions that are usually composed of computer vision and com-
top of the hair. puter graphics modules. Previous renderer imitator methods di-
rectly apply the conventional Generative Adversarial Networks
Parameter Range method for image-to-image translation established by [IZZE17].
Swatch parameters This approach consists in training the imitator to reproduce the ren-
Brightness [−0.3, 0.3] derer output by minimizing a perceptual loss between the renderer
Contrast [0.5, 2]
and imitator outputs on a set of example rendering images. How-
Exposure [0.5, 2]
Gamma [0.5, 3]
ever, this method does not leverage the specificity of the renderer
Hue [0, 1] imitation problem, where training phase does not depend on a fixed
Saturation [0, 3] set of training images, but can be dynamically generated following
Non-Swatch parameters using the original renderer.
Blend Effect [−1, 1]
We propose to leverage this advantage to introduce a more con-
Intensity [0, 1]
strained formulation of the imitator problem, based on a novel ren-
Shine [0, 1]
dering sensitivity loss function. This additional loss term is moti-
vated by two observations. First, the imitator network is required
to accurately model each of the renderer parameters. However, this
shade matching process to render hair color. In the rendering pro-
is not explicitly enforced by the conventional imitator approach,
cess, a hair segmentation model first estimates a hair mask from the
where parameters that only impact a small portion of the rendered
input image. The detected hair region is then transformed so that its
images have a limited weight in the perceptual loss function. Sec-
histogram matches the histogram of the re-colored swatch image.
ondly, in order to accurately solve inverse graphics problems, the
Finally, shine and contrast effects boost global and local contrast
imitator needs to learn a continuous representation for each param-
to improve texture, and a blending effect can add an optional faded
eter, where a shift in a given parameter will lead to changes in the
look.In this paper, we consider both swatch and non-swatch param-
rendered image which are similar to the ones obtained using the ac-
eters as they both affect all steps of the rendering process.
tual AR renderer. This is particularly challenging as generative net-
works are known for the difficulties they encounter in accurately
5. Learned Imitator Module modeling the entire training data distribution (i.e., the mode col-
lapse problem) [LTZQ19, GAA∗ 17, ACB17]. Our sensitivity loss,
5.1. Imitator motivation
which provides an answer to these problems, is inspired by the ex-
Our objective of building an inverse graphics encoder which can pression of the finite difference and forces the imitator to learn a
be used to perform virtual try-on from a single example image in rendering function where the derivative with respect to the render-
real-time on mobile device using a conventional, non-differentiable ing parameters is the same as the approximated derivative on the
AR-renderer. When training such a model, the ground truth render- non-differentiable renderer. For this reason, this loss does not op-
ing parameters are known during the encoder training and can be erate on the image space, as the rendering loss, but in the image
used to supervise the encoder directly, as in [KJB∗ 21]. However, difference space. Thus, when sampling rendering parameters, the
this assumes that a distance in the space of graphics parameters is sensitivity loss forces the generator to modify the same pixels as
a good measure of appearance. This distance can be misleading as the non-differentiable renderer, and in the same proportion.
some parameters (e.g. RGB for makeup) have a very large contri-
bution to the rendering results, while other parameters (e.g. gloss
5.2. Imitator objective functions
roughness) have a more limited impact. Similarly, reaching a real-
istic shine effect for a given product often requires very high accu- In this section, we detail the training procedure of our imitator
racy in the setting of the shine parameters, where slight variations model, illustrated in Figure 4. Our objective is to train an imita-

© 2022 The Author(s)


Computer Graphics Forum © 2022 The Eurographics Association and John Wiley & Sons Ltd.
R. Kips et al. / Real-time Virtual-Try-On from a Single Example Image through Deep Inverse Graphics and Learned Differentiable Renderers

Figure 4: The training procedure of our differentiable imitator module I that learns to reproduce the behavior or the renderer R. The
imitation loss function Limitation enforces perceptual similarity between R and I outputs on randomly sampled graphics parameter vectors g.
The sensitivity loss function Lsens ensures that a random shift in any dimension of the graphics parameter vector is correctly modeled by the
imitator.

tor network I that learns to reproduce the behavior of the renderer function can be written as follows:
R, and for which derivatives with respect to g can be computed.
1 n m
We generate training data by randomly sampling n graphics vectors Lsens = ∑ ∑ k[R(Xi , gi ) − R(Xi , g0i, j )]
n i=1
gi , i = 1...n and render them through R with a randomly associated j=1
portrait image Xi . We propose to train I by using a combination of − [I(Xi , gi ) − I(Xi , g0i, j )]k22
two loss functions. First, we use an imitation loss function that en-
forces the imitator network to produce outputs that are perceptually Finally, we use the conventional adversarial GAN loss function,
similar to the renderer for a given image Xi , and graphics vector gi . where D is a discriminator module trained with the gradient penalty
The perceptual similarity is computed using a perceptual distance loss from [GAA∗ 17] :
based on deep features [KLL16], and can be written as follows:
1 n
LGAN = − ∑ D(I(Xi , gi )) (3)
n i=1

In total, our imitator is trained to minimize the following loss


L perceptual (x, y) = kEV GG (x) − EV GG (y)k22 (1)
function, where λ1 and λ2 are weighting factors that are set exper-
imentally:
where EV GG is the feature encoder of a pre-trained VGG neural I
Ltotal = λ1 Limitation + λ2 Lsens + LGAN (4)
network. Then, our imitation loss function is the following:

1 n   6. Inverse Graphics Encoder Module


Limitation = ∑ L perceptual R(Xi , gi ), I(Xi , gi ) (2)
n i=1 6.1. Self-supervised training procedure
Our final objective is to train an inverse graphics encoder network
E that learns to map a single reference image to the set of param-
However, this imitation constraint is not sufficient in practice. eters of a given renderer R. The training procedure or our inverse
We introduce a novel sensitivity loss function based on two obser- graphics encoder module is summarized in Figure 2.
vations : (a) the imitator must be able to correctly model all the
We propose to use a self-supervised training approach where a
dimensions of the graphics vector g, and (b) the imitator must learn
graphics vector g is randomly sampled and rendered with a random
a continuous representation of each dimension of g, where a given
portrait image. Then, we constrain the graphics encoder to recover
shift in g will be modeled by changes of the corrected magnitude in
the original graphics vector from the synthesized image using a
the synthesized image. Our sensitivity loss term enforces additional
graphics loss function in the space of rendering parameters:
constraints to satisfy these properties. For a given synthesized im-
age R(Xi , gi ) each element j, j = 1...m of the graphics vector gi 1 n
is randomly sampled independently. Each time, the new sampled
Lgraphics = ∑ kgi − E(R(Xi , gi ))k22
n i=1
(5)
vector, noted g0i, j , is passed to the imitator and the renderer, and
the imitator is explicitly constrained to modify the synthesized im- However, using only a graphics loss function suffers from limi-
age in the same proportion as the renderer did. The sensitivity loss tations. A distance in the space of rendering parameters might not

© 2022 The Author(s)


Computer Graphics Forum © 2022 The Eurographics Association and John Wiley & Sons Ltd.
R. Kips et al. / Real-time Virtual-Try-On from a Single Example Image through Deep Inverse Graphics and Learned Differentiable Renderers

reflect well a perceptual distance in the image space. For instance,


some parameters are central in the rendered image appearance,
such as R, G, B values for makeup synthesis, while other parame-
ters will only affect the appearance marginally. Thus, the graphics
loss term does not provide an optimal signal for supervising the
inverse graphics encoder training. Therefore, we propose to use an
additional rendering loss function in the image space, by leveraging Figure 5: Example of the sampling distribution for the R parameter
our imitator module. For each training image, the encoder module of our lipstick renderer. Starting from a distribution fitted on natural
estimates a graphics vector, and the imitator computes the corre- data, we reinforce the diversity of the training by sampling from a
sponding rendered image. A perceptual distance is then calculated uniform distribution.
between this reconstructed rendering and the original rendered im-
age that was passed to the graphics encoder, as illustrated in Fig-
ure 2. The rendering loss can be written as follows: Table 3: The architecture of our imitator and inverse graphics en-
n coder module
1  
Lrendering = ∑ L perceptual R(Xi , gi ) − I(Xi , E(R(Xi , gi ))) (6)
n i=1

Since the imitator is differentiable, we can compute the gradients


of this loss function with respect to the encoder weights, and train
the encoder network using the conventional stochastic gradient de-
scent procedure. In total, our inverse graphics encoder is trained
to minimize the following loss, where λ3 is a weighting factor to
balance the two components of the loss function:
E
Ltotal = Lgraphics + λ3 Lrendering

6.2. Encoder Inference

At inference time, the imitator module is discarded, as it is less op-


timized for inference on portable devices compared to the initial
augmented reality renderer R. Given a single reference image Xre f ,
the encoder network directly estimates the set of corresponding ren-
dering parameters ĝ = E(Xre f ). Thus, for each frame of a video, R
can be used to render the virtual try-on in real-time using ĝ, as il-
lustrated in Figure 2. Since the rendering parameters are fixed for
each frame, the encoder inference needs to be run only once and
does not impact the real-time efficiency of the renderer. 7.2. Model Architectures and Training
The architectures of our imitator and inverse graphics encoder mod-
ule are described in detail in Table 3. Our learned imitator mod-
7. Implementation ule is implemented using an architecture inspired by the StarGAN
[CCK∗ 18] generator. The generator input is constructed by con-
7.1. Controlling Data Distribution catenating the graphics parameters g to the source image Xi as ad-
ditional channels. We use instance normalization in each layer, as
A specificity of our framework is that we fully control the distribu-
well as residual blocks composed of two convolutional layers with
tion of graphics vectors used to train our inverse graphics encoder.
4 × 4 kernels and a skip connection. Furthermore, in the final layer,
We propose to leverage this distinctive trait to construct a distribu-
the generator outputs a pixel differences map that is added to the
tion that will reinforce the model performance on extreme exam-
source image to obtain the generated image. This architecture al-
ples. To obtain a realistic graphics parameters sampling, we fit a
lows for a better preservation of the input image details, as the en-
multivariate normal distribution on a set of rendering parameters
tire image does not need to be encoded in the generator bottleneck.
previously set by experts to simulate a range of existing products.
The encoder architecture is composed of a simple encoder network
We choose a Gaussian distribution as it seemed adapted to the em-
based on convolution blocks with 4 × 4 kernels and ReLU activa-
pirical distribution of our data. In addition, we reinforce the diver-
tion outputs. The final layer of the encoder is composed of linear
sity of the sampled graphics vectors by alternatively sampling from
activation outputs of the same size as the graphics vector.
a uniform distribution as seen in Figure 5 . Even though this might
lead to non-realistic synthetic images, the increased diversity will The training datasets are synthesized by sampling n = 15000
make the framework more robust to extreme examples that might graphics parameters vectors for lipstick and hair color and render-
occur in practice, such as blue lipstick or purple hair. ing them on random portrait images from the ffhq dataset [KLA19],

© 2022 The Author(s)


Computer Graphics Forum © 2022 The Eurographics Association and John Wiley & Sons Ltd.
R. Kips et al. / Real-time Virtual-Try-On from a Single Example Image through Deep Inverse Graphics and Learned Differentiable Renderers

Figure 6: Our imitator module learns to accurately reproduce the


behavior of complex augmented reality renderers, editing the right
pixels and reproducing perceptually similar appearances. Figure 7: A qualitative ablation study illustrates the impact of our
novel rendering sensitivity loss. The accuracy in rendering param-
eters such as shine or color is significantly improved.
using renderers described in Section 4. To improve the propor-
tion of relevant pixels in the lipstick images, we crop the por-
traits around the lips before feeding them to the imitator and en-
and hair appearances are correctly extracted from the example im-
coder model. We empirically set the loss weighting factors values
age and rendered by our framework in high resolution. Further-
to λ1 = 100, λ2 = 1000 and λ3 = 20 for both experiments. The im-
more, our encoder is able to model complex appearance attributes,
itator and inverse graphics encoder are trained with a batch size of
such as lipsticks shiny reflections, and brown hair with blond high-
16 over 300 epochs using an Adam optimizer with a fixed learning
lights, as can be seen in Figure 1. Additional video examples are
rate of 5e − 5. The code and training data will be released upon
provided as supplementary material. However, it can be observed
acceptance of this paper.
that the quality of the results seems lower for the hair color virtual
try-on. As it can be seen in Figures 1 and 9, our framework can
8. Results reproduce the general hair color, but sometimes fail to accurately
model reflection color.
8.1. Qualitative Evaluation
8.1.1. Imitator Compared to other methods, it can be observed that, for makeup,
our models achieved more realistic results with higher resolution.
To illustrate the performance of our learned imitator module, we In addition, compared to the MIG [KJB∗ 21] our imitator approach
show a qualitative comparison of the imitator I and renderer R allows us to better model the complex textures with shine. For hair
outputs in Figure 6. Even though considered AR renders are com- color virtual try-on, we compare our results with MichiGAN, the
posed of multiple complex modules for segmentation and computer generative based method from [TCC∗ 20]. Our approach produces
graphics rendering, our imitator is able to learn how to accurately results that are of comparable realism, while reaching real-time
reproduce their behavior. For both renderers, the imitator modi- speed on mobile devices, which is not the case for generative mod-
fies only relevant image regions, and the material appearance of els as illustrated in Section 8.4. However, our method is limited
makeup and hair are correctly reproduced. to hair color virtual try-on, due to our renderer limitations, while
Furthermore, we perform a qualitative ablation study to empha- slower generative approaches such as [PWS∗ 21] and [XYH∗ 21]
size the impact of our novel rendering sensitivity loss in Figure 7. can also control hair style.
It is worth noting that, compared to the standard imitator approach,
our imitator module trained with sensitivity loss learns a better rep- 8.2. Quantitative Evaluation
resentation for parameters that affect a smaller portion of the image,
such as the shine level in the lipstick rendering. It also leads to more 8.2.1. Ablation study
accurate rendering of colors in the case of hair color virtual try-on. In order to analyze the role played by each component of our frame-
work, we perform a quantitative ablation study. We build a syn-
8.1.2. Graphics encoder
thetic experiment by sampling 3000 graphics vectors and render-
We use our inverse graphics encoder to perform virtual try-on from ing each of them on two portrait images randomly drawn each time
example images, and compare our results against popular neural from the ffhq dataset, for a total of 3000 different sampled pairs of
based models for hair and makeup synthesis. The results are illus- portrait images of different persons. During the experiment, we ex-
trated in Figure 8 for makeup and Figure 9 for hair. The makeup tract appearance from the first rendered image using our graphics

© 2022 The Author(s)


Computer Graphics Forum © 2022 The Eurographics Association and John Wiley & Sons Ltd.
R. Kips et al. / Real-time Virtual-Try-On from a Single Example Image through Deep Inverse Graphics and Learned Differentiable Renderers

Figure 8: Qualitative comparison of our approach against neural rendering methods for example-based makeup virtual try-on. Our model
achieves more realistic results with high resolution. Furthermore, using an imitator allows us to achieve better shine modeling than in MIG

Figure 9: Comparison against generative-based methods for hair Figure 10: Qualitative ablation study on synthetic data. These re-
color virtual try-on. In addition to reaching real-time on mobile sults confirm that training the inverse graphics encoder using an
devices, our model produces more realistic hair color results. imitator based on sensitivity loss leads to improved realism in im-
age based virtual try-on.

encoder, and transfer it to the second portrait before rendering. Ulti-


mately, we compare our estimated rendering with the ground-truth use the dataset provided by the authors with 300 triplets of refer-
rendering on the second portrait. This experiment is illustrated in ence portraits with lipstick, source portraits without makeup, and
the supplementary materials and the results are reported in Table 4. associated ground-truth images of the same person with the ref-
Several example results for lipsticks virtual try-on are presented in erence lipstick. We compare our approach against state-of-the-art
Figure 10. These results tend to demonstrate the usefulness of using generative methods for makeup synthesis from example image. The
an imitator module, as opposed to a single loss in the space of ren- results of this experiment are presented in Table 5, and confirm that
dering parameters as in [KJB∗ 21], reducing the average perceptual our framework achieves a more realistic virtual try-on according to
distance from 0.071 to 0.054 for lipsticks. Furthermore, compared all metrics, in addition to reaching real-time. We do not consider an
to the standard imitator approach, using our novel rendering sensi- equivalent experiment for hair color, as collecting a similar dataset
tivity loss during the imitator training leads to a more accurate in- would require multiple panelists to dye their hair to the same color,
verse graphics encoder. Lastly, using a combination of graphics loss which is difficult to achieve in practice.
and rendering loss terms achieves the best performance on most
metrics by a small margin, we thus suggest to conserve these two
loss terms for training. 8.3. User Experiment
We also conduct a user study in order to compare our model to
8.2.2. Evaluation on real makeup data
renderer parametrization set by expert artists. We build a validation
To quantitatively compare our method against other image-based dataset of 2500 images of volunteers wearing a total of 327 differ-
virtual-try on methods, we perform an experiment based on real ent lipsticks. For each of these lipsticks, artists have carefully set
images. We reproduce the quantitative evaluation of lipstick virtual the rendering parameters to reproduce the makeup product appear-
try-on from the example introduced in [KPGB20]. In particular, we ance. We use 2000 images to estimate the rendering parameters for

© 2022 The Author(s)


Computer Graphics Forum © 2022 The Eurographics Association and John Wiley & Sons Ltd.
R. Kips et al. / Real-time Virtual-Try-On from a Single Example Image through Deep Inverse Graphics and Learned Differentiable Renderers

Table 4: Ablation study on lips synthetic data.


Lipstick experiment
Inverse graphics encoder loss Imitator loss PSNR ↑ (mean ± std) SSIM ↑ (mean ± std) ↓ perceptual dist. (mean ± std)
graphics loss MIG [KJB∗ 21] - 44.87 ± 4.92 0.997 ± 0.001 0.071 ± 0.036
rendering loss imitation loss 47.13 ± 5.02 0.998 ± 0.001 0.059 ± 0.033
rendering loss imitation loss + sensitivity loss 47.85 ± 5.15 0.999 ± 0.001 0.055 ± 0.032
graphics loss + rendering loss imitation loss + sensitivity loss 47.90 ± 5.12 0.999 ± 0.001 0.054 ± 0.032
Hair color experiment
Inverse graphics encoder loss Imitator loss PSNR ↑ (mean ± std) SSIM ↑ (mean ± std) ↓ perceptual dist. (mean ± std)
graphics loss MIG [KJB∗ 21] - 28.19 ± 5.34 0.927 ± 0.056 0.797 ± 0.308
rendering loss imitation loss 26.44 ± 5.81 0.917 ± 0.068 0.810 ± 0.323
rendering loss imitation loss + sensitivity loss 27.49 ± 5.82 0.924 ± 0.064 0.786 ± 0.325
graphics loss + rendering loss imitation loss + sensitivity loss 28.73 ± 5.90 0.933 ± 0.055 0.756 ± 0.320

Table 5: Quantitative evaluation of the makeup transfer performance using a dataset of groundtruth triplet images.
Model PSNR ↑ (mean ± std) SSIM ↑ (mean ± std) ↓ perceptual dist. (mean ± std)
BeautyGAN [LQD∗ 18] 17.44 ± 3.43 0.609 ± 0.094 0.093 ± 0.018
CA-GAN [KPGB20] 17.92 ± 2.93 0.621 ± 0.033 0.077 ± 0.019
PSGAN [JLG∗ 20] 16.11 ± 2.42 0.360 ± 0.098 0.062 ± 0.018
CPM [NTH21] 17.87 ± 3.65 0.655 ± 0.089 0.065 ± 0.022
MIG [KJB∗ 21] 17.82 ± 2.80 0.663 ± 0.096 0.062 ± 0.016
Ours 18.35 ± 2.63 0.672 ± 0.100 0.060 ± 0.016

each of the considered lipstick using our models, computing the Table 6: Results of our user study comparing our system to manual
median when multiple images per lipstick are available. The re- renderer parametrization made by artists. Each judge is asked to
maining 500 images are kept for validation. identify which rendering is the most realistic compared to a real
reference image.
Both renderings valid 19.68%
Using this dataset, we conduct a user study on six human evalu-
Only artists rendering valid 31.80 %
ators. Each of them is presented with an image from the validation
Only our rendering valid 48.52 %
set, and the two associated renderings of the same lipstick, based
on artists rendering parameters and on parameters estimated by our
model. Each rendering image is randomly denoted as rendering A
or B to limit bias in the evaluation. Each evaluator must choose 8.4. Inference Speed
among the categories “both rendering are valid", “only rendering One of the advantages of our hybrid method combining deep learn-
A is valid", “only rendering B is valid", and “both rendering are in- ing and classical computer graphics is that it does not use neural
valid". All images are labeled by three different evaluators. Finally, rendering at inference. Indeed, a commonly recognized challenge
we removed images for which a majority vote was not reached of generative methods is that they cannot be deployed for real-time
among the evaluators (19%). We also removed images where both video applications. In this section, we profile and report the infer-
renderings were considered unrealistic (14%), assuming this was ence speed of our inverse graphics encoder and our lipstick ren-
more due to the renderer limitations than an inaccurate rendering derer. Our trained models of the inverse graphics encoder and lip
parametrization. The results of this experiment are presented in Ta- detection are converted from TensorFlow to NCNN [Ten18] and
ble 6. TensorFlow.js to make it runnable on mobile platforms and mobile
web browsers. As shown in Table 7, our method is able to achieve
real-time speed even on mobile platforms (iPhone8 Plus, Safari).
Results indicate that in 48.5% of cases our system outperforms Furthermore, the slow inference speed of our learned differentiable
a manual rendering parametrization, while it performs equally in renderer confirms that current mobile devices hardware does not al-
19.7% of the labeled examples. However, for 31.8% of the im- low the use of generative networks for real-time video applications
ages, our system failed to produce a realistic rendering while an in the browser. This reinforces the usefulness of our approach com-
artist could manually obtain a convincing result. In particular, our pared to purely generative models such as MichiGAN [TCC∗ 20].
framework seems to fail to correctly model very dark lipsticks, that
were not encountered in our training distribution This study tends
9. Conclusion
to demonstrate that our system can also be used to help artists to
create more realistic renderings, by accelerating the currently man- In this paper, we present a novel framework for real-time virtual
ual rendering parametrization using example images. try-on from example images. Our method is based on a hybrid ap-

© 2022 The Author(s)


Computer Graphics Forum © 2022 The Eurographics Association and John Wiley & Sons Ltd.
R. Kips et al. / Real-time Virtual-Try-On from a Single Example Image through Deep Inverse Graphics and Learned Differentiable Renderers

Table 7: Profiling results of our graphics lipstick rendering [DAD∗ 18] D ESCHAINTRE V., A ITTALA M., D URAND F., D RETTAKIS
G., B OUSSEAU A.: Single-image SVBRDF capture with a rendering-
pipeline on mobile devices in the Safari web browser. To get ac- aware deep network. TOG 37, 4 (2018). 2
curate results, we skip the first 100 frames and average the results
[DAP∗ 19] D UKE B., A HMED A., P HUNG E., K EZELE I., A ARABI P.:
of the next 500 frames for each device.
Nail polish try-on: Realtime semantic segmentation of small objects for-
Device Inverse Landmarks Rendering Learned native and browser smartphone AR applications. CVPR Workshop on
Encoder Detec- & Display Differentiable Vision for AR/VR (2019). arXiv:1906.02222. 3
tion Renderer [GAA∗ 17] G ULRAJANI I., A HMED F., A RJOVSKY M., D UMOULIN V.,
(Imitator) C OURVILLE A. C.: Improved training of Wasserstein GANs. In NIPS
iPhone8 Plus 26.98ms 38.50ms 52.91ms 835.66ms (2017), pp. 5767–5777. 5, 6
iPhoneX 27ms 38.46ms 57.57ms 841.78ms [GPAM∗ 20] G OODFELLOW I., P OUGET-A BADIE J., M IRZA M., X U
B., WARDE -FARLEY D., O ZAIR S., C OURVILLE A., B ENGIO Y.: Gen-
erative adversarial networks. Communications of the ACM 63, 11 (2020),
139–144. 3
proach combining the advantages of neural rendering and classical
[GSG∗ 21] G E C., S ONG Y., G E Y., YANG H., L IU W., L UO P.: Dis-
computer graphics. We proposed to train an inverse graphics en-
entangled cycle consistency for highly-realistic virtual try-on. In CVPR
coder module that learns to map an example image to the parameter (2021). 3
space of a renderer. The estimated parameters are then passed to the
[GVS12] G OMES J., V ELHO L., S OUSA M. C.: Computer graphics:
computer graphics renderer module, which can render the extracted theory and practice. CRC Press, 2012. 1
appearance in real-time on mobile devices in high resolution.
[HDMR21] H ENZLER P., D ESCHAINTRE V., M ITRA N. J., R ITSCHEL
Finally, we introduced a learned differentiable imitator modules T.: Generative Modelling of BRDF Textures from Flash Images. URL:
which relax the need for a differentiable renderer in inverse graph- http://arxiv.org/abs/2102.11861. 2
ics problem. This imitator approach could be useful to most inverse [IZZE17] I SOLA P., Z HU J. Y., Z HOU T., E FROS A. A.: Image-to-image
graphics tasks, in particular when based on renderers using non dif- translation with conditional adversarial networks. CVPR (2017), 5967–
ferentiable operations such as path-tracing. 5976. arXiv:1611.07004, doi:10.1109/CVPR.2017.632. 5
[JB17] J ETCHEV N., B ERGMANN U.: The conditional analogy GAN:
Our framework can be easily adapted to other AR renderers since Swapping fashion articles on people images. In ICCV Workshops (2017),
it uses a self-supervised approach that does not require a labeled pp. 2287–2292. 3
training set, but only an access to a parametrized renderer. We il- [JLG∗ 20] J IANG W., L IU S., G AO C., C AO J., H E R., F ENG J., YAN S.:
lustrated the performance of our framework on two popular virtual PSGAN: Pose and expression robust spatial-aware GAN for customiz-
try-on categories, makeup and hair color. Furthermore, we believe able makeup transfer. In IEEE/CVF Conference on Computer Vision and
that our method could be applied to many other augmented reality Pattern Recognition (CVPR) (June 2020). 3, 10
problems, in particular when the object of interest is of homoge- [KBM∗ 20] K ATO H., B EKER D., M ORARIU M., A NDO T., M ATSUOKA
neous texture and color. Thus, as future work, our framework could T., K EHL W., G AIDON A.: Differentiable rendering: A survey. arXiv
be extended to other virtual-try-on categories such as glasses nail preprint arXiv:2006.12057 (2020). 1
polish or hats. [KCP∗ 21] K IM T., C HUNG C., PARK S., G U G., NAM K., C HOE
W., L EE J., C HOO J.: K-hairstyle: A large-scale korean hairstyle
dataset for virtual hair editing and hairstyle classification. arXiv preprint
References arXiv:2102.06288 (2021). 3
[ACB17] A RJOVSKY M., C HINTALA S., B OTTOU L.: Wasserstein gen- [KJB∗ 21] K IPS R., J IANG R., BA S., P HUNG E., A ARABI P., G ORI P.,
erative adversarial networks. In ICML (2017), pp. 214–223. 5 P ERROT M., B LOCH I.: Deep Graphics Encoder for Real-Time Video
Makeup Synthesis from Example. In CVPR Workshop CVFAD (2021).
[ADSDA16] A ZEVEDO P., D OS S ANTOS T. O., D E AGUIAR E.: An 2, 3, 5, 8, 9, 10
augmented reality virtual glasses try-on system. In 2016 XVIII Sympo-
sium on Virtual and Augmented Reality (SVR) (2016), IEEE. 3 [KLA19] K ARRAS T., L AINE S., A ILA T.: A style-based generator ar-
chitecture for generative adversarial networks. In CVPR (2019). 7
[BGR∗ 20] BAZAREVSKY V., G RISHCHENKO I., R AVEENDRAN K.,
Z HU T., Z HANG F., G RUNDMANN M.: BlazePose: On-device real- [KLL16] K IM J., L EE J. K., L EE K. M.: Accurate image super-
time body pose tracking. CVPR Workshop on Vision for AR/VR (2020). resolution using very deep convolutional networks. In CVPR (2016),
arXiv:2006.10204. 3 pp. 1646–1654. 6
[BKV∗ 19] BAZAREVSKY V., K ARTYNNIK Y., VAKUNOV A., R AVEEN - [KPGB20] K IPS R., P ERROT M., G ORI P., B LOCH I.: CA-GAN:
DRAN K., G RUNDMANN M.: BlazeFace: Sub-millisecond neural face Weakly supervised color aware GAN for controllable makeup transfer.
detection on mobile GPUs. In CVPR Workshop on Vision for AR/VR In ECCV Workshop AIM (2020). 3, 9, 10
(2019). URL: https://arxiv.org/abs/1907.05047. 3
[KRH21] KOPF J., RONG X., H UANG J.-B.: Robust consistent video
[CCK∗ 18] C HOI Y., C HOI M., K IM M., H A J.-W., K IM S., C HOO depth estimation. CVPR (2021). 3
J.: Stargan: Unified generative adversarial networks for multi-domain
image-to-image translation. In CVPR (2018). 7 [KS14] K AZEMI V., S ULLIVAN J.: One millisecond face alignment with
an ensemble of regression trees. In CVPR (2014), pp. 1867–1874. 3
[CPLC21] C HOI S., PARK S., L EE M., C HOO J.: VITON-HD: High-
resolution virtual try-on via misalignment-aware normalization. In [KW14] K INGMA D. P., W ELLING M.: Auto-encoding variational
CVPR (2021). 3 Bayes. ICLR (2014). 3
[CXM∗ 20] C HU M., X IE Y., M AYER J., L EAL -TAIXÉ L., T HUEREY [KWKT15] K ULKARNI T. D., W HITNEY W. F., KOHLI P., T ENEN -
N.: Learning temporal coherence via self-supervision for GAN-based BAUM J. B.: Deep convolutional inverse graphics network. In NIPS
video generation. TOG (2020). 3 (2015). 2

© 2022 The Author(s)


Computer Graphics Forum © 2022 The Eurographics Association and John Wiley & Sons Ltd.
R. Kips et al. / Real-time Virtual-Try-On from a Single Example Image through Deep Inverse Graphics and Learned Differentiable Renderers

[LADL18] L I T. M., A ITTALA M., D URAND F., L EHTINEN J.: Differ- generation for portrait editing. ACM Transactions on Graphics (TOG)
entiable Monte Carlo ray tracing through edge sampling. ACM Trans- 39, 4 (2020), 95–1. 8, 10
actions on Graphics, SIGGRAPH Asia 37, 6 (2018). doi:10.1145/ [TDKP21] T HIMONIER H., D ESPOIS J., K IPS R., P ERROT: Learning
3272127.3275109. 2 long term style preserving blind video temporal consistency. In ICME
[LBB∗ 17] L I T., B OLKART T., B LACK M. J., L I H., ROMERO J.: (2021). 3
Learning a model of facial shape and expression from 4D scans. ACM
[Ten18] T ENCENT: NCNN, high-performance neural network inference
Transactions on Graphics, SIGGRAPH Asia 36, 6 (2017), 194:1–194:17.
framework optimized for the mobile platform. https://github.
URL: https://doi.org/10.1145/3130800.3130813. 2
com/Tencent/ncnn, 2018. 10
[LCP∗ 18] L EVINSHTEIN A., C HANG C., P HUNG E., K EZELE I., G UO
[TFT∗ 20] T EWARI A., F RIED O., T HIES J., S ITZMANN V., L OMBARDI
W., A ARABI P.: Real-time deep hair matting on mobile devices. In 2018
S., S UNKAVALLI K., M ARTIN -B RUALLA R., S IMON T., S ARAGIH J.,
15th Conference on Computer and Robot Vision (CRV) (2018), IEEE,
N IESSNER M., ET AL .: State of the art on neural rendering. Computer
pp. 1–7. 3
Graphics Forum 39, 2 (2020), 701–727. 1
[LHK∗ 20] L AINE S., H ELLSTEN J., K ARRAS T., S EOL Y., L EHTI -
[WMB∗ 20] WANG J., M UELLER F., B ERNARD F., S ORLI S., S OTNY-
NEN J., A ILA T.: Modular primitives for high-performance differen-
CHENKO O., Q IAN N., OTADUY M. A., C ASAS D., T HEOBALT C.:
tiable rendering. TOG 39 (2020). arXiv:arXiv:2011.03277v1,
RGB2Hands: real-time tracking of 3D hand interactions from monocu-
doi:10.1145/3414685.3417861. 2
lar RGB video. TOG) 39, 6 (2020), 1–16. 3
[LPDK19] L I T., P HUNG E., D UKE B., K EZELE I.: Lightweight Real-
[WSB03] WANG Z., S IMONCELLI E. P., B OVIK A. C.: Multiscale struc-
time Makeup Try-on in Mobile Browsers with Tiny CNN Models for
tural similarity for image quality assessment. In Asilomar Conference on
Facial Tracking. CVPR Workshop on Vision for AR/VR (2019), 1–4. 3
Signals, Systems & Computers (2003). 2
[LQD∗ 18] L I T., Q IAN R., D ONG C., L IU S., YAN Q., Z HU W., L IN L.:
[WTP17] W OLF L., TAIGMAN Y., P OLYAK A.: Unsupervised Creation
BeautyGAN: Instance-level facial makeup transfer with deep generative
of Parameterized Avatars. In CVPR (2017), pp. 1539–1547. 5
adversarial network. In ACMMM (2018). 2, 3, 10
[LTZQ19] L IU K., TANG W., Z HOU F., Q IU G.: Spectral regularization [XYH∗ 21] X IAO C., Y U D., H AN X., Z HENG Y., F U H.: Sketchhair-
salon: Deep sketch-based hair image synthesis. ACM Transactions on
for combating mode collapse in GANs. In ICCV (2019), pp. 6382–6390.
5 Graphics (Proceedings of ACM SIGGRAPH Asia 2021) 40, 6 (2021). 3,
8
[LYP∗ 19] L I T., Y U Z., P HUNG E., D UKE B., K EZELE I., A ARABI P.:
Lightweight real-time makeup try-on in mobile browsers with tiny CNN [ZBV∗ 20] Z HANG F., BAZAREVSKY V., VAKUNOV A., T KACHENKA
models for facial tracking. CVPR Workshop on Vision for AR/VR (2019). A., S UNG G., C HANG C. L., G RUNDMANN M.: MediaPipe hands: On-
4 device real-time hand tracking. CVPR Workshop on Vision for AR/VR
(2020). arXiv:2006.10214. 3
[NTH21] N GUYEN T., T RAN A., H OAI M.: Lipstick ain’t enough: Be-
yond color matching for in-the-wild makeup transfer. In CVPR (2021). [ZIE∗ 18] Z HANG R., I SOLA P., E FROS A. A., S HECHTMAN E., WANG
3, 10 O.: The unreasonable effectiveness of deep features as a perceptual met-
ric. In CVPR (2018), pp. 586–595. 2
[PEL∗ 21] PANDEY R., E SCOLANO S. O., L EGENDRE C., H ÄNE C.,
B OUAZIZ S., R HEMANN C., D EBEVEC P., FANELLO S.: Total Relight-
ing: Learning to Relight Portraits for Background Replacement. TOG
40, 4 (2021), 1–21. 2
[PWS∗ 21] PATASHNIK O., W U Z., S HECHTMAN E., C OHEN -O R D.,
L ISCHINSKI D.: Styleclip: Text-driven manipulation of stylegan im-
agery. In Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV) (October 2021), pp. 2085–2094. 3, 8
[SDS∗ 21] S AHA R., D UKE B., S HKURTI F., TAYLOR G. W., A ARABI
P.: Loho: Latent optimization of hairstyles via orthogonalization. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (2021), pp. 1984–1993. 3
[SF19] S ONG S., F UNKHOUSER T.: Neural illumination: Lighting pre-
diction for indoor environments. In CVPR (2019), pp. 6918–6926. 2
[Shi20] Fast and Robust Face-to-Parameter Translation for Game
Character Auto-Creation. AAAI 34, 02 (2020), 1733–1740.
URL: https://aaai.org/ojs/index.php/AAAI/article/
view/5537, doi:10.1609/aaai.v34i02.5537. 5
[SK19] S OKAL K., K IBALCHICH I.: High-Quality AR Lipstick Simu-
lation via Image Filtering Techniques. CVPR Workshop on Vision for
AR/VR (2019). 3
[SK21] S OMANATH G., K URZ D.: HDR environment map estimation
for real-time augmented reality. In CVPR (2021). 2
[SYF∗ 19] S HI T., Y UAN Y., FAN C., Z OU Z., S HI Z., L IU Y.: Face-to-
parameter translation for game character auto-creation. In ICCV (2019).
doi:10.1109/ICCV.2019.00025. 5
[TB19] T KACHENKA A., BAZAREVSKY V.: Real-time Hair segmenta-
tion and recoloring on Mobile GPUs. CVPR Workshop on Vision for
AR/VR (2019). 3
[TCC∗ 20] TAN Z., C HAI M., C HEN D., L IAO J., C HU Q., Y UAN L.,
T ULYAKOV S., Y U N.: Michigan: multi-input-conditioned hair image

© 2022 The Author(s)


Computer Graphics Forum © 2022 The Eurographics Association and John Wiley & Sons Ltd.

You might also like