Virtual Try On
Virtual Try On
Virtual Try On
1 L’Oréal
Research and Innovation, France
2 LTCI,
Télécom Paris, Institut Polytechnique de Paris, France
3 Modiface, Canada 4 Sorbonne Université, CNRS, LIP6, France
Abstract
Augmented reality applications have rapidly spread across online retail platforms and social media, allowing consumers to
virtually try-on a large variety of products, such as makeup, hair dying, or shoes. However, parametrizing a renderer to syn-
thesize realistic images of a given product remains a challenging task that requires expert knowledge. While recent work has
introduced neural rendering methods for virtual try-on from example images, current approaches are based on large generative
models that cannot be used in real-time on mobile devices. This calls for a hybrid method that combines the advantages of
computer graphics and neural rendering approaches. In this paper, we propose a novel framework based on deep learning
to build a real-time inverse graphics encoder that learns to map a single example image into the parameter space of a given
augmented reality rendering engine. Our method leverages self-supervised learning and does not require labeled training data,
which makes it extendable to many virtual try-on applications. Furthermore, most augmented reality renderers are not differ-
entiable in practice due to algorithmic choices or implementation constraints to reach real-time on portable devices. To relax
the need for a graphics-based differentiable renderer in inverse graphics problems, we introduce a trainable imitator module.
Our imitator is a generative network that learns to accurately reproduce the behavior of a given non-differentiable renderer. We
propose a novel rendering sensitivity loss to train the imitator, which ensures that the network learns an accurate and continu-
ous representation for each rendering parameter. Automatically learning a differentiable renderer, as proposed here, could be
beneficial for various inverse graphics tasks. Our framework enables novel applications where consumers can virtually try-on a
novel unknown product from an inspirational reference image on social media. It can also be used by computer graphics artists
to automatically create realistic rendering from a reference product image.
CCS Concepts
• Computing methodologies → Computer vision; Machine learning; Computer graphics;
Figure 1: Our hybrid framework uses a deep learning based inverse graphics encoder that learns to map an example reference image into
the parameter space of a computer graphics renderer. The renderer can then be used to render a virtual try-on in real-time on mobile devices.
We illustrate the performance of our framework on makeup (lipstick and eye shadow) and hair color virtual try-on.
tive networks that suffer from temporal inconsistencies, and can- 2. Related works
not be used to process a video stream in real-time on mobile de-
vices [KJB∗ 21, LQD∗ 18]. 2.1. Inverse Graphics and Differentiable Rendering
Figure 2: Our inference and training pipeline. The inverse graphics encoder E is trained using a self-supervised approach based on the
random sampling of graphics vectors g sent to a renderer R to generate training data. For each synthetic image, a graphics loss function
Lgraphics enforces E to estimate the corresponding rendering parameters. To ensure additional supervision in the image space, our learned
imitator module is used to compute a differentiable rendering estimate from the encoder output. A rendering loss function Lrendering is then
computed using a perceptual distance between the rendered image and its reconstructed estimate. For inference, the imitator is discarded
and we use the original renderer to reach real-time performances.
2.2. Augmented Reality Renderers appearance from a given reference image and realistically synthe-
sizing this product in the image of another person. Most methods
Augmented reality renderers are a particular category of com-
in this field are built on a similar approach, that proposes to use a
puter graphics pipelines where the objective is to realistically syn-
neural network to extract product features into a latent space. Most
thesize an object in an image or video of a real scene. In gen-
of the methods in this field use a similar approach by training a
eral, AR renderers are composed of one or several scene per-
neural network that encodes the target product features into a la-
ception modules whose role is to estimate relevant scene infor-
tent space. Then, this product representation is decoded and ren-
mation from the source image, that is then passed to a render-
dered on the source image using generative models such as gener-
ing module. For instance, many portraits based AR applications
ative advesarial networks (GANs) [GPAM∗ 20] or variational auto-
are based on facial landmarks estimation [KS14] used to com-
encoders (VAEs) [KW14]. In particular, this idea has been success-
pute the position of a synthetic object such as glasses that are
fully used for makeup transfer [LQD∗ 18, JLG∗ 20], hair synthe-
then blended on the face image [ADSDA16]. Similarly, other pop-
sis [SDS∗ 21, KCP∗ 21] and is rapidly emerging in the field of fash-
ular scene perception methods for augmented reality focus on hand
ion articles [JB17]. Recent methods attempt to provide controllable
tracking [WMB∗ 20, ZBV∗ 20], body pose estimation [BGR∗ 20],
rendering [KPGB20], or propose to leverage additional scene in-
hair segmentation [LCP∗ 18], or scene depth estimation [KRH21].
formation in their models, such as segmentation masks for fashion
Furthermore, most augmented reality applications target video-
items [CPLC21, GSG∗ 21] or UV maps for makeup [NTH21]. An-
based problems and deployment on mobile devices. For this rea-
other category of model leverages user’s input instead of example
son, reaching real-time with limited computation resources is usu-
image to control the image synthesis. In particular, for hair virtual
ally an important focus of research in this field, as illustrated in
try-on the model from [XYH∗ 21] uses sketch input to explicitly
[BKV∗ 19, BGR∗ 20, KRH21, LPDK19]. Many AR applications fo-
control the hairstyle, while StyleCLIP [PWS∗ 21] uses text input to
cus on virtual try-on tasks in order to enhance the consumer experi-
control hair and makeup attributes in an image.
ence in digital stores. Popular applications introduce virtual try-on
for lipstick [SK19], hair color [TB19], or nail polish [DAP∗ 19],
reaching realistic results in real-time on mobile devices. However, However, current virtual try-on from example image methods
for such methods, the renderer needs to be manually parametrized suffer from several limitations. First, these neural rendering meth-
by an artist to obtain a realistic rendering of a targeted product. ods are based on large generative models that cannot be used to
Users are restricted to select a product within a pre-defined range, produce high-resolution renderings in real-time on mobile devices.
and cannot virtually try a novel product from a given new reference Furthermore, such models often lead to poor results when used on
image. videos, since generative models are known to produce time incon-
sistencies artifacts. Even though recent works attempt to address
this issue by training post-processing models [CXM∗ 20,TDKP21],
2.3. Image-Based Virtual Try-On
they cannot be used in real-time. More recently, an inverse graph-
While conventional AR renderers render objects that are previously ics approach has been introduced for makeup [KJB∗ 21], bringing
created by computer graphics artists, a recent research direction interesting perspectives for reaching real-time image-based virtual
based on neural rendering has proposed the novel task of image- try-on. In this paper, we propose to build on this approach to pro-
based virtual try-on. The objective consists in extracting a product pose a more robust and general framework.
3. Framework Overview
Table 1: Main parameters used in the makeup rendering process. can lead to perceptually very different textures. Thus, the graphics
The complete rendering process includes a total of 17 parame- loss function, computed in the space of rendering parameters, does
ters in order to achieve more sophisticated lipstick looks including not provide an optimal signal for supervising the inverse graphics
sparkles. encoder training. Therefore, we propose to use an additional ren-
dering loss function in the image space, by leveraging an imitator
Parameter Range module and a perceptual distance. In this section, we detail the first
Makeup opacity [0, 1] step of our framework, which consists in training a differentiable
R,G,B [0, 255] imitator module that will later be used to train our inverse graphics
Amount of gloss on the makeup [0, +∞)
encoder. Our imitator takes the form of a generative neural network
Gloss Roughness [0, 1]
Reflection intensity [0, 1]
that automatically learns to reproduce the behavior of a given ren-
derer. This network can then be used as a differentiable surrogate of
the initial renderer, for solving various inverse graphics problems.
Table 2: Parameters used in the hair color rendering process. The
re-coloring step uses swatch parameters to modify the color of a The idea of using a generative network to build a differentiable
reference swatch image, that is then used as a target for histogram estimate of a renderer was first proposed in the field of automatic
matching. Non-swatch parameters affect post histogram-matching avatar creation [WTP17, SYF∗ 19, Shi20]. However, compared to
steps in the rendering process. Blend effect refers to applying a fixed-camera avatar renderers, AR renderers are more complex
faded look, and the parameter controls the gradient offset from the functions that are usually composed of computer vision and com-
top of the hair. puter graphics modules. Previous renderer imitator methods di-
rectly apply the conventional Generative Adversarial Networks
Parameter Range method for image-to-image translation established by [IZZE17].
Swatch parameters This approach consists in training the imitator to reproduce the ren-
Brightness [−0.3, 0.3] derer output by minimizing a perceptual loss between the renderer
Contrast [0.5, 2]
and imitator outputs on a set of example rendering images. How-
Exposure [0.5, 2]
Gamma [0.5, 3]
ever, this method does not leverage the specificity of the renderer
Hue [0, 1] imitation problem, where training phase does not depend on a fixed
Saturation [0, 3] set of training images, but can be dynamically generated following
Non-Swatch parameters using the original renderer.
Blend Effect [−1, 1]
We propose to leverage this advantage to introduce a more con-
Intensity [0, 1]
strained formulation of the imitator problem, based on a novel ren-
Shine [0, 1]
dering sensitivity loss function. This additional loss term is moti-
vated by two observations. First, the imitator network is required
to accurately model each of the renderer parameters. However, this
shade matching process to render hair color. In the rendering pro-
is not explicitly enforced by the conventional imitator approach,
cess, a hair segmentation model first estimates a hair mask from the
where parameters that only impact a small portion of the rendered
input image. The detected hair region is then transformed so that its
images have a limited weight in the perceptual loss function. Sec-
histogram matches the histogram of the re-colored swatch image.
ondly, in order to accurately solve inverse graphics problems, the
Finally, shine and contrast effects boost global and local contrast
imitator needs to learn a continuous representation for each param-
to improve texture, and a blending effect can add an optional faded
eter, where a shift in a given parameter will lead to changes in the
look.In this paper, we consider both swatch and non-swatch param-
rendered image which are similar to the ones obtained using the ac-
eters as they both affect all steps of the rendering process.
tual AR renderer. This is particularly challenging as generative net-
works are known for the difficulties they encounter in accurately
5. Learned Imitator Module modeling the entire training data distribution (i.e., the mode col-
lapse problem) [LTZQ19, GAA∗ 17, ACB17]. Our sensitivity loss,
5.1. Imitator motivation
which provides an answer to these problems, is inspired by the ex-
Our objective of building an inverse graphics encoder which can pression of the finite difference and forces the imitator to learn a
be used to perform virtual try-on from a single example image in rendering function where the derivative with respect to the render-
real-time on mobile device using a conventional, non-differentiable ing parameters is the same as the approximated derivative on the
AR-renderer. When training such a model, the ground truth render- non-differentiable renderer. For this reason, this loss does not op-
ing parameters are known during the encoder training and can be erate on the image space, as the rendering loss, but in the image
used to supervise the encoder directly, as in [KJB∗ 21]. However, difference space. Thus, when sampling rendering parameters, the
this assumes that a distance in the space of graphics parameters is sensitivity loss forces the generator to modify the same pixels as
a good measure of appearance. This distance can be misleading as the non-differentiable renderer, and in the same proportion.
some parameters (e.g. RGB for makeup) have a very large contri-
bution to the rendering results, while other parameters (e.g. gloss
5.2. Imitator objective functions
roughness) have a more limited impact. Similarly, reaching a real-
istic shine effect for a given product often requires very high accu- In this section, we detail the training procedure of our imitator
racy in the setting of the shine parameters, where slight variations model, illustrated in Figure 4. Our objective is to train an imita-
Figure 4: The training procedure of our differentiable imitator module I that learns to reproduce the behavior or the renderer R. The
imitation loss function Limitation enforces perceptual similarity between R and I outputs on randomly sampled graphics parameter vectors g.
The sensitivity loss function Lsens ensures that a random shift in any dimension of the graphics parameter vector is correctly modeled by the
imitator.
tor network I that learns to reproduce the behavior of the renderer function can be written as follows:
R, and for which derivatives with respect to g can be computed.
1 n m
We generate training data by randomly sampling n graphics vectors Lsens = ∑ ∑ k[R(Xi , gi ) − R(Xi , g0i, j )]
n i=1
gi , i = 1...n and render them through R with a randomly associated j=1
portrait image Xi . We propose to train I by using a combination of − [I(Xi , gi ) − I(Xi , g0i, j )]k22
two loss functions. First, we use an imitation loss function that en-
forces the imitator network to produce outputs that are perceptually Finally, we use the conventional adversarial GAN loss function,
similar to the renderer for a given image Xi , and graphics vector gi . where D is a discriminator module trained with the gradient penalty
The perceptual similarity is computed using a perceptual distance loss from [GAA∗ 17] :
based on deep features [KLL16], and can be written as follows:
1 n
LGAN = − ∑ D(I(Xi , gi )) (3)
n i=1
Figure 8: Qualitative comparison of our approach against neural rendering methods for example-based makeup virtual try-on. Our model
achieves more realistic results with high resolution. Furthermore, using an imitator allows us to achieve better shine modeling than in MIG
Figure 9: Comparison against generative-based methods for hair Figure 10: Qualitative ablation study on synthetic data. These re-
color virtual try-on. In addition to reaching real-time on mobile sults confirm that training the inverse graphics encoder using an
devices, our model produces more realistic hair color results. imitator based on sensitivity loss leads to improved realism in im-
age based virtual try-on.
Table 5: Quantitative evaluation of the makeup transfer performance using a dataset of groundtruth triplet images.
Model PSNR ↑ (mean ± std) SSIM ↑ (mean ± std) ↓ perceptual dist. (mean ± std)
BeautyGAN [LQD∗ 18] 17.44 ± 3.43 0.609 ± 0.094 0.093 ± 0.018
CA-GAN [KPGB20] 17.92 ± 2.93 0.621 ± 0.033 0.077 ± 0.019
PSGAN [JLG∗ 20] 16.11 ± 2.42 0.360 ± 0.098 0.062 ± 0.018
CPM [NTH21] 17.87 ± 3.65 0.655 ± 0.089 0.065 ± 0.022
MIG [KJB∗ 21] 17.82 ± 2.80 0.663 ± 0.096 0.062 ± 0.016
Ours 18.35 ± 2.63 0.672 ± 0.100 0.060 ± 0.016
each of the considered lipstick using our models, computing the Table 6: Results of our user study comparing our system to manual
median when multiple images per lipstick are available. The re- renderer parametrization made by artists. Each judge is asked to
maining 500 images are kept for validation. identify which rendering is the most realistic compared to a real
reference image.
Both renderings valid 19.68%
Using this dataset, we conduct a user study on six human evalu-
Only artists rendering valid 31.80 %
ators. Each of them is presented with an image from the validation
Only our rendering valid 48.52 %
set, and the two associated renderings of the same lipstick, based
on artists rendering parameters and on parameters estimated by our
model. Each rendering image is randomly denoted as rendering A
or B to limit bias in the evaluation. Each evaluator must choose 8.4. Inference Speed
among the categories “both rendering are valid", “only rendering One of the advantages of our hybrid method combining deep learn-
A is valid", “only rendering B is valid", and “both rendering are in- ing and classical computer graphics is that it does not use neural
valid". All images are labeled by three different evaluators. Finally, rendering at inference. Indeed, a commonly recognized challenge
we removed images for which a majority vote was not reached of generative methods is that they cannot be deployed for real-time
among the evaluators (19%). We also removed images where both video applications. In this section, we profile and report the infer-
renderings were considered unrealistic (14%), assuming this was ence speed of our inverse graphics encoder and our lipstick ren-
more due to the renderer limitations than an inaccurate rendering derer. Our trained models of the inverse graphics encoder and lip
parametrization. The results of this experiment are presented in Ta- detection are converted from TensorFlow to NCNN [Ten18] and
ble 6. TensorFlow.js to make it runnable on mobile platforms and mobile
web browsers. As shown in Table 7, our method is able to achieve
real-time speed even on mobile platforms (iPhone8 Plus, Safari).
Results indicate that in 48.5% of cases our system outperforms Furthermore, the slow inference speed of our learned differentiable
a manual rendering parametrization, while it performs equally in renderer confirms that current mobile devices hardware does not al-
19.7% of the labeled examples. However, for 31.8% of the im- low the use of generative networks for real-time video applications
ages, our system failed to produce a realistic rendering while an in the browser. This reinforces the usefulness of our approach com-
artist could manually obtain a convincing result. In particular, our pared to purely generative models such as MichiGAN [TCC∗ 20].
framework seems to fail to correctly model very dark lipsticks, that
were not encountered in our training distribution This study tends
9. Conclusion
to demonstrate that our system can also be used to help artists to
create more realistic renderings, by accelerating the currently man- In this paper, we present a novel framework for real-time virtual
ual rendering parametrization using example images. try-on from example images. Our method is based on a hybrid ap-
Table 7: Profiling results of our graphics lipstick rendering [DAD∗ 18] D ESCHAINTRE V., A ITTALA M., D URAND F., D RETTAKIS
G., B OUSSEAU A.: Single-image SVBRDF capture with a rendering-
pipeline on mobile devices in the Safari web browser. To get ac- aware deep network. TOG 37, 4 (2018). 2
curate results, we skip the first 100 frames and average the results
[DAP∗ 19] D UKE B., A HMED A., P HUNG E., K EZELE I., A ARABI P.:
of the next 500 frames for each device.
Nail polish try-on: Realtime semantic segmentation of small objects for-
Device Inverse Landmarks Rendering Learned native and browser smartphone AR applications. CVPR Workshop on
Encoder Detec- & Display Differentiable Vision for AR/VR (2019). arXiv:1906.02222. 3
tion Renderer [GAA∗ 17] G ULRAJANI I., A HMED F., A RJOVSKY M., D UMOULIN V.,
(Imitator) C OURVILLE A. C.: Improved training of Wasserstein GANs. In NIPS
iPhone8 Plus 26.98ms 38.50ms 52.91ms 835.66ms (2017), pp. 5767–5777. 5, 6
iPhoneX 27ms 38.46ms 57.57ms 841.78ms [GPAM∗ 20] G OODFELLOW I., P OUGET-A BADIE J., M IRZA M., X U
B., WARDE -FARLEY D., O ZAIR S., C OURVILLE A., B ENGIO Y.: Gen-
erative adversarial networks. Communications of the ACM 63, 11 (2020),
139–144. 3
proach combining the advantages of neural rendering and classical
[GSG∗ 21] G E C., S ONG Y., G E Y., YANG H., L IU W., L UO P.: Dis-
computer graphics. We proposed to train an inverse graphics en-
entangled cycle consistency for highly-realistic virtual try-on. In CVPR
coder module that learns to map an example image to the parameter (2021). 3
space of a renderer. The estimated parameters are then passed to the
[GVS12] G OMES J., V ELHO L., S OUSA M. C.: Computer graphics:
computer graphics renderer module, which can render the extracted theory and practice. CRC Press, 2012. 1
appearance in real-time on mobile devices in high resolution.
[HDMR21] H ENZLER P., D ESCHAINTRE V., M ITRA N. J., R ITSCHEL
Finally, we introduced a learned differentiable imitator modules T.: Generative Modelling of BRDF Textures from Flash Images. URL:
which relax the need for a differentiable renderer in inverse graph- http://arxiv.org/abs/2102.11861. 2
ics problem. This imitator approach could be useful to most inverse [IZZE17] I SOLA P., Z HU J. Y., Z HOU T., E FROS A. A.: Image-to-image
graphics tasks, in particular when based on renderers using non dif- translation with conditional adversarial networks. CVPR (2017), 5967–
ferentiable operations such as path-tracing. 5976. arXiv:1611.07004, doi:10.1109/CVPR.2017.632. 5
[JB17] J ETCHEV N., B ERGMANN U.: The conditional analogy GAN:
Our framework can be easily adapted to other AR renderers since Swapping fashion articles on people images. In ICCV Workshops (2017),
it uses a self-supervised approach that does not require a labeled pp. 2287–2292. 3
training set, but only an access to a parametrized renderer. We il- [JLG∗ 20] J IANG W., L IU S., G AO C., C AO J., H E R., F ENG J., YAN S.:
lustrated the performance of our framework on two popular virtual PSGAN: Pose and expression robust spatial-aware GAN for customiz-
try-on categories, makeup and hair color. Furthermore, we believe able makeup transfer. In IEEE/CVF Conference on Computer Vision and
that our method could be applied to many other augmented reality Pattern Recognition (CVPR) (June 2020). 3, 10
problems, in particular when the object of interest is of homoge- [KBM∗ 20] K ATO H., B EKER D., M ORARIU M., A NDO T., M ATSUOKA
neous texture and color. Thus, as future work, our framework could T., K EHL W., G AIDON A.: Differentiable rendering: A survey. arXiv
be extended to other virtual-try-on categories such as glasses nail preprint arXiv:2006.12057 (2020). 1
polish or hats. [KCP∗ 21] K IM T., C HUNG C., PARK S., G U G., NAM K., C HOE
W., L EE J., C HOO J.: K-hairstyle: A large-scale korean hairstyle
dataset for virtual hair editing and hairstyle classification. arXiv preprint
References arXiv:2102.06288 (2021). 3
[ACB17] A RJOVSKY M., C HINTALA S., B OTTOU L.: Wasserstein gen- [KJB∗ 21] K IPS R., J IANG R., BA S., P HUNG E., A ARABI P., G ORI P.,
erative adversarial networks. In ICML (2017), pp. 214–223. 5 P ERROT M., B LOCH I.: Deep Graphics Encoder for Real-Time Video
Makeup Synthesis from Example. In CVPR Workshop CVFAD (2021).
[ADSDA16] A ZEVEDO P., D OS S ANTOS T. O., D E AGUIAR E.: An 2, 3, 5, 8, 9, 10
augmented reality virtual glasses try-on system. In 2016 XVIII Sympo-
sium on Virtual and Augmented Reality (SVR) (2016), IEEE. 3 [KLA19] K ARRAS T., L AINE S., A ILA T.: A style-based generator ar-
chitecture for generative adversarial networks. In CVPR (2019). 7
[BGR∗ 20] BAZAREVSKY V., G RISHCHENKO I., R AVEENDRAN K.,
Z HU T., Z HANG F., G RUNDMANN M.: BlazePose: On-device real- [KLL16] K IM J., L EE J. K., L EE K. M.: Accurate image super-
time body pose tracking. CVPR Workshop on Vision for AR/VR (2020). resolution using very deep convolutional networks. In CVPR (2016),
arXiv:2006.10204. 3 pp. 1646–1654. 6
[BKV∗ 19] BAZAREVSKY V., K ARTYNNIK Y., VAKUNOV A., R AVEEN - [KPGB20] K IPS R., P ERROT M., G ORI P., B LOCH I.: CA-GAN:
DRAN K., G RUNDMANN M.: BlazeFace: Sub-millisecond neural face Weakly supervised color aware GAN for controllable makeup transfer.
detection on mobile GPUs. In CVPR Workshop on Vision for AR/VR In ECCV Workshop AIM (2020). 3, 9, 10
(2019). URL: https://arxiv.org/abs/1907.05047. 3
[KRH21] KOPF J., RONG X., H UANG J.-B.: Robust consistent video
[CCK∗ 18] C HOI Y., C HOI M., K IM M., H A J.-W., K IM S., C HOO depth estimation. CVPR (2021). 3
J.: Stargan: Unified generative adversarial networks for multi-domain
image-to-image translation. In CVPR (2018). 7 [KS14] K AZEMI V., S ULLIVAN J.: One millisecond face alignment with
an ensemble of regression trees. In CVPR (2014), pp. 1867–1874. 3
[CPLC21] C HOI S., PARK S., L EE M., C HOO J.: VITON-HD: High-
resolution virtual try-on via misalignment-aware normalization. In [KW14] K INGMA D. P., W ELLING M.: Auto-encoding variational
CVPR (2021). 3 Bayes. ICLR (2014). 3
[CXM∗ 20] C HU M., X IE Y., M AYER J., L EAL -TAIXÉ L., T HUEREY [KWKT15] K ULKARNI T. D., W HITNEY W. F., KOHLI P., T ENEN -
N.: Learning temporal coherence via self-supervision for GAN-based BAUM J. B.: Deep convolutional inverse graphics network. In NIPS
video generation. TOG (2020). 3 (2015). 2
[LADL18] L I T. M., A ITTALA M., D URAND F., L EHTINEN J.: Differ- generation for portrait editing. ACM Transactions on Graphics (TOG)
entiable Monte Carlo ray tracing through edge sampling. ACM Trans- 39, 4 (2020), 95–1. 8, 10
actions on Graphics, SIGGRAPH Asia 37, 6 (2018). doi:10.1145/ [TDKP21] T HIMONIER H., D ESPOIS J., K IPS R., P ERROT: Learning
3272127.3275109. 2 long term style preserving blind video temporal consistency. In ICME
[LBB∗ 17] L I T., B OLKART T., B LACK M. J., L I H., ROMERO J.: (2021). 3
Learning a model of facial shape and expression from 4D scans. ACM
[Ten18] T ENCENT: NCNN, high-performance neural network inference
Transactions on Graphics, SIGGRAPH Asia 36, 6 (2017), 194:1–194:17.
framework optimized for the mobile platform. https://github.
URL: https://doi.org/10.1145/3130800.3130813. 2
com/Tencent/ncnn, 2018. 10
[LCP∗ 18] L EVINSHTEIN A., C HANG C., P HUNG E., K EZELE I., G UO
[TFT∗ 20] T EWARI A., F RIED O., T HIES J., S ITZMANN V., L OMBARDI
W., A ARABI P.: Real-time deep hair matting on mobile devices. In 2018
S., S UNKAVALLI K., M ARTIN -B RUALLA R., S IMON T., S ARAGIH J.,
15th Conference on Computer and Robot Vision (CRV) (2018), IEEE,
N IESSNER M., ET AL .: State of the art on neural rendering. Computer
pp. 1–7. 3
Graphics Forum 39, 2 (2020), 701–727. 1
[LHK∗ 20] L AINE S., H ELLSTEN J., K ARRAS T., S EOL Y., L EHTI -
[WMB∗ 20] WANG J., M UELLER F., B ERNARD F., S ORLI S., S OTNY-
NEN J., A ILA T.: Modular primitives for high-performance differen-
CHENKO O., Q IAN N., OTADUY M. A., C ASAS D., T HEOBALT C.:
tiable rendering. TOG 39 (2020). arXiv:arXiv:2011.03277v1,
RGB2Hands: real-time tracking of 3D hand interactions from monocu-
doi:10.1145/3414685.3417861. 2
lar RGB video. TOG) 39, 6 (2020), 1–16. 3
[LPDK19] L I T., P HUNG E., D UKE B., K EZELE I.: Lightweight Real-
[WSB03] WANG Z., S IMONCELLI E. P., B OVIK A. C.: Multiscale struc-
time Makeup Try-on in Mobile Browsers with Tiny CNN Models for
tural similarity for image quality assessment. In Asilomar Conference on
Facial Tracking. CVPR Workshop on Vision for AR/VR (2019), 1–4. 3
Signals, Systems & Computers (2003). 2
[LQD∗ 18] L I T., Q IAN R., D ONG C., L IU S., YAN Q., Z HU W., L IN L.:
[WTP17] W OLF L., TAIGMAN Y., P OLYAK A.: Unsupervised Creation
BeautyGAN: Instance-level facial makeup transfer with deep generative
of Parameterized Avatars. In CVPR (2017), pp. 1539–1547. 5
adversarial network. In ACMMM (2018). 2, 3, 10
[LTZQ19] L IU K., TANG W., Z HOU F., Q IU G.: Spectral regularization [XYH∗ 21] X IAO C., Y U D., H AN X., Z HENG Y., F U H.: Sketchhair-
salon: Deep sketch-based hair image synthesis. ACM Transactions on
for combating mode collapse in GANs. In ICCV (2019), pp. 6382–6390.
5 Graphics (Proceedings of ACM SIGGRAPH Asia 2021) 40, 6 (2021). 3,
8
[LYP∗ 19] L I T., Y U Z., P HUNG E., D UKE B., K EZELE I., A ARABI P.:
Lightweight real-time makeup try-on in mobile browsers with tiny CNN [ZBV∗ 20] Z HANG F., BAZAREVSKY V., VAKUNOV A., T KACHENKA
models for facial tracking. CVPR Workshop on Vision for AR/VR (2019). A., S UNG G., C HANG C. L., G RUNDMANN M.: MediaPipe hands: On-
4 device real-time hand tracking. CVPR Workshop on Vision for AR/VR
(2020). arXiv:2006.10214. 3
[NTH21] N GUYEN T., T RAN A., H OAI M.: Lipstick ain’t enough: Be-
yond color matching for in-the-wild makeup transfer. In CVPR (2021). [ZIE∗ 18] Z HANG R., I SOLA P., E FROS A. A., S HECHTMAN E., WANG
3, 10 O.: The unreasonable effectiveness of deep features as a perceptual met-
ric. In CVPR (2018), pp. 586–595. 2
[PEL∗ 21] PANDEY R., E SCOLANO S. O., L EGENDRE C., H ÄNE C.,
B OUAZIZ S., R HEMANN C., D EBEVEC P., FANELLO S.: Total Relight-
ing: Learning to Relight Portraits for Background Replacement. TOG
40, 4 (2021), 1–21. 2
[PWS∗ 21] PATASHNIK O., W U Z., S HECHTMAN E., C OHEN -O R D.,
L ISCHINSKI D.: Styleclip: Text-driven manipulation of stylegan im-
agery. In Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV) (October 2021), pp. 2085–2094. 3, 8
[SDS∗ 21] S AHA R., D UKE B., S HKURTI F., TAYLOR G. W., A ARABI
P.: Loho: Latent optimization of hairstyles via orthogonalization. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (2021), pp. 1984–1993. 3
[SF19] S ONG S., F UNKHOUSER T.: Neural illumination: Lighting pre-
diction for indoor environments. In CVPR (2019), pp. 6918–6926. 2
[Shi20] Fast and Robust Face-to-Parameter Translation for Game
Character Auto-Creation. AAAI 34, 02 (2020), 1733–1740.
URL: https://aaai.org/ojs/index.php/AAAI/article/
view/5537, doi:10.1609/aaai.v34i02.5537. 5
[SK19] S OKAL K., K IBALCHICH I.: High-Quality AR Lipstick Simu-
lation via Image Filtering Techniques. CVPR Workshop on Vision for
AR/VR (2019). 3
[SK21] S OMANATH G., K URZ D.: HDR environment map estimation
for real-time augmented reality. In CVPR (2021). 2
[SYF∗ 19] S HI T., Y UAN Y., FAN C., Z OU Z., S HI Z., L IU Y.: Face-to-
parameter translation for game character auto-creation. In ICCV (2019).
doi:10.1109/ICCV.2019.00025. 5
[TB19] T KACHENKA A., BAZAREVSKY V.: Real-time Hair segmenta-
tion and recoloring on Mobile GPUs. CVPR Workshop on Vision for
AR/VR (2019). 3
[TCC∗ 20] TAN Z., C HAI M., C HEN D., L IAO J., C HU Q., Y UAN L.,
T ULYAKOV S., Y U N.: Michigan: multi-input-conditioned hair image