Pix 2 Style 2 Pix

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation

Elad Richardson1 Yuval Alaluf1,2 Or Patashnik1,2 Yotam Nitzan2


1 1
Yaniv Azar Stav Shapiro Daniel Cohen-Or2
1 2
Penta-AI Tel-Aviv University
arXiv:2008.00951v2 [cs.CV] 21 Apr 2021

Figure 1. The proposed pixel2style2pixel framework can be used to solve a wide variety of image-to-image translation tasks. Here we show
results of pSp on StyleGAN inversion, multi-modal conditional image synthesis, facial frontalization, inpainting and super-resolution.

Abstract has better support for solving tasks without pixel-to-pixel


correspondence, and inherently supports multi-modal syn-
We present a generic image-to-image translation frame- thesis via the resampling of styles. Finally, we demon-
work, pixel2style2pixel (pSp). Our pSp framework is based strate the potential of our framework on a variety of fa-
on a novel encoder network that directly generates a se- cial image-to-image translation tasks, even when compared
ries of style vectors which are fed into a pretrained Style- to state-of-the-art solutions designed specifically for a sin-
GAN generator, forming the extended W+ latent space. We gle task, and further show that it can be extended beyond
first show that our encoder can directly embed real images the human facial domain. Code is available at https:
into W+, with no additional optimization. Next, we pro- //github.com/eladrich/pixel2style2pixel.
pose utilizing our encoder to directly solve image-to-image
translation tasks, defining them as encoding problems from 1. Introduction
some input domain into the latent domain. By deviating
from the standard “invert first, edit later” methodology used In recent years, Generative Adversarial Networks
with previous StyleGAN encoders, our approach can han- (GANs) have significantly advanced image synthesis, par-
dle a variety of tasks even when the input image is not ticularly on face images. State-of-the-art image genera-
represented in the StyleGAN domain. We show that solv- tion methods have achieved high visual quality and fidelity,
ing translation tasks through StyleGAN significantly sim- and can now generate images with phenomenal realism.
plifies the training process, as no adversary is required, Most notably, StyleGAN [21, 22] proposes a novel style-
based generator architecture and attains state-of-the-art vi- wide range of tasks, all using the same architecture. Besides
sual quality on high-resolution images. Moreover, it has the simplification of the training process, as no adversary
been demonstrated that it has a disentangled latent space, discriminator needs to be trained, using a pretrained Style-
W [44, 7, 39], which offers control and editing capabilities. GAN generator offers several intriguing advantages over
Recently, numerous methods have shown competence in previous works. For example, many image-to-image archi-
controlling StyleGAN’s latent space and performing mean- tectures explicitly feed the generator with residual feature
ingful manipulations in W [17, 39, 41, 13]. These methods maps from the encoder [16, 43], creating a strong locality
follow an “invert first, edit later” approach, where one first bias [37]. In contrast, our generator is governed only by the
inverts an image into StyleGAN’s latent space and then ed- styles with no direct spatial input. Another notable advan-
its the latent code in a semantically meaningful manner to tage of the intermediate style representation is the inherent
obtain a new code that is then used by StyleGAN to generate support for multi-modal synthesis for ambiguous tasks such
the output image. However, it has been shown that invert- as image generation from sketches, segmentation maps, or
ing a real image into a 512-dimensional vector w ∈ W does low-resolution images. In such tasks, the generated styles
not lead to an accurate reconstruction. Motivated by this, it can be resampled to create variations of the output image
has become common practice [1, 2, 4, 48, 3] to encode real with no change to the architecture or training process. In a
images into an extended latent space, W+, defined by the sense, our method performs pixel2style2pixel translation, as
concatenation of 18 different 512-dimensional w vectors, every image is first encoded into style vectors and then into
one for each input layer of StyleGAN. These works usually an image, and is therefore dubbed pSp.
resort to using per-image optimization over W+, requiring The main contributions of this paper are: (i) A novel
several minutes for a single image. To accelerate this opti- StyleGAN encoder able to directly encode real images into
mization process, some methods [4, 48] have trained an en- the W+ latent domain; and (ii) A new methodology for
coder to infer an approximate vector in W+ which serves utilizing a pretrained StyleGAN generator to solve image-
as a good initial point from which additional optimization to-image translation tasks.
is required. However, a fast and accurate inversion of real
images into W+ remains a challenge. 2. Related Work
In this paper, we first introduce a novel encoder architec- GAN Inversion. With the rapid evolution of GANs, many
ture tasked with encoding an arbitrary image directly into works have tried to understand and control their latent
W+. The encoder is based on a Feature Pyramid Net- space. A specific task that has received substantial atten-
work [26], where style vectors are extracted from differ- tion is GAN Inversion, which was first introduced by Zhu
ent pyramid scales and inserted directly into a fixed, pre- et al. [49]. In this task, the latent vector from which a pre-
trained StyleGAN generator in correspondence to their spa- trained GAN most accurately reconstructs a given, known
tial scales. We show that our encoder can directly recon- image, is sought. Motivated by its state-of-the-art image
struct real input images, allowing one to perform latent quality and latent space semantic richness, many recent
space manipulations without requiring time-consuming op- works have used StyleGAN [21, 22] for this task. Gener-
timization. While these manipulations allow for extensive ally, inversion methods either directly optimize the latent
editing of real images, they are inherently limited. That is vector to minimize the error for the given image [27, 8, 1, 2],
because the input image must be invertible, i.e., there must train an encoder to map the given image to the latent space
exist a latent code that reconstructs the image. This require- [35, 8, 36, 12, 33], or use a hybrid approach combining both
ment is a severe limitation for tasks, such as conditional [4, 48]. Typically, methods performing optimization are su-
image generation, where the input image does not reside perior in reconstruction quality to a learned encoder map-
in the same StyleGAN domain. To overcome this limita- ping, but require a substantially longer time. Unlike the
tion we propose using our encoder together with the pre- above methods, our encoder can accurately and efficiently
trained StyleGAN generator as a complete image-to-image embed a given face image into the extended latent space
translation framework. In this formulation, input images W+ with no further optimization.
are directly encoded into the desired output latents which
are then fed into StyleGAN to generate the desired output Latent Space Manipulation. Recently, numerous papers
images. This allows one to utilize StyleGAN for image-to- have presented diverse methods for learning semantic edits
image translation even when the input and output images of the latent code. One popular approach is to find linear
are not from the same domain. directions that correspond to changes in a given binary la-
While many previous approaches to solving image-to- beled attribute, such as young ↔ old, or no-smile ↔ smile
image translation tasks involve dedicated architectures spe- [39, 11, 10, 3]. Tewari et al. [41] utilize a pretrained 3DMM
cific for solving a single problem, we follow the spirit of to learn semantic face edits in the latent space. Jahanian
pix2pix [16] and define a generic framework able to solve a et al. [17] find latent space paths that correspond to a spe-
The pSp Encoder StyleGAN Generator
4x4 const map2style
𝒘𝟏
map2style A
map2style A
8x8x512 feature map
map2style A

map2style A
4x4x512

map2style A
2x2x512

𝒘𝟏𝟖
map2style A 1x1x512 𝑤𝑖 vector

1024x1024 output image

Figure 2. Our pSp architecture. Feature maps are first extracted using a standard feature pyramid over a ResNet backbone. For each of the
18 target styles, a small mapping network is trained to extract the learned styles from the corresponding feature map, where styles (0-2) are
generated from the small feature map, (3-6) from the medium feature map, and (7-18) from the largest feature map. The mapping network,
map2style, is a small fully convolutional network, which gradually reduces spatial size using a set of 2-strided convolutions followed by
LeakyReLU activations. Each generated 512 vector, is fed into StyleGAN, starting from its matching affine transformation, A.

cific transformation, such as zoom or rotation, in a self- coder that is able to match each input image to an accu-
supervised manner. Härkönen et al. [13] find useful paths in rate encoding in the latent domain. A simple technique to
an unsupervised manner by using the principal component embed into this domain is directly encoding a given input
axes of an intermediate activation space. Collins et al. [7] image into W+ using a single 512-dimensional vector ob-
perform local semantic editing by manipulating correspond- tained from the last layer of the encoder network, thereby
ing components of the latent code. These methods generally learning all 18 style vectors together. However, such an ar-
follow an “invert first, edit later” procedure, where an im- chitecture presents a strong bottleneck making it difficult to
age is first embedded into the latent space, and then its latent fully represent the finer details of the original image and is
is edited in a semantically meaningful manner. This differs therefore limited in reconstruction quality.
from our approach which directly encodes input images into In StyleGAN, the authors have shown that the different
the corresponding output latents, allowing one to also han- style inputs correspond to different levels of detail, which
dle inputs that do not reside in the StyleGAN domain. are roughly divided into three groups — coarse, medium,
and fine. Following this observation, in pSp we extend
Image-to-Image. Image-to-Image translation techniques an encoder backbone with a feature pyramid [26], gener-
aim at learning a conditional image generation function ating three levels of feature maps from which styles are ex-
that maps an input image of a source domain to a corre- tracted using a simple intermediate network — map2style
sponding image of a target domain. Isola et al. [16] first — shown in Figure 2. The styles, aligned with the hier-
introduced the use of conditional GANs to solve various archical representation, are then fed into the generator in
image-to-image translation tasks. Since then, their work has correspondence to their scale to generate the output image,
been extended for many scenarios: high-resolution synthe- thus completing the translation from input pixels to output
sis [43], unsupervised learning [30, 50, 23, 28], multi-modal pixels, through the intermediate style representation. The
image synthesis [51, 14, 6], and conditional image synthe- complete architecture is illustrated in Figure 2.
sis [34, 25, 31, 52, 5]. The aforementioned works have con- As in StyleGAN, we further define w to be the average
structed dedicated architectures, which require training the style vector of the pretrained generator. Given an input im-
generator network and generally do not generalize to other age, x, the output of our model is then defined as
translation tasks. This is in contrast to our method that uses
the same architecture for solving a variety of tasks.
pSp(x) := G(E(x) + w)
3. The pSp Framework
where E(·) and G(·) denote the encoder and StyleGAN
Our pSp framework builds upon the representative power generator, respectively. In this formulation, our encoder
of a pretrained StyleGAN generator and the W+ latent aims to learn the latent code with respect to the average style
space. To utilize this representation one needs a strong en- vector. We find that this results in better initialization.
3.1. Loss Functions ()

(*
StyleGAN
(+ Generator
pSp Style Mixing
While the style-based translation is the core part of our Encoder
()
4x4 const

(),

framework, the choice of losses is also crucial. Our encoder ()- Layers
(1 – 7)
(*

is trained using a weighted combination of several objec- 18×(1×512) (,


(-

tives. First, we utilize the pixel-wise L2 loss, Randomly Sampled


()

(*
Layers
(8 – 18) (),

Latent (+ ()-
(
L2 (x) = ||x − pSp(x)||2 . (1) 1×(1×512)
(),

()-

In addition, to learn perceptual similarities, we utilize the 18×(1×512)

LPIPS [46] loss, which has been shown to better preserve


image quality [12] compared to the more standard percep- Figure 3. Style-mixing for multi-modal generation.
tual loss [18]:
disentanglement of semantic objects learned by StyleGAN
LLPIPS (x) = ||F (x) − F (pSp(x))||2 , (2) is due to its layer-wise representation. This ability to inde-
pendently manipulate semantic attributes leads to another
where F (·) denotes the perceptual feature extractor. desired property: the support for multi-modal synthesis. As
To encourage the encoder to output latent style vectors some translation tasks are ambiguous, where a single in-
closer to the average latent vector, we additionally define put image may correspond to several outputs, it is desirable
the following regularization loss: to be able to sample these possible outputs. While this re-
quires specialized changes in standard image-to-image ar-
Lreg (x) = ||E(x) − w||2 . (3) chitectures [51, 14], our framework inherently supports this
by simply sampling style vectors. In practice, this is done
Similar to the truncation trick introduced in StyleGAN,
by randomly sampling a vector w ∈ R512 and generating
we find that adding this regularization in the training of
a corresponding latent code in W+ by replicating w. Style
our encoder improves image quality without harming the
mixing is then performed by replacing select layers of the
fidelity of our outputs, especially in some of the more am-
computed latent with those of the randomly generated la-
biguous tasks explored below.
tent, possibly with an α parameter for blending between the
Finally, a common challenge when handling the specific
two styles. This approach is illustrated in Figure 3.
task of encoding facial images is the preservation of the
input identity. To tackle this, we incorporate a dedicated 4. Applications and Experiments
recognition loss measuring the cosine similarity between
the output image and its source, To explore the effectiveness of our approach we evaluate
pSp on numerous image-to-image translation tasks.
LID (x) = 1 − hR(x), R(pSp(x)))i , (4)
4.1. StyleGAN Inversion
where R is the pretrained ArcFace [9] network.
In summary, the total loss function is defined as We start by evaluating the usage of the pSp framework
for StyleGAN Inversion, that is, finding the latent code of
L(x) = λ1 L2 (x) + λ2 LLPIPS (x) + λ3 LID (x) + λ4 Lreg (x), real images in the latent domain. We compare our method
to the optimization technique from Karras et al. [22], the
where λ1 , λ2 , λ3 , λ4 are constants defining the loss ALAE encoder [36] and to the encoder from IDInvert [48].
weights. This curated set of loss functions allows for more The ALAE method proposes a StyleGAN-based autoen-
accurate encoding into StyleGAN compared to previous coder, where the encoder is trained alongside the genera-
works and can be easily tuned for different encoding tasks tor to generate latent codes. In IDInvert, images are em-
according to their nature. Constants and other implementa- bedded into the latent domain of a pretrained StyleGAN by
tion details can be found in Appendix A. first encoding the image into W+ and then directly opti-
3.2. The Benefits of The StyleGAN Domain mizing over the generated image to tune the latent. For a
fair comparison, we compare with IDInvert where no fur-
The translation between images through the style do- ther optimization is performed after encoding.
main differentiates pSp from many standard image-to-
image translation frameworks, as it makes our model op- Results. Figure 4 shows a qualitative comparison be-
erate globally instead of locally, without requiring pixel- tween the methods. One can see that the ALAE method,
to-pixel correspondence. This is a desired property as it operating in the W domain, cannot accurately reconstruct
has been shown that the locality bias limits current meth- the input images. While IDInvert [48] better preserves the
ods when handling non-local transformations [37]. Addi- image attributes, it still fails to accurately preserve identity
tionally, previous works [21, 7] have demonstrated that the and the finer details of the input image. In contrast, our
Input
ALAE [36]
IDInvert [48]
pSp

Figure 4. Results of pSp for StyleGAN inversion compared to other encoders on CelebA-HQ.

Method ↑ Similarity ↓ LPIPS ↓ MSE ↓ Runtime


Karras et al. [22] 0.77 0.11 0.02 182.1
ALAE [36] 0.06 0.32 0.15 0.207
IDInvert [48] 0.18 0.22 0.06 0.032
W Encoder 0.35 0.23 0.06 0.064
Naive W+ 0.49 0.19 0.04 0.064
pSp w/o ID 0.19 0.17 0.03 0.105
pSp 0.56 0.17 0.03 0.105
Table 1. Quantitative results for image reconstruction.

method is able to preserve identity while also reconstruct-


ing fine details such as lighting, hairstyle, and glasses.
Next, we conduct an ablation study to analyze the effec- Input W Naive W+ pSp
tiveness of the pSp architecture. We compare our architec- Figure 5. Ablation of the pSp encoder over CelebA-HQ.
ture to two simpler variations. First, we define an encoder
generating a 512-dimensional style vector in the W latent
domain, extracted from the last layer of the encoder net-
work. We then expand this and define an encoder with an
additional layer to transform the 512-dimensional feature
vector to a full 18 × 512 W+ vector. Figure 5 shows that
while this simple extension into W+ significantly improves
the results, it still cannot preserve the finer details generated
by our architecture. In Figure 6 we show the importance of
the identity loss in the reconstruction task.
Finally, Table 1 presents a quantitative evaluation mea-
suring the different inversion methods. Compared to other
Input pSp w/o ID pSp w/ ID
encoders, pSp is able to better preserve the original images
in terms of both perceptual similarity and identity. To make Figure 6. The importance of identity loss.
sure the similarity score is independent of our loss function,
we utilize the CurricularFace [15] method for evaluation.
Method ↑ Similarity ↓ Runtime
90° 70° 50° 30°
R&R 0.34 0.56 0.66 0.7 1.5s
pSp 0.32 0.52 0.60 0.63 0.1s
Table 2. Results for Face Frontalization on the FEI Face Database
split by rotation angle of the face in the input.

Results. Results are illustrated in Figure 7. When trained


with the same data and methodology, pix2pixHD is unable
to converge to satisfying results as it is much more depen-
dent on the correspondence between the input and output
pairs. Conversely, our method is able to handle the task suc-
cessfully, generating realistic frontal faces, which are com-
parable to the more involved R&R approach. This shows
the benefit of using a pretrained StyleGAN for image trans-
lation, as it allows us to achieve visually-pleasing results
even with weak supervision. Table 2 provides a quantitative
evaluation on the FEI Database [42]. While R&R outper-
forms pSp, our simple approach provides a fast and elegant
alternative, without requiring specialized modules, such as
R&R’s 3DMM fitting and inpainting steps.

4.3. Conditional Image Synthesis


Conditional image synthesis aims at generating photo-
realistic images conditioned on certain input types. In this
section, our pSp architecture is tested on two conditional
image generation tasks: generating high-quality face im-
Input pix2pixHD R&R pSp ages from sketches and semantic segmentation maps. We
Figure 7. Comparison of face frontalization methods. demonstrate that, with only minimal changes, our encoder
successfully utilizes the expressiveness of StyleGAN to
4.2. Face Frontalization generate high-quality and diverse outputs.
Face frontalization is a challenging task for image-to- Methodology and details. The training of the two con-
image translation frameworks due to the required non-local ditional generation tasks is similar to that of the encoder,
transformations and the lack of paired training data. Rotate- where the input is the conditioned image and the target is
AndRender (R&R) [47] overcome this challenge by incor- the corresponding real image. To generate multiple images
porating a geometric 3D alignment process before the trans- at inference time we perform style-mixing on the fine-level
lation process. Alternatively, we show that our style-based features, taking layers (1-7) from the latent code of the input
translation mechanism is able to overcome these challenges, image and layers (8-18) from a randomly drawn w vector.
even when trained with no labeled data.
4.3.1 Face From Sketch
Methodology. For this task, training is the same as the en-
coder formulation with two important changes. First, we Common approaches for sketch-to-image synthesis incor-
randomly flip the target image during training, effectively porate hard constraints that require pixel-wise correspon-
forcing the model to output an image that is close to both dence between the input sketch and generated image, mak-
the original image and the mirrored one. The underlying ing them ill-suited when given incomplete, sparse sketches.
idea behind this augmentation is that it guides the model DeepFaceDrawing [5] address this using a set of dedicated
to converge to a fixed frontal pose. Next, we increase LID mapping networks. We show that pSp provides a simple al-
and decrease the L2 and LLPIPS losses for the outer part of ternative to past approaches. As there are currently no pub-
the image. This change is based on the fact that for frontal- licly available datasets representative of hand-drawn face
ization we are less interested in preserving the background sketches, we elect to construct our own dataset, which we
region compared to the face region and the facial identity. describe in Appendix B.
Input pix2pixHD DeepFace pSp Input pix2pixHD SPADE CC FPSE pSp
Figure 8. Comparison of sketches presented in DeepFace- Figure 9. Comparisons to other label-to-image methods.
Drawing [5].

Results. Figure 8 compares the results of our method to Task pix2pixHD SPADE CC FPSE
those of pix2pixHD and DeepFaceDrawing. As no code
release is available for DeepFaceDrawing, we compare di- Segmentation 94.72% 95.25% 93.06%
rectly with sketches and results published in their paper. Sketch 93.34% N/A N/A
While DeepFaceDrawing obtain more visually pleasing re- Table 3. Human evaluation results on CelebA-HQ for conditional
sults compared to pix2pixHD, they are still limited in their image synthesis tasks. Each cell denotes the percentage of users
diversity. Conversely, although our model is trained on a who favored pSp over the listed method.
different dataset, we are still able to generalize well to their
sketches. Notably, we observe our ability to obtain more di-
Human Perceptual Study. We additionally perform a
verse outputs that better retain finer details (e.g. facial hair).
human evaluation to compare the visual quality of each
Additional results, including those on non-frontal sketches
method presented above. Each worker is given two images
are provided in the Appendix.
synthesized by different methods on the same input and is
given an unlimited time to select which output looks more
4.3.2 Face from Segmentation Map realistic. Each of our three workers reviews approximately
Here, we evaluate using pSp for synthesizing face im- 2, 800 pairs for each task, resulting in over 8, 400 human
ages from segmentation maps. In addition to pix2pixHD, judgements for each method. Table 3 shows that pSp sig-
we compare our approach to two additional state-of-the-art nificantly outperforms the other respective methods in both
label-to-image methods: SPADE [34], and CC FPSE [31], synthesis tasks.
both of which are based on pix2pixHD.
4.4. Extending to Other Applications
Results. In Figure 9 we provide a visual comparison of
the competing approaches on the CelebAMask-HQ dataset Besides the applications presented above, we have found
containing 19 semantic categories. As the competing meth- pSp to be applicable to a wide variety of additional tasks
ods are based on pix2pixHD, the results of all three suf- with minimal changes to the training process. Specifically,
fer from similar artifacts. Conversely, our approach is able we present samples of super-resolution and inpainting re-
to generate high-quality outputs across a wide range of in- sults using pSp in Figure 1 with further details and results
puts of various poses and expressions. Additionally, using presented in Appendix C. For both tasks, paired data is
our multi-modal technique, pSp can easily generate various generated and training is performed in a supervised fash-
possible outputs with the same pose and attributes but vary- ion. Additionally, we show multi-modal support for super-
ing fine styles for a single input semantic map or sketch resolution via style-mixing on medium-level features and
image. We provide examples in Figure 1 with additional evaluate pSp on several image editing tasks, including im-
results in the Appendix. age interpolation and local patch editing.
Figure 11. Challenging cases for StyleGAN Inversion.

5. Discussion
Although our suggested framework for image-to-image
StyleGAN Inversion and Reconstruction translation achieves compelling results in various applica-
tions, it has some inherent assumptions that should be con-
sidered. First, the high-quality images that are generated
by utilizing the pretrained StyleGAN come with a cost —
the method is limited to images that can be generated by
StyleGAN. Thus, generating faces which are not close to
frontal, or have certain expressions may be challenging if
such examples were not available when training the Style-
GAN model. Also, the global approach of pSp, although
Image Generation from Sketches advantageous for many tasks, does introduce a challenge
Figure 10. Results of pSp on the AFHQ Dataset for StyleGAN in preserving finer details of the input image, such as ear-
Inversion and the sketch-to-image tasks. For reconstruction, the rings or background details. This is especially significant
input (left) is shown alongside the reconstructed output (right). For in tasks such as inpainting or super-resolution where stan-
sketch-to-image, multiple outputs are generated via style-mixing.
dard image-to-image architectures can simply propagate lo-
cal information. Figure 11 presents some examples of such
reconstruction failures.

4.5. Going Beyond the Facial Domain


6. Conclusion
In this section we show that our pSp framework can be
trained to solve the various tasks explored above without re- In this work, we propose a novel encoder architecture
lying on the advantages provided by the identity loss in the that can be used to directly map a real image into the W+
facial domain. While our method does require a pretrained latent space with no optimization required. There, styles are
StyleGAN generator, recent works [20, 38] have shown that extracted in a hierarchical fashion and fed into the corre-
such a generator can be easily trained with significantly sponding inputs of a fixed StyleGAN generator. Combining
fewer examples than required in the past. our encoder with a StyleGAN decoder, we present a generic
framework for solving various image-to-image translation
Figure 20 shows the results on the AFHQ Cat and tasks, all using the same architecture. Notably, in contrast to
AFHQ Dog datasets [6] for the StyleGAN inversion and the “invert first, edit later” approach of previous StyleGAN
sketch-to-image tasks. For these tasks, we use a pretrained encoders, we show pSp can be used to directly encode these
StyleGAN-ADA [20] model for each of the two domains translation tasks into StyleGAN, thereby supporting input
and train our pSp encoder using only the L2 , LLPIPS , and images that do not reside in the StyleGAN domain. Addi-
Lreg losses with the same λ values as those used for the fa- tionally, differing from previous works that typically rely
cial domain. As shown, we are able to generalize well to the on dedicated architectures for solving a single translation
examined domains, obtaining high-quality, accurate recon- task, we show pSp to be capable of solving a wide variety
struction results while also supporting multi-modal synthe- of problems, requiring only minimal changes to the training
sis via our style-mixing approach. The accompanying Ap- losses and methodology. We hope that the ease-of-use of
pendix provides additional results for super-resolution and our approach will encourage further research into utilizing
inpainting on these domains. StyleGAN for real image-to-image translation tasks.
References [15] Yuge Huang, Yuhan Wang, Ying Tai, Xiaoming Liu,
Pengcheng Shen, Shaoxin Li, Jilin Li, and Feiyue Huang.
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- Curricularface: adaptive curriculum learning loss for deep
age2stylegan: How to embed images into the stylegan latent face recognition. In Proceedings of the IEEE/CVF Con-
space? In Proceedings of the IEEE international conference ference on Computer Vision and Pattern Recognition, pages
on computer vision, pages 4432–4441, 2019. 5901–5910, 2020.
[2] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- [16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
age2stylegan++: How to edit the embedded images? In Pro- Efros. Image-to-image translation with conditional adver-
ceedings of the IEEE/CVF Conference on Computer Vision sarial networks. In Proceedings of the IEEE conference on
and Pattern Recognition, pages 8296–8305, 2020. computer vision and pattern recognition, pages 1125–1134,
[3] Rameen Adbal, Pie Zhu, Niloy J. Mitra, and Peter Wonka. 2017.
Styleflow: Attribute-conditioned exploration of stylegan- [17] Ali Jahanian, Lucy Chai, and Phillip Isola. On the ”steer-
generated images using conditional continuous normalizing ability” of generative adversarial networks. arXiv preprint
flows. arXiv preprint arXiv:, 2020. arXiv:1907.07171, 2019.
[4] Baylies. stylegan-encoder. https://github.com/ [18] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual
pbaylies / stylegan - encoder, 2019. Accessed: losses for real-time style transfer and super-resolution. In
April 2020. European conference on computer vision, pages 694–711.
[5] Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and Springer, 2016.
Hongbo Fu. DeepFaceDrawing: Deep generation of face [19] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
images from sketches. ACM Transactions on Graphics Progressive growing of GANs for improved quality, stabil-
(Proceedings of ACM SIGGRAPH 2020), 39(4):72:1–72:16, ity, and variation. In International Conference on Learning
2020. Representations, 2018.
[6] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. [20] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine,
Stargan v2: Diverse image synthesis for multiple domains. Jaakko Lehtinen, and Timo Aila. Training generative adver-
In Proceedings of the IEEE/CVF Conference on Computer sarial networks with limited data. In Proc. NeurIPS, 2020.
Vision and Pattern Recognition, pages 8188–8197, 2020. [21] Tero Karras, Samuli Laine, and Timo Aila. A style-based
[7] Edo Collins, Raja Bala, Bob Price, and Sabine Susstrunk. generator architecture for generative adversarial networks. In
Editing in style: Uncovering the local semantics of gans. In Proceedings of the IEEE conference on computer vision and
Proceedings of the IEEE/CVF Conference on Computer Vi- pattern recognition, pages 4401–4410, 2019.
sion and Pattern Recognition, pages 5771–5780, 2020. [22] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
[8] Antonia Creswell and Anil Anthony Bharath. Inverting the Jaakko Lehtinen, and Timo Aila. Analyzing and improv-
generator of a generative adversarial network. IEEE transac- ing the image quality of stylegan. In Proceedings of the
tions on neural networks and learning systems, 30(7):1967– IEEE/CVF Conference on Computer Vision and Pattern
1974, 2018. Recognition, pages 8110–8119, 2020.
[23] Oren Katzir, Dani Lischinski, and Daniel Cohen-Or. Cross-
[9] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos
domain cascaded deep feature translation. arXiv, pages
Zafeiriou. Arcface: Additive angular margin loss for deep
arXiv–1906, 2019.
face recognition. In Proceedings of the IEEE Conference
[24] Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Bourdev,
on Computer Vision and Pattern Recognition, pages 4690–
and Thomas S. Huang. Interactive facial feature localiza-
4699, 2019.
tion. In Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Per-
[10] Emily Denton, Ben Hutchinson, Margaret Mitchell, and ona, Yoichi Sato, and Cordelia Schmid, editors, Computer
Timnit Gebru. Detecting bias with generative coun- Vision – ECCV 2012, pages 679–692, Berlin, Heidelberg,
terfactual face attribute augmentation. arXiv preprint 2012. Springer Berlin Heidelberg.
arXiv:1906.06439, 2019.
[25] Yuhang Li, Xuejin Chen, Feng Wu, and Zheng-Jun Zha.
[11] Lore Goetschalckx, Alex Andonian, Aude Oliva, and Phillip Linestofacephoto: Face photo generation from lines with
Isola. Ganalyze: Toward visual definitions of cognitive im- conditional self-attention generative adversarial networks. In
age properties, 2019. Proceedings of the 27th ACM International Conference on
[12] Shanyan Guan, Ying Tai, Bingbing Ni, Feida Zhu, Feiyue Multimedia, pages 2323–2331, 2019.
Huang, and Xiaokang Yang. Collaborative learning for faster [26] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
stylegan embedding. arXiv preprint arXiv:2007.01758, Bharath Hariharan, and Serge Belongie. Feature pyra-
2020. mid networks for object detection. In Proceedings of the
[13] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and IEEE conference on computer vision and pattern recogni-
Sylvain Paris. Ganspace: Discovering interpretable gan con- tion, pages 2117–2125, 2017.
trols. arXiv preprint arXiv:2004.02546, 2020. [27] Zachary C Lipton and Subarna Tripathi. Precise recovery of
[14] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. latent vectors from generative adversarial networks. arXiv
Multimodal unsupervised image-to-image translation. In preprint arXiv:1702.04782, 2017.
Proceedings of the European Conference on Computer Vi- [28] Wallace Lira, Johannes Merz, Daniel Ritchie, Daniel Cohen-
sion (ECCV), pages 172–189, 2018. Or, and Hao Zhang. Ganhopper: Multi-hop gan for
unsupervised image-to-image translation. arXiv preprint [43] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
arXiv:2002.10102, 2020. Jan Kautz, and Bryan Catanzaro. High-resolution image syn-
[29] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, thesis and semantic manipulation with conditional gans. In
Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the vari- Proceedings of the IEEE conference on computer vision and
ance of the adaptive learning rate and beyond. arXiv preprint pattern recognition, pages 8798–8807, 2018.
arXiv:1908.03265, 2019. [44] Ceyuan Yang, Yujun Shen, and Bolei Zhou. Semantic hier-
[30] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised archy emerges in deep generative representations for scene
image-to-image translation networks. In Advances in neural synthesis, 2019.
information processing systems, pages 700–708, 2017. [45] Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E
[31] Xihui Liu, Guojun Yin, Jing Shao, Xiaogang Wang, et al. Hinton. Lookahead optimizer: k steps forward, 1 step
Learning to predict layout-to-image conditional convolutions back. In Advances in Neural Information Processing Sys-
for semantic image synthesis. In Advances in Neural Infor- tems, pages 9597–9608, 2019.
mation Processing Systems, pages 570–580, 2019. [46] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht-
man, and Oliver Wang. The unreasonable effectiveness of
[32] Sachit Menon, Alexandru Damian, Shijia Hu, Nikhil Ravi,
deep features as a perceptual metric. In Proceedings of the
and Cynthia Rudin. Pulse: Self-supervised photo upsam-
IEEE conference on computer vision and pattern recogni-
pling via latent space exploration of generative models. In
tion, pages 586–595, 2018.
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR), June 2020. [47] Hang Zhou, Jihao Liu, Ziwei Liu, Yu Liu, and Xiaogang
Wang. Rotate-and-render: Unsupervised photorealistic face
[33] Yotam Nitzan, Amit Bermano, Yangyan Li, and Daniel
rotation from single-view images. In Proceedings of the
Cohen-Or. Disentangling in latent space by harnessing a pre-
IEEE/CVF Conference on Computer Vision and Pattern
trained generator. arXiv preprint arXiv:2005.07728, 2020.
Recognition, pages 5911–5920, 2020.
[34] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
[48] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-
Zhu. Semantic image synthesis with spatially-adaptive nor-
domain gan inversion for real image editing. arXiv preprint
malization. In Proceedings of the IEEE Conference on Com-
arXiv:2004.00049, 2020.
puter Vision and Pattern Recognition, pages 2337–2346,
[49] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and
2019.
Alexei A Efros. Generative visual manipulation on the nat-
[35] Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu, and ural image manifold. In European conference on computer
Jose M Álvarez. Invertible conditional gans for image edit- vision, pages 597–613. Springer, 2016.
ing. arXiv preprint arXiv:1611.06355, 2016.
[50] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
[36] Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco Efros. Unpaired image-to-image translation using cycle-
Doretto. Adversarial latent autoencoders. In Proceedings of consistent adversarial networks. In Proceedings of the IEEE
the IEEE/CVF Conference on Computer Vision and Pattern international conference on computer vision, pages 2223–
Recognition, pages 14104–14113, 2020. 2232, 2017.
[37] Eitan Richardson and Yair Weiss. The surprising effec- [51] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Dar-
tiveness of linear unsupervised image-to-image translation. rell, Alexei A Efros, Oliver Wang, and Eli Shechtman. To-
ArXiv, abs/2007.12568, 2020. ward multimodal image-to-image translation. In Advances
[38] Esther Robb, Wen-Sheng Chu, Abhishek Kumar, and Jia- in neural information processing systems, pages 465–476,
Bin Huang. Few-shot adaptation of generative adversarial 2017.
networks. arXiv, 2020. [52] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka.
[39] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. In- Sean: Image synthesis with semantic region-adaptive nor-
terpreting the latent space of gans for semantic face editing. malization. In Proceedings of the IEEE/CVF Conference
In Proceedings of the IEEE/CVF Conference on Computer on Computer Vision and Pattern Recognition, pages 5104–
Vision and Pattern Recognition, pages 9243–9252, 2020. 5113, 2020.
[40] Edgar Simo-Serra, Satoshi Iizuka, Kazuma Sasaki, and Hi-
roshi Ishikawa. Learning to simplify: fully convolutional
networks for rough sketch cleanup. ACM Transactions on
Graphics, 35:1–11, 07 2016.
[41] Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Flo-
rian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael
Zollhöfer, and Christian Theobalt. Stylerig: Rigging style-
gan for 3d control over portrait images. arXiv preprint
arXiv:2004.00121, 2020.
[42] Carlos Eduardo Thomaz and Gilson Antonio Giraldi. A new
ranking method for principal components analysis and its ap-
plication to face image analysis. Image and Vision Comput-
ing, 28(6):902 – 913, 2010.
A. Implementation Details
For our backbone network we use the ResNet-IR archi-
tecture from [9] pretrained on face recognition, which accel-
erated convergence. We use a fixed StyleGAN2 generator
trained on the FFHQ [21] dataset. That is, only the pSp en-
coder network is trained on the given translation task. For
all applications, the input image resolution is 256 × 256,
where the generated 1024 × 1024 output is resized before
being fed into the loss functions. Specifically for LID , the
images are cropped around the face region and resized to
112 × 122 before being fed into the recognition network.
For training, we use the Ranger optimizer, a combination Original LR ×8 pix2pixHD PULSE pSp
of Rectified Adam [29] with the Lookahead technique [45], Figure 12. Comparison of super-resolution approaches with ×8
with a constant learning rate of 0.001. Only horizontal flips down-sampling on the CelebA-HQ [19] test set.
are used as augmentations. All experiments are performed
using a single NVIDIA Tesla P40 GPU.
For the StyleGAN inversion task, the λ values are set as
λ1 = 1, λ2 = 0.8, and λ3 = 0.1. For face frontalization,
we increase the weight of the LID , setting λ3 = 1 and de-
crease the L2 and LLPIPS loss functions, setting λ1 = 0.01,
λ2 = 0.8 over the inner part of the face and λ1 = 0.001,
λ2 = 0.08 elsewhere. Additionally, the constants used in
the conditional image synthesis tasks are identical to those
used in the inversion task except for the omission of the
identity loss (i.e. λ3 = 0). Finally, λ4 is set to 0.005 in all
applications except for the StyleGAN inversion task, which
does not utilize the regularization loss.
Original LR ×16 pix2pixHD PULSE pSp
B. Dataset Details Figure 13. Comparison of super-resolution approaches with ×16
down-sampling on the CelebA-HQ [19] test set.
We conduct our experiments on the CelebA-HQ
dataset [19], which contains 30,000 high-quality images.
We use a standard train-test split of the dataset, resulting in image manifold in search of an image that downsamples to
approximately 24,000 training images. The FFHQ dataset the input LR image.
from [21], which contains 70,000 face images, is used for
the StyleGAN inversion and face frontalization tasks. Methodology and details. We train both our model and
For the generation of real images from sketches, we con- pix2pixHD [43] in a supervised fashion, where for each in-
struct a dataset representative of hand-drawn sketches us- put we perform random bi-cubic down-sampling of ×1 (i.e.
ing the CelebA-HQ dataset. Given an input image, we first no down-sampling), ×2, ×4, ×8, ×16, or ×32 and set the
apply a “pencil sketch” filter which retains most facial de- original, full resolution image as the target.
tails of the original image while removing the remaining
noise. We then apply the sketch-simplification method by Results. Figures 12-14 demonstrates the visual quality of
[40], resulting in images resembling hand-drawn sketches. the resulting images from our method along with those
The same approach is also used for generating the sketch of the previous approaches. Although PULSE is able to
images on the AFHQ Cat and AFHQ Dog datasets [6]. achieve very high-quality results due to their usage of Style-
GAN to generate images, they are unable to accurately re-
C. Application Details construct the original image even when performing down-
sampling of ×8 to a resolution of 32 × 32. By learning
C.1. Super Resolution
a pixel-wise correspondence between the LR and HR im-
In super resolution, the pSp framework is used to con- ages, pix2pixHD is able to obtain satisfying results even
struct high-resolution (HR) images from corresponding when down-sampled to a resolution of 16 × 16 (i.e. ×16
low-resolution (LR) input images. PULSE [32] approaches down-sampling). However, visually, their results appear
this task in an unsupervised manner by traversing the HR less photo-realistic. Contrary to these previous works, we
Input
pix2pixHD
pSp
Original LR ×32 pix2pixHD pSp Figure 16. Image inpainting results using pSp and pix2pixHD [43]
Figure 14. Comparison of super-resolution approaches with ×32 on the CelebA-HQ [19] test set.
down-sampling on the CelebA-HQ [19] test set.
Methodology and details We train both pSp and
pix2pixHD [43] in a supervised fashion, where each input
image is occluded with a symmetric triangular mask.

Results Figure 16 presents results for both our method


and pix2pixHD. As shown, due to the lack of information
in the occluded regions, pix2pixHD is unable to accurately
reconstruct the original image and incurs many artifacts. In
contrast, since pSp is trained to encode images into realistic
face latents, it is able to accurately reconstruct the occluded
region, resulting in high-quality outputs with no artifacts.
C.3. Local Editing
Our framework allows for a simple approach to local im-
age editing using a trained pSp encoder where altering spe-
cific attributes of an input sketch (e.g. eyes, smile) or seg-
Figure 15. Multi-modal synthesis for super-resolution using pSp mentation map (e.g. hair) results in local edits of the gener-
with style-mixing. ated images. We can further extend this and perform local
patch editing on real face images. As shown in Figure 18,
pSp is able to seamlessly merge the desired patch into the
are able to obtain high-quality results even when down- original image.
sampling to resolutions of 16 × 16 and 8 × 8. Finally, in
Figure 15 we generate multiple outputs for a given LR im- C.4. Face Interpolation
age using our multi-modal technique by performing style- Given two real images one can obtain their respective la-
mixing with a randomly sampled w vector on layers (4-7) tent codes w1 , w2 ∈ W+ by feeding the images through
with an α value of 0.5. Doing so alters medium-level styles our encoder. We can then naturally interpolate between
that mainly control facial features. the two images by computing their intermediate latent code
w0 = αw1 + (1 − α)w2 for 0 ≤ α ≤ 1 and generate the
C.2. Inpainting corresponding image using the new code w0 .

In the task of inpainting we wish to reconstruct missing


or occluded regions in a given image. Due to their local
nature, pix2pix [16] and other local-based translation meth-
ods, have shown success in tackling this problem as they
can simply propagate non-occluded regions.
Figure 17. Sketch-Based Local Editing

Figure 18. Local patch editing results using pSp on real images.

Figure 19. Image interpolation results using pSp on the CelebA-HQ [19] test set.
Figure 20. Results of pSp on the AFHQ Cat and Dog datasets [6] on super resolution, inpainting, and image generation from sketches.
Figure 21. Additional StyleGAN inversion results using pSp on the CelebA-HQ [19] test set.
Figure 22. Additional face frontalization results using pSp on the CelebA-HQ [19] test set.
Input
pSp
pix2pixHD

Figure 23. Even for challenging, non-frontal face sketches, pSp is able to obtain high-quality, diverse outputs.
Figure 24. Additional results using pSp for the generation of face images from sketches on the CelebA-HQ [19] test dataset.
Figure 25. Additional results on the Helen Faces [24] dataset using our proposed segmentation-to-image method.
Figure 26. Additional results on the CelebAMask-HQ [19] test set using our proposed segmentation-to-image method.
Figure 27. Conditional image synthesis results from sketches and segmentation maps displaying the multi-modal property of our approach.

You might also like