Pix 2 Style 2 Pix
Pix 2 Style 2 Pix
Pix 2 Style 2 Pix
Figure 1. The proposed pixel2style2pixel framework can be used to solve a wide variety of image-to-image translation tasks. Here we show
results of pSp on StyleGAN inversion, multi-modal conditional image synthesis, facial frontalization, inpainting and super-resolution.
map2style A
4x4x512
map2style A
2x2x512
𝒘𝟏𝟖
map2style A 1x1x512 𝑤𝑖 vector
Figure 2. Our pSp architecture. Feature maps are first extracted using a standard feature pyramid over a ResNet backbone. For each of the
18 target styles, a small mapping network is trained to extract the learned styles from the corresponding feature map, where styles (0-2) are
generated from the small feature map, (3-6) from the medium feature map, and (7-18) from the largest feature map. The mapping network,
map2style, is a small fully convolutional network, which gradually reduces spatial size using a set of 2-strided convolutions followed by
LeakyReLU activations. Each generated 512 vector, is fed into StyleGAN, starting from its matching affine transformation, A.
cific transformation, such as zoom or rotation, in a self- coder that is able to match each input image to an accu-
supervised manner. Härkönen et al. [13] find useful paths in rate encoding in the latent domain. A simple technique to
an unsupervised manner by using the principal component embed into this domain is directly encoding a given input
axes of an intermediate activation space. Collins et al. [7] image into W+ using a single 512-dimensional vector ob-
perform local semantic editing by manipulating correspond- tained from the last layer of the encoder network, thereby
ing components of the latent code. These methods generally learning all 18 style vectors together. However, such an ar-
follow an “invert first, edit later” procedure, where an im- chitecture presents a strong bottleneck making it difficult to
age is first embedded into the latent space, and then its latent fully represent the finer details of the original image and is
is edited in a semantically meaningful manner. This differs therefore limited in reconstruction quality.
from our approach which directly encodes input images into In StyleGAN, the authors have shown that the different
the corresponding output latents, allowing one to also han- style inputs correspond to different levels of detail, which
dle inputs that do not reside in the StyleGAN domain. are roughly divided into three groups — coarse, medium,
and fine. Following this observation, in pSp we extend
Image-to-Image. Image-to-Image translation techniques an encoder backbone with a feature pyramid [26], gener-
aim at learning a conditional image generation function ating three levels of feature maps from which styles are ex-
that maps an input image of a source domain to a corre- tracted using a simple intermediate network — map2style
sponding image of a target domain. Isola et al. [16] first — shown in Figure 2. The styles, aligned with the hier-
introduced the use of conditional GANs to solve various archical representation, are then fed into the generator in
image-to-image translation tasks. Since then, their work has correspondence to their scale to generate the output image,
been extended for many scenarios: high-resolution synthe- thus completing the translation from input pixels to output
sis [43], unsupervised learning [30, 50, 23, 28], multi-modal pixels, through the intermediate style representation. The
image synthesis [51, 14, 6], and conditional image synthe- complete architecture is illustrated in Figure 2.
sis [34, 25, 31, 52, 5]. The aforementioned works have con- As in StyleGAN, we further define w to be the average
structed dedicated architectures, which require training the style vector of the pretrained generator. Given an input im-
generator network and generally do not generalize to other age, x, the output of our model is then defined as
translation tasks. This is in contrast to our method that uses
the same architecture for solving a variety of tasks.
pSp(x) := G(E(x) + w)
3. The pSp Framework
where E(·) and G(·) denote the encoder and StyleGAN
Our pSp framework builds upon the representative power generator, respectively. In this formulation, our encoder
of a pretrained StyleGAN generator and the W+ latent aims to learn the latent code with respect to the average style
space. To utilize this representation one needs a strong en- vector. We find that this results in better initialization.
3.1. Loss Functions ()
(*
StyleGAN
(+ Generator
pSp Style Mixing
While the style-based translation is the core part of our Encoder
()
4x4 const
(),
framework, the choice of losses is also crucial. Our encoder ()- Layers
(1 – 7)
(*
(*
Layers
(8 – 18) (),
Latent (+ ()-
(
L2 (x) = ||x − pSp(x)||2 . (1) 1×(1×512)
(),
()-
Figure 4. Results of pSp for StyleGAN inversion compared to other encoders on CelebA-HQ.
Results. Figure 8 compares the results of our method to Task pix2pixHD SPADE CC FPSE
those of pix2pixHD and DeepFaceDrawing. As no code
release is available for DeepFaceDrawing, we compare di- Segmentation 94.72% 95.25% 93.06%
rectly with sketches and results published in their paper. Sketch 93.34% N/A N/A
While DeepFaceDrawing obtain more visually pleasing re- Table 3. Human evaluation results on CelebA-HQ for conditional
sults compared to pix2pixHD, they are still limited in their image synthesis tasks. Each cell denotes the percentage of users
diversity. Conversely, although our model is trained on a who favored pSp over the listed method.
different dataset, we are still able to generalize well to their
sketches. Notably, we observe our ability to obtain more di-
Human Perceptual Study. We additionally perform a
verse outputs that better retain finer details (e.g. facial hair).
human evaluation to compare the visual quality of each
Additional results, including those on non-frontal sketches
method presented above. Each worker is given two images
are provided in the Appendix.
synthesized by different methods on the same input and is
given an unlimited time to select which output looks more
4.3.2 Face from Segmentation Map realistic. Each of our three workers reviews approximately
Here, we evaluate using pSp for synthesizing face im- 2, 800 pairs for each task, resulting in over 8, 400 human
ages from segmentation maps. In addition to pix2pixHD, judgements for each method. Table 3 shows that pSp sig-
we compare our approach to two additional state-of-the-art nificantly outperforms the other respective methods in both
label-to-image methods: SPADE [34], and CC FPSE [31], synthesis tasks.
both of which are based on pix2pixHD.
4.4. Extending to Other Applications
Results. In Figure 9 we provide a visual comparison of
the competing approaches on the CelebAMask-HQ dataset Besides the applications presented above, we have found
containing 19 semantic categories. As the competing meth- pSp to be applicable to a wide variety of additional tasks
ods are based on pix2pixHD, the results of all three suf- with minimal changes to the training process. Specifically,
fer from similar artifacts. Conversely, our approach is able we present samples of super-resolution and inpainting re-
to generate high-quality outputs across a wide range of in- sults using pSp in Figure 1 with further details and results
puts of various poses and expressions. Additionally, using presented in Appendix C. For both tasks, paired data is
our multi-modal technique, pSp can easily generate various generated and training is performed in a supervised fash-
possible outputs with the same pose and attributes but vary- ion. Additionally, we show multi-modal support for super-
ing fine styles for a single input semantic map or sketch resolution via style-mixing on medium-level features and
image. We provide examples in Figure 1 with additional evaluate pSp on several image editing tasks, including im-
results in the Appendix. age interpolation and local patch editing.
Figure 11. Challenging cases for StyleGAN Inversion.
5. Discussion
Although our suggested framework for image-to-image
StyleGAN Inversion and Reconstruction translation achieves compelling results in various applica-
tions, it has some inherent assumptions that should be con-
sidered. First, the high-quality images that are generated
by utilizing the pretrained StyleGAN come with a cost —
the method is limited to images that can be generated by
StyleGAN. Thus, generating faces which are not close to
frontal, or have certain expressions may be challenging if
such examples were not available when training the Style-
GAN model. Also, the global approach of pSp, although
Image Generation from Sketches advantageous for many tasks, does introduce a challenge
Figure 10. Results of pSp on the AFHQ Dataset for StyleGAN in preserving finer details of the input image, such as ear-
Inversion and the sketch-to-image tasks. For reconstruction, the rings or background details. This is especially significant
input (left) is shown alongside the reconstructed output (right). For in tasks such as inpainting or super-resolution where stan-
sketch-to-image, multiple outputs are generated via style-mixing.
dard image-to-image architectures can simply propagate lo-
cal information. Figure 11 presents some examples of such
reconstruction failures.
Figure 18. Local patch editing results using pSp on real images.
Figure 19. Image interpolation results using pSp on the CelebA-HQ [19] test set.
Figure 20. Results of pSp on the AFHQ Cat and Dog datasets [6] on super resolution, inpainting, and image generation from sketches.
Figure 21. Additional StyleGAN inversion results using pSp on the CelebA-HQ [19] test set.
Figure 22. Additional face frontalization results using pSp on the CelebA-HQ [19] test set.
Input
pSp
pix2pixHD
Figure 23. Even for challenging, non-frontal face sketches, pSp is able to obtain high-quality, diverse outputs.
Figure 24. Additional results using pSp for the generation of face images from sketches on the CelebA-HQ [19] test dataset.
Figure 25. Additional results on the Helen Faces [24] dataset using our proposed segmentation-to-image method.
Figure 26. Additional results on the CelebAMask-HQ [19] test set using our proposed segmentation-to-image method.
Figure 27. Conditional image synthesis results from sketches and segmentation maps displaying the multi-modal property of our approach.