Cp-Vton, Eccv 2018 PDF
Cp-Vton, Eccv 2018 PDF
Cp-Vton, Eccv 2018 PDF
2
SenseTime Group Limited
{wangboch,zhhuab}@mail2.sysu.edu.cn,[email protected],
[email protected], [email protected], [email protected]
1 Introduction
application on this challenging virtual try-on task in the wild. One reason is the
poor capability in preserving details when facing large geometric changes, e.g.
conditioned on unaligned image [23]. The best practice in image-conditional vir-
tual try-on is still a two-stage pipeline VITON [10]. But their performances are
far from the plausible and desired generation, as illustrated in Fig. 1. We argue
that the main reason lies in the imperfect shape-context matching for aligning
clothes and body shape, and the inferior appearance merging strategy.
To address the aforementioned challenges, we present a new image-based
method that successfully achieves the plausible try-on image syntheses while
preserving cloth characteristics, such as texture, logo, text and so on, named as
Characteristic-Preserving Image-based Virtual Try-On Network (CP-VTON). In
particular, distinguished from the hand-crafted shape context matching, we pro-
pose a new learnable thin-plate spline transformation via a tailored convolutional
neural network in order to align well the in-shop clothes with the target person.
The network parameters are trained from paired images of in-shop clothes and a
wearer, without the need of any explicit correspondences of interest points. Sec-
ond, our model takes the aligned clothes and clothing-agnostic yet descriptive
person representation proposed in [10] as inputs, and generates a pose-coherent
image and a composition mask which indicates the details of aligned clothes kept
in the synthesized image. The composition mask tends to utilize the information
of aligned clothes and balances the smoothness of the synthesized image. Ex-
tensive experiments show that the proposed model handles well the large shape
and pose transformations and achieves the state-of-art results on the dataset
collected by Han et al. [10] in the image-based virtual try-on task.
Our contributions can be summarized as follows:
– We propose a new Characteristic-Preserving image-based Virtual Try-On
Network (CP-VTON) that addresses the characteristic preserving issue when
facing large spatial deformation challenge in the realistic virtual try-on task.
– Di↵erent from the hand-crafted shape context matching, our CP-VTON in-
corporates a full learnable thin-plate spline transformation via a new Geo-
metric Matching Module to obtain more robust and powerful alignment.
– Given aligned images, a new Try-On Module is performed to dynamically
merge rendered results and warped results.
– Significant superior performances in image-based virtual try-on task achieved
by our CP-VTON have been extensively demonstrated by experiments on
the dataset collected by Han et al. [10].
2 Related Work
2.1 Image synthesis
Generative adversarial networks(GANs) [9] aim to model the real image distri-
bution by forcing the generated samples to be indistinguishable from the real
images. Conditional generative adversarial networks(cGANs) have shown im-
pressive results on image-to-image translation, whose goal is to translate an in-
put image from one domain to another domain [12,38,5,34,18,19,35]. Compared
4 Bochao Wang et al.
L1/L2 loss, which often leads to blurry images, the adversarial loss has become
a popular choice for many image-to-image tasks. Recently, Chen and Koltun [3]
suggest that the adversarial loss might be unstable for high-resolution image
generation. We find the adversarial loss has little improvement in our model. In
image-to-image translation tasks, there exists an implicit assumption that the in-
put and output are roughly aligned with each other and they represent the same
underlying structure. However, most of these methods have some problems when
dealing with large spatial deformations between the conditioned image and the
target one. Most of image-to image translation tasks conditioned on unaligned
images [10,23,37], adopt a coarse-to-fine manner to enhance the quality of final
results. To address the misalignment of conditioned images, Siarohit et al. [31]
introduced a deformable skip connections in GAN, using the correspondences
of the pose points. VITON [10] computes shape context thin-plate spline(TPS)
transofrmation [2] between the mask of in-shop clothes and the predicted fore-
ground mask. Shape context is a hand-craft feature for shape and the matching
of two shapes is time-consumed. Besides, the computed TPS transoformations
are vulnerable to the predicted mask. Inspired by Rocco et al. [27], we design a
convolutional neural network(CNN) to estimate a TPS transformation between
in-shop clothes and the target image without any explicit correspondences of
interest points.
Person Representation
Person Representation
C Try-on Module
Down Sample Layers Up Sample Layers Correlation Matching TPS Warping Mask Composition
Fig. 2. An overview of our CP-VTON, containing two main modules. (a) Geometric
Matching Module: the in-shop clothes c and input image representation p are aligned
via a learnable matching module. (b) Try-On Module: it generates a composition mask
M and a rendered person Ir . The final results Io is composed by warped clothes ĉ and
the rendered person Ir with the composition mask M .
descriptive person representation. They didn’t take pose variant into consider-
ation, and during inference, they required the paired images of in-shop clothes
and a wearer, which limits their practical scenarios. The most related work is
VITON [10]. We all aim to synthesize photo-realistic image directly from 2D
images. VITON addressed this problem with a coarse-to-fine framework and
expected to capture the cloth deformation by a shape context TPS transoforma-
tion. We propose an alignment network and a single pass generative framework,
which preserving the characteristics of in-shop clothes.
The original cloth-agnostic person representation [10] aims at leaving out the
e↵ects of old clothes ci like its color, texture and shape, while preserving infor-
mation of input person Ii as much as possible, including the person’s face, hair,
body shape and pose. It contains three components:
These feature maps are all scaled to a fixed resolution 256⇥192 and concatenated
together to form the cloth-agnostic person representation map p of k channels,
where k = 18 + 1 + 3 = 22. We also utilize this representation in both our
matching module and try-on module.
The classical approach for the geometry estimation task of image matching con-
sists of three stages: (1) local descriptors (e.g. shape context [2], SIFT [22] )
are extracted from both input images, (2) the descriptors are matched across
CP-VTON 7
The key di↵erences between our approach and Rocco et al. [27] are three-fold.
First, we trained from scratch rather than using a pretrained VGG network. Sec-
ond, our training ground truths are acquired from wearer’s real clothes rather
than synthesized from simulated warping. Most importantly, our GMM is di-
rectly supervised under pixel-wise L1 loss between warping outputs and ground
truth.
Now that the warped clothes ĉ is roughly aligned with the body shape of the
target person, the goal of our Try-On module is to fuse ĉ with the target person
and for synthesizing the final try-on result.
One straightforward solution is directly pasting ĉ onto target person image
It . It has the advantage that the characteristics of warped clothes are fully pre-
served, but leads to an unnatural appearance at the boundary regions of clothes
and undesirable occlusion of some body parts (e.g. hair, arms). Another solution
widely adopted in conditional image generation is translating inputs to outputs
by a single forward pass of some encoder-decoder networks, such as UNet [28],
which is desirable for rendering seamless smooth images. However, It is impos-
sible to perfectly align clothes with target body shape. Lacking explicit spatial
deformation ability, even minor misalignment could make the UNet-rendered
output blurry.
Our Try-On Module aims to combine the advantages of both approaches
above. As illustrated in Fig. 2, given a concatenated input of person represen-
tation p and the warped clothes ĉ, UNet simultaneously renders a person image
Ir and predicts a composition mask M . The rendered person Ir and the warped
8 Bochao Wang et al.
clothes ĉ are then fused together using the composition mask M to synthesize
the final try-on result Io :
Io = M ĉ + (1 M) Ir (2)
where i (I) denotes the feature map of image I of the i-th layer in the visual
perception network , which is a VGG19 [32] pre-trained on ImageNet. The layer
i 1 stands for ’conv1 2’, ’conv2 2’, ’conv3 2’, ’conv4 2’, ’conv5 2’, respectively.
Towards our goal of characteristic-preserving, we bias the composition mask
M to select warped clothes as much as possible by applying a L1 regularization
||1 M ||1 on M . The overall loss function for Try-On Module (TOM) is:
Fig. 3. From top to bottom, the TV norm values are increasing. Each line shows some
clothes in the same level.
In-shop
Clothes
Target
Person
SCMM
SCMM
Align
GMM
GMM
Align
Fig. 4. Matching results of SCMM and GMM. Warped clothes are directly pasted
onto target persons for visual checking. Our method is comparable with SCMM and
produces less weird results.
down-sampling convolutional layers are 64, 128, 256, 512, 512, 512. The num-
bers of filters for up-sampling convolutional layers are 512, 512, 256, 128, 64, 4.
Each convolutional layer is followed by an Instance Normalization layer [33] and
Leaky ReLU [24], of which the slope is set to 0.2.
In-shop
Clothes
Target
Person
VITON
CP-VTON
determinate TPS transformation parameters and more robust for large shape
di↵erences.
Quantitative results It is difficult to evaluate directly the quantitative per-
formance of matching modules due to the lack of ground truth in the testing
phase. Nevertheless, we can simply paste the warped clothes onto the original
person image as a non-parametric warped synthesis method in [10]. We conduct
a perceptual user study following the protocol described in Sec. 4.2, for these two
warped synthesis methods. The synthesized by GMM are rated more realistic
in 49.5% and 42.0% for LARGE and SMALL, which indicates that GMM is
comparable to SCMM for shape alignment.
In-shop Clothes Target Person Coarse Result Warped Clothes Composition Mask Refined Result
Fig. 6. An example of VITON stage II. The composition mask tends to ignore the
details of coarsely aligned clothes.
clothes, despite the regularization of the composition mask(Eq. 4). The VITON’s
“ragged” masks shown in Fig. 6 confirm this argument.
Our pipeline doesn’t address the aforementioned issue by improving match-
ing results, but rather sidesteps it by simultaneously learning to produce a UNet
rendered person image and a composition mask. Before the rendered person
image becomes favorable to loss function, the central clothing region of compo-
sition mask is biased towards warped clothes because it agrees more with ground
truth in the early training stage. It is now the warped clothes rather than the
rendered person image that takes the early advantage in the competition of mask
selection. After that, the UNet learns to adaptively expose regions where UNet
rendering is more suitable than directly pasting. Once the regions of hair and
arms are exposed, rendered and seamlessly fused with warped clothes.
Quantitative results The first column of Table 1 shows that our pipeline
surpasses VITON in the preserving the details of clothes using identical person
representation. According to the table, our approach performs better than other
methods, when dealing with rich details clothes.
Fig. 7. Ablation studies on composition mask and mask L1 loss. Without mask com-
position, UNet cannot handle well even minor misalignment and produces undesirable
try-on results. Without L1 regularization on mask, it tends to select UNet-rendered
person, leading to blurry results as well.
Failure cases Fig. 9 shows three failure cases of our CP-VTON method caused
by (1) improperly preserved shape information of old clothes, (2) rare poses and
(3) inner side of the clothes undistinguishable from the outer side, respectively.
5 Conclusions
In this paper, we propose a fully learnable image-based virtual try-on pipeline
towards the characteristic-preserving image generation, named as CP-VTON,
including a new geometric matching module and a try-on module with the
new merging strategy. The geometric matching module aims at aligning in-shop
clothes and target person body with large spatial displacement. Given aligned
clothes, the try-on module learns to preserve well the detailed characteristic of
clothes. Extensive experiments show the overall CP-VTON pipeline produces
high-fidelity virtual try-on results that retain well key characteristics of in-shop
clothes. Our CP-VTON achieves state-of-the-art performance on the dataset
collected by Han et al. [10] both qualitatively and quantitatively.
CP-VTON 15
References
1. Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape:
shape completion and animation of people. In: ACM transactions on graphics
(TOG). vol. 24, pp. 408–416. ACM (2005)
2. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using
shape contexts. IEEE transactions on pattern analysis and machine intelligence
24(4), 509–522 (2002)
3. Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement
networks. In: The IEEE International Conference on Computer Vision (ICCV).
vol. 1 (2017)
4. Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., Lischinski, D., Cohen-Or,
D., Chen, B.: Synthesizing training images for boosting human 3d pose estimation.
In: 3D Vision (3DV), 2016 Fourth International Conference on. pp. 479–488. IEEE
(2016)
5. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Unified gen-
erative adversarial networks for multi-domain image-to-image translation. arXiv
preprint arXiv:1711.09020 (2017)
6. Deng, Z., Zhang, H., Liang, X., Yang, L., Xu, S., Zhu, J., Xing, E.P.: Structured
generative adversarial networks. In: Advances in Neural Information Processing
Systems. pp. 3899–3909 (2017)
7. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fit-
ting with applications to image analysis and automated cartography. In: Readings
in computer vision, pp. 726–740. Elsevier (1987)
8. Gong, K., Liang, X., Shen, X., Lin, L.: Look into person: Self-supervised structure-
sensitive learning and a new benchmark for human parsing. arXiv preprint
arXiv:1703.05446 (2017)
9. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural
information processing systems. pp. 2672–2680 (2014)
10. Han, X., Wu, Z., Wu, Z., Yu, R., Davis, L.S.: Viton: An image-based virtual try-on
network. arXiv preprint arXiv:1711.08447 (2017)
11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR. pp. 770–778 (2016)
12. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi-
tional adversarial networks. arXiv preprint (2017)
13. Jetchev, N., Bergmann, U.: The conditional analogy gan: Swapping fashion articles
on people images. arXiv preprint arXiv:1709.04695 (2017)
14. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and
super-resolution. In: ECCV. pp. 694–711 (2016)
15. Kinga, D., Adam, J.B.: A method for stochastic optimization. In: International
Conference on Learning Representations (ICLR) (2015)
16. Lamdan, Y., Schwartz, J.T., Wolfson, H.J.: Object recognition by affine invari-
ant matching. In: Computer Vision and Pattern Recognition, 1988. Proceedings
CVPR’88., Computer Society Conference on. pp. 335–344. IEEE (1988)
17. Lassner, C., Pons-Moll, G., Gehler, P.V.: A generative model of people in clothing.
arXiv preprint arXiv:1705.04098 (2017)
18. Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.: Perceptual generative adver-
sarial networks for small object detection. In: IEEE CVPR (2017)
16 Bochao Wang et al.
19. Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion gan for future-flow embedded
video prediction. In: IEEE International Conference on Computer Vision (ICCV).
vol. 1 (2017)
20. Liang, X., Zhang, H., Xing, E.P.: Generative semantic manipulation with contrast-
ing gan. arXiv preprint arXiv:1708.00315 (2017)
21. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: CVPR. pp. 3431–3440 (2015)
22. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna-
tional journal of computer vision 60(2), 91–110 (2004)
23. Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L.: Pose guided
person image generation. In: Advances in Neural Information Processing Systems.
pp. 405–415 (2017)
24. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural net-
work acoustic models. In: Proc. icml. vol. 30, p. 3 (2013)
25. Odena, A., Dumoulin, V., Olah, C.: Deconvolution and checkerboard artifacts.
Distill 1(10), e3 (2016)
26. Pons-Moll, G., Pujades, S., Hu, S., Black, M.J.: Clothcap: Seamless 4d clothing
capture and retargeting. ACM Transactions on Graphics (TOG) 36(4), 73 (2017)
27. Rocco, I., Arandjelovic, R., Sivic, J.: Convolutional neural network architecture for
geometric matching. In: Proc. CVPR. vol. 2 (2017)
28. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-
cal image segmentation. In: International Conference on Medical image computing
and computer-assisted intervention. pp. 234–241. Springer (2015)
29. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.:
Improved techniques for training gans. In: NIPS. pp. 2234–2242 (2016)
30. Sekine, M., Sugita, K., Perbet, F., Stenger, B., Nishiyama, M.: Virtual fitting by
single-shot body shape estimation. In: Int. Conf. on 3D Body Scanning Technolo-
gies. pp. 406–413. Citeseer (2014)
31. Siarohin, A., Sangineto, E., Lathuiliere, S., Sebe, N.: Deformable gans for pose-
based human image generation. arXiv preprint arXiv:1801.00055 (2017)
32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
33. Ulyanov, D., Vedaldi, A., Lempitsky, V.: Improved texture networks: Maximizing
quality and diversity in feed-forward stylization and texture synthesis. In: Proc.
CVPR (2017)
34. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-
resolution image synthesis and semantic manipulation with conditional gans. arXiv
preprint arXiv:1711.11585 (2017)
35. Yang, L., Liang, X., Xing, E.: Unsupervised real-to-virtual domain unification for
end-to-end highway driving. arXiv preprint arXiv:1801.03458 (2018)
36. Yoo, D., Kim, N., Park, S., Paek, A.S., Kweon, I.S.: Pixel-level domain transfer.
In: European Conference on Computer Vision. pp. 517–532. Springer (2016)
37. Zhao, B., Wu, X., Cheng, Z.Q., Liu, H., Feng, J.: Multi-view image generation
from a single-view. arXiv preprint arXiv:1704.04886 (2017)
38. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation us-
ing cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593 (2017)
39. Zhu, S., Fidler, S., Urtasun, R., Lin, D., Loy, C.C.: Be your own prada: Fashion
synthesis with structural coherence. arXiv preprint arXiv:1710.07346 (2017)