Cloth 2 Tex
Cloth 2 Tex
Cloth 2 Tex
Try-On
Figure 1. We propose Cloth2Tex, a novel pipeline for converting 2D images of clothing to high-quality 3D textured meshes that can be
draped onto 3D humans. In contrast to previous methods, Cloth2Tex supports a variety of clothing types. Results of 3D textured meshes
produced by our method as well as the corresponding input images are shown above.
4. Experiments
Our goal is to generate 3D garments from 2D catalog im-
ages. We verify the effectiveness of Cloth2Tex via thor-
ough evaluation and comparison with state-of-the-art base-
lines. Furthermore, we conduct a detailed ablation study to
demonstrate the effects of individual components. Figure 5. Comparison with Pix2Surf [20] and Warping [19] on
T-shirts. Please zoom in for more details.
4.1. Comparison with SOTA
We first compare our method with SOTA virtual try-on al- logos. In contrast, the baseline method Pix2Surf [20] tends
gorithms, both 3D and 2D approaches. to produce blurry textures due to a smooth mapping net-
Comparison with 3D SOTA: We compare Cloth2Tex work, and the Warping [19] baseline introduces undesired
with SOTA methods that produce 3D mesh textures from spatial distortions (e.g., second row in Fig. 5) due to sparse
2D clothing images, including model-based Pix2Surf [20] correspondences.
and TPS-based Warping [19] (We replace the original Comparison with 2D SOTA: We further compare
MADF with locally changed UV-constrained Naiver Stokes Cloth2Tex with 2D virtual try-on methods: Flow-based
method, differences between our UV-constrained naiver- DAFlow [2] and StyleGAN-enhanced Deep-Generative-
stokes and original version is described in the suppl. mate- Projection (DGP) [8]. As shown in Fig. 6, Cloth2Tex
rial). As shown in Fig. 5, our method produces high-fidelity achieves better quality than 2D virtual try-on methods in
3D textures with sharp, high-frequency details of the pat- sharpness and semantic consistency. More importantly, our
terns on clothing, such as the leaves and characters on the outputs, namely 3D textured clothing meshes, are naturally
top row. In addition, our method accurately preserves the compatible with cloth physics simulation, allowing the syn-
spatial configuration of the garment, particularly the overall thesis of realistic try-on effects in various body poses. In
aspect ratio of the patterns and the relative locations of the contrast, 2D methods rely on prior learned from training
images and are hence limited in their generalization ability
to extreme poses outside the training distribution.
User Study: Finally, we conduct a user study to evalu- Figure 8. Ablation Study on Phase I. From left to right: base, base
ate the overall perceptual quality and consistency with our + total variation loss Etv , base + Etv + automatic scaling.
methods’ provided input catalog images and 2D and 3D
baselines. We consider DGP the 2D baseline and TPS
the 3D baseline due to their best performance among ex- new pipeline based on neural rendering. We compare our
isting work. Each participant is shown three randomly se- method with TPS warping quantitatively to verify this de-
lected pairs of results, one produced by our method and the sign choice. Our test set consists of 10+ clothing cate-
other made by one of the baseline methods. The participant gories, including T-shirts, Polos, sweatshirts, jackets, hood-
is requested to choose the one that appears more realistic ies, shorts, trousers, and skirts, with 500 samples per cat-
and matches the reference clothing image better. In total, egory. We report the structure similarity (SSIM [36]) and
we received 643 responses from 72 users aged between 15 peak signal-to-noise ratio (PSNR) between the recovered
and 60. The results are reported in Fig. 7. Compared to textures and the ground truth textures.
DGP [8] and TPS, Cloth2Tex is favored by the participants As shown in Tab. 1, our neural rendering-based pipeline
with preference rates of 74.60% and 81.65%, respectively. achieves superior SSIM and PSNR compared to TPS warp-
This user study result verified the quality and consistency of ing. This improvement is also preserved after inpainting
our method. and refinement, leading to a much better quality of the final
texture. We conduct a comprehensive comparison study on
various inpainting methods in the supp. material, and please
check it if needed.
Table 1. Neural Rendering vs. TPS Warping. We evaluate the
texture quality of neural rendering and TPS-based warping, with
0% and without inpainting.
Figure 7. User preferences among 643 responses from 72 partici- Baseline Inpainting SSIM ↑ PSNR ↑
pants. Our method is favored by significantly more users. TPS None 0.70 20.29
TPS Pix2PixHD 0.76 23.81
4.2. Ablation Study Phase I None 0.80 21.72
Phase I Pix2PixHD 0.83 24.56
To demonstrate the effect of individual components in our
pipeline, we perform an ablation study for both stages in
our pipeline. Total Variation Loss & Automatic Scaling (Phase I) As
Neural Rendering vs. TPS Warping: TPS warping has shown in Fig. 8, dropping the total variation loss Etv and
been widely used in previous work on generating 3D gar- automatic scaling, the textures are incomplete and can-
ment textures. However, we found that it suffers from not maintain a semantically correct layout. With Etv ,
challenging cases illustrated in Fig. 2, so we propose a Cloth2Tex produces more complete textures by exploiting
Figure 9. Comparison with SOTA inpainting methods (Naiver-Stokes [4], LaMa [30], MADF [40] and Stable Diffusion v2 [24]) on texture
inpainting. The upper left corners of each column are the conditional mask input. Blue in the first column shows that our method is capable
of maintaining consistent boundary and curvature w.r.t reference image while Green highlights the blank regions that need inpainting.
the local consistency of textures. Further applying auto- texture in addition to the main UV.
matic scaling results in better alignment between the tem- Another imperfection is that our method cannot main-
plate mesh and the input images, resulting in a more se- tain the uniformity of checked shirts with densely assem-
mantically correct texture map. bled grids: As shown in the second row of Fig. 6, our
Inpainting Methods (Phase II) Next, to demonstrate the method inferior to 2D VTON methods in preserving tex-
need for training an inpainting model specifically for UV ture among which comprised of thousands of fine and tiny
clothing textures, we compare our task-specific inpaint- checkerboard-like grids, checked shirts and pleated skirts
ing model with general-purpose inpainting algorithms, in- are representative type of such garments.
cluding Navier-Stokes [4] algorithm and off-the-shelf deep We boil this down to the subtle position changes during
learning models including LaMa [30], MADF [40] and Sta- deformation graph optimization period, which leads to the
ble Diffusion v2 [24] with pre-trained checkpoints. Here, template mesh becomes less uniform eventually as the regu-
we modify the traditional Navier-Stokes [4] algorithm to a larization terms, i.e. as-rigid-as-possible is not a very strong
UV-constrained version because a texture map is only part constraint energy terms in obtaining a conformal mesh. We
of the whole squared image grid, where plenty of non-UV acknowledge this challenge and leave it to future work to
regions produce an adverse effect for texture in-painting explore the possibility in generating a homogeneous mesh
(please see supp. material for comparison). with uniformly-spaced triangles.
As shown in Fig. 9, our method, trained on our syn-
thetic dataset generated by the diffusion model, outperforms 5. Conclusion
general-purpose inpainting methods in the task of refining
and completing clothing textures, especially in terms of the This paper presents a novel pipeline, Cloth2Tex, for synthe-
color consistency between inpainted regions and the origi- sizing high-quality textures for 3D meshes from the pictures
nal image. taken from only front and back views. Cloth2Tex adopts a
two-stage process in obtaining visually appealing textures,
where phase I offers coarse texture generation and phase II
4.3. Limitations
performs texture refinement. Training a generalized texture
As shown in Fig. 10, Cloth2Tex can produce high-quality inpainting network is non-trivial due to the high topolog-
textures for common garments, e.g. T-shirt, Shorts, Trousers ical variability of UV space. Therefore, obtaining paired
and etc. (blue bounding box (bbox)). However, we have data under such circumstances is important. To the best of
observed that it is having difficulty in recovering textures our knowledge, this is the first study to combine a diffusion
for garments with complex patterns: e.g. inaccurate and in- model with a 3D engine (Blender) in collecting coarse-fine
consistent local texture (belt, collarband) occurred in wind- paired textures in 3D texturing tasks. We show the general-
breaker (red bbox). We regard this as the extra accessories izability of this approach in a variety of examples.
occurred in the garment, which inevitably add on the partial To avoid distortion and stretched artifacts across clothes,
Figure 10. Visualization of 3D virtual try-on. We obtain textured 3D meshes from 2D reference images shown on the left. The 3D meshes
are then draped onto 3D humans.
we automatically adjust the scale of vertices of template Loop, Wenzheng Chen, Krishna Murthy Jatavallab-
meshes and thus best prepare them for later image-based op- hula, Edward Smith, Artem Rozantsev, Or Perel, Tian-
timization, which effectively guides the implicitly learned chang Shen, Jun Gao, Sanja Fidler, Gavriel State, Ja-
texture with a complete and distortion-free structure. Ex- son Gorski, Tommy Xiang, Jianing Li, Michael Li,
tensive experiments demonstrate that our method can effec- and Rev Lebaredian. Kaolin: A pytorch library for
tively synthesize consistent and highly detailed textures for accelerating 3d deep learning research. https:
typical clothes without extra manual effort. //github.com/NVIDIAGameWorks/kaolin,
In summary, we hope our work can inspire more future 2022. 4
research in 3D texture synthesis and shed some light on this [10] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
area. nik, Amit H Bermano, Gal Chechik, and Daniel
Cohen-Or. An image is worth one word: Personal-
izing text-to-image generation using textual inversion.
References arXiv preprint arXiv:2208.01618, 2022. 5
[1] AUTOMATIC1111. Stable diffusion web ui. https: [11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising
/ / github . com / AUTOMATIC1111 / stable - diffusion probabilistic models. Advances in Neural In-
diffusion-webui, 2022. 5 formation Processing Systems, 33:6840–6851, 2020.
[2] Shuai Bai, Huiling Zhou, Zhikang Li, Chang Zhou, 3
and Hongxia Yang. Single stage virtual try-on via de- [12] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
formable attention flows. In Computer Vision–ECCV Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
2022: 17th European Conference, Tel Aviv, Israel, Oc- Weizhu Chen. Lora: Low-rank adaptation of large
tober 23–27, 2022, Proceedings, Part XV, pages 409– language models. arXiv preprint arXiv:2106.09685,
425. Springer, 2022. 6, 7 2021. 5
[3] S. Belongie, J. Malik, and J. Puzicha. Shape match- [13] Diederik P Kingma and Jimmy Ba. Adam: A
ing and object recognition using shape contexts. IEEE method for stochastic optimization. arXiv preprint
Transactions on Pattern Analysis and Machine Intelli- arXiv:1412.6980, 2014. 1
gence, 24(4):509–522, 2002. 2, 3 [14] Mikhail Konstantinov, Alex Shonenkov, Daria Bak-
[4] Marcelo Bertalmio, Andrea L Bertozzi, and Guillermo shandaeva, and Ksenia Ivanova. Deepfloyd: Text-to-
Sapiro. Navier-stokes, fluid dynamics, and image image model with a high degree of photorealism and
and video inpainting. In Proceedings of the 2001 language understanding. https://deepfloyd.
IEEE Computer Society Conference on Computer Vi- ai/, 2023. 1
sion and Pattern Recognition. CVPR 2001, pages I–I.
[15] Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang,
IEEE, 2001. 8, 1
Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai
[5] Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee,
Xu, Zheng Cao, et al. mplug: Effective and effi-
Sergey Tulyakov, and Matthias Nießner. Text2tex:
cient vision-language learning by cross-modal skip-
Text-driven texture synthesis via diffusion models.
connections. arXiv preprint arXiv:2205.12005, 2022.
arXiv preprint arXiv:2303.11396, 2023. 3, 1
5, 1
[6] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo,
[16] Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Sil-
Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, and
vio Savarese, and Steven C. H. Hoi. Lavis: A library
Yuyin Zhou. Transunet: Transformers make strong
for language-vision intelligence, 2022. 5, 1
encoders for medical image segmentation. arXiv
preprint arXiv:2102.04306, 2021. 5, 1, 4 [17] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li.
[7] Prafulla Dhariwal and Alexander Nichol. Diffusion Soft rasterizer: A differentiable renderer for image-
models beat gans on image synthesis. Advances based 3d reasoning. The IEEE International Confer-
in Neural Information Processing Systems, 34:8780– ence on Computer Vision (ICCV), 2019. 2, 3, 4, 5
8794, 2021. 3, 5 [18] Matthew Loper, Naureen Mahmood, Javier Romero,
[8] Ruili Feng, Cheng Ma, Chengji Shen, Xin Gao, Zhen- Gerard Pons-Moll, and Michael J Black. Smpl: A
jiang Liu, Xiaobo Li, Kairi Ou, Deli Zhao, and Zheng- skinned multi-person linear model. ACM transactions
Jun Zha. Weakly supervised high-fidelity clothing on graphics (TOG), 34(6):1–16, 2015. 2
model generation. In Proceedings of the IEEE/CVF [19] Sahib Majithia, Sandeep N Parameswaran, Sadbha-
Conference on Computer Vision and Pattern Recogni- vana Babar, Vikram Garg, Astitva Srivastava, and
tion, pages 3440–3449, 2022. 6, 7 Avinash Sharma. Robust 3d garment digitization from
[9] Clement Fuji Tsang, Maria Shugrina, Jean Francois monocular 2d images for 3d virtual try-on systems. In
Lafleche, Towaki Takikawa, Jiehan Wang, Charles Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision, pages 3428–3438, large mask inpainting with fourier convolutions.
2022. 1, 2, 3, 6 arXiv preprint arXiv:2109.07161, 2021. 8, 1
[20] Aymen Mir, Thiemo Alldieck, and Gerard Pons-Moll. [31] Brandon Trabucco, Kyle Doherty, Max Gurinas,
Learning to transfer texture from clothing images to and Ruslan Salakhutdinov. Effective data aug-
3d humans. In Proceedings of the IEEE/CVF Con- mentation with diffusion models. arXiv preprint
ference on Computer Vision and Pattern Recognition, arXiv:2302.07944, 2023. 3
pages 7023–7034, 2020. 1, 2, 3, 6 [32] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai
[21] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jin-
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, gren Zhou, and Hongxia Yang. Ofa: Unifying ar-
Alban Desmaison, Luca Antiga, and Adam Lerer. Au- chitectures, tasks, and modalities through a simple
tomatic differentiation in pytorch. 2017. 1 sequence-to-sequence learning framework. In In-
[22] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yan- ternational Conference on Machine Learning, pages
song Tang, Zheng Zhu, Guan Huang, Jie Zhou, and 23318–23340. PMLR, 2022. 5, 1
Jiwen Lu. Denseclip: Language-guided dense predic- [33] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, An-
tion with context-aware prompting. In Proceedings of drew Tao, Jan Kautz, and Bryan Catanzaro. High-
the IEEE Conference on Computer Vision and Pattern resolution image synthesis and semantic manipulation
Recognition (CVPR), 2022. 4 with conditional gans. In Proceedings of the IEEE
[23] Nikhila Ravi, Jeremy Reizenstein, David Novotny, conference on computer vision and pattern recogni-
Taylor Gordon, Wan-Yen Lo, Justin Johnson, and tion, pages 8798–8807, 2018. 5, 1, 4
Georgia Gkioxari. Accelerating 3d deep learning with [34] Tuanfeng Y. Wang, Duygu Ceylan, Jovan Popovic,
pytorch3d. arXiv:2007.08501, 2020. 5, 1 and Niloy J. Mitra. Learning a shared shape space
[24] Robin Rombach, Andreas Blattmann, Dominik for multimodal garment design. ACM Trans. Graph.,
Lorenz, Patrick Esser, and Björn Ommer. High- 37(6):1:1–1:14, 2018. 3
resolution image synthesis with latent diffusion mod- [35] Wenguan Wang, Yuanlu Xu, Jianbing Shen, and
els. In Proceedings of the IEEE/CVF Conference Song-Chun Zhu. Attentive fashion grammar network
on Computer Vision and Pattern Recognition, pages for fashion landmark detection and clothing category
10684–10695, 2022. 2, 3, 5, 8, 1 classification. In Proceedings of the IEEE conference
[25] Leonid I Rudin, Stanley Osher, and Emad Fatemi. on computer vision and pattern recognition, pages
Nonlinear total variation based noise removal algo- 4271–4280, 2018. 2, 4
rithms. Physica D: nonlinear phenomena, 60(1-4): [36] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and
259–268, 1992. 5 Eero P Simoncelli. Image quality assessment: from
[26] Christoph Schuhmann, Romain Beaumont, Richard error visibility to structural similarity. IEEE transac-
Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, tions on image processing, 13(4):600–612, 2004. 7
Theo Coombes, Aarush Katta, Clayton Mullis, [37] Yi Xu, Shanglin Yang, Wei Sun, Li Tan, Kefeng Li,
Mitchell Wortsman, et al. Laion-5b: An open large- and Hui Zhou. 3d virtual garment modeling from rgb
scale dataset for training next generation image-text images. In 2019 IEEE International Symposium on
models. arXiv preprint arXiv:2210.08402, 2022. 3, 5 Mixed and Augmented Reality (ISMAR), pages 37–45.
[27] Yawar Siddiqui, Justus Thies, Fangchang Ma, Qi IEEE, 2019. 1
Shan, Matthias Nießner, and Angela Dai. Texturify: [38] Rui Yu, Yue Dong, Pieter Peers, and Xin Tong. Learn-
Generating textures on 3d shape surfaces. In Com- ing texture generators for 3d shape collections from
puter Vision–ECCV 2022: 17th European Conference, internet photo sets. In British Machine Vision Confer-
Tel Aviv, Israel, October 23–27, 2022, Proceedings, ence, 2021. 2
Part III, pages 72–88. Springer, 2022. 2 [39] Lvmin Zhang and Maneesh Agrawala. Adding condi-
[28] Olga Sorkine and Marc Alexa. As-rigid-as-possible tional control to text-to-image diffusion models, 2023.
surface modeling. In Symposium on Geometry pro- 2, 5, 1, 4
cessing, pages 109–116, 2007. 4 [40] Manyu Zhu, Dongliang He, Xin Li, Chao Li, Fu Li,
[29] Robert W Sumner, Johannes Schmid, and Mark Pauly. Xiao Liu, Errui Ding, and Zhaoxiang Zhang. Image
Embedded deformation for shape manipulation. In inpainting by end-to-end cascaded refinement with
ACM siggraph 2007 papers, pages 80–es. 2007. 3 mask awareness. IEEE Transactions on Image Pro-
[30] Roman Suvorov, Elizaveta Logacheva, Anton cessing, 30:4855–4866, 2021. 3, 8
Mashikhin, Anastasia Remizova, Arsenii Ashukha,
Aleksei Silvestrov, Naejin Kong, Harshith Goka, Ki-
woong Park, and Victor Lempitsky. Resolution-robust
Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual
Try-On
Supplementary Material
6. Implementation Details
In phase I, we fix the optimization steps of both silhouette
matching and image-based optimization to 1,000, which
makes each coarse texture generation process takes less
than 1 minute to complete on an NVIDIA Ampere A100
(80GB VRAM). The initial weights of each energy term
are wsil = 50, wlmk = 0.01, warap = 50, wnorm =
10, wimg = 100, wtv = 1, we then use cosine scheduler
for decaying warap , wnorm to 5, 1.
During the blender-enhanced rendering process, we aug-
ment the data by random sampling blendshapes of upper
cloth by a range of [0.1, 1.0]. The synthetic images were
rendered using Blender EEVEE engine at a resolution of
5122 , emission only (disentangle from the impact of shad-
ing, which is the notoriously difficult puzzle as dissected in
Text2Tex [5]).
The synthetic data used for training texture inpainting
network are yielded from pretrained ControlNet through
prompts (generates from Lavis-BLIP [16], OFA [32] and Figure 11. Visualization of Navier-stokes method on UV tem-
MPlug [15]) and UV templates (manually crafted UV maps plate. Our locally constrained NS method fills the blanks thor-
by artists) can be shown in Fig. 14, which contains more oughly (though lack of precision) compared to the original global
garment types than previous methods, e.g. Pix2Surf [20] (4) counterpart.
and Warping [19] (2).
The only existing trainable Pix2PixHD in phase II is op-
timized by Adam [13] with lr = 2e − 4 for 200 epochs. The detailed parameters of template meshes in
Our implementation is build on top of PyTorch [21] along- Cloth2Tex are summarized in Tab. 4, sketch of all template
side PyTorch3D [23] for silhouette matching, rendering and meshes and UV maps are shown in Fig. 12 and Fig. 13 re-
inpainting. spectively.
Figure 14. Texture maps for training instance map guided Pix2PixHD, synthesized by ControlNet canny edge.
Figure 15. Comparison with representative image2image methods with conditional input: autoencoder-based TransUNet [6] (we modify
the base model and add an extra branch for UV map, aims to train it with all types of garment together), diffusion-based ControlNet [39]
and GAN-based Pix2PixHD [33]. It is rather obvious that prompts-sensitive ControlNet limited in recover a globally color-consistent
texture maps. Upper right corner of each method is the conditional input.