Cloth 2 Tex

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual

Try-On

Daiheng Gao1∗ Xu Chen2,3∗ Xindi Zhang1 Qi Wang1


Ke Sun1 Bang Zhang1 Liefeng Bo1 Qixing Huang4
1
Alibaba XR Lab 2 ETH Zurich, Department of Computer Science
3
Max Planck Institute for Intelligent Systems 4 The University of Texas at Austin

Figure 1. We propose Cloth2Tex, a novel pipeline for converting 2D images of clothing to high-quality 3D textured meshes that can be
draped onto 3D humans. In contrast to previous methods, Cloth2Tex supports a variety of clothing types. Results of 3D textured meshes
produced by our method as well as the corresponding input images are shown above.

Abstract quantitatively and demonstrate that Cloth2Tex can gener-


ate high-quality texture maps and achieve the best visual
Fabricating and designing 3D garments has become ex- effects in comparison to other methods. Project page: xxx
tremely demanding with the increasing need for synthesiz-
ing realistic dressed persons for a variety of applications,
e.g. 3D virtual try-on, digitalization of 2D clothes into 3D 1. Introduction
apparel, and cloth animation. It thus necessitates a simple
and straightforward pipeline to obtain high-quality texture The advancement of AR/VR and 3D graphics has opened
from simple input, such as 2D reference images. Since tradi- up new possibilities for the fashion e-commerce industry.
tional warping-based texture generation methods require a Customers can now virtually try on clothes on their avatars
significant number of control points to be manually selected in 3D, which can help them make more informed purchase
for each type of garment, which can be a time-consuming decisions. However, most clothing assets are currently pre-
and tedious process. We propose a novel method, called sented in 2D catalog images, which are incompatible with
Cloth2Tex, which eliminates the human burden in this pro- 3D graphics pipelines. Thus it is critical to produce 3D
cess. Cloth2Tex is a self-supervised method that generates clothing assets automatically from these existing 2D im-
texture maps with reasonable layout and structural consis- ages, aiming at making 3D virtual try-on accessible to ev-
tency. Another key feature of Cloth2Tex is that it can be eryone.
used to support high-fidelity texture inpainting. This is done Towards this goal, the research community has been de-
by combining Cloth2Tex with a prevailing latent diffusion veloping algorithms [19, 20, 37] that can transfer 2D im-
model. We evaluate our approach both qualitatively and ages into 3D textures of clothing mesh models. The key
meshes. This results in higher-quality initial texture maps
for all clothing types. We achieve this by optimizing the 3D
clothing mesh models and textures to align with the catalog
images’ color, silhouette, and key points.
Although the texture maps from neural rendering are of
higher quality, they still need refinement due to missing re-
gions. Learning to refine these texture maps across differ-
ent clothing types requires a large dataset of high-quality
3D textures, which is infeasible to acquire. We tackle this
problem by leveraging the recently emerging latent diffu-
sion model (LDM) [24] as a data simulator. Specifically,
we use ControlNet [39] to generate large-scale, high-quality
texture maps with various patterns and colors based on its
canny edge version. In addition to the high-quality ground-
truth textures, the refinement network requires the corre-
sponding initial defective texture maps obtained from neu-
Figure 2. Problem of warping-based texture generation algorithm: ral rendering. To get such data, we render the high-quality
partially filled UV texture maps with large missing holes as high- texture maps into catalog images and then run our neural
lighted in yellow. rendering pipeline to re-obtain the texture map from the cat-
alog images, which now contain defects as desired. With
to producing 3D textures from 2D images is to determine
these pairs of high-quality complete texture maps and the
the correspondences between the catalog images and the
defective texture maps from the neural renderer, we train a
UV textures. Conventionally, this is achieved via the Thin-
high-resolution image translation model that refines the de-
Plate-Spline (TPS) method [3], which approximates the
fective texture maps.
dense correspondences from a small set of corresponding
Our method can produce high-quality 3D textured cloth-
key points. In industrial applications, these key points are
ing from 2D catalog images of various clothing types. In
annotated manually and densely for each clothing instance
our experiments, we compare our approach with state-of-
to achieve good quality. With deep learning models, au-
the-art techniques of inferring 3D clothing textures and find
tomatic key point detectors [19, 35] have been proposed
that our method supports more clothing types and demon-
to detect key points automatically for clothing. However,
strates superior texture quality. In addition, we carefully
as seen in Fig. 2, the inherent self-occlusions (e.g. sleeves
verify the effectiveness of individual components via a thor-
occluded by the main fabric) of TPS warping-based ap-
ough ablation study.
proaches are intractable, leading to erroneous and incom-
In summary, we contribute Cloth2Tex, a pipeline that
plete texture maps. Several works have attempted to use
can produce high-quality 3D textured clothing in various
generative models to refine texture maps. However, such
types based on 2D catalog images, which is achieved via
a refinement strategy has demonstrated success only in a
• a) 3D parametric clothing mesh models of 10+ different
small set of clothing types, i.e. T-shirts, pants, and shorts.
categories that will be publicly available,
This is because TPS cannot produce satisfactory initial tex-
• b) an approach based on neural mesh rendering to trans-
ture maps on all clothing types, and a large training dataset
ferring 2D catalog images into texture maps of clothing
covering high-quality texture maps of diverse clothing types
meshes,
is missing. Pix2Surf [20], a SMPL [18]-based virtual try-
• c) data simulation approach for training a texture refine-
on algorithm, has automated the process of texture gener-
ment network built on top of blendshape-driven mesh and
ation with no apparent cavity or void. However, due to its
LDM-based texture.
clothing-specific model, Pix2Surf is limited in its ability to
generalize to clothes with arbitrary shapes.
2. Related Works
This paper aims to automatically convert 2D reference
clothing images into 3D textured clothing meshes for a Learning 3D Textures. Our method is related to learning
larger diversity of clothing types. To this end, we first texture maps for 3D meshes. Texturify [27] learns to gen-
contribute template mesh models for 10+ different clothing erate high-fidelity texture maps by rendering multiple 2D
types (well beyond current SOTAs: Pix2Surf (4) and [19] images from different viewpoints and aligning the distribu-
(2)). Next, instead of using the Thin-Plate-Spline (TPS) tion of rendered images and real image observations. Yu
method as previous methods, we incorporate neural mesh et al. [38] adopt a similar method, rendering images from
rendering [17] to directly establish dense correspondences different viewpoints and then discriminating the images by
between 2D catalog images and the UV textures of the separate discriminators. With the emergence of diffusion
models [7, 31], recent work Text2Tex [5] exploits 2D dif- coarse estimate. We use image translation networks trained
fusion models for 3D texture synthesis. Due to the mighty on large-scale data synthesized by pre-trained latent diffu-
generalization ability of the diffusion model [11, 24] trained sion models. The mesh templates for individual clothing
on the largest corpus LAION-5B [26], i.e. stable diffu- categories are a pre-requirement for our pipeline. We ob-
sion [24], the textured meshes generated by Text2Tex are tain these templates by manual artist design and will make
of superior quality and contain rich details. Our method is them publicly available.
related to these approaches in that we also utilize diffusion
Implementation details are placed in the supp. material
models for 3D texture learning. However, different from
due to the page limit.
previous approaches, we use latent diffusion models only to
generate synthetic texture maps to train our texture inpaint-
ing model, and our focus lies in learning 3D texture corre-
sponding to a specific pair of 2D reference images instead
3.1. Pre-requirement: Template Meshes
of random or text-guided generation.
For the sake of both practicality and convenience, we design
Texture-based 3D Virtual Try-On. Wang et al. [34] pro-
cloth template mesh (with fixed topology) M for common
vide a sketch-based network that infers both 2D garment
garment types (e.g., T-shirts, sweatshirts, baseball jackets,
sewing patterns and the draped 3D garment mesh from 2D
trousers, shorts, skirts, and etc.). We then build a defor-
sketches. In real applications, however, many applications
mation graph D [29] to optimize the template mesh ver-
require inferring 3D garments and the texture from 2D cat-
tices. This is because per-vertex image-based optimization
alog images. To achieve this goal, Pix2Surf [20] is the first
is subject to errors and artifacts due to the high degrees of
work that creates textured 3D garments automatically from
freedom. Specifically, we construct D with k nodes, which
front/back view images of a garment. This is achieved by
are parameterized with axis angles A ∈ R3 and translations
predicting dense correspondences between the 2D images
T ∈ R3 . The vertex displacements are then derived from
and the 3D mesh template using a trained network. How-
the deformation nodes (the number of nodes k is dependent
ever, due to the erroneous correspondence prediction, par-
on the garment type since different templates have differ-
ticularly on unseen test samples, Pix2Surf has difficulty in
ent numbers of vertices and faces). We also manually select
preserving high-frequency details and tends to blur out fine-
several vertices on the mesh templates as landmarks K. The
grained details such as thin lines and logos.
specific requirements of the template mesh are as follows:
To avoid such a problem, Sahib et al. [19] propose
vertices V less than 10,000, uniform mesh topology, and
to use a warping-based method (TPS) [3] instead and to
integrity of UV. The vertex number of all templates ranges
use further a deep texture inpainting network built upon
between skirt (6,116) to windbreaker (9,881). For unifor-
MADFNet [40]. However, as mentioned in the introduc-
mity, we set the downsampling factor of D for all templates
tion, warping-based methods generally require dense and
to 20 (details of template meshes are placed in the supp.
accurate corresponding key points in images and UV maps
material). The integrity of UV means that the UV should
and have only demonstrated successful results on two sim-
be placed as a whole in terms of front and back, without
ple clothing categories, T-shirts and trousers. In contrast to
further subdivision, as used in traditional computer graph-
previous work, Cloth2Tex aims to achieve automatic high-
ics. Fabricating integral UV is not complicated and can be a
quality texture learning for a broader range of garment cat-
super-duper candidate for later diffusion-based texture gen-
egories. To this end, we use neural rendering instead of
eration. See Sec. 3.3.1 for more details.
warping, which yields better texture quality on more com-
plex garment categories. We further utilize latent diffusion
models (LDMs) to synthesize high-quality texture maps of
various clothing categories to train the inpainting network. 3.2. Phase I: Shape and Coarse Texture Generation

3. Method The goal of Phase I is to determine the garment shape and


a coarse estimate of the UV textures T from the input cat-
We propose Cloth2Tex, a two-stage approach that converts alog (Front & Back view). We adopt a differentiable ren-
2D images into textured 3D garments. The garments are dering approach [17] to determine the UV textures in a self-
represented as polygon meshes, which can be draped and supervised way without involving trained neural networks.
simulated on 3D human bodies. The overall pipeline is il- Precisely, we fit our template model to the catalog images
lustrated in Fig. 3. The pipeline’s first stage (Phase I) is to by minimizing the difference between the 2D rendering of
determine the 3D garment shape and coarse texture. We do our mesh model and the target images. The fitting proce-
this by registering our parametric garment meshes onto cat- dure consists of two stages, namely Silhouette Matching and
alog images using a neural mesh renderer. The pipeline’s Image-based Optimization. We will now elaborate on these
second stage (Phase II) is to recover fine textures from the stages below.
Figure 3. Method overview: Cloth2Tex consists of two stages. In Phase I, we determine the 3D garment shape and coarse texture by
registering our parametric garment meshes onto catalog images using a neural mesh renderer. Next, in Phase II, we refine the coarse
estimate of the texture to obtain high-quality fine textures using image translation networks trained on large-scale data synthesized by
pre-trained latent diffusion models. Note that the only component that requires training is the inpainting network. Please watch our video
on the project page for an animated explanation of Cloth2Tex.

3.2.1 Silhouette Matching projection of 3D template mesh keypoints:


We first align the corresponding template mesh to the 2D Y
images based on the 2D landmarks and silhouette. Here, Elmk = ∥ K − L2d ∥2 (1)
we use BCRNN [35] to detect landmarks L2d and Dense-
Q
CLIP [22] to extract the silhouette M . To fit our various where denotes the 2D projection of 3D keypoints.
types of garments, we finetune BCRNN with 2,000+ manu- 2D Silhouette Alignment Esil measures the overlap be-
ally annotated clothing images per type. tween the silhouette of M and the predicted M from Dense-
After the mask and landmarks of the input images are CLIP:
obtained, we first perform a global rigid alignment by an
automatic cloth scaling method to adjust the scaling fac- Esil = MaskIoU(Sproj (M), M ) (2)
tor of mesh vertices according to the overlap of the ini-
where Sproj (M) is the silhouette rendered by the differen-
tial silhouette between mesh and input images, which en-
tiable mesh renderer SoftRas [17] and MaskIoU loss is de-
sures a rough agreement of the yielded texture map (See
rived from Kaolin [9].
Fig. 8). Specifically, we implement this mechanism by
Merely minimizing Elmk and Esil does not lead to sat-
checking the silhouette between the rendered and reference
isfactory results, and optimization procedure can easily get
images, and then enlarging or shrinking the scale of mesh
trapped into local minimums. To alleviate this issue, we
vertices accordingly. After an optimum Intersection over
introduce a couple of regularization terms. We first regular-
Union(IoU) has been achieved, we fix the coefficient and
ize the deformation using the as-rigid-as-possible loss Earap
send the scaled template to the next step.
[28] which penalizes the deviation of estimated local sur-
We then fit the silhouette and the landmarks of the tem-
face deformations from rigid transformations. Moreover,
plate mesh (the landmarks on the template mesh are pre-
we further enforce the normal consistency Enorm , which
defined as described in Sec. 3.1) to those detected from the
measures normal consistency for each pair of neighboring
2D catalog images. To this end, we optimize the deforma-
faces). The overall optimization objective is given as
tions of the nodes in the deformation graph by minimizing
the following energy terms: wsil Esil + wlmk Elmk + warap Earap + wnorm Enorm (3)
2D Landmark Alignment Elmk measures the distance be-
tween 2D landmarks L2d detected by BRCNN and the 2D where w∗ are the respective weights of the losses.
We set large regularization weights warap , wnorm at the render emission-only images under the front and back view.
initial iterations. We then reduce their values progressively Next, we use Phase I again to recover the corresponding
during the optimization procedure, so that the final rendered coarse textures. After collecting the pairs of coarse and fine
texture aligns with the input images. Please refer to the textures, we train an inpainting network to fill the missing
supp. material for more details. regions in the coarse texture maps.

3.2.2 Image-based Optimization 3.3.1 Diffusion-based Data Generation


After the shape of the template mesh is aligned with the We employ diffusion models [7, 24, 39] to generate realistic
image silhouette, we then optimize the UV texture map and diverse training data.
T to minimize the difference between the rendered image We generate texture maps following the UV template
Irend = Srend (M, T ) and the given input catalog images configuration, adopting the pre-trained ControlNet with
Iin from both sides simultaneously. To avoid any outside edge map as input conditions. ControlNet finetunes text-
interference during the optimization, we only preserve the to-image diffusion models to incorporate additional struc-
ambient color and set both diffuse and specular components tural conditions as input. The input edge maps are obtained
to be zero in the settings of SoftRas [17], PyTorch3D [23]. through canny edge detection on clothing-specific UV, and
Since the front and back views do not cover the full the input text prompts are generated by applying image cap-
clothing texture, e.g. the seams between the front and back tioning models, namely Lavis-BLIP [16], OFA [32] and
bodice can not be recovered well due to the occlusions, we MPlug [15], on tens of thousands of clothes crawled from
use the total variation method [25] to fill in the blank of Amazon and Taobao.
seam-affected UV areas. The total variation loss Etv is de- After generating the fine UV texture maps, we are al-
fined as the norm of the spatial gradients of the rendered ready able to generate synthetic front and back 2D catalog
image ∇x Irend and ∇y Irend : images, which will be used to train the impainting network.
We leverage the rendering power of Blender native EEVEE
Etv = ∥∇x Irend ∥2 + ∥∇y Irend ∥2 (4) engine to get the best visual result. A critical step of our ap-
In summary, the energy function for the image-based op- proach is to perform data augmentation so that the impaint-
timization is defined as below: ing network captures invariant features instead of details
that differ between synthetic images and testing images,
wimg ∥Iin − Irend ∥2 + wtv Etv (5) which do not generalize. To this end, we vary the blend
shape parameters of the template mesh to generate 2D cata-
where Iin and Irend are the reference and rendered image. log images in different shapes and pose configurations and
As shown in Fig. 3, T implicitly changes towards the final simulate self-occlusions, which frequently exist in reality
coarse texture Tcoarse , which ensures the final rendering is and lead to erroneous textures as shown in Fig. 2. We hand-
as similar as possible with the input. Please refer to our craft three common blendshapes (Fig. 4) that are enough to
attached video for a vivid illustration. simulate the diverse cloth-sleeve correlation/layout in real-
3.3. Phase II: Fine texture generation ity.
Next, we run Phase I to produce coarse textures from the
In Phase II, we refine the coarse texture from Sec. 3.2 and rendered synthetic 2D catalog images, yielding the coarse,
fill in the missing regions. Our approach takes inspiration defect textures corresponding to the fine textures. These
from the strong and comprehensive capacity of Stable Dif- pairs of coarse-fine textures serve as the training data for
fusion (SD), which is a terrific model to have by itself in the subsequent inpainting network.
image inpainting, completion, and text2image tasks. In
fact, there’s also an entire, growing ecosystem around it:
3.3.2 Texture Inpainting
LoRA [12], ControlNet [39], textual inversion [10] and Sta-
ble Diffusion WebUI [1]. Therefore, a straightforward idea Given the training data simulated by LDMs, we then train
is to resolve our texture completion via SD. our inpainting network. Note that we train a single network
However, we find poor content consistency between the for all clothing categories, making it general-purpose.
inpainted blank and original textured UV. This is because Regarding the impainting work, we choose
UV data in our setting rarely appears in the training dataset Pix2PixHD [33], which shows better results than al-
LAION-5B [26] of SD. In other words, the semantic com- ternative approaches such as conditional TransUNet [6],
position of LAION-5B and UV texture (cloth) are quite dif- ControlNet. One issue of Pix2PixHD is that produces
ferent and challenging for SD to generalize. color-consistent output To in contrast to prompt-guided
To address this issue, we first leverage ControlNet [39] to ControlNet (please check our supp. material for visual-
generate ∼ 2, 000+ HQ complete textures per template and ization comparison). These results are compared with the
Figure 4. Illustration of the three sleeve-related blendshapes of our template mesh model. These blendshapes allow rendering clothing
images in diverse pose configurations to facilitate simulating real-world clothing image layouts.

input full UV as the condition. To address this issue, we


first locate the missing holes, continuous edges and lines in
the coarse UV as the residual mask Mr (left corner at the
bottom line of Fig. 9). We then linearly blend those blank
areas with the model’s output during texture repairing.
Formally speaking, we compute the output as below:

Tfine = BilateralFilter(Tcoarse + Mr ∗ To ) (6)

where BilateralFilter is non-linear filter that can blur the ir-


regular and rough seaming between Tcoarse and To very well
while keeping edges fairly sharp. More details can be seen
in our attached video.

4. Experiments
Our goal is to generate 3D garments from 2D catalog im-
ages. We verify the effectiveness of Cloth2Tex via thor-
ough evaluation and comparison with state-of-the-art base-
lines. Furthermore, we conduct a detailed ablation study to
demonstrate the effects of individual components. Figure 5. Comparison with Pix2Surf [20] and Warping [19] on
T-shirts. Please zoom in for more details.
4.1. Comparison with SOTA
We first compare our method with SOTA virtual try-on al- logos. In contrast, the baseline method Pix2Surf [20] tends
gorithms, both 3D and 2D approaches. to produce blurry textures due to a smooth mapping net-
Comparison with 3D SOTA: We compare Cloth2Tex work, and the Warping [19] baseline introduces undesired
with SOTA methods that produce 3D mesh textures from spatial distortions (e.g., second row in Fig. 5) due to sparse
2D clothing images, including model-based Pix2Surf [20] correspondences.
and TPS-based Warping [19] (We replace the original Comparison with 2D SOTA: We further compare
MADF with locally changed UV-constrained Naiver Stokes Cloth2Tex with 2D virtual try-on methods: Flow-based
method, differences between our UV-constrained naiver- DAFlow [2] and StyleGAN-enhanced Deep-Generative-
stokes and original version is described in the suppl. mate- Projection (DGP) [8]. As shown in Fig. 6, Cloth2Tex
rial). As shown in Fig. 5, our method produces high-fidelity achieves better quality than 2D virtual try-on methods in
3D textures with sharp, high-frequency details of the pat- sharpness and semantic consistency. More importantly, our
terns on clothing, such as the leaves and characters on the outputs, namely 3D textured clothing meshes, are naturally
top row. In addition, our method accurately preserves the compatible with cloth physics simulation, allowing the syn-
spatial configuration of the garment, particularly the overall thesis of realistic try-on effects in various body poses. In
aspect ratio of the patterns and the relative locations of the contrast, 2D methods rely on prior learned from training
images and are hence limited in their generalization ability
to extreme poses outside the training distribution.

Figure 6. Comparison with 2D Virtual Try-One methods, includ-


ing DAFlow [2] and DGP [8].

User Study: Finally, we conduct a user study to evalu- Figure 8. Ablation Study on Phase I. From left to right: base, base
ate the overall perceptual quality and consistency with our + total variation loss Etv , base + Etv + automatic scaling.
methods’ provided input catalog images and 2D and 3D
baselines. We consider DGP the 2D baseline and TPS
the 3D baseline due to their best performance among ex- new pipeline based on neural rendering. We compare our
isting work. Each participant is shown three randomly se- method with TPS warping quantitatively to verify this de-
lected pairs of results, one produced by our method and the sign choice. Our test set consists of 10+ clothing cate-
other made by one of the baseline methods. The participant gories, including T-shirts, Polos, sweatshirts, jackets, hood-
is requested to choose the one that appears more realistic ies, shorts, trousers, and skirts, with 500 samples per cat-
and matches the reference clothing image better. In total, egory. We report the structure similarity (SSIM [36]) and
we received 643 responses from 72 users aged between 15 peak signal-to-noise ratio (PSNR) between the recovered
and 60. The results are reported in Fig. 7. Compared to textures and the ground truth textures.
DGP [8] and TPS, Cloth2Tex is favored by the participants As shown in Tab. 1, our neural rendering-based pipeline
with preference rates of 74.60% and 81.65%, respectively. achieves superior SSIM and PSNR compared to TPS warp-
This user study result verified the quality and consistency of ing. This improvement is also preserved after inpainting
our method. and refinement, leading to a much better quality of the final
texture. We conduct a comprehensive comparison study on
various inpainting methods in the supp. material, and please
check it if needed.
Table 1. Neural Rendering vs. TPS Warping. We evaluate the
texture quality of neural rendering and TPS-based warping, with
0% and without inpainting.

Figure 7. User preferences among 643 responses from 72 partici- Baseline Inpainting SSIM ↑ PSNR ↑
pants. Our method is favored by significantly more users. TPS None 0.70 20.29
TPS Pix2PixHD 0.76 23.81
4.2. Ablation Study Phase I None 0.80 21.72
Phase I Pix2PixHD 0.83 24.56
To demonstrate the effect of individual components in our
pipeline, we perform an ablation study for both stages in
our pipeline. Total Variation Loss & Automatic Scaling (Phase I) As
Neural Rendering vs. TPS Warping: TPS warping has shown in Fig. 8, dropping the total variation loss Etv and
been widely used in previous work on generating 3D gar- automatic scaling, the textures are incomplete and can-
ment textures. However, we found that it suffers from not maintain a semantically correct layout. With Etv ,
challenging cases illustrated in Fig. 2, so we propose a Cloth2Tex produces more complete textures by exploiting
Figure 9. Comparison with SOTA inpainting methods (Naiver-Stokes [4], LaMa [30], MADF [40] and Stable Diffusion v2 [24]) on texture
inpainting. The upper left corners of each column are the conditional mask input. Blue in the first column shows that our method is capable
of maintaining consistent boundary and curvature w.r.t reference image while Green highlights the blank regions that need inpainting.

the local consistency of textures. Further applying auto- texture in addition to the main UV.
matic scaling results in better alignment between the tem- Another imperfection is that our method cannot main-
plate mesh and the input images, resulting in a more se- tain the uniformity of checked shirts with densely assem-
mantically correct texture map. bled grids: As shown in the second row of Fig. 6, our
Inpainting Methods (Phase II) Next, to demonstrate the method inferior to 2D VTON methods in preserving tex-
need for training an inpainting model specifically for UV ture among which comprised of thousands of fine and tiny
clothing textures, we compare our task-specific inpaint- checkerboard-like grids, checked shirts and pleated skirts
ing model with general-purpose inpainting algorithms, in- are representative type of such garments.
cluding Navier-Stokes [4] algorithm and off-the-shelf deep We boil this down to the subtle position changes during
learning models including LaMa [30], MADF [40] and Sta- deformation graph optimization period, which leads to the
ble Diffusion v2 [24] with pre-trained checkpoints. Here, template mesh becomes less uniform eventually as the regu-
we modify the traditional Navier-Stokes [4] algorithm to a larization terms, i.e. as-rigid-as-possible is not a very strong
UV-constrained version because a texture map is only part constraint energy terms in obtaining a conformal mesh. We
of the whole squared image grid, where plenty of non-UV acknowledge this challenge and leave it to future work to
regions produce an adverse effect for texture in-painting explore the possibility in generating a homogeneous mesh
(please see supp. material for comparison). with uniformly-spaced triangles.
As shown in Fig. 9, our method, trained on our syn-
thetic dataset generated by the diffusion model, outperforms 5. Conclusion
general-purpose inpainting methods in the task of refining
and completing clothing textures, especially in terms of the This paper presents a novel pipeline, Cloth2Tex, for synthe-
color consistency between inpainted regions and the origi- sizing high-quality textures for 3D meshes from the pictures
nal image. taken from only front and back views. Cloth2Tex adopts a
two-stage process in obtaining visually appealing textures,
where phase I offers coarse texture generation and phase II
4.3. Limitations
performs texture refinement. Training a generalized texture
As shown in Fig. 10, Cloth2Tex can produce high-quality inpainting network is non-trivial due to the high topolog-
textures for common garments, e.g. T-shirt, Shorts, Trousers ical variability of UV space. Therefore, obtaining paired
and etc. (blue bounding box (bbox)). However, we have data under such circumstances is important. To the best of
observed that it is having difficulty in recovering textures our knowledge, this is the first study to combine a diffusion
for garments with complex patterns: e.g. inaccurate and in- model with a 3D engine (Blender) in collecting coarse-fine
consistent local texture (belt, collarband) occurred in wind- paired textures in 3D texturing tasks. We show the general-
breaker (red bbox). We regard this as the extra accessories izability of this approach in a variety of examples.
occurred in the garment, which inevitably add on the partial To avoid distortion and stretched artifacts across clothes,
Figure 10. Visualization of 3D virtual try-on. We obtain textured 3D meshes from 2D reference images shown on the left. The 3D meshes
are then draped onto 3D humans.
we automatically adjust the scale of vertices of template Loop, Wenzheng Chen, Krishna Murthy Jatavallab-
meshes and thus best prepare them for later image-based op- hula, Edward Smith, Artem Rozantsev, Or Perel, Tian-
timization, which effectively guides the implicitly learned chang Shen, Jun Gao, Sanja Fidler, Gavriel State, Ja-
texture with a complete and distortion-free structure. Ex- son Gorski, Tommy Xiang, Jianing Li, Michael Li,
tensive experiments demonstrate that our method can effec- and Rev Lebaredian. Kaolin: A pytorch library for
tively synthesize consistent and highly detailed textures for accelerating 3d deep learning research. https:
typical clothes without extra manual effort. //github.com/NVIDIAGameWorks/kaolin,
In summary, we hope our work can inspire more future 2022. 4
research in 3D texture synthesis and shed some light on this [10] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
area. nik, Amit H Bermano, Gal Chechik, and Daniel
Cohen-Or. An image is worth one word: Personal-
izing text-to-image generation using textual inversion.
References arXiv preprint arXiv:2208.01618, 2022. 5
[1] AUTOMATIC1111. Stable diffusion web ui. https: [11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising
/ / github . com / AUTOMATIC1111 / stable - diffusion probabilistic models. Advances in Neural In-
diffusion-webui, 2022. 5 formation Processing Systems, 33:6840–6851, 2020.
[2] Shuai Bai, Huiling Zhou, Zhikang Li, Chang Zhou, 3
and Hongxia Yang. Single stage virtual try-on via de- [12] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
formable attention flows. In Computer Vision–ECCV Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
2022: 17th European Conference, Tel Aviv, Israel, Oc- Weizhu Chen. Lora: Low-rank adaptation of large
tober 23–27, 2022, Proceedings, Part XV, pages 409– language models. arXiv preprint arXiv:2106.09685,
425. Springer, 2022. 6, 7 2021. 5
[3] S. Belongie, J. Malik, and J. Puzicha. Shape match- [13] Diederik P Kingma and Jimmy Ba. Adam: A
ing and object recognition using shape contexts. IEEE method for stochastic optimization. arXiv preprint
Transactions on Pattern Analysis and Machine Intelli- arXiv:1412.6980, 2014. 1
gence, 24(4):509–522, 2002. 2, 3 [14] Mikhail Konstantinov, Alex Shonenkov, Daria Bak-
[4] Marcelo Bertalmio, Andrea L Bertozzi, and Guillermo shandaeva, and Ksenia Ivanova. Deepfloyd: Text-to-
Sapiro. Navier-stokes, fluid dynamics, and image image model with a high degree of photorealism and
and video inpainting. In Proceedings of the 2001 language understanding. https://deepfloyd.
IEEE Computer Society Conference on Computer Vi- ai/, 2023. 1
sion and Pattern Recognition. CVPR 2001, pages I–I.
[15] Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang,
IEEE, 2001. 8, 1
Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai
[5] Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee,
Xu, Zheng Cao, et al. mplug: Effective and effi-
Sergey Tulyakov, and Matthias Nießner. Text2tex:
cient vision-language learning by cross-modal skip-
Text-driven texture synthesis via diffusion models.
connections. arXiv preprint arXiv:2205.12005, 2022.
arXiv preprint arXiv:2303.11396, 2023. 3, 1
5, 1
[6] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo,
[16] Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Sil-
Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, and
vio Savarese, and Steven C. H. Hoi. Lavis: A library
Yuyin Zhou. Transunet: Transformers make strong
for language-vision intelligence, 2022. 5, 1
encoders for medical image segmentation. arXiv
preprint arXiv:2102.04306, 2021. 5, 1, 4 [17] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li.
[7] Prafulla Dhariwal and Alexander Nichol. Diffusion Soft rasterizer: A differentiable renderer for image-
models beat gans on image synthesis. Advances based 3d reasoning. The IEEE International Confer-
in Neural Information Processing Systems, 34:8780– ence on Computer Vision (ICCV), 2019. 2, 3, 4, 5
8794, 2021. 3, 5 [18] Matthew Loper, Naureen Mahmood, Javier Romero,
[8] Ruili Feng, Cheng Ma, Chengji Shen, Xin Gao, Zhen- Gerard Pons-Moll, and Michael J Black. Smpl: A
jiang Liu, Xiaobo Li, Kairi Ou, Deli Zhao, and Zheng- skinned multi-person linear model. ACM transactions
Jun Zha. Weakly supervised high-fidelity clothing on graphics (TOG), 34(6):1–16, 2015. 2
model generation. In Proceedings of the IEEE/CVF [19] Sahib Majithia, Sandeep N Parameswaran, Sadbha-
Conference on Computer Vision and Pattern Recogni- vana Babar, Vikram Garg, Astitva Srivastava, and
tion, pages 3440–3449, 2022. 6, 7 Avinash Sharma. Robust 3d garment digitization from
[9] Clement Fuji Tsang, Maria Shugrina, Jean Francois monocular 2d images for 3d virtual try-on systems. In
Lafleche, Towaki Takikawa, Jiehan Wang, Charles Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision, pages 3428–3438, large mask inpainting with fourier convolutions.
2022. 1, 2, 3, 6 arXiv preprint arXiv:2109.07161, 2021. 8, 1
[20] Aymen Mir, Thiemo Alldieck, and Gerard Pons-Moll. [31] Brandon Trabucco, Kyle Doherty, Max Gurinas,
Learning to transfer texture from clothing images to and Ruslan Salakhutdinov. Effective data aug-
3d humans. In Proceedings of the IEEE/CVF Con- mentation with diffusion models. arXiv preprint
ference on Computer Vision and Pattern Recognition, arXiv:2302.07944, 2023. 3
pages 7023–7034, 2020. 1, 2, 3, 6 [32] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai
[21] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jin-
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, gren Zhou, and Hongxia Yang. Ofa: Unifying ar-
Alban Desmaison, Luca Antiga, and Adam Lerer. Au- chitectures, tasks, and modalities through a simple
tomatic differentiation in pytorch. 2017. 1 sequence-to-sequence learning framework. In In-
[22] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yan- ternational Conference on Machine Learning, pages
song Tang, Zheng Zhu, Guan Huang, Jie Zhou, and 23318–23340. PMLR, 2022. 5, 1
Jiwen Lu. Denseclip: Language-guided dense predic- [33] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, An-
tion with context-aware prompting. In Proceedings of drew Tao, Jan Kautz, and Bryan Catanzaro. High-
the IEEE Conference on Computer Vision and Pattern resolution image synthesis and semantic manipulation
Recognition (CVPR), 2022. 4 with conditional gans. In Proceedings of the IEEE
[23] Nikhila Ravi, Jeremy Reizenstein, David Novotny, conference on computer vision and pattern recogni-
Taylor Gordon, Wan-Yen Lo, Justin Johnson, and tion, pages 8798–8807, 2018. 5, 1, 4
Georgia Gkioxari. Accelerating 3d deep learning with [34] Tuanfeng Y. Wang, Duygu Ceylan, Jovan Popovic,
pytorch3d. arXiv:2007.08501, 2020. 5, 1 and Niloy J. Mitra. Learning a shared shape space
[24] Robin Rombach, Andreas Blattmann, Dominik for multimodal garment design. ACM Trans. Graph.,
Lorenz, Patrick Esser, and Björn Ommer. High- 37(6):1:1–1:14, 2018. 3
resolution image synthesis with latent diffusion mod- [35] Wenguan Wang, Yuanlu Xu, Jianbing Shen, and
els. In Proceedings of the IEEE/CVF Conference Song-Chun Zhu. Attentive fashion grammar network
on Computer Vision and Pattern Recognition, pages for fashion landmark detection and clothing category
10684–10695, 2022. 2, 3, 5, 8, 1 classification. In Proceedings of the IEEE conference
[25] Leonid I Rudin, Stanley Osher, and Emad Fatemi. on computer vision and pattern recognition, pages
Nonlinear total variation based noise removal algo- 4271–4280, 2018. 2, 4
rithms. Physica D: nonlinear phenomena, 60(1-4): [36] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and
259–268, 1992. 5 Eero P Simoncelli. Image quality assessment: from
[26] Christoph Schuhmann, Romain Beaumont, Richard error visibility to structural similarity. IEEE transac-
Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, tions on image processing, 13(4):600–612, 2004. 7
Theo Coombes, Aarush Katta, Clayton Mullis, [37] Yi Xu, Shanglin Yang, Wei Sun, Li Tan, Kefeng Li,
Mitchell Wortsman, et al. Laion-5b: An open large- and Hui Zhou. 3d virtual garment modeling from rgb
scale dataset for training next generation image-text images. In 2019 IEEE International Symposium on
models. arXiv preprint arXiv:2210.08402, 2022. 3, 5 Mixed and Augmented Reality (ISMAR), pages 37–45.
[27] Yawar Siddiqui, Justus Thies, Fangchang Ma, Qi IEEE, 2019. 1
Shan, Matthias Nießner, and Angela Dai. Texturify: [38] Rui Yu, Yue Dong, Pieter Peers, and Xin Tong. Learn-
Generating textures on 3d shape surfaces. In Com- ing texture generators for 3d shape collections from
puter Vision–ECCV 2022: 17th European Conference, internet photo sets. In British Machine Vision Confer-
Tel Aviv, Israel, October 23–27, 2022, Proceedings, ence, 2021. 2
Part III, pages 72–88. Springer, 2022. 2 [39] Lvmin Zhang and Maneesh Agrawala. Adding condi-
[28] Olga Sorkine and Marc Alexa. As-rigid-as-possible tional control to text-to-image diffusion models, 2023.
surface modeling. In Symposium on Geometry pro- 2, 5, 1, 4
cessing, pages 109–116, 2007. 4 [40] Manyu Zhu, Dongliang He, Xin Li, Chao Li, Fu Li,
[29] Robert W Sumner, Johannes Schmid, and Mark Pauly. Xiao Liu, Errui Ding, and Zhaoxiang Zhang. Image
Embedded deformation for shape manipulation. In inpainting by end-to-end cascaded refinement with
ACM siggraph 2007 papers, pages 80–es. 2007. 3 mask awareness. IEEE Transactions on Image Pro-
[30] Roman Suvorov, Elizaveta Logacheva, Anton cessing, 30:4855–4866, 2021. 3, 8
Mashikhin, Anastasia Remizova, Arsenii Ashukha,
Aleksei Silvestrov, Naejin Kong, Harshith Goka, Ki-
woong Park, and Victor Lempitsky. Resolution-robust
Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual
Try-On
Supplementary Material
6. Implementation Details
In phase I, we fix the optimization steps of both silhouette
matching and image-based optimization to 1,000, which
makes each coarse texture generation process takes less
than 1 minute to complete on an NVIDIA Ampere A100
(80GB VRAM). The initial weights of each energy term
are wsil = 50, wlmk = 0.01, warap = 50, wnorm =
10, wimg = 100, wtv = 1, we then use cosine scheduler
for decaying warap , wnorm to 5, 1.
During the blender-enhanced rendering process, we aug-
ment the data by random sampling blendshapes of upper
cloth by a range of [0.1, 1.0]. The synthetic images were
rendered using Blender EEVEE engine at a resolution of
5122 , emission only (disentangle from the impact of shad-
ing, which is the notoriously difficult puzzle as dissected in
Text2Tex [5]).
The synthetic data used for training texture inpainting
network are yielded from pretrained ControlNet through
prompts (generates from Lavis-BLIP [16], OFA [32] and Figure 11. Visualization of Navier-stokes method on UV tem-
MPlug [15]) and UV templates (manually crafted UV maps plate. Our locally constrained NS method fills the blanks thor-
by artists) can be shown in Fig. 14, which contains more oughly (though lack of precision) compared to the original global
garment types than previous methods, e.g. Pix2Surf [20] (4) counterpart.
and Warping [19] (2).
The only existing trainable Pix2PixHD in phase II is op-
timized by Adam [13] with lr = 2e − 4 for 200 epochs. The detailed parameters of template meshes in
Our implementation is build on top of PyTorch [21] along- Cloth2Tex are summarized in Tab. 4, sketch of all template
side PyTorch3D [23] for silhouette matching, rendering and meshes and UV maps are shown in Fig. 12 and Fig. 13 re-
inpainting. spectively.

Table 2. SOTA inpainting methods act on our synthetic data.


7. Self-modified UV-constrained Naiver-Stokes
Baseline Inpainting SSIM ↑ Method
Phase I None 0.80 As shown in Fig. 11, we display the results between our
Phase I Navier-Stokes [4] 0.80 self-modified UV-constrained Navier-Stokes (NS) method
Phase I LaMa [30] 0.78 (local) and original NS (global) method. Specifically, we
Phase I Stable Diffusion (v2) [24] 0.77 add a reference branch (UV template) for NS and thus con-
Phase I Deep Floyd [14] 0.80 fine the inpainting-affected region to the given UV tem-
plate for each garment, thus contributing directly to the in-
terpolation result. Our locally constrained NS method al-
Table 3. Inpainting methods trained on our synthetic data.
lows blanks to be filled thoroughly compared to the original
global NS method.
Baseline Inpainting SSIM ↑
The sole aim of modifying the original global NS method
Phase I None 0.80
is to conduct a fair comparison with deep learning based
Phase I Cond-TransUNet [6] 0.78
methods as depicted in the main paper.
Phase I ControlNet [39] 0.77
The noteworthy thing is that for small blank areas (e.g.
Phase I Pix2PixHD [33] 0.83
Column 1,3 of Fig. 11), the texture uniformity and consis-
tency are well-persevered thus capable of producing plausi-
Figure 12. Visualization of all template meshes used in Cloth2Tex.

Figure 13. All UV maps of template meshes used in Cloth2Tex.


Table 4. Detailed parameters of template mesh in Cloth2Tex. As shown in the table, each template’s vertex is less than 10,000 and all are
animatable by means of Style3D, which is the best fit software for clothing animation.

Category Vertices Faces Key Nodes (Deformation Graph) Animatable


T-shirts 8,523 16,039 427 ✓
Polo 8,922 16,968 447 ✓
Shorts 8,767 14,845 435 ✓
Trousers 9,323 16,995 466 ✓
Dress 7,752 14,959 388 ✓
Skirt 6,116 11,764 306 ✓
Windbreaker 9,881 17,341 494 ✓
Jacket 8,168 15,184 409 ✓
Hoodie (Zipup) 8,537 15,874 427 ✓
Sweatshirt 9,648 18,209 483 ✓
One-piece Dress 9,102 17,111 455 ✓

Figure 14. Texture maps for training instance map guided Pix2PixHD, synthesized by ControlNet canny edge.
Figure 15. Comparison with representative image2image methods with conditional input: autoencoder-based TransUNet [6] (we modify
the base model and add an extra branch for UV map, aims to train it with all types of garment together), diffusion-based ControlNet [39]
and GAN-based Pix2PixHD [33]. It is rather obvious that prompts-sensitive ControlNet limited in recover a globally color-consistent
texture maps. Upper right corner of each method is the conditional input.

ble textures. and Fig. 15, Pix2PixHD is superior in obtaining satisfactory


texture maps in terms of both qualitative and quantitative
8. Efficiency of mainstream Inpainting meth- views.
ods
As depicted in the main paper, our neural rendering based
pipeline achieves superior SSIM compared to TPS warp-
ing. This improvement is also preserved after inpainting
and refinement, leading to a much better quality of the final
texture.
Free from the page limit in the main paper, here we
conduct a comprehensive comparison study on various in-
painting methods act upon the coarse texture maps derived
from Phase I directly, to demonstrate the efficiency of main-
stream inpainting methods.
First, we compare the state-of-the-art inpainting methods
quantitatively on our synthetic coarse-fine paired dataset.
One thing to note is that checkpoints derived from all deep
learning based inpainting methods are open and free. No
finetune or modification is involved in this comparison. As
described in Tab. 2, none of such methods produce a notice-
able positive impact in boosting the SSIM score compared
to the original coarse texture (None version).
Next, we revise TransUNet [6] with input a conditional
UV map for the unity of the input and output with Con-
trolNet [39] and Pix2PixHD [33]. Then we train cond-
TransUNet, ControlNet, and Pix2PixHD on the synthetic
data for a fair comparison. We input all these three with
original input coarse texture maps, conditional input UV
maps, and output fine texture maps. The selective basis
of TransUNet, ControlNet, and Pix2PixHD originates from
the generative paradigm: TransUNet is a basic autoencoder-
based supervised learning image2image model, ControlNet
is a diffusion-based generative model and Pix2PixHD is a
GAN-based generative model. We want to explore the fea-
sibility of these methods in our task, as depicted in Tab. 3

You might also like