3D Virtual Garment Modeling From RGB Images

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

3D Virtual Garment Modeling from RGB Images

Yi Xu* Shanglin Yang† Wei Sun‡ Li Tan §


OPPO US Research Center JD.COM American Technologies Corporation North Carolina State University JD.COM
Kefeng Li¶ Hui Zhou||
JD.COM JD.COM American Technologies Corporation

A BSTRACT
We present a novel approach that constructs 3D virtual garment
models from photos. Unlike previous methods that require photos of
a garment on a human model or a mannequin, our approach can work
with various states of the garment: on a model, on a mannequin,
arXiv:1908.00114v1 [cs.CV] 31 Jul 2019

or on a flat surface. To construct a complete 3D virtual model, our


approach only requires two images as input, one front view and one
back view. We first apply a multi-task learning network called JFNet
that jointly predicts fashion landmarks and parses a garment image
into semantic parts. The predicted landmarks are used for estimating
sizing information of the garment. Then, a template garment mesh
is deformed based on the sizing information to generate the final 3D
model. The semantic parts are utilized for extracting color textures
from input images. The results of our approach can be used in
various Virtual Reality and Mixed Reality applications.
Index Terms: Computing methodologies—Computer graphics—
Graphics systems and interfaces; Computing methodologies—
Artificial intelligence—Computer vision—Computer vision prob-
lems;

1 I NTRODUCTION Input Photo Set 3D Textured Model


Building 3D models of fashion items has many applications in Vir-
tual Reality, Mixed Reality, and Computed-Aided Design (CAD) for Figure 1: Two product photo sets (left) on an e-commerce site and 3D
apparel industry. A lot of commercial efforts have been put into this textured models (right) computed using two photos from each input
field. For example, there are a few CAD software systems that are set.
created for 3D garment design, but most of them focus on creating
3D garment models based on 2D sewing patterns, such as Mavelous-
Designer and Optitex. Recently, a few e-commerce platforms have an accurate design is not needed. Recently, there have been some
begun to use 3D virtual garments to enhance online shopping expe- methods that create 3D garment models from a single image or a pair
riences. However, large variation, short fashion product life cycle, of images [10, 17, 40, 45]. All of these methods assume the garment
and high modeling costs make it difficult to use virtual garments on is worn by a human model or a mannequin; therefore, do not provide
a regular basis. This necessitates a simple yet effective approach for the convenience of working with readily available photos.
3D garment modeling. We propose a method that can construct 3D virtual garment
There have been a lot of research for creating 3D virtual gar- models from photos that are available on the web, especially on
ment models. Some use specialized multi-camera setups to capture e-commerce sites. Fig. 1 shows two examples. Each photo set
4D evolving shape of the garments [7, 26]. These setups are com- displays several different views of a piece of garment on a fash-
plicated; therefore limiting their usage. Other methods take 2D ion model, on a mannequin, or flattened on a support surface. To
sewing patterns [6] or 2D sketches [29] as input and build 3D mod- generate a 3D virtual model, a user needs to specify one front and
els that can be easily manufactured. Although these methods use one back image of the garment. The generated 3D model is up to
2D images as input, they still rely on the careful and lengthy de- a scale, but can have absolute scale if user specifies a real world
sign of expert users. Another group of methods deform/reshape 3D measurement (e.g., sleeve length in meters).
template meshes to design garments that best fit 3D digital human We train a multi-task learning network, called JFNet, to predict
models [23]. This can be an overkill in certain applications where fashion landmarks and segment a garment image into semantic parts
(i.e., left sleeve, front piece, etc.). Based on the landmark predictions,
* e-mail: [email protected], currently with OPPO US Research Center. we estimate sizing information of the garment and deform a template
The work was done when Yi Xu was with JD. mesh to match the estimated measurements. We then deform the
† e-mail:[email protected] semantic parts onto a 2D reference texture to lift textures. It is worth-
‡ e-mail:[email protected], the work was when Wei Sun was with JD. noting that our method is capable of using a single image as input if
§ e-mail:[email protected] front-back symmetry is assumed for a garment. Our contributions
¶ e-mail:[email protected] are as follows:
|| e-mail: [email protected]
• We present a complete and easy-to-use approach that generates
a 3D textured garment model using product photo set. T-shirt
and pants are modeled in this paper; however, our approach
can be extended to other garment types.
• We propose a multi-task learning framework that predicts fash- method, our method provides a more advanced joint learning model
ion landmarks and segments garment image into semantic for semantic parsing.
parts. The DeepGarment framework proposed by Danźr̆ek et al. [10]
learns a mapping from garment images to 3D model using Con-
• We present algorithms for size estimation and texture extrac- volutional Neural Networks (CNN). More specifically, the learned
tion from garment images. network can predict displacements of vertices from a template mesh.
However, garment texture is not learned.
2 R ELATED W ORK 2.2 Joint Human Body and Garment Shape Estimation
In this section, we discuss related work in garment modeling, joint There have been a lot of efforts that address the challenging problem
human body and garment shape estimation, semantic parsing of of joint human body and garment shape estimation.
fashion images, and image-based virtual try-on. Alldieck et al. [3] reconstructed detailed shape and texture of
clothed human by transforming a large amount of dynamic human
2.1 Garment Modeling and Capturing silhouettes from a single RGB sequence to a common reference
Garment modeling methods can be classified into the following three frame. Later, the same authors introduced a learning approach that
categories: geometric approaches, image-based 3D reconstruction, only requires a few RGB frames as input [1]. Natsume et al. [24]
and image-based template reshaping. reconstructed a complete and textured 3D model of a clothed person
using just one image. In their work, deep visual hull algorithm is
2.1.1 Geometric Approaches used to predict 3D shape from silhouettes and a Generative Adver-
Methods in this category typically have roots from the CAD com- sarial Network (GAN) is used to infer the appearance of the back of
munity. Wang et al. [35] automated the Made-to-Measure (MtM) the human subject. Habermann et al. [14] presented a system for real
process by fitting 3D feature templates of garments onto different time tracking of human performance, but relied on a personalized
body shapes. Meng et al. [23] proposed a method that preserves the and textured 3D model that was captured during a pre-processing
shape of user-defined features on the apparel products during the step. These work do not separate underlying body shape from gar-
automatic MtM process. ment geometry.
Other methods use 2D sketches or patterns as input. For example, Using RGBD camera as input device, body shape and garment
Decaudin et al. [12] fitted garment panels to contours and seam-lines shape can be separated. For example, Zhang et al. [43] reconstructed
that are sketched around a virtual mannequin. These panels are then naked human shape under clothing. Yu et al. [41] used a double layer
approximated with developable surfaces for garment manufacturing. representation to reconstruct geometry of both body and clothing.
Robson et al. [29] created 3D garments that are suitable for virtual Physics based cloth simulation can also be incorporated into the
environments form simple user sketches using context-aware sketch framework to better track human performance [42].
interpretation. Berthouzoz et al. [6] proposed an approach that parses
existing sewing patterns and converts them into 3D models. Wang 2.3 Fashion Semantic Parsing
et al. [36] presented a system that is capable of estimating garment In this section, we review related work in fashion landmark predic-
and body shape parameters interactively using a learning approach. tion, semantic segmentation, and multi-task learning.
All of these methods rely on certain level of tailoring expertise from
users. 2.3.1 Fashion Landmark Prediction
Fashion landmark prediction is a structured prediction problem for
2.1.2 Image-based 3D Reconstruction detecting functional key points, such as corners of cuff, collar, etc.
Some approaches aimed to create 3D models directly from input Despite it being a relatively new topic [21], it has roots in a re-
images and/or videos of a garment. Early work by White et al. [38] lated problem-human pose estimation. Early work on human pose
used a custom set of color markers printed on the cloth surface estimation used pictorial structures to model spatial correlation be-
to recover 3D mesh of dynamic cloth with consistent connectivity. tween human body parts [4]. Such method only works well when
Markerless approaches were also developed by using multi-camera all body parts are visible, so that the structure can be modeled by
setup [7], multi-view 3D scans with active stereo [26], or depth graphical models. Later on, hierarchical models were used to model
cameras [9]. These methods require specialized hardware and do part relationships at multiple scales [33]. Spatial relationship can
not work with existing garment photos. also be learned implicitly using a sequential prediction framework,
such as Pose Machines [27]. CNNs can also be integrated into
2.1.3 Shape Parameter Estimation Pose Machines to jointly learn image features and spatial context
Our approach is most similar to methods that utilize parametric features [37].
models of human and/or garments. Zhou et al. [45] took a single Different from human pose, fashion landmark detection predicts
image of a human wearing a garment as input. Their approach functional key points of fashion items. Liu et al. proposed a Deep
first estimates human pose and shape from images using parameter Fashion Alignment (DFA) [21] framework that cascades CNNs
reshaping. Then, a semi-automatic approach is used to create an in three stages similar to DeepPose [34]. To achieve scale invari-
initial 3D mesh for the garment. Finally, shape-from-shading is ance and remove background clutter, DFA assumes that bounding
used to recover details. Their method requires user input for pose boxes are known during training and testing; thus limiting its us-
estimation and garment outline labeling, assumes the garment is age. This constraint was later removed in Deep LAndmark Network
front-back symmetric, and does not extract textures from the input (DLAN) [39]. It is worth noting that the landmarks defined in these
image. approaches cannot be used for texture extraction. For example, a
Jeong et al. [17] fitted parameterized pattern drafts to input images mid-point on the cuff is a landmark defined in their work. In our
by analyzing silhouettes. However, their method requires input work, two corners of the cuff are predicted and they carry critical
images of a mannequin both with and without garment from the information for texture extraction.
same viewpoint. Yang et al. [40] used semi-automatic processing
to extract semantic information from a single image of a model 2.3.2 Semantic Segmentation
wearing the garment and used optimization with a physics-inspired Semantic segmentation assigns semantic labels to each pixel. CNNs
objective function to estimate garment parameters. Compared to this have been successfully applied to this task. Long et al. pro-
+

3D Template Mesh
Landmarks
JFNet 3D Garment Model

Parts 2D Reference Mesh Texture Map

Figure 2: System Overview. For each input image, we jointly predict landmark locations and segment the garment into semantic parts using the
proposed JFNet. The predicted landmarks are used to guide the deformation of a 3D template mesh. The segmented parts are used to extract
garment textures. Finally, 3D textured garment model is produced.

posed Fully Convolutional Networks (FCNs) for semantic segmen- items. VITON [15] on the other hand transfers the image of a new
tation [22], which achieved significant improvements over methods garment onto a photo of a person.
relied on hand-crafted features. Built upon FCNs, Encoder-Decoder
architectures have shown great success [5, 30]. Such an architecture 3 O UR A PPROACH
typically has an encoder that reduces feature map and a decoder In this section, we explain our approaches on garment image parsing,
that maps the encoded information back to input resolution. Spa- 3D model creation, and texture extraction. Fig. 2 shows an overview
tial Pyramid Pooling (SPP) can also be applied at several scales of our approach.
to leverage multi-scale information [44]. DeepLabV3+ [8] com-
bines the benefits of both SPP and Encoder-Decoder architecture to 3.1 Data Annotation
achieve state-of-the-art result. Our part segmentation sub-network is To train JFNet, we built a dataset with both fashion landmarks and
based on DeepLabV3+ architecture. Similar to our work, Alldieck pixel-level segmentation annotations. We collected 3,000 images
et al. [2] also used human semantic part segmentation to extract of tops (including T-shirts) and another 3,000 images of pants from
detailed textures from RGB sequences. the web. For each type of garment, a set of landmarks are defined
based on fashion design. 13 landmarks are defined for tops including
2.3.3 Multi-task Learning center and corners of neckline, corners of both cuffs, end points on
hemline, and armpits. 7 landmarks are defined for pants including
Multi-task learning (MTL) has been used successfully for many
end points of waistband, crotch, and end points of the bottom.
applications due to the inductive bias it achieves when training a
For part segmentation, we defined a set of labels and asked the
model to perform multiple tasks. Recently, it has been applied to
annotators to provide pixel-level labeling. For tops, we used 5 labels
several computer vision tasks. Kokkinos introduced UberNet [18]
including left-sleeve, right-sleeve, collar, torso, and hat. For pants,
that can jointly handle multiple computer vision tasks, ranging from
we used 2 labels including left-part and right-part. Some labeling
semantic segmentation, human parts, to object detection. Ranjan
examples are shown in Fig. 3.
et al. proposed HyperFace [28] for simultaneously detecting faces,
localizing landmarks, estimating head pose, and identifying gender. 3.2 Garment Image Parsing
Perhaps the most similar work to ours is the work of JPPNet [20].
It is a joint human parsing and pose estimation network, while our Our joint garment parsing network JFNet built upon Con-
work uses MTL for garment image analysis. Another MTL work voluitional Pose Machines (CPMs) [37] for landmark prediction
on human parsing from the same group is [13], where semantic part and DeepLabV3+ [8] for semantic segmentation.
segmentation and instance-aware edge detection are jointly learned. The network architecture of JFNet is illustrated in Fig. 4. We
use ResNet-101 [16] as our backbone network to extract low-level
features. Then we use two branching networks to obtain landmark
2.4 Image-based Virtual Try-on prediction and part segmentation. Finally, we use a refinement
As an alternative to 3D modeling, image-based virtual try-on has network to refine the prediction results.
also been explored. Neverova et al. [25] used a two-stream network
where a data-driven predicted image and a surface-based warped 3.2.1 Landmark Prediction
image are combined and the whole network is learned end-to-end to For landmark prediction (bottom half of Fig. 4), we use a learning
generate a new pose of a person. Lassner et al. [19] used only image network with T-stages similar to that of [37]. At first stage, we ex-
information to predict images of new people in different clothing tract second stage outputs of ResNet-101 (Res-2) followed by a 3x3
convolutional layer as low level features from the input image. Then, 3.2.4 Training Details
we use two 1x1 convolutional layers to predict landmark heatmap We load ResNet-101 parameters that are pre-trained on ImageNet
at the first stage. At each of the subsequent stages, we concatenate classification task. During training, random crop and random rota-
the landmark heatmap predicted from the previous stage with shared tion between -10 and 10 degrees are applied for data augmentation
low-level features from Res-2. Then we use five convolutional layers and the final input image size is resized to 256x256. We adopt SGD
followed by two 1x1 convolutional layers to predict the heatmap optimizer with 0.9 as momentum. Learning rate is initially set as
at the current stage. The architecture repeats this process for T 0.001 and “poly” decay [44] is set to 10−6 in 100 total training
stages, where the size of receptive field increases with each stage. epoches.
This is crucial for learning long-range relationships between fashion
landmarks. The heatmap at each stage is compared against labeled 3.3 3D Model Construction
ground truth and calculated towards total training loss.
Our approach uses fashion landmarks to estimate the sizing infor-
3.2.2 Garment Part Segmentation mation and to guide the deformation of a template mesh. Textures
are extracted form input images and mapped onto the 3D garment
For semantic garment part segmentation (top half of Fig. 4), we fol- model. In this section, we first discuss the garment templates used
lowed the encoder architecture of DeepLabV3+ [8]. Atrous Spatial in our system. Then, we discuss our 3D modeling and texturing
Pyramid Pool (ASPP) module, which can learn context information approaches.
at multiple scales effectively, is applied after the last stage output
of ResNet-101, followed by one 1x1 convolutional layer and up- 3.3.1 Garment Templates
sampling. We use 3D garment models from Berkeley Garment Libraries [11]
as templates. For each garment type, a coarse base mesh and a
3.2.3 Refinement
finer isotropically refined mesh are provided by the library. We use
To refine landmark prediction and part segmentation, and to promote the refined mesh in world-space configuration as our base model.
each other, we concatenate the landmark prediction result from In addition, the texture coordinates of the refined mesh store the
the T-th stage of the landmark sub-network, the part segmentation material coordinates that refer to a planar reference mesh. We use
result from the segmentation sub-network, and the shared low-level this 2D reference mesh for texture extraction. Currently, our system
features together. We then apply a 3x3 convolutional layer for supports two garment types: T-shirt and pants as shown in Fig. 5.
landmark prediction and part segmentation respectively. The sum
of loss from both branches is used for jointly training the network 3.3.2 3D Model Deformation
end-to-end. To create 3D garment models that conform to the sizing information
from the input images, we apply Free-Form Deformation (FFD) [32]
to deform a garment template. We chose FFD because it can be
applied to 3D models locally while maintaining derivative continuity
with adjacent regions of the model. For two view data (front and
back), FFD is a plausible solution. When there are multi-view im-
ages, videos, or 4D scans of garments, other mesh fitting techniques
can be used to generate more accurate results.
For each garment template, we impose a grid of control points
Pijk (0 ≤ i < l, 0 ≤ j < m, 0 ≤ k < n) on a lattice. The deformation
of the template is achieved by moving each control point Pijk from
its original position. Control points are carefully chosen to facilitate
deformation of individual parts so that a variety of garment shapes
can be modeled. For T-shirt, as shown in Fig. 6 (a, b), we use
l = 4, m = 2, n = 4. For pants, as shown in Fig. 6 (c, d), we use
control points with l = 3, m = 2, n = 3.
If metric scale of the resulting 3D model is desired, we ask the
user to specify a measurement l in world space (e.g., sleeve length).
Otherwise, a default value is assigned to l. Based on the ratio
between image space sleeve length to l, we can convert any image
space distance to world space distance.
FFD control points do not directly corresponded to image land-
marks. Instead, we compute 2D distances between garment land-
marks and use them to compute 3D distances between control points.
Tab. 1 shows how to calculate control point distances for the T-shirt

Table 1: Control Points Distances from Landmarks for T-shirt

Distance How to calculate


D(P0jk , P1jk ) left sleeve length * cos(α)
D(P1jk , P2jk ) chest width (armpit left to armpit right)
D(P2jk , P3jk ) right sleeve length * cos(β )
D(Pij0 , Pij1 ) distance from armpit to hemline
D(Pij1 , Pij2 ) distance from armpit to shoulder
Figure 3: Annotation Examples. Top and bottom shows landmark and D(Pij0 , Pij3 ) distance from neck to hemline
part labeling for tops (including T-shirt) and pants respectively. D(Pi0k , Pi1k ) D(Pij1 , Pij2 ) * S
S D(Pi0k , Pi1k )/D(Pij1 , Pij2 ), un-displaced.
3x3 Conv
rate 1
|| Concatenation
3x3 Conv
Rate 18 Convolution layer

3x3 Conv
ResNet 101 Rate 12 1x1 Conv

3x3 Conv
Rate 6

3x3 Conv Image Upsample


3x3 Conv
Pooling

Low­Level
Features ||

|| Repeat for stage 2 to T ||

7x7 Conv 7x7 Conv 7x7 Conv 3x3 Conv


1x1 Conv

1x1 Conv 3x3 Conv 5x5 Conv ......

1x1 Conv

1x1 Conv

......

Stage 1 Stage 2 Stage T

Figure 4: JFNet. Our proposed multi-task learning model use ResNet-101 as backbone network to extract shared low level features. For landmark
prediction (bottom half), we apply T -stage CNNs. Each stage refines the prediction iteratively. For garment part segmentation, Atrous Spatial
Pyramid Pool (ASPP) is applied on the ResNet output and followed by 1x1 convolution and up-sampling. At the last stage of the network, results
from two branches are concatenated together for joint learning.

type. Constants al pha and beta are the angle between horizontal 3.4 Texture Extraction
direction and left sleeve and the angle between horizontal direction
and right sleeve respectively. They are measured from the template The texture coordinates in the 3D mesh refer to the vertices in the
T-shirt mesh. The distances are then used to compute new locations planar 2D reference mesh. This allows us to perform 3D texture
of control points for template mesh deformation. mapping by mapping input images onto the 2D reference mesh as a
surrogate. The different pieces in the reference mesh correspond to
Since the T-shirt template resembles the shape of a T-shirt on a different garment segmentation parts. This is the reason semantic
mannequin, using photos of T-shirts on mannequins achieves most segmentation is performed during garment image analysis. Texture
accurate results. On such images, the distance between two armpits mapping becomes an image deformation problem where the source
corresponds to the chest width of the mannequin. When a T-shirt is a garment part (e.g., left sleeve) and the target is its corresponding
lays on a flat surface, the distance between two armpits corresponds piece on the reference mesh.
to half perimeter of the chest. In this case, we fit an ellipse to the
horizontal section of the chest. We then compute the width of the On the reference mesh, we manually label the landmarks (Fig. 7
horizontal section as the major axis of the ellipse using the perimeter (b) red circles). This only needs to be done once for each garment
measurement. Images of fashion models are not suitable for garment type. In this way, we establish feature correspondence between
size estimation due to self-occlusion, wrinkles, etc. Tab. 2 shows predicted landmarks on the source image and manually-labeled land-
the calculation of control points for the pants. marks on the target image. However, using a sparse set of control
points leads to large local deformation, especially around contours.
To mitigate this, we map each landmark point onto the contour of the
part by finding the closest point on the part contour. Then between
Table 2: Computing Control Points Distances for Pants each pair of adjacent landmarks, we sample N additional points
uniformly along the contour. We do this for both input garment im-
Control Points How to calculate
age and reference mesh (green circles in Fig. 7).The corresponding
D(P0jk , P1jk ) un-displaced distance * S∗ points are then used by Moving Least Squares (MLS) method with
D(P1jk , P2jk ) un-displaced distance * S∗ similarity deformation [31] to deform textures from the input image
D(Pij0 , Pij1 ) distance from crotch to bottom to the reference mesh. Alternatively, a Thin Plate Spline (TPS) based
D(Pij1 , Pij2 ) distance from crotch to waist line approach similar to that used in VITON [15] can also be used for
D(Pi0k , Pi1k ) un-displaced distance * S∗ image warping.
∗S is ratio between new waist girth to template waist girth. Before image deformation, each garment segment is eroded
(a) (b) in Fig.7 and 8. In our experiments, we found that denser control
point set (e.g. N = 50) works better.
In our current implementation, the back piece around the
neck/collar is often included in the front piece segmentation re-
sult. To handle this, we cut out the back piece automatically. JFNet
predicts the front middle point of the neck as a landmark. We then
correct the front piece segmentation by tracing the edge from two
shoulder points to the middle neck point.

4 E XPERIMENTS
(c) (d)
In this section, we show quantitative experimental results for JFNet.
We also show results on 3D modeling.

4.1 Evaluation of JFNet


Our model requires both landmark and segmentation annotations,
thus we cannot compare our results directly with other SOTAs by
training our model on public dataset. Nevertheless, we have trained
CPM and DeepLabV3+ on our dataset and compare them with
JFNet.
We trained JFNet for tops and pants separately. For each model,
2,000 images are used for training and 500 images for validation.
Evaluation is performed on the remaining 500 images. We used
the standard intersection over union (IoU) criterion and mean IoU
Figure 5: Our approach uses garment templates for modeling and (mIOU) accuracy for segmentation evaluation and normalized error
texturing. (a) The template mesh for T-shirt, whose texture coordinates (NE) metric [21] for landmark prediction evaluation. NE refers to
match the vertex coordinates of the (b) reference mesh. (c) Template the distance between predicted landmarks and ground truth locations
mesh for pants, and the corresponding (d) reference mesh. in the normalized coordinate space (i.e., normalized with respect
to the width of the image), and it is a commonly used evaluation
metric.
P013
P003 P113 z Tab. 3 shows performances of different methods. For both tops
P213 y
P313 and pants, JFNet achieves better performance on both landmark
x prediction and garment part segmentation. Our landmark prediction
P312

on tops greatly outperforms CPM (0.031 vs. 0.075). This shows


P311 that constraints and guidance from segmentation task have helped
landmark prediction. Landmark prediction performance on pants
P000 also improves, but not as much because landmarks of pants are less
complex than those of tops. Part segmentation is a more complex
P310 task. Thus, it is reasonable that our model does not boost the seg-
(a) P300 (b)
mentation task as much. Nevertheless, JFNet still improves upon
P002 DeepLabV3+.
P212
It is worth noting that the purpose of the proposed model is to
handle multiple tasks simultaneously with performance improve-
ment compared to individual tasks. Thus, our method focuses on
P211
information sharing and multi-task training while other SOTAs focus
on network structure and training for each individual task. In the

P000
P210

(c) P200 (d)

Figure 6: Template Deformation. (a) The original template for T-shirt


with control grid. (b) Deformed template that captures a different
shape. (c) The original template for pants. (b) Deformed template.
Image deformation

(a) (b)

slightly to accommodate for segmentation artifacts. Then, color


texture is extrapolated from the garment to surrounding area to re- Figure 7: Texture Extraction for T-Shirt. (a) The extrapolated T-shirt
move background color after deformation. Fig. 7 shows the process image with control points computed along the contour of the front
of deforming the front segment of a T-shirt to the desired location segment. (b) The front segment is deformed to the desired location
on its 2D reference mesh. Fig. 8 shows that for the right leg of pants. on the 2D reference mesh.
Note that to better illustrate the idea, we use a small value of N = 10
will be impractical. Our approach recovers shape, but not the pose
of the garment. To learn the 3D pose of garments, more data and
annotations are required.
Another limitation is that we only use two photos (front and
back view) for texture extraction. This leads to excessive local
deformation when source and target contours are very different (see
stickers on the jeans in Fig. 9 last row).
The photo sets for testing our 3D modeling approach are from
online shopping sites. Two occlusion-free images can always be
selected from each set. In general, occlusion can pose a problem
for texture extraction. However, missing textures can be mitigated
(a) (b) using image in-painting. Missing landmarks can be mitigated using
symmetry-based landmark completion.
Figure 8: Texture Extraction for Pants. (a) The extrapolated pants
Finally, our system only supports T-shirt and pants now and we
image with control points. (b) The image segment is deformed to the only address a simplified version of the garment modeling problem,
desired location on the 2D reference mesh. which usually involves wrinkles, folds and pleats.

7 F UTURE W ORK
Currently, 2D proportions from the photos are transferred to the 3D
future, we can also incorporate other SOTA networks into our joint model. In the future, We want to use a garment modeling approach
learning model. that uses sewing patterns [17]. We can fit the shape of each individual
2D sewing pattern using image part segmentation. Then, these 2D
Table 3: Landmark Prediction and Garment Segmentation Perfor- patterns can be assembled in 3D space as in commercial garment
mance Comparison design process. In this way, we can better transfer the shapes from
2D images to 3D models.
Tops Pants We also want to investigate if more than two images can be used
Methods NE mIOU NE mIOU together to texture a 3D model [2]. The distorted textures along the
CPM [37] 0.075 − 0.034 − silhouettes of front and back view can be filled in by a side view
Deeplabv3+ [8] − 0.721 − 0.964 photo.
JFNet 0.031 0.725 0.022 0.968 For applications that require accurate 3D information, we would
like to perform quantitatively evaluation of our 3D modeling algo-
rithm.
Finally, by incorporating more garment templates, more garment
types can be supported. Since we only need to create a template
4.2 3D Modeling Results once for each type/fit, the overhead is small if used in large scales.
We applied our 3D garment modeling algorithm on various input There are certain garments that are not suitable for our approach
images and the results are in Fig. 9. Our approach utilizes the sizing (e.g., fancy dresses with customized design). A possible approach is
information estimated from fashion landmarks to model different to use a hybrid system where template-based deformation generates
styles of garments (e.g., different length of legs or different fits of a base model and 3D details can be added via other methods. Part
T-shirt). For example, the 3rd T-shirt is slightly longer, the 2nd T- segmentation in its current state is not suitable for open jackets. It
shirt is slight wider, and the 1st T-shirt has narrower sleeves. These would be interesting to see if semantic segmentation model with
correspond to the characteristics of the input garment images. Our more data and annotation can distinguish between back side and
approach can also extract textures from garment images and map front side.
them on to different parts of the constructed 3D model.
To quantitatively evaluate our 3D modeling is expensive. This ACKNOWLEDGMENTS
involves capturing 2D images of various garments and scanning them The authors wish to thank the reviewers for their insightful comments
into 3D models. An alternative is to use synthetic data with ground and suggestions.
truth to evaluate accuracy of size estimation and 3D reconstruction.
We leave these for future work. Nevertheless, 3D modeling results of R EFERENCES
our approach are visually plausible for applications where accuracy [1] T. Alldieck, M. Magnor, B. L. Bhatnagar, C. Theobalt, and G. Pons-
requirement is not strict. Moll. Learning to reconstruct people in clothing from a single RGB
camera. In IEEE Conference on Computer Vision and Pattern Recogni-
5 C ONCLUSION tion (CVPR), Jun 2019.
We present a complete system that takes photos of a garment as input [2] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. De-
and creates a 3D textured virtual model. We propose a multi-task tailed human avatars from monocular video. In 2018 International
network called JFNet to predict fashion landmarks and segment the Conference on 3D Vision (3DV), pp. 98–109, Sep. 2018. doi: 10.1109/
garment into parts. The landmark prediction results are used to guide 3DV.2018.00022
template-based deformation. The semantic part segmentation results [3] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Video
are used for texture extraction. We show that our system can create based reconstruction of 3d people models. In 2018 IEEE/CVF Con-
3D virtual models for T-shirt and pants effectively. ference on Computer Vision and Pattern Recognition, pp. 8387–8397,
June 2018. doi: 10.1109/CVPR.2018.00875
6 L IMITATION [4] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited:
People detection and articulated pose estimation. In 2009 IEEE Con-
One limitation is due to the representation power of the templates. ference on Computer Vision and Pattern Recognition, pp. 1014–1021,
Because our model is deformed from a template, the shape of the June 2009. doi: 10.1109/CVPR.2009.5206754
template limits the range of garments we can model. For example, [5] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep
our pants template is a regular fit. Modeling slim or skinny pants convolutional encoder-decoder architecture for image segmentation.
Front Input Landmarks Parts Back Input Landmarks Parts 3D Textured Models

Figure 9: 3D Modeling Results. On each row we show front image and its landmark prediction and part segmentation, followed by back image and
its landmark and part segmentation results. The final two columns show 3D textured models for two view points.

IEEE Transactions on Pattern Analysis and Machine Intelligence, mentation. In The European Conference on Computer Vision (ECCV),
39(12):2481–2495, Dec 2017. doi: 10.1109/TPAMI.2016.2644615 September 2018.
[6] F. Berthouzoz, A. Garg, D. M. Kaufman, E. Grinspun, and [9] X. Chen, B. Zhou, F. Lu, L. Wang, L. Bi, and P. Tan. Garment modeling
M. Agrawala. Parsing sewing patterns into 3d garments. ACM Trans. with a depth camera. ACM Trans. Graph., 34(6):203:1–203:12, Oct.
Graph., 32(4):85:1–85:12, July 2013. doi: 10.1145/2461912.2461975 2015. doi: 10.1145/2816795.2818059
[7] D. Bradley, T. Popa, A. Sheffer, W. Heidrich, and T. Boubekeur. Mark- [10] R. Danźr̆ek, E. Dibra, C. Öztireli, R. Ziegler, and M. Gross. Deep-
erless garment capture. ACM Trans. Graph., 27(3):99:1–99:9, Aug. garment: 3d garment shape estimation from a single image. Comput.
2008. doi: 10.1145/1360612.1360698 Graph. Forum, 36(2):269–280, May 2017. doi: 10.1111/cgf.13125
[8] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder- [11] J. M. de Joya, R. Narain, J. F. O’Brien, A. Samii, and V. Zordan.
decoder with atrous separable convolution for semantic image seg- Berkeley Garment Library, 2012. Available at http://graphics.
berkeley.edu/resources/GarmentLibrary/index.html. and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241.
[12] P. Decaudin, D. Julius, J. Wither, L. Boissieux, A. Sheffer, and M.-P. Springer International Publishing, Cham, 2015.
Cani. Virtual garments: A fully geometric approach for clothing design. [31] S. Schaefer, T. McPhail, and J. Warren. Image deformation using
Computer Graphics Forum, 25(3):625–634. doi: 10.1111/j.1467-8659. moving least squares. ACM Trans. Graph., 25(3):533–540, July 2006.
2006.00982.x doi: 10.1145/1141911.1141920
[13] K. Gong, X. Liang, Y. Li, Y. Chen, M. Yang, and L. Lin. Instance-level [32] T. W. Sederberg and S. R. Parry. Free-form deformation of solid
human parsing via part grouping network. In The European Conference geometric models. SIGGRAPH Comput. Graph., 20(4):151–160, Aug.
on Computer Vision (ECCV), September 2018. 1986. doi: 10.1145/15886.15903
[14] M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll, and C. Theobalt. [33] Y. Tian, C. L. Zitnick, and S. G. Narasimhan. Exploring the spatial
LiveCap: Real-time human performance capture from monocular hierarchy of mixture models for human pose estimation. In A. Fitzgib-
video. ACM Trans. Graph., 38(2):14:1–14:17, Mar. 2019. doi: 10. bon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, eds., Computer
1145/3311970 Vision – ECCV 2012, pp. 256–269. Springer Berlin Heidelberg, Berlin,
[15] X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis. Viton: An image-based Heidelberg, 2012.
virtual try-on network. In 2018 IEEE/CVF Conference on Computer [34] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep
Vision and Pattern Recognition, pp. 7543–7552, June 2018. doi: 10. neural networks. In 2014 IEEE Conference on Computer Vision and
1109/CVPR.2018.00787 Pattern Recognition, pp. 1653–1660, June 2014. doi: 10.1109/CVPR.
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image 2014.214
recognition. In 2016 IEEE Conference on Computer Vision and Pattern [35] C. C. L. Wang, Y. Wang, and M. M. F. Yuen. Design automation for
Recognition (CVPR), pp. 770–778, June 2016. doi: 10.1109/CVPR. customized apparel products. Comput. Aided Des., 37(7):675–691,
2016.90 June 2005. doi: 10.1016/j.cad.2004.08.007
[17] M.-H. Jeong, D.-H. Han, and H.-S. Ko. Garment capture from a [36] T. Y. Wang, D. Ceylan, J. Popovic, and N. J. Mitra. Learning a shared
photograph. Computer Animation and Virtual Worlds, 26(3-4):291– shape space for multimodal garment design. ACM Trans. Graph.,
300. doi: 10.1002/cav.1653 37(6):1:1–1:14, 2018. doi: 10.1145/3272127.3275074
[18] I. Kokkinos. Ubernet: Training a universal convolutional neural net- [37] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose
work for low-, mid-, and high-level vision using diverse datasets and machines. In 2016 IEEE Conference on Computer Vision and Pattern
limited memory. In 2017 IEEE Conference on Computer Vision and Recognition (CVPR), pp. 4724–4732, June 2016. doi: 10.1109/CVPR.
Pattern Recognition (CVPR), pp. 5454–5463, July 2017. doi: 10.1109/ 2016.511
CVPR.2017.579 [38] R. White, K. Crane, and D. A. Forsyth. Capturing and animating
[19] C. Lassner, G. Pons-Moll, and P. V. Gehler. A generative model of occluded cloth. ACM Trans. Graph., 26(3), July 2007. doi: 10.1145/
people in clothing. In The IEEE International Conference on Computer 1276377.1276420
Vision (ICCV), Oct 2017. [39] S. Yan, Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Unconstrained
[20] X. Liang, K. Gong, X. Shen, and L. Lin. Look into person: Joint fashion landmark detection via hierarchical recurrent transformer net-
body parsing & pose estimation network and a new benchmark. IEEE works. In Proceedings of the 2017 ACM on Multimedia Conference,
Transactions on Pattern Analysis & Machine Intelligence, p. 1, 2018. MM ’17, pp. 172–180. ACM, New York, NY, USA, 2017. doi: 10.
doi: 10.1109/TPAMI.2018.2820063 1145/3123266.3123276
[21] Z. Liu, S. Yan, P. Luo, X. Wang, and X. Tang. Fashion landmark [40] S. Yang, T. Ambert, Z. Pan, K. Wang, L. Yu, T. L. Berg, and M. C. Lin.
detection in the wild. In European Conference on Computer Vision Physics-inspired garment recovery from a single-view image. ACM
(ECCV), October 2016. Trans. Graph., 2018.
[22] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks [41] T. Yu, Z. Zheng, K. Guo, J. Zhao, Q. Dai, H. Li, G. Pons-Moll, and
for semantic segmentation. In 2015 IEEE Conference on Computer Y. Liu. DoubleFusion: Real-time capture of human performances with
Vision and Pattern Recognition (CVPR), pp. 3431–3440, June 2015. inner body shapes from a single depth sensor. In The IEEE Conference
doi: 10.1109/CVPR.2015.7298965 on Computer Vision and Pattern Recognition (CVPR), June 2018.
[23] Y. Meng, C. C. L. Wang, and X. Jin. Flexible shape control for auto- [42] T. Yu, Z. Zheng, Y. Zhong, J. Zhao, Q. Dai, G. Pons-Moll, and Y. Liu.
matic resizing of apparel products. Comput. Aided Des., 44(1):68–76, SimulCap : Single-view human performance capture with cloth sim-
Jan. 2012. doi: 10.1016/j.cad.2010.11.008 ulation. In The IEEE Conference on Computer Vision and Pattern
[24] R. Natsume, S. Saito, W. C. Zeng Huang, C. Ma, H. Li, and S. Mor- Recognition (CVPR), June 2019.
ishima. SiCloPe: Silhouette-based clothed people. In IEEE Conference [43] C. Zhang, S. Pujades, M. Black, and G. Pons-Moll. Detailed, accurate,
on Computer Vision and Pattern Recognition (CVPR), Jun 2019. human shape estimation from clothed 3d scan sequences. In 2017 IEEE
[25] N. Neverova, R. Alp Guler, and I. Kokkinos. Dense pose transfer. In Conference on Computer Vision and Pattern Recognition (CVPR), pp.
The European Conference on Computer Vision (ECCV), September 5484–5493, July 2017. doi: 10.1109/CVPR.2017.582
2018. [44] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing
[26] G. Pons-Moll, S. Pujades, S. Hu, and M. J. Black. Clothcap: Seamless network. In 2017 IEEE Conference on Computer Vision and Pattern
4d clothing capture and retargeting. ACM Trans. Graph., 36(4):73:1– Recognition (CVPR), pp. 6230–6239, July 2017. doi: 10.1109/CVPR.
73:15, July 2017. doi: 10.1145/3072959.3073711 2017.660
[27] V. Ramakrishna, D. Munoz, M. Hebert, J. Andrew Bagnell, and [45] B. Zhou, X. Chen, Q. Fu, K. Guo, and P. Tan. Garment modeling
Y. Sheikh. Pose machines: Articulated pose estimation via infer- from a single image. Computer Graphics Forum, 32(7):85–91. doi: 10.
ence machines. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, 1111/cgf.12215
eds., Computer Vision – ECCV 2014, pp. 33–47. Springer International
Publishing, Cham, 2014.
[28] R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-
task learning framework for face detection, landmark localization,
pose estimation, and gender recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence, pp. 1–1, 2018. doi: 10.1109/TPAMI
.2017.2781233
[29] C. Robson, R. Maharik, A. Sheffer, and N. Carr. Context-aware gar-
ment modeling from sketches. Comput. Graph., 35(3):604–613, June
2011. doi: 10.1016/j.cag.2011.03.002
[30] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional net-
works for biomedical image segmentation. In N. Navab, J. Horneg-
ger, W. M. Wells, and A. F. Frangi, eds., Medical Image Computing

You might also like