3D Virtual Garment Modeling From RGB Images
3D Virtual Garment Modeling From RGB Images
3D Virtual Garment Modeling From RGB Images
A BSTRACT
We present a novel approach that constructs 3D virtual garment
models from photos. Unlike previous methods that require photos of
a garment on a human model or a mannequin, our approach can work
with various states of the garment: on a model, on a mannequin,
arXiv:1908.00114v1 [cs.CV] 31 Jul 2019
3D Template Mesh
Landmarks
JFNet 3D Garment Model
Figure 2: System Overview. For each input image, we jointly predict landmark locations and segment the garment into semantic parts using the
proposed JFNet. The predicted landmarks are used to guide the deformation of a 3D template mesh. The segmented parts are used to extract
garment textures. Finally, 3D textured garment model is produced.
posed Fully Convolutional Networks (FCNs) for semantic segmen- items. VITON [15] on the other hand transfers the image of a new
tation [22], which achieved significant improvements over methods garment onto a photo of a person.
relied on hand-crafted features. Built upon FCNs, Encoder-Decoder
architectures have shown great success [5, 30]. Such an architecture 3 O UR A PPROACH
typically has an encoder that reduces feature map and a decoder In this section, we explain our approaches on garment image parsing,
that maps the encoded information back to input resolution. Spa- 3D model creation, and texture extraction. Fig. 2 shows an overview
tial Pyramid Pooling (SPP) can also be applied at several scales of our approach.
to leverage multi-scale information [44]. DeepLabV3+ [8] com-
bines the benefits of both SPP and Encoder-Decoder architecture to 3.1 Data Annotation
achieve state-of-the-art result. Our part segmentation sub-network is To train JFNet, we built a dataset with both fashion landmarks and
based on DeepLabV3+ architecture. Similar to our work, Alldieck pixel-level segmentation annotations. We collected 3,000 images
et al. [2] also used human semantic part segmentation to extract of tops (including T-shirts) and another 3,000 images of pants from
detailed textures from RGB sequences. the web. For each type of garment, a set of landmarks are defined
based on fashion design. 13 landmarks are defined for tops including
2.3.3 Multi-task Learning center and corners of neckline, corners of both cuffs, end points on
hemline, and armpits. 7 landmarks are defined for pants including
Multi-task learning (MTL) has been used successfully for many
end points of waistband, crotch, and end points of the bottom.
applications due to the inductive bias it achieves when training a
For part segmentation, we defined a set of labels and asked the
model to perform multiple tasks. Recently, it has been applied to
annotators to provide pixel-level labeling. For tops, we used 5 labels
several computer vision tasks. Kokkinos introduced UberNet [18]
including left-sleeve, right-sleeve, collar, torso, and hat. For pants,
that can jointly handle multiple computer vision tasks, ranging from
we used 2 labels including left-part and right-part. Some labeling
semantic segmentation, human parts, to object detection. Ranjan
examples are shown in Fig. 3.
et al. proposed HyperFace [28] for simultaneously detecting faces,
localizing landmarks, estimating head pose, and identifying gender. 3.2 Garment Image Parsing
Perhaps the most similar work to ours is the work of JPPNet [20].
It is a joint human parsing and pose estimation network, while our Our joint garment parsing network JFNet built upon Con-
work uses MTL for garment image analysis. Another MTL work voluitional Pose Machines (CPMs) [37] for landmark prediction
on human parsing from the same group is [13], where semantic part and DeepLabV3+ [8] for semantic segmentation.
segmentation and instance-aware edge detection are jointly learned. The network architecture of JFNet is illustrated in Fig. 4. We
use ResNet-101 [16] as our backbone network to extract low-level
features. Then we use two branching networks to obtain landmark
2.4 Image-based Virtual Try-on prediction and part segmentation. Finally, we use a refinement
As an alternative to 3D modeling, image-based virtual try-on has network to refine the prediction results.
also been explored. Neverova et al. [25] used a two-stream network
where a data-driven predicted image and a surface-based warped 3.2.1 Landmark Prediction
image are combined and the whole network is learned end-to-end to For landmark prediction (bottom half of Fig. 4), we use a learning
generate a new pose of a person. Lassner et al. [19] used only image network with T-stages similar to that of [37]. At first stage, we ex-
information to predict images of new people in different clothing tract second stage outputs of ResNet-101 (Res-2) followed by a 3x3
convolutional layer as low level features from the input image. Then, 3.2.4 Training Details
we use two 1x1 convolutional layers to predict landmark heatmap We load ResNet-101 parameters that are pre-trained on ImageNet
at the first stage. At each of the subsequent stages, we concatenate classification task. During training, random crop and random rota-
the landmark heatmap predicted from the previous stage with shared tion between -10 and 10 degrees are applied for data augmentation
low-level features from Res-2. Then we use five convolutional layers and the final input image size is resized to 256x256. We adopt SGD
followed by two 1x1 convolutional layers to predict the heatmap optimizer with 0.9 as momentum. Learning rate is initially set as
at the current stage. The architecture repeats this process for T 0.001 and “poly” decay [44] is set to 10−6 in 100 total training
stages, where the size of receptive field increases with each stage. epoches.
This is crucial for learning long-range relationships between fashion
landmarks. The heatmap at each stage is compared against labeled 3.3 3D Model Construction
ground truth and calculated towards total training loss.
Our approach uses fashion landmarks to estimate the sizing infor-
3.2.2 Garment Part Segmentation mation and to guide the deformation of a template mesh. Textures
are extracted form input images and mapped onto the 3D garment
For semantic garment part segmentation (top half of Fig. 4), we fol- model. In this section, we first discuss the garment templates used
lowed the encoder architecture of DeepLabV3+ [8]. Atrous Spatial in our system. Then, we discuss our 3D modeling and texturing
Pyramid Pool (ASPP) module, which can learn context information approaches.
at multiple scales effectively, is applied after the last stage output
of ResNet-101, followed by one 1x1 convolutional layer and up- 3.3.1 Garment Templates
sampling. We use 3D garment models from Berkeley Garment Libraries [11]
as templates. For each garment type, a coarse base mesh and a
3.2.3 Refinement
finer isotropically refined mesh are provided by the library. We use
To refine landmark prediction and part segmentation, and to promote the refined mesh in world-space configuration as our base model.
each other, we concatenate the landmark prediction result from In addition, the texture coordinates of the refined mesh store the
the T-th stage of the landmark sub-network, the part segmentation material coordinates that refer to a planar reference mesh. We use
result from the segmentation sub-network, and the shared low-level this 2D reference mesh for texture extraction. Currently, our system
features together. We then apply a 3x3 convolutional layer for supports two garment types: T-shirt and pants as shown in Fig. 5.
landmark prediction and part segmentation respectively. The sum
of loss from both branches is used for jointly training the network 3.3.2 3D Model Deformation
end-to-end. To create 3D garment models that conform to the sizing information
from the input images, we apply Free-Form Deformation (FFD) [32]
to deform a garment template. We chose FFD because it can be
applied to 3D models locally while maintaining derivative continuity
with adjacent regions of the model. For two view data (front and
back), FFD is a plausible solution. When there are multi-view im-
ages, videos, or 4D scans of garments, other mesh fitting techniques
can be used to generate more accurate results.
For each garment template, we impose a grid of control points
Pijk (0 ≤ i < l, 0 ≤ j < m, 0 ≤ k < n) on a lattice. The deformation
of the template is achieved by moving each control point Pijk from
its original position. Control points are carefully chosen to facilitate
deformation of individual parts so that a variety of garment shapes
can be modeled. For T-shirt, as shown in Fig. 6 (a, b), we use
l = 4, m = 2, n = 4. For pants, as shown in Fig. 6 (c, d), we use
control points with l = 3, m = 2, n = 3.
If metric scale of the resulting 3D model is desired, we ask the
user to specify a measurement l in world space (e.g., sleeve length).
Otherwise, a default value is assigned to l. Based on the ratio
between image space sleeve length to l, we can convert any image
space distance to world space distance.
FFD control points do not directly corresponded to image land-
marks. Instead, we compute 2D distances between garment land-
marks and use them to compute 3D distances between control points.
Tab. 1 shows how to calculate control point distances for the T-shirt
3x3 Conv
ResNet 101 Rate 12 1x1 Conv
3x3 Conv
Rate 6
LowLevel
Features ||
1x1 Conv
1x1 Conv
......
Figure 4: JFNet. Our proposed multi-task learning model use ResNet-101 as backbone network to extract shared low level features. For landmark
prediction (bottom half), we apply T -stage CNNs. Each stage refines the prediction iteratively. For garment part segmentation, Atrous Spatial
Pyramid Pool (ASPP) is applied on the ResNet output and followed by 1x1 convolution and up-sampling. At the last stage of the network, results
from two branches are concatenated together for joint learning.
type. Constants al pha and beta are the angle between horizontal 3.4 Texture Extraction
direction and left sleeve and the angle between horizontal direction
and right sleeve respectively. They are measured from the template The texture coordinates in the 3D mesh refer to the vertices in the
T-shirt mesh. The distances are then used to compute new locations planar 2D reference mesh. This allows us to perform 3D texture
of control points for template mesh deformation. mapping by mapping input images onto the 2D reference mesh as a
surrogate. The different pieces in the reference mesh correspond to
Since the T-shirt template resembles the shape of a T-shirt on a different garment segmentation parts. This is the reason semantic
mannequin, using photos of T-shirts on mannequins achieves most segmentation is performed during garment image analysis. Texture
accurate results. On such images, the distance between two armpits mapping becomes an image deformation problem where the source
corresponds to the chest width of the mannequin. When a T-shirt is a garment part (e.g., left sleeve) and the target is its corresponding
lays on a flat surface, the distance between two armpits corresponds piece on the reference mesh.
to half perimeter of the chest. In this case, we fit an ellipse to the
horizontal section of the chest. We then compute the width of the On the reference mesh, we manually label the landmarks (Fig. 7
horizontal section as the major axis of the ellipse using the perimeter (b) red circles). This only needs to be done once for each garment
measurement. Images of fashion models are not suitable for garment type. In this way, we establish feature correspondence between
size estimation due to self-occlusion, wrinkles, etc. Tab. 2 shows predicted landmarks on the source image and manually-labeled land-
the calculation of control points for the pants. marks on the target image. However, using a sparse set of control
points leads to large local deformation, especially around contours.
To mitigate this, we map each landmark point onto the contour of the
part by finding the closest point on the part contour. Then between
Table 2: Computing Control Points Distances for Pants each pair of adjacent landmarks, we sample N additional points
uniformly along the contour. We do this for both input garment im-
Control Points How to calculate
age and reference mesh (green circles in Fig. 7).The corresponding
D(P0jk , P1jk ) un-displaced distance * S∗ points are then used by Moving Least Squares (MLS) method with
D(P1jk , P2jk ) un-displaced distance * S∗ similarity deformation [31] to deform textures from the input image
D(Pij0 , Pij1 ) distance from crotch to bottom to the reference mesh. Alternatively, a Thin Plate Spline (TPS) based
D(Pij1 , Pij2 ) distance from crotch to waist line approach similar to that used in VITON [15] can also be used for
D(Pi0k , Pi1k ) un-displaced distance * S∗ image warping.
∗S is ratio between new waist girth to template waist girth. Before image deformation, each garment segment is eroded
(a) (b) in Fig.7 and 8. In our experiments, we found that denser control
point set (e.g. N = 50) works better.
In our current implementation, the back piece around the
neck/collar is often included in the front piece segmentation re-
sult. To handle this, we cut out the back piece automatically. JFNet
predicts the front middle point of the neck as a landmark. We then
correct the front piece segmentation by tracing the edge from two
shoulder points to the middle neck point.
4 E XPERIMENTS
(c) (d)
In this section, we show quantitative experimental results for JFNet.
We also show results on 3D modeling.
P000
P210
(a) (b)
7 F UTURE W ORK
Currently, 2D proportions from the photos are transferred to the 3D
future, we can also incorporate other SOTA networks into our joint model. In the future, We want to use a garment modeling approach
learning model. that uses sewing patterns [17]. We can fit the shape of each individual
2D sewing pattern using image part segmentation. Then, these 2D
Table 3: Landmark Prediction and Garment Segmentation Perfor- patterns can be assembled in 3D space as in commercial garment
mance Comparison design process. In this way, we can better transfer the shapes from
2D images to 3D models.
Tops Pants We also want to investigate if more than two images can be used
Methods NE mIOU NE mIOU together to texture a 3D model [2]. The distorted textures along the
CPM [37] 0.075 − 0.034 − silhouettes of front and back view can be filled in by a side view
Deeplabv3+ [8] − 0.721 − 0.964 photo.
JFNet 0.031 0.725 0.022 0.968 For applications that require accurate 3D information, we would
like to perform quantitatively evaluation of our 3D modeling algo-
rithm.
Finally, by incorporating more garment templates, more garment
types can be supported. Since we only need to create a template
4.2 3D Modeling Results once for each type/fit, the overhead is small if used in large scales.
We applied our 3D garment modeling algorithm on various input There are certain garments that are not suitable for our approach
images and the results are in Fig. 9. Our approach utilizes the sizing (e.g., fancy dresses with customized design). A possible approach is
information estimated from fashion landmarks to model different to use a hybrid system where template-based deformation generates
styles of garments (e.g., different length of legs or different fits of a base model and 3D details can be added via other methods. Part
T-shirt). For example, the 3rd T-shirt is slightly longer, the 2nd T- segmentation in its current state is not suitable for open jackets. It
shirt is slight wider, and the 1st T-shirt has narrower sleeves. These would be interesting to see if semantic segmentation model with
correspond to the characteristics of the input garment images. Our more data and annotation can distinguish between back side and
approach can also extract textures from garment images and map front side.
them on to different parts of the constructed 3D model.
To quantitatively evaluate our 3D modeling is expensive. This ACKNOWLEDGMENTS
involves capturing 2D images of various garments and scanning them The authors wish to thank the reviewers for their insightful comments
into 3D models. An alternative is to use synthetic data with ground and suggestions.
truth to evaluate accuracy of size estimation and 3D reconstruction.
We leave these for future work. Nevertheless, 3D modeling results of R EFERENCES
our approach are visually plausible for applications where accuracy [1] T. Alldieck, M. Magnor, B. L. Bhatnagar, C. Theobalt, and G. Pons-
requirement is not strict. Moll. Learning to reconstruct people in clothing from a single RGB
camera. In IEEE Conference on Computer Vision and Pattern Recogni-
5 C ONCLUSION tion (CVPR), Jun 2019.
We present a complete system that takes photos of a garment as input [2] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. De-
and creates a 3D textured virtual model. We propose a multi-task tailed human avatars from monocular video. In 2018 International
network called JFNet to predict fashion landmarks and segment the Conference on 3D Vision (3DV), pp. 98–109, Sep. 2018. doi: 10.1109/
garment into parts. The landmark prediction results are used to guide 3DV.2018.00022
template-based deformation. The semantic part segmentation results [3] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Video
are used for texture extraction. We show that our system can create based reconstruction of 3d people models. In 2018 IEEE/CVF Con-
3D virtual models for T-shirt and pants effectively. ference on Computer Vision and Pattern Recognition, pp. 8387–8397,
June 2018. doi: 10.1109/CVPR.2018.00875
6 L IMITATION [4] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited:
People detection and articulated pose estimation. In 2009 IEEE Con-
One limitation is due to the representation power of the templates. ference on Computer Vision and Pattern Recognition, pp. 1014–1021,
Because our model is deformed from a template, the shape of the June 2009. doi: 10.1109/CVPR.2009.5206754
template limits the range of garments we can model. For example, [5] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep
our pants template is a regular fit. Modeling slim or skinny pants convolutional encoder-decoder architecture for image segmentation.
Front Input Landmarks Parts Back Input Landmarks Parts 3D Textured Models
Figure 9: 3D Modeling Results. On each row we show front image and its landmark prediction and part segmentation, followed by back image and
its landmark and part segmentation results. The final two columns show 3D textured models for two view points.
IEEE Transactions on Pattern Analysis and Machine Intelligence, mentation. In The European Conference on Computer Vision (ECCV),
39(12):2481–2495, Dec 2017. doi: 10.1109/TPAMI.2016.2644615 September 2018.
[6] F. Berthouzoz, A. Garg, D. M. Kaufman, E. Grinspun, and [9] X. Chen, B. Zhou, F. Lu, L. Wang, L. Bi, and P. Tan. Garment modeling
M. Agrawala. Parsing sewing patterns into 3d garments. ACM Trans. with a depth camera. ACM Trans. Graph., 34(6):203:1–203:12, Oct.
Graph., 32(4):85:1–85:12, July 2013. doi: 10.1145/2461912.2461975 2015. doi: 10.1145/2816795.2818059
[7] D. Bradley, T. Popa, A. Sheffer, W. Heidrich, and T. Boubekeur. Mark- [10] R. Danźr̆ek, E. Dibra, C. Öztireli, R. Ziegler, and M. Gross. Deep-
erless garment capture. ACM Trans. Graph., 27(3):99:1–99:9, Aug. garment: 3d garment shape estimation from a single image. Comput.
2008. doi: 10.1145/1360612.1360698 Graph. Forum, 36(2):269–280, May 2017. doi: 10.1111/cgf.13125
[8] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder- [11] J. M. de Joya, R. Narain, J. F. O’Brien, A. Samii, and V. Zordan.
decoder with atrous separable convolution for semantic image seg- Berkeley Garment Library, 2012. Available at http://graphics.
berkeley.edu/resources/GarmentLibrary/index.html. and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241.
[12] P. Decaudin, D. Julius, J. Wither, L. Boissieux, A. Sheffer, and M.-P. Springer International Publishing, Cham, 2015.
Cani. Virtual garments: A fully geometric approach for clothing design. [31] S. Schaefer, T. McPhail, and J. Warren. Image deformation using
Computer Graphics Forum, 25(3):625–634. doi: 10.1111/j.1467-8659. moving least squares. ACM Trans. Graph., 25(3):533–540, July 2006.
2006.00982.x doi: 10.1145/1141911.1141920
[13] K. Gong, X. Liang, Y. Li, Y. Chen, M. Yang, and L. Lin. Instance-level [32] T. W. Sederberg and S. R. Parry. Free-form deformation of solid
human parsing via part grouping network. In The European Conference geometric models. SIGGRAPH Comput. Graph., 20(4):151–160, Aug.
on Computer Vision (ECCV), September 2018. 1986. doi: 10.1145/15886.15903
[14] M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll, and C. Theobalt. [33] Y. Tian, C. L. Zitnick, and S. G. Narasimhan. Exploring the spatial
LiveCap: Real-time human performance capture from monocular hierarchy of mixture models for human pose estimation. In A. Fitzgib-
video. ACM Trans. Graph., 38(2):14:1–14:17, Mar. 2019. doi: 10. bon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, eds., Computer
1145/3311970 Vision – ECCV 2012, pp. 256–269. Springer Berlin Heidelberg, Berlin,
[15] X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis. Viton: An image-based Heidelberg, 2012.
virtual try-on network. In 2018 IEEE/CVF Conference on Computer [34] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep
Vision and Pattern Recognition, pp. 7543–7552, June 2018. doi: 10. neural networks. In 2014 IEEE Conference on Computer Vision and
1109/CVPR.2018.00787 Pattern Recognition, pp. 1653–1660, June 2014. doi: 10.1109/CVPR.
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image 2014.214
recognition. In 2016 IEEE Conference on Computer Vision and Pattern [35] C. C. L. Wang, Y. Wang, and M. M. F. Yuen. Design automation for
Recognition (CVPR), pp. 770–778, June 2016. doi: 10.1109/CVPR. customized apparel products. Comput. Aided Des., 37(7):675–691,
2016.90 June 2005. doi: 10.1016/j.cad.2004.08.007
[17] M.-H. Jeong, D.-H. Han, and H.-S. Ko. Garment capture from a [36] T. Y. Wang, D. Ceylan, J. Popovic, and N. J. Mitra. Learning a shared
photograph. Computer Animation and Virtual Worlds, 26(3-4):291– shape space for multimodal garment design. ACM Trans. Graph.,
300. doi: 10.1002/cav.1653 37(6):1:1–1:14, 2018. doi: 10.1145/3272127.3275074
[18] I. Kokkinos. Ubernet: Training a universal convolutional neural net- [37] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose
work for low-, mid-, and high-level vision using diverse datasets and machines. In 2016 IEEE Conference on Computer Vision and Pattern
limited memory. In 2017 IEEE Conference on Computer Vision and Recognition (CVPR), pp. 4724–4732, June 2016. doi: 10.1109/CVPR.
Pattern Recognition (CVPR), pp. 5454–5463, July 2017. doi: 10.1109/ 2016.511
CVPR.2017.579 [38] R. White, K. Crane, and D. A. Forsyth. Capturing and animating
[19] C. Lassner, G. Pons-Moll, and P. V. Gehler. A generative model of occluded cloth. ACM Trans. Graph., 26(3), July 2007. doi: 10.1145/
people in clothing. In The IEEE International Conference on Computer 1276377.1276420
Vision (ICCV), Oct 2017. [39] S. Yan, Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Unconstrained
[20] X. Liang, K. Gong, X. Shen, and L. Lin. Look into person: Joint fashion landmark detection via hierarchical recurrent transformer net-
body parsing & pose estimation network and a new benchmark. IEEE works. In Proceedings of the 2017 ACM on Multimedia Conference,
Transactions on Pattern Analysis & Machine Intelligence, p. 1, 2018. MM ’17, pp. 172–180. ACM, New York, NY, USA, 2017. doi: 10.
doi: 10.1109/TPAMI.2018.2820063 1145/3123266.3123276
[21] Z. Liu, S. Yan, P. Luo, X. Wang, and X. Tang. Fashion landmark [40] S. Yang, T. Ambert, Z. Pan, K. Wang, L. Yu, T. L. Berg, and M. C. Lin.
detection in the wild. In European Conference on Computer Vision Physics-inspired garment recovery from a single-view image. ACM
(ECCV), October 2016. Trans. Graph., 2018.
[22] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks [41] T. Yu, Z. Zheng, K. Guo, J. Zhao, Q. Dai, H. Li, G. Pons-Moll, and
for semantic segmentation. In 2015 IEEE Conference on Computer Y. Liu. DoubleFusion: Real-time capture of human performances with
Vision and Pattern Recognition (CVPR), pp. 3431–3440, June 2015. inner body shapes from a single depth sensor. In The IEEE Conference
doi: 10.1109/CVPR.2015.7298965 on Computer Vision and Pattern Recognition (CVPR), June 2018.
[23] Y. Meng, C. C. L. Wang, and X. Jin. Flexible shape control for auto- [42] T. Yu, Z. Zheng, Y. Zhong, J. Zhao, Q. Dai, G. Pons-Moll, and Y. Liu.
matic resizing of apparel products. Comput. Aided Des., 44(1):68–76, SimulCap : Single-view human performance capture with cloth sim-
Jan. 2012. doi: 10.1016/j.cad.2010.11.008 ulation. In The IEEE Conference on Computer Vision and Pattern
[24] R. Natsume, S. Saito, W. C. Zeng Huang, C. Ma, H. Li, and S. Mor- Recognition (CVPR), June 2019.
ishima. SiCloPe: Silhouette-based clothed people. In IEEE Conference [43] C. Zhang, S. Pujades, M. Black, and G. Pons-Moll. Detailed, accurate,
on Computer Vision and Pattern Recognition (CVPR), Jun 2019. human shape estimation from clothed 3d scan sequences. In 2017 IEEE
[25] N. Neverova, R. Alp Guler, and I. Kokkinos. Dense pose transfer. In Conference on Computer Vision and Pattern Recognition (CVPR), pp.
The European Conference on Computer Vision (ECCV), September 5484–5493, July 2017. doi: 10.1109/CVPR.2017.582
2018. [44] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing
[26] G. Pons-Moll, S. Pujades, S. Hu, and M. J. Black. Clothcap: Seamless network. In 2017 IEEE Conference on Computer Vision and Pattern
4d clothing capture and retargeting. ACM Trans. Graph., 36(4):73:1– Recognition (CVPR), pp. 6230–6239, July 2017. doi: 10.1109/CVPR.
73:15, July 2017. doi: 10.1145/3072959.3073711 2017.660
[27] V. Ramakrishna, D. Munoz, M. Hebert, J. Andrew Bagnell, and [45] B. Zhou, X. Chen, Q. Fu, K. Guo, and P. Tan. Garment modeling
Y. Sheikh. Pose machines: Articulated pose estimation via infer- from a single image. Computer Graphics Forum, 32(7):85–91. doi: 10.
ence machines. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, 1111/cgf.12215
eds., Computer Vision – ECCV 2014, pp. 33–47. Springer International
Publishing, Cham, 2014.
[28] R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-
task learning framework for face detection, landmark localization,
pose estimation, and gender recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence, pp. 1–1, 2018. doi: 10.1109/TPAMI
.2017.2781233
[29] C. Robson, R. Maharik, A. Sheffer, and N. Carr. Context-aware gar-
ment modeling from sketches. Comput. Graph., 35(3):604–613, June
2011. doi: 10.1016/j.cag.2011.03.002
[30] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional net-
works for biomedical image segmentation. In N. Navab, J. Horneg-
ger, W. M. Wells, and A. F. Frangi, eds., Medical Image Computing