Paper 4
Paper 4
Paper 4
1
Institut de Robòtica i Informàtica Industrial (CSIC-UPC)
[email protected]
2
Vody
3
Floorfy
4
Amazon
1 Introduction
Enabling the reconstruction and animation of 3D clothed avatars is a key step
to unlock the potential of emerging technologies in fields such as augmented re-
ality (AR), virtual reality (VR), 3D graphics and robotics. Interactivity and fast
iteration over reconstructions and intermediate results can help speed up work-
flows for designers and practitioners. Different sensors are available for learn-
ing clothed avatars, including monocular RGB cameras, depth sensors and 4D
scanners. RGB videos are the most widely available, yet provide the weakest
supervisory signal, making this configuration elusive.
*Work done while at Institut de Robòtica i Informàtica Industrial (CSIC-UPC).
The project website is at github.com/alvaro-budria/InstantGeoAvatar.
2 A. Budria et al.
The advent of neural radiance fields (NeRFs) [70] enabled techniques for
novel view synthesis and animation of human avatars from RGB image supervi-
sion only [26, 38, 93, 109]. Volume rendering-based approaches typically learn a
canonical representation that is deformed with linear blend skinning [65] and an
additional non-rigid deformation [77, 104, 115]. Despite producing good render-
ings of human avatars, these techniques lack awareness of the underlying geome-
try. In parallel, some works have adopted a signed distance function (SDF) [111]
as basic primitive for learning clothed human avatars from 3D scans [5, 85, 107].
To remove the need for 3D supervision, some works have embedded SDFs within
a volumetric rendering pipeline [29,77,103]. Significant steps have been taken to
speed up training of NeRF-based approaches [26, 38] by leveraging an efficient
hash-grid spatial encoding [72]. Subsequent work has tried to improve training
on such unstable and noisy hash grids [23, 56] with some success. Up to date,
however, fast geometry learning in general and effective use of hash grid-based
representations in particular for clothed human avatars with RGB supervision
only remains elusive.
The challenges faced by NeRF- and SDF-based approaches trained with vol-
ume rendering can be succinctly reduced to effectively capturing realistic non-
rigid deformations, dealing with noisy pose and camera estimates, slow training,
and unstable training in the case of hash grid-based methods. In this paper,
we specifically focus on the last two challenges, and aim at significantly ad-
vancing towards the realization of interactive use of human avatar modelling.
We propose InstantGeoAvatar, a system capable of yielding good rendering and
reconstruction quality in as little as 5 minutes of training, down from several
hours as in prior work. Building on recent advances for fast training of NeRF
based systems [38, 72] and efficient training of hash grid encodings [23, 56], we
demonstrate that even in combination such prior improvements and techniques
are insufficient for fast and effective learning of 3D clothed humans. Thus we
propose a simple yet effective regularization scheme that imposes a local ge-
ometric consistency prior during optimization, effectively removing undesired
artifacts and defects on the surface. The proposed approach, which effectively
constrains surface curvature and torsion over continuous SDFs along ray direc-
tions, is easy to implement, fits neatly within the volume rendering pipeline, and
delivers noticeable improvements over our base model without additional cost.
2 Related Work
Template and mesh-based approaches can yield reasonable results even in the
low-data regime by leveraging a low-rank human shape prior. On the other hand,
implicit representations are continuous by design and have been used to produce
detailed reconstructions of a clothed human body. A specific subline of work
aims at speeding up training of radiance field-based methods while maintaining
good reconstruction quality. More recently, methods for human body modelling
based on Gaussian Splats have shown compelling results at fast rendering times.
Explicit Representations. Mesh-based techniques typically represent cloth defor-
mations as deformation offsets [2–4, 6, 7, 10, 12, 28, 44, 53, 71] added to a mini-
mally clothed parametric human body model prior (e.g. SMPL [65]) that has
been previously rigged to a particular subject. Despite being highly compatible
with parametric human body models, these approaches struggle with garment
deformations and are limited to a low resolution topology. Point cloud-based
methods [59–61, 66, 68, 80, 98, 116, 117, 120, 121] have shown promising results by
combining the advantages of a representation that is explicit and simultaneously
allows to model varying topologies.
Neural Volumetric Rendering and Implicit Representations. Previous works have
successfully applied neural radiance fields [70] and signed distance functions
(SDFs) [75, 101, 111] as basic primitives for modelling human avatars. Image
based methods trained on 3D scans [5,31,32,36,78,85,86,107,108,122] and meth-
ods based on RGB-D and pointcloud data [6,10,19,21,51,55,67,90,100,118,119]
have difficulties with out-of-distribution poses, can fail to produce temporally
consistent reconstructions, and suffer from incomplete observations in the case
of RGB-D based methods. While some methods relying solely on 2D images
attempt to train models that generalize to multiple subjects and can perform
inference of novel views of a human directly [15, 18, 46], most such methods in-
volve volumetric rendering and learn a per-subject representation in a canonical
space which can then be deformed to a given pose [11, 13, 16, 17, 24–26, 29, 38,
InstantGeoAvatar: Animatable Avatars from Monocular Video 5
42, 54, 57, 62, 69, 77, 81, 88, 88, 89, 92, 93, 102–104, 106, 109, 110, 115, 123, 124, 127] or
simply train a per-subject model without any canonical space [21,37,47,48,125].
Significant effort has been devoted to improving the performance of volume
rendering-based methods on various fronts. [29, 39, 93, 103, 104] jointly optimize
neural network and body pose parameters. [11,16,17,42,69,81,88,103,110] focus
on the computation of correspondences between posed and canonical space. To
improve the posing of the canonical representation, [16,17,24,25,54,54,77,81,115]
learn or refine skinning weights during training. [24, 25, 29, 39, 54, 77, 103, 104,
109] add a non-rigid deformation module on top of the rigid deformation of a
parametric body model. [25, 57, 69, 89, 92, 93, 115] leverage pose encodings that
help improve the results of pose-conditioned components. [29] allows segmenting
a human on a video without mask supervision. Despite their success, these works
do not demonstrate high geometry reconstruction quality at fast speeds.
An additional line of work aims at speeding up the fitting of models to each
scene. Some works [24,26,38,102,106,127] have employed a multiresolution hash
grid spatial encoding [72]. Despite its effectiveness in accelerating training, it has
been observed [23,33,56,63,126] that it lacks the implicit regularization towards
lower frequencies that MLPs enjoy [83]. Even if the commonly used coordinate-
based MLPs partially offset this bias [84] with a Fourier positional encoding,
one can modulate the frequency bands of such encoding to still maintain a de-
sirable level of smoothness, as done in common practice by works previously
cited. Some works have attempted to offset the lack of regularization in hash
grids by considering the behaviour of gradients [33, 56], implementing coarse-
to-fine strategies [56, 63] and including a hybrid positional encoding to recover
the regularizing effect [23, 126]. As shown in our experiments, these improve-
ments are insufficient for obtaining quality reconstructions, hence the need for
a more suitable training scheme, which we deliver in the form of an additional
computational scheme in volume rendering.
Gaussian Splats [43] have emerged as a powerful alternative primitive for volu-
metric rendering. Recent work has shown impressive results at the task of novel
view synthesis of human avatars [35, 40, 45, 58, 76, 82, 87, 99], even achieving re-
markably short training times [52], but we are unaware of any works specifically
focusing on modeling the geometry of human avatars from monocular video.
3 Method
3.1 Preliminaries
We next describe the fundamental building blocks of the proposed pipeline,
including an accelerated signed distance and texture field to model shape and
appearance in canonical space and a canonicalization module combined with a
non-rigid deformation to map this canonical representation into deformed space.
Canonical Signed Distance Field. We model human geometry in a canonical
space with a signed distance function f sdf that assigns a distance value and a
feature vector to each point in the 3D canonical space:
\boldsymbol {\textbf {f}}_{sdf}: \mathbb {R}^3 & \rightarrow \mathbb {R}, \mathbb {R}^{16} \\ x_c & \mapsto \textbf {d}, \textbf {v} \label {eq:sdf}
(2)
The body shape S in canonical space is then the zero-level set of f sdf :
\mathcal {S} = & \{ x_c\, |\, \boldsymbol {f}_{sdf} (x_c) = 0\} (3)
Following [38] we use the multiresolution hash feature grid encoding from
Instant-NGP [72] to parameterize f sdf .
Canonical Texture Field. We learn a texture field f rgb in canonical space that
models the subject’s appearance, conditioned on the SDF’s predicted features:
\boldsymbol {\textbf {f}}_{rgb}: \mathbb {R}^3, \mathbb {R}^{16} & \rightarrow \mathbb {R}^3 \label {eq:rgb}\\ x_c, \textbf {v} & \mapsto \textbf {c}
(5)
Articulating the Canonical Representation. We leverage the SMPL [65]
parametric body model to map a canonical point xc to a deformed point xd
via linear blend skinning (LBS), according to a set of bone transformations B i
which are derived from body pose θ:
x_d = LBS(x) = \sum _{i=1}^{n_b} w_i \boldsymbol {B}_i x_c \label {eq:lbs} (6)
\boldsymbol {f}_{\Delta x}: \mathbb {R}^3, \mathbb {R}^{69} & \rightarrow \mathbb {R}^3\\ x_c, \theta & \mapsto \Delta x, \label {eq:nonrigid}
(8)
InstantGeoAvatar: Animatable Avatars from Monocular Video 7
\sigma (x_c) = & \alpha \Bigl ( \frac {1}{2} + \frac {1}{2} sgn( - \boldsymbol {f}_{sdf} (x_c) ) (1 - \exp {- \frac {| \boldsymbol {f}_{sdf} | }{\beta }}) \Bigr ), \label {eq:density} (9)
\hat {C} = \sum _{i=1}^{N_p} \alpha _i \prod _{i<j} (1-\alpha _j)\boldsymbol {c}_i, \text { with } \alpha _i = 1 - \exp (\sigma _i \delta _i), (10)
where δi = ||xi+1
c − xic || is the distance between consecutive samples.
\mathcal {L} = \lambda _{rgb} \mathcal {L}_{rgb} + \lambda _{\alpha } \mathcal {L}_{\alpha } + \lambda _{Eik} \mathcal {L}_{Eik} + \lambda _{smooth} \mathcal {L}_{smooth}. \label {eq:loss-rgb} (11)
Lrgb is the photometric loss for rendered pixel color, Lα is an L1 loss between
the ground-truth and rendered masks and helps guide the reconstruction, and
LEik corresponds to the Eikonal loss term [27], which encourages the learned
SDF to have a well-behaved wave-like transition between consecutive isosurface
slices. The loss weightings are (λrgb , λα , λEik , λsmooth ) = (10, 0.1, 0.1, 1.0).
Lsmooth can be flexibly set within the range [0.5, 1.5] with good smoothing and
details balance.
Surface Regularization. Directly substituting the NeRF [70] rendering mod-
ule with the VolSDF [111] rendering scheme explained in Sec. 3.2 produces noisy,
stripped surfaces (Fig. 7a). The photometric loss only constrains the ray inte-
gral to render the desired color, but there are infinitely many shape variations
that satisfy this condition. MLP-based architectures [29, 103] are naturally bi-
ased to low-energy solutions [83], but a hash-grid based representation lacks such
implicit bias.
8 A. Budria et al.
Fig. 2: Non-local updates of the hash grid features. We consider a 1D hash grid
encoding segment to illustrate how the proposed regularization affects backpropagation
updates. Vanilla Eikonal loss (a) performs backpropagation updates on a single local
hash grid cell resulting in discontinuous and spatially disconnected updates. (b) [56]
used numerical gradients to distribute backpropagation updates to other cells in the
grid, resulting in more spatially coherent learned features. Our proposed smooth surface
regularization (c) also distributes backpropagation updates.
with T the surface tangent vector and B the surface binormal vector. Thus,
by minimizing this term, we are directly imposing a penalty on the amount
of curvature (how much the surface deviates from a straight line) and torsion
(how much the surface deviates from following a regular path) that the surface
presents along this direction, effectively enforcing the surface to be well-behaved.
We incorporate this regularization strategy into volume rendering without
additional sampling by computing finite differences on the estimated normals
∇x d(xc ) at each sampled point xc in canonical space, which we already obtained
for the Eikonal loss term LEik :
\mathcal {L}_{smooth} = \frac {1}{N_r} \sum _{i=1}^{N_r} \frac {1}{N^i_s} \sum _{j=1}^{N^i_s - 1} || \overline {\textbf {n}}_i^j - \hat {\textbf {n}}_i^j||_2 + || \overline {\textbf {n}}_i^j - \hat {\textbf {n}}_i^{j+1}||_2\label {eq:loss-smooth} (13)
∇x d(x)
where n̂ji = ||∇x d(x)||2 is the normalized estimated surface normal at point xji ,
nji = 1 j j+1
2 (n̂i + n̂i ) is the estimated normal at the midpoint of the interval
InstantGeoAvatar: Animatable Avatars from Monocular Video 9
(xji , xj+1
i ), Nsi is the number of samples on the i-th ray and Nr is the number of
rays.
Moreover, it has been noted [56] that surface normals ∇x d(x) computed on
hash grid-based representations only depend on the voxel features that imme-
diately surround the point x. When learning with gradients from the Eikonal
loss LEik , this results in spatially discontinuous and incoherent updates that are
limited to a single grid cell (Fig. 2a). Thus, by combining normals computed at
different points with our approach (Fig. 2c), a greater number of adjacent and
nearby grid cells are involved in the computations, and the resulting updates
from the loss term are more stable spatially coherent.
Discussion. Both on a conceptual and computational level, our proposed term
is significantly different from the Eikonal loss [27], the use of finite differences
for surface normal computation [56], and the curvature loss [56] from previous
works. The Eikonal loss LEik has a variational motivation and is derived as
the solution of a wave propagation PDE. It acts as a global constraint on the
SDF, ensuring it propagates through space as an onion, without silos or isolated
components. Our constraint is geometric and differential as it works on local
derivatives, and does not involve the magnitude of the gradient at a fixed point,
but the differences of the normals along ray directions.
Our approach is similar to that of [56] in that we also propagate gradient
updates to multiple cells for more stable training. But unlike them, we achieve
this effect by considering the surface normals at consecutive ray points, and use
more precise analytical gradients to compute those normals. We only take finite
differences along ray directions to compute the derivative of the surface normal.
The curvature loss from [56] computes the Laplacian considering how much
an SDF value at each sample point diverges from the SDF values of neighbors
at a fixed distance, which in three-dimensional space is not explicitly related
to the curvature and torsion of the surface. Moreover, computing it involves
6 additional samples and relies on a fixed scale on the finite differences which
needs to be scheduled over training. On the other hand, our approach explicitly
relates the directional derivative of the surface normals to geometric quantities of
curvature and torsion. Since the rays’ direction of incidence on the surface varies
with pixel location and camera location, the regularization is effectively applied
multi-scale over training (Fig. 4), without the need for any scale scheduling nor
any additional samples.
4 Experiments
Sequence Metric V2A V2A10 IA Ours Sequence Metric V2A V2A10 IA Ours
CD ↓ 0.304 0.602 0.901 0.579 CD ↓ 0.751 0.892 1.72 0.881
25 NC ↑ 0.919 0.894 0.688 0.910 25 NC ↑ 0.889 0.871 0.699 0.879
IoU ↑ 0.977 0.920 0.842 0.974 IoU ↑ 0.931 0.928 0.845 0.930
CD ↓ 0.393 0.842 0.851 0.600 CD ↓ 1.096 1.445 2.452 1.089
28 NC ↑ 0.922 0.871 0.727 0.914 28 NC ↑ 0.860 0.842 0.720 0.885
IoU ↑ 0.963 0.902 0.755 0.947 IoU ↑ 0.911 0.901 0.761 0.909
CD ↓ 0.417 0.641 1.073 0.557 CD ↓ 0.705 0.804 1.213 0.800
35 NC ↑ 0.912 0.879 0.673 0.891 35 NC ↑ 0.883 0.869 0.678 0.871
IoU ↑ 0.944 0.902 0.798 0.923 IoU ↑ 0.945 0.912 0.794 0.896
CD ↓ 0.572 0.748 1.494 0.691 CD ↓ 0.760 0.903 2.23 0.871
36 NC ↑ 0.844 0.817 0.576 0.838 36 NC ↑ 0.842 0.830 0.579 0.845
IoU ↑ 0.950 0.915 0.771 0.936 IoU ↑ 0.910 0.899 0.768 0.922
CD ↓ 0.311 0.512 1.693 0.597 CD ↓ 0.732 0.785 1.892 0.842
58 NC ↑ 0.930 0.887 0.681 0.875 58 NC ↑ 0.835 0.824 0.682 0.838
IoU ↑ 0.973 0.964 0.704 0.950 IoU ↑ 0.898 0.863 0.705 0.877
4.1 Datasets
4.2 Baselines
Sequence Metric V2A V2A10 IA Ours Sequence Metric V2A V2A10 IA Ours
PSNR ↑ 29.82 25.12 30.05 29.87 PSNR ↑ 23.09 19.66 23.32 23.56
25 SSIM ↑ 0.991 0.960 0.986 0.983 25 SSIM ↑ 0.965 0.937 0.957 0.964
LPIPS ↓ 0.016 0.033 0.012 0.020 LPIPS ↓ 0.025 0.045 0.032 0.027
PSNR ↑ 28.20 24.35 27.63 27.74 PSNR ↑ 24.50 20.93 24.56 24.77
28 SSIM ↑ 0.980 0.952 0.982 0.976 28 SSIM ↑ 0.963 0.942 0.964 0.966
LPIPS ↓ 0.017 0.025 0.019 0.032 LPIPS ↓ 0.028 0.038 0.030 0.033
PSNR ↑ 29.01 25.49 29.47 28.68 PSNR ↑ 25.68 21.34 25.23 25.21
35 SSIM ↑ 0.984 0.961 0.988 0.979 35 SSIM ↑ 0.960 0.952 0.955 0.958
LPIPS ↓ 0.019 0.032 0.011 0.028 LPIPS ↓ 0.031 0.048 0.027 0.039
PSNR ↑ 29.59 26.77 31.74 30.97 PSNR ↑ 26.41 22.05 25.07 26.33
36 SSIM ↑ 0.973 0.957 0.981 0.974 36 SSIM ↑ 0.953 0.947 0.950 0.955
LPIPS ↓ 0.015 0.030 0.010 0.021 LPIPS ↓ 0.036 0.040 0.028 0.031
PSNR ↑ 30.32 26.93 31.05 30.54 PSNR ↑ 23.88 21.40 23.82 24.05
58 SSIM ↑ 0.986 0.959 0.989 0.985 58 SSIM ↑ 0.964 0.951 0.957 0.960
LPIPS ↓ 0.018 0.026 0.009 0.016 LPIPS ↓ 0.025 0.040 0.033 0.029
Eikonal w/
28.60 0.960 2.45 0.657
Finite Diffs. [56]
Curvature
28.51 0.957 2.03 0.536
Loss [56]
Eik. w/ F. Diffs.
28.47 0.959 5.03 0.52
and Curv. Loss [56]
proposed SDF smoothing technique significantly boosts quality (see Table 6, and
Figures 5 and 7). Even after training for several hours, the numerical gradients
and the curvature loss fail to yield satisfactory results (Figure 3). Tweaking the
weight of the Eikonal loss term is insufficient for good reconstruction quality
(Figure 5). Table 7 shows that our approach keeps the same speed as the base
model and incurs less memory overhead than other approaches.
5 Conclusions
We proposed InstantGeoAvatar, a method that can model geometry and appear-
ance of animatable human avatars from monocular videos in less than 10 minutes.
Building on previous work, we notice that hash grid-based representations lack
the implicit regularization that MLP-based architectures enjoy, and introduce
a surface regularization term in the optimization that effectively enhances the
learned representation without additional computational cost.
Acknowledgements
This work has been supported by the project MOHUCO PID2020-120049RB-I00
funded by MCIU/AEI/10.13039/501100011033, and by the project GRAVATAR
PID2023-151184OB-I00 funded by MCIU/AEI/10.13039/501100011033 and by
ERDF, UE.
14 A. Budria et al.
References
1. de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Per-
formance capture from sparse multi-view video. In: Proc. SIGGRAPH (2008)
2. Alldieck, T., Magnor, M., Bhatnagar, B.L., Theobalt, C., Pons-Moll, G.: Learning
to reconstruct people in clothing from a single RGB camera. In: IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) (jun 2019)
3. Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Video based
reconstruction of 3d people models. In: IEEE Conference on Computer Vision
and Pattern Recognition. CVPR Spotlight Paper
4. Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Detailed human
avatars from monocular video. In: 2018 International Conference on 3D Vision
(3DV). pp. 98–109. IEEE (2018)
5. Alldieck, T., Zanfir, M., Sminchisescu, C.: Photorealistic monocular 3d reconstruc-
tion of humans wearing clothing. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR) (2022)
6. Bhatnagar, B.L., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Combining im-
plicit function learning and parametric models for 3d human reconstruction. In:
European Conference on Computer Vision (ECCV). Springer (August 2020)
7. Bhatnagar, B.L., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Loopreg: Self-
supervised learning of implicit surface correspondences, pose and shape for 3d
human mesh registration. Advances in Neural Information Processing Systems
33, 12909–12922 (2020)
8. Bozic, A., Palafox, P., Zollhofer, M., Thies, J., Dai, A., Niessner, M.: Neural
deformation graphs for globally-consistent non-rigid reconstruction. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). pp. 1450–1459 (June 2021)
9. Bradley, D., Popa, T., Sheffer, A., Heidrich, W., Boubekeur, T.: Markerless gar-
ment capture. ACM Transactions on Graphics (Proc. SIGGRAPH 2008) 27(3),
99 (2008)
10. Burov, A., Nießner, M., Thies, J.: Dynamic surface function networks for clothed
human bodies. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. pp. 10754–10764 (2021)
11. Cai, H., Feng, W., Feng, X., Wang, Y., Zhang, J.: Neural surface reconstruction
of dynamic scenes with monocular rgb-d camera. Advances in Neural Information
Processing Systems 35, 967–981 (2022)
12. Casado-Elvira, A., Comino Trinidad, M., Casas, D.: PERGAMO: Personalized
3d garments from monocular video. Computer Graphics Forum (Proc. of SCA),
2022 (2022)
13. Chen, D., Lu, H., Feldmann, I., Schreer, O., Eisert, P.: Dynamic multi-view scene
reconstruction using neural implicit surface. In: ICASSP 2023-2023 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5.
IEEE (2023)
14. Chen, J., Zhang, Y., Kang, D., Zhe, X., Bao, L., Jia, X., Lu, H.: Animatable
neural radiance fields from monocular rgb videos (2021)
15. Chen, M., Zhang, J., Xu, X., Liu, L., Cai, Y., Feng, J., Yan, S.: Geometry-guided
progressive nerf for generalizable and efficient neural human rendering. In: Euro-
pean Conference on Computer Vision. pp. 222–239. Springer (2022)
16. Chen, X., Jiang, T., Song, J., Rietmann, M., Geiger, A., Black, M.J., Hilliges,
O.: Fast-snarf: A fast deformer for articulated neural fields. IEEE Transactions
on Pattern Analysis and Machine Intelligence 45(10), 11796–11809 (2023)
16 A. Budria et al.
17. Chen, X., Zheng, Y., Black, M.J., Hilliges, O., Geiger, A.: Snarf: Differentiable
forward skinning for animating non-rigid neural implicit shapes. In: International
Conference on Computer Vision (ICCV) (2021)
18. Chen, Y., Wang, X., Chen, X., Zhang, Q., Li, X., Guo, Y., Wang, J., Wang,
F.: Uv volumes for real-time rendering of editable free-view human performance.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 16621–16631 (2023)
19. Chibane, J., Alldieck, T., Pons-Moll, G.: Implicit functions in feature space for
3d shape reconstruction and completion. In: Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition. pp. 6970–6981 (2020)
20. Collet, A., Chuang, M., Sweeney, P., Gillett, D., Evseev, D., Calabrese, D., Hoppe,
H., Kirk, A., Sullivan, S.: High-quality streamable free-viewpoint video. In: ACM
Trans. Graph. vol. 34. Association for Computing Machinery, New York, NY,
USA (jul 2015)
21. Dong, Z., Guo, C., Song, J., Chen, X., Geiger, A., Hilliges, O.: Pina: Learning a
personalized implicit neural avatar from a single rgb-d video sequence. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition. pp. 20470–20480 (2022)
22. Dou, M., Davidson, P., Fanello, S.R., Khamis, S., Kowdle, A., Rhemann, C.,
Tankovich, V., Izadi, S.: Motion2fusion: real-time volumetric performance cap-
ture. ACM Trans. Graph. 36(6) (nov 2017)
23. Engelhardt, A., Raj, A., Boss, M., Zhang, Y., Kar, A., Li, Y., Sun, D., Brualla,
R.M., Barron, J.T., Lensch, H., et al.: Shinobi: Shape and illumination using
neural object decomposition via brdf optimization in-the-wild. arXiv preprint
arXiv:2401.10171 (2024)
24. Fan, J., Zhang, J., Hou, Z., Tao, D.: Anipixel: Towards animatable pixel-aligned
human avatar. In: Proceedings of the 31st ACM International Conference on
Multimedia. p. 8626–8634. MM ’23, Association for Computing Machinery, New
York, NY, USA (2023)
25. Gao, Q., Wang, Y., Liu, L., Liu, L., Theobalt, C., Chen, B.: Neural novel actor:
Learning a generalized animatable neural representation for human actors. IEEE
Transactions on Visualization and Computer Graphics (2023)
26. Geng, C., Peng, S., Xu, Z., Bao, H., Zhou, X.: Learning neural volumetric repre-
sentations of dynamic humans in minutes. In: CVPR (2023)
27. Gropp, A., Yariv, L., Haim, N., Atzmon, M., Lipman, Y.: Implicit geometric
regularization for learning shapes. In: III, H.D., Singh, A. (eds.) Proceedings of
the 37th International Conference on Machine Learning. Proceedings of Machine
Learning Research, vol. 119, pp. 3789–3799. PMLR (13–18 Jul 2020), https:
//proceedings.mlr.press/v119/gropp20a.html
28. Guo, C., Chen, X., Song, J., Hilliges, O.: Human performance capture from
monocular video in the wild. In: 2021 International Conference on 3D Vision
(3DV). pp. 889–898 (2021)
29. Guo, C., Jiang, T., Chen, X., Song, J., Hilliges, O.: Vid2avatar: 3d avatar re-
construction from videos in the wild via self-supervised scene decomposition.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) (June 2023)
30. Guo, K., Lincoln, P., Davidson, P.L., Busch, J., Yu, X., Whalen, M., Harvey, G.,
Orts-Escolano, S., Pandey, R., Dourgarian, J., Tang, D., Tkach, A., Kowdle, A.,
Cooper, E., Dou, M., Fanello, S.R., Fyffe, G., Rhemann, C., Taylor, J., Debevec,
P.E., Izadi, S.: The relightables: volumetric performance capture of humans with
realistic relighting. ACM Trans. Graph. 38(6), 217:1–217:19 (2019)
InstantGeoAvatar: Animatable Avatars from Monocular Video 17
31. He, T., Collomosse, J., Jin, H., Soatto, S.: Geo-pifu: Geometry and pixel aligned
implicit functions for single-view human reconstruction. Advances in Neural In-
formation Processing Systems 33, 9276–9287 (2020)
32. He, T., Xu, Y., Saito, S., Soatto, S., Tung, T.: Arch++: Animation-ready clothed
human reconstruction revisited. In: 2021 IEEE/CVF International Conference on
Computer Vision (ICCV). pp. 11026–11036 (2021). https://doi.org/10.1109/
ICCV48922.2021.01086
33. Heo, H., Kim, T., Lee, J., Lee, J., Kim, S., Kim, H.J., Kim, J.H.: Robust camera
pose refinement for multi-resolution hash encoding. In: Proceedings of the 40th
International Conference on Machine Learning. ICML’23, JMLR.org (2023)
34. Hilton, A., Starck, J.: Multiple view reconstruction of people. In: Proceedings
of the 2nd International Symposium on 3D Data Processing, Visualization and
Transmission, 2004. 3DPVT 2004. pp. 357–364 (2004)
35. Hu, L., Zhang, H., Zhang, Y., Zhou, B., Liu, B., Zhang, S., Nie, L.: Gaussiana-
vatar: Towards realistic human avatar modeling from a single video via animat-
able 3d gaussians. In: IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) (2024)
36. Huang, Z., Xu, Y., Lassner, C., Li, H., Tung, T.: Arch: Animatable reconstruction
of clothed humans. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) (June 2020)
37. Iqbal, U., Caliskan, A., Nagano, K., Khamis, S., Molchanov, P., Kautz, J.: Rana:
Relightable articulated neural avatars. In: ICCV (2023)
38. Jiang, T., Chen, X., Song, J., Hilliges, O.: Instantavatar: Learning avatars from
monocular video in 60 seconds. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 16922–16932 (2023)
39. Jiang, W., Yi, K.M., Samei, G., Tuzel, O., Ranjan, A.: Neuman: Neural human
radiance field from a single video. In: European Conference on Computer Vision.
pp. 402–418. Springer (2022)
40. Jiang, Y., Shen, Z., Wang, P., Su, Z., Hong, Y., Zhang, Y., Yu, J., Xu, L.:
Hifi4g: High-fidelity human performance rendering via compact gaussian splat-
ting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR). pp. 19734–19745 (June 2024)
41. Kanade, T., Rander, P., Narayanan, P.: Virtualized reality: constructing virtual
worlds from real scenes. vol. 4, pp. 34–47 (1997). https://doi.org/10.1109/93.
580394
42. Kant, Y., Siarohin, A., Guler, R.A., Chai, M., Ren, J., Tulyakov, S., Gilitschenski,
I.: Invertible neural skinning. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 8715–8725 (2023)
43. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for
real-time radiance field rendering. ACM Transactions on Graphics 42(4) (2023)
44. Kim, H., Nam, H., Kim, J., Park, J., Lee, S.: Laplacianfusion: Detailed 3d clothed-
human body reconstruction. ACM Trans. Graph. 41(6) (nov 2022)
45. Kocabas, M., Chang, J.H.R., Gabriel, J., Tuzel, O., Ranjan, A.: HUGS: Human
gaussian splatting. In: 2024 IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR) (2024), https://arxiv.org/abs/2311.17910
46. Kwon, Y., Kim, D., Ceylan, D., Fuchs, H.: Neural human performer: Learning
generalizable radiance fields for human performance rendering. Advances in Neu-
ral Information Processing Systems 34, 24741–24752 (2021)
47. Kwon, Y., Kim, D., Ceylan, D., Fuchs, H.: Neural human performer: Learning
generalizable radiance fields for human performance rendering. Advances in Neu-
ral Information Processing Systems 34 (2021)
18 A. Budria et al.
48. Kwon, Y., Liu, L., Fuchs, H., Habermann, M., Theobalt, C.: Deliffas: Deformable
light fields for fast avatar synthesis. Advances in Neural Information Processing
Systems (2023)
49. Leroy, V., Franco, J.S., Boyer, E.: Volume sweeping: Learning photoconsistency
for multi-view shape reconstruction. In: International Journal of Computer Vision.
vol. 129, pp. 1–16 (02 2021)
50. Li, C., Zhao, Z., Guo, X.: Articulatedfusion: Real-time reconstruction of motion,
geometry and segmentation using a single depth camera. In: Proceedings of the
European Conference on Computer Vision (ECCV) (September 2018)
51. Li, D., Shao, T., Wu, H., Zhou, K.: Shape completion from a single rgbd image.
IEEE Transactions on Visualization and Computer Graphics 23(7), 1809–1822
(2017)
52. Li, M., Tao, J., Yang, Z., Yang, Y.: Human101: Training 100+fps human gaussians
in 100s from 1 view (2023)
53. Li, R., Dumery, C., Guillard, B., Fua, P.: Garment recovery with shape and de-
formation priors (2023)
54. Li, R., Tanke, J., Vo, M., Zollhöfer, M., Gall, J., Kanazawa, A., Lassner, C.:
Tava: Template-free animatable volumetric actors. In: European Conference on
Computer Vision. pp. 419–436. Springer (2022)
55. Li, X., Fan, Y., Xu, D., He, W., Lv, G., Liu, S.: Sfnet: Clothed human 3d re-
construction via single side-to-front view rgb-d image. In: 2022 8th International
Conference on Virtual Reality (ICVR). pp. 15–20 (2022)
56. Li, Z., Müller, T., Evans, A., Taylor, R.H., Unberath, M., Liu, M.Y., Lin, C.H.:
Neuralangelo: High-fidelity neural surface reconstruction. In: IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) (2023)
57. Li, Z., Zheng, Z., Liu, Y., Zhou, B., Liu, Y.: Posevocab: Learning joint-structured
pose embeddings for human avatar modeling. In: ACM SIGGRAPH Conference
Proceedings (2023)
58. Li, Z., Zheng, Z., Wang, L., Liu, Y.: Animatable gaussians: Learning pose-
dependent gaussian maps for high-fidelity human avatar modeling. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) (2024)
59. Li, Z., Zheng, Z., Zhang, H., Ji, C., Liu, Y.: Avatarcap: Animatable avatar condi-
tioned monocular human volumetric capture. In: European Conference on Com-
puter Vision. pp. 322–341. Springer (2022)
60. Lin, L., Zhu, J.: Semantic-preserved point-based human avatar. arXiv preprint
arXiv:2311.11614 (2023)
61. Lin, S., Zhang, H., Zheng, Z., Shao, R., Liu, Y.: Learning implicit templates
for point-based clothed human modeling. In: European Conference on Computer
Vision. pp. 210–228. Springer (2022)
62. Lin, W., Zheng, C., Yong, J.H., Xu, F.: Relightable and animatable neural avatars
from videos. AAAI (2024)
63. Liu, S., Lin, S., Lu, J., Saha, S., Supikov, A., Yip, M.: Baa-ngp: Bundle-adjusting
accelerated neural graphics primitives. arXiv preprint arXiv:2306.04166 (2023)
64. Liu, Y., Dai, Q., Xu, W.: A point-cloud-based multiview stereo algorithm for free-
viewpoint video. In: IEEE Transactions on Visualization and Computer Graphics.
vol. 16, pp. 407–418 (2010)
65. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A
Skinned Multi-Person Linear Model. Association for Computing Machinery, New
York, NY, USA, 1 edn. (2023)
InstantGeoAvatar: Animatable Avatars from Monocular Video 19
66. Ma, Q., Saito, S., Yang, J., Tang, S., Black, M.J.: Scale: Modeling clothed hu-
mans with a surface codec of articulated local elements. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16082–
16093 (2021)
67. Ma, Q., Yang, J., Black, M.J., Tang, S.: Neural point-based shape modeling of
humans in challenging clothing. In: 2022 International Conference on 3D Vision
(3DV). pp. 679–689. IEEE (2022)
68. Ma, Q., Yang, J., Tang, S., Black, M.J.: The power of points for modeling hu-
mans in clothing. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. pp. 10974–10984 (2021)
69. Mihajlovic, M., Zhang, Y., Black, M.J., Tang, S.: Leap: Learning articulated oc-
cupancy of people. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 10461–10471 (2021)
70. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng,
R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In:
ECCV (2020)
71. Moon, G., Nam, H., Shiratori, T., Lee, K.M.: 3d clothed human reconstruction
in the wild. In: European conference on computer vision. pp. 184–200. Springer
(2022)
72. Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives
with a multiresolution hash encoding. ACM Trans. Graph. 41(4), 102:1–102:15
(Jul 2022)
73. Newcombe, R.A., Fox, D., Seitz, S.M.: Dynamicfusion: Reconstruction and track-
ing of non-rigid scenes in real-time. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) (June 2015)
74. Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J.,
Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, A.: Kinectfusion: Real-time dense
surface mapping and tracking. In: 2011 10th IEEE International Symposium on
Mixed and Augmented Reality. pp. 127–136 (2011). https://doi.org/10.1109/
ISMAR.2011.6092378
75. Oechsle, M., Peng, S., Geiger, A.: Unisurf: Unifying neural implicit surfaces and
radiance fields for multi-view reconstruction. In: International Conference on
Computer Vision (ICCV) (2021)
76. Pang, H., Zhu, H., Kortylewski, A., Theobalt, C., Habermann, M.: Ash: Animat-
able gaussian splats for efficient and photoreal human rendering. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). pp. 1165–1175 (June 2024)
77. Peng, S., Xu, Z., Dong, J., Wang, Q., Zhang, S., Shuai, Q., Bao, H., Zhou, X.: An-
imatable implicit neural representations for creating realistic avatars from videos.
TPAMI (2024)
78. Pesavento, M., Volino, M., Hilton, A.: Super-resolution 3d human shape from
a single low-resolution image. In: Computer Vision–ECCV 2022: 17th European
Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part II. pp. 447–
464. Springer (2022)
79. Pons-Moll, G., Pujades, S., Hu, S., Black, M.: Clothcap: Seamless 4d clothing
capture and retargeting. ACM Transactions on Graphics, (Proc. SIGGRAPH)
36(4) (2017)
80. Prokudin, S., Ma, Q., Raafat, M., Valentin, J., Tang, S.: Dynamic point fields.
arXiv preprint arXiv:2304.02626 (2023)
20 A. Budria et al.
81. Qian, S., Xu, J., Liu, Z., Ma, L., Gao, S.: Unif: United neural implicit functions
for clothed human reconstruction and animation. In: European Conference on
Computer Vision. pp. 121–137. Springer (2022)
82. Qian, Z., Wang, S., Mihajlovic, M., Geiger, A., Tang, S.: 3dgs-avatar: Animatable
avatars via deformable 3d gaussian splatting (2024)
83. Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Ben-
gio, Y., Courville, A.: On the spectral bias of neural networks. In: International
Conference on Machine Learning. pp. 5301–5310. PMLR (2019)
84. Ramasinghe, S., MacDonald, L.E., Lucey, S.: On the frequency-bias of coordinate-
mlps. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A.
(eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 796–809
(2022)
85. Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu:
Pixel-aligned implicit function for high-resolution clothed human digitization. In:
Proceedings of the IEEE/CVF international conference on computer vision. pp.
2304–2314 (2019)
86. Saito, S., Simon, T., Saragih, J., Joo, H.: Pifuhd: Multi-level pixel-aligned im-
plicit function for high-resolution 3d human digitization. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 84–93
(2020)
87. Shao, Z., Wang, Z., Li, Z., Wang, D., Lin, X., Zhang, Y., Fan, M., Wang, Z.: Splat-
tingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian
Splatting. In: Computer Vision and Pattern Recognition (CVPR) (2024)
88. Shen, K., Guo, C., Kaufmann, M., Zarate, J.J., Valentin, J., Song, J., Hilliges,
O.: X-avatar: Expressive human avatars. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition. pp. 16911–16921 (2023)
89. Song, C., Wandt, B., Helge, R.: Pose modulated avatars from video. In: Proceed-
ings of the International Conference on Learning Representations (ICLR) (2023)
90. Song, D.Y., , Lee, H., Seo, J., Cho, D.: Difu: Depth-guided implicit function for
clothed human reconstruction (2023)
91. Starck, J., Hilton, A.: Surface capture for performance-based animation. vol. 27,
pp. 21–31 (2007). https://doi.org/10.1109/MCG.2007.68
92. Su, S.Y., Bagautdinov, T., Rhodin, H.: Danbo: Disentangled articulated neural
body representations via graph neural networks. In: European Conference on
Computer Vision (2022)
93. Su, S.Y., Yu, F., Zollhöfer, M., Rhodin, H.: A-nerf: Articulated neural radiance
fields for learning human shape, appearance, and pose. In: Advances in Neural
Information Processing Systems (2021)
94. Tsiminaki, V., Franco, J.S., Boyer, E.: High resolution 3d shape texture from
multiple videos. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) (June 2014)
95. Tu, L.W.: Differential geometry: Connections, curvature, and characteristic
classes. Springer (2017)
96. Vlasic, D., Peers, P., Baran, I., Debevec, P., Popović, J., Rusinkiewicz, S., Ma-
tusik, W.: Dynamic shape capture using multi-view photometric stereo. In: ACM
SIGGRAPH Asia. SIGGRAPH Asia ’09, Association for Computing Machin-
ery, New York, NY, USA (2009). https://doi.org/10.1145/1661412.1618520,
https://doi.org/10.1145/1661412.1618520
97. Wand, M., Adams, B., Ovsjanikov, M., Berner, A., Bokeloh, M., Jenke, P., Guibas,
L., Seidel, H.P., Schilling, A.: Efficient reconstruction of nonrigid shape and mo-
InstantGeoAvatar: Animatable Avatars from Monocular Video 21
tion from real-time 3d scanner data. vol. 28. Association for Computing Machin-
ery, New York, NY, USA (may 2009)
98. Wang, C., Kang, D., Cao, Y.P., Bao, L., Shan, Y., Zhang, S.H.: Neural point-based
volumetric avatar: Surface-guided neural points for efficient and photorealistic
volumetric head avatar. In: SIGGRAPH Asia 2023 Conference Papers. pp. 1–12
(2023)
99. Wang, L., Zhao, X., Sun, J., Zhang, Y., Zhang, H., Yu, T., Liu, Y.: Stylea-
vatar: Real-time photo-realistic portrait avatar from a single video. In: ACM
SIGGRAPH 2023 Conference Proceedings (2023)
100. Wang, L., Zhao, X., Yu, T., Wang, S., Liu, Y.: Normalgan: Learning detailed 3d
human from a single rgb-d image. In: ECCV (2020)
101. Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., Wang, W.: Neus: Learn-
ing neural implicit surfaces by volume rendering for multi-view reconstruction.
NeurIPS (2021)
102. Wang, S., Antić, B., Geiger, A., Tang, S.: Intrinsicavatar: Physically based inverse
rendering of dynamic humans from monocular videos via explicit ray tracing.
arXiv.org 2312.05210 (2023)
103. Wang, S., Schwarz, K., Geiger, A., Tang, S.: Arah: Animatable volume rendering
of articulated human sdfs. In: European Conference on Computer Vision (2022)
104. Weng, C.Y., Curless, B., Srinivasan, P.P., Barron, J.T., Kemelmacher-Shlizerman,
I.: HumanNeRF: Free-viewpoint rendering of moving people from monocular
video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR). pp. 16210–16220 (June 2022)
105. Wu, C., Varanasi, K., Liu, Y., Seidel, H.P., Theobalt, C.: Shading-based dy-
namic shape refinement from multi-view video under general illumination. In:
2011 International Conference on Computer Vision. pp. 1108–1115 (2011). https:
//doi.org/10.1109/ICCV.2011.6126358
106. Xiang, T., Sun, A., Wu, J., Adeli, E., Fei-Fei, L.: Rendering humans from object-
occluded monocular videos. In: Proceedings of the IEEE/CVF International Con-
ference on Computer Vision. pp. 3239–3250 (2023)
107. Xiu, Y., Yang, J., Cao, X., Tzionas, D., Black, M.J.: ECON: Explicit Clothed
humans Optimized via Normal integration. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR) (June 2023)
108. Xiu, Y., Yang, J., Tzionas, D., Black, M.J.: ICON: Implicit Clothed humans Ob-
tained from Normals. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR). pp. 13296–13306 (June 2022)
109. Xu, H., Alldieck, T., Sminchisescu, C.: H-nerf: Neural radiance fields for ren-
dering and temporal reconstruction of humans in motion. Advances in Neural
Information Processing Systems 34, 14955–14966 (2021)
110. Xu, T., Fujita, Y., Matsumoto, E.: Surface-aligned neural radiance fields for con-
trollable 3d human synthesis. In: CVPR (2022)
111. Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit
surfaces. In: Thirty-Fifth Conference on Neural Information Processing Systems
(2021)
112. Yu, T., Guo, K., Xu, F., Dong, Y., Su, Z., Zhao, J., Li, J., Dai, Q., Liu, Y.:
Bodyfusion: Real-time capture of human motion and surface geometry using a
single depth camera. In: 2017 IEEE International Conference on Computer Vision
(ICCV). pp. 910–919 (2017)
113. Yu, T., Zheng, Z., Guo, K., Liu, P., Dai, Q., Liu, Y.: Function4d: Real-time human
volumetric capture from very sparse consumer rgbd sensors. In: Proceedings of the
22 A. Budria et al.