2405.17421v2

MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds
Jiahui Lei1 Yĳia Weng2 Adam W. Harley2 Leonidas Guibas2 Kostas Daniilidis1,3
1 University of Pennsylvania 2 Stanford University 3 Archimedes, Athena RC
{leijh, kostas}@cis.upenn.edu, {yijiaw, aharley, guibas}@cs.stanford.edu

arXiv:2405.17421v2 [cs.CV] 29 Nov 2024
Figure 1. MoSca reconstructs renderable dynamic scenes from monocular casual videos.
Abstract camera parameters—the most typical data format for such

a system in the wild. Robust 4D scene reconstruction from
We introduce 4D Motion Scaffolds (MoSca), a modern 4D such input is increasingly vital for constructing datasets for
reconstruction system designed to reconstruct and synthe- future AGI models, content creation for spatial computing
size novel views of dynamic scenes from monocular videos and VR/MR/AR, and building embodied agents to perceive
captured casually in the wild. To address such a challenging and learn from real video data. However, this task is known
and ill-posed inverse problem, we leverage prior knowl- to be highly challenging and inherently ill-posed [30, 51, 66]
edge from foundational vision models and lift the video due to the limited availability of multi-view stereo cues in
data to a novel Motion Scaffold (MoSca) representation, casual video footage.
which compactly and smoothly encodes the underlying mo-
tions/deformations. The scene geometry and appearance To tackle this challenging task, our first insight is to
are then disentangled from the deformation field and are leverage the recent advances of pretrained vision models
encoded by globally fusing the Gaussians anchored onto (Sec. 3.2.1), which today are very effective at fundamental
the MoSca and optimized via Gaussian Splatting. Addi- computer vision tasks such as tracking and depth estimation.
tionally, camera focal length and poses can be solved using While this knowledge provides a critical boost to understand-
bundle adjustment without the need of any other pose es- ing the complete dynamic scene, it is inherently insufficient,
timation tools. Experiments demonstrate state-of-the-art as it fails to capture occluded parts of the scene and it is
performance on dynamic rendering benchmarks and its ef- usually noisy, local, and partial. Our second insight is to de-
fectiveness on real videos. Project page and code: https: sign a deformation representation, MoSca, derived from the
//www.cis.upenn.edu/~leijh/projects/mosca above foundational priors, exploiting a physical deformation
prior. Although the real-world geometry and appearance are
complex and include high-frequency details, the underlying
1. Introduction deformation that drives these geometries is usually compact
(low-rank) and smooth. MoSca leverages this property by
This paper presents 4D Motion Scaffolds (MoSca), a fully disentangling the 3D geometry and motion, representing the
automated system for reconstructing and rendering dynamic deformation with sparse graph nodes that can be smoothly
scenes from casual monocular video inputs with unknown interpolated (Sec. 3.1). Another physical prior we exploit
1
is the as-rigid-as-possible (ARAP) deformation, which can struction of non-rigidly deforming scenes from a single
be efficiently applied via the trajectory topology of MoSca. camera is a long-standing problem. [7, 8, 81, 107, 108, 121]
Two important benefits arise from the above two insights: focus on specific object categories or articulated shapes and
firstly, MoSca can be lifted into 3D and optimized from the register observations to template models [8]. [10, 19, 23,
inferred 2D foundational priors (Sec. 3.2.3), and secondly, 24, 31, 53, 71] warp, align, and fuse scans of generic scenes.
the observations from all timesteps can be globally fused and To model non-rigid deformations, state-of-the-art methods
rendered for any query time (Sec. 3.2.4). Gaussian fusion [10, 23, 71, 121] use Embedded Deformation Graphs [89],
happens when we deform all Gaussians observed at different where dense transformations over the space are modeled
times to the query time, forming a complete reconstruction, with a sparse set of basis transformations. In MoSca, we
which can be supervised through Gaussian Splatting [44]. extend classic Embedded Graphs to connect priors from 2D
Furthermore, our system estimates the camera poses and foundation models to dynamic Gaussian splatting.
focal lengths via a bundle adjustment and the photometric 2D Vision Foundation Models. Recent years have wit-
optimization (Sec. 3.2.2), obviating the need for other poes nessed great progress in large-scale pretrained vision foun-
estimators such as COLMAP. dation models [9, 47, 72, 73, 80] that serve various down-
In summary, our main contributions can be summarized as: stream tasks, ranging from image-level tasks such as vi-
(1) An automatic 4D reconstruction system that works in the sual question answering [62, 63, 72] to pixel-level tasks
real world for pose-free monocular videos. (2) A novel Mo- including segmentation [47], dense tracking [32, 40], and
tion Scaffold deformation representation, which we build monocular depth estimation [6, 76, 109]. These models
using knowledge from 2D foundational models, and op- encode strong data priors particularly useful in monocular
timize via physically-inspired deformation regularization. video-based dynamic reconstruction, as they help disam-
(3) An efficient and explicit Gaussian-based dynamic scene biguate partial observations. While most previous meth-
representation, driven by MoSca, which globally fuses ob- ods [18, 29, 51, 56, 58, 64, 86, 99, 118] directly use the 2D
servations across an input video to render this data into any priors for regularization in image space, and often in isola-
new viewpoint and query time of choice. (4) State-of-the-art tion from each other, we propose to lift these 2D priors to
performance on dynamic scene rendering benchmarks. 3D and fuse them in a coordinated way.
2. Related Works 3. Method

Dynamic Novel-View Synthesis. Novel-view synthe-
Overview. Given a casual monocular video of a dynamic
sis of dynamic scenes is challenging. Many existing
scene with 𝑇 frames I = [𝐼1 , 𝐼2 , . . . 𝐼𝑇 ], our fully automatic
works [2, 3, 5, 13, 28, 55, 60, 67, 78, 87, 120] assume
system reconstructs the geometry and appearance of the
available synchronized multi-view video inputs. Another
scene with a set of dynamic Gaussians and recovers the focal
line of works [11, 29, 56, 59, 66, 68, 85, 94, 96, 97, 104,
length and poses of the camera if they are unknown .Our key
105, 110, 112, 113] tackles the more practical setting of
idea is to lift the 2D video input to a novel 4D dynamic scene
monocular inputs, where ambiguities from limited observa-
representation, which we name Motion Scaffolds (MoSca),
tions further complicate the problem. As [30] pointed out,
where all the observations are fused globally and geometri-
most methods struggle with realistic single-view videos. To
cally. Fig. 2 provides an overview of our approach. We first
disambiguate, some works [1, 16, 33, 35, 48, 52, 54, 65, 79,
introduce the deformation representation MoSca in Sec.3.1
82, 84, 90, 101, 102] target specific scenes and exploit do-
and then, detail each step of our reconstruction system in
main knowledge like template models [8, 95]. A few recent
Sec. 3.2.
works [51, 58, 118, 119] fuse information across frames, but
only from a small temporal window. 3.1. Deformation Representation with MoSca
Neural radiance fields [4, 14, 27, 69, 70, 74, 75] and 3D
Gaussian Splatting [44–46, 114] are promising approaches A fundamental challenge in real-world 4D reconstruction
to novel view synthesis. The latter’s explicit point-based is the high dimensionality of the potential solution space
representation fits particularly well into the dynamic set- compared to the extremely limited spatiotemporal observa-
ting [18, 21, 25, 26, 37, 42, 50, 57, 59, 61, 67, 103, 110, 111]. tions. However, real-world motion typically behaves rigidly,
We employ 3D Gaussians for long-term, global aggregation. smoothly, and compactly, meaning that the actual solution is
Compared to concurrent works [64, 83, 86, 99], MoSca has a low-rank and driven by a few key “eigen” motions. With this
more structured deformation representation exploiting pow- insight, we model the underlying deformation of the scene
erful 2D foundation models, and is a full-stack automated using an explicit, compact, and structured graph (V, E),
system that directly outputs 4D reconstruction from an un- named 4D Motion Scaffold (MoSca), which encodes these
posed RGB video. local “eigen” motions and interpolates the dense deforma-
Non-Rigid Structure-from-Motion. Geometric recon- tion field.
2
Figure 2. Overview: (A) Given a monocular casual video, we infer pre-trained 2D vision foundation models (Sec. 3.2.1). (B) The camera
intrinsics and poses are initialized using tracklet-based bundle adjustment (Sec. 3.2.2). (C) Our proposed Motion Scaffold (MoSca) is lifted
from 2D predictions and optimized with physics-inspired regularizations (Sec. 3.2.3). (D) Gaussians are initialized from all timesteps,
deformed with MoSca (Sec. 3.1), and fused globally to model the dynamic scene. The entire representation is rendered with Gaussian
Splatting and optimized with photometric losses (Sec. 3.2.4).
Motion Scaffold Graph Definition. Intuitively, the MoSca where q̂ is the dual quaternion representation of Q and | · | 𝐷𝑄
graph nodes V = {v (𝑚) } 𝑚=1
𝑀 are 6-DoF trajectories that cap- denotes the dual norm [43]. Unlike linear blend skinning
ture the underlying low-rank, smooth motion of the scene. (LBS), DQB is a manifold interpolation that always produces
The number of nodes 𝑀 is significantly smaller (e.g., see an interpolated element in 𝑆𝐸 (3). Consider any query po-
Tab. 7) than the number of points required to represent the sition x in 3D space at time 𝑡src . Denote its nearest node at
∗
scene. Specifically, each node v (𝑚) ∈ V consists of per- 𝑡 src as v (𝑚 ) where 𝑚 ∗ = arg min𝑚 ||t𝑡(𝑚)
𝑠𝑟 𝑐
− x|| and t𝑡(𝑚)
𝑠𝑟 𝑐
is the
timestep rigid transformations Q𝑡(𝑚) and a global control translation part of node 𝑚’s transformation at time 𝑡 𝑠𝑟 𝑐 .
radius 𝑟 (𝑚) , which parameterizes a radial basis function We can efficiently compute its 𝑆𝐸 (3) deformation to the
∗
(RBF) describing its influence on nearby space: query time 𝑡 dst using nodes in the neighborhood of 𝑣 (𝑚 ) .
Formally, the deformation field W from time 𝑡 src to time 𝑡 dst
\mb v^{(m)} = ([\mb Q^{(m)}_1,\mb Q^{(m)}_2, \ldots ,\mb Q^{(m)}_T],r^{(m)}), (1) is:
where Q (𝑚) = [R (𝑚) , t (𝑚) ] ∈ 𝑆𝐸 (3) and 𝑟 (𝑚) ∈ R+ is the \mc W (\mb x, \mb w; t_{\text {src}}, t_{\text {dst}}) = \text {DQB}\left (\{w_i, \Delta \mb Q^{(i)} \}_{i\in \mc {E}(m^*)}\right ), \label {eq: warping} (4)
radius. To properly interpolate the node-encoded trajectories
and regularize the deformation, we organize the nodes v (𝑚) where ΔQ (𝑖) = Q𝑡(𝑖) dst
(Q𝑡(𝑖) src
) −1 and w = {𝑤 𝑖 } are skinning
into a topology. We define the MoSca graph edges E as: weights computed from RBFs parameterized by radius 𝑟 (𝑖) :
w_i(\mb x, t_{\text {src}}) =\exp {(-{\|\mb x-\mb t^{(i)}_{t_{\text {src}}}\|_2^2}/{2r^{(i)}})} \in \mathbb R^+ \label {eq: skinning}. (5)
&\mc {E}(m) = \text {KNN}_{n\in \{1,\ldots , M\}}\left [D_{\text {curve}}(m,n)\right ], \nonumber \\ & D_{\text {curve}}(m,n) = \max _{t=1,2,\ldots , T} \|\mb t^{(m)}_t - \mb t^{(n)}_t\|, \label {eq: def_topo}
In summary, MoSca (V, E) encodes the deformation field
(2) through skinning on a structured, sparse trajectory graph. In
the following sections, we will demonstrate how to recon-
where KNN denotes the K-nearest neighbors under the curve struct MoSca and attach Gaussians onto it to produce the
distance metric 𝐷 curve . This metric captures the global prox- final 4D reconstruction.
imity between trajectories across all timesteps and accounts
for topological changes (e.g., opening a door does not con- 3.2. Reconstruction System
nect the door and wall). 3.2.1 Leveraging Priors from 2D Foundation Models
SE(3) Deformation Field. Given MoSca (V, E), we can
4D reconstruction from monocular videos is highly ill-
derive a dense deformation field by interpolating motions
posed; therefore, it is essential to leverage prior knowl-
from nodes near the query point. We use Dual Quaternion
edge to constrain the solution space. In the first step of
Blending (DQB) [43] to mix multiple 𝑆𝐸 (3) elements on
our system, we exploit the priors provided by large vision
the 𝑆𝐸 (3) manifold. Similar to the unit quaternion represen-
foundation models pre-trained on massive datasets. Specif-
tation of 𝑆𝑂 (3), the unit dual quaternion represents 𝑆𝐸 (3)
ically, we utilize off-the-shelf pre-trained models to obtain:
using eight numbers by including a dual part. Please refer
1) Depth estimations [34, 36, 76] D = [𝐷 1 , 𝐷 2 , . . . , 𝐷 𝑇 ]
to [20, 38, 43] for details. Given 𝐿 rigid transformations
that are relatively consistent across frames; 2) Long-
Q𝑖 ∈ 𝑆𝐸 (3) and their blending weights 𝑤 𝑖 , the interpolated
term 2D pixel trajectories [22, 41, 106] T = {𝜏 (𝑖) =
motion is:
[( 𝑝 1(𝑖) , 𝑣 1(𝑖) ), ( 𝑝 2(𝑖) , 𝑣 2(𝑖) ), . . . , ( 𝑝 𝑇(𝑖) , 𝑣 𝑇(𝑖) )]}𝑖 , where 𝑝 𝑡(𝑖) and
\text {DQB}(\{(w_i, \mb Q_i)\}_{i=1}^L) = \frac {\sum _{i=1}^Lw_i \hat {\mb q}_i}{\|\sum _{i=1}^Lw_i \hat {\mb q}_i\|_{DQ}} \in SE(3) , (3)
𝑣 𝑡(𝑖) represent the 𝑖-th trajectory’s 2D image coordinate
3
and visibility at frame 𝑡; 3) Per-frame epipolar error maps of this paper is the seamless integration of MoSca with pow-
M = [𝐸 1 , 𝐸 2 , . . . , 𝐸𝑇 ] [66] computed from RAFT[91] erful 2D foundational models. Specifically, the long-term
dense optical flow predictions, which indicate the likelihood 2D tracking T , together with the depth estimates D, pro-
of being in the dynamic foreground. These inferred results vide strong cues for constructing V. However, there is still
provide critical cues about geometry and correspondence. a gap due to missing information when tracks are invisible
However, such raw information is partial, local, and noisy, and because the local rotation component of MoSca is also
and does not constitute a complete solution. We are going unknown. We address this gap by incorporating physics-
to fuse and optimize these initial cues to produce a coherent inspired regularization into the optimization of MoSca.
and global 4D reconstruction. 3D Lift and Initialization. Similar to the camera initializa-
tion, we identify foreground 2D tracks by thresholding the
3.2.2 Camera Initializaition
maximum epipolar error 𝑒(𝜏) of each tracklet. We then lift
To enable 4D reconstruction in the wild, our system must the foreground tracklets into 3D using depth estimates D
operate on dynamic scene videos with unknown camera pa- at visible timesteps and linearly interpolate between nearby
rameters. Therefore, in the second step of our reconstruction observations at occluded timesteps. Formally, we compute
pipeline, we propose a tracklet-based bundle adjustment to the lifted 3D position h𝑡 at timestep 𝑡 from the 2D track
robustly initialize the camera focal lengths and poses. Given 𝜏 = [( 𝑝 𝑡 , 𝑣 𝑡 )] 𝑇𝑡=1 as
the inferred 2D tracks T and epipolar error maps M, we
first compute the maximum epipolar error of each tracklet
as 𝑒(𝜏) = max𝑡=1...𝑇 𝐸 𝑡 [ 𝑝 𝑡 ] · 𝑣 𝑡 across visible timesteps.
We identify confident background tracklets by thresholding \mb h_t = \begin {cases} & \mb W_t \pi ^{-1}_{\mb K}(p_t, D_t[p_t]), \quad \text {if} \quad v_t=1, \\ & \text {LinearInterp}(\mb h_{\text {left}}, \mb h_{\text {right}}), \quad \text {if} \quad v_t=0, \end {cases} \label {eq: lift} (8)
𝑒(𝜏) with a predefined small threshold. Starting with a pre-
defined initial camera focal length, we optimize the camera
poses and intrinsics jointly by minimizing the reprojection where 𝜋K −1 refers to back-projection with camera intrinsics
errors on these confident static tracks: K, W𝑡 refers to the camera pose, and hleft , hright refer to the
& \mc L_{proj} = \sum _{i\in |\mc T_{\text {static}}|} \sum _{a,b \in [1,T]} (v^{(i)}_{a}v^{(i)}_{b}) \\& \cdot \left \| \pi _{\mb K}\left (\mb W^{-1}_{b} \mb W_{a} \pi _{\mb K}^{-1}(p^{(i)}_{a}, D_{a}[p^{(i)}_{a}])\right ) - p^{(i)}_b \right \|, \nonumber \label {eq: loss_ba_proj} lifted 3D positions from the nearest visible timesteps before
and after 𝑡. From each track, we initialize a MoSca node
v (𝑖) using the lifted positions h𝑡 as the translation part and
the identity as the rotation, i.e., Q𝑡(𝑖) = [I, h𝑡(𝑖) ], along with
a predefined control radius 𝑟 init . In practice, we retain only
where 𝑝 𝑎 and 𝑝 𝑏 are pixel locations, 𝜋K denotes projection
a subset of the densely inferred 2D tracklets by uniformly
with intrinsics K, and W𝑡 is the camera pose at time 𝑡. To
resampling nodes based on the curve distance (Eq. 2).
account for errors in the depth estimation—particularly scale
misalignment—we jointly optimize a correction to the depth Geometry Optimization. Starting from the initialized rota-
𝐷 𝑎 [ 𝑝 𝑎 ], which consists of per-frame global scaling factors tions and the invisible lines, we propagate the visible infor-
and small per-pixel corrections, using a depth alignment mation to the unknowns through the MoSca topology E by
loss: optimizing a physics-inspired as-rigid-as-possible (ARAP)
& \mc L_{z} = \sum _{i\in |\mc T_{\text {static}}|} \sum _{a,b \in [1,T]} (v^{(i)}_{a}v^{(i)}_{b}) \\ &D_\text {scale-inv} \left ( \left [\mb W^{-1}_{b} \mb W_{a} \pi _{\mb K}^{-1}(p^{(i)}_{a}, D_{a}[p^{(i)}_{a}])\right ]_z, D_{b}[p^{(i)}_b] \right ), \nonumber \label {eq: loss_ba_dep} \vspace {-1em} loss. Given two timesteps separated by a time interval Δ, we
define the ARAP loss Larap as:
where [·] 𝑧 takes the 𝑧 coordinate, and 𝐷 scale-inv (𝑥, 𝑦) = \mc L_{\text {arap}}&= \sum _{t=1}^T \sum _{m=1}^M \sum _{n\in \hat {\mc E}(m)} \lambda _{\text {l}}\left | \| \mb t_t^{(m)} - \mb t_t^{(n)}\| - \| \mb t_{t+\Delta }^{(m)} - \mb t_{t+\Delta }^{(n)}\| \right | \nonumber \\ &+ \lambda _{\text {c}} \left \| \mb Q^{-1 \, (n)}_t \mb t^{(m)}_t - \mb Q^{-1 \, (n)}_{t+\Delta } \mb t^{(m)}_{t+\Delta } \right \|, \label {eq: loss_arap}
|𝑥/𝑦 − 1| + |𝑦/𝑥 − 1|. The overall bundle adjustment loss is
LBA = 𝜆proj L 𝑝𝑟 𝑜 𝑗 + 𝜆z L 𝑧 , and the solved camera poses W𝑡 (9)
will be refined during later rendering phases. While camera
solving is not our primary contribution, our system achieves
state-of-the-art camera pose accuracy on dynamic videos where Ê refers to a multi-level sub-sampled topology pyra-
(Sec. 4.2); more details are provided in the Supplemental mid from E in MoSca (detailed in the Supplemental Ma-
Material. terial). The first term encourages the preservation of local
3.2.3 Geometric Optimization of MoSca distances in the neighborhood, and the second term pre-
serves the local coordinates by involving the local frame Q
After inferring the 2D foundational models and initializing in the optimization. We also enforce the temporal smooth-
the camera, we are ready to geometrically construct MoSca ness of the deformation by regularizing the velocity and
(V, E) in the third step of our system. A key contribution acceleration:
where the first five attributes are the center, rotation,
non-isotropic scales, opacity, and spherical harmonics of
3DGS [44], and the latter two are tailored for MoSca. Specif-
ically, 𝑡 ref
𝑗 is the reference timestep—that is, the timestep at
which the Gaussian is initialized from the back-projected
depth; and Δw 𝑗 ∈ R𝐾 is the per-Gaussian learnable skin-
ning weight correction. To obtain the complete geometry
at a query timestep 𝑡, Gaussians from all timesteps are de-
formed to the query time 𝑡 and fused:
\mc G(t) &= \{(\mb T_j(t) \mu _j, \mb T_j(t) R_j, s_j, o_j, c_j)\, |\, \nonumber \\ &\mb T_j(t) = \mc W (\mu _j, \mb w (\mu _j, t^{\text {ref}}_j) + \Delta \mb w_j; t^{\text {ref}}_j, t) \}_{j=1}^N \label {eq: final_repr}
(12)
where W is the deformation field defined in Eq.4, and w

is the base RBF skinning weight defined in Eq.5. The
static background is also represented as a standard 3DGS
H = (𝜇 𝑗 , 𝑅 𝑗 , 𝑠 𝑗 , 𝑜 𝑗 , 𝑐 𝑗 ) 𝐻
𝑗=1 , which can be initialized by
back-projecting the depth map using known camera param-
eters. Therefore, the final renderable dynamic scene at time
𝑡 can be approximated by the union G(𝑡) ∪ H .
Photometric Optimization. The Gaussians described
above can be rendered using a Gaussian Splatting-based
differentiable renderer and optimized with depth and RGB
rendering losses, along with the regularization losses from
Sec. 3.2.3. To fully exploit the inferred tracklets, we also
\mc G = \{(\mu _j, R_j, s_j, o_j, c_j; t^{\text {ref}}_j, \Delta \mb w_j) \}_{j=1}^N, \label {eq: dyn_gs} (11) render a flow/track map by rasterizing the XYZ coordinates
5
Figure 4. Visual comparison on DyCheck [30] under the settings with or without camera pose.
(replacing the RGB color with XYZ values) of each Gaussian Table 1. Comparison on DyCheck [30], group w-pose and w/o-
at different timesteps. We supervise the flow/track map with pose means with or without camera pose and are averaged over all
the inferred 2D tracklets as a regularization loss Ltrack [99]. 7 scenes on the standard 2x resolution. Group SOM-5-1x means
The final photometric step has a total objective: using the 5 scenes and 1x res. as in Shape-of-Motion [99].
\mc L &= \lambda _\text {rgb}\mc L_\text {rgb} + \lambda _\text {dep}\mc L_\text {dep} + \lambda _\text {track}\mc L_\text {track} \nonumber \\ &+ \lambda _\text {arap}\mc L_\text {arap} + \lambda _\text {acc}\mc L_\text {acc}+ \lambda _\text {vel}\mc L_\text {vel}. Method mPSNR↑ mSSIM↑ mLPIPS↓
T-NeRF [30] 16.96 0.577 0.379
(13) NSFF [56] 15.46 0.551 0.396
Nerfies [74] 16.45 0.570 0.339
HyperNeRF [75] 16.81 0.569 0.332
Node Control. Similar to standard 3DGS Gaussian control PGDVS [118] 15.88 0.548 0.340
techniques including gradient-based densification and reset- DyPoint [119] 16.89 0.573 -
pruning simplification, we propose a novel control policy DpDy [98] - 0.559 0.516
over the proposed MoSca nodes. To periodically densify Dyn.Gauss. [67] 7.29 - 0.692
w-pose
4D GS [103] 13.64 - 0.428
nodes, we select Gaussians with high tracking-loss Ltrack Gauss.Marbles [86] 16.72 - 0.413
induced gradients, subsample them, and convert them into DyBluRF [11] 17.37 0.591 0.373
new MoSca nodes. To clean the representation and prune CTNeRF [68] 17.69 0.531 -
D-NPC [39] 16.41 0.582 0.319
the structure, we also periodically copy the dynamic fore- Shape-of-Motion [99] 17.32 0.598 0.296
ground Gaussians from a randomly selected timestep into Ours 19.32 0.706 0.264
the static background and reset the foreground Gaussians RobustDynrf [66] 17.10 0.534 0.517
Dyn.Gaussians [67] 7.60 - 0.704
to a low opacity. This simplifies unnecessary foreground 4D GS [103] 13.11 - 0.726
Gaussians. We then prune nodes whose skinning weights w/o-pose
Gaussian Marbles [86] 15.79 - 0.430
toward all Gaussians fall below a threshold, indicating a Ours 18.84 0.676 0.289
limited contribution to deformation modeling. Ours (w. focal) 19.02 0.683 0.279
Shape-of-Motion [99] 16.72 0.63 0.45
SOM-5-1x
Ours 18.40 0.67 0.42
4. Experiments
ternet videos, SORA-generated videos, and DAVIS[77]
4.1. Novel View Synthesis videos—demonstrating the effectiveness of MoSca.
In-the-wild. One of the most significant results of MoSca DyCheck. To quantitatively evaluate our rendering results,
is demonstrating that such an automatic dynamic render- we compare our method to others on the currently most
ing system can work effectively in real-world scenarios. challenging dataset – the iPhone DyCheck [30]. DyCheck
In Fig. 3, we showcase reconstruction results on diverse features generic, diverse dynamic scenes captured with a
in-the-wild monocular videos—including movie clips, in- handheld iPhone using realistic camera motions for train-
6
Figure 5. Visual comparison on NVIDIA dataset [112].
ing, and utilizes two static cameras at significantly different as shown in the bottom group of Tab. 1.
poses from the training views for testing. For a fair compar- NVIDIA. We also evaluate MoSca on the widely used
ison with previous methods that exploit noisy LiDAR depth NVIDIA video dataset [112], following the protocol in Ro-
from the dataset, we use the iPhone’s noisy LiDAR depth DynRF [66]. As reported in Tab. 2 and Fig. 5, we achieve
as the metric depth D and employ BootsTAPIR [22] for high PSNR and very competitive LPIPS results. Since the
tracking. Since the camera parameters are optimized during facing-forward, the small-baseline setting is relatively eas-
training, during inference, we fix the scene representation ier compared to the realistic DyCheck dataset, where most
and adjust the test camera poses to find the correct view- areas of the dynamic scene are visible in neighboring time
points. The quantitative results are reported in Tab. 1, and frames, reducing the need for strong regularization and fu-
qualitative results are shown in Fig. 1. Due to the large devi- sion of information in occluded areas – the advantages of
ation of the testing views from the training camera trajectory, MoSca are not fully showcased on NVIDIA videos.
most per-frame depth warping methods fail directly (e.g., see
Fig.10 of Casual-FVS [51]). Similarly, local fusion methods 4.2. Camera and Correspondence
exhibit large missing areas (e.g., PGDVS [118], Gaussian
Camera Pose. Another advantage of MoSca is its nat-
Marbles [86]), even though these missing areas are visible
ural integration of camera solving, both geometrically
in other time steps. Some recent Gaussian-based methods
through tracklet-based bundle adjustment and photomet-
like 4D-GS [103] also fail because they depend on strong
rically through rendering-based refinement. We quantita-
multi-view stereo cues to reconstruct the scene. As shown
tively evaluate the camera pose estimation, a byproduct of
in Tab. 1, we outperform all other methods by a large mar-
our system, following MonST3R [115] on the SLAM dataset
gin. We attribute this improvement to two factors: firstly,
TUM-dynamics [88] and the synthetic Sintel dataset [12].
by leveraging powerful pre-trained 2D long-term trackers,
The camera pose errors are shown in Table 3. Although
our MoSca representation models long-term motion trajecto-
camera pose estimation is not the main focus of MoSca, it
ries, enabling the global aggregation of observations across
still achieves comparable or even superior performance com-
all timesteps, which leads to a more complete reconstruc-
tion. Secondly, the structured sparse motion graph design of
MoSca facilitates optimization. Compared to dense Gaus-
sian geometries, its compact and smoothly interpolated mo-
tion nodes significantly reduce the optimization space. Its
topology enables the effective propagation of information
to unobserved regions through ARAP regularization. Note
that our system still performs well under the pose-free setup,
Table 2. Comparison on NVIDIA [112], averaged over all scenes.

“w/o” means without camera pose.
Method PSNR LPIPS Method PSNR LPIPS
D-NeRF [78] 21.49 0.232 CTNeRF [68] 26.13 0.082
NR-NeRF [96] 19.69 0.323 DynPoint [119] 26.53 0.068
TiNeuVox [27] 19.74 0.285 D-NPC [39] 25.64 0.109
HyperNeRF [75] 17.60 0.367 RoDynRF [66] 25.89 0.067
NSFF [56] 24.33 0.199 RoDynRF [66] w/o 25.38 0.079
DynNeRF [29] 26.10 0.082 GaussianMarbles [86] 22.32 0.129
MonoNeRF [94] 25.62 0.106 Ours 26.72 0.070
4DGS [103] 21.45 0.199 Ours w/o 26.54 0.073
Casual-FVS [51] 24.57 0.081
Figure 6. Application of MoSca reconstructed 4D scenes.
7
Table 3. Camera pose accuracy (∗ requires ground truth camera Table 5. Ablation study on different components of the system.
intrinsics as input) Components mPSNR mSSIM mLPIPs
Sintel [12] TUM-dynamics [88] Full model 19.32 0.706 0.264
Method ATE ↓ RPE trans ↓ RPE rot ↓ ATE ↓ RPE trans ↓ RPE rot ↓
No node control 19.28 0.707 0.267
DROID-SLAM∗ [92] 0.175 0.084 1.912 - - -
DPVO∗ [93] 0.115 0.072 1.975 - - - No learnable skinning correction 19.27 0.707 0.267
ParticleSfM [117] 0.129 0.031 0.535 - - - No dual quaternion blending 19.18 0.701 0.276
LEAP-VO∗ [15] 0.089 0.066 1.250 0.068 0.008 1.686 No multi-level topology 19.14 0.701 0.270
Robust-CVD [49] 0.360 0.154 3.443 0.153 0.026 3.528
CasualSAM [116] 0.141 0.035 0.615 0.071 0.010 1.712 No geometric optimizaiton stage 18.85 0.693 0.287
DUSt3R [100] w/ mask 0.417 0.250 5.796 0.083 0.017 3.567 No photometric optimization stage 13.71 0.480 0.763
MonST3R [115] 0.108 0.042 0.732 0.063 0.009 1.217 Only fuse 4 neighboring frames 16.96 0.663 0.344
Ours 0.090 0.034 0.312 0.031 0.011 0.426
Only fuse 8 neighboring frames 17.26 0.664 0.346
Table 6. Ablation study on different priors on DyCheck [30].

Tracker BootsTAPIR [22] CoTracker-v3 [41] SpaTracker [106]
Depth mPSNR mLPIPs mPSNR mLPIPs mPSNR mLPIPs
LIDAR 19.32 0.264 19.55 0.243 19.32 0.259
Metric3D-v2 [34] 17.05 0.331 17.02 0.320 17.60 0.301
UniDepth [76] 17.12 0.323 17.42 0.299 17.61 0.300
Figure 7. Visual comparison of ablation.
Table 7. More specs of MoSca on DyCheck [30] (averaged)
pared to camera-pose-tailored SLAM-based and DuST3R-
FPS (2x res) Num of fg GS Num of nodes Ratio: #GS/#nodes
based methods. Notably, some of the SLAM systems in the 37.823 106596 3177 46.105
table require known camera intrinsics, whereas MoSca does
not.
Correspondence. One feature of MoSca is its ability to the scene.
perform global fusion and provide dense correspondence. 4.4. Applications
We quantitatively evaluate the correspondence tracking ac-
curacy following DyCheck [30] and Gaussian Marbles [86]. In-the-wild 4D reconstruction enables many interesting ap-
Tab. 4 shows our state-of-the-art accuracy. Notably, MoSca plications, as shown in Fig. 6. For example, we can remove
is optimized starting from BootsTAPIR [22] on DyCheck, the moving foreground (Figure 6-A), or remove occluders in
and we observe a significant improvement over the raw an extremely challenging cup-game video to look through
tracker after reconstruction optimization. and see where the ball goes (Figure 6-B). Video object seg-
mentation from DEVA [17] can be lifted and baked into 4D
4.3. Ablation Study to produce novel view semantic videos (Figure 6-C). Finally,
the 4D video can be edited in flexible ways, as shown in Fig-
We assess the effects of different components in our system
ure 6-D. We believe that MoSca will provide the community
in Tab. 5 and Fig. 7. We observe that both the geometric
with many more possibilities for future applications.
optimization and photometric optimization phases are crit-
ical. DQB contributes to smooth results, the multi-level 5. Limitations and Conclusion
topology pyramid enhances global rigidity and shape, and
Limitations. While MoSca achieves state-of-the-art perfor-
node control along with learnable skinning further improves
mance on standard benchmarks and can operate on some in-
the expressiveness of our system. Additionally, our system
the-wild videos, several limitations remain. (1) Our method
benefits from the global fusion of observations from ev-
relies on accurate 2D long-term tracks and depth estima-
ery frame. We also verify the effectiveness of the tracking
tion, indicating that improvements in these areas are crucial
loss Ltrack. When Ltrack is not used, the PCK-T accu-
for enhancing our performance. (2) Our current framework
racy decreases from 0.824 to 0.737. In Tab. 6, we study
only reconstructs areas that are visible at some point in the
how different foundation models affect performance. Note
video; it would be advantageous to incorporate large-scale
that Metric3D-v2 [34] and UniDepth [76] are entirely RGB-
2D/video diffusion priors to hallucinate areas that are never
based and do not use LiDAR sensor information, leading
visible. (3) Another important issue for future work is the
to a reasonable decrease in performance. We report more
correct modeling of lighting effects such as shadows, reflec-
specifications of our system in Tab. 7, where we observe near
tions, liquids, and changes in exposure. These effects cannot
real-time inference FPS and the compactness of the MoSca
be explained by deformation alone and may cause artifacts
nodes compared to the actual foreground GS used to model
in the background.
Table 4. Correspondence on DyCheck [30] with PCK-T @0.05%
In summary, this paper takes a step toward reconstruction
Methods Nerfies[74] HyperNeRf[75] Dyn. Gauss. [67] 4D Gauss. [103]
PCK-T ↑ 0.4 0.453 0.079 0.073
and rendering from monocular in-the-wild casual videos We
Methods CoTracker[40] Gauss.Marbles[86] BootsTAPIR [22] Ours hope this small step could inspire future exploration toward
PCK-T ↑ 0.803 0.806 0.779 0.824 understanding our dynamic physical world.
8
Acknowledgements. The authors appreciate the support [10] Aljaz Bozic, Pablo Palafox, Michael Zollöfer, Angela Dai,
of the gift from AWS AI to Penn Engineering’s ASSET Justus Thies, and Matthias Nießner. Neural non-rigid track-
Center for Trustworthy AI; and the support of the following ing. In Advances in Neural Information Processing Systems,
grants: NSF IIS-RI 2212433, NSF FRR 2220868 awarded pages 18765–18775, 2020. 2
to UPenn, ARL grant W911NF-21-2-0104 and a Vannevar [11] Minh-Quan Viet Bui, Jongmin Park, Jihyong Oh, and
Munchurl Kim. Dyblurf: Dynamic deblurring neural ra-
Bush Faculty Fellowship awarded to Stanford University.
diance fields for blurry monocular video. arXiv preprint
The authors thank Minh-Quan Viet Bui and the authors arXiv:2312.13528, 2023. 2, 6
of DyBluRF, Xiaoming Zhao and the authors of PGDVS for [12] Daniel J Butler, Jonas Wulff, Garrett B Stanley, and
providing their per-scene evaluation metrics on DyCheck Michael J Black. A naturalistic open source movie for opti-
dataset. cal flow evaluation. In Computer Vision–ECCV 2012: 12th
European Conference on Computer Vision, Florence, Italy,
References October 7-13, 2012, Proceedings, Part VI 12, pages 611–
625. Springer, 2012. 7, 8
[1] ShahRukh Athar, Zexiang Xu, Kalyan Sunkavalli, Eli [13] Ang Cao and Justin Johnson. Hexplane: A fast representa-
Shechtman, and Zhixin Shu. Rignerf: Fully controllable tion for dynamic scenes. arXiv preprint arXiv:2301.09632,
neural 3d portraits. In Proceedings of the IEEE/CVF Con- 2023. 2
ference on Computer Vision and Pattern Recognition, pages [14] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and
20364–20373, 2022. 2 Hao Su. Tensorf: Tensorial radiance fields. In Computer
[2] Benjamin Attal, Jia-Bin Huang, Christian Richardt, Michael Vision–ECCV 2022: 17th European Conference, Tel Aviv,
Zollhoefer, Johannes Kopf, Matthew O’Toole, and Changil Israel, October 23–27, 2022, Proceedings, Part XXXII,
Kim. Hyperreel: High-fidelity 6-dof video with ray- pages 333–350. Springer, 2022. 2
conditioned sampling. In Proceedings of the IEEE/CVF [15] Weirong Chen, Le Chen, Rui Wang, and Marc Pollefeys.
Conference on Computer Vision and Pattern Recognition Leap-vo: Long-term effective any point tracking for visual
(CVPR), 2023. 2 odometry. In IEEE/CVF Conference on Computer Vision
[3] Aayush Bansal, Minh Vo, Yaser Sheikh, Deva Ramanan, and and Pattern Recognition (CVPR), 2024. 8
Srinivasa Narasimhan. 4d visualization of dynamic events [16] Yufan Chen, Lizhen Wang, Qĳing Li, Hongjiang Xiao,
from unconstrained multi-view videos. In Proceedings of Shengping Zhang, Hongxun Yao, and Yebin Liu. Mono-
the IEEE/CVF Conference on Computer Vision and Pattern gaussianavatar: Monocular gaussian point-based head
Recognition (CVPR), 2020. 2 avatar. arXiv preprint arXiv:2312.04558, 2023. 2
[4] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter [17] Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander
Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Schwing, and Joon-Young Lee. Tracking anything with
Mip-nerf: A multiscale representation for anti-aliasing neu- decoupled video segmentation. In ICCV, 2023. 8
ral radiance fields. In Proceedings of the IEEE/CVF Interna- [18] Wen-Hsuan Chu, Lei Ke, and Katerina Fragkiadaki. Dream-
tional Conference on Computer Vision, pages 5855–5864, scene4d: Dynamic multi-object scene generation from
2021. 2 monocular videos. arXiv preprint arXiv:2405.02280, 2024.
2
[5] Mojtaba Bemana, Karol Myszkowski, Hans-Peter Seidel,
[19] Brian Curless and Marc Levoy. A volumetric method for
and Tobias Ritschel. X-fields: Implicit neural view-, light-
building complex models from range images. In Proceed-
and time-image interpolation. SIGGRAPH Asia, 2020. 2
ings of the 23rd annual conference on Computer graphics
[6] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter and interactive techniques, pages 303–312, 1996. 2
Wonka, and Matthias Müller. Zoedepth: Zero-shot trans-
[20] Konstantinos Daniilidis. Hand-eye calibration using dual
fer by combining relative and metric depth. arXiv preprint
quaternions. The International Journal of Robotics Re-
arXiv:2302.12288, 2023. 2
search, 18(3):286–298, 1999. 3
[7] Volker Blanz and Thomas Vetter. A morphable model for [21] Devikalyan Das, Christopher Wewer, Raza Yunus, Eddy
the synthesis of 3d faces. In Proceedings of the 26th Annual Ilg, and Jan Eric Lenssen. Neural parametric gaussians for
Conference on Computer Graphics and Interactive Tech- monocular non-rigid object reconstruction. arXiv preprint
niques, pages 187–194, 1999. 2 arXiv:2312.01196, 2023. 2
[8] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter [22] Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda
Gehler, Javier Romero, and Michael J Black. Keep it smpl: Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco,
Automatic estimation of 3d human pose and shape from a Ross Goroshin, Joao Carreira, and Andrew Zisserman.
single image. In European Conference on Computer Vision, Bootstap: Bootstrapped training for tracking-any-point.
pages 561–578. Springer, 2016. 2 Asian Conference on Computer Vision, 2024. 3, 7, 8
[9] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- [23] Mingsong Dou, Jonathan Taylor, Henry Fuchs, Andrew
man, Simran Arora, Sydney von Arx, Michael S Bernstein, Fitzgibbon, and Shahram Izadi. 3d scanning deformable
Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. objects with a single rgbd sensor. In Proceedings of the
On the opportunities and risks of foundation models. arXiv IEEE Conference on Computer Vision and Pattern Recog-
preprint arXiv:2108.07258, 2021. 2 nition, pages 493–501, 2015. 2
9
[24] Yilun Du, Yinan Zhang, Hong-Xing Yu, Joshua B Tenen- gaussian splatting for editable dynamic scenes. arXiv
baum, and Jiajun Wu. Neural radiance flow for 4d view preprint arXiv:2312.14937, 2023. 2
synthesis and video processing. In 2021 IEEE/CVF In- [38] Yan-Bin Jia. Dual quaternions. 3
ternational Conference on Computer Vision (ICCV), pages [39] Moritz Kappel, Florian Hahlbohm, Timon Scholz, Susana
14304–14314. IEEE Computer Society, 2021. 2 Castillo, Christian Theobalt, Martin Eisemann, Vladislav
[25] Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wen- Golyanik, and Marcus Magnor. D-npc: Dynamic neural
zheng Chen, and Baoquan Chen. 4d gaussian splatting: point clouds for non-rigid view synthesis from monocular
Towards efficient novel view synthesis for dynamic scenes. video. arXiv preprint arXiv:2406.10078, 2024. 6, 7
arXiv preprint arXiv:2402.03307, 2024. 2 [40] Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia
[26] Bardienus P Duisterhof, Zhao Mandi, Yunchao Yao, Jia- Neverova, Andrea Vedaldi, and Christian Rupprecht. Co-
Wei Liu, Mike Zheng Shou, Shuran Song, and Jeffrey Ich- tracker: It is better to track together. arXiv preprint
nowski. Md-splatting: Learning metric deformation from arXiv:2307.07635, 2023. 2, 8
4d gaussians in highly deformable scenes. arXiv preprint
[41] Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia
arXiv:2312.00583, 2023. 2
Neverova, Andrea Vedaldi, and Christian Rupprecht. Co-
[27] Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xi- tracker3: Simpler and better point tracking by pseudo-
aopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. labelling real videos. 2024. 3, 8
Fast dynamic radiance fields with time-aware neural voxels.
[42] Kai Katsumata, Duc Minh Vo, and Hideki Nakayama. An
In SIGGRAPH Asia 2022 Conference Papers, pages 1–9,
efficient 3d gaussian representation for monocular/multi-
2022. 2, 7
view dynamic scenes. arXiv preprint arXiv:2311.12897,
[28] Sara Fridovich-Keil, Giacomo Meanti, Frederik Warburg, 2023. 2
Benjamin Recht, and Angjoo Kanazawa. K-planes: Ex-
[43] Ladislav Kavan, Steven Collins, Jiří Žára, and Carol
plicit radiance fields in space, time, and appearance. arXiv
O’Sullivan. Skinning with dual quaternions. In Proceed-
preprint arXiv:2301.10241, 2023. 2
ings of the 2007 symposium on Interactive 3D graphics and
[29] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang.
games, pages 39–46, 2007. 3
Dynamic view synthesis from dynamic monocular video. In
[44] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler,
Proceedings of the IEEE/CVF International Conference on
and George Drettakis. 3d gaussian splatting for real-time
Computer Vision, pages 5712–5721, 2021. 2, 7
radiance field rendering. 2023. 2, 5
[30] Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell,
and Angjoo Kanazawa. Monocular dynamic view synthesis: [45] Leonid Keselman and Martial Hebert. Approximate dif-
A reality check. Advances in Neural Information Processing ferentiable rendering with algebraic surfaces. In European
Systems, 35:33768–33780, 2022. 1, 2, 6, 8 Conference on Computer Vision, pages 596–614. Springer,
2022.
[31] Wei Gao and Russ Tedrake. Surfelwarp: Efficient non-
volumetric single view dynamic reconstruction. In Robotics: [46] Leonid Keselman and Martial Hebert. Flexible tech-
Science and Systems (RSS), 2018. 2 niques for differentiable rendering with 3d gaussians. arXiv
preprint arXiv:2308.14737, 2023. 2
[32] Adam W. Harley, Zhaoyuan Fang, and Katerina Fragkiadaki.
Particle video revisited: Tracking through occlusions using [47] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
point trajectories. 2022. 2 Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
[33] Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
Zhou, Boning Liu, Shengping Zhang, and Liqiang Nie. thing. In Proceedings of the IEEE/CVF International Con-
Gaussianavatar: Towards realistic human avatar modeling ference on Computer Vision, pages 4015–4026, 2023. 2
from a single video via animatable 3d gaussians. arXiv [48] Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel,
preprint arXiv:2312.02134, 2023. 2 Oncel Tuzel, and Anurag Ranjan. Hugs: Human gaussian
[34] Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, splats. arXiv preprint arXiv:2311.17910, 2023. 2
Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and [49] Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Robust
Shaojie Shen. Metric3d v2: A versatile monocular geomet- consistent video depth estimation. In IEEE/CVF Conference
ric foundation model for zero-shot metric depth and surface on Computer Vision and Pattern Recognition, 2021. 8
normal estimation. 2024. 3, 8 [50] Agelos Kratimenos, Jiahui Lei, and Kostas Daniilidis.
[35] Shoukang Hu and Ziwei Liu. Gauhuman: Articulated gaus- Dynmf: Neural motion factorization for real-time dynamic
sian splatting from monocular human videos. arXiv preprint view synthesis with 3d gaussian splatting. arXiv preprint
arXiv:2312.02973, 2023. 2 arXiv:2312.00112, 2023. 2
[36] Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sĳie Zhao, Xi- [51] Yao-Chih Lee, Zhoutong Zhang, Kevin Blackburn-Matzen,
aodong Cun, Yong Zhang, Long Quan, and Ying Shan. Simon Niklaus, Jianming Zhang, Jia-Bin Huang, and Feng
Depthcrafter: Generating consistent long depth sequences Liu. Fast view synthesis of casual videos. arXiv preprint
for open-world videos. arXiv preprint arXiv:2409.02095, arXiv:2312.02135, 2023. 1, 2, 7
2024. 3 [52] Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu,
[37] Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, and Kostas Daniilidis. Gart: Gaussian articulated template
Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse-controlled models. arXiv preprint arXiv:2311.16099, 2023. 2
10
[53] Hao Li, Robert W Sumner, and Mark Pauly. Global corre- by persistent dynamic view synthesis. arXiv preprint
spondence optimization for non-rigid registration of depth arXiv:2308.09713, 2023. 2, 6, 8
scans. Computer Graphics Forum, 27(5):1421–1430, 2008. [68] Xingyu Miao, Yang Bai, Haoran Duan, Yawen Huang, Fan
2 Wan, Yang Long, and Yefeng Zheng. Ctnerf: Cross-time
[54] Mingwei Li, Jiachen Tao, Zongxin Yang, and Yi Yang. transformer for dynamic neural radiance field from monoc-
Human101: Training 100+ fps human gaussians in 100s ular video. arXiv preprint arXiv:2401.04861, 2024. 2, 6,
from 1 view. arXiv preprint arXiv:2312.15258, 2023. 2 7
[55] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon [69] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
Steven Lovegrove, Michael Goesele, Richard Newcombe, Representing scenes as neural radiance fields for view syn-
et al. Neural 3d video synthesis from multi-view video. thesis. Communications of the ACM, 65(1):99–106, 2021.
In Proceedings of the IEEE/CVF Conference on Computer 2
Vision and Pattern Recognition (CVPR), 2022. 2 [70] Thomas Müller, Alex Evans, Christoph Schied, and
[56] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Alexander Keller. Instant neural graphics primitives
Wang. Neural scene flow fields for space-time view syn- with a multiresolution hash encoding. arXiv preprint
thesis of dynamic scenes. In Proceedings of the IEEE/CVF arXiv:2201.05989, 2022. 2
Conference on Computer Vision and Pattern Recognition, [71] Richard A Newcombe, Dieter Fox, and Steven M Seitz.
pages 6498–6508, 2021. 2, 6, 7 Dynamicfusion: Reconstruction and tracking of non-rigid
[57] Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. Spacetime scenes in real-time. In Proceedings of the IEEE conference
gaussian feature splatting for real-time dynamic view syn- on computer vision and pattern recognition, pages 343–352,
thesis. arXiv preprint arXiv:2312.16812, 2023. 2 2015. 2
[58] Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, [72] OpenAI. Gpt-4 technical report, 2023.
and Noah Snavely. Dynibar: Neural dynamic image-based https://openai.com/research/gpt-4. 2
rendering, 2023. 2 [73] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy
[59] Yiqing Liang, Numair Khan, Zhengqin Li, Thu Nguyen- Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,
Phuoc, Douglas Lanman, James Tompkin, and Lei Xiao. Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al.
Gaufre: Gaussian deformation fields for real-time dynamic Dinov2: Learning robust visual features without supervi-
novel view synthesis. arXiv preprint arXiv:2312.11458, sion. arXiv preprint arXiv:2304.07193, 2023. 2
2023. 2 [74] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien
[60] Haotong Lin, Sida Peng, Zhen Xu, Tao Xie, Xingyi He, Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo
Hujun Bao, and Xiaowei Zhou. High-fidelity and real-time Martin-Brualla. Nerfies: Deformable neural radiance fields.
novel view synthesis for dynamic scenes. In SIGGRAPH In Proceedings of the IEEE/CVF International Conference
Asia Conference Proceedings, 2023. 2 on Computer Vision, pages 5865–5874, 2021. 2, 6, 8
[61] Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. [75] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T
Gaussian-flow: 4d reconstruction with dynamic 3d gaus- Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-
sian particle. arXiv preprint arXiv:2312.03431, 2023. 2 Brualla, and Steven M Seitz. Hypernerf: A higher-
[62] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. dimensional representation for topologically varying neural
Improved baselines with visual instruction tuning. arXiv radiance fields. arXiv preprint arXiv:2106.13228, 2021. 2,
preprint arXiv:2310.03744, 2023. 2 6, 7, 8
[63] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. [76] Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mat-
Visual instruction tuning. Advances in neural information tia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu.
processing systems, 36, 2024. 2 Unidepth: Universal monocular metric depth estimation.
[64] Qingming Liu, Yuan Liu, Jiepeng Wang, Xianqiang Lv, arXiv preprint arXiv:2403.18913, 2024. 2, 3, 8
Peng Wang, Wenping Wang, and Junhui Hou. MoDGS: Dy- [77] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar-
namic Gaussian Splatting from Causally-Captured Monoc- beláez, Alex Sorkine-Hornung, and Luc Van Gool. The
ular Videos. arXiv preprint arXiv:2406.00434, 2024. 2 2017 davis challenge on video object segmentation. arXiv
[65] Xinqi Liu, Chenming Wu, Jialun Liu, Xing Liu, Jinbo Wu, preprint arXiv:1704.00675, 2017. 6
Chen Zhao, Haocheng Feng, Errui Ding, and Jingdong [78] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and
Wang. Gva: Reconstructing vivid 3d gaussian avatars from Francesc Moreno-Noguer. D-nerf: Neural radiance fields
monocular videos, 2024. 2 for dynamic scenes. In Proceedings of the IEEE/CVF Con-
[66] Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu ference on Computer Vision and Pattern Recognition, pages
Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Jo- 10318–10327, 2021. 2, 7
hannes Kopf, and Jia-Bin Huang. Robust dynamic radiance [79] Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas
fields. In Proceedings of the IEEE/CVF Conference on Com- Geiger, and Siyu Tang. 3dgs-avatar: Animatable avatars
puter Vision and Pattern Recognition, pages 13–23, 2023. via deformable 3d gaussian splatting. arXiv preprint
1, 2, 4, 6, 7 arXiv:2312.09228, 2023. 2
[67] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and [80] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Deva Ramanan. Dynamic 3d gaussians: Tracking Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
11
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- [94] Fengrui Tian, Shaoyi Du, and Yueqi Duan. Monon-
ing transferable visual models from natural language super- erf: Learning a generalizable dynamic radiance field from
vision. In International conference on machine learning, monocular videos. In Proceedings of the IEEE/CVF In-
pages 8748–8763. PMLR, 2021. 2 ternational Conference on Computer Vision, pages 17903–
[81] Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Re- 17913, 2023. 2, 7
constructing 3d human pose from 2d image landmarks. In [95] Luan Tran and Xiaoming Liu. Nonlinear 3d face morphable
European Conference on Computer Vision, pages 573–586. model. In Proceedings of the IEEE conference on computer
Springer, 2012. 2 vision and pattern recognition, pages 7346–7355, 2018. 2
[82] Alfredo Rivero, ShahRukh Athar, Zhixin Shu, and Dimitris [96] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael
Samaras. Rig3dgs: Creating controllable portraits from Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-
casual monocular videos. arXiv preprint arXiv:2402.03723, rigid neural radiance fields: Reconstruction and novel view
2024. 2 synthesis of a dynamic scene from monocular video. In
[83] Jenny Seidenschwarz, Qunjie Zhou, Bardienus Duisterhof, Proceedings of the IEEE/CVF International Conference on
Deva Ramanan, and Laura Leal-Taixé. Dynomo: Online Computer Vision, pages 12959–12970, 2021. 2, 7
point tracking by dynamic online monocular gaussian re- [97] Chaoyang Wang, Ben Eckart, Simon Lucey, and Orazio
construction. arXiv preprint arXiv:2409.02104, 2024. 2 Gallo. Neural trajectory fields for dynamic novel view syn-
[84] Zhĳing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, thesis. arXiv preprint arXiv:2105.05994, 2021. 2
Xiangru Lin, Yu Zhang, Mingming Fan, and Zeyu [98] Chaoyang Wang, Peiye Zhuang, Aliaksandr Siarohin, Junli
Wang. Splattingavatar: Realistic real-time human avatars Cao, Guocheng Qian, Hsin-Ying Lee, and Sergey Tulyakov.
with mesh-embedded gaussian splatting. arXiv preprint Diffusion priors for dynamic view synthesis from monocular
arXiv:2403.05087, 2024. 2 videos. arXiv preprint arXiv:2401.05583, 2024. 6
[85] Liangchen Song, Anpei Chen, Zhong Li, Zhang Chen, Lele [99] Qianqian Wang, Vickie Ye, Hang Gao, Jake Austin, Zhengqi
Chen, Junsong Yuan, Yi Xu, and Andreas Geiger. Nerf- Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruc-
player: A streamable dynamic scene representation with tion from a single video. 2024. 2, 6
decomposed neural radiance fields. IEEE Transactions on [100] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris
Visualization and Computer Graphics, 2023. 2 Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3D
[86] Colton Stearns, Adam W. Harley, Mikaela Uy, Florian Du- Vision Made Easy. In Proceedings of the IEEE/CVF Con-
bost, Federico Tombari, Gordon Wetzstein, and Leonidas ference on Computer Vision and Pattern Recognition, pages
Guibas. Dynamic gaussian marbles for novel view synthe- 20697–20709, 2024. 8
sis of casual monocular videos. In ArXiv, 2024. 2, 6, 7, [101] Jing Wen, Xiaoming Zhao, Zhongzheng Ren, Alexan-
8 der G Schwing, and Shenlong Wang. Gomavatar: Effi-
[87] Timo Stich, Christian Linz, Georgia Albuquerque, and Mar- cient animatable human modeling from monocular video
cus Magnor. View and time interpolation in image space. using gaussians-on-mesh. arXiv preprint arXiv:2404.07991,
Computer Graphics Forum, 2008. 2 2024. 2
[88] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram [102] Chung-Yi Weng, Brian Curless, Pratul P Srinivasan,
Burgard, and Daniel Cremers. A benchmark for the evalua- Jonathan T Barron, and Ira Kemelmacher-Shlizerman. Hu-
tion of rgb-d slam systems. In 2012 IEEE/RSJ international mannerf: Free-viewpoint rendering of moving people from
conference on intelligent robots and systems, pages 573– monocular video. In Proceedings of the IEEE/CVF Confer-
580. IEEE, 2012. 7, 8 ence on Computer Vision and Pattern Recognition, pages
[89] Robert W Sumner, Johannes Schmid, and Mark Pauly. Em- 16210–16220, 2022. 2
bedded deformation for shape manipulation. In ACM sig- [103] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng
graph 2007 papers, pages 80–es. 2007. 2 Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang.
[90] David Svitov, Pietro Morerio, Lourdes Agapito, and Alessio 4d gaussian splatting for real-time dynamic scene rendering.
Del Bue. Haha: Highly articulated gaussian human avatars arXiv preprint arXiv:2310.08528, 2023. 2, 6, 7, 8
with textured mesh prior. arXiv preprint arXiv:2404.01053, [104] Tianhao Wu, Fangcheng Zhong, Andrea Tagliasacchi, For-
2024. 2 rester Cole, and Cengiz Oztireli. D2 nerf: Self-supervised
[91] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field decoupling of dynamic and static objects from a monocular
transforms for optical flow. In Computer Vision–ECCV video. arXiv preprint arXiv:2205.15838, 2022. 2
2020: 16th European Conference, Glasgow, UK, August [105] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil
23–28, 2020, Proceedings, Part II 16, pages 402–419. Kim. Space-time neural irradiance fields for free-viewpoint
Springer, 2020. 4 video. In Proceedings of the IEEE/CVF Conference on Com-
[92] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam puter Vision and Pattern Recognition, pages 9421–9431,
for monocular, stereo, and rgb-d cameras. Advances in 2021. 2
neural information processing systems, 34:16558–16569, [106] Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue,
2021. 8 Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker:
[93] Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch vi- Tracking any 2d pixels in 3d space. In Proceedings of
sual odometry. Advances in Neural Information Processing the IEEE/CVF Conference on Computer Vision and Pattern
Systems, 36, 2024. 8 Recognition (CVPR), 2024. 3, 8
12
[107] Gengshan Yang, Deqing Sun, Varun Jampani, Daniel point: Dynamic neural point for view synthesis. Advances
Vlasic, Forrester Cole, Huiwen Chang, Deva Ramanan, in Neural Information Processing Systems, 36, 2024. 2, 6, 7
William T Freeman, and Ce Liu. Lasr: Learning articulated [120] C. Lawrence Zitnick, Sing Bing Kang, Matthew Uytten-
shape reconstruction from a monocular video. In Proceed- daele, Simon Winder, and Richard Szeliski. High-quality
ings of the IEEE/CVF Conference on Computer Vision and video view interpolation using a layered representation.
Pattern Recognition, pages 4925–4935, 2021. 2 ACM Transactions on Graphics (TOG), 2004. 2
[108] Gengshan Yang, Minh Vo, Natalia Neverova, Deva Ra- [121] Michael Zollhöfer, Matthias Nießner, Shahram Izadi,
manan, Andrea Vedaldi, and Hanbyul Joo. Banmo: Build- Christoph Rehmann, Christopher Zach, Matthew Fisher,
ing animatable 3d neural models from many casual videos. Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian
In Proceedings of the IEEE/CVF Conference on Computer Theobalt, et al. Real-time non-rigid reconstruction using an
Vision and Pattern Recognition, pages 22247–22257, 2022. rgb-d camera. ACM Transactions on Graphics (ToG), 33(4):
2 1–12, 2014. 2
[109] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Ji-
ashi Feng, and Hengshuang Zhao. Depth anything: Un-
leashing the power of large-scale unlabeled data. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 10371–10381, 2024. 2
[110] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing
Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-
fidelity monocular dynamic scene reconstruction. arXiv
preprint arXiv:2309.13101, 2023. 2
[111] Zeyu Yang, Hongye Yang, Zĳie Pan, Xiatian Zhu, and Li
Zhang. Real-time photorealistic dynamic scene representa-
tion and rendering with 4d gaussian splatting. arXiv preprint
arXiv:2310.10642, 2023. 2
[112] Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park,
and Jan Kautz. Novel view synthesis of dynamic scenes
with globally coherent depths from a monocular camera.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 5336–5345, 2020. 2,
7
[113] Meng You and Junhui Hou. Decoupling dynamic monoc-
ular videos for dynamic view synthesis. arXiv preprint
arXiv:2304.01716, 2023. 2
[114] Zehao Yu, Torsten Sattler, and Andreas Geiger. Gaussian
opacity fields: Efficient and compact surface reconstruction
in unbounded scenes. arXiv preprint arXiv:2404.10772,
2024. 2
[115] Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam-
pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-
Hsuan Yang. Monst3r: A simple approach for estimat-
ing geometry in the presence of motion. arXiv preprint
arxiv:2410.03825, 2024. 7, 8
[116] Zhoutong Zhang, Forrester Cole, Zhengqi Li, Michael Ru-
binstein, Noah Snavely, and William T Freeman. Structure
and motion from casual videos. In European Conference on
Computer Vision, pages 20–37. Springer, 2022. 8
[117] Wang Zhao, Shaohui Liu, Hengkai Guo, Wenping Wang,
and Yong-Jin Liu. ParticleSfM: Exploiting Dense Point
Trajectories for Localizing Moving Cameras in the Wild. In
European conference on computer vision (ECCV), 2022. 8
[118] Xiaoming Zhao, Alex Colburn, Fangchang Ma, Miguel An-
gel Bautista, Joshua M. Susskind, and Alexander G.
Schwing. Pseudo-generalized dynamic view synthesis from
a video, 2024. 2, 6, 7
[119] Kaichen Zhou, Jia-Xing Zhong, Sangyun Shin, Kai Lu,
Yiyuan Yang, Andrew Markham, and Niki Trigoni. Dyn-
13

2405.17421v2

Uploaded by

Copyright:

Available Formats

2405.17421v2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2405.17421v2

Uploaded by

Copyright:

Available Formats

MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds

{leijh, kostas}@cis.upenn.edu, {yijiaw, aharley, guibas}@cs.stanford.edu

Abstract camera parameters—the most typical data format for such

2. Related Works 3. Method

where W is the deformation field defined in Eq.4, and w

Table 2. Comparison on NVIDIA [112], averaged over all scenes.

Table 6. Ablation study on different priors on DyCheck [30].

You might also like