Gaussian Sla

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Gaussian-SLAM: Photo-realistic Dense SLAM with Gaussian Splatting

Vladimir Yugay Yue Li Theo Gevers Martin R. Oswald


University of Amsterdam, Netherlands
{vladimir.yugay, y.li6, th.gevers, m.r.oswald}@uva.nl
vladimiryugay.github.io/gaussian_slam

ESLAM [34] Point-SLAM [51] Gaussian-SLAM (Ours)


arXiv:2312.10070v1 [cs.CV] 6 Dec 2023

Figure 1. Rendering Results of Gaussian-SLAM in Comparison. Embedded into a dense SLAM pipeline, the Gaussian splatting-based
scene representation allows for very fast, photo-realistic rendering of scene views. This leads to unprecedented rendering quality, especially
on real-world data like this TUM-RGBD [59] frame that contains many high-frequency details that other methods struggle to capture.

c
Abstract gation of various scene representations to either push the
tracking performance and mapping capabilities or to adapt
We present a new dense simultaneous localization and it for more complex downstream tasks like path planning
mapping (SLAM) method that uses Gaussian splats as or semantic understanding. Specifically, earlier works fo-
a scene representation. The new representation enables cus on tracking using various scene representations like
interactive-time reconstruction and photo-realistic render- feature point clouds [15, 26, 40], surfels [53, 71], depth
ing of real-world and synthetic scenes. We propose novel maps [43, 58], or implicit representations [14, 42, 44]. Later
strategies for seeding and optimizing Gaussian splats to works focused more on the map quality and density. With
extend their use from multiview offline scenarios to sequen- the advent of powerful neural scene representations like
tial monocular RGBD input data setups. In addition, we neural radiance fields [38] that allow for high fidelity view-
extend Gaussian splats to encode geometry and experi- synthesis, a rapidly growing body of dense neural SLAM
ment with tracking against this scene representation. Our methods [19, 34, 51, 60, 62, 64, 81, 84] has been devel-
method achieves state-of-the-art rendering quality on both oped. Despite their impressive gains in scene representation
real-world and synthetic datasets while being competitive in quality, these methods are still limited to small synthetic
reconstruction performance and runtime. scenes and their re-rendering results are still far from being
photo-realistic.
Recently, a novel scene representation based on Gaussian
1. Introduction splatting [25] has been shown to deliver on-par or even better
rendering performance than NeRFs while being an order of
Simultaneous localization and mapping (SLAM) have been magnitude faster in rendering and optimization. Besides be-
an active research topic for the past two decades [16, 23]. ing faster, this scene representation is directly interpretable
A major byproduct of that journey has been the investi- and allows for direct scene editing which is a desirable prop-

1
erty for many downstream tasks. With these advantages, the to learn scene representation. Inspired by Neural Radiance
Gaussian splatting representation lends itself to be applied Fields [38], there has been huge progress in dense surface
in an online SLAM system with real-time demands and as reconstruction [45, 65] and accurate pose estimation [3, 29,
such opens the door towards photo-realistic dense SLAM. 50, 67]. These efforts have led to the development of com-
In this paper, we introduce Gaussian-SLAM, a dense prehensive dense SLAM systems [34, 60, 75, 81, 84, 85],
RGBD SLAM system using Gaussian splats as scene showing a trend in the pursuit of precise and reliable visual
representation which allows for almost photo-realistic re- SLAM. A comprehensive survey on online RGBD recon-
rendering at interactive runtimes. An example of the high struction can be found in [86].
fidelity rendering output of Gaussian-SLAM is depicted in While the latest neural methods show impressive rendering
Fig. 1. This paper further reveals and discusses a variety of capabilities on synthetic data, they struggle when applied to
limitations of Gaussian splatting for SLAM and proposes real-world data. Further, these methods are not yet practical
solutions on how to tackle them. In summary, our contribu- for real-world applications due to computation requirements,
tions include: slow speed, and the challenges in incorporating pose updates,
• A dense RGBD SLAM approach that uses Gaussian splats as the neural representations rely on positional encoding. In
as a scene representation allowing SOTA rendering results contrast, our method shows impressive performance on real-
on real-world scenes in combination with substantially world data, has a competitive runtime, and uses a scene
faster rendering. representation that naturally allows pose updates.
• An extension of Gaussian splatting to better encode ge- Scene Representations for SLAM. The majority of dense
ometry allowing reconstruction beyond radiance fields in SLAM 3D scene representations are either grid-based,
monocular setup. point-based, network-based, or hybrid. Among these,
• As the adaptation of the original Gaussian splatting from grid-based techniques are perhaps the most extensively re-
an offline to an online approach is by no means straightfor- searched. They further divide into methods using dense
ward, we propose an online learning method for Gaussian grids [4, 10, 12, 28, 42, 61, 68–71, 82–84], hierarchical oc-
splats that splits the map into sub-maps and introduces trees [7, 30, 31, 35, 55, 75] and voxel hashing [14, 21, 39,
efficient seeding and optimization strategies. 44, 65] for efficient memory management. Grids offer the
• We further investigate frame-to-model tracking with Gaus- advantage of simple and quick neighborhood lookups and
sian splatting via photometric error minimization and com- context integration. However, a key limitation is the need
pare it to off-the-shelf frame-to-frame tracking. to predefine grid resolution, which is not easily adjustable
All source code and data will be made publicly available. during reconstruction. This can result in inefficient memory
usage in empty areas while failing to capture finer details
2. Related Work due to resolution constraints.
Point-based approaches address some of the grid-related
Dense Visual SLAM and Online Mapping. The seminal challenges and have been effectively utilized in 3D recon-
work of Curless and Levoy [12] set the stage for a variety of struction [6, 8, 11, 22, 24, 53, 71, 79]. Unlike grid resolution,
3D reconstruction methods using truncated signed distance the density of points in these methods does not have to be
functions (TSDF). A line of works was built upon it improv- predetermined and can naturally vary throughout the scene.
ing speed [42] through efficient implementation and volume Moreover, point sets can be efficiently concentrated around
integration, scalability through voxel hashing [21, 44, 46] surfaces, not spending memory on modeling empty space.
and octree data structure [55], and tracking with sparse im- The trade-off for this adaptability is the complexity of finding
age features [5] and loop closure [6, 14, 43, 53]. Tackling neighboring points, as point sets lack structured connectivity.
the problem of unreliable depth maps, RoutedFusion [68] In dense SLAM, this challenge can be mitigated by trans-
introduced a learning-based fusion network for updating forming the 3D neighborhood search into a 2D problem via
the TSDF in volumetric grids. This concept was further projection onto keyframes [53, 71], or by organizing points
evolved by NeuralFusion [69] and DI-Fusion [19], which within a grid structure for expedited searching [73].
adopt implicit learning for scene representation, enhancing Network-based methods for dense 3D reconstruction provide
their robustness against outliers. Recent research has suc- a continuous scene representation by implicitly modeling it
cessfully achieved dense online reconstruction using solely with coordinate-based networks [1, 27, 36, 47, 60, 64, 65,
RGB cameras [4, 9, 27, 41, 52, 56, 61] bypassing the need 74, 76, 81]. This representation can capture high-quality
for depth data. maps and textures. However, they are generally unsuitable
Recently, test-time optimization methods have become pop- for online scene reconstruction due to their inability to up-
ular due to their ability to adapt to unseen scenes on the date local scene regions and to scale for larger scenes. More
fly. Continuous Neural Mapping [74], for instance, employs recently, a hybrid representation combining the advantages
a continual mapping strategy from a series of depth maps of point-based and neural-based was proposed [51]. While

2
addressing some of the issues of both representations it strug- and R ∈ R3,3 components, and o ∈ R is opacity. The
gles with real-world scenes, and cannot seamlessly integrate 3D Gaussians are projected to the image plane using the
trajectory updates in the scene representation. equations [87]:
Outside these three primary categories, some studies have
µ2D = K (Eµ)/(Eµ)z ,

explored alternative representations like surfels [17, 37] and (2)
neural planes [34, 49]. Parameterized surface elements are Σ2D = JEΣE T J T , (3)
generally not great at modeling a flexible shape template
while feature planes struggle with scene reconstructions con- where K is the intrinsic matrix, E is the extrinsic matrix
taining multiple surfaces, due to their overly compressed in camera-to-world coordinates, and J is the Jacobian of
2D
representation. Recently, Kerbl et al. [25] proposed to rep- the point projection in Eq. (2), i.e. ∂µ∂µ . The pixel color is
resent a scene with 3D Gaussians. The scene representa- influenced by all Gaussians which intersect with the ray cast
tion is optimized via differential rendering with multi-view from that specific pixel. The color is computed as a weighted
supervision. While being very efficient and achieving im- average of the projected 3D Gaussians:
pressive rendering results, this representation is tailored for
fully-observed multi-view environments and does not en- X i−1
Y
2D 2D
code geometry well. Concurrent with our work, [72, 77, 78] Cpix = ci fi,pix (1 − fj,pix ) , (4)
focus on dynamic scene reconstruction, and [33] on tracking. i∈V j=1
However, they are all offline methods and are not suited for
monocular dense SLAM setups. where V is the set of the Gaussians influencing the pixel,
2D
Unlike others, we use 3D Gaussians in an online sce- fi,pix is the equivalent of the formula for f 3D except with
nario with a monocular RGBD input stream. Moreover, we the 3D means and covariance matrices replaced with the 2D
extend 3D Gaussians to encode accurate geometry while splatted versions, and ci is an RGB color of each Gaussian.
being optimized in a single-view setup for a small number Every term along the ray cast for a pixel is down-weighted
of iterations. by the transmittance term which accounts for the influence
of the previously encountered Gaussians.
3. Gaussian Splatting and its Limitations Limitations of Gaussian Splatting for SLAM. Due to its
speed advantages, 3D Gaussian splatting [25] seems a great
In this section we revisit the original offline Gaussian Splat- match for the requirements of a dense SLAM system. How-
ting [25] approach, while following the mathematical for- ever, being designed for multi-camera environments with
mulation of [33]. Additionally, we provide an analysis of good coverage of observations, its application to monocular
Gaussian splatting properties that make it challenging to ap- SLAM presents multiple unique challenges.
ply the original implementation directly in a SLAM setting. ▷ Seeding strategy for online SLAM: As stated above, the
We pinpoint and discuss specific cases where this method original seeding strategy builds upon a sparse point cloud of
fails or proves ineffective, thereby identifying the poten- surface points and adaptively creates and removes Gaussians
tial need for modifications or alternative approaches in the during the optimization. Such iterative dynamic behavior
context of SLAM applications. potentially leads to large variations in mapping iterations
Gaussian Splatting [25] is an effective method for repre- and compute time which is rather undesirable for SLAM.
senting 3D scenes with novel-view synthesis capability. This ▷ Online optimization: A vanilla online implementation
approach is notable for its speed, without compromising on might just optimize over all frames, but for longer sequences,
the rendering quality. Originally, 3D Gaussians are initial- this becomes too slow for interactive frame rates.
ized from a sparse SfM point cloud of a scene. Having a
▷ Catastrophic forgetting in online optimization: To avoid
set of images observing the scene from different angles, the
the linear growth in per-frame mapping with every new
Gaussian parameters are optimized using differentiable ren-
frame, an alternative is to optimize the Gaussian scene repre-
dering while 3D Gaussians are adaptively added or removed
sentation only with the current frame. While the optimization
to the representation based on a set of heuristics.
will quickly converge to fit the new training frame very well,
The influence of a single 3D Gaussian on a physical point
previously mapped views will be severely degraded. This
p ∈ R3 in 3D space is evaluated with the function:
applies to the Gaussian shapes, encompassing both scale
 1  and orientation, as well as to the spherical harmonic color
f 3D (p) = sigmoid(o) exp − (p − µ)T Σ−1 (p − µ) , encoding, where local function adjustments can significantly
2
(1) alter function values in other areas of the spherical domain,
as illustrated by the cyan colors in Figs. 2a, 2e.
where µ ∈ R3 is a 3D Gaussian mean, Σ = RSS T RT ∈ ▷ Highly randomized solutions: For both offline and online
R3,3 is the covariance matrix computed with S ∈ R3 scaling cases, the result of a splatting optimization highly depends

3
4. Method
The key idea of our approach is to use 3D Gaussian scene
representation [25] to enhance dense monocular SLAM. We
extend traditional Gaussian splatting representation to en-
code not only the radiance fields but also detailed geometry.
(a) training view (step t) (b) novel view (step t+10)
Furthermore, we tailor the mapping process to yield state-
of-the-art rendering in a sequential monocular setup, a very
challenging scenario for 3D Gaussian Splatting. Finally,
we experiment with frame-to-model tracking utilizing 3D
Gaussian scene representation. Fig. 3 provides an overview
of our method. We now explain our pipeline, starting with
map construction and optimization, followed by geometry
encoding, and tracking.
(c) training view (d) neighboring novel view
4.1. 3D Gaussian-based Map
To preserve novel view synthesis, avoid catastrophic for-
getting, and make the Gaussian mapping computationally
feasible, we split the input sequence into sub-maps. Every
sub-map consists of several keyframes and is represented
with a separate 3D Gaussian point cloud. Formally, we de-
(e) exploded scaling (f) large depth error
fine a sub-map Gaussian point cloud P s as a collection of
Figure 2. Naive usage of Gaussian splatting for SLAM. Fig. 2a N 3D Gaussians
and Fig. 2b show renderings of the naive SLAM pipeline for a
training view and a novel view with a pose from 10 steps ahead. P s = {G(µsi , Sis , Ris , osi , SHsi ) | i = 1, . . . , N } , (5)
Fig. 2c and Fig. 2d show training and novel views from directly
neighboring frames for which color artifacts occur. Fig. 2e shows each with a mean µsi ∈ R3 , scale Sis ∈ R3 , rotation Ris ∈
uncontrolled growth of 3D Gaussian parameters. Fig. 2f depicts R4 , opacity osi ∈ R, and spherical harmonics parameters
larger depth errors after optimizing 3D Gaussians to a single frame SHsi ∈ R48 . At any period of time, we grow and optimize
seed at GT depth.
only one active sub-map. We start a new sub-map after
a fixed amount of keyframes. Every new keyframe adds
additional 3D Gaussians to the active sub-map to account
for the newly observed parts of the scene.
on the initialization of Gaussians. Also during the optimiza-
Building Sub-maps Progressively. Seeding the new Gaus-
tion, Gaussians may grow suddenly in different directions
sians is a crucial step in building sub-maps. At every
depending on the neighboring Gaussians. Finally, the in-
keyframe, we compute a dense point cloud from the input
herent symmetries of the 3D Gaussians allow parameter
RGBD frame and pose estimated by the tracker. We sample
alterations without affecting the loss function, resulting in
Mu uniformly and Mc points in high color gradient regions
non-unique solutions, a generally undesirable characteristic
from the input point cloud. For each sampled point we seed
in optimization.
two Gaussians: one on the surface at the measured depth
▷ Poor extrapolation capabilities: Related to the previous and one slightly behind the surface in the viewing direction.
issue, Gaussians often grow uncontrollably into unobserved This anchoring strategy is similar to [51], with the differ-
areas. While good view coverage in an offline setting con- ence that we do not seed additional Gaussians in front of the
strains most Gaussians well, novel views in a sparse-view surface points. Selected points serve as the initial mean loca-
SLAM setting often contain artifacts resulting from previ- tions for new Gaussians. The new Gaussians are anisotropic
ously under-constrained Gaussians. This is shown in Fig. 2a- since their scale is defined based on the proximity of the
2d and is especially harmful for model-based tracking. neighboring Gaussians within the active sub-map similar to
[25].
▷ Limited geometric accuracy: Gaussian splatting is not Sub-map Optimization. We optimize the active sub-map
adept at encoding precise geometry when used in a monocu- every time new Gaussians are added to it. When a new sub-
lar setup. While geometry estimation is relatively good in map is created we optimize it for a fixed number of iterations
a well-constrained setup with multiple views, the resulting Is w.r.t. to depth and color. During optimization, we freeze
depth maps from a single-camera setup are ineffective for the means of the Gaussians to not distort the geometry ob-
3D reconstruction, as shown in Fig. 2f. tained from the depth sensor. We do not clone or prune the

4
Figure 3. Gaussian-SLAM Architecture. Given an estimated camera pose, mapping is performed as follows. An input point cloud is
subsampled and the locations for the new 3D Gaussians in the active sub-map are estimated based on the density of the sub-map. After the
sparse set of new 3D Gaussians is added to the active sub-map Gaussian point cloud, they are jointly optimized. Depth maps and color
images of all the keyframes that contributed to the active sub-map are rendered using a differential rasterizer. The 3D Gaussians’ parameters
are optimized by imposing depth, color re-rendering, and regularization losses against the input RGBD frame.

Gaussians as done in [25] to preserve geometry obtained The gradients with respect to the depth are derived similarly
from the depth sensor and to speed up optimization. as for color. However, unlike color optimization, we do not
To counteract catastrophic forgetting, we optimize the update the means of the Gaussians during depth optimization
active sub-map to be able to render the depth and color of to preserve the geometry obtained from the depth sensors.
all the keyframes seen in a given sub-map. In the original We use L1 loss for depth optimization:
work [25] the scene representation is optimized for many
iterations, and in every iteration, a new view is sampled Ldepth = |D̂ − D|1 , (8)
randomly. However, this approach does not suit the SLAM
where D, D̂ ∈ R≥0 are the ground-truth and reconstructed
setup where the amount of iterations is limited. Following
depth maps, respectively. To prevent 3D Gaussians from
the naive strategy leads to insufficient optimization of the
scale explosion, we add a regularization loss Lreg . It is an
new views in the sub-map or excessive time spent on opti-
L2 loss applied to the scales of all the 3D Gaussians whose
mization. We achieve the best results by iterating over all
scale is larger than a threshold γ = 10. Finally, we optimize
the keyframes in the active sub-map, but spending at least
both color, depth, and regularization terms together:
50% of the iterations on the current keyframe.
L = Lcolor + α · Ldepth + β · Lreg , (9)
4.2. Encoding Geometry and Color
3D Gaussians provide a natural way to render the color [25]. with α and β both set to 1.
After the 3D Gaussians are splatted to the image plane, the For color and depth rendering we cannot sample separate
color loss between the input color image and the rendered pixels since the effect of a single splatted Gaussian has a
image is computed and optimized w.r.t. the Gaussian pa- non-local influence on the rendered image. Therefore, for
rameters. For the color supervision, we use a weighted depth supervision, we fill in the missing values in the depth
combination of L1 and SSIM [66] losses: map with Navier-Stokes inpainting [2]. We implement depth
rendering and gradient computation in CUDA which makes
Lcolor = (1 − λ) · |Iˆ − I|1 + λ 1 − SSIM(I,ˆ I) , (6)

it efficient enough for SLAM applications.
4.3. Tracking
where I is the original image, Iˆ is the rendered image, and
λ = 0.2. We estimate the depth at every pixel similar to During tracking, the pose is initialized with multi-scale
color rendering as the sum of z coordinates of the Gaussians RGBD odometry [48]. Estimated relative transformations
affecting this pixel weighted by the transmittance factor: {∆R′ , ∆t′ } are applied globally to the Gaussian positions
and orientations of the active sub-map, and the image is ren-
i−1
X
2D
Y
2D
 dered from the last estimated pose. The transformation from
Dpix = zi fi,pix 1 − fj,pix , (7) the last frame to the current frame is then obtained by invert-
i∈S j
ing the relative transformation applied to the Gaussians:
where zi is the Z coordinate of the Gaussian mean µi in
∆R = (∆R′ )−1 (10)
camera coordinate system. The loss is optimized with respect
to the Gaussian parameters supervised by an input depth map. ∆t = −(∆R) × ∆t′ . (11)

5
Method Metric 0000 0059 0106 0169 0181 0207 Avg. fr1/ fr1/ fr1/ fr2/ fr3/
Method Metric Avg.
desk desk2 room xyz office
PSNR↑ 18.71 16.55 17.29 18.75 15.56 18.38 17.54
NICE- PSNR↑ 13.83 12.00 11.39 17.87 12.890 13.59
SSIM↑ 0.641 0.605 0.646 0.629 0.562 0.646 0.621 NICE-
SLAM [84] SSIM↑ 0.569 0.514 0.373 0.718 0.554 0.545
LPIPS↓ 0.561 0.534 0.510 0.534 0.602 0.552 0.548 SLAM [84]
LPIPS↓ 0.482 0.520 0.629 0.344 0.498 0.494
PSNR↑ 19.06 16.38 18.46 18.69 16.75 19.66 18.17
Vox- PSNR↑ 15.79 14.12 14.20 16.32 17.27 15.54
∗ SSIM↑ 0.662 0.615 0.753 0.650 0.666 0.696 0.673 Vox-
Fusion [75] SSIM↑ 0.647 0.568 0.566 0.706 0.677 0.632
LPIPS↓ 0.515 0.528 0.439 0.513 0.532 0.500 0.504 Fusion∗ [75]
LPIPS↓ 0.523 0.541 0.559 0.433 0.456 0.502
PSNR↑ 15.70 14.48 15.44 14.56 14.22 17.32 15.29 PSNR↑ 11.29 12.30 9.06 17.46 17.02 13.42
ESLAM [34] SSIM↑ 0.687 0.632 0.628 0.656 0.696 0.653 0.658 ESLAM [34] SSIM↑ 0.666 0.634 0.929 0.310 0.457 0.599
LPIPS↓ 0.449 0.450 0.529 0.486 0.482 0.534 0.488 LPIPS↓ 0.358 0.421 0.192 0.698 0.652 0.464
PSNR↑ 21.30 19.48 16.80 18.53 22.27 20.56 19.82 PSNR↑ 13.87 14.12 14.16 17.56 18.43 15.63
Point- Point-
SSIM↑ 0.806 0.765 0.676 0.686 0.823 0.750 0.751 SSIM↑ 0.627 0.592 0.645 0.708 0.754 0.665
SLAM [51] SLAM [51]
LPIPS↓ 0.485 0.499 0.544 0.542 0.471 0.544 0.514 LPIPS↓ 0.544 0.568 0.546 0.585 0.448 0.538
Gaussian- PSNR↑ 31.18 31.81 25.99 26.37 33.08 29.69
Gaussian- PSNR↑ 37.94 36.87 35.95 36.86 38.22 38.85 37.45
SLAM SSIM↑ 0.980 0.981 0.968 0.946 0.984 0.971
SLAM SSIM↑ 0.984 0.988 0.986 0.981 0.983 0.982 0.984 (ours) LPIPS↓ 0.067 0.068 0.125 0.148 0.060 0.093
(ours) LPIPS↓ 0.066 0.058 0.067 0.072 0.074 0.071 0.068
Table 2. Rendering Performance on TUM-RGBD [58]. We out-
Table 1. Rendering Performance on ScanNet [13]. We out-
perform existing dense neural RGBD methods on the commonly re-
perform existing dense neural RGBD methods on the commonly
ported rendering metrics by a huge margin. For NICE-SLAM [84]
reported rendering metrics. For NICE-SLAM [84] and Vox-
and Vox-Fusion [75] we take the numbers from [85]. For qualita-
Fusion [75] we take the numbers from [85]. For qualitative results,
tive results, see Fig. 4.
see Fig. 4.
Method Metric Rm0 Rm1 Rm2 Off0 Off1 Off2 Off3 Off4 Avg.

The re-rendered color loss is optimized with respect to the NICE- PSNR↑ 22.12 22.47 24.52 29.07 30.34 19.66 22.23 24.94 24.42
SLAM SSIM ↑ 0.689 0.757 0.814 0.874 0.886 0.797 0.801 0.856 0.809
relative transformation. [84] LPIPS ↓ 0.330 0.271 0.208 0.229 0.181 0.235 0.209 0.198 0.233
Our experiments show that 3D Gaussians exhibit limited Vox- PSNR↑ 22.39 22.36 23.92 27.79 29.83 20.33 23.47 25.21 24.41
extrapolation capabilities, as evidenced by rendering artifacts Fusion∗ SSIM↑ 0.683 0.751 0.798 0.857 0.876 0.794 0.803 0.847 0.801
[75] LPIPS↓ 0.303 0.269 0.234 0.241 0.184 0.243 0.213 0.199 0.236
when deviating from the initial trajectory and difficulties in
PSNR↑ 25.25 25.31 28.09 30.33 27.04 27.99 29.27 29.15 27.80
reconstructing accurate depth maps at geometry discontinu- ESLAM
SSIM↑ 0.874 0.245 0.935 0.934 0.910 0.942 0.953 0.948 0.921
[34]
ities and within unseen regions. Optimizing the re-rendering LPIPS↓ 0.315 0.296 0.245 0.213 0.254 0.238 0.186 0.210 0.245
loss for pose refinement shows worse tracking performance Point- PSNR↑ 32.40 34.08 35.50 38.26 39.16 33.99 33.48 33.49 35.17
compared to other neural SLAM methods that use a similar SLAM SSIM↑ 0.974 0.977 0.982 0.983 0.986 0.960 0.960 0.979 0.975
[51] LPIPS↓ 0.113 0.116 0.111 0.100 0.118 0.156 0.132 0.142 0.124
approach but use implicit representation as shown in Tab. 5.
Gaussian PSNR↑ 34.31 37.28 38.18 43.97 43.56 37.39 36.48 40.19 38.90
To assess the impact of these limitations, we carry out SLAM SSIM↑ 0.988 0.992 0.993 0.996 0.995 0.992 0.992 0.999 0.993
oracle experiments, where Gaussian sub-maps are firstly (ours) LPIPS↓ 0.082 0.068 0.074 0.045 0.066 0.078 0.079 0.066 0.069
mapped using ground truth poses and then used by tracking. Table 3. Rendering Performance on Replica [57]. We outperform
The oracle method achieves better accuracy than all other all existing dense neural RGBD methods on the commonly reported
dense neural SLAM methods including DROID-SLAM [63]. rendering metrics. The numbers for NICE-SLAM [84] and Vox-
This supports our claim that the tracking accuracy is limited Fusion [75] are taken from [85].
by the extrapolation capability of the 3D Gaussian splats.
As frame-to-model tracking is currently unsatisfactory,
number of iterations for the first frame in a sub-map is set to
we utilize the trajectories of DROID-SLAM [63] in obtaining
1000 for all the datasets. The number of iterations for all the
all the results on rendering and reconstruction. We believe
other frames in a sub-map is set to 500 for Replica and TUM-
that enhancements in depth map rendering will further im-
RGBD, and 1000 for ScanNet. Every 5th frame is considered
prove frame-to-model tracking accuracy in our method as
as a keyframe for all the datasets. We use FAISS [20] GPU
evidenced by the oracle experiment.
implementation for finding nearest neighbors when deciding
which points to add as new Gaussians. We use Gaussian
5. Experiments
addition radius ρ = 0.03m for Replica and TUM-RGBD,
We first describe our experimental setup and then evalu- and ρ = 0.07m for ScanNet. For all the datasets we spend at
ate our method against state-of-the-art dense neural RGBD least 50% of iterations on the newly incoming keyframe. To
SLAM methods on Replica [57] as well as the real-world mesh the scene, we render depth and color every fifth frame
TUM-RGBD [59] and ScanNet [13] datasets. Further exper- over the estimated trajectory and use TSDF Fusion [12] with
iments and details are provided in the appendix. voxel size 1 cm similar to [51]. Please refer to the appendix
Implementation Details. We set Mu = 15000 and Mc = for more details.
20000 for Replica, Mu = 10000, Mc = 15000 for TUM- Datasets. The Replica dataset [57] comprises high-quality
RGBD, and Mu = 55000, Mc = 20000 for ScanNet. The 3D reconstructions of a variety of indoor scenes. We utilize

6
the publicly available dataset collected by Sucar et al. [60], not used due to the negative impact of the inaccurate depth
which provides trajectories from an RGBD sensor. Further, map. More tracking results are provided in the appendix D.
we demonstrate that our framework can handle real-world
data by using the TUM-RGBD dataset [59], as well as the 5.4. Further Statistical Evaluation
ScanNet dataset [13]. The poses for TUM-RGBD were Runtime Analysis. We report runtime usage on the Replica
captured using an external motion capture system while office0 scene in Tab. 6. The mapping time is reported
ScanNet uses poses from BundleFusion [14]. per iteration and per frame averaged over the full trajectory.
Evaluation Metrics. The meshes, produced by marching Limitations. Although we have adapted Gaussian splatting
cubes [32], are evaluated using the F-score being the har- for online dense SLAM, tracking remains challenging. This
monic mean of the Precision (P) and Recall (R). We use a is caused by the insufficient extrapolation ability of the 3D
distance threshold of 1 cm for all evaluations. We further Gaussians, and by unconstrained Gaussian parameters in un-
provide the depth L1 metric for unseen views as in [84]. For observed regions. We also believe that some of our empirical
tracking accuracy, we use ATE RMSE [59] and for rendering hyperparameters can be made test time adaptive e.g. our
we provide the peak signal-to-noise ratio (PSNR), SSIM [66] keyframe selection strategy is currently rather simplistic
and LPIPS [80]. Our rendering metrics are evaluated by
rendering the full-resolution image along the estimated tra-
jectory by the mapping interval similar to [51]. Method Metric Rm0 Rm1 Rm2 Off0 Off1 Off2 Off3 Off4 Avg.

Baseline Methods. We primarily compare our method to NICE- Depth L1 [cm]↓ 1.81 1.44 2.04 1.39 1.76 8.33 4.99 2.01 2.97
SLAM [84] F1 [%]↑ 45.0 44.8 43.6 50.0 51.9 39.2 39.9 36.5 43.9
existing state-of-the-art dense neural RGBD SLAM methods
Vox- Depth L1 [cm]↓ 1.09 1.90 2.21 2.32 3.40 4.19 2.96 1.61 2.46
such as NICE-SLAM [84], Vox-Fusion [75], ESLAM [34] Fusion [75] F1 [%]↑ 69.9 34.4 59.7 46.5 40.8 51.0 64.6 50.7 52.2
and Point-SLAM [51]. We reproduce the results from [75] ESLAM [34]
Depth L1 [cm]↓ 0.97 1.07 1.28 0.86 1.26 1.71 1.43 1.06 1.18
F1 [%]↑ 81.0 82.2 83.9 78.4 75.5 77.1 75.5 79.1 79.1
using the open source code and report the results as Vox- Co- Depth L1 [cm]↓ 1.05 0.85 2.37 1.24 1.48 1.86 1.66 1.54 1.51
Fusion∗ . For NICE-SLAM, we mesh the scene at resolution SLAM [64] F1 [%]↑ - - - - - - - - -
1cm for a fair comparison. We could not obtain the rendering GO-
Depth L1 [cm] ↓

- - - - - - - - 3.38
Depth L1 [cm]↓ 4.56 1.97 3.43 2.47 3.03 10.3 7.31 4.34 4.68
from GO-SLAM [81] for the evaluation since rendering SLAM [81]
F1 [%]↑ 17.3 33.4 24.0 43.0 31.8 21.8 17.3 22.0 26.3
function was not available at the time of submission. Point- Depth L1 [cm]↓ 0.53 0.22 0.46 0.30 0.57 0.49 0.51 0.46 0.44
SLAM [51] F1 [%]↑ 86.9 92.3 90.8 93.8 91.6 89.0 88.2 85.6 89.8
5.1. Rendering Performance Ours
Depth L1 [cm]↓ 0.69 0.33 0.83 0.52 0.46 5.43 3.74 0.47 1.56
F1 [%]↑ 88.0 90.2 88.9 92.0 90.5 83.9 81.4 86.2 87.6

Tab. 1 and Tab. 2 show our state-of-the art rendering perfor- Table 4. Reconstruction Performance on Replica [57]. Our
mance on real-world datasets. Tab. 3 compares rendering method performs on par with existing methods. ∗ Depth L1 for
performance and shows improvements over all the exist- GO-SLAM shows our reproduced results from random poses (GO-
ing dense neural RGBD SLAM methods on synthetic data. SLAM evaluates on ground truth poses).
Fig. 4 shows exemplary full-resolution renderings where
Gaussian-SLAM yields more accurate details. Method Rm0 Rm1 Rm2 Off0 Off1 Off2 Off3 Off4 Avg.
NICE-SLAM [84] 0.97 1.31 1.07 0.88 1.00 1.06 1.10 1.13 1.06
5.2. Reconstruction Performance ESLAM [34] 0.71 0.70 0.52 0.57 0.55 0.58 0.72 0.63 0.63
Point-SLAM [51] 0.61 0.41 0.37 0.38 0.48 0.54 0.69 0.72 0.52
In Tab. 4 we compare our method to NICE-SLAM [84], DROID-SLAM [63] 0.53 0.38 0.45 0.35 0.24 0.36 0.33 0.43 0.38
Vox-Fusion [75], ESLAM [34], GO-SLAM [81] and Point- F2F [48] 1.64 1.92 2.8 2.48 0.8 4.55 2.64 2.27 2.38
SLAM [51] in terms of the geometric reconstruction accu- F2GM + F2F 3.35 8.74 3.13 1.11 0.81 0.78 1.08 7.21 3.27
racy on the Replica dataset [57]. Our method performs on F2GM + F2F oracle 0.13 1.11 0.38 0.07 0.06 0.09 0.08 0.30 0.28
par with other existing dense SLAM methods. Table 5. Frame-to-model tracking with 3D Gaussians on
Replica [57] (ATE-RMSE ↓ [cm]). We summarize our frame-
5.3. Tracking Performance to-Gaussian-model (F2GM) tracking experiments. We use the clas-
sical frame-to-frame tracker[48] (F2F) as the baseline. Then we
Tab. 5 presents our tracking results together with oracle per- evaluate the upper bound performance of the F2GM tracker which
formance on the Replica dataset [57], comparing against is initialized with the F2F poses and conduct refinement using
other SOTA dense neural RGBD SLAM methods and pre-mapped segments obtained with ground-truth poses (F2GM +
DROID-SLAM [63]. Poses are first estimated with RGBD F2F oracle). This approach reaches SOTA performance on Replica
odometry implemented in Open3D based on [48, 54]. As novel views for tracking are never obstructed with unconstrained
discussed in Sec. 4.3, our tracking falls behind in accuracy Gaussians. Further, we evaluate the performance of the F2GM
while the oracle method consistently outperforms all other + F2F tracker with refinement on online mapped segments. This
dense neural RGBD SLAM methods and also surpasses experiment highlights the limitation of 3D Gaussians for frame-to-
DROID-SLAM [63]. For all experiments, the depth loss is model tracking.

7
NICE-SLAM[84] ESLAM[34] Point-SLAM[51] Gaussian-SLAM (ours)
scene_0000
scene_0207
fr1-desk
fr3-office

Figure 4. Rendering performance on ScanNet [13] and TUM-RGBD [58]. Thanks to 3D Gaussian splatting, Gaussian-SLAM is able to
encode more high-frequency details and substantially increase the quality of the renderings. This is also supported by the quantitative results
in Tab. 1 and Tab. 2.

Method Mapping Mapping Rendering 6. Conclusion


/Iteration /Frame /Frame
NICE-SLAM [84] 182 ms 10.92 s 2.64 s
Vox-Fusion [75] 55 ms 0.55 s 1.63 s
ESLAM [34] 29 ms 0.44 s 0.63 s
We proposed Gaussian-SLAM, a dense SLAM system
GO-SLAM [81] - 0.125 s - that extends and utilizes 3D Gaussian splitting for scene
Point-SLAM [51] 33 ms 2.97 s 2.96 s
Gaussian-SLAM (ours) 34 ms 1.66 s 7.15E-07
representation. Smart seeding and data-driven anchoring
of 3D Gaussians allow us to make the optimization process
Table 6. Average Mapping and Rendering Speed on Replica
efficient enough for modern neural SLAM applications.
office0. Our mapping time is competitive, while our rendering
Overall, this leads to a better balance of memory and
time allows for exploring mapped environments in real time. All
metrics are computed using a single NVIDIA A6000. compute resource usage and the accuracy of the estimated
3D scene representation. Our experiments demonstrate
that Gaussian-SLAM substantially outperforms existing
Acknowledgements. This work was supported by TomTom, the Uni-
versity of Amsterdam and the allowance of Top consortia for Knowledge
solutions regarding rendering accuracy while being com-
and Innovation (TKIs) from the Netherlands Ministry of Economic Affairs petitive with respect to geometry reconstruction and runtime.
and Climate Policy.

8
A. Limitations of 3D Gaussian Splatting for depth L1 metric which renders depth maps from randomly
SLAM sampled viewpoints from the reconstructed and ground truth
meshes. The depth maps are then compared and the L1 error
To prove the limitations discussed in Sec. 3, we conduct is reported and averaged over 1000 sampled views. We use
full-scale experiments. When naively optimizing a global the evaluation code provided by [84].
Gaussian point cloud for the whole sequence we get out of Tracking. We use the absolute trajectory error (ATE)
memory error even for A100 NVIDIA GPUs. The cases RMSE [59] to compare tracking error across methods. This
when geometry was not encoded, or when only every single computes the translation difference between the estimated
frame was optimized are shown in Tab. 7. trajectory and the ground truth. Before evaluating the ATE
RMSE, we align the trajectories with Horn’s closed-form
Method PSNR↑ SSIM↑ LPIPS↓ F1↑ L1↓
solution [18].
Single-frame optimization 3.15 0.14 0.16 - -
Color-only optimization 29.40 0.74 0.05 3 3.22
Ours 38.90 0.993 0.069 87.6 1.56
D. Additional Experiments
Table 7. The naive approach of optimizing only for a single incom- Influence of the Number of Iterations. The number of
ing frame or only for color gives poor reconstruction and rendering iterations spent per frame is one of the most important pa-
results. For single-frame optimization, the geometry of the 3D rameters of our method. It directly influences both runtime
Gaussians becomes so distorted that the final mesh cannot be recon- and rendering. In 5 we show that while our method achieves
structed from the noisy depth maps rendered from it. All metrics best results with a larger amount of iterations, it is still state
were averaged over the 8 sequences of Replica [57] dataset. of the art in rendering with fewer iterations.

B. Implementation Details
We use PyTorch 1.12 and Python 3.7 to implement the
pipeline. Training is done with the Adam optimizer and
the default hyperparameters betas = (0.9, 0.999), eps =
1e−08 and weight_decay = 0. The results are gathered using
NVIDIA A6000 and A100 GPUs. Different learning rates
are used for different Gaussian parameters. γ is set to 0.0025
for the first 3 channels of features of spherical harmonics and
is 20 times smaller for the rest of the features. Learning rates
of 0.05, 0.005, and 0.001 are set for opacity, scaling, and
rotation parameters. During training, we set the spherical
harmonics degree to 0 since we assume that the colors are
not view-dependent within every sub-map.

C. Evaluation Metrics Figure 5. Ablation over different of iterations spent per frame
on the Replica[57] dataset. The iteration number is a hyperparam-
Mapping. We use the following five metrics to quantify eter allowing for finding a balance between speed and rendering
the reconstruction performance. We compare the ground and reconstruction quality. A smaller number of iterations dramati-
truth mesh to the predicted mesh. The F-score is defined cally reduces the runtime, while sacrificing only a bit of rendering
as the harmonic mean between Precision (P) and Recall performance. PSNR is divided by 50 to be in [0, 1] range, LPIPS in
(R), F = 2 PP+RR
. Precision is defined as the percentage of the diagram is the inverse of the original LPIPS.
points on the predicted mesh that lie within some distance τ
from a point on the ground truth mesh. Vice versa, Recall Influence of the sub-map Size. Another important param-
is defined as the percentage of points on the ground truth eter is the sub-map size. It influences the runtime and the
mesh that lies within the same distance τ from a point on the rendering and reconstruction quality. The sub-map size does
predicted mesh. In all our experiments, we use the distance not correlate much with the rendering metrics but can im-
threshold τ = 0.01 m. Before the Precision and Recall are prove the runtime since more iterations are spent on the first
computed, the input meshes are globally aligned with the frames of the new sub-map. This holds for both synthetic
iterative closest point (ICP) algorithm. We use the evaluation and real-world datasets as shown in 7 and 6.
script provided by the authors of [51]1 . Finally, we report the Additional Tracking Experiments. Tab. 8 provides ad-
1 https : / / github . com / tfy14esa / evaluate _ 3d _ ditional results comparing tracking performance on TUM-
reconstruction_lib RGBD [58] for frame-to-frame tracker (F2F), frame-to-

9
fr1/ fr1/ fr1/ fr2/ fr3/
Method Avg.
desk desk2 room xyz office
BAD-SLAM [53] 1.7 N/A N/A 1.1 1.7 N/A
Kintinuous [70] 3.7 7.1 7.5 2.9 3.0 4.84
ORB-SLAM2 [40] 1.6 2.2 4.7 0.4 1.0 1.98
ElasticFusion [71] 2.53 6.83 21.49 1.17 2.52 6.91
DROID-SLAM [63] 1.80 2.11 3.13 0.33 1.32 1.74
NICE-SLAM [84] 4.26 4.99 34.49 31.73 (6.19) 3.87 15.87 (10.76)
Vox-Fusion∗ [75] 3.52 6.00 19.53 1.49 26.01 11.31
Point-SLAM [51] 4.34 4.54 30.92 1.31 3.48 8.92
F2F 3.25 6.56 28.49 5.34 25.15 13.76
F2GM + F2F 14.43 24.55 44.83 2.75 6.71 18.66
F2GM + F2F oracle 3.04 10.85 3.03 0.28 1.07 3.65

Table 8. Frame-to-model Tracking with 3D Gaussians on TUM-


RGBD [59] (ATE RMSE ↓ [cm]). Currently the frame-to-Gaussian-
model tracking in Gaussian-SLAM yields large drift, while the ora-
cle performance shows great potential. In parenthesis the average
over only the successful runs is reported. Part of the numbers are
Figure 6. Ablation over different sizes of the map segments on taken from [51].
TUM-RGBD[58] dataset. Segment size is a hyperparameter al-
lowing for finding a balance between speed and rendering quality.
PSNR is divided by 50 to be in [0, 1] range, LPIPS in the diagram in real-world reconstructions.
is the inverse of the original LPIPS, Depth L1 is on the log scale.

Figure 7. Ablation over different sizes of the map segments on


Replica[57] dataset. Segment size is a hyperparameter allowing for
finding a balance between speed and rendering and reconstruction
quality. PSNR is divided by 50, LPIPS in the diagram is the inverse
of the original LPIPS, Depth L1 is on the log scale.

Gaussian-model tracker which is initialized with the F2F


poses following by refinement using online mapped seg-
ments (F2GM+F2F), and lastly the F2GM tracker which is
initialized with the F2F poses and conduct refinement us-
ing pre-mapped segments obtained with ground-truth poses
(F2GM+F2F oracle).
Qualitative Reconstruction Results. Fig. 8 shows recon-
structed mesh on Replica dataset with a normal map shader
to highlight the difference. Fig. 9 compares colored mesh
on ScanNet [13] and TUM-RGBD [58] scenes. Gaussian-
SLAM is able to recover more geometric and color details

10
ESLAM [34] GO-SLAM [81] Point-SLAM [51] Gaussian-SLAM (ours) Ground Truth
room 0
room 1
office 0
office 4

Figure 8. Qualitative Reconstruction Comparison on the Replica dataset [57]. Gaussian-SLAM achieves comparable reconstruction
performance with the state-of-the-art dense neural SLAM methods.

NICE-SLAM [84] ESLAM [34] Point-SLAM [51] Gaussian-SLAM (ours) Ground Truth
0059
0169
desk1
office

Figure 9. Qualitative Mesh-based Comparison on ScanNet[13] and TUM-RGBD[58] datasets. For TUM-RGBD, the ground truth is ob-
tained by TSDF fusion. NICE-SLAM[84] shows over-smoothed surfaces. Point-SLAM[51] has duplicated geometry. ESLAM[34] improves
the reconstruction quality, while Gaussian-SLAM is moderately better in recovering geometric details, see the chairs in scene_0059 for
example.

11
References tegration. ACM Transactions on Graphics (ToG), 36(4):1,
2017. 1, 2, 7
[1] Dejan Azinović, Ricardo Martin-Brualla, Dan B Goldman, [15] Andrew J Davison, Ian D Reid, Nicholas D Molton, and
Matthias Nießner, and Justus Thies. Neural rgb-d surface Olivier Stasse. Monoslam: Real-time single camera slam.
reconstruction. In IEEE/CVF Conference on Computer Vision IEEE transactions on pattern analysis and machine intelli-
and Pattern Recognition, pages 6290–6301, 2022. 2 gence, 29(6):1052–1067, 2007. 1
[2] Marcelo Bertalmio, Andrea L Bertozzi, and Guillermo Sapiro. [16] Jorge Fuentes-Pacheco, José Ruiz-Ascencio, and Juan Manuel
Navier-stokes, fluid dynamics, and image and video inpaint- Rendón-Mancha. Visual simultaneous localization and map-
ing. In Proceedings of the 2001 IEEE Computer Society Con- ping: a survey. Artificial intelligence review, 43:55–81, 2015.
ference on Computer Vision and Pattern Recognition. CVPR 1
2001, pages I–355. IEEE, 2001. 5 [17] Yiming Gao, Yan-Pei Cao, and Ying Shan. Surfelnerf: Neural
[3] Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Vic- surfel radiance fields for online photorealistic reconstruction
tor Adrian Prisacariu. Nope-nerf: Optimising neural radiance of indoor scenes. In Proceedings of the IEEE/CVF Conference
field with no pose prior. arXiv preprint arXiv:2212.07388, on Computer Vision and Pattern Recognition, pages 108–118,
2022. 2 2023. 3
[4] Aljaž Božič, Pablo Palafox, Justus Thies, Angela Dai, [18] Berthold KP Horn. Closed-form solution of absolute orien-
and Matthias Nießner. Transformerfusion: Monocular rgb tation using unit quaternions. Josa a, 4(4):629–642, 1987.
scene reconstruction using transformers. arXiv preprint 9
arXiv:2107.02191, 2021. 2 [19] Jiahui Huang, Shi-Sheng Huang, Haoxuan Song, and Shi-Min
[5] E. Bylow, C. Olsson, and F. Kahl. Robust online 3d recon- Hu. Di-fusion: Online implicit 3d reconstruction with deep
struction combining a depth sensor and sparse feature points. priors. In IEEE/CVF Conference on Computer Vision and
Unknown Journal, Unknown Volume(Unknown Number): Pattern Recognition, pages 8932–8941, 2021. 1, 2
Unknown Pages, Unknown Year. 2 [20] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale
[6] Yan-Pei Cao, Leif Kobbelt, and Shi-Min Hu. Real-time high- similarity search with GPUs. IEEE Transactions on Big Data,
accuracy three-dimensional reconstruction with consumer 7(3):535–547, 2019. 6
rgb-d cameras. ACM Transactions on Graphics (TOG), 37(5): [21] Olaf Kähler, Victor Prisacariu, Julien Valentin, and David
1–16, 2018. 2 Murray. Hierarchical voxel block hashing for efficient in-
[7] Jiawen Chen, Dennis Bautembach, and Shahram Izadi. Scal- tegration of depth images. IEEE Robotics and Automation
able real-time volumetric surface reconstruction. ACM Trans- Letters, 1(1):192–197, 2015. 2
actions on Graphics (ToG), 32(4):1–16, 2013. 2 [22] Olaf Kähler, Victor Adrian Prisacariu, Carl Yuheng Ren, Xin
[8] Hae Min Cho, HyungGi Jo, and Euntai Kim. Sp-slam: Surfel- Sun, Philip H. S. Torr, and David William Murray. Very
point simultaneous localization and mapping. IEEE/ASME high frame rate volumetric integration of depth images on
Transactions on Mechatronics, 27(5):2568–2579, 2021. 2 mobile devices. IEEE Trans. Vis. Comput. Graph., 21(11):
1241–1250, 2015. 2
[9] Jaesung Choe, Sunghoon Im, Francois Rameau, Minjun Kang,
[23] Iman Abaspur Kazerouni, Luke Fitzgerald, Gerard Dooly,
and In So Kweon. Volumefusion: Deep depth fusion for 3d
and Daniel Toal. A survey of state-of-the-art on visual slam.
scene reconstruction. In IEEE/CVF International Conference
Expert Systems with Applications, 205:117734, 2022. 1
on Computer Vision (ICCV), pages 16086–16095, 2021. 2
[24] Maik Keller, Damien Lefloch, Martin Lambers, Shahram
[10] Sungjoon Choi, Qian-Yi Zhou, and Vladlen Koltun. Robust
Izadi, Tim Weyrich, and Andreas Kolb. Real-time 3d re-
reconstruction of indoor scenes. In Proceedings of the IEEE
construction in dynamic scenes using point-based fusion. In
conference on computer vision and pattern recognition, pages
International Conference on 3D Vision (3DV), pages 1–8.
5556–5565, 2015. 2
IEEE, 2013. 2
[11] Chi-Ming Chung, Yang-Che Tseng, Ya-Ching Hsu, Xiang- [25] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and
Qian Shi, Yun-Hung Hua, Jia-Fong Yeh, Wen-Chin Chen, George Drettakis. 3d gaussian splatting for real-time radiance
Yi-Ting Chen, and Winston H Hsu. Orbeez-slam: A real-time field rendering. ACM Transactions on Graphics, 42(4), 2023.
monocular visual slam with orb features and nerf-realized 1, 3, 4, 5
mapping. arXiv preprint arXiv:2209.13274, 2022. 2 [26] Georg Klein and David Murray. Parallel tracking and map-
[12] Brian Curless and Marc Levoy. Volumetric method for build- ping for small ar workspaces. In 2007 6th IEEE and ACM
ing complex models from range images. In SIGGRAPH international symposium on mixed and augmented reality,
Conference on Computer Graphics. ACM, 1996. 2, 6 pages 225–234. IEEE, 2007. 1
[13] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- [27] Heng Li, Xiaodong Gu, Weihao Yuan, Luwei Yang, Zilong
ber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Dong, and Ping Tan. Dense rgb slam with neural implicit
Richly-annotated 3D reconstructions of indoor scenes. In Con- maps. arXiv preprint arXiv:2301.08930, 2023. 2
ference on Computer Vision and Pattern Recognition (CVPR). [28] Kejie Li, Yansong Tang, Victor Adrian Prisacariu, and
IEEE/CVF, 2017. 6, 7, 8, 10, 11 Philip HS Torr. Bnv-fusion: Dense 3d reconstruction using
[14] Angela Dai, Matthias Nießner, Michael Zollhöfer, Shahram bi-level neural volume fusion. In IEEE/CVF Conference on
Izadi, and Christian Theobalt. Bundlefusion: Real-time glob- Computer Vision and Pattern Recognition, pages 6166–6175,
ally consistent 3d reconstruction using on-the-fly surface rein- 2022. 2

12
[29] Chen Hsuan Lin, Wei Chiu Ma, Antonio Torralba, and Simon Kinectfusion: Real-time dense surface mapping and tracking.
Lucey. BARF: Bundle-Adjusting Neural Radiance Fields. In ISMAR, pages 127–136, 2011. 1, 2
In International Conference on Computer Vision (ICCV). [43] Richard A Newcombe, Steven J Lovegrove, and Andrew J
IEEE/CVF, 2021. 2 Davison. Dtam: Dense tracking and mapping in real-time. In
[30] Dongrui Liu, Chuanchaun Chen, Changqing Xu, Robert C International Conference on Computer Vision (ICCV), 2011.
Qiu, and Lei Chu. Self-supervised point cloud registration 1, 2
with deep versatile descriptors for intelligent driving. IEEE [44] Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and
Transactions on Intelligent Transportation Systems, 2023. 2 Marc Stamminger. Real-time 3d reconstruction at scale using
[31] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and voxel hashing. ACM Transactions on Graphics (TOG), 32,
Christian Theobalt. Neural sparse voxel fields. In Advances in 2013. 1, 2
Neural Information Processing Systems, pages 15651–15663, [45] Michael Oechsle, Songyou Peng, and Andreas Geiger.
2020. 2 UNISURF: Unifying Neural Implicit Surfaces and Radiance
[32] William E Lorensen and Harvey E Cline. Marching cubes: Fields for Multi-View Reconstruction. In International Con-
A high resolution 3d surface construction algorithm. ACM ference on Computer Vision (ICCV). IEEE/CVF, 2021. 2
siggraph computer graphics, 21(4):163–169, 1987. 7 [46] Helen Oleynikova, Zachary Taylor, Marius Fehr, Roland Sieg-
[33] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva wart, and Juan I. Nieto. Voxblox: Incremental 3d euclidean
Ramanan. Dynamic 3d gaussians: Tracking by persistent signed distance fields for on-board MAV planning. In 2017
dynamic view synthesis, 2023. 3 IEEE/RSJ International Conference on Intelligent Robots and
Systems, IROS 2017, Vancouver, BC, Canada, September
[34] Mohammad Mahdi Johari, Camilla Carta, and François
24-28, 2017, pages 1366–1373. IEEE, 2017. 2
Fleuret. Eslam: Efficient dense slam system based on hy-
brid representation of signed distance fields. arXiv e-prints, [47] Joseph Ortiz, Alexander Clegg, Jing Dong, Edgar Sucar,
pages arXiv–2211, 2022. 1, 2, 3, 6, 7, 8, 11 David Novotny, Michael Zollhoefer, and Mustafa Mukadam.
isdf: Real-time neural signed distance fields for robot percep-
[35] Nico Marniok, Ole Johannsen, and Bastian Goldluecke. An
tion. arXiv preprint arXiv:2204.02296, 2022. 2
efficient octree design for local variational range image fusion.
[48] Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Colored
In German Conference on Pattern Recognition (GCPR), pages
point cloud registration revisited. In Proceedings of the IEEE
401–412. Springer, 2017. 2
international conference on computer vision, pages 143–152,
[36] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
2017. 5, 7
bastian Nowozin, and Andreas Geiger. Occupancy networks:
[49] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc
Learning 3d reconstruction in function space. In IEEE/CVF
Pollefeys, and Andreas Geiger. Convolutional Occupancy
conference on computer vision and pattern recognition, pages
Networks. In European Conference Computer Vision (ECCV).
4460–4470, 2019. 2
CVF, 2020. 3
[37] Marko Mihajlovic, Silvan Weder, Marc Pollefeys, and Mar- [50] Antoni Rosinol, John J. Leonard, and Luca Carlone. NeRF-
tin R Oswald. Deepsurfels: Learning online appearance SLAM: Real-Time Dense Monocular SLAM with Neural
fusion. In IEEE/CVF Conference on Computer Vision and Radiance Fields. arXiv, 2022. 2
Pattern Recognition, pages 14524–14535, 2021. 3 [51] Erik Sandström, Yue Li, Luc Van Gool, and Martin R Oswald.
[38] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Point-slam: Dense neural point cloud-based slam. In Inter-
Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: national Conference on Computer Vision (ICCV). IEEE/CVF,
Representing Scenes as Neural Radiance Fields for View Syn- 2023. 1, 2, 4, 6, 7, 8, 9, 10, 11
thesis. In European Conference on Computer Vision (ECCV). [52] Mohamed Sayed, John Gibson, Jamie Watson, Victor
CVF, 2020. 1, 2 Prisacariu, Michael Firman, and Clément Godard. Simplere-
[39] Thomas Müller, Alex Evans, Christoph Schied, and Alexan- con: 3d reconstruction without 3d convolutions. In European
der Keller. Instant neural graphics primitives with a multireso- Conference on Computer Vision, pages 1–19. Springer, 2022.
lution hash encoding. ACM Transactions on Graphics (ToG), 2
41(4):1–15, 2022. 2 [53] Thomas Schops, Torsten Sattler, and Marc Pollefeys. BAD
[40] Raul Mur-Artal and Juan D. Tardos. ORB-SLAM2: An Open- SLAM: Bundle adjusted direct RGB-D SLAM. In CVF/IEEE
Source SLAM System for Monocular, Stereo, and RGB-D Conference on Computer Vision and Pattern Recognition
Cameras. IEEE Transactions on Robotics, 33(5):1255–1262, (CVPR), 2019. 1, 2, 10
2017. 1, 10 [54] Frank Steinbrücker, Jürgen Sturm, and Daniel Cremers. Real-
[41] Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha, time visual odometry from dense rgb-d images. In 2011
Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End- IEEE international conference on computer vision workshops
to-end 3d scene reconstruction from posed images. In Com- (ICCV Workshops), pages 719–722. IEEE, 2011. 7
puter Vision–ECCV 2020: 16th European Conference, Glas- [55] Frank Steinbrucker, Christian Kerl, and Daniel Cremers.
gow, UK, August 23–28, 2020, Proceedings, Part VII 16, Large-scale multi-resolution surface reconstruction from rgb-
pages 414–431. Springer, 2020. 2 d sequences. In IEEE International Conference on Computer
[42] Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Vision, pages 3264–3271, 2013. 2
Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohli, [56] Noah Stier, Alexander Rich, Pradeep Sen, and Tobias Höllerer.
Jamie Shotton, Steve Hodges, and Andrew W Fitzgibbon. Vortx: Volumetric 3d reconstruction with transformers for

13
voxelwise view selection and fusion. In 2021 International [69] Silvan Weder, Johannes L Schonberger, Marc Pollefeys, and
Conference on 3D Vision (3DV), pages 320–330. IEEE, 2021. Martin R Oswald. Neuralfusion: Online depth fusion in latent
2 space. In IEEE/CVF Conference on Computer Vision and
[57] Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Pattern Recognition, pages 3162–3172, 2021. 2
Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl [70] Thomas Whelan, John McDonald, Michael Kaess, Maurice
Ren, Shobhit Verma, et al. The replica dataset: A digital Fallon, Hordur Johannsson, and John J. Leonard. Kintinu-
replica of indoor spaces. arXiv preprint arXiv:1906.05797, ous: Spatially extended kinectfusion. In Proceedings of RSS
2019. 6, 7, 9, 10, 11 ’12 Workshop on RGB-D: Advanced Reasoning with Depth
[58] Jan Stühmer, Stefan Gumhold, and Daniel Cremers. Real- Cameras, 2012. 10
time dense geometry from a handheld camera. In Joint Pattern [71] Thomas Whelan, Stefan Leutenegger, Renato Salas-Moreno,
Recognition Symposium, pages 11–20. Springer, 2010. 1, 6, Ben Glocker, and Andrew Davison. Elasticfusion: Dense
8, 9, 10, 11 slam without a pose graph. In Robotics: Science and Systems
[59] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram (RSS), 2015. 1, 2, 10
Burgard, and Daniel Cremers. A benchmark for the evaluation [72] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng
of RGB-D SLAM systems. In International Conference on Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang.
Intelligent Robots and Systems (IROS). IEEE/RSJ, 2012. 1, 6, 4d gaussian splatting for real-time dynamic scene rendering,
7, 9, 10 2023. 3
[60] Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew J. Davi- [73] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu,
son. iMAP: Implicit Mapping and Positioning in Real-Time. Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-
In International Conference on Computer Vision (ICCV). based neural radiance fields. In IEEE/CVF Conference on
IEEE/CVF, 2021. 1, 2, 7 Computer Vision and Pattern Recognition, pages 5438–5448,
2022. 2
[61] Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou,
[74] Zike Yan, Yuxin Tian, Xuesong Shi, Ping Guo, Peng Wang,
and Hujun Bao. Neuralrecon: Real-time coherent 3d recon-
and Hongbin Zha. Continual neural mapping: Learning an
struction from monocular video. In IEEE/CVF Conference
implicit scene representation from sequential observations.
on Computer Vision and Pattern Recognition, pages 15598–
In IEEE/CVF International Conference on Computer Vision
15607, 2021. 2
(ICCV), pages 15782–15792, 2021. 2
[62] Yijie Tang, Jiazhao Zhang, Zhinan Yu, He Wang, and Kai
[75] Xingrui Yang, Hai Li, Hongjia Zhai, Yuhang Ming, Yuqian
Xu. Mips-fusion: Multi-implicit-submaps for scalable and
Liu, and Guofeng Zhang. Vox-fusion: Dense tracking and
robust online neural rgb-d reconstruction. arXiv preprint
mapping with voxel-based neural implicit representation. In
arXiv:2308.08741, 2023. 1
IEEE International Symposium on Mixed and Augmented
[63] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam Reality (ISMAR), pages 499–507. IEEE, 2022. 2, 6, 7, 8, 10
for monocular, stereo, and rgb-d cameras. Advances in neural [76] Xingrui Yang, Yuhang Ming, Zhaopeng Cui, and Andrew
information processing systems, 34:16558–16569, 2021. 6, 7, Calway. Fd-slam: 3-d reconstruction using features and dense
10 matching. In 2022 International Conference on Robotics and
[64] Hengyi Wang, Jingwen Wang, and Lourdes Agapito. Co-slam: Automation (ICRA), pages 8040–8046. IEEE, 2022. 2
Joint coordinate and sparse parametric encodings for neural [77] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing
real-time slam. In Proceedings of the IEEE/CVF Conference Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-
on Computer Vision and Pattern Recognition, pages 13293– fidelity monocular dynamic scene reconstruction, 2023. 3
13302, 2023. 1, 2, 7 [78] Zeyu Yang, Hongye Yang, Zijie Pan, Xiatian Zhu, and Li
[65] Jiepeng Wang, Peng Wang, Xiaoxiao Long, Christian Zhang. Real-time photorealistic dynamic scene representation
Theobalt, Taku Komura, Lingjie Liu, and Wenping Wang. and rendering with 4d gaussian splatting, 2023. 3
Neuris: Neural reconstruction of indoor scenes using normal [79] Heng Zhang, Guodong Chen, Zheng Wang, Zhenhua Wang,
priors. In Computer Vision–ECCV 2022: 17th European Con- and Lining Sun. Dense 3d mapping for indoor environment
ference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, based on feature-point slam method. In 2020 the 4th Inter-
Part XXXII, pages 139–155. Springer, 2022. 2 national Conference on Innovation in Artificial Intelligence,
[66] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P pages 42–46, 2020. 2
Simoncelli. Image quality assessment: from error visibility to [80] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
structural similarity. IEEE transactions on image processing, and Oliver Wang. The unreasonable effectiveness of deep fea-
13(4):600–612, 2004. 5, 7 tures as a perceptual metric. In IEEE conference on computer
[67] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Vic- vision and pattern recognition, pages 586–595, 2018. 7
tor Adrian Prisacariu. Nerf–: Neural radiance fields without [81] Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo
known camera parameters. arXiv preprint arXiv:2102.07064, Poggi. Go-slam: Global optimization for consistent 3d instant
2021. 2 reconstruction. arXiv preprint arXiv:2309.02436, 2023. 1, 2,
[68] Silvan Weder, Johannes Schonberger, Marc Pollefeys, and 7, 8, 11
Martin R Oswald. Routedfusion: Learning real-time depth [82] Qian-Yi Zhou and Vladlen Koltun. Dense scene reconstruc-
map fusion. In IEEE/CVF Conference on Computer Vision tion with points of interest. ACM Transactions on Graphics
and Pattern Recognition, pages 4887–4897, 2020. 2 (TOG), 32(4):120, 2013. 2

14
[83] Qian-Yi Zhou, Stephen Miller, and Vladlen Koltun. Elastic
fragments for dense scene reconstruction. Proceedings of the
IEEE International Conference on Computer Vision, pages
2726–2733, 2013.
[84] Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hu-
jun Bao, Zhaopeng Cui, Martin R Oswald, and Marc Polle-
feys. Nice-slam: Neural implicit scalable encoding for slam.
In IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 12786–12796, 2022. 1, 2, 6, 7, 8, 9, 10,
11
[85] Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui,
Martin R Oswald, Andreas Geiger, and Marc Pollefeys. Nicer-
slam: Neural implicit scene encoding for rgb slam. arXiv
preprint arXiv:2302.03594, 2023. 2, 6
[86] Michael Zollhöfer, Patrick Stotko, Andreas Görlitz, Christian
Theobalt, Matthias Nießner, Reinhard Klein, and Andreas
Kolb. State of the art on 3d reconstruction with rgb-d cameras.
In Computer graphics forum, pages 625–652. Wiley Online
Library, 2018. 2
[87] Matthias Zwicker, Hanspeter Pfister, Jeroen Van Baar, and
Markus Gross. Surface splatting. In Proceedings of the 28th
annual conference on Computer graphics and interactive
techniques, pages 371–378, 2001. 3

15

You might also like