Gaussian Splatting SLAM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Gaussian Splatting SLAM

Hidenobu Matsuki1∗ Riku Murai2∗ Paul H. J. Kelly2 Andrew J. Davison1


1
Dyson Robotics Laboratory, Imperial College London
2
Software Performance Optimisation Group, Imperial College London

Website: https://rmurai.co.uk/projects/GaussianSplattingSLAM/
arXiv:2312.06741v1 [cs.CV] 11 Dec 2023

Video: https://youtu.be/x604ghp9R_Q/

Figure 1. From a single monocular camera, we reconstruct a high fidelity 3D scene live at 3fps. For every incoming RGB frame, 3D
Gaussians are incrementally formed and optimised together with the camera poses. We show both the rasterised Gaussians (left) and
Gaussians shaded to highlight the geometry (right). Notice the details and the complex material properties (e.g. transparency) captured.
Thin structures such as wires are accurately represented by numerous small, elongated Gaussians, and transparent objects are effectively
represented by placing the Gaussians along the rim. Our system significantly advances the fidelity a live monocular SLAM system can
capture.

ture of the Gaussians, we introduce geometric verification


Abstract and regularisation to handle the ambiguities occurring in
incremental 3D dense reconstruction. Finally, we introduce
We present the first application of 3D Gaussian Splat- a full SLAM system which not only achieves state-of-the-art
ting to incremental 3D reconstruction using a single moving results in novel view synthesis and trajectory estimation, but
monocular or RGB-D camera. Our Simultaneous Localisa- also reconstruction of tiny and even transparent objects.
tion and Mapping (SLAM) method, which runs live at 3fps,
utilises Gaussians as the only 3D representation, unifying
the required representation for accurate, efficient tracking, 1. Introduction
mapping, and high-quality rendering.
A long-term goal of online reconstruction with a single
Several innovations are required to continuously recon-
moving camera is near-photorealistic fidelity, which will
struct 3D scenes with high fidelity from a live camera. First,
surely allow new levels of performance in many areas of
to move beyond the original 3DGS algorithm, which re-
Spatial AI and robotics as well as opening up a whole range
quires accurate poses from an offline Structure from Mo-
of new applications. While we increasingly see the ben-
tion (SfM) system, we formulate camera tracking for 3DGS
efit of applying powerful pre-trained priors to 3D recon-
using direct optimisation against the 3D Gaussians, and
struction, a key avenue for progress is still the invention
show that this enables fast and robust tracking with a wide
and development of core 3D representations with advanta-
basin of convergence. Second, by utilising the explicit na-
geous properties. While many “layered” SLAM systems
*Authors contributed equally to this work. exist which combine multiple representations, the most in-
teresting advances are when a new unified dense representa- the most challenging scenarios in SLAM, but we show that
tion can be used for all aspects of a system’s operation: local it can also incorporate depth measurements when available.
representation of detail, large-scale geometric mapping and In summary, our contributions are as follows:
also camera tracking by direct alignment. • The first near real-time SLAM system which works with
In this paper we present the first online visual SLAM a 3DGS as the only underlying scene representation.
system based solely on the 3D Gaussian Splatting (3DGS) • Novel techniques within the SLAM framework, including
representation [10] recently making a big impact in offline the analytic Jacobian for camera pose estimation, Gaus-
scene reconstruction. In 3DGS a scene is represented by sian shape regularisation and geometric verification.
a large number of Gaussian blobs with orientation, elonga- • Extensive evaluations on a variety of datasets both for
tion, colour and opacity. Other previous world/map-centric monocular and RGB-D settings, demonstrating compet-
scene representations used for visual SLAM include occu- itive performance, particularly in real-world scenarios.
pancy or Signed Distance Function (SDF) voxel grids [23];
meshes [28]; point or surfel clouds [9, 29]; and recently 2. Related Work
neural fields [33]. Each of these has disadvantages: grids Dense SLAM: Dense visual SLAM focuses on reconstruct-
use significant memory and have bounded resolution, and ing detailed 3D maps, unlike sparse SLAM methods which
even if octrees or hashing allow more efficiency they can- excel in pose estimation [4, 5, 21] but typically yield maps
not be flexibly warped for large corrections [25, 37]; meshes useful mainly for localisation. In contrast, dense SLAM
require difficult, irregular topology to fuse new informa- creates interactive maps beneficial for broader applications,
tion; surfel clouds are discontinuous and difficult to fuse including AR and robotics. Dense SLAM methods are gen-
and optimise; and neural fields require expensive per-pixel erally divided into two primary categories: Frame-centric
raycasting to render. We show that 3DGS has none of these and Map-centric. Frame-centric SLAM minimises pho-
weaknesses. As a SLAM representation, it is most simi- tometric error across consecutive frames, jointly estimat-
lar to point and surfel clouds, and inherits their efficiency, ing per-frame depth and frame-to-frame camera motion.
locality and ability to be easily warped or modified. How- Frame-centric approaches [1, 36] are efficient, as individual
ever, it also represents geometry in a smooth, continuously frames host local rather than global geometry (e.g. depth
differentiable way: a dense cloud of Gaussians merge to- maps), and are attractive for long-session SLAM, but if a
gether and jointly define a continuous volumetric function. dense global map is needed, it must be constructed on de-
And crucially, the design of modern graphics cards means mand by assembling all of these parts which are not nec-
that a large number of Gaussians can be efficiently rendered essarily fully consistent. In contrast, Map-centric SLAM
via “splatting” rasterisation, up to 200fps at 1080p. This uses a unified 3D representation across the SLAM pipeline,
rapid, differentiable rendering is integral to the tracking and enabling a compact and streamlined system. Compared
map optimisation loops in our system. to purely local frame-to-frame tracking, a map-centric ap-
The 3DGS representation has up until now only been proach leverages global information by tracking against the
used in offline systems for 3D reconstruction with known reconstructed 3D consistent map. Classical map-centric ap-
camera poses, and we present several innovations to enable proaches often use voxel grids [2, 23, 26, 40] or points [9,
online SLAM. We first derive the analytic Jacobian of cam- 29, 41] as the underlying 3D representation.
era pose with respect to a 3D Gaussians map, and show that While voxels enable a fast look-up of features in 3D, the
this can be seamlessly integrated into the existing differen- representation is expensive, and the fixed voxel resolution
tiable rasterisation pipeline to enable camera poses to be and distribution are problematic when the spatial character-
optimised alongside scene geometry. Second, we introduce istics of the environment are not known in advance. On the
a novel Gaussian shape regularisation to ensure geometric other hand, a point-based map representation, such as surfel
consistency, which we have found is important for incre- clouds, enables adaptive changes in resolution and spatial
mental reconstruction. Third, we propose a novel Gaus- distribution by dynamic allocation of point primitives in the
sian resource allocation and pruning method to keep the 3D space. Such flexibility benefits online applications such
geometry clean and enable accurate camera tracking. Our as SLAM with deformation-based loop closure [29, 41].
experimental results demonstrate photorealistic online lo- However, optimising the representation to capture high fi-
cal scene reconstruction, as well as state-of-the-art camera delity is challenging due to the lack of correlation among
trajectory estimation and mapping for larger scenes com- the primitives.
pared to other rendering-based SLAM methods. We further Recently, in addition to classical graphic primitives, neu-
show the uniqueness of the Gaussian-based SLAM method ral network-based map representations are a promising al-
such as an extremely large camera pose convergence basin, ternative. iMAP [33] demonstrated the interesting proper-
which can also be useful for map-based camera localisa- ties of neural representation, such as sensible hole filling
tion. Our method works with only monocular input, one of of unobserved geometry. Many recent approaches combine
the classical and neural representations to capture finer de- 3. Method
tails [8, 27, 46, 47]; however, the large amount of computa-
tion required for neural rendering makes the live operation 3.1. Gaussian Splatting
of such systems challenging. Our SLAM representation is 3DGS, mapping the scene with
Differentiable Rendering: The classical method for a set of anisotropic Gaussians G. Each Gaussian G i contains
creating a 3D representation was to unproject 2D obser- optical properties: colour ci and opacity αi . For continuous
vations into 3D space and to fuse them via weighted av- 3D representation, the mean µiW and covariance ΣiW , de-
eraging [16, 23]. Such an averaging scheme suffers from fined in the world coordinate, represent the Gaussian’s po-
over-smooth representation and lacks the expressiveness to sition and its ellipsoidal shape. For simplicity and speed,
capture high-quality details. in our work we omit the spherical harmonics representing
To capture a scene with photo-realistic quality, differ- view-dependent radiance. Since 3DGS uses volume ren-
entiable volumetric rendering [24] has recently been pop- dering, explicit extraction of the surface is not required. In-
ularised with Neural Radiance Fields (NeRF) [17]. Us- stead, by splatting and blending N Gaussians, a pixel colour
ing a single Multi-Layer Perceptron (MLP) as a scene rep- Cp is synthesised:
resentation, NeRF performs volume rendering by march- i−1
ing along pixel rays, querying the MLP for opacity and X Y
Cp = ci αi (1 − αj ) . (1)
colour. Since volume rendering is naturally differentiable, j=1
i∈N
the MLP representation is optimised to minimise the ren-
dering loss using multiview information to achieve high- 3DGS performs rasterisation, iterating over the Gaussians
quality novel view synthesis. The main weakness of NeRF rather than marching along the camera rays, and hence, free
is its training speed. Recent developments have introduced spaces are ignored during rendering. During rasterisation,
explicit volume structures such as multi-resolution voxel the contributions of α are decayed via a Gaussian function,
grids [6, 14, 34] or hash functions [19] to improve perfor- based on the 2D Gaussian formed by splatting a 3D Gaus-
mance. Interestingly, we can infer from these projects that sian. The 3D Gaussians N (µW , ΣW ) in world coordinates
the main contributor to high-quality novel view synthesis is are related to the 2D Gaussians N (µI , ΣI ) on the image
not the neural network but rather differentiable volumetric plane through a projective transformation:
rendering, and that it is possible to avoid the use of an MLP
and yet achieve comparable rendering quality to NeRF [6]. µI = π(T CW · µW ) , ΣI = JWΣW WT JT , (2)
However, even in these systems, per-pixel ray marching re-
mains a significant bottleneck for rendering speed. This is- where π is the projection operation and T CW ∈ SE(3) is
sue is particularly critical in SLAM, where immediate in- the camera pose of the viewpoint. J is the Jacobian of the
teraction with the map is essential for tracking. In con- linear approximation of the projective transformation and
trast to NeRF, 3DGS performs differentiable rasterisation. W is the rotational component of T CW . This formulation
Similar to regular graphics rasterisations, by iterating over enables the 3D Gaussians to be differentiable and the blend-
the primitives to be rasterised rather than marching along ing operation provides gradient flow to the Gaussians. Us-
rays, 3DGS leverages the natural sparsity of a 3D scene ing first-order gradient descent [12], Gaussians gradually
and achieves a representation which is expressive to capture refines both their optic and geometric parameters to repre-
high-fidelity 3D scenes while offering significantly faster sent the captured scene with high fidelity.
rendering. Several works have applied 3D Gaussians and
differentiable rendering to static scene capture [11, 38], and 3.2. Camera Pose Optimisation
in particular more recent works utilise 3DGS and demon- To achieve accurate tracking, we typically require at least 50
strate superior results in vision tasks such as dynamic scene iterations of gradient descent per frame. This requirement
capture [15, 42, 44] and 3D generation [35, 45]. emphasises the necessity of a representation with computa-
Our method adopts a Map-centric approach, utilising 3D tionally efficient view synthesis and gradient computation,
Gaussians as the only SLAM representation. Similar to making the choice of 3D representation a crucial part of de-
surfel-based SLAM, we dynamically allocate the 3D Gaus- signing a SLAM system.
sians, enabling us to model an arbitrary spatial distribu- In order to avoid the overhead of automatic differen-
tion in the scene. Unlike other methods such as Elastic- tiation, 3DGS implements rasterisation with CUDA with
Fusion [41] and PointFusion [9], however, by using differ- derivatives for all parameters calculated explicitly. Since
entiable rasterisation, our SLAM system can capture high- rasterisation is performance critical, we similarly derive the
fidelity scene details and represent challenging object prop- camera Jacobians explicitly.
erties by direct optimisation against information from every To the best of our knowledge, we provide the first ana-
pixel. lytical Jacobian of SE(3) camera pose with respect to the
(Sec 3.3.1) (Sec 3.3.2) (Sec 3.3.3)

Tracking Keyframing Mapping


Input video Window Optimisation
(RGB or RGB-D) yes Gaussian
Co-visibility Check Is KF?
Insertion & Prune

3D Gaussian Map

Camera Pose Estimation

Keyframe Management

Figure 2. SLAM System Overview: Our SLAM system uses 3D Gaussians as the only representation, unifying all components of SLAM,
including tracking, mapping, keyframe management, and novel view synthesis.

3D Gaussians used in EWA splatting [48] and 3DGS. This 3.3.1 Tracking
opens up new applications of 3DGS beyond SLAM.
In tracking only the current camera pose is optimised, with-
We use Lie algebra to derive the minimal Jacobians, en-
out updates to the map representation. In the monocular
suring that the dimensionality of the Jacobians matches the
case, we minimise the following photometric residual:
degrees of freedom, eliminating any redundant computa-
tions. The terms of Eq. (2) are differentiable with respect Epho = I(G, T CW ) − I¯ , (7)
1
to the camera pose T CW ; using the chain rule:
∂µI ∂µI DµC where I(G, T CW ) renders the Gaussians G from T CW , and
∂T CW
=
∂µC DT CW
, (3) I¯ is an observed image.
We further optimise affine brightness parameters for
∂ΣI ∂ΣI ∂J DµC ∂ΣI DW
= + . (4) varying exposure. When depth observations are available,
∂T CW ∂J ∂µC DT CW ∂W DT CW we define the geometric residual as:
We take the derivatives on the manifold to derive minimal
parameterisation. Borrowing the notation from [30], let Egeo = D(G, T CW ) − D̄ 1
, (8)
T ∈ SE(3) and τ ∈ se(3). We define the partial derivative
on the manifold as: where D(G, T CW ) is depth rasterisation and D̄ is the ob-
served depth. Rather than simply using the depth measure-
Df (T ) Log(f (Exp(τ ) ◦ T ) ◦ f (T )−1 ) ments to initialise the Gaussians, we minimise both photo-
≜ lim , (5)
DT τ →0 τ metric and geometric residuals: λpho Epho +(1−λpho )Egeo ,
where ◦ is a group composition, and Exp, Log are the expo- where λpho is a hyperparameter.
nential and logarithmic mappings between Lie algebra and As in Eq. (1), per-pixel depth is rasterised by alpha-
Lie Group. With this, we derive the following: blending:
X i−1
Y
 ×
0 −W:,1 Dp = zi αi (1 − αj ) , (9)
DµC DW × i∈N j=1
−µ×
 
= I C , = 0 −W:,2 , (6)
DT CW DT CW × where zi is the distance to the mean µW of Gaussian i along
0 −W:,3
the camera ray. We derive analytical Jacobians for the cam-
where × denotes the skew symmetric matrix of a 3D vector, era pose optimisation in a similar manner to Eq. (3), (4).
and W:,i refers to the ith column of the matrix.
3.3. SLAM 3.3.2 Keyframing
In this section, we present details of full SLAM framework.
The overview of the system is summarised in Fig. 2. Please Since using all the images from a video stream to jointly
refer to the supplementary material for the further parameter optimise the Gaussians and camera poses online is infeasi-
details. ble, we maintain a small window Wk consisting of carefully
selected keyframes based on inter-frame covisibility. Ideal
keyframe management will select non-redundant keyframes
observing the same area, spanning a wide baseline to pro-
vide better multiview constraints.

Selection and Management Every tracked frame is


checked for keyframe registration based on our simple yet
effective criteria. We measure the covisibility by measur-
ing the intersection over union of the observed Gaussians
between the current frame i and the last keyframe j. If the
covisibility drops below a threshold, or if the relative trans-
lation tij is large with respect to the median depth, frame i is
registered as a keyframe. For efficiency, we maintain only a
small number of keyframes in the current window Wk fol- w/o 𝐸of
Figure 3. Effect !"#isotropic regularisation: 𝐸!"# Rendering
w/Top:
close to a training view (looking at the keyboard). Bottom: Ren-
lowing the keyframe management heuristics of DSO [4]. dering 3D Gaussians far from the training views (view from a side
The main difference is that a keyframe is removed from of the keyboard) without (left) and with (right) the isotropic loss.
the current window if the overlap coefficient with the lat- When the photometric constraints are insufficient, the Gaussians
est keyframe drops below a threshold. The parameters are tend to elongate along the viewing direction, creating artefacts in
detailed in supplementary. 7.1.2. the novel views, and affecting the camera tracking.

3.3.3 Mapping

Gaussian Covisibility An accurate estimate of covisibil- The purpose of mapping is to maintain a coherent 3D struc-
ity simplifies keyframe selection and management. 3DGS ture and to optimise the newly inserted Gaussians. Dur-
respects visibility ordering since the 3D Gaussians are ing mapping, the keyframes in Wk are used to reconstruct
sorted along the camera ray. This property is desirable for currently visible regions. Additionally, two random past
covisibility estimation as occlusions are handled by design. keyframes Wr are selected per iteration to avoid forgetting
A Gaussian is marked to be visible from a view if used in the global map. Rasterisation of 3DGS imposes no con-
the rasterisation and if the ray’s accumulated α has not yet straint on the Gaussians along the viewing ray direction,
reached 0.5. This enables our estimated covisibility to han- even with a depth observation. This is not a problem when
dle occlusions without requiring additional heuristics. sufficient carefully selected viewpoints are provided (e.g.
in the novel view synthesis case); however, in continuous
SLAM this causes many artefacts, making tracking chal-
lenging. We therefore introduce an isotropic regularisation:
Gaussian Insertion and Pruning At every keyframe,
new Gaussians are inserted into the scene to capture newly |G|
visible scene elements and to refine the fine details. When
X
Eiso = ∥si − s˜i · 1∥1 (10)
depth measurements are available, Gaussian means µW are i=1
initialised by back-projecting the depth. In the monocular
case, we render the depth at the current frame. For pix- to penalise the scaling parameters si (i.e. stretch of the
els with depth estimates, µW are initialised around those ellipsoid) by its difference to the mean s˜i . As shown in
depths with low variance; for pixels without the depth es- Fig 3, this encourages sphericality, and avoids the problem
timates, we initialise µW around the median depth of the of Gaussians which are highly elongated along the viewing
rendered image with high variance (parameters are in sup- direction creating artefacts. Let the union of the keyframes
plementary. 7.1.2). in the current window and the randomly selected one be
In the monocular case, the positions of many newly in- W = Wk ∪ Wr . For mapping, we solve the following
serted Gaussians are incorrect. While the majority will problem:
quickly vanish during optimisation as they violate multi- X
k
view consistency, we further prune the excess Gaussians by min Epho + λiso Eiso . (11)
Tk
CW ∈SE(3),G,
checking the visibility amongst the current window Wk . If ∀k∈W
∀k∈W
the Gaussians inserted within the last 3 keyframes are un-
observed by at least 3 other frames, we prune them out as If depth observations are available, as in tracking, geometric
they are geometrically unstable. residuals Eq. (8) are added to the optimisation problem.
4. Evaluation or the availability of their source code for getting the bench-
mark result. Since one of our focuses is the online scale esti-
We conduct a comprehensive evaluation of our system mation under monocular scale ambiguity, the method which
across a range of both real and synthetic datasets. Addi- uses ground truth poses for the system initialisation such
tionally, we perform an ablation study to justify our design as [13] is not considered for the comparison. In the RGB-
choices. Finally, we present qualitative results of our sys- D case, we compare against neural-implicit SLAM meth-
tem operating live using a monocular camera, illustrating its ods [7, 8, 27, 33, 39, 43, 46] which are also map-centric,
practicality and high fidelity reconstruction. rendering-based and do not perform loop closure.
4.1. Experimental Setup 4.2. Quantitative Evaluation
Datasets For our quantitative analysis, we evaluate our Camera Tracking Accuracy Table 1 shows the tracking
method on the TUM RGB-D dataset [32] (3 sequences) and results on the TUM RGB-D dataset. In the monocular set-
the Replica dataset [31] (8 sequences), following the evalu- ting, our method surpasses other baselines without requiring
ation in [33]. For qualitative results, we use self-captured any deep priors. Furthermore, our performance is compara-
real-world sequences recorded by Intel Realsense d455. ble to systems which perform explicit loop closure. This
Since the Replica dataset is designed for RGB-D SLAM clearly highlights that there still remains potential for en-
evaluation, it contains challenging purely rotational cam- hancing the tracking of monocular SLAM by exploring fun-
era motions. We hence use the Replica dataset for RGB-D damental SLAM representations.
evaluation only. The TUM RGB-D dataset is used for both Our RGB-D method shows better performance than any
monocular and RGB-D evaluation. other baseline method. Notably, our system surpasses ORB-
SLAM in the fr1 sequences, bridging the gap between Map-
Implementation Details We run our SLAM on a desktop centric SLAM and the state-of-the-art sparse frame-centric
with Intel Core i9 12900K 3.50GHz and a single NVIDIA methods. Table 2 reports results on the synthetic Replica
GeForce RTX 4090. We present results from our multi- dataset. Our single-process implementation shows com-
process implementation aimed at real-time applications. petitive performance and achieves the best result in 4 out
For a fair comparison with other methods on Replica, we of 8 sequences. Our multi-process implementation which
additionally report result for single-process implementation performs fewer mapping iterations still performs compara-
which performs more mapping iterations. As with 3DGS, bly. In contrast to other methods, our system demonstrates
time-critical rasterisation and gradient computation are im- higher performance on real-world data, as our system flex-
plemented using CUDA. The rest of the SLAM pipeline is ibly handles real sensor noise by direct optimisation of the
developed with PyTorch. Details of hyperparameters are Gaussian positions against information from every pixel.
provided in the supplementary material.
Novel View Rendering Table 5 summarises the novel
view rendering performance of our method with RGB-D
Metrics For camera tracking accuracy, we report the Root input. We consistently show the best performance across
Mean Square Error (RMSE) of the Absolute Trajectory Er- most sequences and is least second best. Our rendering
ror (ATE) of the keyframes. To evaluate map quality, we re- FPS is hundreds of times faster than other methods, offer-
port standard photometric rendering quality metrics (PSNR, ing a significant advantage for applications which require
SSIM and LPIPS) following the evaluation protocol used real-time map interaction. While Point-SLAM is compet-
in [27]. To evaluate the map quality, on every fifth frame, itive, that method focuses on view synthesis rather than
rendering metrics are computed. However, for fairness, we novel-view synthesis. Their view synthesis is conditional
exclude the keyframes (training views). We report the aver- on the availability of depth due to the depth-guided ray-
age across three runs for all our evaluations. In the tables, sampling, making novel-view synthesis challenging. On
the best result is in bold, and the second best is underlined. the other hand, our rasterisation-based approach does not
require depth guidance and achieves efficient, high-quality,
Baseline Methods We primarily benchmark our SLAM novel view synthesis. Fig. 4 provides a qualitative compar-
method against other approaches that, like ours, do not have ison of the rendering of ours and Point-SLAM (with depth
explicit loop closure. In monocular settings, we compare guidance).
with state-of-the-art classical and learning-based direct vi-
sual odometry (VO) methods. Specifically, we compare Ablative Analysis In Table 3, we perform ablation to con-
DSO [4], DepthCov [3], and DROID-SLAM [36] in VO firm our design choices. Isotropic regularisation and ge-
configurations. These methods are selected based on their ometric residual improve the tracking of monocular and
public reporting of results on the benchmark (TUM dataset) RGB-D SLAM respectively, as they aid in constraining the
Loop- Point-SLAM Ours GT
Input Method fr1/desk fr2/xyz fr3/office Avg.
closure
DSO [4] 22.4 1.10 9.50 11.0
DROID-VO [36] 5.20 10.7 7.30 7.73
Monocular

w/o
DepthCov [3] 5.60 1.20 68.8 25.2
Ours 4.15 4.79 4.39 4.44
DROID-SLAM [36] 1.80 0.50 2.80 1.70
w/
ORB-SLAM2 [20] 2.00 0.60 2.30 1.60
iMAP [33] 4.90 2.00 5.80 4.23
NICE-SLAM [46] 4.26 6.19 6.87 5.77
DI-Fusion [7] 4.40 2.00 5.80 4.07
Vox-Fusion [43] 3.52 1.49 26.01 10.34
w/o
ESLAM [8] 2.47 1.11 2.42 2.00
RGB-D

Co-SLAM [39] 2.40 1.70 2.40 2.17


Point-SLAM [27] 4.34 1.31 3.48 3.04
Ours 1.52 1.58 1.65 1.58
BAD-SLAM [29] 1.70 1.10 1.70 1.50
w/ Kintinous [40] 3.70 2.90 3.00 3.20
ORB-SLAM2 [20] 1.60 0.40 1.00 1.00

Table 1. Camera tracking result on TUM for monocular and Figure 4. Rendering examples on Replica. Due to the stochastic
RGB-D. ATE RMSE in cm is reported. We divide systems into nature of ray sampling, Point-SLAM struggle with rendering fine
with and without explicit loop closures. In both monocular and details.
RGB-D cases, we achieve state-of-the-art performance. In partic-
ular, in the monocular case, not only do we outperform systems Camera Layout Ours w/ depth Ours w/o depth
which use deep prior, but we achieve comparable performance
with many of the RGB-D systems.
Method r0 r1 r2 o0 o1 o2 o3 o4 Avg.
iMAP [33] 3.12 2.54 2.31 1.69 1.03 3.99 4.05 1.93 2.58
NICE-SLAM 0.97 1.31 1.07 0.88 1.00 1.06 1.10 1.13 1.07
Vox-Fusion [43] 1.37 4.70 1.47 8.48 2.04 2.58 1.11 2.94 3.09
ESLAM [8] 0.71 0.70 0.52 0.57 0.55 0.58 0.72 0.63 0.63
Point-SLAM [27] 0.61 0.41 0.37 0.38 0.48 0.54 0.69 0.72 0.53
Ours 0.47 0.43 0.31 0.70 0.57 0.31 0.31 3.2 0.79 Hash Grid SDF MLP SDF
Ours* 0.76 0.37 0.23 0.66 0.72 0.30 0.19 1.46 0.58
Table 2. Camera tracking result on Replica for RGB-D SLAM.
ATE RMSE in cm is reported. We achieve best performance across
most sequences. Here, Ours is our multi-process implementation
and Ours* is the single-process implementation which performs
more mapping iterations.
Input Method fr1/desk fr2/xyz fr3/office Avg. Figure 5. Convergence basin analysis: Left: 3D Gaussian re-
w/o Eiso 4.54 4.87 5.1 4.84 constructed using the training views (Yellow) and visualisation of
RGB-D Mono

w/o kf selection 48.5 4.36 8.70 20.5 the test poses (Red). We measure the convergence basin of the
Ours 4.15 4.79 4.39 4.44 target pose (Blue) by performing localisation from the test poses.
w/o Egeo 1.66 1.51 2.45 1.87 Right: Visualisation of convergence basin of our method (top,
w/o kf selection 1.93 1.46 4.07 2.49 with and without depth for training) and other representations (bot-
Ours 1.52 1.58 1.65 1.58 tom). The green circle marks successful convergence, and the red
Table 3. Ablation Study on TUM RGB-D dataset. We analyse cross marks failure.
the usefulness of isotropic regularisation, geometric residual, and
keyframe selection to our SLAM system. agement. We further compare the memory usage of dif-
ferent 3D representations in Table 4. MLP-based iMAP is
Memory Usage
iMAP [33] NICE-SLAM [46] Co-SLAM [39] Ours (Mono) Ours (RGB-D) clearly more memory efficient, but it struggles to express
0.2M 101.6M 1.6M 2.6MB 3.97MB high-fidelity 3D scenes due to the limited capacity of small
Table 4. Memory Analysis on TUM RGB-D dataset. We com- MLP. Compared with a voxel grid of features used in NICE-
pare the size of our Gaussian map to other methods. Baseline num- SLAM, our method uses significantly less memory.
bers are taken from [39].

geometry when photometric signals are weak. For both Convergence Basin Analysis In our SLAM experiments,
cases, keyframe selection significantly improves systems we discovered that 3D Gaussian maps have a notably large
performance, as it automatically chooses suitable keyframes convergence basin for camera localisation. To investigate
based on our occlusion-aware keyframe selection and man- further, we conducted a convergence funnel analysis, an
Method Metric room0 room1 room2 office0 office1 office2 office3 office4 Avg. Rendering FPS
PSNR[dB] ↑ 22.12 22.47 24.52 29.07 30.34 19.66 22.23 24.94 24.42
NICE-SLAM [46] SSIM ↑ 0.689 0.757 0.814 0.874 0.886 0.797 0.801 0.856 0.809 0.54
LPIPS↓ 0.33 0.271 0.208 0.229 0.181 0.235 0.209 0.198 0.233
PSNR[dB] ↑ 22.39 22.36 23.92 27.79 29.83 20.33 23.47 25.21 24.41
Vox-Fusion [43] SSIM ↑ 0.683 0.751 0.798 0.857 0.876 0.794 0.803 0.847 0.801 2.17
LPIPS↓ 0.303 0.269 0.234 0.241 0.184 0.243 0.213 0.199 0.236
PSNR[dB] ↑ 32.40 34.08 35.5 38.26 39.16 33.99 33.48 33.49 35.17
Point-SLAM [27] SSIM ↑ 0.974 0.977 0.982 0.983 0.986 0.96 0.960 0.979 0.975 1.33
LPIPS↓ 0.113 0.116 0.111 0.1 0.118 0.156 0.132 0.142 0.124
PSNR[dB] ↑ 34.83 36.43 37.49 39.95 42.09 36.24 36.7 36.07 37.50
Ours SSIM ↑ 0.954 0.959 0.965 0.971 0.977 0.964 0.963 0.957 0.960 769
LPIPS↓ 0.068 0.076 0.075 0.072 0.055 0.078 0.065 0.099 0.070
Table 5. Rendering performance comparison of RGB-D SLAM methods on Replica. Our method outperforms most of the rendering
metrics compared to existing methods. Note that Point-SLAM uses sensor depth (ground-truth depth in Replica) to guide sampling along
rays, which limits the rendering performance to existing views. The numbers for the baselines are taken from [27].

Figure 6. Monocular SLAM result on fr1/desk sequence: We


show the reconstructed 3D Gaussian maps (Left) and novel view Figure 7. Self-captured Scenes: Challenging scenes and objects,
synthesis result (Right). for example transparent glasses and crinkled texture of salad are
captured by our monocular SLAM running live.
Method seq1 seq2 seq3 Avg.
Neural SDF (Hash Grid) 0.13 0.15 0.16 0.14 Co-SLAM’s SDF loss for further geometric accuracy (MLP
Neural SDF (MLP) 0.40 0.38 0.22 0.33 Neural SDF). We render the training views using a synthetic
Ours w/o depth 0.82 0.91 0.65 0.79 Replica dataset and create three sequences for testing (seq1,
Ours w/ depth 0.83 1.0 0.65 0.82 seq2 and seq3). The width of the square formed by the
Table 6. Camera convergence analysis. We report the ratio of training view is 0.5m, and the test cameras are distributed
successful camera convergence for the different sequences, across with radii ranging from 0.2m to 1.2m, covering a larger
different differentiable 3D representations. area than the training view. When training the map, the
three methods— Ours w/depth, Hash Grid SDF, and MLP
evaluation methodology proposed in [18] and used in [22].
SDF—use RGB-D images, whereas Ours w/o depth utilises
Here, we train a 3D representation (e.g. 3DGS) using 9
only colour images. Fig. 5 shows the qualitative results and
fixed views arranged in a square. We set the viewpoint in
Table 6 reports the success rate. For both with and without
the middle of the square to be the target view. As shown
depth for training, our method shows better convergence.
in Fig 5, we uniformly sample a position, creating a funnel.
Unlike hashing and positional encoding which can lead to
From the sampled position, given the RGB image of the
signal conflict, anisotropic Gaussians form a smooth gradi-
target view, we perform camera pose optimisation for 1000
ent in 3D space, increasing the convergence basin. Further
iterations. The optimisation is successful if it converges to
experimental details are available in supplementary 8.3.
within 1cm of the target view within the fixed iterations.
We compare our Gaussian approach with Co-SLAM [39]’s
network (Hash Grid SDF) and iMAP’s [33] network with
4.3. Qualitative Results
We report both the 3D reconstruction of the SLAM dataset
and self-captured sequences. In Fig. 6, we visualise the
monocular SLAM reconstruction of fr1/desk. The place-
ments of the Gaussians are geometrically sensible and are
3D coherent, and our rendering from the different view-
points highlights the quality of our systems’ novel view
synthesis. In Fig. 7, we self-capture challenging scenes for
monocular SLAM. By not explicitly modelling a surface,
our system naturally handle transparent objects which are
challenging for many other SLAM systems.

5. Conclusion
We have proposed the first SLAM method using 3D Gaus-
sians as a SLAM representation. Via efficient volume ren-
dering, our system significantly advances the fidelity and di-
versity of object materials a live SLAM system can capture.
Our system achieves state-of-the-art performance across
benchmarks for both monocular and RGB-D cases. Inter-
esting directions for future research are the integration of
loop closure for handling large-scale scenes and extraction
of geometry such as surface normal as Gaussians do not ex-
plicitly represent surface.

6. Acknowledgement
Research presented in this paper has been supported by
Dyson Technology Ltd. We are very grateful to Eric Dex-
heimer, Kirill Mazur, Xin Kong, Marwan Taher, Ignacio
Alzugaray, Gwangbin Bae, Aalok Patwardhan, and mem-
bers of the Dyson Robotics Lab for their advice and insight-
ful discussions.
Supplementary Material given last keyframe j, IOUcov (i, j) < kfcov or if the rela-
tive translation tij > kfm D̂i , where D̂i is the median depth
7. Implementation Details of frame i. For Replica kfcov = 0.95, kfm = 0.04 and for
TUM kfcov = 0.90, kfm = 0.08. We remove the regis-
7.1. System Details and Hyperparameters tered keyframe j in Wk if the OCcov (i, j) < kfc , where
7.1.1 Tracking and Mapping (Sec. 3.3.1 and 3.3.3) keyframe i is the latest added keyframe. For both Replica
and TUM, we set the cutoff to kfc = 0.3. We set the size
Learning Rates We use the Adam optimiser for both of keyframe window to be for Replica, |Wk | = 10, and for
camera poses and Gaussian parameters optimisation. For TUM, |Wk | = 8.
camera poses, we used 0.003 for rotation and 0.001 for
translation. For 3D Gaussians, we used the default learn-
ing parameters of the original Gaussian Splatting imple- Gaussian Insertion and Pruning (Sec. 3.3.2) As we op-
mentation [10], apart from in monocular setting where we timise the positions of Gaussians and prune geometrically
increase the learning rate of the positions of the Gaussians unstable Gaussians, we do not require any strong prior such
µW by a factor of 10. as depth observation for Gaussian initialisation. When in-
serting new Gaussians in a monocular setting, we randomly
sample the Gaussians position µW using rendered depth
Iteration numbers 100 tracking iterations are performed
D. Since the estimated depth may sometimes be incor-
per frame for across all experiments. However, we termi-
rect, we account for this by initialising the Gaussians with
nate the iterations early if the magnitude of the pose update
some variance. For a pixel p where the rendered depth Dp
becomes less than 10−4 . For mapping, 150 iterations are
exists, we sample the depth from N (Dp , 0.2σD ). Other-
used for the single-process implementation.
wise, for unobserved regions, we initialise the Gaussians
by sampling from N (D̂, 0.5σD ), where D̂ is the median of
Loss Weights Given a depth observation, for tracking we D. For pruning, as described in Section 3.3.2, we perform
minimise both photometric Eq. (7) and geometric residual visibility-based pruning, where if new Gaussians inserted
Eq. (8) as: within the last 3 keyframes are not observed by at least 3
other frames, they are pruned. We only perform visibility-
min λpho Epho + (1 − λpho )Egeo , (12)
T CW ∈SE(3) based pruning once the keyframe window Wk is full. Ad-
ditionally, we prune all Gaussians with opacity of less than
and similarly, for mapping we modify Eq. (11) to: 0.7.
X
k k
min (λpho Epho + (1 − λpho )Egeo ) 8. Evaluation details
Tk
CW ∈SE(3),G, ∀k∈W
∀k∈W
8.1. Camera Tracking Accuracy (Table 1 and Ta-
+ λiso Eiso . (13)
ble 2)
We set λpho = 0.9 for all RGB-D experiments, and λiso = 8.1.1 Evaluation Metric
10 for both monocular and RGB-D experiments.
We measured the keyframe absolute trajectory error (ATE)
7.1.2 Keyframing (Sec. 3.3.2) RMSE. For monocular evaluation, we perform scale align-
ment between the estimated scale-free and ground-truth tra-
Gaussian Covisibility Check (Sec. 3.3.2) As described jectories. For RGB-D evaluation, we only align the esti-
in Sec. 3.3.2, keyframe selection is based on the covisibility mated trajectory and ground truth without scale adjustment.
of the Gaussians. Between two keyframes i, j, we define the
covisibility using Intersection of Union (IOU) and Overlap
Coefficient (OC): 8.1.2 Baseline Results
Table 1 Numbers for monocular DROID-SLAM [36]
|Giv ∩ Gjv |
IOUcov (i, j) = v , (14) and ORB-SLAM [20] is taken from [13]. We have lo-
|Gi ∪ Gjv | cally run DSO [4], DepthCov [3] and DROID-VO [36] –
|Giv ∩ Gjv | which is DROID-SLAM without loop closure and global
OCcov (i, j) = , (15) bundle adjustment. For the RGB-D case, numbers for
min(|Giv |, |Gjv |)
NICE-SLAM [46], DI-Fusion [7], Vox-Fusion [43], Point-
where Giv is the Gaussians visible in keyframe i, based on SLAM [27] are taken from Point-SLAM [27], and all the
visibility check described in Section 3.3.2, Gaussian Covisi- other baselines: iMAP [33], ESLAM [8], Co-SLAM [39],
bility. A keyframe i is added to the keyframe window Wk if BAD-SLAM [29], Kintinous [40], ORB-SLAM [20]
Rendering time Test views Training views
Method Rendering FPS ↑
per image [s] ↓
NICE-SLAM [46] 0.54 1.85
Vox-Fusion [43] 2.17 0.46
Point-SLAM [27] 1.33 0.75
Ours 769 0.0013
Table 7. Further dettail of Rendering FPS and Rendering Time
comparison based on Table 5
.

Table 2 and 5 We took the numbers from Point- Target view Overlayed views
SLAM [27] paper.

Table 4 The numbers are from Co-SLAM [39] paper.


8.2. Rendering FPS (Table 5)
In Table 5, we reported the photometric quality metrics
(PSNR, SSIM and LPIPS) and rendering fps of our meth-
ods. We demonstrated that our rendering fps (769) is much
higher than other existing methods (VoxFusion is the sec-
ond best with 2.17fps). Here we describe the detail of how
Figure 8. 2D Visualisation of the camera pose distributions
we measured the fps. The rendering time refers to the dura- used for convergence basin analysis in Figure 5.
tion necessary for full-resolution rendering (1200 × 680 for
the Replica sequence). For each method, we perform 100 Ours We evaluated our method under two settings: “w/
renderings and report the average time taken per rendering. depth” and “w/o depth”, where we train the initial 3D Gaus-
The reported rendering fps is found by taking 1 and dividing sian map Ginit with and without depth supervision. In the
it by the average rendering time. We summarise the num- “w/o depth” setting, the 3D Gaussians’ positions are ran-
bers in Table 7. Note that the “rendering fps” means the domly initialised, and we minimise the monocular mapping
fps just for the forward rendering, which differs from the cost Eq. (11) for the 3D Gaussian training, but keeping the
end-to-end system fps reported in Table 8 and 9. camera poses fixed. Specifically, let k ∈ N be a number of
training views and 3D Gaussians G, we find Ginit by:
8.3. The convergence basin analysis (Table 6 and
Fig 5)
X
k
Ginit = arg min Epho + λiso Eiso . (16)
G
8.3.1 The detail of the benchmark Dataset ∀k∈W

For convergence basin analysis, we create three datasets by Note that training views’ camera poses T kCW are fixed dur-
rendering the synthetic Replica dataset. In addition to the ing the optimisation.
qualitative visualisation in Figure 5, we report more detailed In the “w/ depth” setting, we train the Gaussian map by
camera pose distributions in Figure 8. Figure 8 shows the minimising the same cost function as our RGB-D SLAM
camera view frustums of the test (red), training (yellow) and system:
target (blue) views. As we mentioned in the main paper, we
set the training view in the shape of a square with a width of X
k k
0.5m and test views are distributed with radii ranging from Ginit = arg min (λpho Epho + (1 − λpho )Egeo )
G
∀k∈W
0.2m to 1.2m, covering a larger area than the training views.
We only apply displacements to the camera translation but + λiso Eiso , (17)
not to the rotation. For each sequence, we use a total of 67
where we use λpho = 0.9 and λiso = 10 for all the experi-
test views.
ments
8.3.2 Training setup
Baseline Methods For Hash Grid SDF, we trained the
For each method, the 3D representation is trained for 30000 same network architecture as Co-SLAM [39]. For MLP
iterations using the training views. Here, we detail the train- SDF, we trained the network of iMAP [33]. For both base-
ing setup of each of the methods: lines, we supervised networks with the same loss functions
Method Total Time [s] FPS Avg. Map. Iter. Method Total Time [s] FPS Avg. Map. Iter.
Monocular 798.9 3.2 88.1 RGB-D 1002.7 2.0 27.5
RGB-D 986.7 2.5 81.0 RGB-D* 1878.1 1.1 150
Table 8. Performance Analysis using fr3/office. Both monocular Table 9. Performance Analysis using replica/office2. RGB-
and RGB-D implementation uses multiprocessing. We report the D uses multi-process implementation and RGB-D* is the single-
total execution time of our system, FPS computed by dividing process implementation. We report the total execution time of
the total number of processed frames with the total time, and aver- our system, FPS computed by dividing the total number of pro-
age mapping iteration per added keyframe. cessed frames with the total time, and average mapping iteration
per added keyframe.
as Co-SLAM, which are colour rendering loss Lrgb , depth
rendering loss Ldepth , SDF loss Lf s , free-space loss Lf s ,
and smoothness loss Lsmooth . Please refer to the original
Co-SLAM paper for the exact formulation (equation (6) - Input Method fr1/desk fr2/xyz fr3/office Avg.
w/o pruning 77.4 12.0 129.0 72.9
(9)). All the training hyperparameters (e.g. learning rate Mono
Ours 4.15 4.79 4.39 4.44
of the network, number of sampling points, loss weight)
Table 10. Pruning Ablation Study on TUM RGB-D dataset
are the same as Co-SLAM’s default configuration of the
(Monocular Input). Numbers are camera tracking error (ATE
Replica dataset. While Co-SLAM stores training view in-
RMSE) in cm.
formation by downsampling the colour and depth images,
we store the full pixel information because the number of Input Method fr1/desk fr2/xyz fr3/office Avg.
training views is small. w/o Eiso 1.60 1.54 1.53 1.56
RGB-D
Ours 1.52 1.58 1.65 1.58
Table 11. Isotropic Loss Ablation Study on TUM RGB-D
8.3.3 Testing Setup
dataset (RGB-D input). Numbers are camera tracking error (ATE
For testing, we localise the camera pose by minimising only RMSE) in cm.
the photometric error against the ground-truth colour image Method r0 r1 r2 o0 o1 o2 o3 o4 Avg.
of the target view. w/o Eiso 0.69 0.53 0.39 4.30 2.01 1.24 0.32 3.58 1.63
Ours 0.47 0.43 0.31 0.70 0.57 0.31 0.31 3.20 0.79
Table 12. Isotropic Loss Ablation Study on Replica dataset
Ours Let the camera pose T CW ∈ SE(3) and initial 3D (RGB-D input). Numbers are camera tracking error (ATE RMSE)
Gaussians Ginit , the localised camera pose T est
CW is found in cm.
by:
result shows, Gaussian pruning plays a significant role in
T est ¯
CW = arg min I(Ginit , T CW ) − Itarget . (18) enhancing camera tracking performance. This improvement
1
T CW
is primarily because, without pruning, randomly initialised
Note that Ginit is fixed during the optimisation. We ini- Gaussians persist in the 3D space, potentially leading to in-
tialise T CW at one of the test view’s positions, and optimi- correct initial geometry for other views.
sation is performed for 1000 iterations. We perform this lo-
calisation process for all the test views and measure the suc-
cess rate. Camera localisation is successful if the estimated
pose converges to within 1cm of the target view within the
1000 iterations.
9.2. Isotropic Loss Ablation (RGB-D input)
Baseline Methods For the baseline methods, the camera
localisation is performed by minimising colour volume ren-
dering loss Lrgb , while all the other trainable network pa- Table 11 and 12 report the ablation study of the effect of
rameters are fixed. The learning rates of the pose optimiser isotropic loss Eiso for RGB-D input. In TUM, as Table 11
are also the same as Co-SLAM’s default configuration of shows, isotropic regularisation does not improve the per-
Replica dataset. formance but only shows a marginal difference. However,
for Replica, as summarised in Table 12, isotropic loss sig-
9. Further Ablation Analysis (Table 3) nificantly improves camera tracking performance. Even
with the depth measurement, since rasterisation does not
9.1. Pruning Ablation (Monocular input)
consider the elongation along the viewing axis. Isotropic
In Table 9.1, we report the ablation study of our proposed regularisation is required to prevent the Gaussians from
Gaussian pruning, which prunes randomly initialised 3D over-stretching, especially for textureless regions, which are
Gaussians effectively in monocular SLAM setting. As the common in Replica.
9.3. Memory Consumption and Frame Rate (Ta- sented in Eq. (6).
ble. 4)
DµC Exp(τ ) · µC − µC
= lim (19)
9.3.1 Memory Analysis DT CW τ →0 τ
(I + τ ∧ ) · µC − µC
= lim (20)
τ →0 τ
In memory consumption analysis, for Table. 4, we mea- τ ∧ · µC
sure the final size of the created Gaussians. The memory = lim (21)
τ →0 τ
footprint of our system is lower than the original Gaus- θ× µC + ρ
sian Splatting, which uses approximately 300-700MB for = lim (22)
τ →0 τ
the standard novel view synthesis benchmark dataset [10], ×
−µC θ + ρ
as we only maintain well-constrained Gaussians via pruning = lim (23)
τ →0 τ
and do not store the spherical harmonics.
= I −µ×

C (24)

where T · x is the group action of T ∈ SE(3) on x ∈ R3 .


Simiarly, we compute the Jacobian with respect to W.
Since the translational component is not involved, we only
consider the rotational part RCW of T CW .
9.3.2 Timing Analysis
DW Exp(θ) ◦ W − W
= lim (25)
DRCW θ→0 θ
To analyse the processing time of our monocular/RGB-
D SLAM system, we measure the total time required to (I + θ∧ ) ◦ W − W
= lim (26)
process all frames in the TUM-RGBD fr3/office dataset. θ→0 θ
This approach assesses the performance of our system as θ∧
= lim ◦W (27)
a whole, rather than isolating individual components. By θ→0 θ
adopting this approach, we gain a more realistic understand- θ×
= lim ◦W (28)
ing of the system’s true performance which better reflects θ→0 θ
the real-world operating conditions, as it avoids the assump-
tion of an idealised, sequential interleaving of the tracking Since skew symmetric matrix is:
and mapping processes. As shown in Table 9, our sys-  
tem operates at 3.2 FPS with monocular and 2.5 FPS with 0 −θz θy
depth. The FPS is found by dividing the number of pro- θ × =  θz 0 −θx  (29)
cessed frames by the total time. We conducted a similar −θy θx 0
analysis with the Replica dataset office2. Here, we compare
The partial derivative of one of the component (e.g. θx ) is:
the RGB-D method with and without multiprocessing. As
expected, single-process implementation takes longer as it  
0 0 0
performs more mapping iterations. ∂θ×
= 0 0 −1 = e× 1 (30)
∂θx
0 1 0

where e1 = [1, 0, 0]⊤ , e2 = [0, 1, 0]⊤ , e3 = [0, 0, 1]⊤ .


10. Camera Pose Jacobian

01×3
∂W
Use of 3D Gaussian as a primitive and performing camera = e×
1 W = −W3,:
  (31)
∂θx
pose optimisation is discussed in [11]; however, the method W2,:
assumes a smaller number of Gaussians and is based on ray-
intersection not splatting; hence, is not applicable to 3DGS.
 
W3,:
∂W
While many applications of 3DGS exist, for example, dy- = e×
2W =
 01×3  (32)
namic tracking and 4D scene representation [15, 42], they ∂θy
−W1,:
assume offline application and require accurate camera po-
 
sition. In contrast, we perform camera pose optimisation −W2,:
∂W
by deriving the minimal analytical Jacobians, and for com- = e×
3W =
 W1,:  (33)
pleteness, we provide the derivation of the Jacobians pre- ∂θz
01×3
where Wi,: refers to the ith row of the matrix. After
column-wise vectorisation of Eq. (31), (32), (33), and stack-
ing horizontally we get:
 ×
−W:,1
DW ×
= −W:,2 , (34)
DRCW ×
−W:,3

where W:,i refers to the ith column of the matrix. Since


translational part is all zeros, with this we get Eq. (6).

11. Additional Qualitative Results


We urge readers to view our supplementary video for con-
vincing qualitative results. In Fig. 9 - Fig. 16, we further
show additional qualitative results. We visually compare
other state-of-the-art SLAM methods using differentiable
rendering (Point-SLAM [27] and ESLAM [8]).

12. Limitation of this work


Although our novel Gaussian Splatting SLAM shows com-
petitive performance on experimental results, the method
also has several limitations.
• Currently, the proposed method is tested only on room-
scale scenes. For larger real-world scenes, the trajectory
drift is inevitable. This could be addressed by integrating
a loop closure module into our existing pipeline.
• Although we achieve interactive live operation, hard real-
time operation on the benchmark dataset (30 fps on TUM
sequences) is not achieved in this work. To improve
speed, exploring a second-order optimiser would be an
interesting direction.
Monocular RGB-D

Figure 9. Novel view rendering and Gaussian visualizations on TUM fr1/desk

ESLAM Point-SLAM Ours (Mono) Ours (RGBD) GT

Figure 10. Rendering comparison on TUM fr1/desk


Monocular RGB-D

Figure 11. Novel view rendering and Gaussian visualizations on TUM fr2/xyz

ESLAM Point-SLAM Ours (Mono) Ours (RGBD) GT

Figure 12. Rendering comparison on TUM fr2/xyz


Monocular RGB-D

Figure 13. Novel view rendering and Gaussian visualizations on TUM fr3/office

ESLAM Point-SLAM Ours (Mono) Ours (RGBD) GT

Figure 14. Rendering comparison on TUM fr3/office


Figure 15. Novel view rendering and Gaussian visualizations on Replica

ESLAM Point-SLAM Ours GT

Figure 16. Rendering comparison on Replica


References [15] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and
Deva Ramanan. Dynamic 3d gaussians: Tracking by per-
[1] J. Czarnowski, T. Laidlow, R. Clark, and A. J. Davi- sistent dynamic view synthesis. 3DV, 2024.
son. Deepfactors: Real-time probabilistic dense monocular
[16] J. McCormac, A. Handa, A. J. Davison, and S. Leutenegger.
SLAM. IEEE Robotics and Automation Letters (RAL), 5(2):
SemanticFusion: Dense 3D semantic mapping with convo-
721–728, 2020.
lutional neural networks. In Proceedings of the IEEE In-
[2] Angela Dai, Matthias Nießner, Michael Zollhöfer, Shahram ternational Conference on Robotics and Automation (ICRA),
Izadi, and Christian Theobalt. BundleFusion: Real-time 2017.
Globally Consistent 3D Reconstruction using On-the-fly
[17] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
Surface Re-integration. ACM Transactions on Graphics
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
(TOG), 36(3):24:1–24:18, 2017.
Representing scenes as neural radiance fields for view syn-
[3] Eric Dexheimer and Andrew J. Davison. Learning a Depth thesis. In Proceedings of the European Conference on Com-
Covariance Function. In Proceedings of the IEEE Confer- puter Vision (ECCV), 2020.
ence on Computer Vision and Pattern Recognition (CVPR),
[18] N. J. Mitra, N. Gelfand, H. Pottmann, and L. J. Guibas. Reg-
2023.
istration of Point Cloud Data from a Geometric Optimization
[4] J. Engel, V. Koltun, and D. Cremers. Direct sparse odom- Perspective. In Proceedings of the Symposium on Geometry
etry. IEEE Transactions on Pattern Analysis and Machine Processing, 2004.
Intelligence (PAMI), 2017. [19] Thomas Müller, Alex Evans, Christoph Schied, and Alexan-
[5] C. Forster, M. Pizzoli, and D. Scaramuzza. SVO: Fast Semi- der Keller. Instant neural graphics primitives with a mul-
Direct Monocular Visual Odometry. In Proceedings of the tiresolution hash encoding. ACM Transactions on Graphics
IEEE International Conference on Robotics and Automation (TOG), 2022.
(ICRA), 2014. [20] R. Mur-Artal and J. D. Tardós. ORB-SLAM2: An Open-
[6] Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Source SLAM System for Monocular, Stereo, and RGB-D
Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Cameras. IEEE Transactions on Robotics (T-RO), 33(5):
Radiance fields without neural networks. In Proceedings 1255–1262, 2017.
of the IEEE Conference on Computer Vision and Pattern [21] R. Mur-Artal, J. M. M Montiel, and J. D. Tardós. ORB-
Recognition (CVPR), 2022. SLAM: a Versatile and Accurate Monocular SLAM System.
[7] Jiahui Huang, Shi-Sheng Huang, Haoxuan Song, and Shi- IEEE Transactions on Robotics (T-RO), 31(5):1147–1163,
Min Hu. Di-fusion: Online implicit 3d reconstruction with 2015.
deep priors. In Proceedings of the IEEE Conference on Com- [22] R. A. Newcombe. Dense Visual SLAM. PhD thesis, Imperial
puter Vision and Pattern Recognition (CVPR), 2021. College London, 2012.
[8] M. M. Johari, C. Carta, and F. Fleuret. ESLAM: Efficient [23] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D.
dense slam system based on hybrid representation of signed Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A.
distance fields. In Proceedings of the IEEE Conference on Fitzgibbon. KinectFusion: Real-Time Dense Surface Map-
Computer Vision and Pattern Recognition (CVPR), 2023. ping and Tracking. In Proceedings of the International Sym-
[9] M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich, and posium on Mixed and Augmented Reality (ISMAR), 2011.
A. Kolb. Real-time 3D Reconstruction in Dynamic Scenes [24] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and
using Point-based Fusion. In Proc. of Joint 3DIM/3DPVT Andreas Geiger. Differentiable volumetric rendering: Learn-
Conference (3DV), 2013. ing implicit 3d representations without 3d supervision. In
[10] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, Proceedings of the IEEE Conference on Computer Vision
and George Drettakis. 3D gaussian splatting for real-time and Pattern Recognition (CVPR), 2020.
radiance field rendering. ACM Transactions on Graphics [25] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger.
(TOG), 2023. Real-time 3D Reconstruction at Scale using Voxel Hashing.
[11] Leonid Keselman and Martial Hebert. Approximate differ- In Proceedings of SIGGRAPH, 2013.
entiable rendering with algebraic surfaces. In Proceedings [26] Victor Adrian Prisacariu, Olaf Kähler, Ming-Ming Cheng,
of the European Conference on Computer Vision (ECCV), Carl Yuheng Ren, Julien P. C. Valentin, Philip H. S. Torr,
2022. Ian D. Reid, and David W. Murray. A framework for the vol-
[12] Diederik P. Kingma and Jimmy Ba. Adam: A method for umetric integration of depth images. CoRR, abs/1410.0925,
stochastic optimization. In Proceedings of the International 2014.
Conference on Learning Representations (ICLR), 2015. [27] Erik Sandström, Yue Li, Luc Van Gool, and Martin R. Os-
[13] Heng Li, Xiaodong Gu, Weihao Yuan, Luwei Yang, Zilong wald. Point-slam: Dense neural point cloud-based slam. In
Dong, and Ping Tan. Dense rgb slam with neural implicit Proceedings of the International Conference on Computer
maps. In Proceedings of the International Conference on Vision (ICCV), 2023.
Learning Representations (ICLR), 2023. [28] Thomas Schöps, Torsten Sattler, and Marc Pollefeys. Sur-
[14] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and felmeshing: Online surfel-based mesh reconstruction. IEEE
Christian Theobalt. Neural sparse voxel fields. NeurIPS, Transactions on Pattern Analysis and Machine Intelligence
2020. (PAMI), 2020.
[29] Thomas Schöps, Torsten Sattler, and Marc Pollefeys. Bad [43] Xingrui Yang, Hai Li, Hongjia Zhai, Yuhang Ming, Yuqian
slam: Bundle adjusted direct rgb-d slam. In Proceedings Liu, and Guofeng Zhang. Vox-fusion: Dense tracking and
of the IEEE Conference on Computer Vision and Pattern mapping with voxel-based neural implicit representation. In
Recognition (CVPR), 2019. Proceedings of the International Symposium on Mixed and
[30] J. Solà, J. Deray, and D. Atchuthan. A micro Lie theory for Augmented Reality (ISMAR), 2022.
state estimation in robotics. arXiv:1812.01537, 2018. [44] Zeyu Yang, Hongye Yang, Zijie Pan, Xiatian Zhu, and Li
[31] Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Zhang. Real-time photorealistic dynamic scene representa-
Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, tion and rendering with 4d gaussian splatting. arXiv preprint
Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan, arXiv 2310.10642, 2023.
Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang [45] Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng
Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussian-
Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, dreamer: Fast generation from text to 3d gaussian splatting
Dhruv Batra, Hauke M. Strasdat, Renzo De Nardi, Michael with point cloud priors. arxiv:2310.08529, 2023.
Goesele, Steven Lovegrove, and Richard Newcombe. The [46] Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hu-
Replica dataset: A digital replica of indoor spaces. arXiv jun Bao, Zhaopeng Cui, Martin R. Oswald, and Marc Polle-
preprint arXiv:1906.05797, 2019. feys. Nice-slam: Neural implicit scalable encoding for slam.
[32] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre- In Proceedings of the IEEE Conference on Computer Vision
mers. A Benchmark for the Evaluation of RGB-D SLAM and Pattern Recognition (CVPR), 2022.
Systems. In Proceedings of the IEEE/RSJ Conference on In- [47] Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui,
telligent Robots and Systems (IROS), 2012. Martin R Oswald, Andreas Geiger, and Marc Pollefeys.
[33] E. Sucar, S. Liu, J. Ortiz, and A. J. Davison. iMAP: Implicit Nicer-slam: Neural implicit scene encoding for rgb slam.
mapping and positioning in real-time. In Proceedings of the arXiv preprint arXiv:2302.03594, 2023.
International Conference on Computer Vision (ICCV), 2021. [48] M. Zwicker, H. Pfister, J. van Baar, and M. Gross. Ewa
[34] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel splatting. IEEE Transactions on Visualization and Computer
grid optimization: Super-fast convergence for radiance fields Graphics, 8(3):223–238, 2002.
reconstruction. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2022.
[35] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang
Zeng. Dreamgaussian: Generative gaussian splatting for effi-
cient 3d content creation. arXiv preprint arXiv:2309.16653,
2023.
[36] Zachary Teed and Jia Deng. DROID-SLAM: Deep Visual
SLAM for Monocular, Stereo, and RGB-D Cameras. In Neu-
ral Information Processing Systems (NIPS), 2021.
[37] Emanuele Vespa, Nikolay Nikolov, Marius Grimm, Luigi
Nardi, Paul HJ Kelly, and Stefan Leutenegger. Efficient
octree-based volumetric SLAM supporting signed-distance
and occupancy mapping. IEEE Robotics and Automation
Letters (RAL), 2018.
[38] Angtian Wang, Peng Wang, Jian Sun, Adam Kortylewski,
and Alan Yuille. Voge: a differentiable volume renderer us-
ing gaussian ellipsoids for analysis-by-synthesis. 2022.
[39] Hengyi Wang, Jingwen Wang, and Lourdes Agapito. Co-
slam: Joint coordinate and sparse parametric encodings for
neural real-time slam. Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2023.
[40] T. Whelan, M. Kaess, H. Johannsson, M. F. Fallon, J. J.
Leonard, and J. B. McDonald. Real-time large scale dense
RGB-D SLAM with volumetric fusion. International Jour-
nal of Robotics Research (IJRR), 34(4-5):598–626, 2015.
[41] T. Whelan, S. Leutenegger, R. F. Salas-Moreno, B. Glocker,
and A. J. Davison. ElasticFusion: Dense SLAM without a
pose graph. In Proceedings of Robotics: Science and Sys-
tems (RSS), 2015.
[42] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng
Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang.
4d gaussian splatting for real-time dynamic scene rendering.
arXiv preprint arXiv:2310.08528, 2023.

You might also like