Gaussian Splatting SLAM
Gaussian Splatting SLAM
Gaussian Splatting SLAM
Website: https://rmurai.co.uk/projects/GaussianSplattingSLAM/
arXiv:2312.06741v1 [cs.CV] 11 Dec 2023
Video: https://youtu.be/x604ghp9R_Q/
Figure 1. From a single monocular camera, we reconstruct a high fidelity 3D scene live at 3fps. For every incoming RGB frame, 3D
Gaussians are incrementally formed and optimised together with the camera poses. We show both the rasterised Gaussians (left) and
Gaussians shaded to highlight the geometry (right). Notice the details and the complex material properties (e.g. transparency) captured.
Thin structures such as wires are accurately represented by numerous small, elongated Gaussians, and transparent objects are effectively
represented by placing the Gaussians along the rim. Our system significantly advances the fidelity a live monocular SLAM system can
capture.
3D Gaussian Map
Keyframe Management
Figure 2. SLAM System Overview: Our SLAM system uses 3D Gaussians as the only representation, unifying all components of SLAM,
including tracking, mapping, keyframe management, and novel view synthesis.
3D Gaussians used in EWA splatting [48] and 3DGS. This 3.3.1 Tracking
opens up new applications of 3DGS beyond SLAM.
In tracking only the current camera pose is optimised, with-
We use Lie algebra to derive the minimal Jacobians, en-
out updates to the map representation. In the monocular
suring that the dimensionality of the Jacobians matches the
case, we minimise the following photometric residual:
degrees of freedom, eliminating any redundant computa-
tions. The terms of Eq. (2) are differentiable with respect Epho = I(G, T CW ) − I¯ , (7)
1
to the camera pose T CW ; using the chain rule:
∂µI ∂µI DµC where I(G, T CW ) renders the Gaussians G from T CW , and
∂T CW
=
∂µC DT CW
, (3) I¯ is an observed image.
We further optimise affine brightness parameters for
∂ΣI ∂ΣI ∂J DµC ∂ΣI DW
= + . (4) varying exposure. When depth observations are available,
∂T CW ∂J ∂µC DT CW ∂W DT CW we define the geometric residual as:
We take the derivatives on the manifold to derive minimal
parameterisation. Borrowing the notation from [30], let Egeo = D(G, T CW ) − D̄ 1
, (8)
T ∈ SE(3) and τ ∈ se(3). We define the partial derivative
on the manifold as: where D(G, T CW ) is depth rasterisation and D̄ is the ob-
served depth. Rather than simply using the depth measure-
Df (T ) Log(f (Exp(τ ) ◦ T ) ◦ f (T )−1 ) ments to initialise the Gaussians, we minimise both photo-
≜ lim , (5)
DT τ →0 τ metric and geometric residuals: λpho Epho +(1−λpho )Egeo ,
where ◦ is a group composition, and Exp, Log are the expo- where λpho is a hyperparameter.
nential and logarithmic mappings between Lie algebra and As in Eq. (1), per-pixel depth is rasterised by alpha-
Lie Group. With this, we derive the following: blending:
X i−1
Y
×
0 −W:,1 Dp = zi αi (1 − αj ) , (9)
DµC DW × i∈N j=1
−µ×
= I C , = 0 −W:,2 , (6)
DT CW DT CW × where zi is the distance to the mean µW of Gaussian i along
0 −W:,3
the camera ray. We derive analytical Jacobians for the cam-
where × denotes the skew symmetric matrix of a 3D vector, era pose optimisation in a similar manner to Eq. (3), (4).
and W:,i refers to the ith column of the matrix.
3.3. SLAM 3.3.2 Keyframing
In this section, we present details of full SLAM framework.
The overview of the system is summarised in Fig. 2. Please Since using all the images from a video stream to jointly
refer to the supplementary material for the further parameter optimise the Gaussians and camera poses online is infeasi-
details. ble, we maintain a small window Wk consisting of carefully
selected keyframes based on inter-frame covisibility. Ideal
keyframe management will select non-redundant keyframes
observing the same area, spanning a wide baseline to pro-
vide better multiview constraints.
3.3.3 Mapping
Gaussian Covisibility An accurate estimate of covisibil- The purpose of mapping is to maintain a coherent 3D struc-
ity simplifies keyframe selection and management. 3DGS ture and to optimise the newly inserted Gaussians. Dur-
respects visibility ordering since the 3D Gaussians are ing mapping, the keyframes in Wk are used to reconstruct
sorted along the camera ray. This property is desirable for currently visible regions. Additionally, two random past
covisibility estimation as occlusions are handled by design. keyframes Wr are selected per iteration to avoid forgetting
A Gaussian is marked to be visible from a view if used in the global map. Rasterisation of 3DGS imposes no con-
the rasterisation and if the ray’s accumulated α has not yet straint on the Gaussians along the viewing ray direction,
reached 0.5. This enables our estimated covisibility to han- even with a depth observation. This is not a problem when
dle occlusions without requiring additional heuristics. sufficient carefully selected viewpoints are provided (e.g.
in the novel view synthesis case); however, in continuous
SLAM this causes many artefacts, making tracking chal-
lenging. We therefore introduce an isotropic regularisation:
Gaussian Insertion and Pruning At every keyframe,
new Gaussians are inserted into the scene to capture newly |G|
visible scene elements and to refine the fine details. When
X
Eiso = ∥si − s˜i · 1∥1 (10)
depth measurements are available, Gaussian means µW are i=1
initialised by back-projecting the depth. In the monocular
case, we render the depth at the current frame. For pix- to penalise the scaling parameters si (i.e. stretch of the
els with depth estimates, µW are initialised around those ellipsoid) by its difference to the mean s˜i . As shown in
depths with low variance; for pixels without the depth es- Fig 3, this encourages sphericality, and avoids the problem
timates, we initialise µW around the median depth of the of Gaussians which are highly elongated along the viewing
rendered image with high variance (parameters are in sup- direction creating artefacts. Let the union of the keyframes
plementary. 7.1.2). in the current window and the randomly selected one be
In the monocular case, the positions of many newly in- W = Wk ∪ Wr . For mapping, we solve the following
serted Gaussians are incorrect. While the majority will problem:
quickly vanish during optimisation as they violate multi- X
k
view consistency, we further prune the excess Gaussians by min Epho + λiso Eiso . (11)
Tk
CW ∈SE(3),G,
checking the visibility amongst the current window Wk . If ∀k∈W
∀k∈W
the Gaussians inserted within the last 3 keyframes are un-
observed by at least 3 other frames, we prune them out as If depth observations are available, as in tracking, geometric
they are geometrically unstable. residuals Eq. (8) are added to the optimisation problem.
4. Evaluation or the availability of their source code for getting the bench-
mark result. Since one of our focuses is the online scale esti-
We conduct a comprehensive evaluation of our system mation under monocular scale ambiguity, the method which
across a range of both real and synthetic datasets. Addi- uses ground truth poses for the system initialisation such
tionally, we perform an ablation study to justify our design as [13] is not considered for the comparison. In the RGB-
choices. Finally, we present qualitative results of our sys- D case, we compare against neural-implicit SLAM meth-
tem operating live using a monocular camera, illustrating its ods [7, 8, 27, 33, 39, 43, 46] which are also map-centric,
practicality and high fidelity reconstruction. rendering-based and do not perform loop closure.
4.1. Experimental Setup 4.2. Quantitative Evaluation
Datasets For our quantitative analysis, we evaluate our Camera Tracking Accuracy Table 1 shows the tracking
method on the TUM RGB-D dataset [32] (3 sequences) and results on the TUM RGB-D dataset. In the monocular set-
the Replica dataset [31] (8 sequences), following the evalu- ting, our method surpasses other baselines without requiring
ation in [33]. For qualitative results, we use self-captured any deep priors. Furthermore, our performance is compara-
real-world sequences recorded by Intel Realsense d455. ble to systems which perform explicit loop closure. This
Since the Replica dataset is designed for RGB-D SLAM clearly highlights that there still remains potential for en-
evaluation, it contains challenging purely rotational cam- hancing the tracking of monocular SLAM by exploring fun-
era motions. We hence use the Replica dataset for RGB-D damental SLAM representations.
evaluation only. The TUM RGB-D dataset is used for both Our RGB-D method shows better performance than any
monocular and RGB-D evaluation. other baseline method. Notably, our system surpasses ORB-
SLAM in the fr1 sequences, bridging the gap between Map-
Implementation Details We run our SLAM on a desktop centric SLAM and the state-of-the-art sparse frame-centric
with Intel Core i9 12900K 3.50GHz and a single NVIDIA methods. Table 2 reports results on the synthetic Replica
GeForce RTX 4090. We present results from our multi- dataset. Our single-process implementation shows com-
process implementation aimed at real-time applications. petitive performance and achieves the best result in 4 out
For a fair comparison with other methods on Replica, we of 8 sequences. Our multi-process implementation which
additionally report result for single-process implementation performs fewer mapping iterations still performs compara-
which performs more mapping iterations. As with 3DGS, bly. In contrast to other methods, our system demonstrates
time-critical rasterisation and gradient computation are im- higher performance on real-world data, as our system flex-
plemented using CUDA. The rest of the SLAM pipeline is ibly handles real sensor noise by direct optimisation of the
developed with PyTorch. Details of hyperparameters are Gaussian positions against information from every pixel.
provided in the supplementary material.
Novel View Rendering Table 5 summarises the novel
view rendering performance of our method with RGB-D
Metrics For camera tracking accuracy, we report the Root input. We consistently show the best performance across
Mean Square Error (RMSE) of the Absolute Trajectory Er- most sequences and is least second best. Our rendering
ror (ATE) of the keyframes. To evaluate map quality, we re- FPS is hundreds of times faster than other methods, offer-
port standard photometric rendering quality metrics (PSNR, ing a significant advantage for applications which require
SSIM and LPIPS) following the evaluation protocol used real-time map interaction. While Point-SLAM is compet-
in [27]. To evaluate the map quality, on every fifth frame, itive, that method focuses on view synthesis rather than
rendering metrics are computed. However, for fairness, we novel-view synthesis. Their view synthesis is conditional
exclude the keyframes (training views). We report the aver- on the availability of depth due to the depth-guided ray-
age across three runs for all our evaluations. In the tables, sampling, making novel-view synthesis challenging. On
the best result is in bold, and the second best is underlined. the other hand, our rasterisation-based approach does not
require depth guidance and achieves efficient, high-quality,
Baseline Methods We primarily benchmark our SLAM novel view synthesis. Fig. 4 provides a qualitative compar-
method against other approaches that, like ours, do not have ison of the rendering of ours and Point-SLAM (with depth
explicit loop closure. In monocular settings, we compare guidance).
with state-of-the-art classical and learning-based direct vi-
sual odometry (VO) methods. Specifically, we compare Ablative Analysis In Table 3, we perform ablation to con-
DSO [4], DepthCov [3], and DROID-SLAM [36] in VO firm our design choices. Isotropic regularisation and ge-
configurations. These methods are selected based on their ometric residual improve the tracking of monocular and
public reporting of results on the benchmark (TUM dataset) RGB-D SLAM respectively, as they aid in constraining the
Loop- Point-SLAM Ours GT
Input Method fr1/desk fr2/xyz fr3/office Avg.
closure
DSO [4] 22.4 1.10 9.50 11.0
DROID-VO [36] 5.20 10.7 7.30 7.73
Monocular
w/o
DepthCov [3] 5.60 1.20 68.8 25.2
Ours 4.15 4.79 4.39 4.44
DROID-SLAM [36] 1.80 0.50 2.80 1.70
w/
ORB-SLAM2 [20] 2.00 0.60 2.30 1.60
iMAP [33] 4.90 2.00 5.80 4.23
NICE-SLAM [46] 4.26 6.19 6.87 5.77
DI-Fusion [7] 4.40 2.00 5.80 4.07
Vox-Fusion [43] 3.52 1.49 26.01 10.34
w/o
ESLAM [8] 2.47 1.11 2.42 2.00
RGB-D
Table 1. Camera tracking result on TUM for monocular and Figure 4. Rendering examples on Replica. Due to the stochastic
RGB-D. ATE RMSE in cm is reported. We divide systems into nature of ray sampling, Point-SLAM struggle with rendering fine
with and without explicit loop closures. In both monocular and details.
RGB-D cases, we achieve state-of-the-art performance. In partic-
ular, in the monocular case, not only do we outperform systems Camera Layout Ours w/ depth Ours w/o depth
which use deep prior, but we achieve comparable performance
with many of the RGB-D systems.
Method r0 r1 r2 o0 o1 o2 o3 o4 Avg.
iMAP [33] 3.12 2.54 2.31 1.69 1.03 3.99 4.05 1.93 2.58
NICE-SLAM 0.97 1.31 1.07 0.88 1.00 1.06 1.10 1.13 1.07
Vox-Fusion [43] 1.37 4.70 1.47 8.48 2.04 2.58 1.11 2.94 3.09
ESLAM [8] 0.71 0.70 0.52 0.57 0.55 0.58 0.72 0.63 0.63
Point-SLAM [27] 0.61 0.41 0.37 0.38 0.48 0.54 0.69 0.72 0.53
Ours 0.47 0.43 0.31 0.70 0.57 0.31 0.31 3.2 0.79 Hash Grid SDF MLP SDF
Ours* 0.76 0.37 0.23 0.66 0.72 0.30 0.19 1.46 0.58
Table 2. Camera tracking result on Replica for RGB-D SLAM.
ATE RMSE in cm is reported. We achieve best performance across
most sequences. Here, Ours is our multi-process implementation
and Ours* is the single-process implementation which performs
more mapping iterations.
Input Method fr1/desk fr2/xyz fr3/office Avg. Figure 5. Convergence basin analysis: Left: 3D Gaussian re-
w/o Eiso 4.54 4.87 5.1 4.84 constructed using the training views (Yellow) and visualisation of
RGB-D Mono
w/o kf selection 48.5 4.36 8.70 20.5 the test poses (Red). We measure the convergence basin of the
Ours 4.15 4.79 4.39 4.44 target pose (Blue) by performing localisation from the test poses.
w/o Egeo 1.66 1.51 2.45 1.87 Right: Visualisation of convergence basin of our method (top,
w/o kf selection 1.93 1.46 4.07 2.49 with and without depth for training) and other representations (bot-
Ours 1.52 1.58 1.65 1.58 tom). The green circle marks successful convergence, and the red
Table 3. Ablation Study on TUM RGB-D dataset. We analyse cross marks failure.
the usefulness of isotropic regularisation, geometric residual, and
keyframe selection to our SLAM system. agement. We further compare the memory usage of dif-
ferent 3D representations in Table 4. MLP-based iMAP is
Memory Usage
iMAP [33] NICE-SLAM [46] Co-SLAM [39] Ours (Mono) Ours (RGB-D) clearly more memory efficient, but it struggles to express
0.2M 101.6M 1.6M 2.6MB 3.97MB high-fidelity 3D scenes due to the limited capacity of small
Table 4. Memory Analysis on TUM RGB-D dataset. We com- MLP. Compared with a voxel grid of features used in NICE-
pare the size of our Gaussian map to other methods. Baseline num- SLAM, our method uses significantly less memory.
bers are taken from [39].
geometry when photometric signals are weak. For both Convergence Basin Analysis In our SLAM experiments,
cases, keyframe selection significantly improves systems we discovered that 3D Gaussian maps have a notably large
performance, as it automatically chooses suitable keyframes convergence basin for camera localisation. To investigate
based on our occlusion-aware keyframe selection and man- further, we conducted a convergence funnel analysis, an
Method Metric room0 room1 room2 office0 office1 office2 office3 office4 Avg. Rendering FPS
PSNR[dB] ↑ 22.12 22.47 24.52 29.07 30.34 19.66 22.23 24.94 24.42
NICE-SLAM [46] SSIM ↑ 0.689 0.757 0.814 0.874 0.886 0.797 0.801 0.856 0.809 0.54
LPIPS↓ 0.33 0.271 0.208 0.229 0.181 0.235 0.209 0.198 0.233
PSNR[dB] ↑ 22.39 22.36 23.92 27.79 29.83 20.33 23.47 25.21 24.41
Vox-Fusion [43] SSIM ↑ 0.683 0.751 0.798 0.857 0.876 0.794 0.803 0.847 0.801 2.17
LPIPS↓ 0.303 0.269 0.234 0.241 0.184 0.243 0.213 0.199 0.236
PSNR[dB] ↑ 32.40 34.08 35.5 38.26 39.16 33.99 33.48 33.49 35.17
Point-SLAM [27] SSIM ↑ 0.974 0.977 0.982 0.983 0.986 0.96 0.960 0.979 0.975 1.33
LPIPS↓ 0.113 0.116 0.111 0.1 0.118 0.156 0.132 0.142 0.124
PSNR[dB] ↑ 34.83 36.43 37.49 39.95 42.09 36.24 36.7 36.07 37.50
Ours SSIM ↑ 0.954 0.959 0.965 0.971 0.977 0.964 0.963 0.957 0.960 769
LPIPS↓ 0.068 0.076 0.075 0.072 0.055 0.078 0.065 0.099 0.070
Table 5. Rendering performance comparison of RGB-D SLAM methods on Replica. Our method outperforms most of the rendering
metrics compared to existing methods. Note that Point-SLAM uses sensor depth (ground-truth depth in Replica) to guide sampling along
rays, which limits the rendering performance to existing views. The numbers for the baselines are taken from [27].
5. Conclusion
We have proposed the first SLAM method using 3D Gaus-
sians as a SLAM representation. Via efficient volume ren-
dering, our system significantly advances the fidelity and di-
versity of object materials a live SLAM system can capture.
Our system achieves state-of-the-art performance across
benchmarks for both monocular and RGB-D cases. Inter-
esting directions for future research are the integration of
loop closure for handling large-scale scenes and extraction
of geometry such as surface normal as Gaussians do not ex-
plicitly represent surface.
6. Acknowledgement
Research presented in this paper has been supported by
Dyson Technology Ltd. We are very grateful to Eric Dex-
heimer, Kirill Mazur, Xin Kong, Marwan Taher, Ignacio
Alzugaray, Gwangbin Bae, Aalok Patwardhan, and mem-
bers of the Dyson Robotics Lab for their advice and insight-
ful discussions.
Supplementary Material given last keyframe j, IOUcov (i, j) < kfcov or if the rela-
tive translation tij > kfm D̂i , where D̂i is the median depth
7. Implementation Details of frame i. For Replica kfcov = 0.95, kfm = 0.04 and for
TUM kfcov = 0.90, kfm = 0.08. We remove the regis-
7.1. System Details and Hyperparameters tered keyframe j in Wk if the OCcov (i, j) < kfc , where
7.1.1 Tracking and Mapping (Sec. 3.3.1 and 3.3.3) keyframe i is the latest added keyframe. For both Replica
and TUM, we set the cutoff to kfc = 0.3. We set the size
Learning Rates We use the Adam optimiser for both of keyframe window to be for Replica, |Wk | = 10, and for
camera poses and Gaussian parameters optimisation. For TUM, |Wk | = 8.
camera poses, we used 0.003 for rotation and 0.001 for
translation. For 3D Gaussians, we used the default learn-
ing parameters of the original Gaussian Splatting imple- Gaussian Insertion and Pruning (Sec. 3.3.2) As we op-
mentation [10], apart from in monocular setting where we timise the positions of Gaussians and prune geometrically
increase the learning rate of the positions of the Gaussians unstable Gaussians, we do not require any strong prior such
µW by a factor of 10. as depth observation for Gaussian initialisation. When in-
serting new Gaussians in a monocular setting, we randomly
sample the Gaussians position µW using rendered depth
Iteration numbers 100 tracking iterations are performed
D. Since the estimated depth may sometimes be incor-
per frame for across all experiments. However, we termi-
rect, we account for this by initialising the Gaussians with
nate the iterations early if the magnitude of the pose update
some variance. For a pixel p where the rendered depth Dp
becomes less than 10−4 . For mapping, 150 iterations are
exists, we sample the depth from N (Dp , 0.2σD ). Other-
used for the single-process implementation.
wise, for unobserved regions, we initialise the Gaussians
by sampling from N (D̂, 0.5σD ), where D̂ is the median of
Loss Weights Given a depth observation, for tracking we D. For pruning, as described in Section 3.3.2, we perform
minimise both photometric Eq. (7) and geometric residual visibility-based pruning, where if new Gaussians inserted
Eq. (8) as: within the last 3 keyframes are not observed by at least 3
other frames, they are pruned. We only perform visibility-
min λpho Epho + (1 − λpho )Egeo , (12)
T CW ∈SE(3) based pruning once the keyframe window Wk is full. Ad-
ditionally, we prune all Gaussians with opacity of less than
and similarly, for mapping we modify Eq. (11) to: 0.7.
X
k k
min (λpho Epho + (1 − λpho )Egeo ) 8. Evaluation details
Tk
CW ∈SE(3),G, ∀k∈W
∀k∈W
8.1. Camera Tracking Accuracy (Table 1 and Ta-
+ λiso Eiso . (13)
ble 2)
We set λpho = 0.9 for all RGB-D experiments, and λiso = 8.1.1 Evaluation Metric
10 for both monocular and RGB-D experiments.
We measured the keyframe absolute trajectory error (ATE)
7.1.2 Keyframing (Sec. 3.3.2) RMSE. For monocular evaluation, we perform scale align-
ment between the estimated scale-free and ground-truth tra-
Gaussian Covisibility Check (Sec. 3.3.2) As described jectories. For RGB-D evaluation, we only align the esti-
in Sec. 3.3.2, keyframe selection is based on the covisibility mated trajectory and ground truth without scale adjustment.
of the Gaussians. Between two keyframes i, j, we define the
covisibility using Intersection of Union (IOU) and Overlap
Coefficient (OC): 8.1.2 Baseline Results
Table 1 Numbers for monocular DROID-SLAM [36]
|Giv ∩ Gjv |
IOUcov (i, j) = v , (14) and ORB-SLAM [20] is taken from [13]. We have lo-
|Gi ∪ Gjv | cally run DSO [4], DepthCov [3] and DROID-VO [36] –
|Giv ∩ Gjv | which is DROID-SLAM without loop closure and global
OCcov (i, j) = , (15) bundle adjustment. For the RGB-D case, numbers for
min(|Giv |, |Gjv |)
NICE-SLAM [46], DI-Fusion [7], Vox-Fusion [43], Point-
where Giv is the Gaussians visible in keyframe i, based on SLAM [27] are taken from Point-SLAM [27], and all the
visibility check described in Section 3.3.2, Gaussian Covisi- other baselines: iMAP [33], ESLAM [8], Co-SLAM [39],
bility. A keyframe i is added to the keyframe window Wk if BAD-SLAM [29], Kintinous [40], ORB-SLAM [20]
Rendering time Test views Training views
Method Rendering FPS ↑
per image [s] ↓
NICE-SLAM [46] 0.54 1.85
Vox-Fusion [43] 2.17 0.46
Point-SLAM [27] 1.33 0.75
Ours 769 0.0013
Table 7. Further dettail of Rendering FPS and Rendering Time
comparison based on Table 5
.
Table 2 and 5 We took the numbers from Point- Target view Overlayed views
SLAM [27] paper.
For convergence basin analysis, we create three datasets by Note that training views’ camera poses T kCW are fixed dur-
rendering the synthetic Replica dataset. In addition to the ing the optimisation.
qualitative visualisation in Figure 5, we report more detailed In the “w/ depth” setting, we train the Gaussian map by
camera pose distributions in Figure 8. Figure 8 shows the minimising the same cost function as our RGB-D SLAM
camera view frustums of the test (red), training (yellow) and system:
target (blue) views. As we mentioned in the main paper, we
set the training view in the shape of a square with a width of X
k k
0.5m and test views are distributed with radii ranging from Ginit = arg min (λpho Epho + (1 − λpho )Egeo )
G
∀k∈W
0.2m to 1.2m, covering a larger area than the training views.
We only apply displacements to the camera translation but + λiso Eiso , (17)
not to the rotation. For each sequence, we use a total of 67
where we use λpho = 0.9 and λiso = 10 for all the experi-
test views.
ments
8.3.2 Training setup
Baseline Methods For Hash Grid SDF, we trained the
For each method, the 3D representation is trained for 30000 same network architecture as Co-SLAM [39]. For MLP
iterations using the training views. Here, we detail the train- SDF, we trained the network of iMAP [33]. For both base-
ing setup of each of the methods: lines, we supervised networks with the same loss functions
Method Total Time [s] FPS Avg. Map. Iter. Method Total Time [s] FPS Avg. Map. Iter.
Monocular 798.9 3.2 88.1 RGB-D 1002.7 2.0 27.5
RGB-D 986.7 2.5 81.0 RGB-D* 1878.1 1.1 150
Table 8. Performance Analysis using fr3/office. Both monocular Table 9. Performance Analysis using replica/office2. RGB-
and RGB-D implementation uses multiprocessing. We report the D uses multi-process implementation and RGB-D* is the single-
total execution time of our system, FPS computed by dividing process implementation. We report the total execution time of
the total number of processed frames with the total time, and aver- our system, FPS computed by dividing the total number of pro-
age mapping iteration per added keyframe. cessed frames with the total time, and average mapping iteration
per added keyframe.
as Co-SLAM, which are colour rendering loss Lrgb , depth
rendering loss Ldepth , SDF loss Lf s , free-space loss Lf s ,
and smoothness loss Lsmooth . Please refer to the original
Co-SLAM paper for the exact formulation (equation (6) - Input Method fr1/desk fr2/xyz fr3/office Avg.
w/o pruning 77.4 12.0 129.0 72.9
(9)). All the training hyperparameters (e.g. learning rate Mono
Ours 4.15 4.79 4.39 4.44
of the network, number of sampling points, loss weight)
Table 10. Pruning Ablation Study on TUM RGB-D dataset
are the same as Co-SLAM’s default configuration of the
(Monocular Input). Numbers are camera tracking error (ATE
Replica dataset. While Co-SLAM stores training view in-
RMSE) in cm.
formation by downsampling the colour and depth images,
we store the full pixel information because the number of Input Method fr1/desk fr2/xyz fr3/office Avg.
training views is small. w/o Eiso 1.60 1.54 1.53 1.56
RGB-D
Ours 1.52 1.58 1.65 1.58
Table 11. Isotropic Loss Ablation Study on TUM RGB-D
8.3.3 Testing Setup
dataset (RGB-D input). Numbers are camera tracking error (ATE
For testing, we localise the camera pose by minimising only RMSE) in cm.
the photometric error against the ground-truth colour image Method r0 r1 r2 o0 o1 o2 o3 o4 Avg.
of the target view. w/o Eiso 0.69 0.53 0.39 4.30 2.01 1.24 0.32 3.58 1.63
Ours 0.47 0.43 0.31 0.70 0.57 0.31 0.31 3.20 0.79
Table 12. Isotropic Loss Ablation Study on Replica dataset
Ours Let the camera pose T CW ∈ SE(3) and initial 3D (RGB-D input). Numbers are camera tracking error (ATE RMSE)
Gaussians Ginit , the localised camera pose T est
CW is found in cm.
by:
result shows, Gaussian pruning plays a significant role in
T est ¯
CW = arg min I(Ginit , T CW ) − Itarget . (18) enhancing camera tracking performance. This improvement
1
T CW
is primarily because, without pruning, randomly initialised
Note that Ginit is fixed during the optimisation. We ini- Gaussians persist in the 3D space, potentially leading to in-
tialise T CW at one of the test view’s positions, and optimi- correct initial geometry for other views.
sation is performed for 1000 iterations. We perform this lo-
calisation process for all the test views and measure the suc-
cess rate. Camera localisation is successful if the estimated
pose converges to within 1cm of the target view within the
1000 iterations.
9.2. Isotropic Loss Ablation (RGB-D input)
Baseline Methods For the baseline methods, the camera
localisation is performed by minimising colour volume ren-
dering loss Lrgb , while all the other trainable network pa- Table 11 and 12 report the ablation study of the effect of
rameters are fixed. The learning rates of the pose optimiser isotropic loss Eiso for RGB-D input. In TUM, as Table 11
are also the same as Co-SLAM’s default configuration of shows, isotropic regularisation does not improve the per-
Replica dataset. formance but only shows a marginal difference. However,
for Replica, as summarised in Table 12, isotropic loss sig-
9. Further Ablation Analysis (Table 3) nificantly improves camera tracking performance. Even
with the depth measurement, since rasterisation does not
9.1. Pruning Ablation (Monocular input)
consider the elongation along the viewing axis. Isotropic
In Table 9.1, we report the ablation study of our proposed regularisation is required to prevent the Gaussians from
Gaussian pruning, which prunes randomly initialised 3D over-stretching, especially for textureless regions, which are
Gaussians effectively in monocular SLAM setting. As the common in Replica.
9.3. Memory Consumption and Frame Rate (Ta- sented in Eq. (6).
ble. 4)
DµC Exp(τ ) · µC − µC
= lim (19)
9.3.1 Memory Analysis DT CW τ →0 τ
(I + τ ∧ ) · µC − µC
= lim (20)
τ →0 τ
In memory consumption analysis, for Table. 4, we mea- τ ∧ · µC
sure the final size of the created Gaussians. The memory = lim (21)
τ →0 τ
footprint of our system is lower than the original Gaus- θ× µC + ρ
sian Splatting, which uses approximately 300-700MB for = lim (22)
τ →0 τ
the standard novel view synthesis benchmark dataset [10], ×
−µC θ + ρ
as we only maintain well-constrained Gaussians via pruning = lim (23)
τ →0 τ
and do not store the spherical harmonics.
= I −µ×
C (24)
Figure 11. Novel view rendering and Gaussian visualizations on TUM fr2/xyz
Figure 13. Novel view rendering and Gaussian visualizations on TUM fr3/office