Robust Odometry Estimation For RGB-D Cameras: Christian Kerl, J Urgen Sturm, and Daniel Cremers
Robust Odometry Estimation For RGB-D Cameras: Christian Kerl, J Urgen Sturm, and Daniel Cremers
Robust Odometry Estimation For RGB-D Cameras: Christian Kerl, J Urgen Sturm, and Daniel Cremers
u + c
x
f
x
,
v + c
y
f
y
, 1
(3)
where Z
1
(x) is the depth of the pixel, and f
x
, f
y
and
c
x
, c
y
denote the focal length and optical center of the
pinhole camera model, respectively. In the coordinate frame
of the second camera, the point p is rotated and translated
according to the rigid body motion g SE(3). A rigid body
motion comprises a rotation represented as a 33 orthogonal
matrix R SO(3) and a translation represented as a 3 1
vector t R
3
. Correspondingly, the point p in the frame of
the second camera is given as
T(g, p) = Rp +t. (4)
To have a minimal parametrization of g we use twist
coordinates, i.e.,
= (
1
,
2
,
3
,
1
,
2
,
3
)
R
6
, (5)
where
1
,
2
,
3
are also called the linear velocity and
1
,
2
,
3
the angular velocity of the motion. The rotation
matrix and translation vector of g can be calculated from
with the exponential map relating Lie algebra se(3) to Lie
group SE(3):
g() = exp(
) (6)
A closed form solution to compute the matrix exponential
exp(
f
x
x
z
c
x
,
f
y
y
z
c
y
. (7)
To summarize, the full warping function is given by
(, x) = (T(g(), p)) (8)
= (T(g(),
1
(x, Z
1
(x)))). (9)
B. Likelihood function
For the moment, we assume that the photo-consistency
assumption as stated in (1) holds equally for all n pixels x
i
with i = 1, . . . , n in the image. We dene the residual of
the i-th pixel as the difference in brightness between the rst
and the warped second image, i.e.,
r
i
() := I
2
((, x
i
)) I
1
(x
i
). (10)
In Fig. 1a and 1b two exemplary input images are shown.
Their residual image is depicted in gure 1c where brighter
pixels indicate larger errors. Ideally, the residuals would be
zero, however, due to sensor noise, the residuals will be
distributed according to the probabilistic sensor model p(r
i
|
). By assuming that the noise of all pixels is independent
and identically distributed, the likelihood of observing the
whole residual image r = (r
1
, . . . , r
n
)
becomes
p(r | ) =
i
p(r
i
| ). (11)
Using Bayes rule, we obtain the a posteriori likelihood of a
camera motion given a residual image r, i.e.,
p( | r) =
p(r | )p()
p(r)
. (12)
Note that p() denotes the prior distribution over camera
motions. Possible choices include a uniform prior (all camera
motions are equally likely), a motion prior from an additional
sensor like an IMU, or the prediction of a Kalman lter.
C. Maximum A Posteriori (MAP) estimation
We now seek for the camera motion that maximizes the
posterior probability, i.e.,
MAP
= arg max
p( | r). (13)
By plugging in (12) and (11), and dropping the term p(r) as
it does not depend on , we obtain
MAP
= arg max
i
p(r
i
| )p(). (14)
By minimizing instead the negative log-likelihood, we can
equivalently write
MAP
= arg min
i
log p(r
i
| ) log p() (15)
To avoid clutter in the notation, we drop the motion prior
log p() in the remainder of this section. We discuss it in
more detail in the next section.
The minimum is found by taking the derivative of the log
likelihood and setting it to zero, i.e.,
i
log p(r
i
| )
i
log p(r
i
)
r
i
r
i
= 0. (16)
By dening w(r
i
) = log p(r
i
)/r
i
1/r
i
, we obtain
r
i
w(r
i
)r
i
= 0 (17)
5 0 5
0
5
10
15
20
residual r
w
e
i
g
h
t
e
d
e
r
r
o
r
w
(
r
)
r
2
normal distribution
Tukey weights
tdistribution
(a) weighted error
100 0 100
0
0.02
0.04
0.06
residual r
p
r
o
b
a
b
i
l
i
t
y
p
(
r
)
normal distribution
robust normal dist.
tdistribution
(b) residual histogram
Fig. 3: (a) Depending on the weight function residuals have
different inuence on the optimization. (b) We found that the
distribution over residuals (gray) is not well approximated
by a Gaussian distribution (red, green). In contrast, a t-
distribution matches the observed residuals nicely (blue).
which minimizes the weighted least squares problem:
MAP
= arg min
i
w(r
i
)(r
i
())
2
. (18)
The function w(r
i
) is often called the weighting function, as
it describes how strongly a particular residual is considered
during minimization. Note that if p(r
i
) exp(r
2
i
/
2
) is nor-
mally distributed, then w(r
i
) is constant, leading to normal
least squares minimization. For non-Gaussian error models,
that we will consider in the next section, the weighting
function will be non-constant. Fig. 3a shows the weighted
quadratic error for different weight functions. The advantage
of this formulation is that also other sensor models can be
applied to allow for non-Gaussian noise (e.g., outliers). To
solve this minimization problem, we apply the iteratively
re-weighted least squares (IRLS) algorithm, where the com-
putation of weights and the estimates of is alternated until
convergence.
D. Linearization
To nd the best camera motion, we have to solve (17). As
the residuals r
i
() are non-linear in , we use the Gauss-
Newton method to iteratively build and solve a system of
linear equations. For this, we need to compute the rst order
Taylor approximation of r
i
().
r
lin
(, x
i
) = r(0, x
i
) +
r((, x
i
))
=0
(19)
= r(0, x
i
) + J
i
, (20)
where J
i
R
16
is the Jacobian of the i-th pixel with respect
to the 6-DOF camera motion. By plugging this into (17)
and writing all constraints in matrix notation, we obtain the
normal equations
J
T
WJ = J
T
Wr(0), (21)
where J R
n6
is the stacked matrix of all J
i
pixel-wise
Jacobians and W is the diagonal matrix of weights with
W
ii
= w(r
i
). In Fig. 1d the rst column of the Jacobian
matrix for the two example images is given, corresponding
to the derivative along the camera x axis. The linear system
of equations in (21) can be efciently solved, for example
using Cholesky decomposition. At each iteration k of the
Gauss-Newton algorithm, we compute an increment
(k)
using (21) with which we update our motion estimate using
(k+1)
= log(exp(
(k)
) exp()). Note that we do not need
to re-linearize (17) at every iteration, because we warp the
image I
2
with our current estimate towards I
1
. Therefore,
we only need the Jacobian matrix at the identity.
As the linearization is only valid for small , we apply
a coarse-to-ne scheme: First we build an image pyramid
where we half the image resolution at each level. Then
we estimate the camera motion at each pyramid level as
described above and use it as an initialization for the next
level. In this way, even large translational and rotational
motions can be handled.
IV. ROBUST MOTION ESTIMATION
The approach described in Section III is very exible, as
it allows us to choose a suitable sensor model p(r | ) and
motion prior p(). For example, the robustness of motion
estimation can be increased by using a sensor model that
allows for outliers. Furthermore, if additional information
on the camera motion is available, it can be plugged into the
motion model.
A. Sensor Model
In [10], no weighting is used. This is equivalent to
assuming normal distributed errors r. In our recent studies,
however, we found that this assumption is often violated:
This is exemplied in Fig. 3b where a typical residual
histogram from the fr1/desk sequence is depicted (gray
bars). As can be seen from this plot, the normal distribution
(depicted in red) ts the data poorly. [11] recently proposed
to use a M-estimator to t the Gaussian distribution more
robustly (depicted in green). In our analysis, however, we
found that the normal distribution independent of the choice
of does not t the residual distribution well. The problem
is that the normal distribution assigns too low probabilities
to large and very small residuals, but too high probabilities
in between. Therefore, we follow [30], [31] and assume t-
distributed errors where in addition to mean and variance
2
also the so-called degrees of freedom of the distribution
can be specied. The t-distribution is suited to model data
distributions with outliers, because of its heavy tails covering
the outliers with low probability. As can be seen from the
gure, the tted t-distribution (depicted in blue) matches
nicely the residual distribution.
The weight function w(r) derived from the t-distribution
is
w(r
i
) =
log p(r
i
)
r
i
1
r
i
=
+ 1
+ (
ri
)
2
(22)
Based on our experiments, we determined degrees of free-
dom to = 5. In each iteration of IRLS, we compute the
(a) scene (b) residuals (c) weights
Fig. 4: In this experiment, a hand moves through the scene (a)
which causes large residuals (b). By using a robust weighting
function, the outlier pixels are ignored (dark) for motion
estimation (c).
1 0 1 2
2
1.5
1
0.5
0
x [m]
y
[
m
]
ground truth weights weights+temporal
Fig. 5: A temporal prior stabilizes motion estimation signif-
icantly.
variance
2
using
2
=
1
n
i
r
2
i
+ 1
+ (
ri
)
2
. (23)
This equation has to be solved iteratively, because it is
recursive, but it converges in few iterations.
The effect of weighting is illustrated in Fig. 4 where a
hand moves in a different direction than the camera causing
outliers in the residuals (cf. 4b). The outliers get down-
weighted as can be seen in Fig. 4c where darker pixels
indicate lower weights.
B. Motion Prior
With the approach described so far, the camera motion
can be estimated accurately when the RGB-D frames contain
sufcient texture and structure. However, feature-poor input
images, motion blur or dynamic objects may lead to in-
creased noise and even divergence of the motion estimate. In
particular, in our previous implementation [10], we treated all
motions equally likely. As a result, we occasionally observed
jumps in the estimated trajectories as illustrated in Fig. 5.
We assume a constant velocity model with a normal
distribution, i.e., p(
t
) = N(
t1
, ), where
t1
is the
camera speed from the previous time step and R
66
is a diagonal covariance matrix that denes how quickly it
may change.
When we derive the normal equations similar to (21) from
(15) including the motion prior, we obtain
(J
T
WJ +
1
) = J
T
Wr(0) +
1
(
t1
(k)
t
),
(24)
where
(k)
t
is the motion estimate after the k-th iteration
at time step t. As can be seen from this equation, a large
covariance matrix will decrease the inuence of the motion
prior with respect to the image-based residuals, and vice
versa.
V. EVALUATION
In this section, we demonstrate in a series of experiments
that the robust sensor model and a temporal motion model
strongly improve the accuracy of the estimated odometry.
We evaluated our approach using the TUM RGB-D bench-
mark [32] and on self-generated synthetic datasets with
perfect ground truth. For each experiment we calculated the
root mean square error (RMSE) of the drift in meters per
second. Furthermore, the average runtime for matching a pair
of images was measured. All timing results were obtained
on a PC with Intel i5 670 CPU (3.46 GHz) and 4 GB RAM.
It should be noted that for the computations only one CPU
core was utilized, and the memory footprint is less than four
times the size of the input images (for storing the image
pyramid and the image gradient at each level).
For comparison, we computed the camera trajectories also
with our previous version [10] and a re-implementation
of [11] using the Tukey weight function. We ran our pre-
vious implementation with parameters optimized for speed,
denoted as reference in the tables.
A. Synthetic Sequences
We generated two synthetic sequences with perfect ground
truth, one on a static scene (static) and one with a small,
moving object in it (moving). To generate the images,
we simulated a moving camera along a sampled trajectory
on a real RGB-D frame from the fr1/desk sequence. The
motion in the camera plane was sampled from a uniform
distribution between 0.01 m to emulate a camera moving
at speed similar to ones in [32], i.e., 0.3 m/s. Additionally, a
rotation around the camera axis was sampled from a uniform
distribution between 5