Visual Features For Vehicle Localization and Ego-Motion Estimation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Visual Features for Vehicle Localization and Ego-Motion Estimation

Oliver Pink, Frank Moosmann and Alexander Bachmann


Institut fur Mess- und Regelungstechnik
Universitat Karlsruhe (TH), 76128 Karlsruhe, Germany
{pink,moosmann,bachmann}@mrt.uka.de

AbstractThis paper introduces a novel method for vehicle However, visual odometry suffers from the same problems
pose estimation and motion tracking using visual features. The than inertial-sensor based odometry, i.e. the vehicle pose error
method combines ideas from research on visual odometry with a integrates over time. Similarly, a typical problem of SLAM
feature map that is automatically generated from aerial images
into a Visual Navigation System. Given an initial pose estimate, e.g. approaches is the loop closure when an already mapped point
from a GPS receiver, the system is capable of robustly tracking is visited again. Furthermore, building the map from scratch
the vehicle pose in geographical coordinates over time, using is not suitable for long-distance path planning in large scale
image data as the only input. environments and building an entire road network from sensor
Experiments on real image data have shown that the precision data is very time-consuming.
of the position estimate with respect to the feature map typically
lies within only several centimeters. This makes the algorithm B. Objective Formulation
interesting for a wide range of applications like navigation, path
planning or lane keeping. To overcome the limitations of existing SLAM approaches
for large-scale outdoor scenes, we introduce a pre-built feature
I. I NTRODUCTION map of the environment instead of building up the map from
on-board sensor data only. This feature map is generated from
A precise digital representation of the road network is widely available geographically referenced aerial imagery and
crucial for autonomous navigation. At present, coverage of is matched to visual features from the on-board camera plat-
such high-precision digital maps is very low and mainly form.
focused on some major cities, as the generation of these maps The basic idea of localization by matching on-board camera
is costly and time-consuming. features to a global map has already been demonstrated in
Highly detailed aerial images, which are publicly available [21]. In this work, we will combine this idea with additional
for virtually any region of the world, present a good alternative information from a visual odometry system for increased
to manual map generation. In this paper, we present a novel robustness and precision of the pose estimate.
method for vehicle localization by matching data from an on- Similar to conventional Inertial Navigation Systems which
board stereo camera rig to a digital feature map which is fuse motion estimates from inertial sensors with an absolute
automatically created from aerial imagery. Except for one GPS GPS pose estimate [2], the vehicle pose estimate from map
measurement for position initialization, the method does not matching will be fused with the vehicle motion estimates from
rely on other sensor data than on-board and aerial imagery. visual odometry by a Kalman lter. Since our approach relies
As a main advantage over existing visual odometry and only on visual data from aerial and on-board cameras - except
visual SLAM approaches, our method delivers a pose estimate for one GPS measurement for pose initialization - we dub it
not only relative to a local coordinate frame, but in geograph- a Visual Navigation System.
ical coordinates. Furthermore, problems of error integration Figure 1 shows examples of the aerial images and on-board
over time and loop-closure problems are avoided by the use camera images that will be matched for vehicle localization.
of the pre-built digital map.

A. Related Work
In recent years, research on the simultaneous localization
and mapping (SLAM) problem has been brought from indoor
applications to large-scale outdoor scenes [18], which makes
them interesting for driver assistance applications. While orig-
inally SLAM methods based on range measurements from a
laser range nder, recent work has also focused on developing
(a) (b)
camera-based approaches to SLAM [7][15][22].
For SLAM, it is desirable to obtain a robust estimate of Fig. 1. Example vehicle camera image (a) and aerial image (b)
the ego-motion, which is commonly realized with on-board
inertial sensors. Current research on a eld often referred to The remainder of this paper will detail the components of
as visual odometry has shown the possibility of doing motion- the Visual Navigation System system. Section II-A describes
estimation from an on-board camera only [20][19][12]. the generation of the digital map from aerial imagery. In

978-1-4244-3504-3/09/$25.00 2009 IEEE 254


section II-B, feature detection from on-board cameras for
map matching and visual odometry is described. The actual
vehicle pose estimation in section III is explained in two
steps: matching the on-board features to the pre-built map is
discussed in section III-A. The Kalman lter for data fusion
with the underlying model for vehicular motion is introduced
in section III-B. Experimental results of the overall system are
given in section IV and section V concludes.

II. I MAGE P ROCESSING (a) (b)

This section deals with the necessary processing of both on- Fig. 2. Original aerial image (a) and classication result (b).
board and aerial images. The images will be matched using
some kind of feature that has to be extracted from both views.
To distinguish between low-level image features like corners Second, the 3D vehicle motion has to be estimated from the
or edges and features for matching, the latter will be denoted 2D motion of image features.
landmarks in the following. For the latter, the image displacement, i.e. the 2D motion of
The landmarks have to be visible from both views and image points between consecutive frames, is computed for a
ideally should not have too large dimensions or perspective salient set of image points. These image points are determined
distortion. Our approach will make use of lane markings as using an interest point detector like e.g. the Harris corner
landmarks, since they are clearly visible from both views and detector [11].
since their detection is well-discussed in literature [8][13][14]. Given the displacement, the motion components in 3D space
However, the method is applicable to any other kind of are computed using depth information from a stereo image
landmark. Introducing other classes of landmarks may even pair. In the following, it is assumed that the optical ow is
improve the results further. determined from the right camera image and the disparity
The landmark extraction from aerial images is detailed in is determined with respect to the right camera image.
section II-A. Both the landmark extraction and the determi- Furthermore, the camera platform is assumed to be fully
nation of the vehicle motion from on-board camera images is calibrated and all image coordinates are assumed to be given
described in section II-B. in normalized coordinates, i.e. with focal lengths f = 1 and
the image center located at c = (0, 0)T .
A. Aerial Images For a given 3D scene point X = (X, Y, Z)T , the corre-
The landmark extraction from aerial images is performed sponding position x in the image plane is
by classication of each pixel whether it belongs to a lane    
y f Y
marking or not. This is done by a Support Vector Machine x = = (X) = . (1)
z X Z
classier [9], which is trained by manually selecting some
positive and negative samples from the aerial images. The depth information of the interest points can be re-
The manual training and the comparably long computation covered by measuring the horizontal separation of two cor-
times are tolerable, since landmark extraction from aerial responding points in a rectied stereo image pair.
images can be performed off-line, and only a very small
training set is required to classify large map areas.
As the aerial imagery contains no height information, the
classication result is a planar map, i.e. all detected features
are assumed to lie in one ground plane.
Figure 2 shows a detail of an aerial image and the corre-
sponding classication result.
The lane marking pixels are nally clustered to lane mark-
ings according to their distances and the centroid of each
lane marking is stored in the feature map. Some additional
properties like length and orientation of each marking are
also stored for visualization purposes but will not be used
for localization.
Fig. 3. The model of the stereo rig. The coordinate system of the right
camera coincides with the global coordinate system. For any 3D scene point
B. Vehicle Camera X, its projection (X) into the image plane can be described by eq. (1).
The images from the on-board camera platform will serve
two purposes. First, lane markings have to be detected and Given the image coordinates yL in the left and yR in the
their 3D position has to be estimated for map matching. right image along an epipolar line, the disparity is related to

255
the depth X of a scene point by in the disparity or displacement calculation. Especially for
b small disparities 0, the observation matrix H becomes
X = , (2) unobservable. Therefore, interest points with a disparity below

a certain threshold are discarded.
where = (yL yR ) is the disparity and b is the base length Furthermore, instead of using all interest points for the least
of the stereo rig. squares estimate, a maximum set of N inliers is determined
Figure 3 shows the model of the stereo rig. using the RANSAC estimator [10], which has proved to yield
Assuming a rigid scene, the motion of all scene points is good results even with a large number of outliers present.
described by the same translational velocity v = (vx , vy , vz )T For the displacement estimation, the hierarchical Lukas-
and rotational velocity = (x , y , z )T . With the Longuet- Kanade-algorithm ([17], [4]) is used. Disparity is computed by
Higgins equations [16], the relation between the 6 DoF area based block matching ([6]). For the subsequent Kalman
velocity and the 2D displacement can be described adequately. ltering (section III-B), the residuals of both estimation pro-
Dening the displacement of an image point as its motion cedures are used to determine the measurement noise.
across the image plane (y, z)T = (u, v)T , the Longuet-Higgins Figure 4 shows an example result for optical ow estimation
equations for the case of translational and rotational rigid and gure 5 shows an example disparity matching result. Only
motion can be written as interest points with sufciently low residuals and a disparity
Y Y above the threshold are shown. Only the green interest points
y = 2 X are used for least squares motion estimation, while the red
X v X  v 
y x points were rejected from the RANSAC algorithm.
= z + x z y y z + z y ,
X X
(3)
Z Z
z = 2 X
X
 v X   v 
z x
= x y + z z y z + z y .
X X
(4)
Substitution of equation (2) into equation (4) and separation
of the translational and the rotational motion components
delivers the 2D image displacement (ui , vi ) of the i-th interest
point for a given camera motion
 
ui
= Hi (vx , vy , vz , x , y , z )T (5)
vi
with
  Fig. 4. Optical ow vectors for the detected feature points. Green points are
yi i
bi 0 zi y i zi 1 yi2
Hi = b
zi i . used for ego-motion estimation.
b 0 bi yi 1 + zi2 yi zi
Stacking equation (5) for N interest points with displace- The visual landmarks for map matching are yet to be
ment (ui , vi )T delivers a linear observation model with 2N determined. For this purpose, we will make use of the same
equations combination of Harris interest point detection with disparity
that was already used for motion estimation:
u1 In the lower part of the camera image, the majority of the
v1 vx
H1 vy Harris Corner responses lies on corners of the lane markings.
u2
H2 With equations (1) and (2), the 3D scene coordinates X for a
v2 vz
+ e = .. , (6) given interest point with image coordinates x = (y, z)T and
.. . x

disparity are denoted by
. y
HN
uN
z X 1
b
vN X= Y = y . (8)

which can be solved using straightforward least squares esti- Z z
mation   Figure 6(a) shows an example corner detection result for the
v  1 T
= HT H H u . (7) lower part of the image. Unlike in gure 4, corners that were

rejected for motion estimation due to a high optical ow
However, the least squares estimate is very sensitive to residual are also displayed.
outliers, which, in this case, are mainly due to violations Typically, up to four corners belong to one lane marking,
of the rigid scene assumption. Other causes are e.g. errors thus the corners have to be clustered accordingly. This is

256
Fig. 5. Correspondences between left and right camera image. Green points are used for ego-motion estimation.

initial position estimate. For vehicle pose tracking, the pose


from the previous time step is in most cases sufcient, however
for initialization, a single GPS measurement is needed.
Other methods like the ones proposed by van Wamelen et
al. [23] or Caetano et al. [5] do not have these restrictions,
(a) Detected interest points.
however they require a large amount of memory and comput-
ing time to nd the global optimum. Therefore, these methods
are not suitable for real-time vehicle localization.
After having obtained an estimate for the current vehicle
pose, the result is fused with the previous vehicle pose, vehic-
ular motion constraints and with the vehicle motion estimate
(b) Canny contours. Each cluster is marked with a different
color.
from visual odometry by a Kalman lter. The underlying
system model is presented in section III-B. In the following,
the vehicle pose estimation using the Iterative Closest Point
method will be described in detail.
A. Iterative Point Matching
Given a set of m scene points in 3 from the digital map
(c) Centroids of detected landmarks.
in world coordinates and n template points in 3 from the
Fig. 6. Example landmark detection results. camera view in vehicle coordinates, the optimal transformation
of the two coordinate systems has to be determined. The scene
points are given as S = sW W W
1 , s2 , ...sm and the template points
V V V
accomplished by a Canny-like contour-tracing and computing as T = t1 , t2 , ..., tn where the superscript W and V denote
the lane marking centroid from all clustered edge coordinates. the respective vehicle or world coordinate system.
Figure 6(b) shows an example result of the clustered edges Assuming that the optimal transformation, i.e. the vehicle
and gure 6(c) shows the resulting centroids which will be position xW and orientation matrix R is known, the transfor-
used for matching. mation of the scene points from world coordinates to vehicle
coordinates is
III. V EHICLE P OSE E STIMATION  
To obtain an estimate of the vehicle pose, the 3D coordinates sVi = R1 sW i x
W
. (9)
of the lane markings from the camera view have to be matched As stated before, for the iterative closest point method, the
to the lane markings in the digital map. This is a standard initial vehicle position has to be known up to some position
point pattern matching problem, for which many solutions and orientation error x  Given the initial pose estimate
 and R.
were proposed in the recent years. These solutions mainly W and R,
x  the optimal transformation is
differ on their complexity and their robustness to outliers. We xW = x
W x, (10)
decided to use the iterative closest point algorithm [3] as a
 1 R
R = R  . (11)
very fast and simple method for point pattern matching.
A major drawback of this method is that it only minimizes a The estimated position of the scene points in vehicle coor-
local objective function, and therefore is very likely to stick to dinates is  
nearby local minima. It is therefore necessary to have a good sVi = R
  1 sW xW . (12)
i

257
The goal of the ICP algorithm is now to recover the position leading to any convergence at all. However, for a good initial
and orientation errors x and R  given the pose estimate xW pose estimate, the RANSAC stage is very well suited to reject

and R. all false correspondences.
Assuming that for each template point tVi , the correspond- B. Motion Tracking
ing scene point mW W
i = sj is known, the optimal rotation and
Similar to conventional Inertial Navigation Systems, the
translation can be determined by a translation of the centroid
Visual Navigation System obtains an estimate for the current
position and a rotation around the centroid which can be solved
vehicle pose by fusing vehicle motion estimates from visual
efciently by a singular value decomposition (SVD) of the
odometry and the pose estimates from map matching in
matrix n
 a Kalman lter structure. The underlying system assumes
W= mi tT
i = UV ,
T
(13) vehicle motion with constant velocity and yaw rate in the
i=1 x,y-ground plane of the world coordinate system. Figure 7
where tiand mi
are the coordinates of the template and illustrates the vehicle motion.
matched scene points relative to their respective centroid The vehicle pitch and roll angles as well as the camera pose
position: are modeled as constant.
n

ti = tVi tVk (14)
k=1
n
mi  Vi
= m  Vk .
m (15)
k=1

The desired rotation matrix is


P
 = UVT
R (16) T
v
and the translation is V
n r x

=
x  Vk
m tVk . (17) x W
V
k=1

It is shown in [1] that this solution minimizes the sum of yV


the squared residuals. W
W y
As the correspondences of the scene and template points
are not known in advance, each template point is paired Fig. 7. Vehicle motion with constant yaw rate with instantaneous center of
with the closest scene point, i.e. that minimizes the squared rotation P . The z-axis points into the image plane.
Mahalanobis distance
The vehicle position at time instant k + 1 relative to the
d2 (ti , sj ) = (ti sj )T 1 (ti sj ) , (18)
vehicle coordinate system at time instant k is given by

where is the covariance matrix of the vehicle position 0 0
estimate. xVk+1 = xVk + r + R r , (19)
The point pairing and the computation of the rotation and 0 0
translation errors R W and x is repeated until the vehicle pose
where R is the rotation matrix around the z-axis
converges.
Finally, to rene the vehicle position estimate further, and cos( T ) sin( T ) 0
to cope with outliers that are mainly due to false lane marking R = sin( T ) cos( T ) 0 . (20)
detection, the iterative closest point algorithm is repeated with 0 0 1
only a smaller subset of the matches mj . This subset is Transformation to world coordinates gives the system model
v

determined by nding a maximum consensus for a rotation sin( T )
and translation error R  W and x  by the RANSAC algorithm W v


xW W
k+1 = xk + RV (1 cos( T )) . (21)
[10]. Again, the pairing and pose computation is repeated until
the pose converges. 0
The combination of a full least squares optimization and The velocity and rotation vectors v and from visual
a RANSAC optimization afterward has shown to give better odometry (see section II-B) and the pose estimates in geo-
convergence results in practice than a RANSAC optimization graphical coordinates xW , R from map matching (see section
alone, while still having the robustness to outliers of the III-A) serve as measurements for the Kalman lter. The output
RANSAC algorithm. Especially for large initial pose devi- of the Kalman lter is a vehicle pose estimate in world
ations, only a small set of points is paired correctly. In coordinates and a covariance matrix as quality measure for
these cases, RANSAC tends to reject the correct matches, not the estimate.

258
IV. E XPERIMENTAL R ESULTS
For experimental evaluation, the visual odometry results, the
results from map matching and the fused pose estimates will
be compared.
Figure 8 shows an overlay of the yellow lane markings from
the feature map with an example sequence of vehicle camera
images for the different pose estimates. All sequences were
initialized with the same exact pose estimate at t = 0. Only
every 25th frame is shown, i.e. the i-th row shows the results
at t = i seconds. Figure 9 shows all 50 pose estimates for the
rst 2 seconds relative to the initial pose1 .
Figures 8(a) and 9 clearly show the accumulation of a
positioning error over time, when only integrating the motion
estimates. After 2 seconds, the absolute accumulated position Fig. 10. Overlay of vehicle position, aerial image and lane markings (yellow)
error after 50 frames is 0.56m. On the other hand, the overlays from the feature map.
from the map matching pose estimates in gures 8(b) match
the real lane markings very well, except for some cases where
a larger number of outliers is present. The image in the third V. C ONCLUSION
row of 8(b) shows such an example. In gure 9, these cases We described a system that matches features from an on-
can be noticed as blue dots that do not lie on the trajectory. board camera to a previously generated feature map to obtain
These intermediate pose errors typically lie within 0.5m and a precise vehicle localization result. Robustness of the vehicle
can be ltered very well by the extended Kalman lter, as pose estimate is increased by fusing the localization result
can be seen in 8(c). The map overlay matches the real lane with an ego-motion estimate which is also obtained from the
marking position in all cases. on-board camera platform. The resulting pose estimates are
accurate within several centimeters with respect to the feature
y[m]
map even when a large number of false detections is present.
2 Except for one pose estimate for initialization, the proposed
0 Visual Navigation System system relies on images as the only
input data.
0 5 10 15 20 x[m]
Future work will evaluate possibilities of using globally
Fig. 9. Vehicle pose estimates for the rst 50 frames (i.e. 2 seconds). The
optimal methods for map matching. However, these methods
y-axis points to the east, the x-axis to the south. The units are meters relative are comparably slow and are very likely not to be suitable for
to the initial vehicle pose. Red: Integrated vehicle motion estimates. Blue: real-time vehicle applications. Alternatively, a globally optimal
Map matching pose estimates. Green: Filtered pose estimates
method could be used for initialization and the proposed
iterative method for pose tracking. Meanwhile, initialization
It is worth to notice that, according to our previous exper- with GPS appears to be a reasonable alternative for most
iments in [21], map matching alone can cope with position practical purposes.
errors of up to 2m. In combination with visual odometry, the
pose matching remains reliable even if the map matching fails VI. ACKNOWLEDGEMENTS
for a longer sequence of frames, e.g. if no lane markings are
The authors gratefully acknowledge the contribution of the
present at all. As long as the overall accumulated pose error
German collaborative research center on Cognitive Automo-
remains below 2m, the rst valid matching instantaneously
biles (SFB/Tr28), granted by Deutsche Forschungsgemein-
yields a correct position estimate. A quality measure for
schaft, and of the Karlsruhe School of Optics and Photonics
the current pose estimate can be obtained directly from the
(KSOP). Aerial images courtesy of Stadt Karlsruhe, VLW
covariance matrix of the underlying Kalman lter.
Geodaten.
Finally, gure 10 shows an overlay of the original aerial
image, the extracted yellow lane marking and the vehicle R EFERENCES
for the pose estimate at t = 2s. It clearly illustrates that
[1] K. S. Arun, T. S. Huang, and S. D. Blostein. Least-squares tting of two
the pose estimate in combination with the feature map can 3-d point sets. IEEE Trans. Pattern Anal. Mach. Intell., 9(5):698700,
provide information about the environment even outside the 1987.
current eld of view. This information may be useful for other [2] Y. Bar-Shalom, T. Kirubarajan, and X.-R. Li. Estimation with applica-
tions to tracking and navigation. Wiley-Interscience, 2001.
applications, e.g. determining drivable area or road curvature. [3] P. J. Besl and N. D. McKay. A method for registration of 3-D shapes.
IEEE Transactions On Pattern Analysis And Machine Intelligence,
1 The coordinate system relative to the initial pose instead of geograph-
14(2):239256, 1992.
ical coordinates was chosen for reasons of comparability. Transformation [4] J. Bouget. Pyramidal implementation of the Lucas Kanade feature
to latitute/longitude using the geographically referenced aerial images is tracker. Technical report, Tech. Rep. included in the OpenCV library,
straigtforward. 2002.

259
(a) (b) (c)

Fig. 8. Overlay of the feature map (yellow) with vehicle camera images for different vehicle pose estimates and different time steps. The i-th row shows the
pose estimate after i seconds, i.e. every 25th image is shown. Column (a): Integration of the vehicle motion estimates. Column (b): Unltered ICP matching
results. Column (c): Filtered Pose estimate with vehicle motion and map matching result as measurement data.

[5] T. S. Caetano, T. Caelli, D. Schuurmans, and D. A. C. Barone. Graphical [15] T. Lemaire, C. Berger, I. Jung, and S. Lacroix. Vision-based SLAM:
models and point pattern matching. IEEE Transactions On Pattern Stereo and monocular approaches. 74(3):343364, September 2007.
Analysis And Machine Intelligence, 28(10):1646 1663, Oct. 2006. [16] H. C. Longuet-Higgins and K. Prazdny. The interpretation of a moving
[6] T. Dang, C. Hoffmann, and C. Stiller. Self-calibration for active retinal image. Proceedings of the Royal Society of London, Series B,
automotive stereo vision. In Proceedings of the IEEE Intelligent Vehicles 208:385397, 1980.
Symposium, Tokyo, 2006. [17] B. Lucas and T. Kanade. An iterative image registration technique
[7] A. Davison, I. Reid, N. Molton, and O. Stasse. MonoSLAM: Real- with an application to stereo vision. In DARPA Image Understanding
time single camera SLAM. IEEE Transactions On Pattern Analysis Workshop, pages 121130, 1981.
And Machine Intelligence, 29(6):10521067, June 2007. [18] R. Martinez-Cantin and J. Castellanos. Unscented slam for large-scale
[8] C. Duchow. A novel, signal model based approach to lane detection for outdoor environments. IEEE/RSJ International Conference on Intelligent
use in intersection assistance. In Proceedings of the IEEE Intelligent Robots and Systems (IROS) 2005, pages 34273432, 2-6 Aug. 2005.
Transport Systems Conference, pages 11621167, 2006. [19] R. Munguia and A. Grau. Monocular SLAM for visual odometry.
[9] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classication. Wiley, Intelligent Signal Processing, 2007. WISP 2007. IEEE International
2. edition, 2001. Symposium on, pages 16, 3-5 Oct. 2007.
[10] M. Fischler and R. Bolles. Random sample consensus: A paradigm [20] D. Nister, O. Naroditsky, and J. Bergen. Visual odometry. Proceedings
for model tting with applications to image analysis and automated of the 2004 IEEE Computer Society Conference on Computer Vision
cartography. Communications of the ACM, (24):381395, 1981. and Pattern Recognition, (CVPR) 2004, 1:I652I659 Vol.1, 27 June-2
[11] C. Harris and M. Stephens. A combined corner and edge detector. In July 2004.
The Fourth Alvey Vision Conference, pages 147151, 1988. [21] O. Pink. Visual map matching and localization using a global feature
[12] J. Horn, A. Bachmann, and T. Dang. Stereo vision based ego motion map. In CVPR Workshop on Visual Localization for Mobile Platforms,
estimation with sensor supported subset validation. In Proceedings of 2008.
the IEEE Intelligent Vehicles Symposium 2007, pages 741748. IEEE [22] R. Sim, P. Elinas, and J. J. Little. A study of the rao-blackwellised
Intelligent Vehicles Symposium, Istanbul, Turkey, June 2007. particle lter for efcient and accurate vision-based slam. International
[13] S.-S. Ieng, J.-P. Tarel, and R. Labayrade. On the design of a single Journal of Computer Vision, 74(3):303318, 2007.
lane-markings detector regardless the on-board cameras position. Pro- [23] P. B. van Wamelen, Z. Li, and S. S. Iyengar. A fast expected
ceedings of IEEE Intelligent Vehicle Symposium 2003, pages 564569, time algorithm for the 2-D point pattern matching problem. Pattern
2003. Recognition, 37(8):16991711, 2004.
[14] Z. Kim. Realtime lane tracking of curved local road. In Proceedings of
the IEEE Intelligent Transport Systems Conference, pages 11491155,
2006.

260

You might also like