Visual Features For Vehicle Localization and Ego-Motion Estimation
Visual Features For Vehicle Localization and Ego-Motion Estimation
Visual Features For Vehicle Localization and Ego-Motion Estimation
AbstractThis paper introduces a novel method for vehicle However, visual odometry suffers from the same problems
pose estimation and motion tracking using visual features. The than inertial-sensor based odometry, i.e. the vehicle pose error
method combines ideas from research on visual odometry with a integrates over time. Similarly, a typical problem of SLAM
feature map that is automatically generated from aerial images
into a Visual Navigation System. Given an initial pose estimate, e.g. approaches is the loop closure when an already mapped point
from a GPS receiver, the system is capable of robustly tracking is visited again. Furthermore, building the map from scratch
the vehicle pose in geographical coordinates over time, using is not suitable for long-distance path planning in large scale
image data as the only input. environments and building an entire road network from sensor
Experiments on real image data have shown that the precision data is very time-consuming.
of the position estimate with respect to the feature map typically
lies within only several centimeters. This makes the algorithm B. Objective Formulation
interesting for a wide range of applications like navigation, path
planning or lane keeping. To overcome the limitations of existing SLAM approaches
for large-scale outdoor scenes, we introduce a pre-built feature
I. I NTRODUCTION map of the environment instead of building up the map from
on-board sensor data only. This feature map is generated from
A precise digital representation of the road network is widely available geographically referenced aerial imagery and
crucial for autonomous navigation. At present, coverage of is matched to visual features from the on-board camera plat-
such high-precision digital maps is very low and mainly form.
focused on some major cities, as the generation of these maps The basic idea of localization by matching on-board camera
is costly and time-consuming. features to a global map has already been demonstrated in
Highly detailed aerial images, which are publicly available [21]. In this work, we will combine this idea with additional
for virtually any region of the world, present a good alternative information from a visual odometry system for increased
to manual map generation. In this paper, we present a novel robustness and precision of the pose estimate.
method for vehicle localization by matching data from an on- Similar to conventional Inertial Navigation Systems which
board stereo camera rig to a digital feature map which is fuse motion estimates from inertial sensors with an absolute
automatically created from aerial imagery. Except for one GPS GPS pose estimate [2], the vehicle pose estimate from map
measurement for position initialization, the method does not matching will be fused with the vehicle motion estimates from
rely on other sensor data than on-board and aerial imagery. visual odometry by a Kalman lter. Since our approach relies
As a main advantage over existing visual odometry and only on visual data from aerial and on-board cameras - except
visual SLAM approaches, our method delivers a pose estimate for one GPS measurement for pose initialization - we dub it
not only relative to a local coordinate frame, but in geograph- a Visual Navigation System.
ical coordinates. Furthermore, problems of error integration Figure 1 shows examples of the aerial images and on-board
over time and loop-closure problems are avoided by the use camera images that will be matched for vehicle localization.
of the pre-built digital map.
A. Related Work
In recent years, research on the simultaneous localization
and mapping (SLAM) problem has been brought from indoor
applications to large-scale outdoor scenes [18], which makes
them interesting for driver assistance applications. While orig-
inally SLAM methods based on range measurements from a
laser range nder, recent work has also focused on developing
(a) (b)
camera-based approaches to SLAM [7][15][22].
For SLAM, it is desirable to obtain a robust estimate of Fig. 1. Example vehicle camera image (a) and aerial image (b)
the ego-motion, which is commonly realized with on-board
inertial sensors. Current research on a eld often referred to The remainder of this paper will detail the components of
as visual odometry has shown the possibility of doing motion- the Visual Navigation System system. Section II-A describes
estimation from an on-board camera only [20][19][12]. the generation of the digital map from aerial imagery. In
This section deals with the necessary processing of both on- Fig. 2. Original aerial image (a) and classication result (b).
board and aerial images. The images will be matched using
some kind of feature that has to be extracted from both views.
To distinguish between low-level image features like corners Second, the 3D vehicle motion has to be estimated from the
or edges and features for matching, the latter will be denoted 2D motion of image features.
landmarks in the following. For the latter, the image displacement, i.e. the 2D motion of
The landmarks have to be visible from both views and image points between consecutive frames, is computed for a
ideally should not have too large dimensions or perspective salient set of image points. These image points are determined
distortion. Our approach will make use of lane markings as using an interest point detector like e.g. the Harris corner
landmarks, since they are clearly visible from both views and detector [11].
since their detection is well-discussed in literature [8][13][14]. Given the displacement, the motion components in 3D space
However, the method is applicable to any other kind of are computed using depth information from a stereo image
landmark. Introducing other classes of landmarks may even pair. In the following, it is assumed that the optical ow is
improve the results further. determined from the right camera image and the disparity
The landmark extraction from aerial images is detailed in is determined with respect to the right camera image.
section II-A. Both the landmark extraction and the determi- Furthermore, the camera platform is assumed to be fully
nation of the vehicle motion from on-board camera images is calibrated and all image coordinates are assumed to be given
described in section II-B. in normalized coordinates, i.e. with focal lengths f = 1 and
the image center located at c = (0, 0)T .
A. Aerial Images For a given 3D scene point X = (X, Y, Z)T , the corre-
The landmark extraction from aerial images is performed sponding position x in the image plane is
by classication of each pixel whether it belongs to a lane
y f Y
marking or not. This is done by a Support Vector Machine x = = (X) = . (1)
z X Z
classier [9], which is trained by manually selecting some
positive and negative samples from the aerial images. The depth information of the interest points can be re-
The manual training and the comparably long computation covered by measuring the horizontal separation of two cor-
times are tolerable, since landmark extraction from aerial responding points in a rectied stereo image pair.
images can be performed off-line, and only a very small
training set is required to classify large map areas.
As the aerial imagery contains no height information, the
classication result is a planar map, i.e. all detected features
are assumed to lie in one ground plane.
Figure 2 shows a detail of an aerial image and the corre-
sponding classication result.
The lane marking pixels are nally clustered to lane mark-
ings according to their distances and the centroid of each
lane marking is stored in the feature map. Some additional
properties like length and orientation of each marking are
also stored for visualization purposes but will not be used
for localization.
Fig. 3. The model of the stereo rig. The coordinate system of the right
camera coincides with the global coordinate system. For any 3D scene point
B. Vehicle Camera X, its projection (X) into the image plane can be described by eq. (1).
The images from the on-board camera platform will serve
two purposes. First, lane markings have to be detected and Given the image coordinates yL in the left and yR in the
their 3D position has to be estimated for map matching. right image along an epipolar line, the disparity is related to
255
the depth X of a scene point by in the disparity or displacement calculation. Especially for
b small disparities 0, the observation matrix H becomes
X = , (2) unobservable. Therefore, interest points with a disparity below
a certain threshold are discarded.
where = (yL yR ) is the disparity and b is the base length Furthermore, instead of using all interest points for the least
of the stereo rig. squares estimate, a maximum set of N inliers is determined
Figure 3 shows the model of the stereo rig. using the RANSAC estimator [10], which has proved to yield
Assuming a rigid scene, the motion of all scene points is good results even with a large number of outliers present.
described by the same translational velocity v = (vx , vy , vz )T For the displacement estimation, the hierarchical Lukas-
and rotational velocity = (x , y , z )T . With the Longuet- Kanade-algorithm ([17], [4]) is used. Disparity is computed by
Higgins equations [16], the relation between the 6 DoF area based block matching ([6]). For the subsequent Kalman
velocity and the 2D displacement can be described adequately. ltering (section III-B), the residuals of both estimation pro-
Dening the displacement of an image point as its motion cedures are used to determine the measurement noise.
across the image plane (y, z)T = (u, v)T , the Longuet-Higgins Figure 4 shows an example result for optical ow estimation
equations for the case of translational and rotational rigid and gure 5 shows an example disparity matching result. Only
motion can be written as interest points with sufciently low residuals and a disparity
Y Y above the threshold are shown. Only the green interest points
y = 2 X are used for least squares motion estimation, while the red
X v X v
y x points were rejected from the RANSAC algorithm.
= z + x z y y z + z y ,
X X
(3)
Z Z
z = 2 X
X
v X v
z x
= x y + z z y z + z y .
X X
(4)
Substitution of equation (2) into equation (4) and separation
of the translational and the rotational motion components
delivers the 2D image displacement (ui , vi ) of the i-th interest
point for a given camera motion
ui
= Hi (vx , vy , vz , x , y , z )T (5)
vi
with
Fig. 4. Optical ow vectors for the detected feature points. Green points are
yi i
bi 0 zi y i zi 1 yi2
Hi = b
zi i . used for ego-motion estimation.
b 0 bi yi 1 + zi2 yi zi
Stacking equation (5) for N interest points with displace- The visual landmarks for map matching are yet to be
ment (ui , vi )T delivers a linear observation model with 2N determined. For this purpose, we will make use of the same
equations combination of Harris interest point detection with disparity
that was already used for motion estimation:
u1 In the lower part of the camera image, the majority of the
v1 vx
H1 vy Harris Corner responses lies on corners of the lane markings.
u2
H2 With equations (1) and (2), the 3D scene coordinates X for a
v2 vz
+ e = .. , (6) given interest point with image coordinates x = (y, z)T and
.. . x
disparity are denoted by
. y
HN
uN
z X 1
b
vN X= Y = y . (8)
which can be solved using straightforward least squares esti- Z z
mation Figure 6(a) shows an example corner detection result for the
v 1 T
= HT H H u . (7) lower part of the image. Unlike in gure 4, corners that were
rejected for motion estimation due to a high optical ow
However, the least squares estimate is very sensitive to residual are also displayed.
outliers, which, in this case, are mainly due to violations Typically, up to four corners belong to one lane marking,
of the rigid scene assumption. Other causes are e.g. errors thus the corners have to be clustered accordingly. This is
256
Fig. 5. Correspondences between left and right camera image. Green points are used for ego-motion estimation.
257
The goal of the ICP algorithm is now to recover the position leading to any convergence at all. However, for a good initial
and orientation errors x and R given the pose estimate xW pose estimate, the RANSAC stage is very well suited to reject
and R. all false correspondences.
Assuming that for each template point tVi , the correspond- B. Motion Tracking
ing scene point mW W
i = sj is known, the optimal rotation and
Similar to conventional Inertial Navigation Systems, the
translation can be determined by a translation of the centroid
Visual Navigation System obtains an estimate for the current
position and a rotation around the centroid which can be solved
vehicle pose by fusing vehicle motion estimates from visual
efciently by a singular value decomposition (SVD) of the
odometry and the pose estimates from map matching in
matrix n
a Kalman lter structure. The underlying system assumes
W= mi tT
i = UV ,
T
(13) vehicle motion with constant velocity and yaw rate in the
i=1 x,y-ground plane of the world coordinate system. Figure 7
where tiand mi
are the coordinates of the template and illustrates the vehicle motion.
matched scene points relative to their respective centroid The vehicle pitch and roll angles as well as the camera pose
position: are modeled as constant.
n
ti = tVi tVk (14)
k=1
n
mi Vi
= m Vk .
m (15)
k=1
258
IV. E XPERIMENTAL R ESULTS
For experimental evaluation, the visual odometry results, the
results from map matching and the fused pose estimates will
be compared.
Figure 8 shows an overlay of the yellow lane markings from
the feature map with an example sequence of vehicle camera
images for the different pose estimates. All sequences were
initialized with the same exact pose estimate at t = 0. Only
every 25th frame is shown, i.e. the i-th row shows the results
at t = i seconds. Figure 9 shows all 50 pose estimates for the
rst 2 seconds relative to the initial pose1 .
Figures 8(a) and 9 clearly show the accumulation of a
positioning error over time, when only integrating the motion
estimates. After 2 seconds, the absolute accumulated position Fig. 10. Overlay of vehicle position, aerial image and lane markings (yellow)
error after 50 frames is 0.56m. On the other hand, the overlays from the feature map.
from the map matching pose estimates in gures 8(b) match
the real lane markings very well, except for some cases where
a larger number of outliers is present. The image in the third V. C ONCLUSION
row of 8(b) shows such an example. In gure 9, these cases We described a system that matches features from an on-
can be noticed as blue dots that do not lie on the trajectory. board camera to a previously generated feature map to obtain
These intermediate pose errors typically lie within 0.5m and a precise vehicle localization result. Robustness of the vehicle
can be ltered very well by the extended Kalman lter, as pose estimate is increased by fusing the localization result
can be seen in 8(c). The map overlay matches the real lane with an ego-motion estimate which is also obtained from the
marking position in all cases. on-board camera platform. The resulting pose estimates are
accurate within several centimeters with respect to the feature
y[m]
map even when a large number of false detections is present.
2 Except for one pose estimate for initialization, the proposed
0 Visual Navigation System system relies on images as the only
input data.
0 5 10 15 20 x[m]
Future work will evaluate possibilities of using globally
Fig. 9. Vehicle pose estimates for the rst 50 frames (i.e. 2 seconds). The
optimal methods for map matching. However, these methods
y-axis points to the east, the x-axis to the south. The units are meters relative are comparably slow and are very likely not to be suitable for
to the initial vehicle pose. Red: Integrated vehicle motion estimates. Blue: real-time vehicle applications. Alternatively, a globally optimal
Map matching pose estimates. Green: Filtered pose estimates
method could be used for initialization and the proposed
iterative method for pose tracking. Meanwhile, initialization
It is worth to notice that, according to our previous exper- with GPS appears to be a reasonable alternative for most
iments in [21], map matching alone can cope with position practical purposes.
errors of up to 2m. In combination with visual odometry, the
pose matching remains reliable even if the map matching fails VI. ACKNOWLEDGEMENTS
for a longer sequence of frames, e.g. if no lane markings are
The authors gratefully acknowledge the contribution of the
present at all. As long as the overall accumulated pose error
German collaborative research center on Cognitive Automo-
remains below 2m, the rst valid matching instantaneously
biles (SFB/Tr28), granted by Deutsche Forschungsgemein-
yields a correct position estimate. A quality measure for
schaft, and of the Karlsruhe School of Optics and Photonics
the current pose estimate can be obtained directly from the
(KSOP). Aerial images courtesy of Stadt Karlsruhe, VLW
covariance matrix of the underlying Kalman lter.
Geodaten.
Finally, gure 10 shows an overlay of the original aerial
image, the extracted yellow lane marking and the vehicle R EFERENCES
for the pose estimate at t = 2s. It clearly illustrates that
[1] K. S. Arun, T. S. Huang, and S. D. Blostein. Least-squares tting of two
the pose estimate in combination with the feature map can 3-d point sets. IEEE Trans. Pattern Anal. Mach. Intell., 9(5):698700,
provide information about the environment even outside the 1987.
current eld of view. This information may be useful for other [2] Y. Bar-Shalom, T. Kirubarajan, and X.-R. Li. Estimation with applica-
tions to tracking and navigation. Wiley-Interscience, 2001.
applications, e.g. determining drivable area or road curvature. [3] P. J. Besl and N. D. McKay. A method for registration of 3-D shapes.
IEEE Transactions On Pattern Analysis And Machine Intelligence,
1 The coordinate system relative to the initial pose instead of geograph-
14(2):239256, 1992.
ical coordinates was chosen for reasons of comparability. Transformation [4] J. Bouget. Pyramidal implementation of the Lucas Kanade feature
to latitute/longitude using the geographically referenced aerial images is tracker. Technical report, Tech. Rep. included in the OpenCV library,
straigtforward. 2002.
259
(a) (b) (c)
Fig. 8. Overlay of the feature map (yellow) with vehicle camera images for different vehicle pose estimates and different time steps. The i-th row shows the
pose estimate after i seconds, i.e. every 25th image is shown. Column (a): Integration of the vehicle motion estimates. Column (b): Unltered ICP matching
results. Column (c): Filtered Pose estimate with vehicle motion and map matching result as measurement data.
[5] T. S. Caetano, T. Caelli, D. Schuurmans, and D. A. C. Barone. Graphical [15] T. Lemaire, C. Berger, I. Jung, and S. Lacroix. Vision-based SLAM:
models and point pattern matching. IEEE Transactions On Pattern Stereo and monocular approaches. 74(3):343364, September 2007.
Analysis And Machine Intelligence, 28(10):1646 1663, Oct. 2006. [16] H. C. Longuet-Higgins and K. Prazdny. The interpretation of a moving
[6] T. Dang, C. Hoffmann, and C. Stiller. Self-calibration for active retinal image. Proceedings of the Royal Society of London, Series B,
automotive stereo vision. In Proceedings of the IEEE Intelligent Vehicles 208:385397, 1980.
Symposium, Tokyo, 2006. [17] B. Lucas and T. Kanade. An iterative image registration technique
[7] A. Davison, I. Reid, N. Molton, and O. Stasse. MonoSLAM: Real- with an application to stereo vision. In DARPA Image Understanding
time single camera SLAM. IEEE Transactions On Pattern Analysis Workshop, pages 121130, 1981.
And Machine Intelligence, 29(6):10521067, June 2007. [18] R. Martinez-Cantin and J. Castellanos. Unscented slam for large-scale
[8] C. Duchow. A novel, signal model based approach to lane detection for outdoor environments. IEEE/RSJ International Conference on Intelligent
use in intersection assistance. In Proceedings of the IEEE Intelligent Robots and Systems (IROS) 2005, pages 34273432, 2-6 Aug. 2005.
Transport Systems Conference, pages 11621167, 2006. [19] R. Munguia and A. Grau. Monocular SLAM for visual odometry.
[9] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classication. Wiley, Intelligent Signal Processing, 2007. WISP 2007. IEEE International
2. edition, 2001. Symposium on, pages 16, 3-5 Oct. 2007.
[10] M. Fischler and R. Bolles. Random sample consensus: A paradigm [20] D. Nister, O. Naroditsky, and J. Bergen. Visual odometry. Proceedings
for model tting with applications to image analysis and automated of the 2004 IEEE Computer Society Conference on Computer Vision
cartography. Communications of the ACM, (24):381395, 1981. and Pattern Recognition, (CVPR) 2004, 1:I652I659 Vol.1, 27 June-2
[11] C. Harris and M. Stephens. A combined corner and edge detector. In July 2004.
The Fourth Alvey Vision Conference, pages 147151, 1988. [21] O. Pink. Visual map matching and localization using a global feature
[12] J. Horn, A. Bachmann, and T. Dang. Stereo vision based ego motion map. In CVPR Workshop on Visual Localization for Mobile Platforms,
estimation with sensor supported subset validation. In Proceedings of 2008.
the IEEE Intelligent Vehicles Symposium 2007, pages 741748. IEEE [22] R. Sim, P. Elinas, and J. J. Little. A study of the rao-blackwellised
Intelligent Vehicles Symposium, Istanbul, Turkey, June 2007. particle lter for efcient and accurate vision-based slam. International
[13] S.-S. Ieng, J.-P. Tarel, and R. Labayrade. On the design of a single Journal of Computer Vision, 74(3):303318, 2007.
lane-markings detector regardless the on-board cameras position. Pro- [23] P. B. van Wamelen, Z. Li, and S. S. Iyengar. A fast expected
ceedings of IEEE Intelligent Vehicle Symposium 2003, pages 564569, time algorithm for the 2-D point pattern matching problem. Pattern
2003. Recognition, 37(8):16991711, 2004.
[14] Z. Kim. Realtime lane tracking of curved local road. In Proceedings of
the IEEE Intelligent Transport Systems Conference, pages 11491155,
2006.
260