Applications of Non-Metric Vision to some Visually
Guided Robotics Tasks
Martial Hebert, Cyril Zeller, Olivier Faugeras, Luc Robert
To cite this version:
Martial Hebert, Cyril Zeller, Olivier Faugeras, Luc Robert. Applications of Non-Metric Vision
to some Visually Guided Robotics Tasks. RR-2584, 1995. <inria-00074099>
HAL Id: inria-00074099
https://hal.inria.fr/inria-00074099
Submitted on 24 May 2006
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE
Applications of non-metric vision to some
visually guided robotics tasks
Luc Robert, Cyril Zeller, Olivier Faugeras
and
Martial Hébert
N˚ 2584
Juin 1995
PROGRAMME 4
ISSN 0249-6399
apport
de recherche
Applications of non-metric vision to some visually guided
robotics tasks
Luc Robert, Cyril Zeller, Olivier Faugeras
and
✁✂✁✂✁
Martial Hébert
✁✂✁
Programme 4 — Robotique, image et vision
Projet Robotvis
Rapport de recherche n˚2584 — Juin 1995 — 54 pages
Abstract: We usually think of the physical space as being embedded in a three-dimensional Euclidean space where measurements of lengths and angles do make sense. It turns out that for artificial
systems, such as robots, this is not a mandatory viewpoint and that it is sometimes sufficient to think
of the physical space as being embedded in an affine or even projective space. The question then
arises of how to relate these geometric models to image measurements and to geometric properties
of sets of cameras.
We first consider that the world is modelled as a projective space and determine how projective
invariant information can be recovered from the images and used in applications. Next we consider
that the world is an affine space and determine how affine invariant information can be recovered
from the images and used in applications. Finally, we do not move to the Euclidean layer because
this is the layer where everybody else has been working with from the early days on, but rather to
an intermediate level between the affine and Euclidean ones. For each of the three layers we explain
various calibration procedures, from fully automatic ones to ones that use some a priori information.
The calibration increases in difficulty from the projective to the Euclidean layer at the same time as
the information that can be recovered from the images becomes more and more specific and detailed.
The two main applications that we consider are the detection of obstacles and the navigation of a
robot vehicle.
Key-words: projective, affine, Euclidean geometry, stereo, motion, self-calibration, robot navigation, obstacle avoidance
(Résumé : tsvp)
✄
☎✄ ✄ This work was partially funded by the EEC under Esprit Project 6448 (VIVA) and 8878 (REALISE).
✄☎✄☎✄ Email : {lucr,faugeras,zeller}@sophia.inria.fr
Email :
[email protected]
Unité de recherche INRIA Sophia-Antipolis
2004 route des Lucioles, BP 93, 06902 SOPHIA-ANTIPOLIS Cedex (France)
Téléphone : (33) 93 65 77 77 – Télécopie : (33) 93 65 77 65
Applications de la vision non-métrique à des tâches robotiques
guidées par la vision
Résumé : Nous considérons souvent que l’epace physique peut être représenté par un espace tridimensionnel Euclidien où les notions de longueur et d’angle ont un sens. Il s’avère que pour des
systèmes artificiels comme les robots, ceci n’est pas nécessaire, et qu’il est parfois suffisant de considérer l’espace physique comme représenté par un espace affine ou projectif. Le problème se pose
alors de savoir comment relier ces modèles géométriques aux mesures effectuées dans les images, et
aux propriétés géométriques des caméras.
Nous considérons tout d’abord que le monde est représenté par l’espace projectif, et nous déterminons comment de l’information projectivement invariante peut être extraite des images et utilisée
dans un certain nombre d’applications. Ensuite, nous considérons que le monde est un espace affine,
et nous déterminons comment des informations invariantes au sens affine peuvent être extraites des
images et utilisées dans des applications robotiques. Enfin, nous n’allons pas jusqu’au niveau Euclidien, car c’est le niveau que tout le monde a considéré depuis le début, mais nous nous limitons à un
niveau intermédiaire entre l’affine et l’Euclidien. Pour chacun de ces trois niveaux, nous décrivons
un certain nombre de procédures de calibration, certaines e t́ant complètement automatiques, d’autres
utilisant des connaissances a-priori. lorsque l’on passe du niveau projectif au niveau Euclidien, la calibration devient de plus en plus difficile, mais l’information que l’on peut extraire des images est de
plus en plus spécifique et détaillée. Les deux principales applications que nous considérons sont la
détection d’obstacles et la robotique mobile.
Mots-clé : géométrie projective, affine, Euclidienne, stéréo, mouvement, auto-calibration, robotique
mobile, évitement d’obstacles
Applications of non-metric vision to some visually guided robotics tasks
1
3
Introduction
Many visual tasks require recovering 3-D information from sequences of images. This chapter takes
the natural point of view that, depending on the task at hand, some geometric information is relevant
and some is not. Therefore, the questions of exactly what kind of information is necessary for a given task, how it can be computed from the data, after which preprocessing steps, are central to our
discussion. Since we are dealing with geometric information, a very natural question that arises from
the previous ones is the question of the invariance of this information under various transformations.
An obvious example is viewpoint invariance which is of course of direct concern to us.
It turns out that viewpoint invariance can be separated into three components: invariance to changes
of internal parameters of the cameras i.e to some changes of coordinates in the images, invariance to
some transformations of space, and invariance to perspective projection via the imaging process.
Thus, the question of viewpoint invariance is mainly concerned with the invariance of geometric information to certain two- and three-dimensional transformations. It turns out that a neat way
to classify geometric transformations is by considering the projective, affine, and Euclidean groups
of transformations. These three groups are subgroups of each other and each one can be thought of
as determining an action on geometric configuration. For example, applying a rigid displacement
to a camera does not change the distances between points in the scene but in general changes their
distances in the images. These actions determine three natural layers, or strata in the processing of visual information. This has the advantages of 2) of identifying the 3-D information that can thereafter
be recovered from those images and 1) clearly identifying the information that needs to be collected
from the images in order to "calibrate" the vision system with respect to each the three strata.
Point 1) can be considered as the definition of the preprocessing which is necessary in order to
be able to recover 3-D geometric information which is invariant to transformations of the given subgroup. Point 2) is the study of how such information can effectively be recovered from the images.
This viewpoint has been adopted in [4]. In this chapter we follow the same track and enrich it considerably on two counts. From the theoretical viewpoint, the analysis is broadened to include a detailed
study of the relations between the images and a number of 3-D planes which are then used in the
development of the second viewpoint (absent in [4]) the viewpoint of the applications.
To summarize, we will first consider that the world is modeled as a projective space and determine
how projective invariant information can be recovered from the images and used in applications. Next
we will consider that the world is an affine space and determine how affine invariant information can
be recovered from the images and used in applications. Finally, we will not move to the Euclidean
layer because this is the layer where everybody else has been working with from the early days on,
but rather to an intermediate level between the affine and Euclidean ones. For each of the three layers
we explain various calibration procedures, from fully automatic ones to ones that use some a priori
information. Clearly, the calibration increases in difficulty from the projective to the Euclidean layer
at the same time as the information that can be recovered from the images becomes more and more
specific and detailed. The two main applications that we consider are the detection of obstacles and
the navigation of a robot vehicle.
Section (2) describes the model used for the camera and its relation to the three-dimensional
scene. After deriving from this model a number of relations between two views, we analyze the links
RR n˚2584
4
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
between the partial knowledge of the model’s parameters and the invariant properties of the reconstructed scene from those two views. Section (3) describes techniques to compute some of the model’s
parameters without assuming full calibration of the cameras. Section (4) describes the technique of
the rectification with respect to a plane. This technique, which does not require full calibration of
the cameras either, allows to compute information on the structure of the scene and is at the basis of
all the remaining sections. Section (5) shows how to locate 3-D points with respect to a plane. Section (6) shows how to compute local surface orientations. Lastly, section (7) presents several obstacle
avoidance and navigation applications based on a partially calibrated stereo rig.
INRIA
Applications of non-metric vision to some visually guided robotics tasks
2
5
Stratification of the reconstruction process
In this section we investigate the relations between the three-dimensional structure of the scene and its
images taken by one or several cameras. We define three types of three-dimensional reconstructions
that can be obtained from such views. These reconstructions are obtained modulo the action of one of
the three groups, Euclidean, affine, and projective considered as acting on the scene. For example, to
say that we have obtained a projective (resp. affine, Euclidean) reconstruction of the scene means that
the real scene can be obtained from this reconstruction by applying to it an unknown projective (resp.
affine, Euclidean) transformation. Therefore the only properties of the scene that can be recovered
from this reconstruction are those which are invariant under the group of projective (resp. affine,
Euclidean) transformations. A detailed analysis of this stratification can be found in [4].
We also relate the possibility of obtaining such reconstructions to the amount of information that
needs to be known about the set of cameras in a quantitative manner, through a set of geometric parameters such as the fundamental matrix [5] of a pair of cameras, the collineation of the plane at infinity,
and the intrinsic and extrinsic parameters of the cameras.
We first recall some properties of the classical pinhole camera model, which we will use in the
remainder of the chapter. Then, we analyze the dissimilarity (disparity) between two pinhole images
of a scene, and its relation to three-dimensional structure.
2.1 Notations
We assume that the reader has some familiarity with projective geometry at the level of [6, 23] and
with some of its basic applications to computer vision such as the use of the fundamental matrix [5].
We will be using the following notations. Geometric entities such as points, lines, planes, etc...are
represented by normal latin or greek letters; upper-case letters usually represent 3-D objects, lowercase letters 2-D (image based) objects. When these geometric entities are represented by vectors or
✂
matrixes, they appear in boldface. For example, represents a pixel, ✁ its coordinate vector,
represents a 3-D point, ✄ its coordinate vector.
The line going through and ☎ is represented by ✆✝✟✞✠☎☛✡ . For a three-dimensional vector ☞ , we
note ✌ ☞✎✍✑✏ the ✒✔✓✕✒ antisymmetric matrix such that ☞✖✓✘✗✟✙✚✌ ☞✎✍✛✏✜✗ for all vectors ✗ , where ✓ indicates
the cross-product. ✢✤✣ represents the ✒✥✓✦✒ identity matrix, ✧★✣ the ✒✥✓✟✩ null vector.
We will also be using projective, affine and Euclidean coordinate frames. They are denoted by
the letter ✪ , usually indexed by a point such as in ✪✖✫ . This notation means that the projective frame
✪ ✫ is either an affine or a Euclidean frame of origin ✬ . To indicate that the coordinates of a vector ✄
✯★✶
are expressed in the frame ✪ , we write ✄✮✭✰✯ . Given two coordinate frames ✪✲✱ and ✪✴✳ , we note ✵ ✯★✷
✯✺✶
the matrix of change of coordinates from frame ✪ ✱ to frame ✪ ✳ , i.e. we have ✄✸✭✹✯ ✶ ✙✮✵ ✯✺✷ ✄✻✭✰✯ ✷ .
✯✺✶
✯✺✷
✱
Note that ✵ ✯ ✷ ✙✚✼✽✵ ✯ ✶✛✾❀✿ .
2.2 The camera
The camera model that we use is the classical pinhole model. Widely used in computer vision, it captures quite accurately the actual geometry of many real imaging devices. It is also very general, and
RR n˚2584
6
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
✟
✠ ✞
☎
✁
✪✄✂
✬
✡
✆
✪ ✫
✝
Figure 1: The pinhole model.
encompasses many camera models used in computer vision, such as perspective, weak-perspective,
paraperspective, affine, parallel or orthographic projection. It can be described mathematically as
✁ ✣ embedded
follows: If the object space is considered to be the three-dimensional Euclidean space
✣
in the usual way in the three-dimensional projective space ☛
the image space to be the two✁ ✳ embedded in the usual way in theandtwo-dimensional
projective space
dimensional Euclidean space
✣ to ☛ ✳ (see [6]). We can
from ☛
☛ ✳ , the camera is then described as a linear projective application
✣
write the projection matrix in any object frame ✪✌☞ of ☛ :
✣
✍✎✑✏✓✒
✘
✘
✔ ✕✗✖ ✢✜ ✎✍
✘✛✘✛✘ ✢✜
✏✓✙✛✚ ✖
✘ ✩ ✩ ✘✛✘ ✵ ✬✯ ✫
✬✯ ✭
✘ ✩
✘✩✘ ✩ ✘
✤✦✥
✧✣
✤✦✥
✧
★
✪
(1)
where ✮ is the matrix of the intrinsic parameters, ✬ the optical center (see figure 1). The special
frame in which the projection matrix of the camera is equal to the matrix ✯ is called the normalized
camera frame.
✆ ☎
In particular, the projection equation, relating a point not in the focal plane ✄✱✰✭✹✯✲✫ ✙✻✌ ✫ ✞ ✫ ✞✴✳ ✫ ✞✶✵
expressed in the normalized camera frame, to its projection ✁ ✭✹✯✗✸ ✙ ✌ ✝ ✞ ✞ ✞✤✩✛✍✷✰ , expressed in the
image frame and written ✁ for simplicity, is
✳✘✫ ✁ ✙✹✮✺✯ ✄✻✭✰✯✬✫
(2)
2.3 Disparity between two views
We now consider two views of the scene, obtained from either two cameras or one camera in motion.
If the two images have not been acquired simultaneously, we make the further assumption that no
object of the scene has moved in the mean time.
INRIA
✫ ✍✷✰
,
7
Applications of non-metric vision to some visually guided robotics tasks
✁
The optical centers corresponding to the views are denoted by ✬ for the first and ✬ for the second,
the intrinsic parameters matrixes by ✮ and ✮ respectively, the normalized camera frames respectively by ✪ ✫ and ✪ ✫ . The matrix of change of frame ✪✲✫ to frame ✪ ✫ is a matrix of displacement
defined by a rotation matrix and a translation vector :
☎✄
✂
✆
✞ ✆
✯
✵ ✯✲✫✫ ✄ ✙
✂
✝
✡
✧ ✰✣
✝
☎✄
✩✠✟
(3)
More precisely, given a point
of an object , we are interested in establishing the disparity
✂
✂
for the two views, that is the equation relating the projection
of
in the second
equation of
✂
in the first view.
view to the projection of
☛
2.3.1 The general case
✂
Assuming that
is not in either one of the two focal planes corresponding to the first and second
views, we have, from equations (2) and (3):
✌☞ ✆
✝ ✍ ✄✻✭✰✯✬✫ ✙ ✳ ✫ ✮ ✆ ✮ ✿ ✱ ✁✑✏
✎
This is the general disparity equation relating ✎ to , which we rewrite as:
✳ ✫☎✄ ✁ ✙ ✳ ✫✓✒✕✔ ✁✑✏ ✵ ✫✗✖
✳ ✫☎✄ ✁
✙✹✮ ✯ ✄✻✭✰✯ ✫ ✄ ✙ ✮
where we have introduced the following notations:
✒ ✔
✒ ✔✮✙✹✮ ✆
✮ ✿ ✱
✖
and
✙ ✮
✵ ✫ ✮
✝
(4)
✝
(5)
represents the collineation of the plane at infinity, as it will become clear below in section (2.3.3).
is a vector representing the epipole in the second view, that is, the image of ✬ in the second view.
Indeed, this image is
✖
☞ ✆
✮ ✯✂✘ ✰✭ ✯ ✫ ✄ ✙ ✮
✘ ✘ ✘
since ✘ ✭✹✯✲✫ ✙✻✌ ✞ ✞ ☛
✞ ✩✛✷✍ ✰ . Similarly,
✖
✙
✝ ✍✘
✭✰✯✬✫✦✙✹✮
✝
✆ ✰✝
✙ ✮
(6)
is a vector representing the epipole in the first view.
lies on the line going through and the point represented by
Equation (4) means that
which is the epipolar line of . This line is represented by the vector
✙
✚
✚
where
✱
or equivalently
✚
✙ ✮
✁
✒ ✔
✁
,
(7)
✒ ✔
✜✛ ✌ ✝✤✍✰✏ ✆ ✮
✙✻✌ ✖ ✍✰✏
(8)
✿ ✱
(9)
F is the fundamental matrix which describes the correspondence between an image point in the first
view and its epipolar line in the second (see [5]).
✢ using the algebraic equation ✣ ✤✦✥★✧✪✩✬✫☛✤ ✣ ✥★✧✭✩✮✤
matrix of ✤ .
✄
RR n˚2584
(valid if
✯✱✰✳✲✵✴✶✤✦✷✹✫☛
✸ ✺ ), where ✤ ✫☛✯✱✰✳✲✵✴✜✤✻✷✜✤✽✼ ✢✭✾
✄
is the adjoint
8
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
2.3.2 The case of coplanar points
Let us now consider the special case of points lying in a plane . The plane is represented in ✪✥✫ by
✁
☎✝✆ ✘ , where ✂ is unit normal in ✪ ✫ and ✆ , the distance of ✬ to the plane.
the vector ✰ ✙ ✄✂ ✰
✁
, which can be written, using equation (2),
Its equation is ✰ ✄✻✭✹✯✲✫✦✙
☞
✍
✘
✂✓✰ ✯ ✄✻✭✹✯✲✫✞☎ ✵ ✫ ✆ ✙ ✳ ✫ ✓✂ ✰ ✮ ✿ ✱ ✟
✁ ☎ ✵ ✫ ✆ ✙
If we first assume that ✆✡✙ ✠
✳
disparity equation :
✘
(10)
, that is the plane does not go through ✬ , we obtain the new form of the
✙ ✳ ✫✓✒ ✁
✳ ✫ ✄✁
where
✒
✒✕✔ ✏ ✖ ✂ ✆ ✰
✙
(11)
✮ ✿ ✱
(12)
✒
This equation defines the projective linear mapping, represented by , the ☛ -matrix of the plane,
relating the images of the points of the plane in the first view to their images in the second. It is at the
basis of the idea which consists of segmenting the scene in planar structures given by their respective
☛ -matrices and, using this segmentation, to compute motion and structure (see [7] or [29]).
✘
If the plane does not go either through ✬ , its ☛ -matrix represents a collineation (☞✍✌✏✎ ✼ ✾ ✙ ✠ )
and its inverse is given by
✒
✒
✄
✿ ✱ ✙
✒
✙
✒ ✔✿ ✱ ✏ ✖ ✂ ✆ ✰
✮ ✿ ✱
(13)
where ✂ is the unit normal in ✪✲✫ and ✆ , the distance from the plane to ✬ . If the plane goes through
only one of the two points ✬ or ✬ , its ☛ -matrix is still defined by the one of the two equations (12)
or (13) which remains valid, but is no longer a collineation; equation (10) shows that the plane then
projects in one of the two views in a line represented by the vector
✮
✛✂
✮
or
✛✂
(14)
If the plane is an epipolar plane, i.e. goes through both ✬ and ✬ , its ☛ -matrix is undefined.
Finally, equations (5) and (6) show that and always verify equation (11), as expected, since
and are the images of the intersection of the line ✑ ✬ ✬ ✓✒ with the plane.
✙
✙
✙
✙
2.3.3 The case of points at infinity
For the points of the plane at infinity, represented by
parity equation becomes
✒✕✔
✳ ✫ ✄✁
✘ ✘ ✘
✌ ✞ ✞ ✞✤✩ ✍ ✰
✙ ✳ ✫✓✒✕✔ ✁
, thus of equation ✵☛✫
✙
✘
, the dis(15)
Thus,
is indeed the ☛ -matrix of the plane at infinity. Equation (15) is also the limit of equation (11), when ✆✕✔✗✖ , which is compatible with the fact that the points at infinity correspond to the
remote points of the scene.
✘ using the algebraic equation ✴ ✥ ✾✓✙✛✚ ✷✢✜ ✫✂✴✣✜ ✥ ✾✤✙ ✷ ✚
.
INRIA
9
Applications of non-metric vision to some visually guided robotics tasks
2.4 Reconstruction
Reconstruction is the process of computing three-dimensional structure from two-dimensional image
measurements. The three-dimensional structure of the scene can be captured only up to a group of
transformations in space, related to the degree of knowledge of the imaging parameters. For instance, with a calibrated stereo rig (i.e., for which intrinsic and extrinsic parameters are known), it is
well known that structure can be captured up to a rigid displacement in space. This has been used for
a long time in photogrammetry. It has been shown more recently [14] that with non-calibrated affine
cameras (i.e. that perform orthographic projection), structure can be recovered only up to an affine
transformation. Then, the case of uncalibrated projective cameras has been addressed [2, 11, 28] and
it has been shown that in this case, three-dimensional structure can be recovered only up to a projective transformation.
We will now use the formalism introduced above to describe these three cases in more detail.
2.4.1 Euclidean reconstruction
✆
✠
Here we suppose that we know the intrinsic parameters of the cameras ✮ , ✮ , and the extrinsic parameters of the rig, and . This is the case when cameras have been calibrated. For clarity we call
and . Equation (4) gives
it the strong calibration case. Through equation (5) we can compute
us
✝
✳
✼
✫
✁
✓
✂ ✖ ✁ ✒✕✔ ✕✒ ✂ ✔
✞ ✝
✳
✼
✁
✒ ✔
✓
✁
✾
✵
✙
☎
✖
✾
✳
✄✆☎
✫
and equation (2)
✁
✰✝ ✫✫
✫
✰ ✫✟✞
✓
✫
✙
✵
✁
✮
✞
✱
✿
✫
✟
Thus, we have computed the coordinates of
with respect to
.
The projection matrices for the first and second views, expressed in their respective image frames
, are then written
and in
✂
✪
✪
✫
✮
☞
✢
✣
✧
✣
✍
and
✮
☞ ✆
✫
✝ ✍
These matrices and the coordinates of
are thus known up to an unknown displacement
responding to an arbitrary change of the Euclidean reference frame.
✂
✠
✲✫ cor✲✭
✯
✯
2.4.2 Affine reconstruction
We now assume that both the fundamental matrix and the homography of the plane at infinity are
known, but the intrinsic parameters of the cameras are unknown.
We show below that by applying an affine transformation of space, i.e., a transformation of ☛
which leaves invariant the plane at infinity, we can compensate for the unknown parameters of the
camera system. The guiding idea is to choose the affine transformation in such a way that the projection matrix of the first camera is equal to ✯ as in [21]. We can then use the same reconstruction
equations as in the Euclidean case (strong calibration). Since structure is known up to this unknown
✣
RR n˚2584
10
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
affine transformation, we call this case the affine calibration case. Let us now describe the operations
in detail.
Suppose then that we have estimated
(see section 3.4), thus we know
up to an unknown
one of the possible representations of
:
scale factor. Let us denote by
✒✕✔
✁
✚
✔
☛
✒✕✔
✙
✁ ✒✕✔
☛
✔
✒ ✔
where is an unknown nonzero scalar. Suppose also that we have estimated the fundamental matrix
(see section 3.2) which is of rank 2, i.e its null-space is of dimension 1. Equation (8) shows that
is in the null-space of ✰ , hence is known up to a nonzero scalar and we write in analogy with
the previous case:
(16)
✙
✚
✖
✖
✂
✖ ✄✂ ✖
✒ ✔
✖
✞✝
Neither equation (2) nor equation (4) is usable since ✮ ,
and are unknown. Both equations
✯
defined by the matrix of change of frame ✵ ✯ ✫ :
can be rewritten in another frame ✪
✆☎
Hence
this implies
✡✝
✞ ✁
✵ ✯✯✞✝✫ ✙
✯
✄✻✭✹✯ ✫ ✄ ✙ ✵ ✯✲✫✫ ✄ ✼ ✵ ✯✞✲✯ ✝✫ ✾ ✿ ✱ ✻
✄ ✭✹✯
✙✚✌ ✔☎ ✞ ✘
☎ ✞ ✳☛☎✘✞ ✵☞☎ ✍ ✰
✆
✧✱✣
✠ ✟
✧ ✰✣
✮ ✿ ✱
✧ ✣✰
✧ ✣
✂ ✟
✯
✯
✄✻✭✹✯ ✫ ✄ ✙ ✵ ✯✲✫✫ ✄ ✻
✄ ✭✰✯ ✫ ✙ ✵ ✯✬✫✫ ✄ ✵ ✯✯✞✝✫ ✄✻✭✹✯
Since we have
If ✄ ✰✭✰✯
becomes
✞ ✟✱ ✮
✵ ✯✯✬✝✫ ✙
☎
and equation (18),
✙
✞ ✆
✧ ✰✣
✝
✂
is a vector representing
✳ ✫ ✄✁
and equation (2),
Equation (17) yields
✝
☛☎
✵ ☎
✳
✙
✩✂✟
✞ ✁
✮ ✿ ✱
✧ ✣✰
✓
✂ ✁ ✖ ✾ ✓✁ ✼✒✁
✄ ☎✝
☛✳ ☎
✰✝ ✝✝
✝✰ ✟✞ ✙ ✵☞☎
✂
Thus, we have computed the coordinates of
✞
✂ ✟
✄✻✭✹✯
✝
✆☎ , equation (4), written in frame ✪✆☎ ,
☎ ✒✕✔ ✁✑✏ ✵ ☎ ✖
✳✌☎ ✁ ✙ ✯ ✄✻✭✰✯✡✝
✼✁
✧ ✣
in ✪
✙ ✳
☎
✝
✓
✔
✝
✞
✒✂ ✔
✁
✳
✟
with respect to ✪
(17)
(18)
✁ ✾
☎.
INRIA
11
Applications of non-metric vision to some visually guided robotics tasks
It is easy to verify that the projection matrices for the first and second views, expressed in their
respective image frames and in ✪✆☎ , are then written
☞ ✢✣
✧ ✣ ✍ ✙ ✯
☞ ✒✕✔
and
✖
They are thus known up to the unknown affine transformation
✣
change of the affine reference frame .
2.4.3 Projective reconstruction
✠
✯✡✝
✍
✯ ✭ , corresponding to an arbitrary
✚
We now address the case when only the fundamental matrix is known. This is known as the weak
calibration case. The representation of the epipole ✖ is also known up to a nonzero scalar factor, as
✚
belonging to the null-space of ✰ . Neither equation (2) nor equation (4) is usable since ✮ , ✒ ✔ and
✖ are unknown. As in the previous paragraph, we eliminate the unknown parameters by applying a
projective transformation of space. Here, the plane at infinity is not (necessarily) left invariant by the
transformation: It is mapped to an arbitrary plane. Let us now go into more details:
✁
Let us first assume that we know, up to a nonzero scalar factor , the -matrix of a plane not
going through the optical center ✬ of the first camera, as defined in section (2.3.2):
☛
✒ ✙
✁ ✼✒ ✔ ✏ ✖
✰
✂
✮ ✿ ✱✾
(19)
✘
✆
where is the unit normal expressed in ✪ ✫ of the plane and , with
plane. We define a frame ✪✁ by the matrix of change of frame from ✪
✂
✆
✯✄✂
✵ ✯✲✫ ✙
✟✱
✄
✮
✠ ✱ ✝☎ ✞ ✆
✧ ✣
✠✱
✁ ✱
✁ ☎ ✮ ✆ ★✠✞ ✿ ✟ ✷
✧ ✣
☎
hence
✡ ✟ ☎✝✆
so that ✠ ✞
✯✲✫
✵ ✯✄✂ ✙
✞
✆
✫
✙
✠
, the distance of ✬ to the
(20)
✞
✂ ✟
✩☞☛ ✰ is the vector representing the plane at infinity in ✪
✂
is the vector representing
in ✪ , we have then, using equation (2),
✮ ✿ ✱
✵
and, eliminating ✵
✙
☎
✂
✩
✰
✂
✯ ✄✻✭✹✯ ✫ ✏
✆
✂
✩
✵✎✫ ✙
✳ ✫
✂
✩
✰
✂
✩
✮ ✿ ✱ ✁✑✏ ✵✎✫
✂
✆
✰✭✹✯✄✂ ✙✻✌
✆
✞
☎
(21)
✫ from equation (4),
✳ ✫☎✄ ✁
✙✹✘
✳ ✫ ✼✒ ✔ ✏ ✖
✌ It is affine because it does not change the plane at infinity
RR n˚2584
☎
. If ✄
✰
✂
✆
✮ ✿ ✱ ✾ ✁✑✏ ✵
✖
(22)
✞✳
✞✵
✍✷✰
12
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
Equation (4) is thus written in ✪✁
☎✄ ✁
✳
✁ ✏
✒ ✑
✙✹✳
✫
✖
✵
(23)
As for equation (2), it is written in ✪
✳
Equation (23) then gives us
✳
✵
✼✁
☎
✙
✙✹✯
✁
✓
✄✆☎ ✂
✰✝ ✂✂
and equation (24),
✂
✰
✄✂
(24)
✂ ✁ ✖ ✾ ✓✁ ✼✒✁ ✁ ✓ ✂ ✳ ✒
✞
✂
✄✻✭✹✯
✳
✙
✞
✝
✞
✵
✁
✾
✟
Thus, we have computed the coordinates of
with respect to frame ✪ .
The projection matrices for the first and second views, expressed in their respective image frames
and in ✪ , are then written
✢✛✣
✧ ✣
and
☞
☞ ✒
✍
✖ ✍
Indeed, the projection matrix for the second view is
✮ ✯
✠ ✯✂✄
✯
✫
✙
☞
✮
✍✠
✧ ✣
✫
✯
✯✲✫
✄ ✠ ✯✲✯✄✂✫
☞
✙
✁
✮
✆
✖ ✍
✞
✁
✁
☎ ✞ ✆✮ ✮ ✿
✁ ✁
✱
✿
✱
✧ ✣
✂
✟
and is actually of rank 3 as product of a ✒✥✓ -matrix of rank 3 and a ✓ -matrix of rank 4.
✂
Both projection matrices and the coordinates of are thus known up to the unknown collineation
✯✄✂
✵ ✯ ✭ , corresponding to an arbitrary change of the projective reference frame. This result had already
been found in a quite different manner in [2, 12].
The reconstruction described above is possible as soon as the ☛ -matrix of a plane which does
not go through ✬ is known. In particular, when F is known, one is always available as suggested by
equations (8) and (22). It is defined by
✂ ✰
✆
which gives, using equation (8),
✂
✙
☎
✂ ✖ ✖ ✰ ✂ ✳ ✕✒ ✔
✒
✙✻✌
☎✄
✮
✆
✝☎ ✰✬✂ ✮✬ ✰✓✬
✮✂
✮ ✝ ✳
✙
✂ ✖✖ ✂ ✍✰✏ ✚
The equation, expressed in ✪ ✫ , of the corresponding plane is
using equation (25),
✖
✰ ✮ ✯
✄✻✭✹✯
✫
✄
✙
✘
☞ ✂
✰
(25)
☎✝✆ ✍ ✠ ✯✬✯ ✫✫ ✄ ✄✻✭✹✯ ✫ ✄
✙
✘
, thus,
✁ which projects, in the
which shows, using equation (2), that this plane is the plane going through ✬
second view, to the line representing by , as already noticed in [21].
✖
✄ using the algebraic equation ✥✮✥ ✾ ✫✆☎ ✥✝☎ ✘✟✞ ✌✡✠ ✣ ✥★✧ ✘ ✩ .
INRIA
13
Applications of non-metric vision to some visually guided robotics tasks
3
Computing the geometric parameters
Now that we have established which parameters are necessary to deduce information on the structure
of the scene, we describe methods to compute these parameters, from real images.
If no a priori knowledge is assumed, the only source of information is the images themselves and
the correspondences established between them.
After showing how accurate and reliable point correspondences can be obtained in general from
the images, we describe how they can be used for estimating the fundamental matrix on the one hand,
plane collineations on the other hand.
3.1 Finding correspondences
Matching is done using the image intensity function ✼ ✞ ✾ . A criterion, usually depending on the
local value of ✼ ✞ ✾ in both images, is chosen to decide whether a point ✱ of the first image and
a point ✳ of the second are the images of the same point of the scene. It is generally based on on
a physical model of the scene. A classical measure for similarity between the two images within a
given area is the cross-correlation coefficient
✝
✝
✞
✞
✬
✼
✱ ✞✽
✳ ✾
✂✁☎✄✝✆ ✟✞ ☎ ✞ ✠✞ ☎ ✞
✙
✛✼
✱
✱ ✞
✳
✳ ✾
✡✞ ☎ ✞ ☞☛ ✟✞ ✝☎ ✞
✞ ☎ ✞ ✞ ☎ ✞
✼
✙
✂
✱
✱ ✾
✱
✞✍✌
✼ ✝✳
✂
✳ ✾
✂
✱
✳
✂
✳
✎✌
✞✏✌
where is the vector of the image intensity values in a neighborhood around the point
and its
mean in this neighborhood.
The context in which the views have been taken plays a significant role. Two main cases have
to be considered: the case where the views are very similar and the opposite case. The first case
usually corresponds to consecutive views of a sequence taken by one camera, the second, to views
taken by a stereo rig with a large baseline. In the first case, the distance between the images of a point
in two consecutive frames is small. This allows to limit the search space when trying to find point
correspondences. Below, we briefly describe a simple point tracker which, relying on this property,
provides robust correspondences at a relatively low computational cost. In the second case, corresponding points may have quite different positions in the two images. Thus, point matching requires
more sophisticated techniques. This is the price to pay if we want to manipulate pairs of images taken
simultaneously from different viewpoints, which allow general reconstruction of the scene without
worrying about the motion of the observed objects, as mentioned in section (2.3).
In both cases, the criterion that we use for estimating the similarity between image points is not
computed for all of them, but only for points of interest. These points are usually intensity corners
in the image, obtained as the maxima of some operators applied to ✼ ✞ ✾ . Indeed, they are the most
likely to be invariant to view changes for these operators since they usually correspond to object markings.
✝
✞
The corner detector. The operator that we use is the one presented in [10], which is a slightly modified version of the Plessey corner detector:
☞✍✌✏✎
RR n˚2584
✼
✘
✑ ✓☎ ✒
✾
★✼
trace ✼
✑
✘
✾✰✾
✳
14
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
where
✳
✑
✞
✑
✘
✙
✑
✂✁
✑
✑
✁
✁
✳
✟
✑
and denotes a smoothed version of . Based on experiments, Harris suggests to set ✙
for best results.
is computed at each point of the image, and points for which it is larger than a
given threshold are retained as corners.
✘
✘
✁
✒
☛
✑
✘
The points tracker. The implementation has been strongly influenced by the corner tracker described in [18].
It works as follows: First, corners are extracted in both images. Then, for a given corner of the
first image, the following operation is performed: its neighborhood is searched for corners of the
second image; the criterion ✬ is computed for each pair of the corner of the first image and one of
the possible matches in the second image; The pair with the best score is retained as a correspondence
if the score is above a fixed threshold.
Then, for each corner of the second image for which a corresponding point in the first image has
been found, the preceding operation is applied from the second image to the first. If the corresponding
point found by this operation is the same as the previous one, it is then definitely taken as valid.
The stereo points matcher. The method described in the previous section no longer works as soon
as the views are quite different. More precisely, the correlation criterion is not selective enough:
there are, for a given point of an image, several points of the other image that lead to a good correlation score, without the best of them being the real correspondent point searched. To achieve correspondence matching, the process must then keep all those potentially good but conflicting correspondences and invokes global techniques to decide between them: a classical relaxation technique is used
to converge towards a globally coherent system of point correspondences, given some constraints of
uniqueness and continuity (see [30]).
3.2 The fundamental matrix
Once some image point correspondences, represented in the image frame by ✼ ✁ ✄ ✞✹✁ ✄ ✾ , have been
found, the fundamental matrix is computed, up to a nonzero scalar factor, as the unique solution
of the system of equations, derived from the disparity equations,
✚
✁
✄
✚
✰
✁
✄ ✙
(26)
✘
This system can be solved as soon as seven such correspondences are available: only eight coefficients of need to be computed, since is defined up to a nonzero scalar factor, while equation (26)
supplies one scalar equation per correspondence and ☞✍✌✏✎ ✼ ✾ ✙
, the eighth. If there are more correspondences available, which are not exact, as it is the case in practice, the goal of the computation
is to find the matrix which best approximates the solution of this system according to a given least
squares criterion.
✚
✚
✚
✘
INRIA
15
Applications of non-metric vision to some visually guided robotics tasks
A study of the computation of the fundamental matrix from image point correspondences can be
found in [20]. Here, we just mention our particular implementation, which consists, on the one hand,
of a direct computation considering that all the correspondences are valid and in the other hand, of a
method for rejecting some possible outliers among the correspondences.
The direct computation computes which minimizes the following criterion:
✚
✁
✚
✄
✌ ✁
✍✳
✄
✩
✏ ✌✚
✁
✏ ✌✚ ✰
✍✳
✄
✁
✁
✄
✍✳
✩
✏ ✌✚ ✰
✁
✄
✍ ✳✄✂
✼✁
✄
✰✚
✁
✄
✾
✳
✁
which is the sum of the squares of the distance of
to the epipolar line of
and the distance of
to the epipolar line of
. Minimization is performed with the classical Levenberg-Marquardt
method (see [26]). In order to take in account both its definition up to a scale factor and the fact
that it is of rank ☎ , a parametrization of with seven parameters is used, which parametrizes all the
✒ ✓ ✒ -matrices of rank strictly less than 3. These parameters are computed from the following way:
a line ✆ (respectively, a column ) of is chosen and written as a linear combination of the other two
lines (respectively, columns); the four entries of of these two combinations are taken as parameters;
among the four coefficients not belonging to ✆ and , the three smallest, in absolute value, are divided
by the biggest and taken as the last three parameters. ✆ and are chosen in order to maximize the rank
of the derivative of F with respect to the parameters. Denoting the parameters by ✝ ✱ , ✝ ✳ , ✝ ✣ , ✝ , ✝✟✞ ,
✝✡✠ and ✝✟☛ and assuming, for instance, ✆ and equals to ✩ and the bottom right coefficient being the
normalized coefficient, leads to the following matrix:
✂
✄
✄
✄
✄
✚
✚
✚
✚
✂
✎✍
✝✟✠ ✼☞✝ ✝ ✱ ✏✌✝ ✞ ✝
✝✡✠✎✝
✝ ✠
✂
✣ ✾ ✏✌✝✡☛ ✼✍✝ ✝ ✳ ✌
✏ ✝✞✾
✱ ✏✏✝✡☛✑✝ ✳
✝ ✣ ✏✌✝ ☛
✂
✝ ✝ ✱ ✏✌✝ ✞ ✝ ✣
✝ ✱
✝ ✣
✂
✚
✢✜
✝ ✝ ✳ ✏✌✝ ✞
✝ ✳
✩
✂
During the minimization process, the parametrization of can change: the parametrization chosen
for the matrix at the beginning of the process is not necessarily the most suitable for the final matrix.
The outliers rejection method used is a classical least median of squares method. It is described
in detail in [30].
3.3 The ✒ -matrix of a plane
If we have at our disposal correspondences, represented in the image frames by ✼ ✁ ✞✹✁ ✾ , of points
belonging to a plane, the -matrix of this plane is computed, up to a nonzero scalar factor, as the
unique solution of the system of equations (11),
☛
✒
✄
✄
✳ ☎✫ ✄ ✁ ✎✹✙ ✳✘✫ ✒
✄
✁
✄
This system can be solved as soon as four such correspondences are available: only eight coefficients of need to be computed, since is defined up to a nonzero scalar factor, while equation (11)
supplies two scalar equation for each correspondence. If there are more correspondences available,
which are not exact, as it is the case in practice, the goal of the computation is to find the matrix
✒
RR n˚2584
✒
16
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
which approximates at best the solution of this system according to a given criterion: a study of the
computation of plane -matrices from image point correspondences can be found in [6].
Since ✙ and ✙ verify equation (11), three point correspondences are in general sufficient for defining . In fact, this is true as long as the homography is defined, i.e., when three points are not
aligned in either image (a proof can be found in [3]). If the plane is defined by one point and a line ,
given by its projections ✼ ✽✞ ✭ ✾ , so that ✙ does not belong to and ✙★ does not belong to ✭ , its -matrix
is computable the same way, as soon as we know the fundamental matrix. Indeed, the projections of
✂
two other points and ✁ of the plane are given by choosing two points and ☎ on , which amounts
✂
to choosing and ✁ on : the corresponding points ✎ and ☎ are then given by intersecting with
the epipolar line of and the epipolar line of ☎ , given by the fundamental matrix.
As an application of this idea, we have a purely image-based way of solving the following pro✂
, and a line correspondence
blem: given a point correspondence ✼✝✟✞
✾ defining a 3-D point
✂
, find the -matrix of the plane going through
and . In particular,
✼ ✽✞
✾ defining a 3-D line
if is at infinity, it defines a direction of plane (all planes going through are parallel) and we can
✂
find the -matrix of the plane going through
and parallel to that direction. This will be used in
section 6.2.
✂
Given the -matrix ✒ of a plane ✂ and the correspondences ✼✝✟✞✽ ✾ and ✼✝☎ ✞ ☎ ✾ of two points
and ✁ , it is possible to directly compute in the images the correspondences ✼☎✄✰✞✆✄ ✾ of the intersection
✂
of the line ✆ ✞✝✁ ✡ with ✂ . Indeed, ✄✳ belongs both to ✆ ✕ ✞✜☎ ✡ and the image of ✆✝✟✞ ☎ ✡ by ✒ , so:
☛
☛
✆
✆
✆
✆
☛
✆
✆
✆
✆
☛
☛
☛
✞
✙✻✼ ✁
✓
✂
✾
✓
✼
✒
✁
✓
✒
✂
✾
(see [27])
Similarly, given two planes ✂✖✱ and ✂ ✳ represented by their -matrices ✒ ✱ and ✒ ✳ , it is possible
to directly compute in the images the correspondences of the intersection of ✂ ✱ with ✂ ✳ . Indeed,
the correspondences of two points of are computed, for example, as intersections of two lines ✱
and ✳ of ✂ ✱ with ✂ ✳ ; the correspondences of such lines are obtained by choosing two lines in the
first image representing by the vectors ✞✽✱ and ✞ ✳ , their corresponding lines in the second image being
✱
✱
✞ ✱ and ✒ ✱ ✿
✞ ✳ .
given by ✒ ✱ ✿
☛
✰
✰
3.4 The homography of the plane at infinity
To compute the homography of the plane at infinity ✒ ✔ , we can no longer use the disparity equation (4) with correspondences of points not at infinity, even if we know the fundamental matrix, since
✞✰✁
✾
✫☎✄ , ✘✫ and ✫ are not known. We must, thus, know correspondences of points at infinity ✼ ✁
and compute ✒ ✔ like any other plane -matrices, as described in section 3.3.
The only way to obtain correspondences of points at infinity is to assume some additional knowledge.
First, we can assume that we have some additional knowledge of the observed scene that allows
to identify, in the images, some projections of points at infinity, like, for instance, the vanishing points
of parallel lines of the scene, or the images of some points on the horizon, which provide sufficiently
good approximations to points at infinity.
Another way to proceed is to assume that we have an additional pair of views. More precisely, if
✂
this second pair differs from the first only by a translation of the rig, any pair ✼ ✞✟✁ ✾ of stationary
✄
✳
✳
✵
✄
☛
INRIA
17
Applications of non-metric vision to some visually guided robotics tasks
✆
✩
☎
✖
☎
✁
✁
✝
✂
✂
☎
✩
✞
✄
☎
✄
✩
Figure 2: Determining the projections of points at infinity (see section 3.4).
object points (see figure 2), seen in the first views as ✼ ✁ ✱ ✞✹✁ ✱✛✾ and ✼ ✱ ✞ ✱✤✾ , and in the second as
✼ ✁
✳ ✞✹✁
✳ ✞
✳ ✾ and ✼
✳ ✾ , gives us the images ✼✟✞ ✱✤✞ ✞ ✱ ✾ and ✼✟✞✝✳ ✞ ✞ ✳ ✾ in the four images of the intersection
✂
of the line ✆ ✞✝✁ ✡ with the plane at infinity. Indeed, on one hand, since is at infinity and the
✂
and ✁ implies the stationarity of , we have, from equations (15) and (4),
stationarity of
✂
✂
✂
✂
✳
✒ ✔
✞
✫ ✶
✹✳ ✒✕✔
✒ ✔
✱ ✳☎✞ ✱
✫ ✷
✳✴✙
and
✳ ✄✞
✫ ✶
✳
✳ ✄✒ ✔
✙
✞
✱ ✳
✫ ✷
✱
✱ ✳ (respectively,
where
✱✽✳ ) is the homography of the plane at infinity between the first (respectively, second) view of the first pair and its corresponding view in the second pair. In the case
✳ ,
✱✽✳ ✙
where the two pairs of views differ only by a translation, ✱ ✙
✢ ✣ ,
✢ ✣
✱ ✙
✳ ,
✱✽✳ ✙
and we have, by equation (5),
✮ ✆
✮
✮ ✆
✮
and ✒ ✔
✒✕✔
✄ and ✄✪
✵✄ ✄✵ . On the other hand, as lies on ✁ , ✄ lies on
, ✄✵ , on ✂ ✌ and ✄✵ , on ✂ ✌ . Consequently, ✄ and ✄✪ are obtained
with
and of ✠ ✌ with ✠ ✌ , respectively:
✱ ✳✴✙
which implies that ✄ ✱ ✙ ✄ ✳ ✙
✆
✱✤✞✜☎☛✱✤✡ , ✄ ✳ , on ✆
✳ ✞✜☎ ✳ ✡
✱
as the intersections of ✆✝✟✱✤✞✜☎☛✱✤✡
✞
✙✻✼ ✁
✱
✓
✂
✱ ✾
✙
✱
✆✝
✳
✳
✱ ✳
✂
✢✛✣
✆
✳
✳ ✞ ☎ ✳ ✡
✓
✙
✙
✱ ✞✜☎ ✱ ✡
✆✝
✓✟✼ ✁
✢❀✣
✳ ✾
and ✞
✆✝
✳ ✞✠☎ ✳ ✡
✆✝
✱ ✞✜☎ ✱ ✡
✙✻✼ ✁
✱
✓
✆✝
✂
✱ ✾
✂
✞
✡
✱
✳ ✞✜☎ ✳ ✡
✓✟✼ ✁
✳
✓
✂
✳ ✾
✒ ✔
Once
has been obtained, the ratio of the lengths of any two aligned segments of the scene
✂
✂
✂
can be computed directly in the images. Indeed, given three points ✱ , ✳ and ✣ on a line, as in
figure 3, from their images ✼ ✱ ✞✽ ✱✤✾ , ✼✝ ✳ ✞ ✳ ✾ and ✼ ✣ ✞ ✣ ✾ , we can compute the images ✼ ✟✞
✾
of the intersection of this line with the plane at infinity, using
, as explained in section 3.3. We can
compute in each image the cross-ratio of those four points. As a projective invariant, this cross-ratio
✂
RR n˚2584
✂
✂
✒ ✔
✕
18
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
✞✝
☎✆
✂✁ ☎✄
☞✡✝
☛ ✁ ☛ ✄
☛ ✆ ☛
✠☛ ✆ ☛ ✠
☛ ✠✄
☛ ✠✁
✟✡✠
✟
Figure 3: Determining ratios of lengths in affine calibration (see text).
is then exactly equal to the ratio of
✂
,
✂
✂
✱
✂
✂
✳
✱
✂
✣
✱
✣
✙
✂
✂
✂
✂
and
✌
✂
✣
✂
. More precisely:
✂
✳
✣
✱
✣
✙
✂
✂
✱
✳
✣
✱
✔
✳
✌
✳
✣
✳
✔
INRIA
19
Applications of non-metric vision to some visually guided robotics tasks
4
The rectification with respect to a plane
In this section, we assume that we know the epipolar geometry. This allows us to rectify the images
with respect to a plane of the scene. This process explained below, allows not only to compute a map
of image point correspondences, but also to assign to each of them a scalar that represents a measure
of the disparity between the two projections of the correspondence.
4.1 The process of rectification
✚
✖
Like in section 2.4.3, we assume that we know, up to nonzero scale factors, , thus , given by
of a plane given by equation (19). Let us then choose two
equation (16), and the -matrix
and , such that
homographies, represented by the matrices
✒
☛
✒
✒ ✖
✒
✒
✂
✑
✏
✑
✏
where
✑
✙
✘
✌ ✩ ✞
✘
✞
✒ ✒
✑
(27)
✍ ✰
✑
✙
(28)
is any nonzero scalar. Equation (23) can then be rewritten
☎✄
✑
✳
where ✁
✑
✙✚✌
✝
✑
✞
✞
,✁
✑
✞ ✩ ✍✷✰
✑
✙
✌
✝
✑
✝✞
✞
✫
✑
✑
✑
✁
✙
✏
✏
✵
✌ ✩ ✞
✘
✞
✘
(29)
✍ ✰
and
✑
✞✤✩ ✍✷✰
✒
✑
✑
✁
✁
✳
✙
✑
✑
and
✁
✑
✁
✙
✒
✑
(30)
✁
The rectification with respect to a plane consists of applying such matrices, called the rectification
matrices,
to the second image and to the first.
in the second rectified image of a point
Equation (29) shows that the corresponding point
of the first rectified image lies on the line parallel to the ✝ -axis and going through . Applying a
correlation criterion to and each point of this line thus allows to determine , if the image is not too
distorted through the process of rectification. Equations (27) and (28) do not completely determine
and : This indetermination is used to minimize the distortion of the images, as explained in
section 4.3.
has been determined, a measure of the disparity between
and with respect to this
Once
✂
✝
plane is given by ✝
. If
belongs to , it is equal to zero since ✵
then vanishes as shown
by equations (10) and (21); otherwise, its interpretation depends on the information available for the
model, as explained in section 5.
✒
✒
✑
✑
✎
✑
✑
✑
✑
✒
✑
✒
✑
✑
✂
✎
✑
✑
✑
☎
✑
✑
✂
4.2 Geometric interpretation of the rectification
From the
✁✄✂
-decomposition [9] any nonsingular matrix
✒
RR n˚2584
✙
✆✆☎
✒
decomposed as:
(31)
20
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
✆
✱
where is a rotation matrix and , a nonsingular upper triangular matrix. Decomposing ✒ ✿ like
in equation (31), inverting it, and noticing that the inverse of an upper triangular matrix is also an
upper triangular matrix, we see that ✒ can also be decomposed as:
☎
✆
✒ ✙
✆
☎
(32)
where
is a rotation matrix and , a nonsingular upper triangular matrix.
To give a geometric interpretation of the rectification, we decompose ✒ and ✒ the following
way: By applying equation (32) to the non-singular matrices ✒ ✮ and ✒ ✮✬ , there exist two scalars
✆
✆
and , two rotation matrices and
and two upper triangular matrices ✮ and ✮✬ of the same
form as ✮ in equation (1), such that
☎
✑
✑
✑
✑
✁
✑
✑
✑
✑
✑
✑
✁
✒ ✙
✑
✑
✆
✑
✁
✮
✿ ✱
✑
✮
✒
and
✙
✑
✑
✆
✑
✁
✮
✿ ✱
✑
✮
(33)
Then, we study how the constraints on ✒ and ✒ given by equations (27) and (28) propagate to ,
✆
✆
, and . On one hand, from equations (27), (33), (5) and (16), we have
✑
✑
✑
✁
✑
✑
✑
✁
✆
✏
✝ ✙
✑
✘
✘
✿ ✱ ✌✩ ✞ ✞ ✍✰
✑
✮
✑
✁
✂
✑
and we define
such that
✆
✝ ✙ ✌ ✞ ✞ ✍✰
✑
✘
✑
✘
(34)
On the other hand, from equations (28), (33), (19), (5) and (34), we have then
✢ ✣ ✙ ✒✁✄✂ ✒ ✒ ✿ ✱
✆
✱ ✼✮ ✆ ✮ ✿ ✱ ✏ ✮ ✝
✷✮✬ ✿
✑
✢✣ ✙
✑
✁
✑
✁
✮
✑
✆ ✆ ✆
✑
✙
✑
✰
✟
✟
☎
✟
✁✄✂
☎
✱
✄✬✿
✑
✟
☎
☎
✆
✆☎
✑
✑
✰
✮
✿ ✱
✆
✞
is an upper-triangular matrix. Since it is a rotation matrix, this
✰
and
✮
✑
✿ ✱ ✏
✑
✮
✙ ✢✣
✑
✰
✙
✑
✁
✝☎
✮
✑
✮
We then also deduce that
✳
✆
✞
✑
✑
✑
☎
✌ ✞ ✘ ✞ ✘ ✍✷✰
✑
✮
✆ ✆ ✆
✢✣ ✙
✆
✿ ✱ ✾ ✱☎ ✮
✑
✆ ✆ ✆
from which we deduce that
means that
✑
✁
(35)
✑
(36)
✁
✌ ✞ ✞ ✍✰
✑
✑
✮
✘
✘
✂
✆
✰
✑
✰
✑
✮
✿ ✱
(37)
✆
We are
now able to interpret the equations (30). From equation (27) and (20), we have
✡ ✫
✙
. Using equation (36), we can then define ✳
by
✟✞
✟
✠✞ ✙
✳
✑
✷✳
✑
✁
✫☎✄ ✙
✘✫
✑
✳
✑
✁
✑
✳
✫☎✄ ✙
(38)
INRIA
21
Applications of non-metric vision to some visually guided robotics tasks
✝
✆
✂✁
✂
✂
☎
✂
✞✄ ✝
✆
✄
☎ ✁
✂
☎✄
✆✄
✆
✄✁
☎✄ ✁
✆
✬
✬
✁
✬
Figure 4: The rectification with respect to a plane .
✂
RR n˚2584
✆✄
✬
22
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
so that the equations (30) are written
✳✞✁ ✙ ✳✫✮ ✆ ✮
✑
✑
✿ ✱✁
✑
✑
✳ ✞ ✁ ✙ ✳ ✫☎✄ ✮ ✆ ✮
✑
and
✬
✑
✿ ✱✁
✑
✑
✪✫
✁
✁
They are interpreted as the disparity equations of two pairs of views (see figure 4): The first pair is
composed of the view of optical center , camera frame
, retinal plane and intrinsic parameters
, retinal plane and intrinsic
matrix and its rectified view of optical center , camera frame
parameters matrix ; similarly, the second pair is composed of the view of optical center , camera
frame
, retinal plane
and intrinsic parameters matrix
and its rectified view of optical center
, camera frame
, retinal plane
and intrinsic parameters matrix . The basis of
is the
by the rotation of matrix . Similarly, the basis of
is the image of the
image of the basis of
by the rotation of matrix . Furthermore, according to equations (35) and (34), we
basis of
have
✮
✪ ✫☎✄
✬
✪✫✄
✮ ✁
✁
✑
✪✫✄
✪✫
✑
✬
✆
✑
✆
✪✫
✑
✮
✑
✰ ✧✣
✧ ✜ ✰✣ ✩✂✟ ✧ ✰✣ ✩ ✟
✞ ✆
✑
✘✩✘✛✘
✪✫
✑
✑
✯✯✬☎ ✫✫
✑
✮✬ ✪ ✫☎✄
✬
✑
✑
✯✵ ✯✲☎☎ ✫✫ ✄ ✙ ✵ ✯✯ ☎ ✫✫ ✄✄ ✵ ✯✯✬✫✫ ✄ ✵
✞ ✆
✙ ✧ ✰✣ ✧ ✩ ✣ ✟
✍✎ ✘✩ ✘✛✘✘
✙ ✘✩✘✩ ✩
✑
✞ ✆
✘✘ ✢
✝
✑
✩
✁
✁
✪ ✫☎✄ have the same basis ✠✞ and that the ✝ -axis of this basis is parallel to
✱
of the plane at infinity is ✒ ✔✮✙ ✮ ✮ ✿ , the
✬ ✬ . Lastly, for the two rectified views, the✘ homography
✘
epipole of the second view is ✖ ✙ ✮ ✌ ✞ ✞ ✷✍ ✰ , so that, according to equation (19), the homography
of is ✢❀✣ .
✁
✪✫
✑
which shows that
☎✏☎ ✔
✑
and
✂
✑
✑
✑
✑
✑
✑
✂
✝
✁
✁ ✁
✑
In summary, the process of rectification consists of projecting the first image onto a retinal plane
and the second image onto a retinal plane such that and are parallel and choosing the rectified
image frames such that the -axis of the two rectified images are parallel and the homography of
for the two rectified images is the identity.
✑
✑
✂
4.3 Minimizing image distortion
We now examine the distortion caused by
✑
✒
✑
and
✒
✑
to the images.
4.3.1 How many degrees of freedom are left ?
✒
✒
✑
✒
✑
being known, equation (28) shows that is completely determined as soon as
is. So, all the
degrees of freedom left are concentrated in . Only eight coefficients of
need to be computed,
is defined up to a nonzero scale factor, and equation (27) supplies two scalar equations: Six
since
degrees of freedom remain, but how many of them are really involved in the distortion ?
✒
✑
✒
✑
✒
✑
INRIA
23
Applications of non-metric vision to some visually guided robotics tasks
✑
✿✱
To answer this question, we propose two approaches: The first one decomposes
and the second
. In each case, we propose a method for computing the values of the parameters which
one,
minimize the image distorsion.
✑
✒
✒
✑
4.3.2 The decomposition of
.
✒
According to equation (32), there exist two matrices,
✙
✑
✒
☎
and
☎
such that
✆
✆
is an upper triangular matrix and , a rotation matrix. If we decompose
rotations around the ✝ - ✞ - and -axis, we can write
☎
✆
✑
✒
✙
✞
☎
✣
☎ ✓✌☎
✳
✁
✧ ✰✣ ✤✦✥
✂
✳
✞
✆
✁
✟
✳ ✧✳
✧ ✰✣ ✤✦✥ ✩
☎✄
✧ ✣
✆
✟
✙
✧
✆
✁
✆
✳ ✳
✧ ✰✣
✞
☎
✆
✆
☎
✁
✟
✆
☎✖✓✌☎
✳
✆
✳ ✳
☎
✆
✳
✙
✑
✒
✞
✆
✣
✳ ✧✳
✧ ✰✣ ✤✦✥ ✩
✄
✆
✞
✟
☎
✧ ✣
✧ ✣✰
✳
✂
✳ ✳
✆
☎
✳✰
✆
✤✦✥
✄
where
from
✆
✁
where
is a
upper triangular matrix,
,a
rotation matrix,
can be rewritten as
where
Now, according to equation (31),
, an upper triangular matrix and we can write
☎
as a product of three
✆
✆
✁
✟
a vector and , a scalar.
is a rotation matrix and
✁
✳
✆
✆
✧
✁
✄
is a rotation around the -axis and , an upper triangular matrix. Lastly, if we extract
the translation and scaling components, we have
✆
☎
✞✝ ✘ ✝ ✘
✘
✘
✝
☎
✍✎
✙
✑
✒
✁
✆
✝
✁
✕✗✖ ✜✢✱✍✎
✚ ✖
✘✩
✘
✩
✝
✘✩
✘ ✜✢
✁
✘
✩
✆
(39)
✆
✁
✑
Based on equation (39),
is chosen such as to cancel out the third coordinate of
, involved in
, such as to cancel
out
its
second
coordinate
equation (27), (making the epipolar lines parallel) and
✕ ✖
✚ ✖
and , are not involved
(making the epipolar lines parallel to the ✝ -axis). The translation terms,
in the distortion, four degrees of freedom are left, given by the two scaling factors,
and , the
and the rotation angle in
.
skew
✆
✁
✒
✖
✝
✆
✝
✝
✝
✁
✆
✁
Minimizing distorsion using a criterion based on areas. The criterion to be minimized is the ratio
of the area of the rectangle with sides parallel to the ✝ - and ✞ -axes circumscribing the rectified image
to the area of the rectified image (see figure 5).
This criterion is valid as soon as these areas are not infinite, that is, as soon as the line (resp. ),
), to the line at infinity, does not go through any point of the first
which is mapped by (resp.
(resp. second) image. If (resp. ) does not lie in the first (resp. second) image, (resp.
) can
✑
✆
✑
✒
✆
✒
✑
✙
RR n˚2584
✙
✒
✒
✑
24
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
-axis
-axis
✁
Figure 5: The area-based criterion: Minimizing the relative area of the filled region.
be chosen to verify this constraint, since equation (27) (resp. (28)) show that (resp. ✳ ), which is
represented by the last row of ✒ (resp. ✒ ), is only constrained to go through ✙ (resp. ✙★ ).
✒ is decomposed as explained in the paragraph above so that the criterion is a scalar function of
✆
, ,
and the angle ✂ of
. Since the criterion is non-linear and its derivatives are not easily
computable, a direction-set method is used, namely, the Powell’s method.
✕✗✖
✚ ✖ and are initialized
and ✂ , to 0. At the end of the minimization, , ,
and are adjusted so that the
to 1 and
rectified image is of the same size and at the same position in the plane as the initial image.
✆
✑
✆
✑
✑
✝
✝
✝
✁
✁
✝
✝
✁
✝
✝
✁
✝
✁
4.3.3 The decomposition of ✒
✑
✱
✿
.
Here we present another approach in which a particular parametrization of ✒
allows us to isolate
the parameters responsible for image distorsion, and estimate their values so as to minimize distorsion.
For simplicity, we express image points coordinates with respect to a normalized coordinate system, in which the image occupies the unit square. Using homogeneous coordinates, we denote by
✙ ✙ ✙ the coordinates of the epipole ✙ . We now describe a parametrization of ✒
that explicitly introduces two free rectification parameters. The other parameters correspond to two scaling
factors (one horizontal and one vertical), and one horizontal translation which can be applied to both
rectified images. These parameters can be set arbitrarily, and represent the magnification and clipping
of the rectified images.
Let us now see how, using the mapping of four particular points, we define a parameterization for
.
✒
✑
✱
✿
✑
✱
✝
✌
✁
✞
✞
✍
✿
✑
✱
✿
✘ ✘
☎✄ onto the epipole. This is the condition for the epipolar lines to be
maps point
1. ✒
horizontal in the rectified images.
✑
✱
✌
✩
✞
✞
✍
✿
INRIA
25
Applications of non-metric vision to some visually guided robotics tasks
✿ ✱ ✌ ✞ ✞✤✩✛✍ ✙
2. We impose that the origin of the rectified image be mapped onto the origin of the image. This
sets two translation parameters in the rectified image plane). In other words, ✒
.
✑
✘
✁
✌ ✞ ✞✤✩ ✍
✘
✘
✄
✘
✿✱
✄
maps horizontal lines onto epipolar lines, we impose that the top-right corner of the
3. Since ✒
✙ ✙ ✙ of the image, intersection of the epipolar line of
image be mapped onto point ✁
the left corner with the right edge of the image (Figure 6). This sets the horizontal scale factor
on the rectified image coordinates.
✑
✙ ✌ ✞ ✞ ✍
✁
✞
4. Fourth, we impose that the low-lefthand corner of the rectified image be mapped onto the epipolar line of the low-lefthand corner of the image. This sets the vertical scale factor of the
rectified image coordinates.
✒
From the first three points, we infer that matrix
✒
✿✱✙
✑
✙
✑
✿✱
is of the form:
✂
✍✎
✙
✜✢
✘
✂
✙
✘
✙
✂
✁
✝
✙
☎
✝
✼ ✒ ✿ ✱ ✌ ✞✤✩ ✞✤✩ ✍ ✾ ✔✼ ✖ ✓ ✌ ✞✤✩ ✞ ✩ ✍ ✾ ✙ . In other words, ✒ ✿ ✱ ✌ ✞✤✩ ✞✤✩✛✍
✌ ✞ ✩ ✞✤✩✛✍ , so there exist ✞ such that
✑
From the fourth point, we have
is a linear combination of e and
✒
✘
✄
✘
✄
✘
✄
✄
✏
☎✄
✄
✿✱✙
✑
✑
✘
✘
✙
✍✎
✙
✏
✙
✏
✙
✁
✜✢
✘
✙ ✹✏✆✄
✙ ✆
✏ ✄ ✏ ✙
✘
✁
✏
✝
✝
✝
☎
✙
✙
✙
☎
✝
✏
Assuming that the rectification plane is known (homography
fication matrix for the two images.
), any choice of
☛
✞
defines a recti-
☎✄
✞
✏
Minimizing distorsion using orthogonality. We choose ☎✄ so as to introduce as little image distortion as possible. Since there is no absolute measure of global distortion for images, the criterion
that we use is based on the following remark: In the rectified images, epipolar lines are orthogonal to
pixel columns. If the rectification transformation induced no deformation, it would preserve orthoof lines along pixel columns would be orthogonal to epipolar lines.
gonality, so the image by ✒
Let us now consider one scanline ✁ of the rectified image, and two points ✞✝ which are respectively the top, bottom points of a vertical line
of the rectified image. The epipolar line ✁
corresponds to epipolar lines
in the initial images. The two lines ✁ and
are orthogonal.
Assuming that the rectification transformation preserves orthogonality, lines ✒
✒ ☎✝ and
☎
✝
should be orthogonal (see Figure 7), as well as lines ✒
and
.
✒
For a given value of parameters ☎✄ , we define the residual
✿✱
✑
✆ ✡
✆ ✡✑✞ ✆ ✡
✆ ✡
✆
✄
✆
✄
✞
✼ ✞ ✾✻
✙ ✼✹✼
✑
✑
✄
✏
☎✄
✆
✆
✏
✂
✆ ✡
✆
✑
✑
✆
✍
✆✱ ✡
✆
✿ ✼ ✾✞ ✿ ✱✼ ✾✡
✱
✱
✆ ✿ ✼ ✾✞ ✿ ✼ ✾✡ ✆ ✡
✼ ✒ ✿ ✱ ✝ ✾ ✓✟✼ ✒ ✿ ✱ ✾✹✾ ✳
✆
✄
✆
✞
✆✡
✆
✠✟☛✡
✄
✾
✑
✰
✟
✑
✌☞
Since cameras are in a horizontal configuration, this intersection point exists. However, the same method can be easily
adapted to other cases.
RR n˚2584
26
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
✖✖
✁ ✂ ✂ ✱
✖
☎✝
✁ ✱✄✂ ✂ ✱
☎✞✝
✖✖
✖✖
✁ ✱✄✂ ✂
✁ ✂ ✂ ✱
✆☎✞✝
✖
✖
left image
✘✘
✱
☎✞✝
☎✌✝
rectified image
✑
☎✝
✍✎✑✏✓✒ ✁ ✂ ✱✔✂ ✱
✁ ✂ ✱✄✂ ✱
☎✝
✁ ✟✡✠ ✂ ☞✟ ☛ ✂ ✟✆✠
Figure 6: ✿ maps three corners of the rectified image onto particular points of the image, and the
point at infinity ✌ ✩ ✞ ✞ ✍☎✄ onto the epipole .
✒
✖
with
✞
✟
✏
✙
✘ ✛✘ ✘✘
✩
✩
✟
✑
✱
✑
✱
This term is the dot-product of the directions of the two lines ✆ ✿ ✼ ✾ ✞ ✿ ✼☎✝ ✾ ✡ and ✆ ✡ . An analo✱
✙
gous term ✼ ✞☎✄ ✾ can be defined in the right image, with the rectification transformation
✿
✱
✱
✿
✿ .
To determine rectification transformations, we compute ✞ ✄ which minimize the sum ✼ ✼ ✞☎✄ ✾
☞
✼ ✞☎✄ ✾✰✾ computed for both the left and the right columns of the rectified image, i.e. and having
respective values ✌ ✞ ✞✤✩ ✍ and ✌ ✞✤✩ ✞✤✩ ✍ on the one hand, ✌ ✩ ✞ ✞✤✩ ✍ and ✌ ✩ ✞✤✩ ✞✤✩ ✍ on the other hand.
One can see easily that the resulting expression is the square of a linear expression in ✞ ✄ . It can be
minimized very efficiently using standard linear least-squares techniques.
The two epipolar lines ✆ ✡ ✞✤✆ ✡ are chosen arbitrarily, so as to represent an “average direction”
of the epipolar lines in the images. In practice, the pair of epipolar lines defined by the center of the
left image provides satisfactory results.
✂
✂
✒
✏
✄
✒
✑
✒
✘✘ ✰ ✘ ✰
✏
✆
✒
✒
✘ ✰
✂
✰ ✏
✏
✑
✏
✝
✄
✆
✆
✄
INRIA
27
Applications of non-metric vision to some visually guided robotics tasks
✂✁☎✄✝✆
☎
✷
✟
✄
✡ ✎✍☎✌
✡ ☞☛✂✌
✞
?
✞
✡✑✏ ✌
✞
☎
✷
✟
rectified image
✁✟✞✠✆
left image
Figure 7: Lines involved in the determination of the rectification transformation (see text).
RR n˚2584
28
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
5
Positioning points with respect to a plane
Measuring the positions of points with respect to a reference plane is essential for robot navigation.
We will show in sections 6 and 7 several applications which are based on this measurement. In this
section we study how to compare distances of points to a reference plane under minimal calibration
assumptions.
5.1 Comparing point distances to a reference plane
For convenience we will adopt the terminology which corresponds to the particular application of
section 7.3 where the reference plane is the ground plane, and the robot needs to estimate the relative
heights of visible points, i.e., their relative distances to the ground plane.
The distance of a point to the reference plane is related to the location of along the direction
orthogonal to the plane. This notion can clearly not be captured at the projective level. Let us now
see under which calibration assumptions we will be able to compare point heights.
Let us introduce an (arbitrary) reference point which does not belong to the ground plane. In
, chosen arbitrarily so as to satisfy the
practice this point is defined by its two image projections
above constraints:
✂
✂
✡✞✡
✡ ✞✡
✁
✁
satisfy the epipolar constraint,
✡
✁
both and
✡
and
✡
✡
lie outside of the images,
do not satisfy the homographic relation of the reference plane
This guaranties that point does not lie on the plane, and is different from any observable point.
We now consider the line ✄✂ orthogonal to the reference plane and passing through . Denoting
✆☎
the intersection between ✄✂ and the ground plane, the height of
is in fact equal to the
by
☎
☎
☎
signed distance
, where
is obtained by projecting
on ✄✂
parallel to the reference
plane.
✝☎
☎✟✞
✠☎
✞
and
From simple affine geometric properties, the ratios of signed distances
☛✡
are equal. Thus, if we consider an arbitrary point
which we declare to be at height one from the
in terms of this unit can be expressed as (see Figure 8):
reference plane, the height of
✂
✁
✁
✆ ✦✡
✂ ✆ ✦✡
✂
✂
✂
✙
☞
✁
✡
✁
✂
✂
✆ ✦✡
✂
✂
✁
✞
✁
✁
✂
✁
✡ ✞
✁
(40)
✡
Affine projection: In practice, we cannot directly compute distances between 3D points. However,
we can compute their projections on the image planes. Ratios of distances are affine invariants, so if
we assume that, say, the right camera performs affine projection, we can write
☞
✂
✂ ✡
✞
✙
✍✌ ✡
✌
✡ ✞
✡
✌ ✡
✌
☞
Under affine viewing, this definition of height is exact in the sense that is proportional to the
distance between
and the reference plane. Otherwise, this formula is only an approximation. At
INRIA
✁
29
Applications of non-metric vision to some visually guided robotics tasks
✁
✠
✄
✠
✄
✞✁
✆✄
✄
✞
✟
✄
✟
✁
✟
✝
✄
☎✄
✂✁
☞
Figure 8: Computation of relative heights with respect to unit point
text).
✂
✡
under affine projection (see
any rate, it turns out to be accurate enough for some navigation applications, in which points for which
heights have to be compared are at relatively long range from the camera, and within a relatively
shallow depth of field (cf Section 7.3).
Perspective projection: If the affine approximation is not valid, we need to relate relative heights
to projective invariants. Instead of considering ratios of distances, we consider cross-ratios, which
are invariant by projection onto the images.
Let us assume that we know the homography of a plane ✭✑✭ parallel to the ground plane (This is
✂
in fact equivalent to knowing the line at infinity of the ground plane). Intersecting line ✆
✡ (resp.
✂ ✡
✡
✂
✡ ✂ ✡
✆
✡ ) with plane ✭✑✭ defines a point
✞
(resp.
) aligned with ✞ ✞
(resp. ✞
).
✂
The cross-ratio ✟ ✞ ✡✠ ✞ ☞☛ is by definition equal to
✂
✁
✂
✁
✁
✁
✁
✁
✂
✁
✂
✞
✁
✞
✁
✁
Based on simple affine properties, the denominator of the above fraction is also equal to
☎
✁
where
☎
✁
RR n˚2584
is the projection of
✁
✂
☎✟✞
☎
✁
on ✆✄✂✦✡ parallel to plane
✂
✭ ✭
, i.e., the intersection of ✖✭✑✭ and ✆✄✂✦✡ .
✂
30
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
Similarly, we have
✟
✁
✠
✡
✞
✂
✞✟✁
✡
☛
✡
✁
✙
✂
✡
✁
✟ ✡✠
✡
✂
☎
✞
✁
✡
✁
✞
☎
☛ ✟
✂
☎
✠
✂
☛
✞ ✁
✞
✞ ✁
✞
As a consequence, the ratio of cross-ratios
is equal to (as de✂
with respect
fined in Equation 40). Using projective invariance, we can then express the height of
✂
as
to
✁
✟
✡
✟
☞
✙
✠
✞✽☎
✞
✡
✠
✌
✞✽☎
☛
✡
✽✟✞
✡
✡
✽
✞
✁
✡
✡
✡
☞
☛
✡
✌
We remark that the ratio of heights with respect to a plane can be captured at a calibration level
which is intermediate between projective and affine: knowing the plane at infinity is not necessary,
one only needs to know the line at infinity of the reference plane.
5.2 Interpreting disparities
In this section we assume that images have been rectified with respect to the reference plane, and
relate positions ot points relative to the plane to image disparities.
The measure of the disparity assigned to a point correspondence after the rectification with
respect to a plane and the correlation along the epipolar lines, described in section 4, is in turn related
to the position of the corresponding point of the scene with respect to the plane.
Indeed, with the notations of section 4,
✁
✝
✁
✙
✝
✑
✑
☎
So, according to equations (29) and (38), we have
✏
✁
✙
✵
✏
✙
☎✄
✳ ✫
✑
✵
✑
✁
(41)
✳
✞
✂
In order to interpret , we introduce the signed distance ✼ ✞ ✂ ✾ of a point
to a plane ✂ defined by its unit normal and its distance to the origin:
✁
✂
✙
✆
✌
✆
☎
✞
✞ ✳✕✞✶✕
✵ ✍ ✰
✂
✆
✆
✼
☎
✆
✂
✞✟✂
✙
✾
✰ ✌
✂
✆
☎
✂
✵
✞
✳
✞
✵
✵
✍ ✰
✂
✂
The sign of ✺✼ ✞ ✂ ✾ is the same for all the points
located at the same side of ✂ and ✺✼ ✞ ✂ ✾
✂
is equal to the distance of
to ✂ . Similarly, we introduce the signed distance ✼ ✞ ✾ of a point
✝ ✞
✙✻✌ ✞ ✞ ✍✷✰ to a line defined by its unit normal
and its distance to the origin:
✆
✂
✆
✆
✂
✆
✂
✆
✆
✝
✺✼ ✟✞
✆
✾
✆
✙
✂
✆
☎
✰ ✌
✞
✞
✍ ✰
The sign of ✼✝✟✞ ✾ is the same for all the points located at the same side of and
to the distance of to . We have then, according to equations (21) and (3),
✆
✆
✆
✂
✆
✼✝✟✞
✆
✾
✂
is equal
✆
✵
✙
✵ ✫
✆
✆
✼
✂
✞
✂ ✾
✂
INRIA
31
Applications of non-metric vision to some visually guided robotics tasks
✄
✫
✳
✙
✙
✂
✞
✆
✩
✑
✂
✺✼
✫
✵
✺✼
✆✔
✞
✆
✒
✠✞
✙
✳
✦
✵
✺✼
✫
✾
✂
✁
with
✾
☞
✥✙
✒
✳
✣ ✱
✏
✳
✣✰✳
☞
✑
✂
✞
✆
✾
✑
where ✂
is the focal plane of the second view, ✂
the focal plane of the rectified views (see fi✙
✌
✍ is the line at infinity.
the line of
whose image by
gure (4)) and
Now, using these signed distances, we write in three different ways:
✆✭✔
✁
✁
✒
✑
☞
✄
✌
✂
✏
✑
✁
✼
✆
✙
✆
✂
✼
✆
✂ ✾
✞
✂
✂
✞
✂
✏
✆✔
✁
✒
✆
✙
✆
✂
✼✝
✆
✞
✂
✑
✏
✁
✼
✁
✆
✙
✆
✺✼
✂
✂
✑
✏
do not depend on
✁
✂
✆
✒
✂ ✾
✞
✾
✆
✼
✂
✞✟✂
(43)
✾
✂ ✾
✞
✂
(44)
✑
✂
✞
✆
Since , , , and
three interpretations:
✼
(42)
✾
and
✾
, we deduce from these equations the following
✂
✂
From equation (42), we deduce that, if
is a visible point, which implies that ✼ ✞ ✂ ✾
,
gives its position with respect to ✂ . Furthermore,
is proportional to the
the sign of
✂
✂
to ✂ to the distance of
to ✂ .
ratio of the distance of
✁
✘
✆
✑
✑
✁
✒
✁
✂
✂
✂
From equation (43), we deduce that, if
is a visible point, the sign of usually gives its
usually does not go through any point of the image
position with respect to ✂ . Indeed,
✞
so that the sign of ✼
is
✾ is usually the same for all the points considered. In fact,
does
not
really
depend
on
for
the
points
usually far away from the image, so that ✼✝ ✞
✾
✂
is approximately proportional to the ratio of the distance of
to ✂ to the
considered and
✂
distance of
to ✂ .
✁
✆
✆✔
✕ ✆✔
✁
✁
✂
✆✪✔
✆✔
✆
✂
✂
From equation (44), we deduce that the sign of gives the position of with respect to ✂ and
✂
✂
and
is proportional to the ratio of the distance of
to ✂ to the distance of
to ✂ .
✂
so that
is an epipolar line.
is thus the image in the
According to equation (27), ✄✂
second view of an epipolar plane ✂ . Now, the image in
of ✂ is the line at infinity, so ✂
✙
✂
is parallel to ✂ . Since ✂ is an epipolar plane, ✂
and
is, indeed, the intersection
of ✂
and . Consequently, since
is usually far away from the image, ✂
and , thus
and ✂ , are approximately parallel around the image, so that ✂
may be approximated by
✂
✂
for the points considered and we turn again to the preceding interpretation.
✁
✁
✑
✑
✁
✂
✙ ✌✆ ✔
✂
✆✔
✑
✑
✑
✆✭✔
✑
✁
✑
✑
✦
✆✭✔
✑
✁
✟
✟
✆✔
✟
When ✂ is the plane at infinity, according to equation (21), we have
✏
✑
✙
✆
RR n˚2584
✦
✁
✼
✂
✞
✂
✾
✟
✑
✵
✁
✙
✵
✫
and, so,
32
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
✏
✁
✒
✙
✼✝
✆
✞
✾
✆
✆
✔
✼
✂
✞✟✂
✾
✑
✏
✁
✁
✙
✆
✼
✂
✑
✞
✂
✾
✂
Thus, in that case,
is inversely proportional to the distance of
to ✂ , is approximately in✂
✂
to ✂ and is inversely proportional to the distance of
versely proportional to the distance of
to ✂ .
✑
✁
✁
✁
✑
INRIA
Applications of non-metric vision to some visually guided robotics tasks
6
33
Computing local terrain orientations using collineations
Once a point correspondence is obtained through the process of rectification and correlation along
the epipolar lines described in section 4, it is possible to estimate, in addition to a measure of the
disparity, a measure of local surface orientations, by using the image intensity function.
The traditional approach to computing such surface properties is to first build a metric model of
the observed surfaces from the stereo matches, and then to compute local surface properties using
standard tools from Euclidean geometry. This approach has two major drawbacks. First, reconstructing the geometry of the observed surfaces can be expensive because it requires not only applying geometric transformations to the image pixels and their disparity in order to recover three-dimensional
coordinates, but also interpolating a sparse 3D-map in space to get dense three-dimensional information. Second, reconstructing the metric surface requires having full knowledge of the geometry of the
camera system through exact calibration. In addition, surface properties such as slope are particularly
sensitive to the calibration parameters, thus putting more demand on the quality of the calibration.
Here, we investigate algorithms for evaluating terrain orientations from pairs of stereo images
using limited calibration information. More precisely, we want to obtain an image in which the value
of each pixel is a measure of the difference in orientation relative to some reference orientation, e.g.
the orientation of the ground plane, assuming that the only accurate calibration information is the
epipolar geometry of the cameras.
We investigate two approaches based on these geometrical tools. In the first approach (Section 6.2),
we compute the Sum of Squared Differences (SSD) at a pixel for all the possible skewing configurations of the windows. The skewing parameters of the window which produce the minimum of SSD
corresponds to the most likely orientation at that point. This approach uses only knowledge of the
epipolar geometry but does not allow the full recovery of the slopes. Rather, it permits the comparison of the slope at every pixel with a reference slope, e.g. the orientation of the ground plane for a
mobile robot.
The second approach (Section 6.3) involves relating the actual orientations in space with window skewing parameters. Specifically, we parameterize the space of all possible windows at a given
pixel by the corresponding directions on the unit sphere. This provides more information than in the
previous case but requires additional calibration information, i.e, the knowledge of the approximate
intrinsic parameters of one of the cameras and of point correspondences in the plane at infinity.
6.1 The principle
The guiding principle of this section is the following: the collineation that represents a planar surface
is the one that best warps the first image onto the second one.
This principle is used implicitly in all area-based stereo techniques in which the images are rectified and the scene is supposed to be locally fronto-parallel (i.e., parallel to the cameras) at each point
[8, 25, 22]. In this case, homographies are simple translations. A rectangular window in image 1
maps onto a similar window in image two, whose horizontal offset (disparity) characterizes the position of the plane in space. The pixel-to-pixel mapping defined by the homography allows to compute
a similarity measure on the pixel intensities inside the windows, based on a cross-correlation of the
intensity vectors or a SSD of the pixel intensities.
RR n˚2584
34
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
Another example of this concept is the use of windows of varying shapes in area-based stereo by
compensating for the effects of foreshortening due to the orientation of the plane with respect to the
camera. In the TELEOS system [24], for example, several window shapes are used in the computation
of the disparity.
We use this principle for choosing, among all homographies that represent planes of various orientations passing through a surface point , the one that best represents the surface at . We use a
standard window-based correlation algorithm to establish the correspondences between the images
of . Since the methods presented above are sensitive to the disparity estimates, we also
use a simple sub-pixel disparity estimator.
✂
✼ ✞✽✂ ✾ ✂
✂
6.2 Window-based representation
When applying an homography to a rectangular window in image 1, one obtains in general a skewed
window in image 2. Since the homographies that we study map on , two other pairs of points
are sufficient for describing them (see section 3.3). This allows us to introduce a description of the
plane orientation by two parameters measured directly in the images.
☛
The standard configuration In order to simplify the presentation, we first describe the relations
in the case of cameras in standard configuration, i.e.whose optical axes are aligned and such that
the axes of the image planes are also aligned. We also assume that the camera intrinsic parameters
are known, and considering metric coordinates in the retinal plane instead of pixel coordinates we
and
(using the same notations as in Section 2). Choosing the frame
end up with
attached to the first camera as the reference frame, the translation between the cameras is assumed to
. Although we describe the approach in this simplified case, the principle remains
be
the same in the general case, though interpreting the equations is more complicated.
In the current case, Equation (12) gives us:
✮
✝ ✙ ✌ ✞ ✞ ✍
✘
✒ ✔ ✙ ✢
✙ ✢
✄
✘
✒ ✙ ✢✏ ✝
✍✎
✄
✂
✙
✩ ✏✁
✄
☛
✠
✄
✄
✠
✠
✠
✄
✜✢
✞
✘
✘
✘
✆
✩
✞
✂
✞
(45)
✘
✩
✟✞✽
of
The distance parameter can be obtained from the image points as follows: the images
the three-dimensional point
are known, and related by a horizontal disparity . From the projection geometry, we have:
✆
✁
✝
✘
✘✲✄ ✙
✘ ✁
(46)
✁
✆
In this equation, the origin of the image coordinate system (a 3D-point) and is a 2-D projective point of the form
. Therefore, we have a simple expression of as a function of the
known variables, and :
✁
✌ ✞ ✞✤✩✛✍
✁
✕
✄
✚
✆
✆
✝
✙
✂
✄
✘ ✄ ✙
✄
✂
✁
✘✲✁
(47)
INRIA
35
Applications of non-metric vision to some visually guided robotics tasks
v
v’
v’
u
✠
☎✄
☎✄
✠
✄
✠
✄
u’
u’
✂✁
✂✁
(a)
(b)
(c)
Figure 9: Parametrization of window deformation; (a): left image. (b): right image if the cameras
are aligned. (c): right window in general camera configuration.
Substituting (47) in (45),
✙
✒
✍✎
✩
can be expressed as function of ,
✏
✏
✏
✘
✘
✘
✩
✏
✜✢
✝
✁
✘
✩
✏
✙
✁
✝
✄
☎
✁
✂
✘
✁
, and
✁
✂
✒
✏
✁
✙
:
☎
✁
✝
✄
✁
✂
✘
✁
(48)
✏
only has a translational effect through , so it has no influence on the resulting window shape.
The two parameters and fully characterize the effect of on the window shape. Geometrically,
is the horizontal displacement of the center points of the left and right edges of the correlation
is the displacement of the centers of the top and bottom edges of the window (See
window and
figure 9 (b)).
✝
✒
✏
✏
✁
✒
✏
✏
✁
The general case To simplify the equations, this discussion of window skewing was presented
in the case in which the cameras are aligned. The property remains essentially the same when the
cameras are in a general configuration. In that case, the window shape can still be described by
✆
that represents the displacements of the centers of the window edges along the epipolar lines given by instead of along the lines of the image in the case of aligned cameras (See figure 9
(c)). Also, we have not included the intrinsic parameter matrices in the computations. It can be easily
seen that including these matrices does not change the general form of equation (48); it changes only
the relation between ✆ and the orientation of the plane.
We have shown how to parametrize window shape from the corresponding homography. It is
important to note that the reasoning can be reversed in that a given arbitrary value of ✆ corresponds
to a unique homography at , which itself corresponds to a unique plane at .
✙✚✌ ✞ ✍
✏
✄
✏
✁
✚
✁
✄
✂
Slope Computation Using Window Shape: The parameterization of window shape as a function
of planar orientations suggests a simple algorithm for finding the slope at . First, choose a set of
and . Then, compute a measure (correlation or SSD) for each possible ✆ , and find
values for
the best slope ✆
as corresponding to the minimum of the measure. This is very similar to the
approach investigated in [1].
could be converted to the Euclidean desIf all the parameters of the cameras were known, ✆
cription of the corresponding plane in space.
✏
✏
✁
✠
✄
✠
✄
RR n˚2584
36
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
50000
#
#
#
80000
40000
#
#
#
#
#
#
#
60000
A
40000
30000
A
20000
24
20000
18
16
14
12 column
10
5
10
15
row
#
#
#
#
#
#
10000
22
20
#
#
#
0
0
12
10
10
6
20
4
25
2
30
18
16
14
column
8
15
row
4
25
24
22
20
5
8
6
20
30
2
✏
✏
Figure 10: Left: Selected pixels in image 1. Center (resp. right): SSD error as function of
at the road (resp. tree) pixel. The reference orientation
is approximately the orientation of
the road, and is represented by the vertical dotted lines. Both surfaces have a sharp minimum, but
.
only one (the road point) is located at
✌
✏
✝
✼
✍
✾
✏
✝
✼
✁
✝
✞
✁
✏
✞
✏
✝
✞
✁
✾
Otherwise, it is still possible to compare the computed orientation with the orientation of a reference plane
whose line at infinity is known. Indeed, the homography of the plane passing through
and parallel to
can be computed (Section 3.3), and its parameters
derived; the distance
is a measure of the difference between the slope of the terrain at
and the reference slope. Figure 10 shows an example in the case of two single pixel correspondences selected in
a pair of images. The SSD is computed from the Laplacian of the images rather than from the images
themselves in order to eliminate the offset between the intensities of the two images.
In practice,
can be defined by three point correspondences in a pair of training images, two
of which lying far enough from the camera to be considered as lying at infinity, thus defining the
from any three point
line at infinity of the plane. Another way to proceed would be to estimate
estimated with three
correspondences, and to compute its intersection with the plane at infinity
non-aligned correspondences corresponding to remote points.
✝
✂
✂
✝
✝
✆
✂
✂
✂
✠
✂
✝
✄
✆
✂
✙
✆
☎
✝
✂
✝
✂
✂
✏
✔
✏
Limitations: There are two problems with the
representation of plane directions. First,
depending on the position of the point in space, the discretization of the parameters may lead to very
different results due to the non-uniform distribution of the plane directions in -space. In particular,
the discrimination between different plane directions becomes poorer as the range to the surface increases. This problem becomes particularly severe when the surfaces are at a relatively long range
from the camera and when the variation of range across the image is significant.
and
The second problem is that it is difficult to interpret consistently the distance between
across the image. Specifically, a particular value of corresponds to different angular distances
between planes depending on the disparity, but also on the position of the point in the image.
✌
✁
✞
✍
✆
✠
✄
✆
✂
✝
✆
✂
6.3 Normal-based representation
Based on the limitations identified above, we now develop an alternate parameterization, that consists
of discretizing the set of all possible orientations in space, and then of evaluating the corresponding
INRIA
37
Applications of non-metric vision to some visually guided robotics tasks
Figure 11: Distribution of SSD at the two selected pixels of figure (Left: road; Right: trees). The
SSD distribution is plotted with respect to ✼ ☎ ✞✽☎ ✾ , i.e., two coordinates of the 3D normal. The reference orientation (close to the road) is represented as a plain surface patch. On the left diagram, the
computed SSD values are low in the neighborhood of the reference orientation. On the other one, the
low SSD values are further from the reference orientation.
✁
homographies. This assumes that the intrinsic parameters of the first camera are known as well as
the collineation of the plane at infinity.
✒ ✔
and , we can
Slope computation using normal representation: Assuming that we know
compute the homography of a plane defined by a given pair of points ✼✝✟✞✽ ✾ and a normal
vector in space. Indeed, the plane ✫ that is orthogonal to and contains the optical center ✬
projects in the first image onto a line
and Equation (14) tells us that
✒
✮
✂
✂
✂
✆
✔
✞
✂
✱
✔
✙
✮
✿
(49)
✰
✂
This line is the projection in the first image of any line of the plane that does not contain ✬ , so unless
✬
is at infinity, it is the image of the line at infinity of ✫ and thus of since both planes are paralof the plane at infinity, we can then compute the corresponding line
lel. Given the homography
✱
✙
in the second image. Finally, according to section 3.3, ✼✝✟✞✽ ✾ , ✱ , ✳ and the
✿
✳
✱
knowledge of the epipolar geometry allow to compute .
By sampling the set of possible orientations in a uniform manner, we generate a set of homogra✂
phies that represent planes of well-distributed orientations at a given point . Then the algorithm of
the previous section is used directly to evaluate each orientation . In the current implementation,
we sample the sphere of unit normals into 40 different orientations using a regular subdivision of the
icosahedron.
Figure 11 shows the SSD distributions in the case of the two pixels studied in Figure 10. Though
correct, the results point out one remaining problem of this approach: the SSD distribution may be
flat because of the lack of signal variation in the window. This is a problem with any area-based
technique. In the section 6.4, we present a probabilistic framework which enables us to address this
problem.
It is important to note that the orientation of the cameras does not need to be known and that the
coordinate system in which the orientations are expressed is unimportant. In fact, we express all the
orientations in a reference frame attached to the first camera, which is sufficient since all we need is
to compare orientations, which does not require the use of a specific reference frame. Consequently,
it is not necessary to use the complete metric calibration of the cameras.
✞
✔
✒ ✔
✰
✞
✔
✒✎✔
✂
✂
☛ ✔ ✔
✒
✞
✄
✂
RR n˚2584
✞
38
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
✒☛✔
✔ ✔
A-priori geometric knowledge: In practice,
is estimated as described at the end of the previous section. Since this only gives an approximation, the lines ✱ ✞ ✳ that we compute from a given
orientation do not really represent a line at infinity. Thus, the planes that correspond to this orientation rotate around a fixed line in space instead of being parallel. For practical purposes, the line is far
enough so that this discrepancy does not introduce significant errors.
The matrix represents the intrinsic parameters of the first image. Since we are interested in
the slopes in the image relative to some reference plane, it is not necessary to know
precisely.
Specifically, an error in introduces a consistent error in the computation of the homographies which
is the same for the reference plane and for an arbitrary plane, and does not affect the output of the
algorithm dramatically.
We finally remark that if is modified by changing the scale in the first image, the results remain
unchanged. This geometric property, observed by Koenderink [16], implies that only the aspect ratio,
the angle of the pixel axes and the principal point of the first camera need to be known.
✞
✞
✮
✮
✮
✮
6.4 Application to estimation of terrain traversability
Although the accuracy of the slopes computed using the algorithms of the previous section is not
sufficient to, for example, reconstruct terrain shape, it provides a valuable indication of the traversability of the terrain. Specifically, we define the traversability at a point as the probability that the
angle between a reference vertical direction and the normal to the terrain surface is lower than a given
angular threshold. The term traversability comes from mobile robot navigation in which the angular
threshold controls the range of slopes that the robot can navigate safely.
Estimating traversability involves converting the distribution of SSD values ✼ ✾ at a pixel to a
function ✼ ✾ which can be interpreted as the likelihood that corresponds to the terrain orientation
at . We then define the traversability measure ✖✼✝ ✾ as the probability that this orientation is within
a neighbourhood around the direction of the reference plane :
✆
✝
✆
✆
✵
✂
✝
✆
✝
✖✼✝
✾
✵
✂✁
✙
✼
✾
✆
✆
✞
✝
We use a formalism similar to the one presented in [22] in order to define . Assuming that the pixel
✳
values in both images are normally distributed with standard deviation , the distribution of is
given by:
✄
✝
✼
☎
where
✆
✾
✙
☎ ✝✆✟✞
✩
✌
☎
✄
✼
✆
✳
✆
✾
(50)
is a normalizing factor.
This definition of has two advantages. First, it integrates the confidence values computed for
all the slope estimates (50) into one traversability measure. In particular, if the distribution of ✼ ✾ is
relatively flat, ✖✼✝ ✾ has a low value, reflecting the fact that the confidence in the position of the minimum of ✼ ✾ is low. This situation occurs when there is not enough signal variation in the images
or when is the projection of a scene point that is far from the cameras.
The second advantage of this definition of traversability is that the sensitivity of the algorithm
can be adjusted in a natural way. For example, if is defined as the set of plane orientations which
are at an angle less than from , the sensitivity of ✖✼✝ ✾ increases as decreases.
✵
✝
✆
✵
✆
✂
✂
✝
✂
✵
✂
INRIA
39
Applications of non-metric vision to some visually guided robotics tasks
Figure 12: Examples of traversability maps computed on two pairs of images (see text).
Figure 12 shows the results on two pairs of images of outdoor scenes. The first image of each
pair is displayed on the left. The center images show the complete traversability maps, Once again,
the influence of the signal is noticeable. In particular, in the top example, a large part of the road has
a rather low traversability, because there is little signal in the images. On the contrary, the values
corresponding to the highly textured sidewalk are very high.
The right image shows the regions that have a probability greater than the value we would obtain
if there were no signal in the images, i.e. the regions that could be considered as traversable. In both
cases, the obstacles have low traversability values.
for three different values of . Only the traverFigure 13 shows the result of evaluating
sable regions are shown. As increases, the influence of the signal becomes less noticeable, and the
likelihood of a region to be traversable increases.
The measure of traversability can be easily integrated into navigation systems such as the one
presented in section 7.
✖✼ ✾
✵
✂
RR n˚2584
✂
40
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
Figure 13: Traversability map from the distribution of slopes on real data. Left: small admissible
✁
region
; Center: medium admissible region
; Right: large admissible region ✄✂ .
✂
✙
✘
☎
✖✙
✝
✂
✁
✖✙
✝
✝
✂
INRIA
41
Applications of non-metric vision to some visually guided robotics tasks
7
Navigating
In this section we show three robotic applications of the geometric properties presented in the above
sections. In the first one, stereo is used to detect close obstacles. In the second one, the robot uses
affine geometry to follow the middle of a corridor. In the third one, relative heights and orientations
with respect to the ground plane are used for trajectory planning.
7.1 Detecting obstacles
This section describes how to use the previous results to provide a robot with the ability to detect
obstacles. The only requirement is that the robot is equipped with a stereo rig which can be very
simply calibrated, as explained next.
Let us imagine the following calibration steps:
as described in Section 3.1, some correspondences between two views taken by the cameras
are found;
✁
these correspondencesare used to compute the fundamental matrix, as described in Section 3.2;
✁
three particular correspondences are given to the system; they correspond to three object points
defining a virtual plane ✂ in front of the robot;
✁
✁
the ☛ -matrix of ✂ is computed as described in Section 3.3;
The fundamental matrix, as well as the plane ☛ -matrix, remain the same for any other pair of views
taken by the system, as long as the intrinsic parameters of both cameras and the attitude of one camera
with respect to the other do not change.
According to Section 5, by repeatedly performing rectifications with respect to ✂ , the robot knows
whether there are points in front between itself and ✂ by looking at the sign of their disparity and
can act in consequence. If the distance ✆ of the robot to ✂ is known, the robot may, for example,
move forward from the distance ✆ . Furthermore, if ✂ and ✂ intersect sufficiently far away from the
cameras, it can detect whether the points are moving away or towards itself. Indeed, ✂ and the focal
plane ✂✦ may then be considered as parallel around the images, so that, for the points considered,
✆
✂
is proportional to ✆
✂
☎✝✆ ✂ ✂✻ , where ✆ ✂ ✂✦
✆
✂✻ for any point
is then approximatively proportional to
✂ . According to equation (42),
✖
✖
✼✂ ✞ ✾
✼✂ ✞
✑
✁
✾
✼ ✖✞
✾
✺✼ ✖✞
✾ ✙ ✺✼ ✂✁ ✞
✂✂
✾
✂
✆ ✼ ✂✖✞ ✂✦
✩ ☎ ✆ ✼ ✂ ✟✞ ✂✻ ✾ ✾
✂ to ✂ .
thus, is a monotonic function of the distance of
At last, since we are only interested in the points near the plane, which have a disparity close to
zero, we can limit the search along the epipolar line of the correspondent point
of any point to
an interval around , which significantly reduces the computation time.
An example is given in Figures 14, 15, 16 and 17. Figure 14 shows as dark square boxes the
points used to define a plane and the image of a fist taken by the left camera. Figure 15 shows the left
✑
✑
RR n˚2584
✑
42
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
Figure 14: The paper used to define the plane and the left image of the fist taken as an example.
Figure 15: The left and right rectified images of the fist.
and right images once rectified with respect to this plane. Figure 16 shows the disparity map obtained
by correlation. Figure 17 shows the segmentation of the disparity map in two parts. On the left side,
points with negative disparities, that is points in front of the reference plane, are shown. The intensity
encodes closeness to the camera. Similarly, the right side of the figure shows the points with positive
disparities, that is the points which are beyond the reference plane.
INRIA
Applications of non-metric vision to some visually guided robotics tasks
43
Figure 16: The disparity map obtained from the rectified images of Figure 15.
Figure 17: The absolute value of the negative disparities on the left, showing that the fist and a portion
of the arm are between the robot and the plane of rectification, and the positive disparities on the right,
corresponding to the points located beyond the plane.
RR n˚2584
44
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
✄✆☎
✁
✠
✁
✂
✂
☛
✞
✠
✄✆✝
✁
✞
✠
✠
☛
✄
✞
✞ ✠✟
✠
✠
✁
Figure 18: Determining a point at the middle of a corridor (see Section 7.2).
7.2 Navigating along the middle of a corridor
If we add to the computation of the fundamental matrix during the calibration stage, the computation
of the homography of the plane at infinity, using the method described in Section 3.4, the robot becomes able to compute ratios of three aligned points, thus, for example, the middle of a corridor and
do visual servoing.
Indeed, if we represent the projections of the sides ✱ and ✳ of the corridor by ✱ and ✳ in
the first image and ✱ and ✳ in the second image (see Figure 18) and choose any point ✡ of ✱ and
any point of ✳ , projections in the first image of a point ☛ of ✱ and a point ☞ of ✳ , then the
✡ and in the second image are computed as the intersections, respectively,
corresponding points ✮
of the epipolar line ✙✍✌ of ✡ with ✱ and the epipolar line ✙ of with ✳ . Having ✼✎✡ ✞✏✡ ✾ and ✼ ✤✞ ✾
✂
allows to compute the projections ✼✝✟✞✽ ✾ of the midpoint
of ☛ and ☞ . If we consider ✱ and ✳
✂
as locally parallel, then
lies on the local middle line of the corridor and computing the projections
✂
allows to have the projections of this line in the two
of another point of this line the same way as
images.
Figure 19 shows some real sequences used to perform the affine calibration of a stereoscopic system. Six strong correspondences between the four images have been extracted, from which fifteen
correspondences of points at infinity have been computed to finally get the homography of the plane
at infinity. Figure 20 shows some midpoints obtained once the system calibrated: the endpoints are
represented as black squares and the midpoints as black crosses. Figure 21 shows the midline of a
corridor obtained from another affinely calibrated system: the endpoints are represented as numbered
oblique dark crosses, the midpoints as black crosses and the midline as a black line.
✝
✝
✝
✝
✝
✝
✝
✝
✝
✝
✞
✝
✝
✝
7.3 Trajectory evaluation using relative elevation
A limitation of the conventional approach to stereo driving is that it relies on precise metric calibration with respect to an external calibration target in order to convert matches to 3-D points. From a
practical standpoint, this is a serious limitation in scenarios in which the sensing hardware cannot be
physically accessed, such as in the case of planetary exploration. In particular, this limitation implies
that the vision system must remain perfectly calibrated over the course of an entire mission. Ne-
INRIA
Applications of non-metric vision to some visually guided robotics tasks
45
Figure 19: The top images correspond to a first pair of views taken by a stereoscopic system and the
bottom images to a second pair taken by exactly the same system after a translation. Among the 297
detected corners of the top left image and the 276 of the top right image, 157 points correspondences
have been found by stereo points matching (see Section 3.1), among which 7 outliers have been rejected when computing the fundamental matrix (see Section 3.2). The top to bottom correspondences
matching has been obtained by tracking (see Section 3.1).
RR n˚2584
46
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
Figure 20: Midpoints obtained after affine calibration (see Section 3.4).
Figure 21: Midline of a corridor obtained after affine calibration (see Section 3.4).
INRIA
Applications of non-metric vision to some visually guided robotics tasks
47
vertheless, navigation should not require the precise knowledge of the 3-D position of points in the
scene: What is important is how much a point deviates from the reference ground plane, not its exact
position. Based on these observations, we have developed an approach which relies on the measure
of relative height with respect to a ground plane (see Section 5.1).
7.3.1 The driving approach
We give only an overview of the approach since a detailed description of the driving system is outside
the scope of this book. A detailed description of the stereo driving system can be found in [17]. The
autonomous navigation architecture is described in [19] and [13].
In autonomous driving, the problem is to use the data in the stereo images for computing the best
steering command for the vehicle, and to update the steering command every time a new image is
taken. Our basic approach is to evaluate a set of possible steering directions based on the relative
heights computed at a set of points which project onto a regular grid in the image. Specifically, a
given steering radius corresponds to an arc which can be projected into a curve in the image. This
curve traces the trajectory that the vehicle would follow in the image if it used this steering radius.
Given the points of the measurement grid and the set of steering radii, we compute a vote for every arc
and every point of the grid which reflects how drivable the arc is. The computed value lies between
-1 (high obstacle) and +1 (no obstacle) (Figure 22).
For a given steering radius, the votes from all the grid points are aggregated into a single vote by
taking the minimum of the votes computed from the individual grid points. The output is, therefore, a
distribution of votes between -1 and +1, -1 being inhibitory, for the set of possible steering arcs. This
distribution is then sent to an external module which combines the distribution of votes from stereo
with distributions from other modules in order to generate the steering command corresponding to the
highest combined vote. Figure 23 shows examples of vote distributions computed in front of visible
obstacles.
Characterization of the reference ground plane: The homography of the reference ground plane
is estimated from a number of point correspondencesrelated to scene points on this plane. These point
correspondences are obtained by selecting points in the first image. Their corresponding points in the
second image are computed automatically through the process of rectification and correlation along
the epipolar lines, described in Section 4.
Measuring obstacle heights Let us we consider one point of the grid in the first image, for which
a corresponding point in the second image has been found by the stereo process. Based on the results
of Section 5.1, we can compute its height with respect to the ground. The unit height is defined by
a point of the scene, selected manually in one of the two images at the beginning of the experiment
and matched automatically in the other image.
This measurement is not sufficient, since we aim at measuring heights along trajectories which
are estimated in the ground plane. So, after determining the elevation of a point selected in one image,
we determine the point of the ground plane to which this elevation has to be assigned, by projecting
the measured 3D point on the (horizontal) ground plane, along the vertical direction. This means
RR n˚2584
48
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
computing the intersection between the ground plane and the vertical line passing through the 3D
point. To apply the method of Section 3.3, we need to compute the images of the vertical the line
passing through the observed point. For this, we compute the images of the point at infinity in the
vertical direction (also called vertical vanishing point) in both images. First, we select manually four
points representing two vertical lines in the left image. Matching two of these points we obtain one
of the two corresponding lines in the right image. The left vertical vanishing point is obtained by
intersecting the two lines in the left image. Computing the intersection of its epipolar line with the
line in the right image, we obtain the right vertical vanishing point.
Computing images trajectories This approach assumes that a transformation is available for projecting the steering arcs onto the image plane. Such a transformation can be computed from features
in sequences of images using an approach related to the estimation algorithm described above for
computing the homography induced by the ground plane.
We first introduce a system of coordinates in the ground plane, attached to the rover, which we
call “rover coordinates”. At each time, we know in rover coordinates the trajectory which will be followed by the rover for each potential steering command. Further more, for a given motion/steering
command sent to the robot, we know from the mechanical design the expected change of rover coordinates from the final position to the initial one. We can even estimate the actual motion using deadreckoning. Since the transformation is a change of coordinates in the plane, it can be represented by
matrix
operating on homogeneous coordinates.
a
which maps pixel coordinates in
The transformation which we compute is the homography
the left image onto rover coordinates. The inverse of this matrix then allows us to map potential rover
trajectories onto the left image plane.
is done by tracking points across the left images taken at various rover poComputation of
.
sitions. Let us consider two images acquired at positions 1 and 2, with a known rover motion
Given a point correspondence ✍✝ ✝
we have the following equation (up to a scale factor):
✒✥✓✦✒
✁
✁
✄
✒
✁
✄
✒
✁
✼ ✱✞ ✳✾
✂✁
✁
✄
✒
where the only unknown is the matrix
✁
✄
✒
☎✁
✁
✄
✒
✱ ✙✄ ✱ ✳
✁
✂✁
✁
✄
✒
✱✳
✳
. This can be also written
✱ ✓ ✆✼ ✽✱ ✳
✁
☎✁
✁
✄
✒
✳✾ ✙
✘
This yields a system of two independent quadratic equations in the coefficients of
. Given a set of
displacements and point coordinates, we can write a large system of such equations, which we solve
in the least-squares sense using the Levenberg-Marquardt technique.
✁
✄
✒
Using heights to speed up stereo matching The relative height is also used for limiting the search
of heights which we anin the stereo matching. More precisely, we define an interval
ticipate in a typical terrain. This interval is converted at each pixel to a disparity range
.
This is an effective way of limiting the search by searching only for disparities that are physically
meaningful at each pixel.
✌
☞
✠
✄
✞
☞
✠
✌
✍
✌
✠
✄
✆
✞
✠
✆
✌
✍
INRIA
49
Applications of non-metric vision to some visually guided robotics tasks
Vote
1.0
5
10
15
20
25
30
35
40
45
50
0.0
Vote
1.0
5
10
15
20
25
30
35
40
45
50
0.0
-1.0
Left
Straight
Right
-1.0
Left
Vote
1.0
5
10
15
20
25
30
35
40
45
50
0.0
Straight
Right
-1.0
Left
Straight
Right
Figure 22: Evaluating steering at individual points; (top) Three points selected in a stereo pair and their projections; (bottom) Corresponding votes for all steering directions.
7.3.2 Experimental results
This algorithm has been successfully used for arc evaluation in initial experiments on the CMU HMMWV [13],
a converted truck for autonomous navigation. In this case, a 400 points grid was used. The combination of stereo computation and arc evaluation was done at an average of 0.5s on a Sparc-10 workstations. New steering directions were issued to the vehicle at that rate. This update rate is comparable
to what can be achieved using a laser range finder [19].
An important aspect of the system is that we are able to navigate even though a relatively small
number of points is processed in the stereo images. This is in contrast with the more conventional
approach in which a dense elevation map is necessary, thus dramatically increasing the computation
time. Using such a small number of points is justified because it has been shown that the set of points
needed for driving is a small fraction of the entire data set independent of the sensor used [15], and
because we have designed our stereo matcher to compute matches at specific points.
RR n˚2584
50
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
15
10
50
45
40
5
0
20
30
35
25
Vote
1.0
5
10
15
20
25
30
35
40
45
50
0.0
-1.0
Left
Vote
1.0
5
10
15
20
25
30
35
40
45
50
0.0
Straight
Right
-1.0
Left
Straight
Right
Figure 23: Evaluating steering directions in the case of a large obstacle (left) and a small obstacle (right). (Top):
Regular grid of points (dots) and corresponding projections (squares); (Bottom): Distribution of votes (see text).
INRIA
Applications of non-metric vision to some visually guided robotics tasks
8
51
Conclusion
In this chapter we have pushed a little further the idea that only the information necessary to solve a
given visual task needs to be recovered from the images and that this attitude pays off by considerably
simplifying the complexity of the processing.
Our guiding light has been to exploit the natural mathematical idea of invariance under a group
of transformations. This has led us to consider the three usual groups of transformations of the 3-D
space, the projective, affine and Euclidean groups which determine a three-layer stratification of that
space in which we found it convenient to think about and solve a number of vision problems related
to robotics applications.
We believe that this path, even though it may look a bit arduous for a non mathematically enclined
reader’s point of view, offers enough practical advantages to make it worth investigating further. In
particular we are convinced that, apart from the robotics applications that have been described in the
paper and for which we believe our ideas have been successful, the approach can be used in other
areas such as the representation and retrieval of images from digital libraries.
RR n˚2584
52
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
References
[1] Frédéric Devernay and Olivier Faugeras. Computing differential properties of 3-D shapes from
stereoscopic images without 3-D models. In Proceedings of the International Conference on
Computer Vision and Pattern Recognition, pages 208–213, Seattle, WA, June 1994. IEEE.
[2] O.D. Faugeras. What can be seen in three dimensions with an uncalibrated stereo rig ? In
G. Sandini, editor, Proceedings of the 2nd European Conference on Computer Vision, pages
563–578, Santa Margherita Ligure, Italy, May 1992. Springer-Verlag.
[3] O.D. Faugeras and L. Robert. What can two images tell us about a third one ? In J-O. Eklundh, editor, Proceedings of the 3rd European Conference on Computer Vision, pages 485–
492, Stockholm, Sweden, May 1994. Springer-Verlag.
[4] Olivier Faugeras. Stratification of 3-d vision: projective, affine, and metric representations.
Journal of the Optical Society of America A, 12(3):465–484, March 1995.
[5] Olivier Faugeras, Tuan Luong, and Steven Maybank. Camera self-calibration: theory and experiments. In G. Sandini, editor, Proceedings of the 2nd European Conference on Computer
Vision, volume 588 of Lecture Notes in Computer Science, pages 321–334, Santa Margherita
Ligure, Italy, May 1992. Springer-Verlag.
[6] Olivier D. Faugeras. Three-Dimensional Computer Vision: a Geometric Viewpoint. MIT Press,
1993.
[7] Olivier D. Faugeras and Francis Lustman. Let us suppose that the world is piecewise planar. In
O. D. Faugeras and Georges Giralt, editors, Robotics Research, The Third International Symposium, pages 33–40. MIT Press, 1986.
[8] P. Fua. Combining stereo and monocular information to compute dense depth maps that preserve depth discontinuities. In Proceedings of the 12th International Joint Conference on Artificial Intelligence, Sydney, Australia, August 1991.
[9] Gene H. Golub and Charles F. Van Loan. Matrix computations. The John Hopkins University
Press, Baltimore, Maryland, 1983.
[10] C. Harris and M. Stephens. A Combined Corner and Edge Detector. In Proceedings 4th Alvey
Conference, pages 147–151, Manchester, August 1988.
[11] R. Hartley, R. Gupta, and T. Chang. Stereo from uncalibrated cameras. In Proceedings of the
Conference on International Conference on Computer Vision and Pattern Recognition, pages
761–764, Urbana Champaign, IL, June 1992. IEEE.
[12] Richard Hartley, Rajiv Gupta, and Tom Chang. Stereo from Uncalibrated Cameras. In Proceedings of CVPR92, Champaign, Illinois, pages 761–764, June 1992.
INRIA
Applications of non-metric vision to some visually guided robotics tasks
53
[13] M. Hebert, D. Pomerleau, A. Stentz, and C. Thorpe. A behavior-based approach to autonomous
navigation systems: The cmu ugv project. In To Appear in IEEE Expert, 1994, 1994.
[14] Koenderink J. J. and A. J. Van Doorn. Affine structure from motion. Journal Of The Optical
Society Of America A, 8(2):377–385, 1992.
[15] A. Kelly. A partial analysis of the high speed autonomous navigation problem. Technical Report
CMU-RI-TR-94-16, The Robotics Institute, Carnegie Mellon, 1994.
[16] J.J Koenderink and A.J. Van Doorn. Goemetry of binocular vision and a model for stereopsis.
Journal Biol. Cybern., 21:29–35, 1976.
[17] E. Krotkov, M. Hebert, M. Buffa, F.G. Cozman, and L. Robert. Stereo driving and position estimation for autonomous planetary rovers. In 2nd International Workshop on Robotics in Space,
pages 320–328, Montreal, Quebec, July 1994.
[18] H. Wang L. S. Shapiro and J. M. Brady. A matching and tracking strategy for independentlymoving, non-rigid objects. In Proceedings of BMVC, 1992.
[19] D. Langer, J. Rosenblatt, and M. Hebert. A reactive system for autonomous navigation in unstructured environments. In Proc. International Conference on Robotics and Automation, San
Diego, 1994.
[20] Q.-T. Luong, R. Deriche, O.D. Faugeras, and T. Papadopoulo. On determining the Fundamental
matrix: analysis of different methods and experimental results. Technical Report RR-1894,
INRIA, 1993.
[21] Q.-T. Luong and T. Viéville. Canonic representations for the geometries of multiple projective
views. Technical Report UCB/CSD-93-772, University of California at Berkeley, Sept 1993.
[22] L. Matthies. Stereo vision for planetary rovers: Stochastic modeling to near real-time implementation. The International Journal of Computer Vision, 1(8), July 1992.
[23] J. L. Mundy and A. Zisserman, editors. Geometric Invariance In Computer Vision. MIT Press,
1992.
[24] H.K. Nishihara. Rtvs-3: Real-time binocular stereo and optical flow measurement system. system description manuscript. Technical report, Teleos, Palo Alto, CA, July 1990.
[25] M. Okutomi and T. Kanade. A multiple-baseline stereo. In Proceedings of the Conference on
International Conference on Computer Vision and Pattern Recognition, pages 63–69, Lahaina,
Hawai, June 1991. IEEE.
[26] W. H. Press, B. P. Flannery, S.A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C.
Cambridge University Press, 1988.
[27] L. Robert and O.D. Faugeras. Relative 3d positioning and 3d convex hull computation from a
weakly calibrated stereo pair. Image and Vision Computing, 13(3):189–197, 1995.
RR n˚2584
54
Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert
[28] A. Shashua. Projective structure from two uncalibrated images: Structure from motion and
recognition. Technical Report A.I. Memo No. 1363, MIT, September 1992.
[29] T. Viéville, C. Zeller, and L. Robert. Recovering motion and structure from a set of planar
patches in an uncalibrated image sequence. In Proceedings of ICPR94, Jerusalem, Israel, Oct
1994.
[30] Zhengyou Zhang, Rachid Deriche, Olivier Faugeras, and Quang-Tuan Luong. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence Journal, 1994. to appear. Also INRIA Research Report No.2273,
May 1994.
INRIA
Unité de recherche INRIA Lorraine, Technopôle de Nancy-Brabois, Campus scientifique,
615 rue du Jardin Botanique, BP 101, 54600 VILLERS LÈS NANCY
Unité de recherche INRIA Rennes, Irisa, Campus universitaire de Beaulieu, 35042 RENNES Cedex
Unité de recherche INRIA Rhône-Alpes, 46 avenue Félix Viallet, 38031 GRENOBLE Cedex 1
Unité de recherche INRIA Rocquencourt, Domaine de Voluceau, Rocquencourt, BP 105, 78153 LE CHESNAY Cedex
Unité de recherche INRIA Sophia-Antipolis, 2004 route des Lucioles, BP 93, 06902 SOPHIA-ANTIPOLIS Cedex
Éditeur
INRIA, Domaine de Voluceau, Rocquencourt, BP 105, 78153 LE CHESNAY Cedex (France)
ISSN 0249-6399