Vision Systems Applications PDF
Vision Systems Applications PDF
Vision Systems Applications PDF
Applications
Vision Systems
Applications
Edited by
Goro Obinata and Ashish Dutta
IV
Abstracting and non-profit use of the material is permitted with credit to the source. Statements and
opinions expressed in the chapters are these of the individual contributors and not necessarily those of
the editors or publisher. No responsibility is accepted for the accuracy of information contained in the
published articles. Publisher assumes no responsibility liability for any damage or injury to persons or
property arising out of the use of any materials, instructions, methods or ideas contained inside. After
this work has been published by the Advanced Robotic Systems International, authors have the right to
republish it, in whole or part, in any publication of which they are an author or editor, and the make
other personal use of the work.
2007 I-Tech Education and Publishing
www.ars-journal.com
Additional copies can be obtained from:
[email protected]
First published June 2007
Printed in Croatia
A catalog record for this book is available from the Austrian Library.
Vision Systems: Applications, Edited by Goro Obinata and Ashish Dutta
p. cm.
ISBN 978-3-902613-01-1
1. Vision Systems. 2. Applications. 3. Obinata & Dutta.
Preface
Computer Vision is the most important key in developing autonomous navigation systems
for interaction with the environment. It also leads us to marvel at the functioning of our own
vision system. In this book we have collected the latest applications of vision research from
around the world. It contains both the conventional research areas like mobile robot navigation and map building, and more recent applications such as, micro vision, etc.
The fist seven chapters contain the newer applications of vision like micro vision, grasping
using vision, behavior based perception, inspection of railways and humanitarian demining.
The later chapters deal with applications of vision in mobile robot navigation, camera calibration, object detection in vision search, map building, etc.
We would like to thank all the authors for submitting the chapters and the anonymous reviewers for their excellent work.
Sincere thanks are also due to the editorial members of Advanced Robotic Systems publications for all the help during the various stages of review, correspondence with authors and
publication.
We hope that you will enjoy reading this book and it will serve both as a reference and
study material.
Editors
Goro Obinata
Centre for Cooperative Research in Advanced Science and Technology
Nagoya University, Japan
Ashish Dutta
Dept. of Mechanical Science and Engineering
Nagoya University, Japan
VII
Contents
Preface ................................................................................................................................. V
1. Micro Vision ................................................................................................................. 001
Kohtaro Ohba and Kenichi Ohara
2. Active Vision based Regrasp Planning for
Capture of a Deforming Object using Genetic Algorithms .......................................... 023
Ashish Dutta, Goro Obinata and Shota Terachi
3. Multi-Focal Visual Servoing Strategies ..................................................................... 033
Kolja Kuehnlenz and Martin Buss
4. Grasping Points Determination Using Visual Features ........................................... 049
Madjid Boudaba, Alicia Casals and Heinz Woern
5. Behavior-Based Perception for Soccer Robots ....................................................... 065
Floris Mantz and Pieter Jonker
6. A Real-Time Framework for the Vision
Subsystem in Autonomous Mobile Robots .................................................................. 083
Paulo Pedreiras, Filipe Teixeira, Nelson Ferreira and Luis Almeida
7. Extraction of Roads From Out Door Images ............................................................ 101
Alejandro Forero Guzman and Carlos Parra
8. ViSyR: a Vision System for Real-Time Infrastructure Inspection .......................... 113
Francescomaria Marino and Ettore Stella
9. Bearing-Only Vision SLAM with Distinguishable Image Features .......................... 145
Patric Jensfelt, Danica Kragic and John Folkesson
VIII
IX
1
Micro Vision
Kohtaro Ohba and Kenichi Ohara
National Institute of Advanced Industrial Science and Technology (AIST)
Japan
1. Introduction
The observational and measurement system in the micro environments to manipulate
objects in the micro world is becoming necessary in many fields, such as manufacturing;
Micro Factory (Fig.1);one of the past Japanese national project, and medical usages; the
micro surgery. Most of the past researches in the micro environments might be only focused
on the micro manipulation but not on the micro observation and measurement, which might
be very important to operate. Actually, the micro operation includes the scale factors; i.e. the
van der Waals forces are larger than the Newton force in the micro environments.
Furthermore the micro vision has the optical scale factors on this micro observation, i.e.
the small depth of a focus on the microscope, which could not allow us to feel the micro
environments, intuitively.
For example, if the focus is on some objects in the microscope, the actuator hands could not
be observed in the same view at the same time with the microscope. On the other hand, if
the focus is on the actuator hands, the object could not be observed. Figure 2 shows a simple
3D construction example constructing a micro scarecrow, 20um height, with 4 um six grass
balls and one grass bar on the micro wafer. And Fig.3 show the two typical microscopic
views putting the second glass ball onto the first glass ball. Left figure (a) in Fig.3 shows the
first glass ball in focused, but the gripper is blurring at almost same position, because of the
different depth. And right figure (b) shows the gripper in focused. Therefore, the operator
has to change the focal distance with the microscope to observe while operating the microactuator, simultaneously.
(a)
Figure 3. Typical Microscopic Images with Micro Manipulation
(b)
Even though the big effort of the micro vision for the micro operation, there are few
computer vision researches especially for the micro environments.
In the micro vision system, there could be categorized into two areas,
1. The micro measurements techniques to measure the micro object position for micro
operation.
2. The micro observation techniques to show the 3D image only for human to know the
interesting objects.
We will summarize these two areas into the following two sections.
Micro Vision
1 1 1
+ =
X x f
(1)
is well known as the Gaussian lens law. In this equation, X, x and f depict the object
distance, the image distance and the focal distance of the lens, respectively.
Then, in the depth of a focus, x is defined as the range of the distance of focal plane,
which holds the image in focus on the focal plane, as shown in Fig. 4 (a).
Infinity:
Finite:
f
D
(2)
f
D'
(3)
x = 2
x = 2
where D and D' are the diameter of lens and the iris, respectively. The focus obviously
depends on the radius of the circle of confusion delta, which caused by the resolution of the
sensor device of camera.
The depth of a photographic subject; X is defined as the range of the distance between
object and lens as shown in Fig. 4 (b), which holds the sharpness on the focal plane;
X =
2XfD( X f )
f D 2 2 ( X f )2
2
(4)
In this equation, the depth of a photographic subject obviously depends on the distance of
principle focus f and the distance between object and lens X.
Equation (2) or (3) decides the resolution of object distance with the depth of focus criteria.
In the calibration process between the object distance and the image plane distance, the
equation (4) is utilized.
Micro Vision
To decide the optimal focal distance with images, the Image Quality Measure (IQM), which
could detect the in-focus area in the image, is defined with the follow equation,
IQM =
where
1
D
xf
yf
Lc L r
I ( x, y ) I ( x + p , y + q )
x = x i y = y i p = Lc q = L r
( Lc , Lr ) (Lc , Lr )
and
(5)
frequency and the smoothing, respectively [7].And D is the total pixel number to make
standard the image quality measure value with the number of pixels in the area
With some variation of the focus values, once a peak of the IQM value at particular position
of image pixel is detected, the optimal in-focus image point on each pixel points could be
easily defined. Then the corresponding local intensity value and the focus value are finally
the depth map and the all-in-focus image, respectively.
2.3 Real-time micro VR camera system
To realize the real-time micro VR camera system with the depth from focus criteria above,
there are two big issues to be solved,
1. how to capture and process the high frame rate image sequences (vision part),
2. how to change the focal distance with high frequency and high accuracy(optical part).
Unfortunately, most of the vision system seems to be based on the video frame rate
30frame/sec. This video frame rate is good enough for human vision, but not good enough
as a sensor system.
To realize a real-time micro VR camera with the depth from focus criteria mentioned before,
a high-speed image capture and processing system is required. For example, if eight images
are applied to obtain the depth map and the all-in-focus image with 30frame/sec, a 240
frame/sec image sequence is necessary to capture and process.
Furthermore, to change the focal distance with the microscope, motor control system could
be used. But the range of frequency of the motor system is not enough frequency for the
real-time micro VR camera system.
Next, section, we will show some of the proto type of the real-time micro VR camera
systems, Finally product specification of Micro VR Camera System is shown.
2.3.1 First Prototype
At first, we developed the micro VR camera system shown in Fig.6 with a dynamic focusing
lens[8] as shown in Fig. 7 and a smart sensor, an IVP C-MOS vision chip (MAPP2200) that
has a resolution of 256*256pixel, a column parallel ADC architecture, and DSP processing.
A sample object in Fig. 8 and its sample images at four particular focal distances are shown
in Fig. 9. The objects for demonstration were constructed in a four-step pyramidal shape:
first stage,
3mm-5mm. In real-usage cases, such as less than 1 mm size, the IQM value could be
obtained with the original texture on the object without any artificial texture.
Micro Vision
Micro Vision
The spatial resolution depends on the optical setting. For this demonstration, the view area
is almost 16mm square with 256 pixels, and the spatial resolutions are 62.5um. The depth
resolution is 1.67mm (21 frames with 35mm depth range, each 3V input voltage from -30V to
+30V to charge the PZT), which directly depends on the number of input frames in the
range of variable focusing. The all-in-focus image and the micro VR environments from
one image sequence are shown in Fig. 10 and 11, respectively. The all-in-focus image gives
a clear image to observe the whole object. However, the resolution of depth without any
interpolation in Fig.11 does not seem enough. A simple way to increase the resolution of
depth is to capture more images with other focal distances, which could also require a
higher calculation cost.
(a) Processing Part
The processing time with an IVP chip is almost 2sec. for one final VR output. This is caused
because the ADC/processing performance is not good enough for the gray level intensity on
the vision chip MAPP2200. Actually, MAPP2200 has a good performance for binary images
of more than 2000frame/sec.
(b) Optical Part
Changing the focus with usual optical configuration is quite difficult to actuate because of
its dynamics. We had developed a compact and quick-response dynamic focusing lens,
which is including the PZT bimorph actuator and the glass diaphragm shown in Fig. 7. This
lens is capable to be a convex lens or concave lens with the voltage to drive the PZT
bimorph, and was evaluated the robustness with more than 150Hz high frequency. See
details in [10]. We applied this lens with the combination of the micro zoom lens.
10
Figure 11. Sample of the Depth Image for Sample Object with Texture Based on All-in-Focus
Image
2.3.2 Second Prototype
This paragraph shows the second prototype of the micro VR camera systems.
(a) Processing Part
Recently, the large-scale FPGA (Field Programmable Gate Array) has dramatically
improved its performance and is being widely used because of its programmable capability.
Then, in the second system shown in Fig.12, one FPGA (APEX EP20K600E, ALTERA) and
SDRAM in the image-processing test board (iLS-BEV800, INNOTECH Co.) are used to
calculate the IQM in equation (5) at each pixel all over image 512*480 pixel, 8bits with
240Hz, which has the TMDS (Transition Minimized Display Signaling) architecture interface
to connect the sensor part and processing part as shown in Fig.13. Then, the image data
512*480 is captured with two parallel interfaces, and high-speed transmission 60Mbyte/sec
(512*480*240Hz) from HSV to the dual-port SDRAM is realized. As a result, the performance
of the FPGA is good enough to calculate the IQM value with 240Hz, and the total execution
performance is less than 20% of the performance of FPGA.
Micro Vision
11
12
Micro Vision
13
the sensor output from the controller. Actually, this system has no scaling factor on the
images with different depth, because the image distance is changed with the scanner,
besides the focal distance is changed in our earlier system with the variable focusing lens.
You can observe two glass fabrics 4 um, each located in micro-3D environments with
14
Micro Vision
15
16
background: otherwize
(6)
(7)
where IQM(x,y,f) is the image quality value at image location; (x,y) with the focus length; f.
Figures 19 and 20 show the result obtained with this ghost-filtering technique.
2.3.4 Product System
Now, Micro VR Camera System is productization by Photron co.ltd in Japan [9], which is
shown in Fig. 21. This system is more improved about resolution, which is 512 * 512 pixels
per one the depth image and all-in-focus image. Moreover, this system can measure object
depth in a step of about 3um, when the user use piezo actuator to be able to move 100 um.
Figure 22 shows sample results of the all-in-focus and depth image with the latest system.
Actually, this output result is real-time movie on this system. Even though the operator put
the gripper in sight, the in-focus image could allow us to observe the object and the gripper
simultaneously, although they are located at different depths.
17
Micro Vision
Figure 22. Sample View of MEMS device with the product system
18
3. Micro Observation
In the previous section, micro measurement is describes. The micro observation system is
introduced in this section.
Generally speaking, because the small depth of focus factor is quite strong, the microscopic
view is quite different from the macroscopic images. Actually, to know the micro
phenomena and micro object shape, operator should change the focal distance of the
microscope very often. Then operator summarizes each image information in his brain.
If this summarize sequence could be obtained automatically, the operator could easily know
the micro object phenomena and shape, intuitively.
This section mainly focuses on the algorithm to obtain the all-in-focus image in the micro VR
Camera System, and the 3D voxel image, which has (R,G,B,alpha) parameter for each voxel,
based on Micro VR Camera System.
3.1 All-in-Focus Image
In the previous section, overview about how to obtain depth image and all-in-focus image
is described. In this subsection, detail algorithm is shown as follows;
1. Acquire a sequence of images while changing a focus distance using the PZT actuator.
2. Calculate the Image Quality Measurement (IQM) value ( eq.(5) ) at each pixel on all of
the acquired images, which might be the index for in-focus or out-focus.
3. Find the maximum point on the IQM value considering the different focus distance at
each pixel location; (x,y).
4. Integrate the in-focused pixel values at maximum IQM points into an all-in-focus
image, and the focus distance information into the depth map.
Micro Vision
19
This system mainly applied to use the industrial usage as shown in Fig. 23. However, this
algorithm has one big drawback in the case of the transparent object, such as crystal and cell
in the biomedical usage, because there might be several possibilities of in-focused points on
the transparent object as shown in Fig. 24. In other words, if we apply to use this system for
the biomedical use, the depth information might be quite noisy; sometime the maximum
IQM value is on the top surface of the object, but in other case, in-focused points is on the
other back surface.
20
As the matter of fact in our previous researh, the image intensity data set could be already
obtained while moving the focus distance in the micro VR camera system.
Then, a new volume rendering based method shown in Fig. 25, which reflects the IQM value
as transparency parameter P with the Look Up Table (LUT) and visualizes 3D transparent
object, is proposed in this section, because of the fact that the IQM value might be the index
of in-focused area.
To realize the proposed method, the image intensity data and the IQM value at each focus
distance with the previous micro VR camera system are stored in V[RGB|P] at each voxel
(x,y,z) in the volume rendering system: VOLUME-PRO 500, and the volume rendering
visualization could be obtained using the VGL library.
To show the validity of this proposed method, a 4m glass ball including one bubble
inside is applied to use. Fig. 26 shows the microscopic view of the glass ball at several focus
distances, which is acquired with the AIF system while moving the focus distance.
Then, the center view in Fig. 27 display a visualization results with the proposed method,
and left side shows the slice view at any particular point. The shape of glass ball could be
intuitively obtained in this figure, furthermore a bubble could be observed, even though any
de-convolution technique is not utilized. The volume rendering tools VGStudio MAX 1.1
is used in Fig. 27. In this viewer, we could chage the view angel, and change slice point for
the slice view, as you want.
Micro Vision
21
22
4. Conclusion
In the field of the micro vision, there are few researches compared with macro environment.
However, applying to the study result for macro computer vision technique, you can
measure and observe the micro environment. Moreover, based on the effects of micro
environment, it is possible to discovery the new theories and new techniques.
5. References
Daisuke Miyazaki, Megumi Saito, Yoichi Sato, and Katsushi Ikeuchi. (2002). Determining
surface orientations of transparent objects based on polarization degrees in visible
and infrared wavelengths. Journal of Optical Society of America A (JOSA A). Vol. 19,
No. 4, pp.687-694.
Megumi Saito, Yoichi Sato, Katsushi Ikeuchi, Hiroshi Kashiwagi. (1999). Measurement of
surface orientations of transparent objects using polarization in highlight.
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR'99).
pp. 381-386.
Kazuya Kodama, Kiyoharru Aizawa, and Mitsutoshi Hatori. (1999). Acquisition of an AllFocused Image by the Use of Multiple Differently Focused Images. The Trans. of
the Institute of Electronics, Information and Communication
Engineering
Engineers. D-II, Vol.J80-D-II, No.9, pp.2298-2307.
Masahiro Watanabe and Shree K. Nayer. (1996). Minimal Operator Set for Passive Depth
from Defocus. CVPR'96, pp.431-438.
Shree K. Nayer, Masahiro Watanabe, and Minoru Noguchi. (1995). Real-Time Focus Range
Sensor. ICCV'95, pp.995-1001.
Shree K. Nayer, and Yasuo Nakagawa. (1994). Shape from Focus. IEEE Trans. on PAMI.
Vol.16, No.8, pp.824-831.
Sridhar R. Kundur and Daniel Raviv. (1996). Novel Active-Vision-Based Visual-Threat-Cue
for Autonomus Navigation Tasks. Proc. CVPR'96. pp.606-612.
Takashi Kaneko, Takahiro Ohmi, Nobuyuki Ohya, Nobuaki Kawahara, and Tadashi Hattori.
(1997). A New, Compact and Quick-Response Dynamic Focusing Lens.
Transducers'97.
Photron co.ltd. http://www.photron.com/
2
Active Vision based Regrasp Planning for
Capture of a Deforming Object using
Genetic Algorithms
Ashish Dutta, Goro Obinata and Shota Terachi
Nagoya University
Japan
1. Introduction
The ability to efficiently grasp an object is the basic need of any robotic system. This research
aims to develop an active vision based regrasp planning algorithm for grasping a deforming
2D prismatic object using genetic algorithms (GA). The possible applications of the
proposed method are in areas of grasping biological tissues or organs, partially occluded
objects and objects whose boundaries change slowly. Most previous studies on robotic
grasping mainly deal with formulating the necessary conditions for testing grasp points for
static objects (Blake (1995), Chinellato et al. (2003), Galta et al. (2004), Mirtich et al. (1994)).
Nguyen (1989) has suggested a strategy for constructing an optimal grasp using finger
stiffness grasp potentials. A detailed review of multifinger grasping of rigid objects is
presented in Bichi and Kumar (2000). There are few studies on grasping of deformable
objects, such as Hirai et al. (2001) in which they present a control strategy for grasping and
manipulation of a deformable object using a vision system. In this case the object deforms on
application of fingertip forces, the deformation is recorded by a vision systems and based
on the amount of deformation the object motion is controlled. Studies relating to searching
and tracking of grasping configurations for deforming object are rare. Deforming objects are
those that deform by themselves without application of external forces. Mishra et al. (2006)
have proposed a method of finding the optimal grasp points for a slowly deforming object
using a population based stochastic search strategy. Using this method it is possible to find
the optimal grasp points satisfying force closure for 2D prismatic deforming objects. This
method minimizes the distance between the intersection of fingertip normals and the object
centre of gravity, and maximizes the area formed by the finger tip contact points. However
their method fails in cases when the fingertip normals do not intersect at a point (as in case
of a square object).
The problem of grasping deforming objects is a very challenging problem as the object
shape changes with deformation. Hence the optimal grasp points have to be continuously
found for each new shape. This process of recalculating the fingertip grasp points due to
object shape change, slide or roll is called regrasping. The best method of determining the
change in shape of an object is by using a vision system. A vision system not only captures
the new shape but can also be used to track a moving object. The main objectives of this
research are to use a vision system to capture the shape of a deforming object, divide the
24
object boundary into a number of discrete points (pixels) and then find the optimal grasp
points satisfying form closure. As the object changes shape the new shape is continuously
updated by the vision system and the optimal grasp points are found. Once the solution for
the first frame is obtained this solution is used as the initial guess in subsequent cases for
finding the optimal grasp points. This enables faster solutions for later frames recording the
deformation of the object. It is assumed that the object deforms slowly, the contact between
the fingertip and the object is frictionless and the fingers do not cause deformation of the
object. Hence four fingers are required to grasp a prismatic object in 2D. Simulations were
carried out on 200 synthetic shapes that deformed slowly and the optimal grasp points
found. An experiment was conducted in which a deforming object was simulated by a piece
of black cloth that was held from below and deformed. The shape change of the cloth was
captured by a camera and for each shape the optimum gasp points were obtained.
Experimental results prove that the proposed method can be used in real time to find the
optimal grasp points for a deforming object. In section 2 the algorithm used for determining
the optimal grasp points is explained. The procedure for obtaining the regrasp solutions is
discussed in section 3. Simulation results are explained in section 4, while the experimental
setup is given in section 5. The experimental results are shown in section 6 and conclusions
are drawn in section 7.
a1.........an
= DT
T =
p1 a1... pn an
(1)
Where T is the external forces and moment vector (total of six) acting at the centre of the
object, ai is the unit normal directed into the object at the fingertip contact points, pi is
the position vector of the fingertip contact points on the object, and
fingertip forces (n=total number of fingers). A necessary and sufficient condition for form
closure are (i) rank D=6 and (ii) equation (1) has a solution such that >0 (all forces are
positive). Hence to obtain form closure in 3D we need seven contact points and in 2D we
need four contact points. In this research we have proposed a geometrical method for
finding the form closure grasps based on the concept of accessibility angle. The freedom
angle ( ) of a two dimensional objects is defined as the angular region along which it can
be translated away from the contact. The concept of freedom angle is as shown in Figure
1(a). It shows an object grasped with three contact points, for each individual contact point
we define the direction (range) along which the object can move away from the contact
points. The three freedom angles are as marked in the figure. Figure 1(b) show that after
combining all the freedom angles there is still an angle left (escape angle) from where the
object can escape. Hence it can be derived that the object is not in form closure. Figure 2(a)
25
show the same object with four contact points and the corresponding freedom angles. In
figure 2(b) it can be seen that all the total 360 are covered and hence the object is in form
closure. If x represents the position vector at a point on the object surface then the freedom
angle i at that point is computed as:
= {( x
i +1
xi ), ( xi xi 1)}
= { ..... }
1
The accessibility angle is the common angle between all the freedom angles. The
accessibility angle ( ) (Sharma et al. (2006)) is calculated as shown in Figure 2(b). An object
is in form closure if the accessibility angle is the null set (or escape angle is zero). This means
that there is no way the object can move away (translate or rotate) from the gripper points.
finger
contact
1
freedom
angles
escape
angle
2
3
(a)
(b)
Figure 1. (a) The freedom angles showing the directions in which the object can move with
respect to each individual finger contact, (b) direction in which the object is free to escape
freedom
angles
2
2
No
escape
angle
(a)
(b)
Figure 2. (a) The direction in which the object can move with respect to each finger contact,
(b) the object cannot escape in any direction as there is no escape angle
Hence the method essentially searches for the best from closure grasp points by comparing
all sets of four grasp points satisfying conditions of form closure. As the object boundary is
made up of a very large number of points (pixels) and a good form closure grasp is desired
this search is quite complex. Also as the search involves discrete points an efficient method
to solve the problem is to use genetic algorithms.
GA is used to maximize an objective function subject to constraints. A traditional GA, like
Gordy (1996), performs three operations on a population of genomes i.e. selection, crossover
and mutation. The length of the binary string is equal to the number of discrete points on
the object boundary. If a finger is present at a particular point then 1 is present or it is 0.
The binary string encoding the object boundary is as shown in Figure 3 .
26
010000000010000000000100.0000001000000
Figure 3. Binary string (1 means finger present at that location, 0 means no finger present)
Selection is performed to choose better individuals for cross-over. In Gordy (1996), selection
is performed using the roulette wheel procedure. If an individual has better fitness, its
probability of getting selected is more. In this selection process, cumulative sum of the
fitness of all individuals in the population is calculated and normalized by dividing it with
the total sum of the fitness of individuals in the population. A random number between 0
and 1 is chosen. If that number lies within the span of normalized cumulative sum of any
individual, that individual is selected. An individual can be selected multiple times based
on how fit it is. Once the number of individuals equal to the original population size is
selected into the mating pool, a single point crossover is performed. A split point is
randomly generated and contents of the two individuals are swapped about this split point.
Post crossover, mutation is performed with a very low probability. Each individual is
scanned through and a gene is randomly mutated if the probability is lower than the
mutation probability. Thus, a new population of vectors is obtained and individual fitness is
computed. Finally, eliticism is invoked by replacing the worst individual of the new
population with the best from the previous population.
The two conditions needed to be satisfied in order to get a good grasp are: a) the fingertips
must be capable of resisting all the external forces and moment acting on the object and b)
the placement of the fingers should be such that the moment applied is minimum. The
proposed objective function maximizes the moment that the fingertips can resist, by
considering different combination of fingertip positions taking four discrete points at a time.
The constraint uses accessibility angle to ensure that all the feasible solutions satisfy form
closure. If the accessibility angle is zero it means that the object is in form closure. In case the
constraint is not met, a very high penalty is placed on the function value that eliminates the
non-feasible solutions. The objective function used is given by:
Mcw + Mccw 4
T
f =
+ UiVi
Ncw Nccw + i =1
(2)
The first part of the right side of equation (2) is the objective function while the second part
is the constraints. Mcw is the total clockwise moment and Mccw is the total anticlockwise
moment applied by the fingers. These two terms ensure that the individual moments are
maximized in both the clockwise and anticlockwise directions. This indirectly leads to
minimum normal forces at the contact. Ncw and Nccw are the number of fingers applying
clockwise and anticlockwise moment. This ensures that the fingers are placed all around the
object and do not get concentrated at one location. A term having a small value (0.01) has
been added to ensure that the denominator does not become zero when both the
anticlockwise and clockwise moments are equal. The constraints used are U=[ui] and V=[vi]
which are given as :
1. u1=0 If total number of contact points is four , else u1=1;
2. u2=1 If area formed by contact points equals zero, else u2=0;
3. u3=0 If Both clockwise and anticlockwise moments exist, else u3=1;
4. u4=0 If object is in form closure, else u4=1;
27
vi = - 1x1020 (i=14), hence if the constraints are not met the function takes a very high
value and that particular solution is rejected. The normal function values for feasible grasp
points are approximately 6.5x103 and hence the large negative value of vi ensures that nonfeasible solutions are rejected. In this way feasible solutions move towards feasible space
and the non feasible solutions are eliminated.
4. Simulation
The proposed regrasp algorithm has been tested on 200 types of synthetic shapes that
undergo slow deformation. Simulations were performed on a 1.86 GHz laptop computer
with 512 RAM. We have assumed that the objects deforms slowly as the algorithm takes
time (secs) to obtain a solution. An example of slow deformation is a rectangle that can
slowly expands each side to become an octagon etc. However a rectangle cannot suddenly
become a circle. This assumption is practical as an expanding object like a balloon does not
change shape suddenly. The simulation was made in Matlab in which a closed object was
constructed using straight lines segments. Each time a side of the object was expanded by
dividing it into two or more segments and expanding it. In case of real objects the sides can
be approximated by straight lines and hence this method can be used to approximately
simulate deformable objects. A few sample cases of an object expanding are shown in Figure
4. As shown, an object (a) deforms to object (b), then (c) etc. by expanding one side at a time
(all intermediate steps are not shown). The GA parameters used are:
1. Size of generation 60
2. Crossover 0.80
3. Mutation 0.12
4. Maximum number of iteration 5000
5. Maximum number of gains before stopping 1000
The time required to find the regrasp points was found in two ways for each object. In CaseI the time was found independently for each deforming object. There was no initial guess
solution supplied to the algorithm. In Case II the time to get a solution was found by
supplying the earlier solution as an initial guess to the algorithm. In both the cases for the
same objects the four optimal finger positions were same but the time to get a solution was
different, as shown in Table 1. It was seen that in Case II the time required to get each
solution was very much less than in Case I. This can be explained by the fact that as the
28
object deforms the optimal grasp points are not random but are related to the shape of the
object.
(a)
(b)
(c)
(d)
(e)
(f)
Figure.4. (a-f) Optimal grasp points for a slowly deforming object (the fingertip contact
points are indicated by solid circles) x and y axis are in mm
Object No.
Case I Time(secs)
Case II Time (Sec)
1
53
53
2
75
37
3
61
23
4
67
31
5
74
21
6
73
24
Table 1. Comparison of time taken for calculating optimal grasp points for Case I and II
5. Experimental details
The experimental system (as shown in Fig 5) consists of a vision camera, a slowly deforming
object, a PC with image processing software and a laptop PC on which the GA based
algorithm runs. The deforming object was a piece of black cloth that was deformed by
holding it from below and deforming it. The image was captured by a black and white CCD
camera model SB-6B manufactured by Wireless Tsukamoto Co., Japan. The camera can
capture frames at a rate of 30 fps and each frame has a resolution of 100x100 pixels. The
29
number of pixels determines the total number of discrete points on the object boundary that
are considered by the binary string in the GA algorithm. Hence increasing the number of
pixels in a frame increases the resolution of the picture but it also increases the time required
for computation as the length of the binary string will be longer. It was found that using
100x100 pixels per frame gave satisfactory results and the image was captured at intervals of
10 seconds. The sequence of images captured of the deforming object is shown in Figure 6.
Thresholding was used to segment each image into the foreground and background based
on different pixel intensities. The input was a grey scale image and the output was a binary
image representing segmentation. The boundary of the segmented image was obtained by
using edge detection as shown in Figure 7. After the edge is detected the coordinate of all
the pixels with reference to a reference coordinate frame was found. These coordinates of
the object boundary pixels are then passed on to the GA based algorithm for calculating the
best grasp points.
L a p to p
w ith G A
a lg o rith m
P C w ith
im a g e
p ro ce ssin g
so ftw a re
C a m e ra
d e fo rm in g o b je ct
(a)
(b)
(c)
(d)
(a)
(b)
(c)
(d)
30
6. Experimental results
The pixel coordinates of the boundary of the deforming object as obtained by the image
processing software was input to the GA based grasping algorithm. Computations were
performed on a 1.86 GHz laptop computer with 512 RAM. The results of the experiments
are as shown in Figure 8. Each figure corresponds to the frame obtained by the vision
camera in Figure 6. The GA parameters used in the algorithm are same as those used in the
simulations. The optimal grasp points for the first frame were obtained by using a random
initial guess solution in the GA algorithm. Subsequent solutions were obtained by using the
previous solutions as the initial guess. Table 2 shows the time required to get each solution
and it is again seen that the first frame required the most time.
Object No.
Time (secs)
(a)
46
(b)
22
(c)
27
(d)
25
(a)
(b)
(c)
(d)
Figure 8. (a-d) The optimal grasp points for each of the deforming objects (x and y
coordinates in mm)
6.1 Real time application
One of the potential uses of the proposed method is an application in which an autonomous
multifinger robot with a vision camera has to capture a deforming object. In such
applications the time from image capture to obtaining the optimal grasp points has be done
in real time (in a few seconds). As shown earlier, the time required to get the first solution
was the highest as it depended on parameters like, initial guess solution, total number of
iterations and the total iterations before stopping if gains are not exceeded. Hence faster
solutions can be obtained by dynamically tuning these parameters. Figure 9 shows two
solutions for the same object obtained by varying the GA parameters. The final objective
31
function values indicated that solution (a) with function value of 6.8x103 (iteration 5000 and
number of gains before stop 200) is better than solution (b) with function value 6.1x103
(iteration 1000, number of gains before stop 100). The solutions were obtained in 6 seconds
and 2 seconds respectively. Hence it is possible to obtain faster solutions in real time by
dynamically tuning the GA parameters based on required function value or number of
iterations, and also using a faster computer for running the algorithm. It is however not
clear how the function value varies with different shapes and parameter values. In future,
we hope to study how to adjust the GA parameters dynamically to obtain the fastest
solutions in real time.
(a)
(b)
Figure 9. (a-b) Finger points for the same object for different functional values
7. Conclusion
The main contributions of this research are an effective vision based method to compute the
optimal grasp points for a 2D prismatic object using GA has been proposed. The simulation
and experimental results prove that it is possible to apply the algorithm in practical cases to
find the optimal grasp points. In future we hope to integrate the method in a multifinger
robotic hand to grasp different types of deforming objects autonomously.
8. References
Bicchi, A. & Kumar, V. (2000). Robot Grasping and Control: A review, Proceedigns of the IEEE
International Conference on Robotics and Automation, pp. 348-353, ISSN 1050 4729.
Blake, A. (1995). A symmetric theory of planar grasp, The International Journal of Robotics
Research, vol. 14, no. 5, pp. 425-444, ISSN 0278-3649.
Chinellato, E., Fisher, R.B., Morales, A. & del Pobil, A. P. (2003). Ranking planar grasp
configurations for a three finger hand, Proceedings of the IEEE International
Conference on Robotics and Automation, pp. 1133-1138, ISSN 1050 4729.
Gatla, C., Lumia, R., Wood, J. & Starr, G.(2004). An efficient method to compute three
fingered planar object grasps using active contour models, Proceedigns of the
IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3674-3679,
ISBN 07803-8463-6.
Gordy, M. (1996) A Matlab routine for function maximization using Genetic Algorithm.
Matlab Codes: GA.
Hirai, S., Tsuboi, T. & Wada, T. (2001) Robust grasping manipulation of deformable objects,
Proceedings of the IEEE International Conference on Assembly and Task Planning, pp.
411-416, ISBN 07803-7004.
32
Sharma, P., Saxena, A. & Dutta, A. (2006). Multi agent form closure capture of a generic 2D
polygonal object based on projective path planning, Proceedings of the ASME 2006
International Design Engineering Technical Conferences, pp.1-8, ISBN 07918-3784.
Mishra T., Guha, P., Dutta, A. & Venkatesh K. S. (2006). Efficient continuous re-grasp
planning for moving and deforming planar objects, Proceedings of the IEEE
International Conference on Robotics and Automation, pp. 2472 2477, ISSN 1050 4729.
Mirtich, B. & Canny, J. (1994). Easily computable optimum grasps in 2D and 3D, Proceedings
of the IEEE International Conference on Robotics and Automation, pp. 739-747.
Nguyen, V.D. (1989). Constructing stable force-closure grasps, International Journal of
Robotics Research, vol. 8, no. 1, pp. 26-37, 0278-3649.
Yoshikawa, T. (1996). Passive and active closures by constraining mechanisms, Proceedings of
the IEEE International Conference on Robotics and Automation, pp. 1477-1484, ISBN
07803-2988.
3
Multi-Focal Visual Servoing Strategies
Kolja Khnlenz and Martin Buss
Institute of Automatic Control Engineering (LSR), Technische Universitt Mnchen
Germany
1. Introduction
Multi-focal vision provides two or more vision devices with different fields of view and
measurement accuracies. A main advantage of this concept is a flexible allocation of these
sensor resources accounting for the current situational and task performance requirements.
Particularly, vision devices with large fields of view and low accuracies can be used
together. Thereby, a coarse overview of the scene is provided, e.g. in order to be able to
perceive activities or structures of potential interest in the local surroundings. Selected
smaller regions can be observed with high-accuracy vision devices in order to improve task
performance, e.g. localization accuracy, or examine objects of interest. Potential target
systems and applications cover the whole range of machine vision from visual perception
over active vision and vision-based control to higher-level attention functions.
This chapter is concerned with multi-focal vision on the vision-based feedback control level.
Novel vision-based control concepts for multi-focal active vision systems are presented. Of
particular interest is the performance of multi-focal approaches in contrast to conventional
approaches which is assessed in comparative studies on selected problems.
In vision-based feedback control of the active vision system pose, several options to make
use of the individual vision devices of a multi-focal system exist: a) only one of the vision
devices is used at a time by switching between the vision devices, b) two or more vision
devices are used at the same time, or c) the latter option is combined with individual
switching of one or several of the devices. Major benefit of these strategies is an
improvement of the control quality, e.g. tracking performance, in contrast to conventional
methods. A particular advantage of the switching strategies is the possible avoidance of
singular configurations due to field of view limitations and an instantaneous improvement
of measurement sensitivity which is beneficial near singular configurations of the visual
controller and for increasing distances to observed objects. Another advantage is the
possibility to dynamically switch to a different vision device, e.g. in case of sensor
breakdown or if the one currently active is to be used otherwise.
The chapter is organized as follows: In Section 2 the general configuration, application areas,
data fusion approaches, and measurement performance of multi-focal vision systems are
discussed; the focus of Section 3 are vision-based strategies to control the pose of multi-focal
active vision systems and comparative evaluation studies assessing their performance in
contrast to conventional approaches; conclusions are given in Section 4.
34
image plane
optical axis
projection ray
focal-point
motion vector
2. Multi-Focal Vision
2.1 General Vision System Structure
A multi-focal vision system comprises several vision devices with different fields of view
and measurement accuracies. The field of view and accuracy of an individual vision device
is mainly determined by the focal-length of the optics in good approximation and by the
size and quantization (pixel sizes) of the sensor-chip. Neglecting the gathered quantity of
light, choosing a finer quantization has approximately the same effect as choosing a larger
focal-length. Therefore, sensor quantization is considered fixed and equal for all vision
devices in this chapter. The projections of an environment point or motion vector on the
image planes of the individual vision devices are scaled differently depending on the
respective focal-lengths. Figure 1 schematically shows a general multi-focal vision system
configuration and the projections of a motion vector.
2.2 Systems and Applications
Cameras consisting of a CCD- or CMOS-sensor and lens or mirror optics are the most
common vision devices used in multi-focal vision. Typical embodiments of multi-focal
vision systems are foveated (bi-focal) systems of humanoid robots with two different cameras
combined in each eye which are aligned in parallel, e.g. (Brooks et al., 1999; Ude et al., 2006;
Vijayakumar et al., 2004). Such systems are the most common types of multi-focal systems.
Systems for ground vehicles, e.g. (Apostoloff & Zelinsky, 2002; Maurer et al., 1996) are
another prominent class whereas the works of (Pellkofer & Dickmanns, 2000) covering
situation-dependent coordination of the individual vision devices are probably the most
advanced implementations known. An upcoming area are surveillance systems which
strongly benefit from the combination of large scene overview and selective observation
with high accuracy, e.g. (Bodor et al., 2004; Davis & Chen, 2003; Elder et al., 2004; Jankovic &
Naish, 2005; Horaud et al., 2006).
An embodiment with independent motion control of three vision devices and a total of 6
degrees-of-freedom (DoF) is the camera head of the humanoid robot LOLA developed at our
laboratory which is shown in Figure 2, cf. e.g. (Khnlenz et al., 2006). It provides a flexible
allocation of these vision devices and, due to directly driven gimbals, very fast camera
saccades outperforming known systems.
35
Most known methods for active vision control in the field of multi-focal vision are
concerned with decision-based mechanisms to coordinate the view direction of a telephoto
vision device based on evaluations of visual data of a wide-angle device. For a survey on
existing methods cf. (Khnlenz, 2007).
36
Nagahashi, 2006). The measurement performance of multi-focal vision systems has recently
been investigated by (Khnlenz, 2007).
2.4 Measurement Performance of Multi-Focal Vision Systems
The different focal-lengths of the individual vision devices result in different abilities
(sensitivities) to resolve Cartesian information. The combination of several vision devices
with different focal-lengths raises the question on the overall measurement performance of
the total system. Evaluation studies for single- and multi-camera configurations with equal
vision device characteristics have been conducted by (Nelson & Khosla, 1993) assessing the
overall sensitivity of the vision system. Generalizing investigations considering multi-focal
vision system configurations and first comparative studies have recently been conducted in
our laboratory (Khnlenz, 2007).
a)
b)
focal-length
c)
37
38
h2
h1
Figure 4. Visual servoing scenario with multi-focal active vision system consisting of a wideangle camera (h1) and a telephoto camera (h2); two vision system poses with switch of active
vision device
The proposed sensor switching control strategy is visualized in Figure 5. Assumed is a
physical vision device mapping observed feature points concatenated in vector r to an
image space vector
= h ( r , x (q )) ,
(1)
at some Cartesian sensor pose x relative to the observed feature points which is dependent
on the joint angle configuration q of the active vision device. Consider further a velocity
relationship between image space coordinates and joint space coordinates q
(2)
= J + K p ( d ) K v q + g ,
(3)
with positive semi-definite control gain matrices Kp and Kv, a desired feature point
configuration d, joint angles q, gravitational torques g, and joint torques . The computed
torques are fed into the dynamics of the active vision system which can be written
M ( q ) q + C ( q, q ) q + g ( q ) = ,
(4)
with the inertia matrix M and C summarizing Coriolis and friction forces, gravitational
torques g, joint angles q, and joint torques .
39
Now consider a set of n vision devices H={h1,h2,,hn} mounted on the same manipulator
and the corresponding set of differential kinematics J={J1,J2,,Jn}. An active vision controller
is proposed which substitutes the conventional visual controller by a switching controller
+
= J K p ( d ) K v q + g ,
(5)
(6)
...
Jn+
Kp
J1+
manipulator
dynamics /
forward
kinematics
hn
h1
q
.
...
field of view
performance
sensor selector
Kv
g
Figure 5. Block diagram of multi-focal switching visual servoing strategy; vision devices are
switched directly or by conditions on field of view and/or control performance
This switching control strategy has been shown locally asymptotically stable by proving the
existence of a common Lyapunov function under the assumption that no parameter
perturbations exist (Khnlenz, 2007). In case of parameter perturbations, e.g. focal-lengths
or control gains are not known exactly, stability can be assured by, e.g., invoking multiple
Lyapunov functions and the dwell-time approach (Khnlenz, 2007).
A major benefit of the proposed control strategy is the possibility to dynamically switch
between several vision devices if the control performance decreases. This is, e.g., the case at
or near singular configurations of the visual controller. Most important cases are the
exceedance of the image plane limits by observed feature points and large distances
between vision device and observed environmental structure. In these cases a vision device
with a larger field of view or a larger focal-length, respectively, can be selected.
Main conditions for switching of vision devices and visual controller may consider
requirements on control performance and field of view. A straight forward formulation
dynamically selects the vision device with the highest necessary sensitivity in order to
provide a sufficient control performance, e.g. evaluating the pose error variance, in the
current situation. As a side-condition field of view requirements can be considered, e.g.
always selecting the vision device providing sufficient control performance with maximum
field of view. Alternatively, if no measurements of the vision device pose are available the
sensitivity or condition of the visual controller can be evaluated. A discussion of selected
switching conditions is given in (Khnlenz, 2007).
40
x d (t ) = 0 0
1 ,
7 1 7
t
0 0
sin t
5
2 2
2 5
(7)
with a sinusoidal translation along the optical axes and a uniform rotation around the
optical axes. The corresponding desired feature point vector d is computed using a pinhole
camera model.
0.2
0
e
0.2
ez
,z
a)
0.2
0
0.2
e,z
ez
b)
0.2
0
0.2
10
e,z
ez
c)
x,z
0
10
0
xz
5
10
15
t [s]
d)
20
25
30
Figure 6. Tracking errors epose,i and trajectory xpose,i of visual servoing trajectory following
task; sinusoidal translation along optical (xz-)axis with uniform rotation (x,z) ; focal-lengths
a) 10mm, b) 20mm, c) 40mm
For comparison the task is performed with each of the vision devices independently and
afterwards utilizing the proposed switching strategy. A switching condition is defined with
41
a pose error variance band of 2=6.25 10-6m2 and a side-condition to provide a maximum
field of view. Thus, whenever this variance band is exceeded the next vision device
providing the maximum possible field of view is selected.
0.02
=10mm
=20mm
[m]
0.015
e,z
0.01
0.005
=40mm
0
0
10
15
t [s]
20
25
30
epose,i [m,rad]
Figure 7. Corresponding tracking error standard deviation estimates for trajectory following
tasks (Figure 6) with different cameras; three samples estimation window
0.1
ez
0
0.1
e,z
a)
e,z [m]
0.02
0.01
0
b)
[mm]
40
20
xpose,i [m,rad]
10
c)
10
x,z
0
10
0
xz
5
10
15
t [s]
d)
20
25
30
Figure 8. Results of sensor switching visual servoing strategy with multi-focal vision;
sinusoidal translation along optical (xz-)axis with uniform rotation (x,z); a) tracking errors,
b) tracking error standard deviation estimates, c) current focal-length, d) pose trajectory
Figure 6 shows the resulting tracking errors for the trajectory following task for each of the
individual vision devices. In spite of very low control error variances in image space of
about 0.01 pixels2 large pose error variances in Cartesian space can be noted which vary
over the whole operating distance as shown in Figure 7. The distance dependent sensitivity
of the visual controller and quantization effects result in varying pose error variances over
the operating range caused by sensor noise. These effects remain a particular problem for
wide range visual servoing rendering conventional visual servoing strategies unsuitable.
42
Figure 8 shows the results of the switching control strategy. The standard deviation (Figure
8b) is kept within a small band reaching from about 0.004m to 0.008m. The overall
variability is significantly lower compared to the single-camera tasks (Figure 7). The spikes,
which can be noted in the standard deviation diagram, are caused by the switches due to the
delay of the feedback signal. After a switch the desired feature value changes with the
sensor, but the current value is still taken from the previous sensor. Thus, the control error
at this time instance jumps. This effect can be reduced by mapping the previous value of the
feature vector to the image space of the new sensor or by definition of a narrower variance
band as switching condition.
Figure 9 exemplarily illustrates the progression of the fields of view over time for a uniform
single-camera translation task and the corresponding camera switching task. The field of
view is defined by the visible part of the plane extending the surface of the observed object
in x-direction. The variability achieved with the switching strategy is significantly lower.
The effectiveness of the proposed multi-focal switching strategy has been shown
successfully. The contributions of this novel approach are a guaranteed control performance
by means of a bounded pose error variance, a low variability of the performance over the
whole operating range, and the consideration of situational side-conditions as, e.g., a
maximum field of view.
3.4 Multi-Camera Control Strategy
If two or more vision devices of a multi-focal system are available simultaneously these
devices can be used together in order to control the pose of the vision system. In this section
a multi-focal multi-camera strategy is proposed in order to make use of several available
vision devices with different fields of view and measurement accuracies. Major benefit is an
improved control performance compared to single-camera strategies whereas only a partial
observation of the reference object with high accuracy is necessary.
FOVx [m]
4
=10mm
=20mm
2
=40mm
a)
FOVx [m]
4
2
0
b)
xpose,i [m]
5
0
5
10
0
xz
5
10
15
t [s]
20
25
30
Figure 9. Progression of the extension of the field of view orthogonal to the optical axis of
the observing vision device; uniform translation along optical (xz-)axis; a) single-camera
tasks, b) sensor switching strategy with multi-focal vision, c) pose trajectory
43
A vision-based controller computing joint torques from a control error in image space
requires sufficient observed feature points to be mapped to the six Cartesian degrees of
freedom. A minimum of three feature points composed of two elements in image space is
needed in order to render the controller full rank. If the field of view of the observing vision
device is too small to cover all feature points the controller becomes singular. However,
high-sensitivity sensors needed in order to achieve high control performance only provide
small fields of view.
A multi-camera strategy is proposed combining the advantages of vision devices with
different characteristics. High-sensitivity devices are used for improving control
performance and wide-field-of-view devices in order to observe the required number of
remaining feature points to render the controller full rank.
h2
h1
Figure 10. Visual servoing scenario with multi-focal active vision system consisting of a
wide-angle camera (h1) and a telephoto camera (h2); both vision devices are observing
different feature points of a reference object accounting for field of view constraints
The sensor equation (1) extends such that individual feature points are observed with
different vision sensors
T
1
i
[ ([
j T = h1 r1T ri T , x1 (q)
) h ([r
2
T
j
, x 2 (q )
) ]
, (8)
where a Cartesian point rk is mapped to an image point l by vision device hm. The proposed
visual controller is given by
= J 1T J 1T
J2
T+
K p ( d ) K v q + g ,
(9)
with image feature vector =[ 1 ... i j ...]T and differential kinematics Jm corresponding to
vision device hm.
Substituting the composition of individual differential kinematics Jm by a generalized
differential kinematics J* the proposed control strategy can be expressed by
+
= J * K p ( d ) K v q + g ,
(10)
which has been proven locally asymptotically stable (Kelly et al., 2000).
Utilizing the proposed multi-camera strategy an improved control performance is achieved
even though only parts of the observed reference structure are visible for the highsensitivity vision devices. This multi-camera strategy can be combined with the switching
44
strategy discussed in Section 3.2 allowing switches of the individual vision devices of a
multi-focal vision system. Such a multi-camera switching strategy is discussed in the
following section.
3.5 Multi-Camera Switching Control Strategy
In the previous sections two concepts to make use of the individual vision devices of a
multi-focal vision system have been presented: a sensor switching and a multi-camera
vision-based control strategy. This section proposes the integration of both strategies, thus,
allowing switches of one or more vision devices observing parts of a reference structure
simultaneously. Thereby, the benefits of both strategies are combined.
The sensor equation (8) is extended writing
T
1
i
[ ([
j T = h1 r1T ri T , x1 (q)
) h ([r
T
j
, x 2 (q)
) ] ,
T
(11)
allowing the hm of (8) to be selected dynamically from a set H={h1,h2,,hn}. The visual
controllers (5) and (10) are integrated writing
= J *+ K p ( d ) K v q + g ,
where J is composed of individual differential kinematics Jm
J *+ = J 1
J1
J2
(12)
,
(13)
45
lower standard deviation is achieved by the multi-camera task (second camera with
telephoto characteristics) compared to the wide-angle task. The multi-camera switching task
additionally achieves a lower variability of the standard deviation of the tracking error.
0.1
[m]
0.08
pose,z
0.06
0.04
0.02
0
0
0.5
1.5
2
t [s]
2.5
3.5
Figure 11. Tracking error of multi-focal two-camera visual servoing task with wide-angle
and switchable wide-angle/telephoto camera; desired trajectory xzd(t)=-0.2ms-1t-1m
8
x 10
wideangle
e,z
[m]
multifocal
switched
multifocal
multi-camera
0
0.5
1.5
2.5
3.5
t [s]
Figure 12. Standard deviation estimates of tracking error of unswitched single-camera task
(wide-angle), of unswitched multi-focal multi-camera task with one feature point observed
by additional telephoto camera, and of switched multi-focal multi-camera task with
additional camera switching from wide-angle to telephoto characteristics at t=2.6s
5
x 10
multifocal
multi-camera
szvz
3
2
1
multifocal
switched
wideangle
0
0
0.5
1.5
2
t [s]
2.5
3.5
Figure 13. Sensitivities of the visual servoing controller along the optical axis of the central
wide-angle camera corresponding to the tasks in Figure 12
46
Figure 13 shows the sensitivity (szvz) of the visual controller for all three tasks along the
optical axis of the wide-angle camera. It can be noted that the multi-camera strategies result
in a better sensitivity of the controller compared to the wide-angle task.
Summarized, the simulations clearly show the benefits of the proposed multi-camera control
strategies for multi-focal vision systems: an exploitation of the field of view and sensitivity
characteristics in order to achieve improved control performance and a lower variability of
the performance by switching of individual vision devices.
4. Conclusion
In this chapter novel visual servoing strategies have been proposed based on multi-focal
active vision systems able to overcome common drawbacks of conventional approaches: a
tradeoff between field of view and sensitivity of vision devices and a large variability of the
control performance due to distance dependency and singular configurations of the visual
controller. Several control approaches to exploit the benefits of multi-focal vision have been
proposed and evaluated in simulations: Serial switching between vision devices with
different characteristics based on performance- and field-of-view-dependent switching
conditions, usage of several of these vision devices at the same time observing different
parts of a reference structure, and individual switching of one or more of these
simultaneously used sensors. Stability has been discussed utilizing common and multiple
Lyapunov functions.
It has been shown that each of the proposed strategies significantly improves the visual
servoing performance by reduction of the pose error variance. Depending on the
application scenario several guidelines for using multi-focal vision can be given. If only one
vision sensor at a time is selectable then a dynamical sensor selection satisfying desired
performance constraints and side-conditions is proposed. If several vision sensors can be
used simultaneously selected features of a reference object can be observed with highsensitivity sensors while a large field of view sensor ensures observation of a sufficient
number of features in order to render the visual controller full rank. The high-sensitivity
sensors should preferably be focused on those feature points resulting in the highest
sensitivity of the controller.
5. Acknowledgments
The authors like to gratefully thank Dr. Nicholas Gans and Prof. Seth Hutchinson for
inspiring discussions and reference simulation code for performance comparison. This work
has been supported in part by the German Research Foundation (DFG) grant BU-1043/5-1
and the DFG excellence initiative research cluster Cognition for Technical Systems - CoTeSys,
see also www.cotesys.org.
6. References
Apostoloff, N. & Zelinsky, A. (2002). Vision in and out of vehicles: Integrated driver and
road scene monitoring, Proceedings of the 8th International Symposium on Experimental
Robotics (ISER), 2002, Sant Angelo dIscia, Italy
47
Bodor, R.; Morlok, R. & Papanikolopoulos, N. (2004). Dual-camera system for multi-level
activity recognition, Proceedings of the 2004 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), 2004, Sendai, Japan
Brooks, R. A.; Breazeal, C.; Marjanovic, M.; Scasselati, B. & Williamson, M. M. (1999). The
Cog Project: Building a Humanoid Robot, In: Computation for Methaphors, Analogy,
and Agents, C. Nehaniv, (Ed.), Springer, Germany
Chaumette, F.; Hashimoto, K.; Malis, E. & Martinet, P. (2004). TTP4: Tutorial on Advanced
Visual Servoing, Tutorial Notes, IEEE/RSJ IROS, 2004
Corke, P. I. (1994). Visual Control of Robot Manipulators A Review, In: Visual Servoing, K.
Hashimoto, (Ed.), World Scientific, 1994
Cowan, N. (2002). Binocular visual servoing with a limited field of view, In: Mathematical
Theory of Networks and Systems, Notre Dame, IN, USA, 2002
Dickmanns, E. D. (2003). An advanced vision system for ground vehicles, Proceedings of the
International Workshop on In-Vehicle Cognitive Computer Vision Systems (IVC2VS),
2003, Graz, Austria
Dorai, C.; Wang, G.; Jain, A. K. & Mercer, C. (1998). Registration and Integration of Multiple
Object Views for 3D Model Construction, In: IEEE Transactions on Pattern Analysis
and Machine Intelligence, Vol. 20, No. 1, 1998
Elder, J. H.; Dornaika, F.; Hou, B. & Goldstein, R. (2004). Attentive wide-field sensing for
visual telepresence and surveillance, In: Neurobiology of Attention, L. Itti, G. Rees & J.
Tsotsos, (Eds.), 2004, Academic Press, Elsevier
Flandin, G.; Chaumette, F. & Marchand, E. (2000). Eye-in-hand/eye-to-hand cooperation for
visual servoing, Proceedings of the IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), 2003
Hager, G. D. (1995). Calibration-free visual control using projective invariance, Proceedings of
the 5th International Conference on Computer Vision (ICCV), 1995
Hall, D. L. & Llinas, J. (2001). Handbook of Multisensor Data Fusion, CRC Press, 2001, Boca
Raton, FL, USA
Hartley, R. I. & Zisserman, A. (2004). Multiple View Geometry in Computer Vision, Cambridge
University Press, 2004, NY, USA
Hayman, E. (2000). The use of zoom within active vision, Ph.D. Thesis, University of Oxford,
2000, Oxford, UK
Hollighurst, N. & Cipolla, R. (1994). Uncalibrated stereo hand-eye coordination, In: Image
and Vision Computing, Vol.12, No. 3, 1994
Horaud, R.; Knossow, D. & Michaelis, M. (2006). Camera cooperation for achieving visual
attention, In: Machine Vision and Applications, Vol. 15, No. 6, 2006, pp. 331-342
Hosoda, K.; Moriyama, H. & Asada, M. (1995). Visual servoing utilizing zoom mechanism,
Proceedings of the IEEE International Conference on Robotics and Automation (ICRA),
1995
Hutchinson, S.; Hager, G. D. & Corke, P. I. (1996). A tutorial on visual servo control, In: IEEE
Transaction on Robotics and Automation, Vol. 12, No. 5, 1996
Jankovic, N. D. & Naish, M. D. (2005). Developing a modular spherical vision system,
Proceedings of the 2005 IEEE International Conference on Robotics and Automation
(ICRA), pp. 1246-1251, 2005, Barcelona, Spain
Kelly, R.; Carelli, R.; Nasisi, O.; Kuchen, B. & Reyes, F. (2000). Stable visual servoing of camerain-hand robotic systems, In: IEEE Transactions on Mechatronics, Vol. 5, No. 1, 2000
48
Kragic, D. & Christensen, H. I. (2002). Survey on Visual Servoing for Manipulation, Technical
Report, Stockholms Universitet, ISRN KTH/NA/P-02/01-SE, CVAP259, 2002
Khnlenz, K. & Buss, M. (2005). Towards multi-focal visual servoing, Proceedings of the
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2005
Khnlenz, K. & Buss, M. (2006). A multi-camera view stabilization strategy, Proceedings of the
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2006
Khnlenz, K. (2007). Aspects of multi-focal vision, Ph.D. Thesis, Institute of Automatic Control
Engineering, Technische Universitt Mnchen, 2007, Munich, Germany
Khnlenz, K.; Bachmayer, M. & Buss, M. (2006). A multi-focal high-performance vision
system, Proceedings of the 2006 IEEE International Conference on Robotics and
Automation (ICRA), 2006, Orlando, FL, USA
Lipiello, V.; Siciliano, B. & Villani, L. (2005). Eye-in-hand/eye-to-hand multi-camera visual
servoing, Proceedings of the IEEE International Conference on Decision and Control
(CDC), 2005
Malis, E. (2001). Visual servoing invariant to changes in camera intrinsic parameters,
Proceedings of the 8th International Conference on Computer Vision (ICCV), 2001
Malis, E.; Chaumette, F. & Boudet, S. (2000). Multi-cameras visual servoing, Proceedings of the
IEEE International Conference on Robotics and Automation (ICRA), 2000
Maurer, M.; Behringer, R.; Furst, S.; Thomanek, F. & Dickmanns, E. D. (1996). A compact
vision system for road vehicle guidance, Proceedings of the 13th International
Conference on Pattern Recognition (ICPR), 1996
Morooka, K. & Nagahashi H. (2006). A Method for Integrating Range Images in Different
Resolutions for 3-D Model Construction, Proceedings of the IEEE International
Conference on Robotics and Automation (ICRA), 2006
Nelson, B. & Khosla, P. (1993). The resolvability ellipsoid for visually guided manipulation,
Technical Report, CMU-RI-TR-93-28, The Robotics Institute, Carnegie Mellon
University, 1993, Pittsburgh, PA, USA
Nelson, B. & Khosla, P. (1995). An extendable framework for expectation-based visual
servoing using environment models, Proceedings of the IEEE International Conference
on Robotics and Automation (ICRA), 1995
Pellkofer, M. & Dickmanns, E. D. (2000). EMS-Vision: Gaze control in Autonomous vehicles,
Proceedings of the IEEE Intelligent Vehicles Symposium, 2000, Dearborn, MI, USA
Sagawa, R.; Nishino, K. & Ikeuchi, K. (2001). Robust and Adaptive Integration of Multiple
Range Images with Photometric Attributes, Proceedings of the IEEE International
Conference on Computer Vision and Pattern Recognition (CVPR), 2001
Soucy, M. & Laurendeau, D. (1992). Multi-Resolution Surface Modelling from Multiple
Range Views, Proceedings of the IEEE International Conference on Computer Vision and
Pattern Recognition (CVPR), 1992
Ude, A.; Gaskett, C. & Cheng, G. (2006). Foveated Vision Systems with Two Cameras Per
Eye, Proceedings of the 2006 IEEE International Conference on Robotics and Automation
(ICRA), 2006, Orlando, FL, USA
Vijayakumar, S.; Inoue, M. & Souza, A. D. (2004). Maveric Oculomotor experimental vision
head, http://homepages.inf.ed.ac.uk/svijayak/projects/maveric/index.html, 2004
4
Grasping Points Determination Using Visual
Features
Madjid Boudaba1 , Alicia Casals2 and Heinz Woern3
Design Center, TES Electronic Solutions GmbH, Stuttgart
GRINS: Research Group On Intelligent Robots and Systems,Technical University of
Catalonia, Barcelona
3 Institute of Process Control and Robotics (IPR), University of Karlsruhe
1,3Germany, 2Spain
1
1. Introduction
This paper discusses some issues for generating point of contact using visual features. To
address these issues, the paper is divided into two sections: visual features extraction and
grasp planning. In order to provide a suitable description of object contour, a method for
grouping visual features is proposed. A very important aspect of this method is the way
knowledge about grasping regions are represented in the extraction process, which is used
also as filtering process to exclude all undesirable grasping point (unstable points) and all line
segments that do not fit to the fingertip position. Fingertips are modelled as point contact with
friction using the theory of polyhedral convex cones. Our approach uses three-finger contact
for grasping planar objects. Each set of three candidate of grasping points is formu- lated as
linear constraints and solved using linear programming solvers. Finally, we briefly describe
some experiments on a humanoid robot with a stereo camera head and an anthropomorphic
robot hand within the Centre of excellence on Humanoid Robots: Learning and co-operating
Systems at the University of Karlsruhe and Forchungszentrum Karlsruhe.
2. Related work
Grasping by multi-fingered robot hands has been an active research area in the last years.
Several important studies including grasp planning, manipulation and stability analysis
have been done. Most of these researches assume that the geometry of the object to be
grasped is known, the fingertip touches the object in a point contact without rolling, and the
position of the contact points are estimated based on the geometrical constraints of the 2
Madjid Boudaba, Alicia Casals and Heinz Woern grasping system. These assumptions
reduce the complexity of the mathematical model of the grasp (see [Park and Starr, 1992],
[Ferrari and Canny, 1992], [Ponce and Faverjon, 1995], [Bicchi and Kumar, 2000], [J. W. Li
and Liu, 2003]). A few work, however has been done in integrating vision-sensors for
grasping and manipulation tasks. To place our approach in perspective, we review existence
methods for sensor based grasp planning. The existing literature can be broadly classified in
two categories; vision based and tactile based. For both categories, the extracted image
50
features are of concern which are vary from geometric primitives such as edges, lines,
vertices, and circles to optical flow estimates. The first category uses visual features to
estimate the robots motion with respect to the object pose [Maekawa et al., 1995], [Smith
and Papanikolopoulos, 1996], [Allen et al., 1999]. Once the robot hands is already aligned
with object, then it needs only to know where the fingers are placed on the object. The
second category of sensor uses tactile features to estimate the touch sensing area that in
contact with the object [Berger and Khosla, 1991], [Chen et al., 1995], [Lee and Nicholls,
1999]. A practical drawback is that the grasp execution is hardly reactive to sensing errors
such as finger positioning errors. A vision sensor, meanwhile, is unable to handle
occlusions. Since an object is grasped according to its CAD model [Koller et al., 1993],
[Wunsch et al., 1997], [Sanz et al., 1998], [N. Giordana and Spindler, 2000], [Kragic et al.,
2001], an image also contains redundant information that could become a source of errors
and ineffciency in the processing.
This paper is an extension of our previous works [Boudaba and Casals, 2005], [Boudaba et
al., 2005], and [Boudaba and Casals, 2006] on grasp planning using visual features. In this
work, we demonstrate its utility in the context of grasp (or fingers) positioning. Consider the
problem of selecting and executing a grasp. In most tasks, one can expect various
uncertainties. To grasp an object implies building a relationship between the robot hand and
object model. The latter is often unavailable or poorly known. So selecting a grasp position
from such model can be unprecise or unpracticable in real time applications. In our
approach, we avoid to use any object model and instead it works directly from image
features. In order to avoid fingers positioning error, a set of grasping regions is defined that
represents the features of grasping contact point. This not only avoids detection/localization
errors but also saves computations that could affect the reliability of the system. Our
approach can play the critical role of forcing the fingers to a desired positions before the task
of grasping is executed.
The proposed work can be highlighted in two major phases:
1. Visual information phase: In this phase, a set of visual features such as object size, center
of mass, main axis for orientation, and objects boundary are extracted. For the purpose of
grasping region determination, extracting straight segments are of concern using the
basic results from contour based shape representation techniques. We will focus on the
class techniques that attempt to represent objects contour into a model graph, which
preserves the topological relationships between features.
2. Grasp planning phase: The grasping points are generated in the planning task taking as
input these visual features extracted from the first phase. So a relationship between
visual features and grasp planning is proposed. Then a set of geometrical functions is
analysed to find a feasible solution for grasping. The result of grasp planning is a
database contains a list of:
Best Grasps. a criterion for measuring a grasp quality is used to evaluate the best
grasps from a list of valid grasps.
Reject grasps. those grasps that do not fulfill the condition of grasp.
The remainder of this chapter is organized as follows: Section 3 gives some background for
grasping in this direction. The friction cone modeling and condition of force-closure grasps
are discussed. In section 4, a vision system framework is presented. The vision system is
divided into two parts: the first part concerning to 2D grasping and the second part
51
3. Grasp Background
Our discussion is based on [Hirai, 2002]. Given a grasp which is characterized by a set of
contact points and the associated contact models, determine if the grasp has a force-closure.
For point contact, a commonly used model is point contact with friction (PCWF). In this
model, fingers can exert any force pointing into friction cone at the edge of contacts (We use
edge contact instead of point contact and can be described as the convex sum of proper
point contacts). To fully analyze the grasp feasibility, we need to examine the full space of
forces acting on the object. Forming the convex hull of this space is diffcult due to the
nonlinear friction cone constraints imposed by the contact models. In this section, we only
focus in precision grasps, where only the fingertips are in contact with the object. After
discussing the friction cone modeling, a formalizme is used for analysing the force closure
grasps using the theory of polyhedral convex cones.
3.1 Modeling the Point of Contact
A point of contact with friction (sometimes referred to as a hard-finger) im- poses non linear
constraints on the force inside of its friction cones. For the analysis of the contact forces in
planar grasps, we simplify the problem by modeling the friction cones as a convex
polytopes using the theory of polyhedral convex cones attributed to [Goldman and Tucker,
1956]. In order to construct the convex polytope from the primitive contact forces, the
following theorem states that a polyhedral convex cone (PCC) can be generated by a set of
basic directional vectors.
(a)
Figure 1. Point Contact Modelling
(b)
Theorem 1. A convex cone is a polyhedral if and only if it is finitely generated, that is, the
:
cone is generated by a finite number of vectors
(1)
52
where the coeffcients i are all non negative. Since vectors ui through um span the cone, we
write 1 simply by C=span {u1, u2, ..., um}. The cone spanned by a set of vectors is the set of all
nonnegative linear combinations of its vectors. A proof of this theorem can be found in
[Goldman and Tucker, 1956].
Given a polyhedral convex set C, let vert(P)={u1, u2, ..., um} stand for vertices of a polytope P,
while face(P)={F1, ..., FM} denotes its faces. In the plane, a cone has the appearance as shown
in Figure 1(b). This means that we can reduce the number of cone sides, m = 6 to one face, Ci.
Lets denote by P, the convex polytopes of a modelled cone, and {u1, u2, u3} its three vertices.
We can define such polytope as
(2)
where ui denotes the i-th vertex of P, and up is the total number of vertices.
n=2 in the case of a 2D plane.
3.2 Force-Closure Grasps
The force-closure of a grasp is evaluated by analysing its convex cone. For a set of friction
cone intersection, the full space can be defined by
(3)
where k is the number of grasping contacts. Note that the result of
is a set of polytopes
intersections and produces either an empty set or a bounded convex polytopes. Therefore,
the solution of (3) can be expressed in terms of its extreme vertices
(4)
where
(a)
Figure 2. Feasible solution of a three-fingered grasp
(b)
53
suggested: first, if the location of a fingertip is not a solution to the grasp, it is possible to move
along its grasping region. Such displacement is defined by ui = ui0 + iti where i is constrained
by 0 i li and ui be a pointed vertex of Ci. Second, we define a ray passing through the
. The vector ci=[ cix, ciy] R2 varies from the
pointed vertex ui, by a function
lower to the upper side of the spanned cone Ci. This allows us to check whether the feasible
solution remains for all vci in the cone spanned by u2 and u3 (see Figure 1(b)).
Testing the force-closure of a grasp now becomes the problem of finding the solutions to (4).
In other words, finding the parameters of (3) that the (4) is a bounded convex polytopes.
4. System Description
We are currently developing a robotic system that can operate autonomously in an
unknown environment. In this case, the main objective is the capability of the system to (1)
locate and measure objects, (2) plan its own actions, and (3) self adaptable grasping
execution. The architecture of the whole system is organized into several modules, which
are embedded in a distributed object communication framework. There are mainly three
modules which are concerned in this development: the extraction of visual information and
its interpretation, grasp planning using the robot hand, the control and execution of grasps..
54
55
is to determine that all vertices/segments belong to the objects boundary which we will
need further for the determination of the grasping points position.
(a)
(b)
(a) Binary object
(a)
(b)
(b) Visual features extraction
Figure 4. Object shape representation. (a) Images from original industrial objects
(b) Extraction of grasping regions
Extraction of Grasping Regions
Grasping regions are determined by grouping consecutive edge points from a binary edge
image. This is usually a preliminary step before grasping takes place, and may not be as
time critical as the task of grasping points determination. We deal with (5), the list
is the result that forms an ordered list of connected boundary vertices.
We then need to store the parameters of these primitives instead of discrete points (or
vertices) to fit a line segment to a set of vertices points that lie along a line segment. The aim
of this step is to determine all salient segments that preserve the shape of the object contour.
Figure 4(b) shows grasp regions on the objects contour. Afterwards, each grasping region is
extracted as straight segment. The size of the grasping regions should be long enough for
positioning the robot fingers. The curve fitting (as shown in Figure 5(a)) describes the
process of finding a minimum set of curve segments to approximate the objects contour to a
set of line segments with minimum distortion. Once the line segments have been
approximated, the merging method (as shown in Figure 5(b)) is used to merge two lines
segment that satisfied the merging threshold.
56
The final result of the algorithm is a list of consecutive line segments with a specified
tolerance which preserve the objects contour. Briefly, merging methods, (1) use the first two
vertices points to define a line segment (2) add a new vertex if it does not deviate too much
from the current line segment (3) update the parameters of the line segment using leastsquares measure (4) start a new line segment when edge points deviate too much from the line
segment. The final result of the algorithm is a list of consecutive line segments with a specified
tolerance which preserve the objects contour. We define such list by
(6)
where a segment si is defined by its ending vertices i=(xi, yi) and i+1=(xi+1, yi+1) that represent
the location of a segment in the plane. m is the number of segments containing the list slist.
57
To assure the robustness of contact placement, we make some assumptions in (6): First
assumption, to avoid undesirable contacts at convex vertices and convex corners (see the
position of finger f1, f2 in Figure 6(a)) which are not generally robust due to small uncertainty
during the grasping phase. We also avoid the concave vertices having a size of concavity
smaller than the size of the fingertip using the reachability conditions (see the position of
finger f3 in Figure 6(a)).
Second assumption, we estimate a fingertip as a sphere with radius fr (see Figure 6). the
grasping regions must be large enough for positioning the fingertip on it. Hence, a
preprocessing (or prefiltering) is necessary in (6) to discard those segments with length less
than the diameter of the sphere.
Based on both assumption, we define a small margin value at the endpoint of each segment
by as shown in Figure 6 with =fr. If a segment si contains all possible contact points from
i to i+1 then any grasping points must satisfy
(7)
Using the grasp criteria of (7) including the condition that the size of the grasp region must
be large enough to place a finger on it, gi 2fr (see Figure 6). Equation (6) becomes
(8)
where glist is a linked list ordered in counterclockwise direction (see Figure 6(b)) and
updated from the condition of (7).
Equation (8) is the result of a filtering test which excludes all grasping candidates that do
not belong to the grasping regions and therefore reducing time consuming during grasp
point generation.
Let
(9)
be a function defining the parameter of a grasping region, gi , where pi=(xi,yi) and
, gci its center (midpoint),
pi+1=(xi+1,yi+1) represent its location in the plane, lgi its length
di is the perpendicular distance from
to the objects center of mass, com. The
relationship between the center of mass and grasping region is given in Figure 7(b). The sign
defines the orientation of grasping region . The elements of (9) verify the
of the area
following equations:
The last equation gi is a linear equality constraint of a given grasping region in the plane (see
Figure 7). An additional criteria should be added to avoid that two or three fingers are placed
on the same contact point. In this paper we only assign one finger to each grasping region.
58
(a)
Figure 7. Projection of grasping region gi in the image plane (c, r)
(b)
(11)
where Gs, Gb, and Gr are selected, best, and rejected grasp, respectively. The is, ib, and ir are
the number of selected, best, and rejected grasps, respectively.
59
For a three-finger grasps, the selected grasps (Gs) is given in the following form:
A similar form can be given for representing the best grasps Gb and those rejected Gr.
5.2 The Algorithm
The algorithm is divided into three parts: Visual features part which are regrouped in (5)
and (8); grasp planning part which is defined by (10) and (11); and Testing part that
corresponds to (4). In the visual features part, the compact representation of the objects
contour is obtained which includes the grasping regions and local parameters by using the
standard image processing library. In the grasp planning, a relationship between visual
features and the location of the contact points is obtained for selecting a valid grasp. In the
testing part, the force-closure condition is based on determining the feasible solution of a
grasps. We first model the friction cone as a convex polytopes. Then, we solve the problem
of (3) and (4) for a given location of contact grasp using programming solvers as well as for
computing the polytope convex cones, extreme vertices of polytopes, and calculating
projections. One of the advantages of the proposed algorithm is that it does not require a
geometrical model of the object and can grasp unknown objects.
Figure 8. Grasp point generation. The fingertip range defined here is the range of its
corresponding grasping regions and midpoint is its optimal contact positions, called initial
pose
The whole algorithm is divided into several procedures and operates as follows:
1. Visual features procedure
60
3.
4.
Check whether the polytopes given by (4) is bounded. If so, stop and save the selected grasps to Gs.
The last step of the algorithm consists of selecting the best grasps from a range of valid
grasps from lower to upper acceptance measures by using the parameters measure given in
table 1. Save to Gb.
5.3 Implementation
We have implemented the visual features extraction and grasp planning algorithms in
Matlab environment for computing feasible solution of a three-fingered grasp. We have
experimented with two different kind of objects; a 3D object and a planar object. For both
objects, the images extraction are saved in two jpeg files with a resolution of 320x240 and
160x220 pixels, respectively. Table 1 resumes the results of grasp planning algorithm. Three
and four feasible grasp configurations have been selected from a total of 25 and 24 grasping
regions generated on the objects boundary obj1 and obj2, respectively. d1, d2 and d3 are
distance measures of finger position f1, f2 and f3 from the objects center of mass. x1, x2 are the
coordinates of the focus point F in the plane. d is the measured distance between focus point
and center of mass. R is the vector radius of the ball centered at F. The objects center of
mass is located at com = 121.00098.000 and com = 115.00075.000, respectively. The angle of
= 8.5 degrees for all grasp configurations. Figure 9 illustrates the
friction cone is fixed to
grasp planing for object obj1. Three fingers are in contact with the object which is viewed
from the top by the stereo vision head placed above the table. For the second object (obj2),
the visual features are extracted from a single camera. The friction cone modelling and
linear constraints programming have been implemented using [M. Kvasnica, 2005]. We
further developed auxiliary function to compute various data such as extraction of visual
features of the object, extraction of grasping regions, friction cone modelling, and grasp
configurations.
61
which inserts a valid candidate grasp into library, while the remove function deletes invalid
grasp from the library. The results in this paper shows the potential to combine vision and
grasping in a unified way to resemble the dexterity of human manipulation.
The second part of our visual processing: General flow diagram will be the future work for
generating 3D grasps on unknown objects includes implementation on a humanoid robot
with a stereo camera head and an anthropomorphic robot hand (as shown in Figure 3).
7. Acknowledgments
The authors would like to thank Prof. Dr. H. Woern and his co-workers from the IPR
institute for their support in providing the facilities and the anthropomorphic robot hand for
testing the proposed approach.
62
8. References
Allen, P., Miller, A., Oh, P., and Leibowitz, B. (1999). Integration vision, force and tactile
sensing for grasping. Int. Journal of Intell. Mechatronics, 4(1):129149. [Allen et al.,
1999]
Berger, A. D. and Khosla, P. K. (1991). Using tactile data for real-time feedback. Int. Journal of
Robot. Res. (IJR91), 2(10):88102. [Berger and Khosla, 1991]
Bicchi, A. and Kumar, V. (2000). Robotic grasping and contacts: A review. In Proceedings of
the IEEE International Conference on Robotics and Automation, pages 348353. [Bicchi
and Kumar, 2000]
Boudaba, M. and Casals, A. (2005). Polyhedral convex cones for computing feasible grasping
regions from vision. In Proc. IEEE Symposium on Computational Intelligence in
Robotics and Automation (CIRA05), pages 607613, Helsinki, Finland. [Boudaba and
Casals, 2005]
Boudaba, M. and Casals, A. (2006). Grasping of planar obbjects using visual perception. In
Proc. IEEE 6th International Conference on Humanoid Robots (HUMANOIDS06),
pages 605611, Genova, Italy. [Boudaba and Casals, 2006]
Boudaba, M., Casals, A., Osswald, D., andWoern, H. (2005). Vision-based grasping point
determination on objects grasping by multifingered hands. In Proc. IEEE 6th
International Conference on Field and Service Robotics (FRS05), pages 261272,
Australia. [Boudaba et al., 2005]
Canny, J. (1986). A computational approach to edge detection. IEEE Transaction on Pattern
Analysis and Machine Intelligence, 8(6):679698. [Canny, 1986]
Chen, N., Rink, R. E., and Zhang, H. (1995). Edge tracking using tactile servo. In Proc.
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS95), pages
8499. [Chen et al., 1995]
Costa, L. and Cesar, R. (2001). Shape Analysis and Classification Theory and Practice. CRC Press,
Florida, USA, 1st edition. [Costa and Cesar, 2001]
63
Ferrari, C. and Canny, J. (1992). Planning optimal grasps. In Proceedings of the IEEE
International Conference on Robotics and Automation, pages 22902295, Nice, France.
[Ferrari and Canny, 1992]
Goldman, A. J. and Tucker, A. W. (1956). Polyhedral convex cones, in linear inequalities and
related systems. Annals of Mathematics Studies, Princeton, 38:1940. [Goldman and
Tucker, 1956]
Hirai, S. (2002). Kinematics of manipulation using the theory of polyhedral convex cones
and its application to grasping and assembly operations. Trans. of the Society of Inst.
and Control Eng., 2:1017. [Hirai, 2002]
J. W. Li, M. H. J. and Liu, H. (2003). A new algorithm for three-finger force-closure grasp of
polygonal objects. In Proceedings of the IEEE International Conference on Robotics and
Automation, pages 18001804. [J. W. Li and Liu, 2003]
Koller, D., Danilidis, K., and Nagel, H. H. (1993). Model-based object tracking in monocular
image sequences of road traffc scenes. Int. Journal of Comp. Vision IJCV93,
3(10):257281. [Koller et al., 1993]
Kragic, D., Miller, A., and Allen, P. (2001). Real-time tracking meets online grasp planning.
In Proc. IEEE International Conference on Robotics and Automatation (ICRA2001),
pages 24602465, Seoul, Korea. [Kragic et al., 2001]
Lee, M. H. and Nicholls, H. R. (1999). Tactile sensing for mechatronics - a state of the art
servey. Mechatronics, 9:131. [Lee and Nicholls, 1999]
M. Kvasnica, P. Grieder, M. B. F. J. C. (2005). Multiparametric toolbox, users guide. [M.
Kvasnica, 2005]
M. Marji, P. S. (2003). A new algorithm for dominant points detection and polygonization of
digital curves. Journal of the Pattern Recognition Society, 36:22392251. [M. Marji,
2003]
Maekawa, H., Tanie, K., and Komoriya, K. (1995). Tactile sensor based manipulation of an
unknown object by a multifingered hand with rolling contact. In Proc. IEEE
International Conference on Robotics and Automatation (ICRA95), pages 743750.
[Maekawa et al., 1995]
Mokhtarian, F. and Mackworth, A. (1986). Scale-based description and recognition of planar
curves and two-dimensional shapes. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 8:3443. [Mokhtarian and Mackworth, 1986]
N. Giordana, P. Bouthemy, F. C. and Spindler, F. (2000). Two-dimensional model-based
tracking of complex shapes for visual servoing tasks. M. Vincze and G. Hager,
editors, Robust vision for vision-based control of motion, pages 6777. [N.
Giordana and Spindler, 2000]
Park, Y. C. and Starr, J. P. (1992). Grasp synthesis of polygonal objects using a three-fingered
robot hand. IEEE International Journal of Robotics Research, 11(3):163184. [Park and
Starr, 1992]
Ponce, J. and Faverjon, B. (1995). On computing three-finger force-closure grasps of
polygonal objects. Proceedings of the IEEE Transactions on Robotics and Automation,
11(6):868881. [Ponce and Faverjon, 1995]
Rosin, P. L. (1997). Techniques for assesing polygonal approximation of curves. IEEE Trans.
on Pattern Analysis and Machine Intell., 19:659666. [Rosin, 1997]
64
Sanz, P., del Pobil, A., Iesta, J., and Recatal, G. (1998). Vision-guided grasping of unknown
objects for service robots. In Proc. IEEE International Conference on Robotics and
Automatation (ICRA98), page 30183025, Leuven, Belgium. [Sanz et al., 1998]
Smith, C. and Papanikolopoulos (1996). Vision-guided robotic grasping: Issues and
experiments. In Proc. IEEE International Conference on Robotics and Automatation
(ICRA96), pages 32033208. [Smith and Papanikolopoulos, 1996]
Wunsch, P., Winkler, S., and Hirzinger, G. (1997). Real-time pose estimation of 3d objects
from camera images using neural networks. In Proc. IEEE International Conference on
Robotics and Automatation (ICRA97), pages 32323237. [Wunsch et al., 1997]
5
Behavior-Based Perception for Soccer Robots
Floris Mantz and Pieter Jonker
Delft University of Technology
The Netherlands
1. Introduction
The mission of this chapter is to show the possibility of boosting the performance of the
vision system of autonomous perception-based robots, by implementing a behavior based
software architecture with multiple independent sense-think-act loops. This research comes
forth from a wider view of future robots having layered modular architectures, with higher
layers controlling lower layers, in which all parts of the robots tasks (perception, behavior,
motion) are behavior specific, and preferably all input-output mappings are learned. The
work done in this chapter only focuses on improving the perception of robots. By
implementing a behavior-based perception system of a goalie in a team of 4-legged soccer
robots, we have increased its performance on localization and goal-clearing with more than
50 %. On top, we have significantly increased the performance of the image processing by
making it entirely object specific, with a different color-table and set of grid-lines for each
different object searched for. All improvements combined allow the robot to localize in
various conditions where this was previously not possible.
66
Artificial
Intelligence
Think
Image
Processing
Control
Engineering
Act
Sense
World
Be_Striker
S
Dribble_Ball_to_Goal
Shoot_Ball_to_Goal
S
Push_Ball
Pose
Goal
S
Get
Speed
T
S
T
S
Pose
Self
Kick_Ball
T
S
Pose
Ball
Goto_Ball
T
S
Set
Speed
T
S
T
S
World
67
architecture should be set-up in which higher layers control the lower layers, either by
invocation actions from the lower layers or by promoting or suppressing behaviors from
that lower layers. All modules run principally in parallel and on their turn invoke, promote
or suppress actions of modules lower in the abstraction hierarchy.
Figure 2 shows the design for a soccer robot, detailed for its role of striker. Figures 3 and 4
show the same hierarchy of figure 2, but now in more detail. Moreover, Figure 3 shows
more detail on the behavior (act) part of the hierarchy, whereas Figure 4 shows more detail
on the perception (sense) part of the hierarchy.
Play_Game
Be_Goalie
Be_Striker
P1..P
Dribble_Ball_to_Goal
P1..P
Rrb,
PHIrb
Rr,
Shoot_Ball_to_Goal
X
P1..P
P1..P
Avoid_Robot
Be_Defender
Push_Ball
Rb, PHIb
Vf, Vphi Vf, Vphi
P1..P
Goto_Ball
Rb,
Vf, Vphi
Wheels
Rrb,
PHIrb
Kick_Ball
Vf, Vphi
Fk
Kicker
World
Figure 3. Soccer playing robot in the role of striker; behavior viewpoint
A striker can either dribble or shoot the ball to goal. The striker module decides on the best
position (P1...P7) near the goal to dribble the ball to, from where it can successfully execute
(X) a shoot to goal. Both for dribbling and shooting it needs to go to the best position behind
the ball. For dribbling to goal it needs to push the ball without loosing it (avoiding others);
for shooting to goal it needs to execute a kick.
To perform these three behaviors it needs to perceive the pose (position and orientation) of
ball and goal with respect to itself, i.e. given by vectors (R, ) and to set and measure the
forward and angular speed of the robot (Vf,V). For kicking one needs to specify the kicking
force (F). The pose of ball, goal and itself are measured using the vision system and the
odometry (RPM of the wheels for mobile robots or steps for walking robots). Figure 4 shows
that all perception modules, e.g. as to mind ones own pose, can be split into a part to
discover the pose when the pose is un-known and a module to track the pose when it is
well-known.
To program behaviors of an autonomous system that needs to function under all
circumstances in any environment, is often similar to maintaining a house of cards.
Moreover, as one can not foresee as designer all possible states that the system encounter in
68
its life, learning the behaviors, e.g. based on reinforcement (Sutton & Barto, 1998);
(Takahashi & Asada, 2004) is a valuable solution to overcome and learn from unknown
situations. However, when the dimension of input-output / state-action space becomes too
high (>8) learning becomes cumbersome (Jonker et al, 2004); (Dietterich, 2000). Hence, even
/ especially when reinforcement learning methods are used, one should aim for layered,
modular sense-think-act architectures in which we can learn the basic behaviors and
perhaps even the perceptions.
Rs, PHIs
Mind_Self
Rr, PHIr
Mind_Robot
Rbr, PHIb
Mind_Ball
Rg, PHIg
Mind_Self
Mind_Goal
Find_Percepts
Vf, Vphi
Vf, Vphi
Rb, PHIb
Rg, PHIg
Rs, PHIs
P1..P7
Rs, PHIs
Track_Percepts
Vf, Vphi
Vf, Vphi
Vf, Vphi
Vf, Vphi
Vision
Odometer
Rp, PHIp
Rp, PHIp
World
69
Because the perception problem can differ so widely over different behaviors, we have
developed a software architecture for a team of soccer robots, with a behavior-based
hierarchy of modules (Lenser et al, 2002) in which each module is treated and implemented
as a separate sense-think-act loop. We will show that this architecture performs similar or
better than an architecture based on monolithic discipline based modules, even when we
omit learning.
With this new architecture we expect the following improvements:
1. That each (sense-think-act) module is simpler and hence can be better understood and
used to design other behaviors (copy-past-modify) by other developers.
2. That effectively less and less complicated, code is running in the new situation then it
was in the old situation. The crux in this is that in the old situation the code that was
running not always contributed to the behavior, but was merely there for generalpurposeness.
3. That our goalkeeper performs better and more robust because it can use information on
its location and behavior (action).
Location and behavior information (point 3) can be used either in improving the self
localisation algorithms, which we call behavior specific self localisation or it can directly be
used in optimizing the image processing algorithms, which we call behavior-specific image
processing. Below we will discuss both options.
3.1 Behavior specific self localisation
With behavior specific self localisation, we make the self localisation algorithms specific for
different behaviors. The first reason why behavior-specific self localisation can increase
performance, is because it can use information on the kind of action the robot is performing.
E.g. when particle filters are used for self localization, one always has to make a trade-off
between robustness and speed. If the particles are updated slowly on new sensor inputs, the
system is more robust against false sensor inputs. If the particles are updated fast, the
system can be accurate despite unmodeled movements, such as uncertainty in odometry
evaluation, collisions, or a pickup (kidnap) by the referee. With behavior-specific self
localistation we can go for speed or robusness when required. When a robot is positioning
(e.g. a goalie standing in the goal, or a field player walking around), the sensor input is
qualitatively high and accurate localization is our aim; hence we use a fast update of the
particles. When a robot is handling a ball, the sensor input has a low quality and the
updating of the robots pose is less urgent; hence we use a slow update of the particles.
Secondly, behavior-specific self localisation can increase performance by using the location
information for a certain behavior. If a postion is already well known, the self locator could
(partly) discard percepts indicating a totally different position. The self locator could also
directly be told on which percepts it should put more or less emphasis on. E.g. For the
goalie, the self localisation could always make less use of perceptions of its own goal. For a
striker, the self localisation could put extra emphasis on detections of the opponents goal.
In most situations, the best way to implement behavior-specific self localisation is to build
one general self locator that takes paramters that can be set from the behaviors. These
parameters could indicate the overall update rate of all particles, the rate of rejecting outlier
measurments, and a weight for each possible detected object (blue goal, yellow goal, lines,
blue flag etc). If the situation or requirement in a certain behavior is really different from
that in other behaviors, one could decide to implement an entire new self locator algorithm.
70
Figure 5. False positive. The robot not only detects the yellow goal, but also mistakes some
blue in the playing field for a blue goal
Note that discarding unexpected objects could also be done in the self locator. The
advantage of discarding them in an earlier stage, i.e. in the image processing stage is that the
locator algorithms dont need to be executed, which saves CPU cycles.
Secondly, behavior specific image processing can be used to allow for different detection
schemes for the same object, using e.g. distance information. A goalie for example, will see
the opponent flag at far distance (fig 6a), while an attacker might come much closer to the
same flag (fig 6b). Using different algorithms for the two situations could improve the
performance of the detection.
71
modules and can be called independently. When an algorithm is called it takes a parameter,
indicating e.g. the color of the object (blue/yellow), and the size (far/near). Every cycle,
when the central image processing module is called, it will call a set of image processing
algorithms, dependent on the behavior. In chapter 6 we will show the other advantages we
found by making image processing completely modular.
3.3 Drawbacks of behavior based vision
There are limits and drawbacks to applying multiple sense-think-act loops to the vision
system of robots.
The first thing to consider is that the use of location information in the image processing and
self localization for discarding unexpected objects, gives rise to the chance of entering a local
loop: when the robot would discard information based on a wrong assumption of its own
position, it could happen the robot would not be able to retrieve its correct position. For
avoiding local loops, periodic checking mechanisms on the own position are required (on a
lower pace). Also one could restrict the runtime of behaviors in which much information is
discarded and invoke some relocation behavior to be executed periodically.
The second drawback is, that due to less reusability, and more implementations of
optimized code, the overall size of the system will grow. This influences the time it will take
to port code to a new robot, or to build new robot-software from scratch.
The third drawback is that for every improvement of the system (for every sense-think-act
loop), some knowledge is needed of the principles of image processing, mechanical
engineering, control theory, AI and software engineering. Because of this, behaviordesigners will probably reluctant to use the behavior-specific vision system. Note, however,
that even if behavior designer are not using behavior-dependent vision, the vision system
can still be implemented. In worst case a behavior designer can choose to select the general
version of the vision system for all behaviors, and the performance will be the same as
before.
Figure 7. Simplified software architecture for a soccer-playing Aibo robot in the Dutch Aibo
Team
72
In this paragraph, an overview will be given of the software architecture of soccer robots
(Sony Aibo ERS-7) in the Dutch Aibo Team (Oomes et al, 2004), which was adapted in 2004
from the code of the German Team of 2003 (Rofer et al, 2003). This software was used as a
starting point for implementing the behavior-based vision system as is described in the next
paragraph. The DT2004 software was also used for testing the performance of new systems.
In Fig 7. A simplified overview of the DT2004 software architecture is depicted. The
architecture can be seen as one big sense-think-act loop. Sensor measurements are processed
by, Image Processing, Self Localisation, Behavior Control and Motion Control sequentially,
in order to plan the motions of the actuators. Note that this simplified architecture only
depicts the modules most essential to our research. Other modules, e.g. for detecting
obstacles or other players, and modules for controlling LEDs and generating sounds, are
omitted from the picture.
4.1 Image Processing
The image processing is the software that generates percepts (such as goals, flags, lines and
the ball) from the sensor input (camera images). In the DT2004 software, the image
processing uses a grid-based state machine (Bruce et al, 2000), with segmentation primarily
done on color and secondarily by shapes of objects.
Using a color table
A camera image consists of 208*160 pixels. Each of these pixels has a three-dimensional
value p(Y,U,V). Y represents the intensity; U and V contain color-information; each having
an integer value between 0 and 254. In order to simplify the image processing problem, all
these 254*254*254 possible pixel-values are mapped onto only 10 possible colors: white,
black, yellow, blue, sky-blue, red, orange, green, grey and pink, the possible colors of objects
in the playing field. This mapping makes use of a color-table, a big 3-dimensional matrix
which stores which pixel-value corresponds to which color. This color-table is calibrated
manually before a game of soccer.
Grid-based image processing
The image processing is grid-based. For every image, first the horizon is calculated from the
known angles of the head of the robot. Then a number of scan-lines is calculated
perpendicular to that horizon. Each scan-line then is then scanned for sequences of coloredpixels. When a certain sequence of pixels indicates a specific object, the pixel is added to a
cluster for that possible object. Every cluster will be evaluated to finally determine whether
or not an object was detected. This determination step uses shape information, such as the
width and length of the detected cluster, and the position relative to the robot.
Grid-based image processing is useful not only because it processes only a limited number
of pixels, saving CPU cycles, but also that each image is scanned relative to the horizon.
Therefore processing is independent of the position of the robots head (which varies widely
for an Aibo Robot).
4.2 Self Localisation
The self localisation is the software that obtains the robots pose (x,y, ) from output of the
image processing, i.e. the found percepts. The approach used in the Dutch Aibo Team is
particle filtering, or Monte Carlo Localization, a probability-based method (Thrun, 2002);
(Thrun et al, 2001); (Rfer & Jungel, 2003). The self locator keeps tracks of a number of
particles, e.g. 50 or 100.
73
Each particle basically consists of a possible pose of the robot, and of a probability. Each
processing cycle consists of two steps, updating the particles and re-sampling them. The
updating step starts by moving all particles in the direction that the robot has moved
(odometry), adding a random offset. Next, each particle updates its probability using
information on percepts (flags, goals, lines) generated by the image processing. Also in this
step the pose of the particles can be slightly updated, e.g. using the calculated distance to the
nearest lines. In the second step, all particles are re-sampled. Particles with high
probabilities are multiplied; particles with low probabilities are removed.
A representation of all 50 particles is depicted in figure 8.
Figure 8. The self localization at initialization; 100 samples are randomly divided over the
field. Each sample has a position x, y, and heading in absolute playing-field coordinates. The
robots pose (yellow robot) is evaluated by averaging over the largest cluster of samples.
4.3 Behavior Control
Figure 9. General simplified layout of the first layers of the behavior Architecture of the
DT2004-soccer agent. The rectangular shapes indicate options; the circular shape indicates a
basic behavior. When the robot is in penalized state and standing, all the dark-blue options
are active
74
Behavior control can be seen as the upper command of the robot. As input, behavior control
takes high level information about the world, such as the own pose, the position of the ball
and of other players. Dependent on its state, behavior control will then give commands to
motion control, such as walk with speed x, look to direction y, ... Behavior control in the
DT2004 software is implemented as one gigantic state machine, written in XABSL (Ltzsch
et al, 2004), an XML based behavior description language. The state machine distinguishes
between options, states and basic behaviors. Each option is a separate XABSL file. Within
one option, the behavior control can be in different states. E.g. in Figure 9, the robot is in the
penalized state of the play soccer option, and therefore calls the penalized option. Basic
behaviors are those behaviors that directly control the low level motion. The stand behavior
in Figure 9 is an example of a basic behavior.
4.4 Motion control
Motion control is the part that calculates the joint-values of the robots joints. Three types of
motion can be identified in the DT2004 software:
Special actions
A special action is a predefined set of joint-values that is executed sequentially, controlling
both leg and head joints. All kicking motions, get-up actions and other special movements
are special actions.
Walking engine
All walking motions make use of an inverse kinematics walking engine. The engine takes a
large set of parameters (approx. 20) that result in walking motions. These parameters can be
changed by the designer. The walking engine mainly controls the leg joints.
Head motion
The head joints are controlled by head control, independently from the leg joints. The head
motions are mainly (combinations of) predefined loops of head joint values. The active head
motion can be controlled by behavior control.
75
Figure 10) that may invoke them. We will call this meta-behavior the goalies governing
behavior.
Figure 10. Cut-out of the hierarchy of behaviors of a soccer robot, with emphasis on the
goalkeeper role. Each behavior (e.g. position) is an independently written sense-think-act loop
a)
b)
c)
Figure 11. Basic goalie behaviors: a) Goalie-return-to goal, b) Goalie-position, c) Goalie-clear
ball. For each behavior a different vision system is used and a different particle filter setting
5.2 Specific perception used for each behavior.
For each of the 3 behaviors, identified in Figures 10 and 11, we have adapted both the image
processing and self localization algorithms in order to improve localization performance.
Goalie-return-to-goal. When the goalie is not in his goal area, he has to return to it. The
goalie walks around scanning the horizon. When he has determined his own position on the
field, the goalie tries to walk straight back to goal - avoiding obstacles - keeping an eye on
his own goal. The perception algorithms greatly resemble the ones of the general image
processor, with some minor adjustments.
Image-processing searches for the own goal, line-points, border-points and the two corner
flags near the own goal. The opponent goal and flags are ignored.
For localisation, an adjusted version of the old DT2004 particle filter is used, in which a
detected own goal is used twice when updating the particles.
Goalie- position. The goalie is in the centre of its goal when no ball is near. It sees the
field-lines of the goal area often and at least one of the two nearest corner flags regularly.
Localisation is mainly based of the detection of the goal-lines; the flags are used only to
correct if the estimated orientation is off more than 450 off. This is necessary because the
robot has no way (yet) to distinguish between the four lines surrounding the goal.
76
Image processing is used to detect the lines of the goal-area and for detecting the flags. The
distance and angle to goal-lines are detected by applying a Hough transform on detected
line-points.
For the detection of the own flags a normal flag detection algorithm is used, with the
adjustment that too small flags are rejected, since the flags are expected relatively near.
For self localization, a special particle filter was used that localized only on the detected
lines and flags. A background process verifies the in goal assumption on the average
number of detected lines and flags.
Goalie-clear-ball. If the ball enters the goal area, the goalie will clear the ball.
The image processing in this behavior is identical to that in the goalie-position behavior. The
goalie searches for the angles and distances to the goal-lines, and detects the flags nearest to
the own goal.
However, the self localization for the clear_ball behavior is different from that of the position
behavior. When the goalie starts clearing the ball, the quality of the perception input will be
very low. We have used this information, both for processing detected lines, and for
processing detected flag.
For flags we have used a lower update rate: it will take longer before the detection of flags at
a different orientation will result in the robot changing its pose. Lines detected at far off
angles or distances, resulting in a far different robot-pose, are ignored. The reason for this
mainly is that while clearing the ball, the goalie could come outside its penalty area. In this
case we dont want the robot to mistake a border line or the middle-line for a line belonging
to the goal area.
When the goalie clears a ball, there is no checking mechanism to check the in goal
assumption, as was in the position behavior. When the goalie has finished clearing the ball
and has returned to the position behavior, this assumption will be checked again.
77
color-tables is that it takes much less time to calibrate them. Making a color table as in
Figure 14b, which has to work for all algorithms, can take a very long time.
Figure 12. General versus object-specific image processing. Left one can see the general
image processing. A single grid and color-table is used for detecting all candidates for all
objects. In the modular image processing (right), the entire process of image processing is
object specific
a)
b)
c)
Figure 13. Object-specific image processing: a) for line detection we scan the image below
the horizon, using a green-white color table; b) for yellow flag detection we scan above the
horizon using a yellow-white-pink color table; c) 2 lines and 1 flag detected in the image
Figure 14. a) camera image; b) segmented with a general color-table; c) segmented with a
blue/green color-table; d) segmented with a blue/white/pink color-table for the detection
of a blue flag; e) segmented with a green/white color-table; f) segmented with a
yellow/green color-table for the detection of the yellow goal
78
7. Performance Measurements
7.1 General setup of the measurements
In order to prove our hypothesis that a goalie with a behavior-based vision system is more
robust, we have performed measurements on the behavior of our new goalie.
The localisation performance is commonly evaluated in terms of accuracy and/or
reactiveness of localisation in test environments dealing with noisy (Gaussian) sensormeasurements (Rfer & Jungel, 2003). We, however, are interested mainly in terms of the
systems reliability when dealing with more serious problems such as large amounts of false
sensor data input, or limited amounts of correct sensor input.
The ultimate test is how much goals does the new goalie prevent under game conditions in
comparison with the old goalie? Due to the hassle and chaotic play around the goal when
there is an attack, the goalie easily loses track of where he is. So our ultimate test is now
twofold:
1. How fast can the new goalie find back his position in the middle of the goal on a
crowded field in comparison with the old goalie
2. How many goals can the new goalie prevent on a crowded field within a certain time
slot in comparison with the old goalie
All algorithms for the new goalie are made object specific, as described in chapter 4. Since
we also want to know the results of using behavior-based perception, results of all realworld scenarios are compared not only to results obtained with the DT2004 system, but also
with a general vision system that does implement all object-specific algorithms.
The improvements due to object-specific algorithms are also tested offline on sets of images.
7.2 Influence of Object-Specific Image Processing
We have compared the original DT2004 image processing with a general version of our
NEW image processing; meaning that the latter does not (yet) use behavior specific image
processing nor self-localization. In contrast with the DT2004 code, the NEW approach does
use object specific grids and color tables. Our tests consisted of, searching for the 2 goals, the
4 flags, and all possible line- and border-points. The images sequences were captured with
the robots camera, under a large variety of lighting conditions (Figure 15). A few images
from all but one of these lighting condition sequences were used to calibrate the ColorTables (CTs). For the original DT2004 code, a single general CT was calibrated for all colors
that are meaningful in the scene, i.e.: blue, yellow, white, green, orange and pink. This
calibration took three hours. For the NEW image processing code we calibrated five 3-color
CTs (for the white-on-green lines, blue-goal, blue-flag, yellow-goal, and yellow-flag
respectively). This took only one hour for all tables, so 30% of the original time.
Figure 15. Images taken by the robots camera under different lighting conditions: a) Tubelight; b) Natural-light; c) Tube-light + 4 floodlights + natural light.
79
For all image sequences that we had acquired, we have counted the number of objects that
were detected correctly (N true) and detected falsely (N false). We have calculated also the
correctly accepted rate (CAR) being the number of objects that were correctly detected
divided by the number of objects that were in principle visible. Table 1 shows the results on
detecting flags and lines. The old DT2004 image processor uses a general grid and a single
color table, the NEW modular image processor uses object-specific grids and color-tables
per object. The calculation of the correctly accepted rate is based on 120 flags/goals that
were in principle visible in the first 5 image sequences and 360 flags/goals in principle
visible in the set where no calibration settings were made for. The image sequences for line
detection each contained on average 31-33 line-points per frame.
Goals and flags
DT2004
N true
NEW
CAR
N false N true
(%)
DT2004 NEW
CAR
(%)
N false
Lines
(%)
Lines
(%)
1 flood light
23
19
65
54
18
94
Tube light
54
45
83
83
58
103
4 flood lights
86
72
99
99
42
97
41
34
110
92
24
91
Tube,flood+natural
39
33
82
68
42
91
Natural light
Non calibration set
47
39
68
57
131
44
28
218
73
16
Table 1. The influence of object-specific algorithms for goal, flag and line detection
Table 1 shows that due to using object specific grids and color tables, the performance of the
image processing largely increased. The correctly accepted rate (CAR) goes up from about
45 % to about 75%, while the number of false positives is reduced. Moreover, it takes less
time to calibrate the color-tables. The correctly accepted rate of the line detection even goes
up to over 90%, also when a very limited amount of light is available (1 Flood light).
7.4 Influence of behavior based perception
In the previous tests we have shown the improvement due to the use of object specific grids
and color tables. Below we show the performance improvement due to behavior based
switching of the image processing and the self localization algorithm (the particle filter). We
used the following real-world scenarios.
Localize in the penalty area. The robot is put into the penalty area and has to return to a
predefined spot as many times as possible within 2 minutes.
Return to goal. The robot is manually put onto a predefined spot outside the penalty
area and has to return to the return-spot as often as possible within 3 minutes.
Clear ball. The robot starts in the return spot; the ball is manually put in the penalty
area every time the robot is in the return spot. It has to clear the ball as often as possible
in 2 minutes.
Clear ball with obstacles on the field. We have repeated the clear ball tests but then with
many strange objects and robots placed in the playing field, to simulate a more natural
playing environment.
80
Figure 16. Results for localisation in the penalty area. The number of times the robot can relocalise in the penalty area within 2 minutes. The old DT2004 vision system cannot localise
when there is little light (TL). The performance of the object specific image processing
(without specific self localisation) is shown by the flags and lines bars. In contrast with the
DT2004 code, the striker uses object specific image processing. The goalie uses object specific
image processing, behavior based image processing and behavior based self localisation
In order to be able to distinguish between the performance increase due to object-specific
grids and color-tables, and the performance increase due to behavior-dependent image
processing and self localisation, we used 3 different configurations.
DT2004: The old image processing code with the old general particle filter.
Striker: The new object-specific image processing used in combination with the old
general particle filter of which the settings are not altered during the test.
Goalie: The new object-specific image processing used in combination with objectspecific algorithms for detecting the field lines, and with a particle filter of which the
settings are altered during the test, depending on the behavior that is executed (as
described in chapter 5).
The results can be found in Figures 16-19.
Figure 17. Results of the return to goal test. The robot has to return to its own goal as many
times as possible within 3 minutes. The striker vision systems works significantly better
than the DT2004 vision system. There is not a very significant difference in overall
performance between the striker (no behavior-dependence) and the goalie (behavior
dependence). This shows that the checking mechanism of the in goal assumption works
correctly
81
Figure 18. (left). Results of the clear ball test. The robot has to clear the ball from the goal
area as often as he can in 2 minutes. Both the striker and the goalie vision systems are more
robust in a larger variety of lighting conditions than the DT2004 vision system (that uses a
single color table). The goalies self-locator, using detected lines and the yellow flags, works
up to 50 % better than the striker self-locator, which locates on all line-points, all flags and
goals
Figure 18 (right). Results of the clear ball with obstacles on the field test. The goalie vision
system, which uses location information to disregard blue flags/goals and only detects large
yellow flags, is very robust when many unexpected obstacles are visible in or around the
playing field.
8. Results
The impact of behavior-based perception can be seen from the localization test in the
penalty area (Figure 16) and from the clear-ball tests (Figure 18). The vision system of
the goalie, with behavior based vision and self localisation, performs > 50 % better on
the same task as a striker robot with a vision system without behavior-based perception.
With object-specific grids and color-tables, the performance of the image processing
(reliability) under variable lighting conditions has increased with 75-100% on sets of offline images, while the color calibrating time was reduced to 30%.
Behavior-based perception and object-specific image processing combined allows for
localization in badly lighted conditions, e.g. with TL tube light only (Figure 16-18).
The impact of discarding unexpected objects on the reliability of the system can most
clearly be seen from the clear ball behavior test with obstacles on the field (Figure 18,
right). With TL + Floods, the striker apparently sees unexpected objects and is unable to
localize, whereas the goalie can localize in all situations.
Using all object specific image processing algorithms at the same time requires the same
CPU load as the old general DT2004 image processor. Searching for a limited number of
objects in a specific behavior can therefore reduce the CPU load considerably.
Due to the new architecture, the code is more clean and understandable; hence better
maintainable and extendable. The main drawback is that one has to educate complete
system engineers instead of sole image processing, software, AI, and mechanical
experts.
82
9. References
Arkin, R.C. (1998). Behavior based robotics, MIT press, ISBN 0-262-01165-4
Brooks, R.A. (1991). Intelligence without Representation. Artificial Intelligence, Vol.47, 1991,
pp.139-159.
Bruce, J.; Balch, T. & Veloso, M (2000). Fast and inexpensive color image segmentation for
interactive robots. In Proceedings of the 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS '00), volume 3, pages 2061-2066.
Dietterich, T.G (2000). Hierarchical reinforcement learning with the MAXQ value function
decomposition. Journal of Artificial Intelligence Research, 13:227-303, 2000
Jonker, P.P.; Terwijn, B; Kuznetsov, J & van Driel, B (2004). The Algorithmic foundation of
the Clockwork Orange Robot Soccer Team, WAFR '04 (Proc. 6th Int. Workshop on the
Algorithmic Foundations of Robotics, Zeist/Utrecht, July), 2004, 1-10.
Lenser, S; Bruce, J & Veloso (2002). M. A Modular Hierarchical Behavior-Based Architecture,
in RoboCup-2001, Springer Verlag, Berlin, 2002.
Ltzsch, M.; Back, J.; Burkhard H-D & Jngel, M (2004). Designing agent behavior with the
extensible agent behavior specification language XABSL. In: 7th International Workshop on Robocup 2003 (Robot World Cup Soccer Games and Conferences in Artificial
Intelligence, Padova, Italy, 2004.
Mantz, F. (2005). A behavior-based vision system on a legged robot. MSc Thesis, Delft
University of Technology, Delft, the Netherlands.
Mantz, F; Jonker, P; Caarls W (2005); Behavior-based vision on a 4-Legged Soccer Robot.
Robocup 2005, p. 480-487
Oomes, S; Jonker, P.P; Poel, M; Visser, A & Wiering, M (2004). The Dutch AIBO Team 2004,
Proc. Robocup 2004 Symposium (July 4-5, Lisboa, Portugal, Instituto Superior
Tecnico, 2004, 1-5. see also http://aibo.cs.uu.nl
Parker, L.E. (1996). On the design of behavior-based multi-robot teams. Journal of Advanced
Robotics, 10(6).
Pfeifer, R & Scheier, C (1999). Understanding Intelligence. The MIT Press, Cambridge,
Massechussets, ISBN 0-262-16181-8.
Rfer, T, von Stryk, O, Brunn, R; Kallnik, M and many other (2003). German Team 2003.
Technical report (178 pages, only available online:
http://www. Germanteam.org/GT2003.pdf)
Rfer, T. & Jungel, M. (2003). Vision-based fast and reactive monte-carlo localization. In The
IEEE International Conference on Robotics and Automation, pages 856-861, 2003, Taipei,
Taiwan.
Sutton, R.S & Barto, A.G (1998). Reinforcement learning an introduction., MIT press, 1998.
ISBN 0-262-19398-1.
Takahashi, Y & Asada, M (2004). Modular Learning Systems for Soccer Robot (Takahashi04d.pdf). 2004, Osaka, Japan.
Thrun, S.; Fox, D.; Burgard, W & Dellaert (2001), F. Robust monte carlo localization for
mobile robots. Journal of Artificial Intelligence, Vol. 128, nr 1-2, page 99-141, 2001,
ISSN:0004-3702
Thrun, S. (2002). Particle filters in robotics. In The 17th Annual Conference on Uncertainty in AI
(UAI), 2002
6
A Real-Time Framework for the Vision
Subsystem in Autonomous Mobile Robots
Paulo Pedreiras1, Filipe Teixeira2, Nelson Ferreira2, Lus Almeida1,
Armando Pinho1 and Frederico Santos3
1LSE-IEETA/DETI,
1. Introduction
Interest on using mobile autonomous agents has been growing (Weiss, G., 2000), (K. Kitano;
Asada, M.; Kuniyoshi, Y.; Noda, I. & Osawa E., 1997) due to their capacity to gather
information on their operating environment in diverse situations, from rescue to demining
and security. In many of these applications, the environments are inherently unstructured
and dynamic, and the agents depend mostly on visual information to perceive and interact
with the environment. In this scope, computer vision in a broad sense can be considered as
the key technology for deploying systems with an higher degree of autonomy, since it is the
basis for activities like object recognition, navigation and object tracking.
Gathering information from such type of environments through visual perception is an
extremely processor-demanding activity with hard to predict execution times (Davison, J.,
2005). To further complicate the situation many of the activities carried out by the mobile
agents are subject to real-time requirements with different levels of criticality, importance
and dynamics. For instance, the capability to timely detect obstacles near the agent is a hard
activity, since failures can result in injured people or damaged equipment, while activities
like self-localization, although important for the agent performance, are inherently soft since
extra delays in these activities simply cause performance degradation. Therefore, the
capability to timely process the image at rates high enough to allow visual-guided control or
decision-making, called real-time computer vision (RTCV) (Blake, A; Curwen, R. &
Zisserman, A., 1993), plays a crucial role in the performance of mobile autonomous agents
operating in open and dynamic environments.
This chapter describes a new architectural solution for the vision subsystem of mobile
autonomous agents that substantially improves its reactivity by dynamically assigning
computational resources to the most important tasks. The vision-processing activities are
broken into separated elementary real-time tasks, which are then associated with adequate
real-time properties (e.g. priority, activation rate, precedence constraints). This separation
allows avoiding the blocking of higher priority tasks by lower priority ones as well as to set
independent activation rates, related with the dynamics of the features or objects being
processed, together with offsets that de-phase the activation instants of the tasks to further
84
Shared Data Buffer (SDB), allowing different processes to process in parallel a set of
image buffers;
Process Manager (PMan), which carries out the activation of the vision-dependent realtime tasks;
Quality of Service manager (QoS), which dynamically updates the real-time properties
of the tasks.
The SDB service keeps track of the number of processes that are connected to each image
buffer. Buffers may be updated only when there are no processes attached to them, thus
ensuring that processes have consistent data independently of the time required to complete
the image analysis.
The process activation is carried out by a PMan service that keeps, in a database, the process
properties, e.g. priority, period and phase. For each new image frame, the process manager
scans the database, identifies which processes should be activated and sends them wake-up
signals. This framework allows reducing the image processing latency, since processes are
activated immediately upon the arrival of new images. Standard OS services are used to
implement preemption among tasks.
The QoS manager monitors continuously the input data and updates the real-time
properties (e.g. the activation rate) of the real-time tasks. This service permits to adapt the
computational resources granted to each task, assuring that in each instant the most
important ones, i.e. the ones that have a greater value for the particular task being carried
out, receive the best possible QoS.
The performance of the real-time framework herein described is assessed in the scope of the
CAMBADA middle-size robotic soccer team, being developed at the University of Aveiro,
Portugal, and its effectiveness is experimentally proven.
External communication
(IEEE 802.11b)
Main
Processor
High bandwidth
sensors
Coordination
layer
Low-level
control layer
Distributed sensing/
actuation system
A Real-Time Framework
for the Vision Subsystem in Autonomous Mobile Robots
85
86
system by putting together well defined subsystems, and dependability, by using nodes to
ease the definition of error-containment regions. This architecture relies strongly on the
network, which must support real-time communication. For this purpose, it uses the CAN
(Controller Area Network) protocol (CAN, 1992), which has a deterministic medium access
control, a good bandwidth efficiency with small packets and a high resilience to external
interferences. Currently, the interconnection between CAN and the PC is carried out by
means of a gateway, either through a serial port operating at 115Kbaud or through a serialto-USB adapter.
A Real-Time Framework
for the Vision Subsystem in Autonomous Mobile Robots
87
Initializations:
- Read configuration files
(Cameras, Objects)
- Open and set-up camera devices
- Initialize data structures
- Initialize SDL
Sleep
New image
ready
Search Obstacles
Search Ball
Search Goals
Search Posts
Update RTDB
60
50
40
30
20
10
0
0
10
15
20
25
30
35
40
Time (ms)
88
execution times since in some cases the objects may be found almost immediately, when
their position between successive images does not change significantly, or it may be
necessary to explore the whole image and expand a substantial amount of regions of
interest, e.g. when the object disappears from the robot field of vision (Davison, J., 2005).
This expectation is in fact confirmed in reality, as depicted in Figure 3, which presents a
histogram of the execution time of the ball tracking alone. Frequently the ball is located
almost immediately, with 76.1% of the instances taking less than 5ms to complete. However,
a significant amount of instances (13.9%) require between 25ms and 35ms to complete and
the maximum observed execution time was 38,752 ms, which represents 77.5% of the interframe period just to process a single object.
A Real-Time Framework
for the Vision Subsystem in Autonomous Mobile Robots
89
process takes a couple of frames to execute its output is still meaningful. Furthermore
prediction methods (Iannizzotto, G., La Rosa, F. & Lo Bello, L., 2004) combined with
odometry data may also be effectively used to obtain estimates of object positions between
updates. Another aspect to consider is that the pole localization activity should not block the
more frequent obstacle detection. This set of requirements calls for the encapsulation of each
object tracking activity in different processes as well as for the use of preemption and
appropriate scheduling policies, giving higher priority to most stringent processes. These
are basically the techniques that were applied to the CAMBADA vision subsystem as
described in the following section.
4.2 A Modular Software Architecture
Figure 4 describes the software modular architecture adopted for the CAMBADA vision
subsystem. Standard Linux services are used to implement priority scheduling, preemption
and data sharing.
Associated to each camera there is one process (ReadXC) which transfers the image frame
data to a shared memory region where the image frames are stored. The availability of a
new image is fed to a process manager, which activates the object detection processes. Each
object detection process (e.g. obstacle, ball), generically designated by proc_obj:x, x={1,2,n}
in Figure 4, is triggered according to the attributes (period, phase) stored in a process
database. Once started, each process gets a link to the most recent image frame available and
starts tracking the respective object. Once finished, the resulting information (e.g. object
detected or not, position, degree of confidence, etc.) is placed in a real-time database
(Almeida, L.; Santos, F.; Facchinetti; Pedreiras, P.; Silva, V. & Lopes, L., 2004), identified by
the label Object info, similarly located in a shared memory region. This database may be
accessed by any other processes on the system, e.g. to carry out control actions. A display
process may also be executed, which is useful mainly for debugging purposes.
4.2.1 Process Manager
For process management a custom library called PMan was developed. This library keeps a
database where the relevant process properties are stored. For each new image frame, the
process manager scans the database, identifies which processes should be activated and
sends them pre-defined wake-up signals.
Table 1 shows the information about each process that is stored in the PMan database.
The process name and process pid fields allow a proper process identification, being used to
associate each field with a process and to send OS signals to the processes, respectively. The
period and phase fields are used to trigger the processes at adequate instants. The period is
expressed in number of frames, allowing each process to be triggered every n frames. The
phase field permits de-phasing the process activations in order to balance the CPU load over
time, with potential benefits in terms of process jitter. The deadline field is optional and
permits, when necessary, to carry out sanity checks regarding critical processes, e.g. if the
high-priority obstacle detection does not finish within a given amount of time appropriate
actions may be required to avoid jeopardizing the integrity of the robot. The following
section of the PMan table is devoted to the recollection of statistical data, useful for profiling
purposes. Finally, the status field keeps track of the instantaneous process state (idle,
executing).
90
Process identification
PROC_name
PROC_pid
Generic temporal properties
PROC_period
PROC_phase
PROC_deadline
QoS management
PROC_qosdata
PROC_qosupdateflag
Statistical data
PROC_laststart
PROC_lastfinish
PROC_nact
PROC_ndm
Process status
PROC_status
Process ID string
Process id
Period ( frames)
Phase (frames)
Deadline (s)
QoS attributes
QoS change flag
Activation instant of last instance
Finish instant of last instance
Number of activations
Number of deadline misses
Process status
PMAN_init: allocates resources (shared memory, semaphores, etc) and initializes the
PMan data structures;
PMAN_epilogue: signals that a process has terminated the execution of one instance;
PMAN_tick: called upon the availability of every new frame, triggering the activation
of processes.
The PMan service should be initialized before use, via the init function. The service uses OS
resources that require proper shutdown procedures, e.g. shared memory and semaphores,
and the close function should be called before terminating the application. To register in the
PMan table, a process should call the add function and afterwards the attach function. This
separation permits a higher flexibility since it becomes possible to have each process
registering itself completely or to have a third process managing the overall properties of
the different processes. During runtime the QoS allocated to each process may be changed
with an appropriate call to QoSupd function. Similarly, the temporal properties of one
A Real-Time Framework
for the Vision Subsystem in Autonomous Mobile Robots
91
process can also be changed dynamically by means of the TPupd function. When a process
terminates executing one instance it should report this event via the epilogue call. This
action permits maintaining the statistical data associated with each process as well as
becoming aware of deadline violations. The query call allows accessing the statistical data of
each process registered in the database. This information can be used by the application for
different purposes like profiling, load management, etc. Finally, the tick call is triggered by
the process that interacts with the camera and signals that a new frame is ready for
processing. As a consequence of this call the PMan database is scanned and the adequate
processes activated.
4.2.2 Shared Data Buffers
As discussed previously, the robot application is composed by several processes which
operate concurrently, each seeking for particular features in a given frame. The complexity
of these activities is very dissimilar and consequently the distinct processes exhibit
distinctive execution times. On the other hand the execution time of each process may also
vary significantly from instance to instance, depending on the particular strategy followed,
on the object dynamics, etc.. Consequently, the particular activation instants of the processes
cannot be predicted beforehand. To facilitate the sharing of image buffers in this framework
a mechanism called Shared Data Buffers (SDB) was implemented. This mechanism is similar
to the Cyclic Asynchronous Buffers (Buttazzo, G.; Conticelli, F.; Lamastra, G. & Lipari, G.,
1997), and permits an asynchronous non-blocking access to the image buffers. When the
processes request access to an image buffer automatically receive a pointer to the most
recent data. Associated to each buffer there is a link count which accounts for the number of
processes that are attached to each buffer. This mechanism ensures that the buffers are only
recycled when there are no processes attached to them, and so the processes have no
practical limit to the time during which they can hold a buffer.
The access to the SDB library is made trough the following calls:
SDB_init: reserves and initializes the diverse data structures (shared memory,
semaphores, etc);
SDB_update: signals that a given buffer was updated with new data;
92
runtime. Coping efficiently with this kind of ambiance requires support for dynamic
reconfiguration and on-line QoS management (Burns, A; Jeffay, K.; Jones, M. et al, 1996).
These features are generally useful to increase the efficiency in the utilization of system
resources (Buttazzo, G.; Lipari, G., Caccamo, M. & Abeni. L., 2002) since typically there is a
direct relationship between resource utilization and delivered QoS. In several applications,
assigning higher CPU to tasks increases the QoS delivered to the application. This is true, for
example, in control applications (Buttazzo, G. & Abeni, L., 2000), at least within certain
ranges (Marti, P., 2002), and in multimedia applications (Lee, C.; Rajkumar, R. & Mercer, C.,
1996). Therefore, managing the resources assigned to tasks, e.g. by controlling their
execution rate or priority, allows a dynamic control of the delivered QoS. Efficiency gains
can be achieved in two situations: either maximizing the utilization of system resources to
achieve a best possible QoS for different load scenarios or adjusting the resource utilization
according to the application instantaneous QoS requirements, i.e. using only the resources
required at each instant.
Process
Ball_Fr
BGoal / YGoal
BPost / YPost
Avoid_Fr
Ball_Om
Avoid_Om
Line
Period
(ms)
50
200
800
50
50
50
400
Priority
35
25
15
45
40
45
20
Offset
(ms)
0
50/150
100/200
0
0
0
0
Purpose
Ball tracking (front camera)
Blue / Yellow Goal tracking
Blue / Yellow Post tracking
Obstacle avoidance (front cam.)
Ball tracking (omni camera)
Obstacle avoidance (omni camera)
Line tracking and identification
5. Experimental Results
In order to assess the performance of the modular approach and compare it with the initial
monolithic one, several experiments were conducted, using a PC with an Intel Pentium III
CPU, running at 550MHz, with 256MB of RAM. This PC has lower capacity than those
typically used on the robots but allows a better illustration of the problem addressed in this
chapter. The PC runs a Linux 2.4.27 kernel, patched with RTAI 3.0r4. The image-capture
devices are Logitech Quickcams, with a Philips chipset. The cameras were set-up to produce
320*240 images at a rate of 20 frames-per-second (fps). The time instants were measured
accessing the Pentium TSC. To allow a fair comparison all the tests have been executed over
the same pre-recorded image sequence.
A Real-Time Framework
for the Vision Subsystem in Autonomous Mobile Robots
93
Max.
(ms)
143
197
FrontVision
OmniVision
Min.
(ms)
29
17
Avg.
(ms)
58
69
St.Dev.
(ms)
24
31
18
20
16
18
16
Number of activations (%)
14
12
10
8
6
4
14
12
10
8
6
4
0
0
50
100
Time (ms)
150
200
50
100
150
200
Time (ms)
Figure 5. Histogram of the inter-activation time of the FrontVision (top) and OmniVision
(bottom) processes
The response time of both processes exhibits a substantial variance, with inter-activation
times ranging from 17ms to near 200ms and an average inter-activation time of 58ms and
69ms, respectively. Remembering that the image acquisition rate is 20 fps, corresponding to
50ms between frames, these figures indicate a poor performance. In fact the image
processing is part of the control loop and so the high jitter leads to a poor control
performance, a situation further aggravated by the significant amount of dropped frames,
which correspond to time lapses during which the robot is completely non-responsive to the
environment.
5.2 Modular Architecture
The different image-processing activities have been separated and wrapped in different
Linux processes, as described in Section 4. Table 2 shows the periods, offsets and priorities
assigned to each one of the processes.
The obstacle avoidance processes are the most critical ones since they are responsible for
alerting the control software of the presence of any obstacles in the vicinity of the robot,
allowing it to take appropriate measures when necessary, e.g. evasive maneuvers or
immobilization.
Therefore these processes are triggered at a rate equal to the camera frame rate and receive
the highest priority, ensuring a response-time as short as possible. It should be noted that
these processes scan restricted image regions only, looking for specific features, thus their
execution time is bounded and relatively short. In the experiments the measured execution
94
time was bellow 5ms for each one of the processes, therefore this architecture allows
ensuring that every frame will be scanned for the presence of obstacles.
Process interactivation time
80
35
70
30
60
50
40
30
20
25
20
15
10
10
0
40
45
50
55
60
40
45
50
Time (ms)
55
60
Time (ms)
Figure 6. Front (left) and omni-directional (right) obstacle detection processes interactivation intervals
Process interactivation time
60
40
35
50
Number of activations (%)
30
40
30
20
25
20
15
10
10
5
0
0
20
30
40
50
Time (ms)
60
70
80
20
30
40
50
Time (ms)
60
70
80
Figure 7. Omni-directional (left) and frontal (right) camera ball tracking processes interactivation intervals
Process interactivation time
25
20
20
Number of activations (%)
15
15
10
10
5
5
0
200
250
300
350
400
Time (ms)
450
500
550
600
0
400
500
600
700
800
Time (ms)
900
1000
1100
1200
Figure 8. Line (left) and yellow post (right) tracking processes inter-activation intervals
The second level of priority is granted to the Ball_Om process, which tracks the ball in the
omni-directional camera. This information is used when approaching, dribbling and kicking
the ball, activities that require a low latency and high update rate for good performance.
A Real-Time Framework
for the Vision Subsystem in Autonomous Mobile Robots
95
Therefore this process should, if possible, be executed on every image frame, thus its period
was also set to 50ms.
The third level of priority is assigned to the Ball_Fr process, responsible for locating the ball
in the front camera. This information is used mainly to approach the ball when it is at
medium to far distance from the robot. Being able to approach the ball quickly and
smoothly is important for the robot performance but this process is more delay tolerant than
the Ball_Om process, thus it is assigned a lower priority.
Process
Max.
(ms)
Min.
(ms)
Average
(ms)
Standard deviation
(ms)
Avoid_Fr
60.1
48.9
50.0
0.5
Avoid_Om
60.1
45.9
50.0
1.6
Ball_Om
60.1
46.0
50.0
1.6
Ball_Fr
80.0
19.9
50.0
2.1
Ygoal
362.2
61.1
207.9
58.3
BGoal
383.9
60.9
208.4
66.6
Line
564.7
235.6
399.9
71.9
BPost
1055.8
567.9
799.9
87.2
YPost
1156.4
454.4
799.6
114.3
96
narrow inter-activation variance, since they are not blocked and preempt other processes
that may be running. Figure 7 shows the inter-activation intervals of the ball tracking
processes. As stated above, the ball tracking process on the omni-directional camera has a
higher priority since its data is used by more time sensitive activities. For this reason its
inter-activation interval is narrower than the ball tracking process related to the front
camera. As discussed in Section 4, the ball-tracking processes exhibit a significant execution
time variance, since in some cases they are able to find the ball almost immediately while in
other cases the whole image is scanned. For this reason the lower-priority ball-tracking
process (frontal camera) exhibits a significantly higher inter-activation jitter than the higherpriority one. The same behavior is observed for the remaining processes, which see their
inter-activation jitter increase as their relative priorities diminish.
Table 4 shows statistical data regarding the inter-activation intervals of these processes,
which confirm, in a more rigorous way, the behavior observed above. The processes are
sorted by decreasing priorities exhibiting, from top to bottom, a steady increase in the gap
between maximum and minimum values observed as well as in the standard deviation. This
is expected since higher priority processes, if necessary, preempt lower priority ones
increasing their response-time.
Comparing the figures in Table 3 and Table 4, a major improvement can be observed with
respect to the activation jitter of the most time-sensitive processes, which, for the most
important tasks was reduced to 10ms (object avoidance and omni-directional ball tracking)
and 30ms (frontal ball tracking). Furthermore, the standard deviation of the activation jitter
of these processes is much lower (between 0.5ms and 2.1ms) and no frame drops have
occurred, a situation that may have a significant impact on control performance.
During runtime higher priority processes preempt the lower priority ones, delaying its
execution. This effect is clear in Table 4, with the goal, post and line processes exhibiting a
much higher variability in their inter-activation times. Therefore, it can be concluded that
the modular approach is effective, being able to privilege the execution of the processes that
have higher impact on the global system performance.
5.3 Dynamic Qos adaptation
During runtime the robotic soccer players have to perform different roles, e.g., when a
defender robot gets the ball possession and has a clear way in the direction of the opposite
team goal it should assume an attacker role and some other team mate should take the
defender role in its place. The relative importance of the image processing activities depends
on the particular role that the robots are playing, e.g., in the situation described above the
robot that is taking the defender role does not need to look for the ball in its vicinity, since
this one is in the possession of a team mate, while it could benefit from a higher accuracy on
the localization, achieved by tracking the field lines more often. Therefore, having the ability
to change the image-processing attributes during runtime has the potential to increase the
robot performance.
Another aspect that should not be neglected is that the environment strongly influences the
image processing time since, depending on its richness, the algorithms may have to explore
more or less regions of interest. As a result it is possible for the robotic players to perform
differently in distinct environments or even in different times in the same environment, e.g.,
due to illumination variation. In these cases it may be interesting to manage the execution
A Real-Time Framework
for the Vision Subsystem in Autonomous Mobile Robots
97
rates of the image-processing activities in order to take the best possible profit of the CPU
but without incurring in overloads that penalize the control performance.
Process interactivation time
60
60
58
58
56
54
52
50
56
54
52
50
48
48
46
0
10000
20000
30000
40000
50000
Time (ms)
60000
70000
80000
90000
10000
20000
30000
40000
50000
60000
70000
80000
90000
Time (ms)
Figure 9. Inter-activation time of the high-priority frontal (left) and omnidirectional (right)
avoid processes during a mode change affecting lower priority processes, only
As discussed in Section 4.2.1, the PMAN library permits to change the QoS properties of the
processes, namely the period, phase, deadline and priority. To observe the impact of this
service a situation was created in which the decision level requested a change in the role of a
robot, from attacker to defender, as described before. Furthermore, a CPU overload was
detected and thus the need to remove a lower importance process. The resulting actions
were:
to execute the front camera ball tracking process only once in each two frames;
to raise its priority to 40, i.e., just below the obstacle avoidance processes.
Figure 9 and Figure 10 depict the inter-arrival time of the avoid, frontal camera ball-tracking
and line tracking processes.
The first fact to be observed is that the higher priority processes are not affected, except for a
small glitch on the instant of the QoS update, of similar magnitude as the jitter already
observed (less than 10ms, see Table 4). This glitch may be explained by the need to access
the PMAN table in exclusive mode and to call the Linux primitive sched_setscheduler()to
change the priority of the line process. These operations are made within the PMAN_tick
call, before the activation of the processes.
The second fact to be observed is that the line and frontal ball-tracking processes started to
behave as expected immediately after the mode change, with periods of one and two
frames, respectively.
The third fact to be observed is that the overload was controlled, and all the processes
started to behave more regularly. This effect can be observed in medium priority processes
(e.g. ball tracking) as well as in lower-priority processes (e.g. post seeking).
Therefore, it can be concluded that the PMAN services permit to change the process
attributes at run-time, allowing both mode changes and CPU load management without
disturbing the behavior of other processes not directly involved in the adaptation process
and, consequently, it is possible to carry out the reconfiguration dynamically, since there is
no service disruption.
98
200
600
180
500
140
160
120
100
80
60
40
400
300
200
100
20
0
0
0
10000
20000
30000
40000 50000
Time (ms)
60000
70000
80000
90000
10000
20000
30000
40000 50000
Time (ms)
60000
70000
80000
90000
Figure 10. Inter-activation time of the frontal ball-tracking (left) and line (right) processes
during a mode change in which the period of the former process was increased (50ms to
100ms) and the period of the latter was reduced (400ms to 50ms)
6. Conclusion
Computer vision applied to guidance of autonomous robots has been generating large
interest in the research community as a natural and rich way to sense the environment and
extract from it the necessary features. However, due to the robots motion, vision-based
sensing becomes a real-time activity that must meet deadlines in order to support adequate
control performance and avoid collisions. Unfortunately, most vision-based systems do not
rely on real-time techniques and exhibit poor temporal behavior, with large variations in
execution time that may lead to control performance degradation and even sensing blackout periods caused by skipped image frames.
In this chapter, the referred problem is identified in the scope of the CAMBADA middle-size
robotic soccer team, being developed at the University of Aveiro, Portugal. Then, a new
architectural solution for the vision subsystem is presented that substantially improves its
reactivity, reducing jitter and frame skipping.
The proposed architecture separates the vision-based object-tracking activities in several
independent processes. This separation allows, transparently and relying solely on
operative system services, to avoid the blocking of higher priority processes by lower
priority ones as well as to set independent activation rates, related with the dynamics of the
objects being tracked and with its impact on the control performance, together with offsets
that de-phase the activation instants of the processes to further reduce mutual interference.
As a consequence, it becomes possible to guarantee the execution of critical activities, e.g.,
obstacle avoidance and privilege the execution of others that, although not critical, have
greater impact on the robot performance, e.g., ball tracking.
Finally, many robotic applications are deployed in open environments that are hard to
characterize accurately at pre-runtime. The architecture herein proposed permits managing
dynamically the resources assigned to tasks, e.g. by controlling their execution rate or
priority, allowing a dynamic control of the delivered QoS. This approach permits either
maximizing the utilization of system resources to achieve a best possible QoS for different
load scenarios or adjusting the resource utilization according to the application
instantaneous requirements, granting a higher QoS to the tasks that have higher impact on
the global system performance.
A Real-Time Framework
for the Vision Subsystem in Autonomous Mobile Robots
99
The work described in this chapter is focused on robotic soccer robots but the results and
approach are relevant for a wider class of robotic applications in which the vision subsystem
is part of their control loop.
7. References
Almeida, L.; Santos, F.; Facchinetti; Pedreiras, P.; Silva, V. & Lopes, L. (2004). Coordinating
distributed autonomous agents with a real-time database: The CAMBADA project.
Lecture Notes in Computer Science, Volume 3280/2004, pp. 876-886, ISSN 0302-9743.
Assad, C.; Hartmann, M. & Lewis, M. (2001). Introduction to the Special Issue on
Biomorphic Robotics.
Autonomous Robots, Volume 11, pp. 195-200, ISSN 09295593.
Blake, A; Curwen, R. & Zisserman, A. (1993). A framework for spatio-temporal control in
the tracking of visual contours. International Journal of Computer Vision, Vol. 11 No.2,
pp. 127145, ISSN0920-5691.
Burns, A; Jeffay, K.; Jones, M. et al (1996). Strategic directions in realtime and embedded
systems. ACM Computing Surveys, Vol. 28, No. 4, pp. 751763, ISSN 0360-0300.
Buttazzo, G.; Conticelli, F.; Lamastra, G. & Lipari, G. (1997). Robot control in hard real-time
environment. Proceedings of the 4th International Workshop on Real-Time Computing
Systems and Applications, pp. 152159, ISBN 0-8186-8073-3, Taiwan, Oct. 1997,
Taipei.
Buttazzo, G. & Abeni, L. (2000). Adaptive rate control through elastic scheduling.
Proceedings of the 39th IEEE Conference on Decision and Control, pp. 4883-4888, ISBN
0-7803-6638-7, Dec. 2000, Sydney, Australia.
Buttazzo, G.; Lipari, G., Caccamo, M. & Abeni. L. (2002). Elastic scheduling for flexible
workload management. IEEE Transactions on Computers, Vol. 51, No. 3, pp. 289302,
ISSN: 0018-9340.
CAN (1992). Controller Area Network - CAN2.0. Technical Specification, Robert Bosch, 1992.
Davison, J. (2005). Active search for real-time vision, Proceedings of the 10th IEEE International
Conference on Computer Vision, Volume: 1, pp. 66- 73, ISBN 0-7695-2334-X.
De Souza, G. & Kak, A.( 2004). A Subsumptive, Hierarchical, and Distributed Vision-Based
Architecture for Smart Robotics. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, Vol. 34, pp. 1988-2002, ISSN 1083-4419.
Gibson, J. (1979). The Ecological Approach to Visual Perception, Lawrence Erlbaum Associates,
Inc.,, ISBN 0-89859-959-8, Boston, MA.
Hirai, S.; Zakouji, M & Tsuboi, T. (2003). Implementing Image Processing Algorithms on
FPGA-based Realtime Vision System, Proceedings of the 11th Synthesis and System
Integration of Mixed Information Technologies, pp.378-385, March 2003, Hiroshima.
Iannizzotto, G., La Rosa, F. & Lo Bello, L. (2004). Real-time issues in vision-based HumanComputer Interaction. Technical Report, VisiLab, University of Messina , Italy.
Kitano, K.; Asada, M.; Kuniyoshi, Y.; Noda, I. & Osawa E. (1997). RoboCup: The Robot
World Cup Initiative, Proceedings of the First International Conference on Autonomous
Agents (Agents'97), W. Lewis Johnson and Barbara Hayes-Roth (Eds.), pp. 340
347, ISBN 0-89791-877-0, USA, Aug. 1997, ACM Press, N.Y.
Kopetz, H. (1997). Real-Time Systems Design Principles for Distributed Embedded Applications,
Kluwer Academic Publishers, ISBN 0-7923-9894-7, Boston, MA.
100
Lee, C.; Rajkumar, R. & Mercer, C. (1996). Experiences with processor reservation and
dynamic qos in real-time Mach. In Multimedia Japan 96, Japan, April 1996.
Marti, P. (2002). Analysis and Design of Real-Time Control Systems with Varying Control
Timing Constraints. PhD thesis, Universitat Politecnica de Catalunya, Barcelona,
Spain, July 2002.
RTAI (2007), RTAI for Linux, Available from http://www.aero.polimi.it/~rtai/, accessed:
2007-01-31.
Santos, F.; Almeida, L.; Pedreiras, P.; Lopes, S. & Facchinnetti, T. (2004). An Adaptive
TDMA Protocol for Soft Real-Time Wireless Communication Among Mobile
Computing Agents, Proceedings of the Workshop on Architectures for Cooperative
Embedded Real-Time Systems (satellite of RTSS 2004). Lisboa, Portugal, Dec. 2004.
SDL (2007), Simple DirectMedia Layer, Available from http://www.libsdl.org/index.php,
accessed: 2007-01-31.
Weiss, G. (2000). Multiagent systems. A Modern Approach to Distributed Artificial Intelligence,
MIT Press, ISBN 0-262-23203-0, Cambridge, MA.
7
Extraction of Roads From Out Door Images
Alejandro Forero Guzmn M.Sc. and Carlos Parra Ph.D.
Departamento de Ingeniera Electrnica, Pontificia Universidad Javeriana, Bogot
Colombia
1. Introduction
The humanitarian demining process is very slow, expensive and most important, because it
is done manually, it puts human lives at risk. Deminers are exposed to permanent danger
and accidents. Even with the help of dogs, the demining process has not improved much
during recent years (UNICEF, 2000).
Figure 1. Left side: Ursula Robot. Right side Amaranta Robot Project
A few separate initiatives from the robotics community to design and prove a mechanical
automated solution have taken place. Here at the Pontificia Universidad Javeriana, in
Colombia, we are working in this problem: on a previous project the mobile robot Ursula
was developed (Rizo et al., 2003); and now we are working in a new mobile robot called
Amaranta (Figure1). One part of the humanitarian demining problem is the navigation, in
the two projects the autonomous navigation task is executed based on a vision system that
uses a camera mounted on the robot.
Landmines could be placed in any type of terrain: deserts, mountains, swamps, roads,
forests, etc. This means that when trying to build a robot for demining operations, its
workspace has to be previously defined and limited. In this work humanitarian demining, is
limited to operate on places that have been modified by man and that represent great
importance for a community, for example the roads or paths. Specifically, the systems have
been designed for the Colombian territory (Rizo et al., 2003).
102
103
This chapter presents two approximations made for the vision system in order to enable
autonomous navigation in outdoor environments; both based on the road following
principle.
Therefore, almost every vehicle facing the problem of autonomous navigation using vision
systems solves this problem by following the track. How ever, this technique is widely
implemented only over structured roads, because painted lines over the road are a reliable
characteristic to exploit. When there are no painted roads to follow or simply no road at all,
autonomous navigation based on visual systems, is usually reduced to avoid obstacles.
In the last two decades, autonomous navigations have been a goal sought by different
authors (Turk et al., 1988) (Thorpe et al., 1988), yet today still an open area for research
(Thrun et al.,2006).
Due to the danger involved in demining efforts, a cheap robot is required, equipped with an
autonomous navigation system, in order to minimize the risk for the humanitarian
demining team; and an architecture capable of supporting multiple sensors to acquire the
most reliable information about the surrounding area. These limitations, along with the
special terrain conditions of the Colombia topography guide the present research.
Bellow some concepts from the classic theory are presented and then the complete approach
made by the authors is exposed, from the problem of navigation to the extraction of roads or
paths in outdoor images as essential part of the autonomous navigation.
2. Important Concepts
2.1 Colour Spaces
In general, colour is the perceptual result of light in the visible region of the spectrum (Jain,
1989). There are many colour spaces reported in the literature, each one has its
characteristics. For image processing, its is usually described by the distribution of the
colour of the three components R (Red), G (Green) and B (Blue), moreover many other
attributes can also be calculated from these components. The colour analysis is more
difficult using the three components. In image processing of exterior scenes, the illumination
is a very critical parameter. A first approach is to select a colour space expressed as two
colour components and one intensity/luminance component (Ohta, HIS, LUV, L*a*b*)
(Avia-Cervantes, 2005)
For example, the CIE L*a*b* colour space was presented by the International Commission of
Illumination. The model was based on two properties of an older colour space called CIE
XYZ. The first of these properties is that the standard was created from the frequency
response of several patients eyes, making the system independent from electronic devices.
The second property, taken from CIEXYZ, is that the mathematical representation of the
space allows separating the luminance from the chrominance.
On the other hand, the originality of this colour space is that it introduces the concept of
perceptual uniformity. It means that if two colours are similar to each other, in CIE L*a*b*
they are close and this distance is measured by the Euclidean metric. In the CIE L*a*b* space
L* represents the luminance, a* codifies the reddish and greenish sensation, while b* codifies
the yellowish and bluish sensation.
The space transformation from RGB to L*a*b* has two steps. The first one is a linear
transformation from RG to CIE XYZ (Eq. 1).
104
(1)
The second transformation is a non-linear transformation from CIE XYZ to CIE L*a*b* (Eq.
2).
116 (Y / Y ) 13 16
for Y / Yn > 0.008856
n
L* =
903.3 Y / Yn
for Y / Yn 0.008856
a* = 500 [ f (X / X n ) f (Y / Yn )]
b* = 200 [ f (Y / Yn ) f (Z / Zn )]
(2)
with
f (t ) = t
1
3
f (t ) = 7.787 t +
for t 0.008856
X ,Y , Z
The n n n values are the tri stimuli related with the white point. In some cases it can be
measured, but there are also standards according to the light conditions. In this case, due to
the weather conditions at the capturing moment, the CIE D65 (lightening day) (Eq.3),
standard is normally selected (Broek & Rikxoort, 2004).
X n = 0.9502 Yn = 1
Z n = 1.0884
(3)
This method is not simple for computer implementation. The non-linear transformation
takes important time of processing.
A second approach takes the relationship between the components of RGB. For example:
R/G and B/G. This approach is also applied over the colour space YCbCr. The colour space
YCbCr is used in video systems. Y is the brightness component and Cb and Cr are the blue
and red chrominance components.
2.2 Semantic Model
This is an abstract representation; it gives a label, corresponding to a class, to each entity
found in the scene (i.e. sky, road, tree, etc.) (Murrieta-Cid et al., 2002). In the semantic
model, the classification is based on a priori knowledge given to the system (Fan et al.,
2001).
This knowledge consists in:
A list of possible classes that the robot identifies in the environment.
Learning attributes for each class. The region characterization is developed by using several
attributes computed from the colour information. Other attributes are texture and
geometrical information.
The kind of environment to be analyzed; the nature of the region is obtained by comparing a
vector of features with a database composed of different classed, issues of the learning
process. The database is a function of the environment and problem restrictions.
105
3. General System
The approach used to solve the problem of navigation presented here is very similar to the
one used with structured roads: find a path and then follow it (Bertozzi et al., 2002) (AviaCervantes et al., 2003). Once the road or path has been identified, following it supposes a
known problem in robotics. For this reason, the focus of this exposition is centered on the
identification of the area that represents the path the robot will follow, and the extraction of
some parameters necessary for the control stage.
Figure 2 shows the kind of images processed by the system. In all of these images there are
certain characteristics in common: a set of pixels, mostly connected, represent the road; in a
close range image (5 to 25 meters) it is highly probable to have only one road; a road that
can be followed goes from the bottom of the picture till some point in the middle upper area
of the picture.
Along with these ideas, other facts are implicit: the picture is taken horizontal to the ground,
the sky is in the upper portion of the picture, there is sufficient light to distinguish the road
or path, there are not objects obstructing the view of the road. All of these assumptions are
easily fulfilled in real conditions. Every one of these characteristics is used as semantic
information and helps to delimit the scope of the problem. The processing of the images is
done to exploit all the semantic characteristics mentioned before. At the end, semantic rules
are applied to extract the essential information in the image: the route over the navigable
terrain.
106
107
4. Algorithms
Two approximations were made to solve the problem; both of them use some colour
segmentation along with semantic information. The first algorithm, which works in the RGB
space, is the initial design (Forero & Parra, 2004). It was tested in Matlab and designed to
work along with the first implementation of the robot Ursula.
The second algorithm was conceived to work in an embedded system and it captures the
image in the YCbCr colour space. In figure 4, for pictures show the image in different colour
spaces. Next, both algorithms are explained in more detail.
Figure 5. From top to bottom, left to right: original image RGB, R-B plane, inverse R-B,
image result from a median filter, image result from the first morphological filter, image
thicken, image result from the second morphological filter, image segmented by
connectivity, path selected
108
109
After one region is selected as the path, his centroid along with the direction of the path are
extracted. This enables the control system to plan the path and overtake autonomous
navigation. Figure 7 illustrates the algorithm with a flow diagram, and figure 8 illustrates
the different stages of the algorithm.
110
111
calculates the moments of the region of interest to extract the centroid and the orientation of
the path.
In the last step, Transmission, the information concerning the path (centroid and the
orientation) is transmitted by a RS-232 serial interface to a navigation module.
Besides these functions, other considerations had to be taken to run the algorithm in the
embedded system:
New data types were created in C++ in order to be compatible with ADSP EZKIT- LITE
BF533. These data structures manage the information in the image and handle all the
parameters that the algorithm uses.
All the variables are defined according with the size and physical position that each one will
take in the physical memory in the development kit. This execution allows a better use of
the hardware resource and enables simultaneous processing of two images, one image is
acquired by the DMA, and other is processed in the CPU.
Finally, The Blackfins ALU only handles fixed-point values, so floating-point values have to
be avoided in order to maintain the performance of the whole system.
6. Conclusion
Even when there has been an extensive development of works on road detection and road
following during the last two decades, most of them are focused on well structured roads,
making difficult its use for humanitarian demining activities. The present work shows a way
to use the natural information in outdoor environment to extract the roads or paths
characteristics, which can be used as landmarks for the navigation process.
Other important observation is that the information combines of two colors, (i.e. the
projection R B, Cb or Cr channels) hence reducing the harmful effect of the changing
illumination in natural environment.
Good results were also achieved in the path planning process. The robot executes a 2 D
trajectory planning, which facilitates the work of the vision system because only the close
range segmentation has to be correct to be successful in the path planning.
With regard to the semantic information, the results show how semantic characteristics
make possible the use of low-level operations to extract the information required without
spending too many time and hardware resources.
Finally, the system implemented is part of a visual exploration strategy which is being
implemented in the robot Amaranta, and has other visual perception functions like the
detection of buried objects by color and texture analysis. When the whole system will be
functional it will integrate techniques of control visual navigation and would be a great tool
to test how all the system can work together (Coronado et al., 2005).
7. References
Avia-Cervantes, G. Navigation visuelle dun robot mobile dans un environnement
dextrieur semi-structur. Ph.D. Thesis. INP Toulouse. France. 2005.
Broek, E.L. van den; Rikxoort, E.M. van. Evaluation of color representation for texture
analysis. Proceedings of the 16th Belgium-Netherlands Conference on Artificial
Intelligence. University of Groningen. 21-22 October, 2004.
UNICEF- Colombia. Sembrando Minas, Cosechando Muerte. UNICEFBogot. Colombia.
September 2000.
112
8. Acknowledgement
The present work was partially founded by Colciencias and Ecos-Nord Program.
8
ViSyR: a Vision System for
Real-Time Infrastructure Inspection
Francescomaria Marino1 and Ettore Stella2
1Dipartimento
2Istituto
1. Introduction
The railway maintenance is a particular application context in which the periodical surface
inspection of the rolling plane is required in order to prevent any dangerous situation.
Usually, this task is performed by trained personnel that, periodically, walks along the
railway network searching for visual anomalies. Actually, this manual inspection is slow,
laborious and potentially hazardous, and the results are strictly dependent on the capability
of the observer to detect possible anomalies and to recognize critical situations.
With the growing of the high-speed railway traffic, companies over the world are interested
to develop automatic inspection systems which are able to detect rail defects, sleepers
anomalies, as well as missing fastening elements. These systems could increase the ability in
the detection of defects and reduce the inspection time in order to guarantee more
frequently the maintenance of the railway network.
This book chapter presents ViSyR: a patented fully automatic and configurable FPGA-based
vision system for real-time infrastructure inspection, able to analyze defects of the rails and
to detect the presence/absence of the fastening bolts that fix the rails to the sleepers.
Besides its accuracy, ViSyR achieves impressive performance in terms of inspection velocity.
In fact, it is able to perform inspections approximately at velocities of 450 km/h (Jump
search) and of 5 km/h (Exhaustive search), with a composite velocity higher than 160 km/h
for typical video sequences. Jump and Exhaustive searches are two different modalities of
inspection, which are performed in different situations. This computing power has been
possible thanks to the implementation onto FPGAs. ViSyR is not only affordable, but even
highly flexible and configurable, being based on classifiers that can be easily reconfigured in
function of different type of rails.
More in detail, ViSyR's functionality can be described by three blocks: Rail Detection &
Tracking Block (RDT&B), Bolts Detection Block (BDB) and Defects Analysis Block (DAB).
RD&TB is devoted to detect and track the rail head in the acquired video. So doing it
strongly reduces the windows to be effectively inspected by the other blocks. It is based
on the Principal Component Analysis and the Single Value Decomposition. This
technique allows the detection of the coordinates of the center of the rail analyzing a
single row of the acquired video sequence (and not a rectangular window having more
114
rows) in order to keep extremely low the time for I/O. Nevertheless, it allows an
accuracy of 98.5%.
BDB, thanks to the knowledge of the rail geometry, analyses only those windows
candidate to contain the fastening elements. It classifies them in the sense of
presence/absence of the bolts. This classification is performed combining in a logical
AND two classifiers based on different preprocessing. This cross validated response
avoids (practically-at-all) false positive, and reveals the presence/absence of the
fastening bolts with an accuracy of 99.6% in detecting visible bolts and of 95% in
detecting missing bolts. The cases of two different kinds of fastening elements (hook
bolts and hexagonal bolts) have been implemented.
DAB focuses its analysis on a particular class of surface defects of the rail: the so-called
rail corrugation, that causes an undulated shape into the head of the rail. To detect (and
replace) corrugated rails is a main topic in railways maintenance, since in high-speed
train, they induce harmful vibrations on wheel and on its components, reducing their
lifetime. DAB mainly realizes a texture analysis. In particular, it derives as significant
attributes (features) mean and variance of four different Gabor Filter responses, and
classifies them using a Support Vector Machine (SVM) getting 100% reliability in
detecting corrugated rails, as measured in a very large validation set. The choice of
Gabor Filter is derived from a comparative study about several approaches to texture
feature extraction (Gabor Filters, Wavelet Transforms and Gabor Wavelet Transforms).
Details on the artificial vision techniques basing the employed algorithms, on the parallel
architectures implementing RD&TB and BDB, as well as on the experiments and test
performed in order to define and tune the design of ViSyR are presented in this chapter.
Several Appendixes are finally enclosed, which shortly recall theoretical issues recalled
during the chapter.
2. System Overview
ViSyR acquires images of the rail by means of a DALSA PIRANHA 2 line scan camera
[Matrox] having 1024 pixels of resolution (maximum line rate of 67 kLine/s) and using the
Cameralink protocol [MachineVision]. Furthermore, it is provided with a PC-CAMLINK
frame grabber (Imaging Technology CORECO) [Coreco]. In order to reduce the effects of
variable natural lighting conditions, an appropriate illumination setup equipped with six
OSRAM 41850 FL light sources has been installed too. In this way the system is robust
against changes in the natural illumination. Moreover, in order to synchronize data
acquisition, the line scan camera is triggered by the wheel encoder. This trigger sets the
resolution along y (main motion direction) at 3 mm, independently from the train velocity;
the pixel resolution along the orthogonal direction x is 1 mm. The acquisition system is
installed under a diagnostic train during its maintenance route. A top-level logical scheme
of ViSyR is represented in Figure 1, while Figure 2 reports the hardware and a screenshot of
ViSyR's monitor.
A long video sequence captured by the acquisition system is fed into Prediction Algorithm
Block (PAB), which receives a feedback from BDB, as well as the coordinates of the railways
geometry by RD&TB. PAB exploits this knowledge for extracting 24x100 pixel windows
where the presence of a bolt is expected (some examples are shown in Figure 3).
These windows are provided to the 2-D DWT Preprocessing Block (DWTPB). DWTPB
reduces these windows into two sets of 150 coefficients (i.e., D_LL2 and H_LL2), resulting
115
respectively from a Daubechies DWT (DDWT) and a Haar DWT (HDWT). D_LL2 and H_LL2
are therefore provided respectively to the Daubechies Classifier (DC) and to the Haar
Classifier (HC). The output from DC and HC are combined in a logical AND in order to
produce the output of MLPN Classification Block (MLPNCB). MLPNC reveals the
presence/absence of bolts and produces a Pass/Alarm signal that is online displayed (see
the squares in Figure 2.b), and -in case of alarm, i.e. absence of the bolts- recorded with the
position into a log file.
Acquisition System
Long
Video Sequence
Prediction Algorithm
Block
(PAB)
Sampling Block
(SB)
Bolts Detection
Block (BDB)
Xilinx Virtex II
Pro XC2VP20)
Daubechies
DWT (DDWT)
D_LL2
150 coefficients
(LL2 subband)
Daubechies
Classiffier (DC)
Haar
DWT (HDWT)
H_LL2
150 coefficients
(LL2 subband)
Haar
Classiffier (HC)
24x100 pixel
window candidate
to contain bolts
800-pixel
row image
&
Principal Component
Analysis Block
(PCAB)
Feature Vector
(12 coefficients)
MLPN Classification
Block
(MLPNCB)
Pass/Alarm
Rail
Coordinates
(xc)
4 Filter
Responses
Fetarues Extraction
Block
(FEB)
Feature Vector
(8 coefficients)
SVM Block
(SVMB)
Corrugation State
Report
116
Edition) FPGAs, respectively. SB, PAB and DAB are software tools developed in MS Visual
C++ 6.0 on a Work Station equipped with an AMD Opteron 250 CPU at 2.4 GHz and 4 GB
RAM.
(a)
(b)
Figure 2. ViSyR: (a) hardware and (b) screenshot
Figure 3. Examples of 24x100 windows extracted from the video sequence containing
hexagonal headed bolts. Resolutions along x and y are different because of the acquisition
setup
117
the rail may appear in different forms (UIC 50, UIC 60 and so on);
the defects of the rail surface might modify the rai geometry;
in presence of switches, the system should correctly follow the principal rail.
In order to satisfy all of the above requirements, we have derived and tested different
approaches, respectively based on Correlation, on Gradient based neural network, on
Principal Component Analysis (PCA, see Appendix A) with threshold and a PCA with
neural network classifier.
Briefly, these methods extract a window ("patch") from the video sequence and decide if it is
centred or not on the rail head. In case the "patch" appears as "centred on the rail head", its
median coordinate x is assigned to the coordinate of the centre of the rail xc, otherwise, the
processing is iterated on a new patch, which is obtained shifting along x the former "patch".
Even having a high computational cost, PCA with neural network classifier outperformed
other methods in terms of reliability. It is worth to note that ViSyRs design, based on a
FPGA implementation, makes affordable the computational cost required by this approach.
Moreover, we have experienced that PCA with neural network classifier is the only method
able to correctly perform its decision using as "patches" windows constituted by a single
row of pixels. This circumstance is remarkable, since it makes the method strongly less
dependent than the others from the I/O bandwidth. Consequently, we have embedded into
ViSyR a rail tracking algorithm based on PCA with MLPN classifier. This algorithm consists
of two steps:
a data reduction phase based on PCA, in which the intensities are mapped into a
reduced suitable space (Component Space);
a neural network-based supervised classification phase, for detecting the rail in the
Component Space.
3.1 Data Reduction Phase.
Due to the setup of ViSyR's acquisition, the linescan TV camera digitises lines of 1024 pixels.
In order to detect the centre of the rail head, we discarded the border pixels, considering
rows of only 800 pixels. In the set-up employed during our experiments, rail having widths
up to 400 pixels have been encompassed.
Matrices A and C were derived according to equations (A.1) and (A.4) in Appendix A, using
450 examples of vectors. We have selected L=12 for our purposes, after having verified that
a component space of 12 eigenvectors and eigenvalues was sufficient to represent the 91% of
information content of the input data.
3.2 Classification Phase
The rail detection stage consists of classifying the vector a -determined as shown in (A.8)- in
order to discriminate if it derives from a vector r centred or not on the rail head. We have
implemented this classification step using a Multi Layer Perceptron Neural (MLPN)
Network Classifier, since:
118
contrarily to the id-tree, neural networks have a topology very suitable for hardware
implementation.
Inside neural classifiers, we have chosen the MLP, after having experimented that they are
more precise than their counterpart RBF in the considered application, and we have adopted
a 12:8:1 MLPN constituted by three layers of neurons (input, hidden and output layer),
respectively with 12 neurons n1,m (m=0..11) corresponding to the coefficients of a derived by
r according to (A.7); 8 neurons n2,k (k=0..7):
11
(1)
and a unique neuron n3,0 at the output layer (indicating a measure of confidence on the fact
that the analyzed vector r is centered or not on the rail head):
7
n 3 , 0 = f bias 2 , 0 + w 2 , k ,0 n 2 , k
k =0
(2)
In (1) and (2), the adopted activation function f(x), having range ]0, 1[, has been:
f (x ) =
1
1 + ex
(3)
while the weights w1,m,k and w2,k,0 have been solved using the Error Back Propagation
algorithm with an adaptive learning rate [Bishop. (1995)] and a training set of more than 800
samples (see Paragraph 7.3).
3.3 Rail Detection and Tracking Algorithm
The Rail Detection and Tracking Algorithm consists of determining which extracted vector
r is centred on the rail.
Instead of setting the classifier using a high threshold at the last level and halting the
research as soon as a vector is classified as centred on the rail ("rail vector"), we have
verified that better precision can be reached using a different approach.
We have chosen a relatively low threshold (=0.7). This threshold classifies as "rail vector" a
relatively wide set of vectors r, even when these ones are not effectively centred on the rail
(though they contain it). By this way, in this approach, we halt the process not as soon as the
first "rail vector" has been detected, but when, after having detected a certain number of
contiguous "rail vectors", the classification detects a "no rail". At this point we select as true
"rail vector" the median of this contiguous set. In other words, we accept as "rail vector" a
relatively wide interval of contiguous vectors, and then select as xC the median of such
interval.
In order to speed-up the search process, we analyse each row of the image, starting from a
vector r centered on the last detected coordinate of the rail centre xC. This analysis is
performed moving on left and on right with respect to this origin and classifying the
119
vectors, until the begin (xB) and the end (xE) of the "rail vectors" interval are detected. The
algorithm is proposed in Figure 4.
xC = 512;
// presetting of the coordinate of the centre of the rail
do Start image sequence to End image sequence;
set r (400-pixel row) centered on xC;
do:
determine a (12 coefficients) from r
input a to the classifier and classify r
set the new r shifting 1-pixel-left the previous r
while(r' is classified as rail)
// exit from do-while means you have got the begin of the "rail vectors" interval
xB = median coordinate of r;
r (400-pixel row) centred on xC;
do:
determine a (12 coefficients) from r
input a to the classifier and classify r
set the new r shifting 1-pixel-right the previous r
while(r' is classified as rail)
// exit from do-while means you have got the end of the "rail vectors" interval
xE = median coordinate of r;
output xC = (xB+xE)/2;
end do
Figure 4. Algorithm for searching the rail center coordinates
4. Bolts Detection
Usually two kinds of fastening elements are used to secure the rail to the sleepers:
hexagonal-headed bolts and hook bolts. They essentially differ by shape: the first one has a
regular hexagonal shape having random orientation, the second one has a more complex
hook shape that can be found oriented only in one direction.
In this paragraph the case of hexagonal headed bolts is discussed.
It is worth to note that they present more difficulties than those of more complex shapes
(e.g., hook bolts) because of the similarity of the hexagonal bolts with the shape of the stones
that are on the background. Nevertheless, detection of hook bolts is demanded in Paragraph
7.6.
Even if some works have been performed, which deal with railway problems -such as track
profile measurement (e.g., [Alippi et al. (2000)]), obstruction detection (e.g., [Sato et al.
(1998)]), braking control (e.g., [Xishi et al. (1992)]), rail defect recognition (e.g., [Cybernetix
Group], [Benntec Systemtechnik Gmbh]), ballast reconstruction (e.g., [Cybernetix Group]),
switches status detection (e.g., [Rubaai (2003)]), control and activation of signals near
stations (e.g., [Yinghua (1994)), etc.- at the best of our knowledge, in literature there are no
references on the specific problem of fastening elements recognition. The only found
approaches, are commercial vision systems [Cybernetix Group], which consider only
fastening elements having regular geometrical shape (like hexagonal bolts) and use
geometrical approaches to pattern recognition to resolve the problem. Moreover, these
systems are strongly interactive. In fact, in order to reach the best performances, they
120
require a human operator for tuning any threshold. When a different fastening element is
considered, the tuning phase has to be re-executed.
Contrariwise, ViSyR is completely automatic and needs no tuning phase. The human
operator has only the task of selecting images of the fastening elements to manage. No
assumption about the shape of the fastening elements is required, since the method is
suitable for both geometric and generic shapes.
ViSyRs bolts detection is based on MLPNCs and consists of:
a prediction phase for identifying the image areas (windows) candidate to contain the
patterns to be detected;
Exhaustive search;
Jump search.
Dx
Left
Bolts
Dx
Right
Bolts
Dy
Figure 5. Geometry of a rail. A correct expectation for Dx and Dy notably reduces the
computational load
In the first kind of search, a window exhaustively slides on the areas at a (well-known)
distance Dx from the rail-head coordinate (as detected by RD&TB) until it finds
contemporaneously (at the same y) the first occurrence of the left and of the right bolts. At
this point, it determines and stores this position (A) and continues in this way until it finds
the second occurrence of both the bolts (position B). Now, it calculates the distance along y
between B and A (Dy) and the process switches on the Jump search. In fact, the distance
along y between two adjacent sleepers is constant ad known. Therefore, the Jump search
uses Dy to jump only in those areas candidate to enclose the windows containing the
121
hexagonal-headed bolts, saving computational time and speeding-up the performance of the
whole system. If, during the Jump search, ViSyR does not find the bolts in the position
where it expects them, then it stores the position of fault (this is cause of alarm) in a log-file
and restarts the Exhaustive search. A pseudo-code describing how Exhaustive search and
Jump search commutate is shown in Figure 6.
do Start image sequence to End image sequence;
repeat
Exhaustive search;
if found first left and right bolt store this position (A);
until found second left and right bolt;
store this position (B);
determine the distance along y between B and A;
repeat
Jump search
until the bolts are detected where they were expected;
end do
Figure 6. Pseudo code for the Exhaustive search - Jump search commutation
4.2 Data Reduction Phase
For reducing the input space size, ViSyR uses a features extraction algorithm that is able to
preserve all the important information about input patterns in a small set of coefficients.
This algorithm is based on 2-D DWTs [Daubechies (1988), Mallat (1989), Daubechies (1990
a), Antonini et al. (1992)], since DWT concentrates the significant variations of input patterns
in a reduced number of coefficients. Specifically, both a compact wavelet introduced by
Daubechies [Daubechies (1988)], and the Haar DWT (also known as Haar Transform [G.
Strang, & T. Nuguyen (1996)]) are simultaneously used, since we have verified that, for our
specific application, the logical AND of these two approaches avoids -almost completelythe false positive detection (see Paragraph 7.5).
In pattern recognition, input images are generally pre-processed in order to extract their
intrinsic features. We have found [Stella et al. (2002), Mazzeo et al. (2004)] that orthonormal
bases of compactly supported wavelets introduced by Daubechies [Daubechies (1988)] are
an excellent tool for characterizing hexagonal-headed bolts by means of a small number of
features1 containing the most discriminating information, gaining in computational time. As
an example, Figure 7 shows how two decomposition levels are applied on an image of a
bolt.
LL2
LH2
HL2
HH2
HL1
LH1
HH1
Figure 7. Application of two levels of 2-D DWT on a subimage containing an hexagonalheaded bolt
1 These are the coefficients of the LL subband of a given decomposition level l; l depending on the image
resolution and equal to 2 in the case of VISyR's set-up.
122
Due to the setup of ViSyRs acquisition, PAB provides DWTPB with windows of 24x100
pixels to be examined (Figure 3). Different DWTs have been experimented varying the
number of decomposition levels, in order to reduce this number without losing in accuracy.
The best compromise has been reached by the LL2 subband consisting only of 6x25
coefficients. Therefore, BDB has been devoted to compute the LL2 subbands both of a Haar
DWT [G. Strang, & T. Nuguyen (1996)] and of a Daubechies DWT, since we have found that
the cross validation of two classifiers (processing respectively D_LL2 and H_LL2, i.e., the
output of DDWT and HDWT, see Figure 1) practically avoids false positive detection (see
Paragraph 7.5). BDB, using the classification strategy described in the following Paragraph,
gets an accuracy of 99.9% in recognizing bolts in the primitive windows.
4.3 Classification Phase
ViSyRs BDB employs two MLPNCs (DC and HC in Figure 1), trained respectively for
DDWT and HDWT. DC and HC have an identical three-layers topology 150:10:1 (they differ
only for the values of the weights). In the following, DC is described; the functionalities of
HC can be straightforwardly derived.
The input layer is composed by 150 neurons D _ nm' (m=0..149) corresponding to the
coefficients D_LL2(i, j) of the subband D_LL2 according to:
(4)
The hidden layer of DC consists of 10 neurons D _ nk'' (k=0..9); they derive from the
propagation of the first layer according to:
149
(5)
whilst the unique neuron D _ n0''' at the output layer is given by:
9
(6)
4where D _ w m' , k and D _ wk'' , 0 are the weights respectively between first/second and
second/third layers. The activation function f (x ) is the same as (3).
In this scenario, D _ n0''' ranges from 0 to 1 and indicates a measure of confidence on the
presence of the object to detect in the current image window, according to DC.
The outputs from DC and HC ( D _ n0''' and H _ n0''' ) are combined as follows:
(7)
123
(a)
(b)
Figure 8. (a) Examples of rail head; (b) Examples of rail head affected by corrugation
A wide variety of texture analysis methods based on local spatial pattern of intensity have
been proposed in literature [Bovik et al. (1990), Daubechies (1990 b)]. Most signal processing
approaches submit textured image to a filter bank model followed by some energy
measures. In this context, we have tested three filtering approaches to texture feature
extraction that in artificial vision community have already provided excellent results [Gong
et al. (2001), Jain et al. (2000)] (Gabor Filters, Wavelet Transform and Gabor Wavelet
Transform), and classified the extracted features by means both of a k-nearest neighbor
classifier and of a SVM, in order to detect the best combination "feature
extractor"/"classifier".
DAB is currently a "work in progress". Further steps could deal with the analysis of other
defects (e.g., cracking, welding, shelling ,blob, spot etc.). Study of these defects is already in
progress, mainly exploiting the fact that some of them (as cracking, welding, shelling)
present a privileged orientation. Final step will be the hardware implementation even of
DAB onto FPGA.
5.1 Feature Extraction
For our experiments we have used a training set of 400 rail images of 400x128 pixels
centered on the rail-head, containing both corrugated and "good" rails, and explored three
different approaches, which are theoretically shortly recalled in Appendixes B, C and D.
Gabor Filters. In our applicative context, we have considered only circularly symmetric
Gaussians (i.e., x = y = ), adopting a scheme which is similar to the texture
segmentation approach suggested in [Jain & Farrokhnia (1990)], approximating the
characteristics of certain cells in the visual cortex of some mammals [Porat & Zeevi (1988)].
We have submitted the input image to a filter Gabor bank with orientation 0, /4, /2 and
3/4 (see Figure 9), =2 and radial discrete frequency F= 2 2 3 to each example of the
training set. We have discarded other frequencies since they were found too low or too high
for discriminating the texture of our applicative context.
124
Figure 9. Gabor Filters at different orientations: (a) 0; (b) /4; (c) /2; (d) 3/4
The resulting images i (x , y ) (see Figure 10) represent the convolution of the input image
i(x , y ) with the Gabor filters h ( x , y ) where sub index indicates the orientation:
(8)
Figure 10. Examples of Gabor Filters (F= 2 2 3 , =2) applied to a corrugated image
Wavelet Transform. We have applied a Daubechies 1 or haar Discrete Wavelet
transform to our data set, and we have verified that, for the employed resolution, more than
three decomposition levels will have not provided additional discrimination.
Figure 11 shows how three decomposition levels are applied on an image of a corrugated
rail.
125
.
.
.
.
.
.
Jet il,n(x, y)
126
ip (x , y ) p
(9)
) dxdy
2
(10)
of each pre-processed image i p (x , y ) have been therefore used to build the feature vectors to
be fed as input to the classification process.
5.2 Classification
We have classified the extracted features using two different classifiers as described in
Paragraph 7.8. Considering the results obtained both by k-Nearest Neighbour and Support
Vector Machine (see Appendix E), Gabor filters perform better compared to others features
extractors. In this context, we have discarded Neural Networks in order to better control the
internal dynamic.
Moreover, Gabor filter bank has been found to be preferred even considering the number of
feature images extracted to form the feature vector for each filtering approach. In fact, the
problem in using Wavelet and Gabor Wavelet texture analysis is that the number of feature
images tends to become large. Feature vectors with dimension 8, 18, 24 for Gabor, Wavelet
and Gabor Wavelet filters have been used, respectively. In addition, its simplicity, its
optimum joint spatial/spatial-frequency localization and its ability to model the frequency
and orientation sensitive typical of the HVS, has made the Gabor filter bank an excellent
choice for our aim to detect the presence/absence of a particular class of surface defects as
corrugation.
127
A top-level schematic of BDB and RDT&B are provided in Figure 13.a and 13.b respectively,
while Figure 14 shows the FPGAs floorplans.
(a)
(b)
Figure 13. A top-level schematic of (a) RD&TB and (b) BDB, as they can be displayed on
Alteras QuartusII CAD tool
128
Therefore, even if FPGAs were initially created for developing little glue-logic, they
currently often represent the core of various systems in different fields.
(a)
(b)
Figure 14. Floorplans of (a) Altera StratixTM EP1S60 and (b) Xilinx Virtex IITM Pro 20 after
being configured
6.1 RD&TB: Modules Functionalities
The architecture can be interpreted as a memory: the task starts when the host writes a
800-pixel row to be analyzed. In this phase, the host addresses two shift registers inside the
DOUBLE_WAY_SLIDING_MEMORY (pin address[12..0]) and sends the 800 bytes via the
input line DataIn[31..0] in form of 200 words of 32 bits.
As soon as the machine has completed his job, the output line irq signals that the results are
ready. At this point, the host reads them addressing the FIFO memories inside the
OUTPUT_INTERFACE.
A more detailed description of the modules is provided in the follow.
Input Interface
The PCI Interface (not explicitly shown in Figure 13.a) sends the input data to the
INPUT_INTERFACE block, through DataIn[63..0]. INPUT_INTERFACE separates the input
phase from the processing phase, mainly in order to make the processing phase
synchronous and independent from delays that might occur during the PCI input.
Moreover, it allows of working at a higher frequency (clkHW signal) than the I/O (clkPCI
signal).
Double Way Sliding Memory
As soon as the 800 pixel row is received by INPUT_INTERFACE, it is forwarded to the
129
am+1
LUT 1,0
[storing w1,m,0]
...
MAC2,0
AF_LUT
*
>T
...
LUT 2,0
[storing w2,m,0]
MAC1,7
*
LUT 1,7
[storing w1,m,7]
(11)
130
n = f (x )
(12)
is realized by means of a Look Up Table (for what concerns neurons n2,k) and comparers (for
what concerns neuron n3,0). More in detail:
neurons n2,k, step (a): PCAC has been provided with 8 Multiplier-and-ACcumulators
(MACs), i.e., MAC1,k (k=0..7), each one initialized with biask. As soon as a coefficient al
(l=1..12) is produced by PREPROCESSING_PCA, the multipliers MAC1, k multiply it in
parallel by w1,m,k (m=l+1, k=0..7). These weights have been preloaded in 8 LUTs during
the setup, LUT1, k being related to MAC1, k and storing 12 weights. The accumulation
takes 12 ccs, one cc for each coefficient al coming from PREPROCESSING_PCA; at the
end of the computation, any MAC1, k will contain the value xk.
neurons n2,k, step (b): The values xk are provided as addresses to AF_LUT through a
parallel input/serial output shift register. AF_LUT is a Look up Table which maps at
any address x the value of the Activation Function f(x). The adopted precision and
sampling rate are discussed in Paragraph 7.4.
neuron n3,0, step (a): This step is similar to that of the previous layer, but it is performed
using a unique MAC2, 0 which multiplies n2,k (k=0..7) by the corresponding w2,k,0 at the
rate of 1 data/cc.
neuron n3,0, step (b): Since our attention is captured not by the effective value of n3,0, but
by the circumstance that this might be greater than a given threshold T=0.7 (the result
of this comparison constitutes the response of the classification process), we implement
step (b) simply by comparing the value accumulated by MAC2, 0 with f -1(T).
Output Interface
Because of its latency, PCAC classifies each pattern 5 ccs after the last coefficient is provided
by PREPROCESSING_PCA. At this point, the single bit output from the comparer is sent to
OUTPUT_INTERFACE via PCACOut.
This bit is used as a stop signal for two counters. Specifically, as soon as a value "1" is gotten
on PCACOut, a first counter CB is halted and its value is used for determining which
position of the shift of the DOUBLE_WAY_SLIDING_MEMORY is that one centered at the
begin of the "rail vector" interval. Afterward, as soon as a value "0" is received from
PCACOut, a second counter CE is halted signaling the end of the "rail vector" interval. At
this point, Irq signals that the results are ready, and the values of CB and CE packed in a 64
bits word are sent on DataOut[63..0]. Finally, the host can require and receive these results
(signal read).
6.2 BDB: Modules Functionalities
Similarly to RD&TB, even BDB can be interpreted as a memory which starts its job when the
host writes a 24x100 pixel window to be analysed. In this phase, the host addresses the
dual port memories inside the INPUT_INTERFACE2 (pins address[9..0]) and sends the 2400
bytes via the input line data[63..0] in form of 300 words of 64 bits. As soon as the machine
has completed his job, the output line irq signals that the results are ready. At this point, the
host reads them addressing the FIFO memories inside the OUTPUT_INTERFACE.
2 In addition, INPUT_INTERFACE aims at the same goals of decoupling the input phase from the
processing phase, as previously said in the case of RD&TB.
131
-0,08544
-0,13501
0,45988
0,80689
0,33267
(13)
the LL2 subband can be computed in only one bi-dimensional filtering step (instead of the
classical twice-iterated two monodimensional steps shown in Figure 23 in Appendix C),
followed by a decimation by 4 along both rows and columns. Figure 16 reports the applied
symmetrical 16x16 kernel.
Figure 16. Symmetrical 16x16 kernel for directly computing in one 2-D step the LL2 subband
of the DWT based on the 1-D low-pass filter . The filtering has to be followed by decimation
by 4 along both rows and columns
We decided of computing LL2 directly in only one 2-D step, because:
this requires a controller much simpler than the one used by the separable approach
(Figure 23, in Appendix C);
separable approach is greatly efficient in computing all the four subbands of each level.
But ViSyRs classification process does not need other subbands than LL2;
when fixed point precision is employed, each step of the separable approach produces
results with different dynamic, so doing, the hardware used at a certain step becomes
unusable for implementing the further steps;
the error (due to the fixed point precision) generated in a unique step does not
propagate itself and can be easily controlled. Conversely, propagation occurs along four
different steps when LL2 is computed by means of separable approach.
132
In this scenario, SHIFTREGISTERS implements a 16x16 array which slides on the 24x100
input window shifting by 4 along columns at any clock cycle (cc). This shift along columns is
realized by a routing among the cells as that one shown in Figure 17, that represents the jth
row (j=0..15) of SHIFTREGISTERS.
j,0
...
...
j,1
j,2
j,3
j,4
j,5
j,6
j,7
j,8
j,9
j,10
j,11
j,12
j,13
j,14
j,15
Figure 17. The jth row of the array of 16x16 shift registers in the SHIFTREGISTERS block.
Each square represents an 8-bit register
The shift by 4 along the rows is performed by INPUT_INTERFACE which feeds into the jth
row of the array only the pixels p(m, n) of the 24x100 input window (m=0..23, n=0..99)
where:
j mod 4=m mod 4
(14)
At any cc, sixteen contiguous rows of the input window are fed in parallel into
SHIFTREGISTERS at the rate of 64 bytes/cc (4 bytes of each row for 16 rows) through
IN[511..0]. Simultaneously, all the 256 bytes latched in the 16x16 array are inputted in
parallel
into
DAUB_LL2_FILTER
through
OutToDaubLL256bytes[2047..0].
DAUB_LL2_FILTER exploits the symmetry of the kernel (see Figure 16), adding the pixels
coming from the cells (j, l) to those ones coming from the cells (l, j) (j=0..15, l=0..15);
afterwards, it computes the products of these sums and of the diagonal elements of the
array by the related filter coefficients, and, finally, it accumulates these products.
As a result, DAUB_LL2_FILTER produces the LL2 coefficients after a latency of 11 ccs and at
the rate of 1 coefficient/cc. These ones are now expressed in 35 bits, because of the growing
of the dynamic, and are input into MLPN_CLASSIFIER via InFromDaub[34..0].
We are not interested in higher throughput, since -because of FPGA hardware resourcesour neural classifier employs 10 multipliers and can manage 1 coefficient per cc.
Haar DWT Preprocessing
Computationally, Haar Transform is a very simple DWT since its 1-D filters are: L=[1/2,
1/2] and H=[1/2, -1/2]. Therefore, any coefficient H_LL2(i, j) can be computed in one step
according to:
H_LL2 (i , j ) =
1 l =3 k =3
p( 4i + k ,4 j + l)
16 l =0 k =0
(15)
In order to compute (15), we exploit the same SHIFTREGISTERS block used for performing
Daubechies DWT and a HAAR_LL2_FILTER block. HAAR_LL2_FILTER trivially adds[3] the
data coming from OutToHaar16bytes[255..0] which are the values of the pixels p(m, n) of the
4x4 window centered on the 16x16 sliding array implemented by SHIFTREGISTERS.
By this way, after a latency of 2 cc, HAAR_LL2_FILTER produces 1 coefficient (expressed by
12 bits) per cc and provides it to MLPN_CLASSIFIER via HaarLL2[11..00]. Higher
performance is unnecessary, since the data flow of this block is parallel at that of
[3]
The scaling by 16 is simply performed by a shift left of the fixed point of 4 positions.
133
DAUB_LL2_FILTER.
Multi Layer Perceptron Neural Classifier
As we have seen in Paragraph 4, the MLPN_CLASSIFIER implements two classifiers (DC
and HC, see Figure 1) . Their structure is similar to that already described in Figure 15. The
logical AND of their output is sent to the OUTPUT_INTERFACE via DCOutXHCOut.
Output Interface
The result of the classification is extended in a word of 64 bits by and sent to the host
DataOut[63..0].
80
RS
600
60
RS
RS
RS
RS
550
40
20
500
RS
RS
-20
400
-40
Real xc
Estimated xc
300
225
150
75
350
-60
-80
(a)
(b)
Figure 18. (a): Real and estimated coordinates of xC . (b): error. RS denotes rail switch
300
225
150
75
0
RS
450
134
f ( x) = if
if
x < 5
: 0
5 x 5 : AF_LUT[x' ]
x>5
(16)
: 1
AF_LUT was filled using words of 5 bits, that was found the best compromise in terms of
detection accuracy and hardware cost.
7.5 False Positive Elimination
In defining the preprocessing strategy, we observed that, though the classifier DC, based on
Daubechies DWT, reached a very high detection rate (see Paragraph 7.9), it also produced a
certain number of False Positives (FPs) during the Exhaustive search.
In order to reduce these errors, a cross validation strategy was introduced. Because of its
very low computational overhead, Haar DWT was taken into account and tested. HC, a
neural classifier working on the LL2 subband of the Haar DWT, was designed and trained:
HC reached the same detection rate of DC, though revealing much more FPs.
Nevertheless, the FPs resulting from HC were originated from different features (windows)
than those causing the FPs output from DC. This phenomenon is put in evidence by Figure
19, where a spike denotes a detection (indifferently true and false positives) at a certain line
of the video sequence revealed by DC (Figure 19.a) and by HC (Figure 19.b) while they
analyzed in Exhaustive search (i.e., without jump between couple of bolts) 4,500 lines of
video sequence. Figure 19.c shows the logical AND between the detections (both True and
False Positive) of DC and HC. In other words, it shows the results of (7).
135
(a)
(b)
(c)
Figure 19. Detected couples of bolts vs video sequence, analyzed in Exhaustive search (i.e.,
without jump between couples of detected bolts). (a) Daubechies Classifier; (b) Haar
Classifier; (c) Crossed validation
Haar DWT
Daubechies DWT
AND (Daubechies, Haar)
True Positive
(TP)
22 (100%)
22 (100%)
22 (100%)
False Positive
(FP)
90
26
2
FP/TP
409%
118%
9%
FP/Analyzed
Lines
200.0 0 000
57.8 0 000
4.4 0 000
136
(a)
(b)
Figure 20. Sample image patterns of the (a) right hook bolts and (b) left hook bolts
Two training sets TS1 and TS2 were extracted. They contained 421 negative examples, and
respectively 172 positive examples of left hook bolts (TS1), and 172 examples of right hook
bolts (TS2). Therefore, TS1 and TS2, were used for training the MLPN Classifiers devoted to
inspect respectively the left and on the right side of the rail. Finally, the remaining video
sequence was used to test the ability of ViSyR even in detecting hook bolts.
7.7 Hardware Design Definition
The report (file log) obtained from the above experiment was used as term of comparison
for the reports of similar experiments aiming at defining the number of bits per words to be
used in the hardware design. The fully-software prototype of ViSyR was modified changing
the floating point operating mode into the fixed point mode. Different versions of ViSyR
were compiled with different precisions (i.e., number of bits). For what concerned RD&TB,
12 bits for the eigenvectors coefficients and 28 bits for the weights of the classifier, allowed
an accuracy only 0.6% lower than that one achievable using floating point precision while 23
bits for the filter coefficients and with 25 bits for the weights of both the classifiers led to
detect visible bolts with accuracy only 0.3% lower than that obtained using floating point
precision. These settings were considered acceptable, and the hardware design was
developed using these specifications.
7.8 Rail Corrugation Analysis and Classification Strategy
As said in Paragraph 5, feature vectors have been respectively determined considering mean
and variance of:
each Gabor filter output image i ( x, y ) , one for orientation (0, /2, , ), getting a
feature vector composed by 8 features;
each HL, LH and HH subbands of each decomposition level, getting a feature vector
composed by 18 features;
each image of the jet (consisting of three decomposition levels -as in the wavelet
transform case- per four orientations -as in the Gabor Filter case-), getting a feature
vector composed by 24 features.
In order to test the performances of a k-Nearest Neighbor classifier, we have used a leaveone-out (LOO) procedure. Table 2 shows the number of misclassifications for different
137
values of K, for a training set of Gabor filtered images (GF), Wavelet filtered images (WF)
and Gabor-Wavelet filtered images (GWF).
GF
WF
GWF
3
3
3
3
5
3
4
5
7
6
10
4
K
9
5
13
5
11
5
14
5
13
4
14
4
15
5
16
5
C=150, K(x,y)=[(xy)/k]
0
12
10
138
rail vectors
non-rail vectors
visible hexagonal bolts
occluded hexagonal bolts
absent hexagonal bolts
visible left hook bolts
occluded left hook bolts
absent left hook bolts
visible right hook bolts
occluded right hook bolts
absent right hook bolts
corrugated rails
non-corrugated rails
RD&TB
BDB
DAB
00:21,6
00:21,6
00:17,3
00:17,3
Computing performance was measured too, for what concerns the functionality of RD&TB
and BDB (i.e. the ViSyR's modules already implemented in hardware). In particular, over
than 15,000 couples of bolts have been detected in more than 3,000,000 lines at the velocity of
166 km/h. This performance is given by the combination of the Jump search and of the
Exhaustive search, being the velocities reached during these phases approximately of 4
km/h and 444 km/h, and obviously depends on the distribution of the two kinds of search
for the inspected video sequence. For instance, Figure 21 shows how the two types of search
commutate during the process, for the tested video sequence.
00:13,0
00:08,6
00:04,3
00:00,0
00:13,0
00:08,6
00:04,3
00:00,0
1
20
39
58
77
20
39
58
77
Number of triggers
Number of triggers
(a)
(b)
Figure 21. The way in which the system commutates during (a) the Exhaustive search and
(b) the Jump search
The maximum elapsed time in the Exhaustive search is less than 3. This means that the
Exhaustive search finds a couple of bolts (left and right) after less than 3 in the worst cases.
At this point the control switches on the Jump search that, because of its philosophy, is
much faster. When activated, Jump search works uninterruptedly up to 17, for the
analyzed sequence (Figure 21.b). Obviously, if the system remains in the Jump phase for a
long time, performance can increase subsequently. Next work will be addressed in this
139
direction, for example, automatically skipping those areas where the fastening elements are
covered by asphalt (i.e., level crossing, where Exhaustive search is executed in continuous).
8. Conclusive Remarks
This paper has presented ViSyR, a visual system able to autonomously detect the bolts that
secure the rail to the sleepers and to monitor the rail condition.
Thanks to a FPGA-based hardware implementation, it performs its inspection at velocities
that can reach 460 km/h. In addition to this computing power ViSyR is also characterized by
an impressive accuracy and is highly flexible and configurable, being the decision level of
both RD&TB, BDB and DAB based on classifiers that can be easily reconfigured in function
of different type of rails and bolts to be inspected and detected.
ViSyR constitutes a significant aid to the personnel in the railway safety issue because of its
high reliability, robustness and accuracy. Moreover, its computing performance allows a
more frequent maintenance of the entire railway network.
A demonstrative video of ViSyR is available at:
http://ftp-dee.poliba.it:8000/Marino/ViSyR_Demo.MOD
9. References
Alippi C., Casagrande E., Scotti F., & Piuri V. (2000) Composite Real-Time Image Processing
for Railways Track Profile Measurement, IEEE Trans. Instrumentation and
Measurement, vol. 49, N. 3, pp. 559-564 (June 2000).
Antonini M., Barlaud M., Mathieu P. & Daubechies I. (1992). Image Coding Using Wavelet
Transform, IEEE Trans. Image Processing, Vol. 1, pp. 205-220. (1992).
Bahlmann C., Haasdonk B. & Burkhardt H. (2002). On-line Handwriting Recognition using
Support Vector Machines - A kernel approach, In Int. Workshop on Frontiers in
Handwriting Recognition (IWFHR) 2002, Niagara-on-the-Lake, Canada (August
2002).
Benntec Systemtechnik Gmbh, RAILCHECK: image processing for rail analysis, internal
documentation, http://www.benntec.com
Bishop M. (1995). Neural Networks for Pattern Recognition, New York, Oxford, pp. 164-191.
Bovik AC, Clark M, Geisler WS (1990), Multichannel texture analysis Using Localized
Spatial Filters. IEEE Trans On PAMI 12: 55-73
Coreco. http://www.coreco.com
Cybernetix Group (France), IVOIRE: a system for rail inspection, internal documentation,
http://www.cybernetix.fr
Daubechies I. (1988). Orthonormal bases of compactly supported wavelets, Comm. Pure &
Appl. Math., vol. 41, pp. 909-996. (1988).
Daubechies I. (1990 a). The Wavelet Transform, Time Frequency, Localization and Signal
Analysis, IEEE Trans. on Information Theory, vol. 36, n. 5, pp. 961-1005. (Sept. 1990).
Daubechies I (1990 b), Ten Lectures on Wavelets. Capital City Press, Montpellier, Vermont
Drucker H., Burges C., Kaufman L., Smola A., Vapnik V. (1997). "Support Vector Regression
Machines," in: M. Mozer, M. Jordan, andT. Petsche (eds.), Neural Information
Processing Systems, Vol. 9. MIT Press, Cambridge, MA.
Gong S. et al. (2001). Dynamic Vision: From Images to Face Recognition, Imperial College Press.
140
Jain A., Duin R., & Mao J. (2000). Statistical Pattern Recognition: A Review, IEEE Transactions
on Pattern Analysis and Machine Intelligenve, vol. 22, no.1, pp.4-37, 2000.
Jain AK, Farrokhnia F (1990). Unsupervised texture segmentation using Gabor filters. Pattern
Recognition, 24: 1167-1186
Lee T.S. (1996). Image Representation Using 2D Gabor Wavelets , IEEE Trans. on PAMI , Vol.
18 no. 10, 1996
Ma W. Y., Manjunath B.S. (1995) A comparison of wavelet transform features for texture
image annotation, Proc. Second International Conference on Image Processing (ICIP'95),
Washington, D.C., vol. 2, pp. 256-259. (Nov. 1995).
MachineVision. CAMERALINK: specification for camera link interface standard for digital
cameras and frame grabbers, www.machinevisiononline.org
Mallat S.G. (1989). A theory for quadriresolution signal decomposition: the wavelet
representation, IEEE Trans on Pattern Analysis and Machine Intelligence, vol. 2 pp.674693 (1989).
Matrox. http://www.matrox.com/imaging/products/odyssey_xcl/home.cfm
Mazzeo P.L., Nitti M., Stella E. & Distante A. (2004). Visual recognition of fastening bolts for
railroad maintenance, Pattern Recognition Letters, vol. 25 n. 6, pp. 669-677 (2004).
Osuna E., Freund R. & Girosi F. (1997) Training Support Vector Machines: an Application to
Face Detection, Proceedings of CVPR'97, Puerto Rico. (1997).
Papageorgiou C. & Poggio T. (1999). A Pattern Classification Approach to Dynamical Object
Detection, Proceedings of ICCV, pp. 1223-1228 (1999).
Porat M, Zeevi YY (1988), The generalized Gabor scheme of image representation in
biological and machine vision, IEEE Trans Pattern Anal Machine Intell 10: 452-468
Rubaai A. (2003). A neural-net-based device for monitoring Amtrak railroad track system,
IEEE Transactions on Industry Applications, vol. 39, N. 2 , pp. 374-381 (March-April
2003).
Sato K., Arai H., Shimuzu T., & Takada M. (1998). Obstruction Detector Using Ultrasonic
Sensors for Upgrading the Safety of a Level Crossing, Proceedings of the IEE
International Conference on Developments in Mass Transit Systems, pp. 190-195 (April
1998).
Stella E., Mazzeo P.L., Nitti M., Cicirelli G., Distante A. & DOrazio T. (2002). Visual
recognition of missing fastening elements for railroad maintenance, IEEE-ITSC
International Conference on Intelligent Transportation System, pp. 94-99, Singapore
(2002).
Strang G., & Nuguyen T. (1996). Wavelet and Filter banks, Wellesley College.
Vapnik N. (1998), Statistical Learning Theory, New York: John Wiley & Sons Inc. Pub.
Wen J, Zhisheng Y, Hui L (1994), Segment the Metallograph Images Using Gabor Filter,
International Symposium on Speech Image Processing and Neural Networks pp 25-28,
Hong Kong
Xishi W., Bin N., & Yinhang C. (1992). A new microprocessor based approach to an
automatic control system for railway safety, Proceedings of the IEEE International
Symposium on Industrial Electronics, vol. 2, pp. 842-843 (May 1992).
Yinghua M., Yutang Z., Zhongcheng L., & Cheng Ye Y. (1994). A fail-safe microprocessorbased system for interlocking on railways, Proceedings of the Annual Symposium on
Reliability and Maintainability, pp. 415-420 (Jan. 1994).
141
(A.1)
hk = rk -
(A.2)
= [1,.., P]T
(A.3)
with:
where:
(A.4)
can be built. The QxQ matrix C contains information about mutual relationships among rail
images rk.
In Principal Component Analysis [Gong et al. (2001), Jain et al. (2000).] the eigenvectors uj
(j=1..N) of C define a new reference space in which the variance among data is maximized.
Moreover, an ordering relationship on uj components can be induced sorting the
eigenvectors uj in such way that:
q > q+1
(A.5)
where the eigenvalues j of C, represent the variances of each one of uj. In other words, (A.5)
means that the set of projections of the input data on uq has variance higher than that one of
the set of projections of the input data on uq+1.
By thresholding the eigenvalues j it is possible to select the corresponding L<Q
eigenvectors sufficient enough to represent the biggest part of the informative content of the
input data. Let l (l=1..L, L<Q) the selected components, a generic vector r can be expressed
by:
L
r' a l u l + '
(A.6)
l =1
where is the average vector of r. From a computational point of view the eigenvectors
and eigenvalues of C can be estimated by a Single Value Decomposition (SVD) of matrix A
where the coefficients al are evaluated by the inner product:
al = (r-)ulT
(A.7)
142
(A.8)
(B.1)
where
g(x, y) =
1
2x y
1 x2 y2
2+ 2
2 x y
(B.2)
(B.3)
x and y are the standard deviations of Gaussian envelope along the x and y directions, F
frequency of sinusoidal plane and and is the orientation [Wen at al. (1994)].
Thus (B.1) is a complex sinusoidal grating modulated by a 2D gaussian function [25].
Gabor functions have been found useful because reach the lower bounds of the uncertainty
inequalities xu 1 4 and yv 1 4 and achieve optimally joint resolution in space and
spatial frequency [Bovik et al. (1990)].
1
t b
x(t )
dt
a
a
(C.1)
143
It can be seen as the further decomposition of a 2-D data set LLj-1 (LL0 being the original
input image) into four subbands LLj, LHj, HLj and HHj. The capital letters and their position
are related respectively to the applied mono-dimensional filters (L for Low pass filter, H for
High pass filter) and to the direction (first letter for horizontal, second letter for vertical).
The band LLj is a coarser approximation of LLj-1. The bands LHj and HLj record the changes
along horizontal and vertical directions of LLj-1, respectively, whilst HHj shows high
frequency components. Because of the decimation occurring at each level along both the
directions, any subband at the level j is composed by NjxMj elements, where Nj=N0/2j and
Mj=M0/2j.
1-D Filters along columns
LL j
LL j-1
LH j
(Mj-1xNj-1 samples)
output from the level j-1
HL j
L
H
represents decimation by 2
n (x , y ) =
s x1 2 2 2 {[sx (( x c x ) cos + (y c y )sin )]2 + [sy ((x c x ) sin + (y c y )cos )]2 } 2jFsx ((x c x ) cos + (y c y )sin )
e
e
2 2
1
(D.1)
n ( x, y ) =
sxl 2 2 2 [2l ((x k 2 l )cos + ( y k 2 l )sin )] + [2 l ( (x k 2 l )sin + ( y k 2 l )cos )] 2jF 2l [(x k 2 l )cos + ( y k 2 l )sin ] (D.2)
e
e
2 2
1
144
il ,n (x , y ) = n (x , y ) i(x , y )
(D.3)
w* v + b* = 0
(E.1)
maximizing the margin and minimizing the number of misclassified patterns. In (E.1), the
optimal weight vector is expressed as linear combination of the examples of the training set
S:
N
w * = i* y i v i
(E.2)
i =1
where
f ( v) = w * v + b = i* y i v i v + b *
i =1
(E.3)
meaning that v can be classified by evaluating the dot product between v and some
elements (support vectors) of the training set S.
9
Bearing-Only Vision SLAM with
Distinguishable Image Features
Patric Jensfelt1, Danica Kragic1 and John Folkesson2
1Centre
1. Introduction
One of the key competences for autonomous mobile robots is the ability to build a map of
the environment using natural landmarks and to use it for localization (Thrun et al., 1998,
Castellanos et al, 1999, Dissanayake et al, 2001, Tardos et al. 2002, Thrun et al., 2004). Most
successful systems presented so far in the literature have relied on range sensors such as
laser scanners and sonar sensors. For large scale, complex environments with natural
landmarks the problem of SLAM is still an open research problem. Recently, the use of
vision as the only exteroceptive sensor has become one of the most active areas of research
in SLAM (Davison, 2003, Folkesson et al., 2005, Goncavles et al., 2005, Sim et al., 2005,
Newman & Ho., 2005).
In this chapter, we present a SLAM system that builds maps with point landmarks using a
single camera. We deal with a set of open research issues such as how to identify and extract
stable and well-localized landmarks and how to match them robustly to perform accurate
reconstruction and loop closing. All of these issues are central to success, especially when an
estimator such as the Extended Kalman Filter (EKF) is used. Robust matching is required for
most recursive formulations of SLAM where decisions are final. Even for methods that
allow the data associations to change over time, e.g. (Folkesson & Christensen, 2004, Frese &
Schrder 2006), reliable matching is very important.
One of the big disadvantages with the laser scanner is that it is a very expensive sensor.
Cameras, on the other hand, are relatively cheap. Another aspect of using cameras for
SLAM is the much greater richness of the sensor information as compared to that from, for
example, a range sensor. Using a camera it is possible to recognize features based on their
appearance. This provides the means for dealing with one of the most difficult problems in
SLAM, namely data association.
The main contributions of this work are i) a method for the initialisation of visual landmarks
for SLAM, ii) a robust and precise feature detector, iii) the management of the measurement
to make on-line estimation possible, and iv) the demonstration of how this framework
146
2. Related Work
Working with a single camera, the measurements will be of bearing only type. This means
that a single observation of a landmark is not enough to estimate its full pose since the depth
is unknown. This problem is typically addressed by combining the observations from
multiple views as in the structure-from-motion (SFM) approaches in computer vision. The
biggest difference between SLAM and SFM is that SFM considers mostly batch processing
while SLAM typically requires on-line, real-time performance.
The fact that the full pose of a landmark cannot be estimated from a single observation leads
to one of the most important problems that has to be addressed in bearing only SLAM;
landmark initialisation. Several approaches have been presented in the literature. In
(Davison, 2003) a particle filter was used to represent the unknown initial depth of features.
The drawback of the approach is that the initial distribution of particles has to cover all
possible depth values for a landmark, which makes it difficult to use when the number of
detected features is large. A similar approach has been presented in (Dissanayake et al.,
2005) where the initial state is approximated using a Gaussian Sum Filter for which the
computational load grows exponentially with number of landmarks. The work in (Lemarie
et al. 2005) proposes an approximation with additive growth. It uses a weighted Gaussian
sum approximation for the depth estimate of uninitialised landmarks. Gaussians in the sum
are deleted when they no longer are supported by subsequent observations. When a single
Gaussian remains, the landmark is initialised given that a few other conditions are fulfilled.
Another, more practical problem associated with landmark initialisation comes from the
limited field of view of a normal perspective camera in combination with the robot typically
moving along the optical axis as pointed out in (Goncavles et al., 2005). To cope with the
reconstruction problem, a stereo-based SLAM method was presented in (Sim et al., 2005)
where Difference-of-Gaussians (DoG) is used to detect distinctive features which are then
matched using SIFT descriptors. An important problem mentioned is that their particle filter
based approach is inappropriate for large-scale and textured environments. One of the
contributions of our work is that we deal with this problem by identifying only a few high
quality features in the scene to perform SLAM.
Another problem mentioned in (Sim et al., 2005) is related to the time-consuming feature
matching. We address this by using a KD-tree to make our matching process very fast. The
visual feature detector used in our work is the Harris corner detector across different scales
represented by a Laplacian pyramid, similar to what is suggested in (Mikolajczyk & Schmid
2003). For feature matching, we use a modified SIFT descriptor in combination with KDtrees.
Working in indoor environments means that the floor is typically flat and the SLAM
problem can be simplified by assuming that the robot is constrained to a plane. However,
there are many repetitive features stemming from, for example, right angle corners. A single
SIFT descriptor is not discriminative enough in an image to solve the data association
problem. To address this, ``chunks'' of SIFT points were used to represent landmarks in an
outdoor environment in (Luke et al., 2005). This was motivated by the success that SIFT has
had in recognition applications where the object/scene was represented as a set of SIFT
points. In our approach, the position of a landmark is defined by a series of SIFT points
representing different views of the landmark. Each such point is accompanied with a chunk
of descriptors that make the matching/recognition of landmarks more robust. Our
experimental evaluation shows also that our approach performs successful matching even
147
with a narrow field of view, which was mentioned as a problem in (Goncavles et al., 2005,
Sim et al., 2005).
Yet another problem in SLAM is loop closing, that is the ability to detect when the robot
comes back to a position it has been to previously and thereby closing a loop. (Newman &
Ho, 2005) argue for using laser for the geometric mapping but to rely on visual input to
solve the loop-closing problem. The message is that robustness is best achieved if the same
mechanism is not used for the mapping and the loop closing detection. In (Newman & Ho,
2005) visually salient, so called ``maximally stable extremal regions'' or MSERs, encoded
using SIFT descriptors, are used to detect when the robot is revisiting an area. In (Gutmann
& Konolige, 1999) scan matching is used to detect when loops are closed. We show in this
chapter that our framework also can be used for loop closing detection.
In the remainder of this chapter we will make a distinction between recognition and location
features. A single location feature will be associated with several recognition features. The
recognition features' descriptors then give robustness to the match between the location
features in the map and the features in the current image. The key idea is to use a few high
quality features to define the location of landmarks and then use the other features for
recognition. This contributes to a low complexity (few location features) while maintaining
highly robust matching (many recognition features).
3. Feature Description
The SIFT descriptor (Lowe, 1999) has been used frequently in both computer vision and
various robot vision applications. It has been shown in (Mikolajczyk & Schmid 2003) to be
the most robust descriptor regarding scale and illumination changes. The original version of
the SIFT descriptor uses feature points determined by the peaks of a series of Difference of
Gaussians (DoG) on varying scales. In our system, peaks are found using Harris-Laplace
features, (Mikolajczyk & Schmid 2001) since they respond to regions of high curvature,
instead of blob-like image structures obtained by series of DoG. This leads to features
accurately localized spatially, which is essential when features are used for reconstruction
and localization, instead of just recognition.
In a sparse, indoor environment many of the detected features originate from corner
features. The original SIFT descriptor assigns canonical orientation at the peak of smoothed
gradient histograms. This means that similar corners but with a significant rotation
difference can have similar descriptors. This may potentially lead to many false matches. For
example, the four corners of the waste bin in Figure 2. may all match if rotated. Therefore,
we use a rotationally 'variant' SIFT descriptor where we avoid the canonical orientation at
the peak of smoothed gradient histogram and leave the gradient histogram as it is.
148
measurements are delayed N frames with respect to the input side of the buffer. Figure 1.
illustrates this idea.
Figure 1. A buffer of N image frames is used for matching, selection & triangulation
Figure 2. Many structures in indoor environments look similar even when rotated
The benefit of this is that the SLAM process can be fed with few and high quality
landmarks. In addition, since an estimate of the 3D position of landmarks can be supplied
with the first measurement of a landmark, the landmarks can immediately be fully
initialised in the SLAM process. This allows immediate linearisation without the need to
apply multiple hypotheses (Lemarie et al., 2005) or particle filtering (Davison, 2003)
techniques to estimate the depth. It is important to point out that the approximate 3D
position found from the buffer of frames is only used for initialising the point landmark at
the correct depth with respect to the camera at the first observation. The uncertainty in
depth is still assumed to be very high, as problems with incorporating information twice
would otherwise occur. Comparing to a multiple hypothesis approach, it is like knowing
which of the multiple hypotheses about the depth is correct right away which saves
computations. Having the correct depth allows us, as said before, to reduce the linearisation
errors that would results from having a completely wrong estimate of the depth.
Assuming that the delay caused by the length of the buffer is not too large, it is possible to
make a quite accurate estimate of the current robot pose by using dead reckoning
information to predict forward from the pose estimated by the SLAM process. For typical
values of N, the addition to the robot position error caused by the dead reckoning is small
and we believe that the benefits of being able to initialise landmarks using bearing-only
information and perform feature quality checks are more significant. The predication
149
forward in time is done in each iteration from the latest pose estimated by SLAM. This way
there is no accumulation of dead reckoning errors other than over the short distances
corresponding to the size of the buffer.
In addition to requiring that features can be tracked over more than a certain predefined
number of frames, we require that the image positions of the feature allow good
triangulation and that the resulting 3D point is stable over time in the image. Requiring that
the feature can be tracked over several frames removes noise and moving targets that could
otherwise severely damage the estimation process. Good triangulations rule out features
that have a high triangulation uncertainty, typically because of small baseline or having
bearings near the direction of motion. The third requirement removes features that lack
sharp positions in all images due to parallax or a lack of a strong maximum in scale space.
Difference in scales of the images can also cause apparent motion of features, such as for
example a corner of a non-textured object.
We have used a fixed value for N, i.e. the length of the buffer, in our tests. The values
between 10 and 50 have been tested. A buffer with all frames acquired from the same
camera pose would be of little use for triangulation. Therefore, a new frame is added to the
buffer when the camera has moved enough since the last added frame. This way, it is likely
that there is enough baseline for estimating the location. The value of N depends very much
on the motion of the robot/camera and the camera parameters. For a narrow field of view,
camera mounted in the direction of motion of the robot as in our case the effective baseline
will be quite small. An omnidirectional camera would offer one way to deal with the small
field of view. Another idea is to actively control the direction of the camera as in (VidalCalleja et al., 2006).
5. Feature Tracking
The buffer with data from the past N frames does not contain the whole images, but rather
the feature points that have been extracted in each frame. An even higher reduction of space
could be achieved by using an indexing scheme as in (Nistr & Stewnius, 2006). The feature
points are tracked over consecutive frames. To estimate if two feature points match, we use
the distance between the descriptors, i.e. between the 128-dimensional vectors associated
with the SIFT descriptors. On the left hand side of Figure 3. the organization of the frame
memory is shown. Notice the lists that store the associations between the points for the
different frames in the buffer. Ideally, each association list corresponds to one landmark in
the world and denotes how the projection of this landmark moves in the image as the robot
moves.
The SIFT descriptor is invariant to changes in scale and view angle but only up to a certain
degree. The change between two consecutive observations in the buffer is however typically
quite small and makes tracking possible. The different descriptors in the list correspond to
different viewpoints of the same landmark.
As was previously described, the buffer is used to sort the good from the bad landmarks.
The output from the frame memory is a small selection of all the features points in the oldest
frame. These points are the ones that are judged to be the best with respect to the criteria
mentioned earlier. Some of these points correspond to observations of already existing
landmarks and some to the first observation of a new landmark. For each new landmark
observation, an estimate of the 3D position is obtained by triangulating the points in the
corresponding association list. The number of points that are used as observations in each
150
frame is typically only a small fraction of all points in that frame. This helps reducing the
complexity. The time to perform the tracking over frames has constant complexity assuming
that the number of features in each frame is bounded.
Only using the similarity of the point descriptor for tracking has two problems. First, it
requires that all points in the image are tested for similarity which is computationally
expensive and second, it can lead to false matches in cases where there are similar structures
in multiple places in the image. To address these issues we predict the approximate image
location for the old point features in the new frame using odometry and optical flow
estimates. The predicted image location allows us to narrow the search region for each
feature match and thus increase efficiency. Notice that the buffer allows us to predict feature
points observed not only in the very last frame but also further back. This increases the
robustness in the tracking, as some feature points are not present in every frame.
151
landmark in the database are provided by the frame memory. For every new observation of
a landmark the descriptor is compared to the existing ones and used to augment the
descriptor list if it is different enough.
The SIFT point descriptors are not globally unique (see Figure 2. again) and thus matching a
single observation to a landmark is doomed to cause false matches in a realistic indoor
environment. However, using large number of SIFT descriptors has proven to give robust
matching results in object recognition applications. This is why we store, along with the
landmark descriptor associated with the location of the landmark, the rest of the descriptors
extracted from the same frame and use these for verification. We refer to the rest of the
feature points in a frame as recognition features to distinguish them from the location
feature associated with the location of the landmark.
The structure of the database is shown on the right hand side in Figure 3. Each landmark
F1,,F2,...,FN has a set of location descriptors shown in the dashed box. A KD-tree
representation and a Best-Bin-First (Beis & Lowe, 1997) search allow for real-time matching
between new image feature descriptors and those in the database. Each location descriptor
has a set of recognition descriptors shown to the right.
When we match to the database, we first look for a match between a single descriptor in the
new frame and the location descriptors of the landmarks (dashed box Figure 3.). As a second
step, we match all descriptors in the new frame to the recognition descriptors associated
with candidate location descriptors for verification. As a final test, we require that the
displacement in image coordinates for the two location features (new frame and database) is
consistent with the transformation between the two frames estimated from the matched
recognition descriptors (new frame and database). This assures that it is not just two similar
structures in the same scene but that they are at the same position as well. Currently, the
calculation is simplified by checking the 2D image point displacement. This final
confirmation eliminates matches that are close in the environment and thus share
recognition descriptors such as would be the case with the glass windows in Figure 2.
7. SLAM
The previous sections have explained how we track features between frames to be able to
determine which make good landmarks and how these are added to, represented in and
matched to the database. In our current system, we use an EKF base implementation of
SLAM. It is however important to point out that the output from the frame memory could
be used as input to any number of different SLAM algorithms. It is possible to use normal
EKF despite its limitation regarding complexity since most features extracted from the
frames have been discarded by the matching and quality assessment process in the frame
memory. Even though hundreds of features are extracted in each frame only a fraction of
these are used for estimation. We are also able to supply the approximate 3D location of new
landmark so that no special arrangement for this has to be added in the SLAM algorithm.
This also makes the plug-n-play of SLAM algorithm easier.
We use the same implementation for SLAM that was used in (Folkesson et al, 2005). This is
part of the freely available CURE/toolbox software package. In (Folkesson et al, 2005) it was
used for vision SLAM with a camera pointing up in the ceiling.
To summarize, the division is such that the SLAM process is responsible for estimating the
location of a landmark and the database for its appearance.
152
8. Experimental Evaluation
153
Figure 8. The line picked up by the laser scanner is the lower part of the bench where people
sit and not the wall behind it. This explains why many of the points in this area do not fall
on the laser-based line. Some of the spread of the point can also be explained by the small
baseline. The depth error is inversely proportional to the baseline (Hartley & Zisserman,
2000).
Figure 5. The landmark map with the trajectory and reference laser based map
Figure 6. Situation when the first loop is closed. Lines show matched points
154
Another observation that can be made is that the final map contained 113 landmarks and
that most of these were added to the map during the first loop (98). This indicates that
landmarks were matched to the database rather than to be added to the map. Had this not
been the case one would have expected to see roughly 3 times the number of landmarks.
As many as half of the features in each frame typically do not match any of the old features
in the frame memory and are thus matched to the database. A typical landmark in the
database has around 10 descriptors acquired from different viewing angles. The matching to
the database uses the KD-tree in the first step that makes this first step fast. This often
results only in a few possible matching candidates.
Figure 7. One of the matched points in the first loop detection (compare to Figure 6)
155
10. References
Beis, J.S.; Lowe, D.G. (1997). Shape indexing using approximate nearest-neighbour search in
high-dimensional spaces. Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp.1000-1006.
Castellanos, J.A.; Tards, J.D. (1999). Mobile Robot Localization and Map Building: A
Multisensor Fusion Approach, Kluwer Academic Publishers.
Davison A.J. (2003). Real-time simultaneous localisation and mapping with a single camera.
Proceedings of the International Conference on Compupter vision (ICCV).
Dissanayake, G.; Newman, P.; Clark, S.; Durrant-Whyte, H.F.; Corba, M. (2001). A solution
to the slam building problem. IEEE Transactions on Robotics and Automation, 17, 3,
229-141
Folkesson, J.; Christensen, H.I. (2004). Graphical SLAM - A Self-Correcting Map, Proceedings
of the IEEE International Conference on Robotics and Automation (ICRA)
Folkesson, J.; Jensfelt, P.; Christensen; H.I. (2005) Vision slam in the measurement subspace.
Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).
Frese, U.; Schrder, L. (2006) Closing a Million-Landmarks Loop, Proceedings of the IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS)
156
Goncavles, L.; di Bernardo, E.; Benson, D.; Svedman, M.; Ostrovski, J.; Karlsson, N.;
Pirjanian, P. (2005) A visual fron-end for simultaneous localization and mapping.
Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).pp.
4449.
Gutmann, J.; Konolige, K. (1999) Incremental mapping of large cyclic environments.
Proceedings of the IEEE International Symposium on Computational Intelligence in
Robotics and Automation, pp. 318-325
Hartley, R.; Zisserman, A. (2000). Multiple View Geometry in Computer Vision, Cambridge
University Press, ISBN: 0521623049
Kwok, N.M.; Dissanayake, G.; Ha, Q.P. (2005) Bearing only SLAM using a SPRT based
gaussian sum filter. Proceedings of the IEEE International Conference on Robotics and
Automation (ICRA).
Lemaire, T.; Lacroix, S.; Sola, J. (2005) A practical 3D bearing-only SLAM algorithm,
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pp 2757-2762
Lowe, D.G. (1999) Object recognition from local scale-invariant features. Proceedings of the
IEEE International Conference on Computer Vision (ICCV), pp. 1150-57
Luke, R.H.; Keller, J.M.; Skubic, M.; Senger, S. (2005) Acquiring and maintaining abstract
landmark chunks for cognitive robot navigation. Proceedings of the IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS)
Mikolajczyk, K.; Schmid, C. (2001) Indexing based on scale invariant interest points.
Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 525531
Mikolajczyk, K.; Schmid, C. (2003) A performance evaluation of local descriptors.
Proceedings of IEEE Conference on Computer Visiona and Pattern Recognition (CVPR),
pp. 257-263.
Newman, P.; Ho, K. (2005) SLAM-loop closing with visually salient features, Proceedings of
the IEEE International Conference on Robotics and Automation (ICRA). pp. 644-651.
Nistr, D.; Stewnius, H. (2006) Scalable Recognition with a Vocabulary Tree, Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Se, S.; Lowe, D.G.; Little, J. (2002) Mobile robot localization and mapping with uncertainty
using scale-invariant visual landmarks. Journal of Robotics Research, 21, 8, 735-758
Sim, R.; Elinas, P.; Griffin, M.; Little, J. (2005) Vision-based slam using the rao-blackwellised
particle filter, Proceedings of the Workshop on Reasoning with Uncertainty in Robotics
(IJCAI)
Tards, J.D.; Neira, J; Newman, P.M.; Leonard, J.J. (2002) Robust mapping and localization
in indoor environments using sonar data, Journal of Robotics Research, 4
Thrun, S.; Fox, D.; Burgard, W. (1998) A probalistic approach to concurrent mapping and
localization for mobile robots, Autonomous Robots, 5, 253-271
Thrun, S.; Liu, Y.; Koller, D.; Ng, A.; Ghahramani, Z.; Durrant-White, H. (2004) SLAM with
sparse extended information filters, Journal of Robotics Research, 23, 8, 690717
Vidall-Calleja, T.; Davison, A.J.; Andrade-Cetto, J.; Murray, D.W. (2006) Active Control for
Single Camera SLAM, Proceedings of the IEEE International Conference on Robotics and
Automation (ICRA), pp. 1930-36
10
An Effective 3D Target Recognition Imitating
Robust Methods of the Human Visual System
Sungho Kim and In So Kweon
Korea Advanced Institute of Science and Technology
Korea
1. Introduction
Object recognition is an important research topic in computer vision. Not only it is the
ultimate goal of computer vision, but is also useful to many applications, such as automatic
target recognition (ATR), mobile robot localization, visual servoing, and guiding visually
impaired people.
Great progress in this field has been made during the last 30 years. During 1970~1990, the
research focused on the recognition of machine parts or polyhedral objects using edge or
line information (Lowe, 2006, Faugeras & Hebert, 1986). A 2D invariant feature and hashingbased object recognition was popular during the 1990s (Mundy & Zisserman, 1992,
Rothwell, 1993). Since the mid 1990s, view or appearance-based methods have become a
popular approach in computer vision (Murase & Nayar, 1995). Current issues cover how to
select a feature, handle occlusion, and cope with image variations in photometric and
geometric distortions. Recently, object recognition methods based on a local visual patch
showed successful performance in those environmental changes (Lowe, 2004, Rothganger et
al., 2004, Fergus et al., 2003). But these approaches can work on textured complex object and
do not provide 3D pose information of interesting objects.
The goal of our research is to get the identification and pose information of 3D objects or
targets from either a visible or infrared band sensor in a cluttered environment. The
conventional approaches as mentioned above do not provide satisfying results. To achieve
this goal more effectively, we pay attention to the perception mechanism of the human
visual system (HVS), which shows the best efficiency and robustness to the above
mentioned problems. Especially, we focus on the components of HVS robustness.
158
159
properties of V4 receptive field, where the convex part is used to represent visual
information (Pasupathy & Connor, 2001). A part-based representation is very robust to
occlusion and background clutter. We represent visual appearance by a set of robust visual
part.
Motivated by these facts, many computational models were proposed in computer vision.
Researchers of model-based vision regarded bottom-up/top-down processes as
hypothesis/verification paradigms (Kuno et al., 1988, Zhu et al., 2000). To reduce
computational complexity, visual attention mechanism is used (Milanese et al. 1994). Topdown constraint is used to recognize face and pose (Kumar, 2002). Recently, an interesting
computational model (HMAX) was proposed based on the tuning and max operation of a
simple cell and a complex cell, respectively (Serre & Riesenhuber, 2004). In a computer
vision society, Tu et al. proposed a method of unifying segmentation, detection and
recognition using boosting and prior information by learning (Tu et al., 2005). Although
these approaches have their own advantages, they modeled only on partial evidences of
human visual perception, and did not pay attention to the robust properties of HVS more
closely.
In this paper, we propose a computationally plausible model of 3D object recognition,
imitating the above properties of the HVS. Bottom-up and top-down information is
processed by a visual attention mechanism and integrated under a statistical framework.
(1)
where I = {Z L , Z C }
where means the parameter set as explained, I denotes input image, and it is composed
of two sets: Z L for object related local features Z C for place or scene related contextual
features. The likelihood of the equation (1), the first factor P ( Z L | , Z C ) represents the
posterior distribution of local features, such as local structural patch, edge information given
parameters and contextual information. There is a lot of contextual information, but we
restrict the information as place context and a 3D global shape for our final goal. This
information alleviates the search space and provides accurate pose information. The second
factor P ( | Z C ) provides context-based priors on object ID, pose which are related to the
160
Place context
3D shape context
Vi1
View index
Oi
Part index
Pi1
Pi2
Input feature
ZL1
ZL2
Oj
ZC
Vik
Vj1
ZL3
Vjk
Pj1
Pj2
ZLn
161
advantage of the MCMC is that its samples are guaranteed to asymptotically approximate
those which form the posterior. A particular implementation of the MCMC is the
Metropolis-Hastings algorithm (Robert & Casella, 1999). The original algorithm is as
follows:
Algorithm 1: Metropolis-Hastings algorithm
Draw an initial point 0 from a starting distribution P ( ) .
For i=1..N
Draw candidate point * from the jumping distribution J i (* | i 1 )
Calculate the ratio
f (* ) J i (i 1 | * )
f (i 1 ) J i (* | i 1 )
162
shape matching. The decision information is fed back to the bottom-up process for another
object recognition in the same scene. Algorithm 2 summarizes the overall recognition steps.
Initial object parameters
Bottom-up
Top-down
Scene DB
Discriminative
on scene
Scene
context
Discriminative
on object
Local structural
feature extraction
Shape
View
3D CAD context
based DB model DB
Rendering
global shape
Ji (
i 1
Shape
matching
Input Image
Figure 2. Overall functional model of the object recognition motivated by the robust
properties of the HVS
Algorithm 2: Domain knowledge & context-based 3D object recognition algorithm
Stage I: Initialization by bottom-up process
Step 1: Extract HCM, CEM in scale space
Step 2: Find salient interesting points through scale space analysis.
Step 3: Bind feature maps by relating salient HCM and the corresponding CEM
Step 4: Extract local edge patches and calculate local Zernike moments
Step 5: Discriminate scene ID through direct voting
Step 6: Calculate the likelihood of object parameters from scene context and object
discrimination by direct voting
Step 6: Sort candidate parameters 0 = { ID 0 ,C 0 , P0 , D 0}
Stage II: Optimization by top-down process
Step 1: Extract GMM and GOM
Step 2: Set initial point 0 = { ID 0 ,C 0 , P0 , D 0} from Stage I
Step 3: Optimize parameters by MCMC sampling with feature map binding
For t = 0, , T
Draw a candidate point * from the jumping distribution J t (* | t 1 )
Render the 3D CAD model based on shape context and *
Calculate the cost function f (* ) , by focusing on the rendered model
and the integrated feature maps (GMM+GOM)
Calculate the ratio
r=
f (* ) J t (t 1 | * )
f (t 1 ) J t (* | t 1 )
163
End for
Step 4: If f (T ) < , recognition finished and fed back to the step 6 in Stage I.
Else reject 0 and go to step 2 with the next candidate 0
Figure 3. Configuration of the database: scene context + 3D CAD model + part-based view
representation
164
Binding by Attention
on Salient corner
Figure 4. Block of local structural feature extraction: Canny Edge Map and Harris Corner
Map are extracted in scale space which is bound by spatial attention on salient corners. Each
local structural patch is represented using Zernike moments
165
Figure 4 shows the overall process for feature generation. We extract separate low-level
feature maps such as Canny Edge Maps (called CEM) and Harris Corner Mapps (called
HCM) in scale space. Then a perceptually salient corner and characteristic scale is calculated
(Lindeberg, 1998). Locally structural visual parts are extracted by attending on CEM around
salient corner points and scale tuned regions of HCM. The scale tuning process that exists is
supported by the neuro-physiological evidence, as explained in section 2. Each patch whose
size is normalized to 20 20 is represented by local Zernike moments introduced in (Kim &
Kweon, 2005).
Step 1: Generation of separate feature maps
In the bottom-up process, we assume that an object is composed of local structures.
According to (Parkhurst et al., 2002), Parkhurst et al. experimentally showed the fact that
bottom-up saliency map-based attention of Ittis model is not suitable for learned object
recognition. So, we adopt another spatial attention approach that the HVS usually attends
on a high curvature point (Feldman & Singh, 2005). Although the HVS also attends on
symmetrical points (Reisfeld et al., 1995), we only use the high curvature points for visual
attention, since they are robust to a viewpoint and computationally easy to detect. We detect
high curvature points directly from an intensity image using a scale-reflected Harris corner
detector which shows highest repeatability in photometric, geometric distortions, and which
contains enough information (Harris & Stephens, 1988, Schmid et al., 2000). A conventional
Harris corner detector detects many clusters around a noisy and textured region. However,
this doesnt matter, since the scale-reflected Harris detector extracts corners in noise
removed images by Gaussian scale space. Furthermore, since salient corners are selected in
scale space, corner clusters are rarely found, as in Figure 5. Canny edge detector is used to
extract an edge map which reflects similar processing of a center-surround detection
mechanism (Canny, 1986). The CEM is accurate and robust to noise. Both low level maps are
extracted pre-attentively.
Step 2: Feature integration by attending on salient corners
Local visual parts are selected by giving spatial attention to a salient corner. We use the scale
space maxima concept to detect salient corners. We define that a corner is salient if the
measure of convexity (here, Laplacian) of corners in scale axis shows a local maxima. A
computationally suitable algorithm is scale-adapted Harris-Laplace method which shows
most robust to image variations (Schmid et al., 2000). Figure 5 shows the salient corner
detection results. To detect a salient corner, first we make a corner scale space by changing
the smoothing factor ( ). Then the convexity of corners are compared in scale axis.
Finally, salient corners are selected by selecting the maximum convexity measure in the
tracked corners in scale space. As a by product, a scale tuned region can be obtained as
Figure 5. This image patch corresponds to a local object structure.
Step 3: Local visual parts description by Zernike moments
The local visual parts are represented using modified Zernike moments introduced in (Kim
& Kweon, 2005). The Zernike moments were used to represent characters because they are
inherently rotation invariant, as well as possessing superior image representation
properties, information redundancy, and noise characteristics. A normalized edge part is
represented as 20 dimensional vectors where each element is the magnitude of a Zernike
moment. Although we do not know how the HVS represents local visual image, we utilize
the local Zernike moments, since this feature is robust to scale, rotation and illumination
changes.
166
The performance is evaluated in terms of interest region selector and region descriptor using
ROC curve (Mikolajczyk & Schmid, 2003). We used 20 object images as a reference, and
made test images by changing the scale factor 0.8 times, planar rotation 45, view-angle 25,
and illumination reduction by 0.7 time to the reference. For the comparison of the visual
part detect, we used the same number of scale space, Zernike moment descriptor and image
homography to check the correct matches. For the comparison of the descriptors, we use the
same scale space, salient corner part detector and image homography for the same reason.
Scale tuned region detector by the salient corner part detector almost outperform the SIFT
(DoG-based) as in Figure 6 (a). In the descriptor comparison graph, SIFT and PCA show
better performance than Zernike, as in Figure 6 (b). But this region of the low false positive
rate is useless, because few features are found. In a noisy environment, our descriptor
(Zernike) shows better performance. Figure 7 shows several matching examples using the
salient corner with Zernike moments. Note the robust matching results in various
environments.
SIFT
Zernike
PCA
Salient corner
SIFT
90
85
90
80
85
80
75
75
70
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.2
0.4
0.6
0.8
1.2
(a)
(b)
Figure 6. (a) Performance comparison of interest part selector: Salient corner vs. SIFT, (b)
performance comparison of local descriptor: SIFT, Zernike, and PCA
167
Figure 7. Examples of feature matching using a salient corner part detector and a Zernike
moments descriptor in illumination, occlusion, rotation, scale and view angle changes
5.2 Initial parameter estimation by discriminative method
The initial parameters of an object are estimated using a discriminative method, 1-nearest
neighbor based voting. In the first step, scene identity is found using direct voting. This
scene context provides the information of probable object ID. In the next step, other initial
pose, position, and scale parameters are estimated for the object, using the same voting
method.
Step 1: Discriminative method on scene recognition,
In equation (1), the scene context term P ( | Z C ) provides object related priors especially
object ID. If we assume one object per scene for simplification, then initial object ID can be
estimated directly from the scene discrimination process as equation (2).
168
P ( ID | Z C ) P ( s | Z C )
(2)
P( s | Z
i
C)
(3)
i =1
where local feature ZiC belongs to scene feature set Z C , which usually corresponds to
background features. s is a scene label and N ZC is the number of input scene features. The
posterior P ( s | Z C ) is approximated by the sum rule. We use the following binary
probability model to design P ( s | Z Ci ) :
1 L( Z Ci ) s, K E ( Z Ci , Z i )
P ( s | ZCi ) =
otherwise
0
(4)
where L( Z Ci ) denotes the label of feature Z Ci searched by 1-nearest neighbor search and
K E ( Z Ci , Z i ) is Gaussian Kernel of Euclidean distance between input feature Z Ci and
corresponding scene DB feature Z i . The kernel threshold usually set to 0.7~0.8. The final
scene discrimination result provides scene context, prior information of object ID.
Step 2: Discriminative method on initial object parameters
Initial object ID is directly estimated from the scene context as step 1. Other object-related
parameters are estimated by the same voting on view-based object DB. In equation (1), the
initial parameters used in P ( Z L | , Z C ) can be directly discriminated as step 1, the voting
scheme. Since we already know the initial object ID, the search space of other parameters are
reduced enormously. The only difference is that the voting spaces are dependent on the
parameters. For example, if we want to estimate the initial pose C , we vote the nearest
match pairs to the corresponding pose space (azimuth, elevation) like equation (3), and
select the max. Given the initial object ID and pose, the initial object scale D , and position
P is estimated easily, since our part detectors extract characteristic part scale with its
position in the image (see Figure 5). So, the initial scale is just the average of the
characteristic scale ratio between scene and model image, and the initial object position is
the mean of matching feature pairs (see Figure 5). Since object parameters are estimated
based on salient feature and scene context which reduce the search space, there is no
increase of estimation error. Figure 8 shows the sample scene database and scene
discrimination result by direct voting for the test image. In this test, we used 20 scenes in
canonical view points for database and the test image was captured on a different view
point. The scene 16 is selected by max operation of the voting result. This scene contains the
interesting object. So, we can initialize the object ID parameter from this scene context.
Figure 9 shows a bottom-up result, where the 3D CAD model is overlaid using the initial
parameters. There are some pose, scale, location errors. In addition, we cannot trust the
estimated object ID. These ambiguities are solved through a top-down process using 3D
shape context information.
169
(a)
Nearest Neighbor Voting Result
1
0.9
0.8
Probability
0.7
0.6
0.5
0.4
0.3
0.2
0.1
10
12
14
16
18
20
Scene DB
(b)
Figure 8. (a) Examples of scene DB and test image on the right, (b) Scene context: nearest
neighbor-based direct voting
170
and gradient orientation map (GOM). The model parameter prediction and shape matching
are processed iteratively for statistical optimization.
6.1 Generation of model parameters
A posteriori in equation (1) is approximated statistically by MCMC sampling. Based on the
initial parameters obtained in bottom-up process, the next samples are generated based on
the jumping distribution, J i (i | i 1 ) . It is referred to as proposal or candidate-generation
function for its role. Generally, random samples are generated to prevent local maxima.
However, we utilize the bottom-up information and top-down verification result for
suitable sample generation. In this paper, we use three kinds of jumping types, i.e., object
addition, deletion and refinement as Table 1.
The first type is to insert a new object and its parameters, depending on the result of a
bottom-up process. The second is to remove a tested model and its parameters, determined
by the result of top-down recognition. A jumping example of the third type is like equation
(5). Next state depends on current state and random gain. This gain has uniform distribution
(U) in the range of 30 $ , because the view sphere is quantized with this range. Here, C0 is
initialized by the result of a bottom-up process.
Ct = Ct 1 + C
(5)
3D Shape Context
Meaningful
Shape
Matching
Figure 10. 3D shape context-based top-down shape matching using MCMC: the 3D CAD
model is rendered using the initial object parameters, then meaningful shape matching is
performed by attending on the rendered 2D shape location and GMM, GOM. Final decision
is made based on the MCMC optimization value
171
Jump
type
Function
Parameters
Jumping distribution
J1
Object addition
ID ,C , D , P ,
J2
Object deletion
ID ,C , D , P
Fine tuning of
parameters
d C , d D , d P
d roll U (10,10)
d D U ( D D / 5, D + D / 5)
P = { x , y }
d x = U {40, 40}
d y = U {40, 40}
Table 1. Jumping types and corresponding distributions
6.2 Robust shape matching
A predicted 3D CAD model generated by jumping distribution is rendered on the GMM
and GOM image. Attending on the shape model points, both map information is combined
as Figure 10. The scoring function used in the MCMC algorithm is defined by the shape
matching. The shape matching between the rendered 2D shape and both maps is based on
the computational gestalt theory (Desolneux et al., 2004). We propose a novel -meaningful
shape matching method motivated from this theory.
Two important concept of the theory is as follows:
Helmholtz principle: This principle provides a suitable mathematical tool for modeling
computational Gestalt. Basically, it assumes that an image has random distribution of
pixel values or orientations. If some pixels break the randomness, then these pixels have
a certain pattern, called gestalt.
Pixel intensity: u ( x, y )
u
Gradient magnitude: u '( x, y ) = ,
x
u
( x, y )
y
u
1
u
,
( x, y )
Du ( x, y ) y
x
The last two components are useful for shape matching, since they are robust to illumination
and noise. If we assume the image is random, then we can measure the structural alignment
to a certain pattern. We can think of a matching at x i that satisfy both the image gradient
Gradient orientation: ( x, y ) =
172
and orientation. If the rendered shape model is compatible to the image gradient and
orientation simultaneously, then this matching is meaningful.
(a)
(b)
(c)
(d)
1 num of {x | C (x) } ,
where, H (x, ) =
8
(6)
173
f () = N H (x, )l
(7)
where N is the number of the test. The smaller this value is, the better the shape matching is.
We use this -meaningful shape matching as a scoring function for the MCMC
optimization method because this function provides a measure of shape matching. The
Scoring or cost function acts as a means of measuring the goodness of the proposed model
parameters. Generated samples are accepted or rejected based on this function.
Figure11 shows the effectiveness of feature map binding in a top-down process. To show the
power of feature map binding, we added Gaussian noise with a standard deviation 8. The
binding GMM with GOM outperforms the single map based shape matching.
Figure 12. Shape matching results for temperature varying FLIR sequences. The proposed
method is very robust to temperature changes. The last image shows a false matching result
where the roof target hardly detectable by human eyes
(a)
(b)
(c)
Figure 13. Parameter optimization by top-down process: (a) CAD model is overlaid with
initial parameters, (b) after 10 iterations (c) after 40 iterations for the visible object
174
7. Experimental Results
In this paper, our main goal is to recognize man-made architectures such as building,
bridge, container, and etc. using a FLIR camera. As an initial test, we experimented on a
polyhedral object using a CCD camera. Then we evaluated the system on a FLIR dataset.
Figure 14 shows the overall interface of the target recognition system. This automatic target
recognition system estimates the initial object parameters using scene and object DB. Then
optimal parameter tuning is performed in top-down meaningful shape matching. From this
result, the system makes a decision and feedbacks to the bottom-up process.
Figure 14. System interface-(upper left): input image with final result is overlaid, (upper
right): rendered 3D CAD model generated from bottom-up and jump distribution, (lower
left): bottom-up process result, (lower right): top-down process result which shows the
optimal parameters
7.1 Test on visible database
First, we tested the algorithm for the objects captured using the CCD camera. We made a
database for quantized views as explained. Figure 9 shows some results of the bottom-up
process. We can get proper initial parameter values. Figure 13 shows the projection of a
model with refined parameters by a top-down process for each object placed in the general
environment. The overall computation time is 2 sec (0.5 sec for the bottom-up process) on
the average under AMD 2400+.
175
Figure 15. FLIR targets to recognize: cars, building, container, and tower
The test images are shown in Figure 16. They are composed of three types for the accurate
performance evaluation for the practical use. The system has to recognize the targets in DB
with high recognition rate and able to reject clutter objects or natural scenes.
Figure 16. The composition of test images: targets in DB, targets not in DB, and natural
scenes
176
90
80
70
60
50
40
30
20
10
Proposed
GMM
GOM
Figure 17. Evaluation of target recognition performance: the proposed method, GMM only,
and GOM only
177
Figure 17 summarized our results compared with the methods of GMM only and GOM
only. We used the performance measure as correct positive rate vs. false positive rate. In
target recognition, the false positive rate is very important factor for practical system
because false detections makes enormous damage. So, a good target recognition system has
to high correct detection rate and very low false detection rate. During the performance
comparison, we have the same bottom-up process with different top-down methods. We
take all test images into consideration for the optimal parameter tuning. Our method
outperforms the other two, with correct detection rate 93.75% and false detection rate only
2.85%. GOM-based method shows the worst performance. Figure 18 shows visual object
recognition results for each object.
Figure 19 shows a typical failure case of the proposed system. The failures occurred from a
bottom-up failure due to severe noise and a top-down failure due to low contrast.
Figure 19. Failure case due to top-down fails due to low contrast
8. Conclusions
We propose a new object recognition paradigm based on the robust properties of the HVS,
especially in scene context and 3D shape context information in a bottom-up and a topdown process. Furthermore, we also propose the cooperative feature map binding by
utilizing both bottom-up and top-down processes and validate the system performance with
various experiments. The test results on several images demonstrate efficiency in optimal
matching as well as feasibility of the proposed recognition paradigm. The same paradigm
will be extended to the general object recognition problem by changing the model
representation.
178
9. Acknowledgments
This work was supported in part by the Korea Science and Engineering Foundation
(KOSEF) grant funded by the Korea government(MOST)(No. M1-0302-00-0064), by the MIC
for the project, Development of Cooperative Network-based Humanoids Technology of
Korea.
10. References
Bar, M. (2004). Visual objects in context. Nature Reviews: Neuroscience, Vol. 5, 617-629.
Biederman, I. (1987). Recognition by Components: A Theory of Human Image
Understanding. Psychol. Review, Vol. 94, No. 2, 115-147.
Borgelt, C. & Kruse, Z. (2001). Graphical models: methods for data analysis and mining. Wiley,
New York, 1-12.
Canny, J. (1986). A computational approach to edge detection. IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. 8, No.6, 679-698.
Desolneux, A.; Moisan, L. & Morel, J.M. (2004). Gestalt theory and computer vision. In
Carsetti A. Seeing, Thinking and Knowing, Kluwer Academic Publishers, New York,
71-101.
Doucet, A.; Freitas, N.D. & Gordon, N. (2001). Sequential Monte Carlo methods in practice,
Springer, New York, 432-444, 3-13.
Edelman, S. & Blthoff, H. (1992). Orientation dependence in the recognition of familiar and
novel views of 3D objects. Vision Research, Vol. 32, 2385-2400.
Faugeras, O.D. & Hebert, M. (1986). The representation recognition, and locating of 3-D
objects. International Journal of Robotics Research, Vol. 5, No. 3, 2752.
Feldman, J. & Singh, M. (2005). Information along contours and object boundaries.
Psychological Review, Vol. 112, No. 1, 243-252.
Fergus, R.; Perona, P. & Zisserman, A. (2003). Object class recognition by unsupervised
scale-invariant learning. Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 264-271, Madison, Wisconsin, June.
Fiser, J.; Subramaniam, S. & Biederman, I. (2001). Size Tuning in the absence of spatial
frequency tuning in object recognition. Vision Research, Vol. 41, No. 15, 1931-1950.
Green, P. (1996). Reversible jump Markov Chain Monte Carlo computation and Bayesian Model
Determination, Champman and Hall, London.
Harris, C.J. & Stephens, M. (1988). A combined corner and edge detector. In Proceedings of
4th Alvey Vision Conference, 147-151, Manchester.
Kim, S. & Kweon, I.S. (2005). Automatic model-based 3D object recognition by combining
feature matching with tracking. Machine Vision and Applications, Vol. 16, No. 5, 267272.
Kumar, V.P. (2002). Towards trainable man-machine interfaces: combining top-down
constraints with bottom-up learning in facial analysis. Ph.D Thesis, MIT.
Kuno, Y.; Ikeuchi, K. & Kanade, T. (1988). Model-based vision by cooperative processing of
evidence and hypotheses using configuration spaces. SPIE Digital and Optical Shape
Representation and Pattern Recognition, Vol. 938, 444-453.
Lindeberg, T. (1998). Feature detection with automatic scale selection. International Journal of
Computer Vision, Vol. 30, No. 2, 77-116.
179
180
Tu, Z.; Chen, X.; Yuille, A. & Zhu, S.C. (2005). Image parsing: unifying segmentation,
detection, and object recognition. (Marr Prize Issue, a short version appeared in
ICCV 2003), International Journal of Computer Vision, Vol. 63, No. 2, 113-140.
VanRullen, R. (2003). Visual saliency and spike timing in the ventral visual pathway. Journal
of Physiology (Paris) 97, 365-377.
Zhu, S.C.; Zhang, R. & Tu Z. (2000). Integrating bottom-up/top-down for object recognition
by data driven Markov Chain Monte Carlo. Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition, 738-745, Hilton Head, SC, June.
11
3D Cameras:
3D Computer Vision of wide Scope
Stefan May, Kai Pervoelz and Hartmut Surmann
Fraunhofer Institute for Intelligent Analysis and Information Systems
Germany
1. Introduction
The human visual sense is the one among all other senses that gathers most information we
receive. Evolution has optimized our visual system to negotiate one's way in three
dimensions even through cluttered environments. For perceiving 3D information, the
human brain uses three important principles: stereo vision, motion parallax and a-priori
knowledge about the perspective appearance of objects in dependency of their distance.
These tasks pose a challenge to computer vision since decades. Today the most common
techniques for 3D sensing are CCD- or CMOS-camera, laser scanner or 3D time-of-flight
camera based. Even though evolution has shown predominance for passive stereo vision
systems, three additional problems are remaining for 3D perception compared with the two
mentioned active vision systems above. First, the computation needs a great deal of
performance, since the correlation of two images from a different point of view has to be
found. Second, distances to structureless surfaces cannot be measured, if the perspective
projection of the object is larger than the cameras field of view. This problem is often called
aperture problem. Finally, a passive visual sensor has to cope with shadowing effects and
changes in illumination over time.
That is why for mapping purposes mostly active vision systems like laser scanners are used ,
e.g. [Thrun et al., 2000], [Wulf & Wagner, 2003], [Surmann et al., 2003]. But these approaches
are usually not applicable to tasks considering environment dynamics.
Due to this restriction, 3D cameras [CSEM SA, 2007], [PMDTec, 2007] have attracted
attention since their invention nearly a decade ago. Distance measurements are also based
on a time-of-flight principle but with an important difference. Instead of sampling laser
beams serially to acquire distance data point-wise, the entire scene is measured in parallel
with a modulated surface. This principle allows for higher frame rates and thus enables the
consideration of environment dynamics.
The first part of this chapter discusses the physical principles of 3D sensors, which are
commonly used in the robotics community for typical problems like mapping and
navigation. The second part concentrates on 3D cameras, their assets, drawbacks and
perspectives. Based on these examining parts, some solutions are discussed that handle
common problems occurring in dynamic environments with changing lighting conditions.
Finally, it is shown in the last part of this chapter how 3D cameras can be applied to
mapping, object localization and feature tracking tasks.
182
2. Range Sensing
Before focusing on 3D cameras and their applications, a short comparison of range sensors
and their underlying principles is given. Since there are many different types of sensors for
range sensing, the section focuses on those that are most common in the domain of robotics,
i.e. stereo vision systems, 3D laser scanners and of course 3D cameras. The section first
introduces into underlying measurement principles before it describes real sensor systems
in more detail.
2.1 Range Measurement Principles
Different types of sensors are based on different measurement principles. The two main
principles for technical systems are triangulation and time-of-flight. Both principles can
further be separated into two subcategories: active and passive triangulation or respectively
pulsed and phase shifted time-of-flight.
2.1.1 Triangulation
This technique is called triangulation since the object whose distance should be measured
forms a triangle with two parts of the sensor (cf. Figure 1). If the sensor consists of one
receiver part and one active transmitter part, the measurement principle is called active
triangulation. If it consists only of two passive receivers, it is called passive triangulation.
Figure 1. Left image: Working principle of active triangulation. Right image: Working
principle of passive triangulation
Active triangulation. The configuration of a simple active triangulation sensor can be seen
in figure 1. A light source projects a single point onto the object and the reflection of the light
point is measured by the receiver part of the sensor. This receiver is a position sensitive
device, which can determine the point where the light reflection has hit the receiver. By
knowing the position of the sensors optics, the distance between transmitter and receiver x,
their distance to the optics h and the hitting point of the light reflection x, it is possible to
calculate the distance d of the object by the formula:
d = h
x
x'
(1)
183
Such a simple sensor configuration restricts the distance measurement capability to one
single point. To determine the shape of an object, either the sensor or the object itself must
be moved and several measurements have to be taken. Higher sophisticated triangulation
sensor systems use two-dimensional light sources as well as two-dimensional receivers.
They project a light pattern onto the object, which is received by, e.g. a 2D camera system.
Such a system directly provides 3D shape information of the measured objects.
Passive triangulation. This principle is well known in nature and has been improved over
millions of years since it is the basic principle of the human visual sense. Being more precise,
it is the base of the human depth perception. It consists as well as a technical passive
triangulation sensor of two receivers (the eyes or two cameras respectively) which are
observing an overlapping area (cf. figure 1). If a specific point p is in the field of view of both
receivers, it is possible to determine its distance d to the sensor. Therefore each receiver
calculates the angle between the line from the sensor to point p and the optical axis of the
receiver. In combination with the distance x between the two receivers the distance to the
point p is calculated by
d=
x
1
1
+
tan tan
(2)
where is the angle from receiver A to the point p and is the angle from receiver B to the
point p. This formula assumes that the optical axes of receiver A and B are parallel. The most
important task here is to find distinctive points in the images and assign correspondences
between the points in the two images. Each point in the image of receiver A has to be
correctly identified in the image of receiver B. Wrong assignments will result in wrong
distance measurements.
2.1.2 Time-of-Flight
As the name already implies, this principle utilizes the time a specific signal needs to travel
from the sensor to the object and back. For calculating this time, different methods can be
used. In the following, two of them will be described in more detail, namely the impulse
time-of-flight method and the phase difference method.
Impulse Time-of-Flight. This method is the most obvious one, since a timer is started when
a signal is sent to the object and stopped when its reflection is received. By knowing the
speed of the signal, the distance to the object can be calculated directly. In practice most
often a short laser impulse is sent out to the object and the time until it is detected by an
optical receiver is measured.
From that travel time t, the distance d can be calculated by the formula
d = t c,
(3)
184
d ' = f + (2 d ) .
(4)
Since the second signal has traveled a longer distance than the reference signal, the phase of
the incoming signal is different. With this measured phase difference and the wavelength
of the signal modulation , the distance d of the object can be calculated by
d=
(5)
360 2
185
Figure 3. Left image: The 3D laser scanner system 3DLS with a SICK LMS291 laser scanner.
Right image: 3D scan taken with the 3DLS
The 3DLS is available as indoor and outdoor version, which mainly differ in the operation
temperature and the maximum range of the scanner. The maximum range is almost
exclusively limited by the amount of light which is reflected by the measured object.
186
Theoretically, the maximum range of both versions is 80m but with an object reflectivity of
only 10% (e.g. a black cardboard) the maximum range is specified with 10m for the indoor
version and 30m for the outdoor version. Another property of the 3DLS which influences
the quality of the resulting data is the diameter of the sent laser impulse, which signal
increases over the traveled distance. For the 3DLS, the diameter of the laser impulse
increased from around 1cm at the beginning to 15cm at a distance of 30m. This causes
inaccuracies if the laser hits an edge and therefore is partially reflected from different
distances.
The result of a 3D laser scan is a 3D point cloud, which can be seen in figure 3. The technical
data of the 3DLS are summarized in the following table.
Resolution
Field of view
Range
Frame Rate
Dimensions (mm)
Weight
Power supply
Figure 4. Left image: Hokuyo 3D Scanner. Right image: Scan taken with the Hokuyo 3D
Scanner. Note that a field of view of 248 is reached by only one rotating servo
Similar to the 3DLS, this scanner is based on a 2D laser scanner, the Hokuyo URG-04LX
[Kawata et al., 2005], [Hokuyo Automatic, 2007]. Since it is very small and light weighted it
is directly mounted on a servo drive to get the additional rotation axis. By using a pan-tilthead (cf. figure 4), different scanning setups are possible. In difference to the 3DLS, this
scanner measures the range by using the phase difference principle. For generating the
modulated light signal, an infrared laser diode with a wavelength of = 785nm is used. As
described in chapter 2.1.2 it is not possible to detect if the measured phase difference is more
than one cycle period and therefore out of the maximum measurement range. To handle
187
that problem, two laser signals with different modulation frequencies are emitted
alternately. Both phase differences are measured separately and used for determining the
real distance of the measured object.
The maximum apex angle of this 3D laser scanner is 270 horizontally and 248 vertically
with a resolution of 0.36. Depending on the measured distance, the precision is at least 2%
of the distance. A full resolution scan takes 50 seconds. The technical data are summarized
in the following table.
Resolution
Field of view
Range
Frame Rate
Dimensions (mm)
Weight
Modulation frequencies
Figure 5. Left image: SwissRanger SR-2 device mounted on a pan-tilt unit. Right image:
Sample image captured from a SwissRanger SR-2 device. The image is color coded (see color
bar on the right side)
The SwissRanger SR-2 provides amplitude data, intensity data and distance data. All
measurements are being organized by a FPGA, which provides an USB interface to access
the data. The FPGA can be configured by setting one or more of its eleven registers. The
most important register concerns the adjustment of the integration time, since the SR-2 does
not provide an automatic integration time controller by itself (the follow-up model SR-3000
does). It ranges from 1 to 255, which are multiples of 255 s. Finding the optimal value will
be investigated in section 3.2.
188
Please pay some attention on table 3 and compare it with table 1. The comparison of the
SwissRanger SR-2 and the Sick LMS device is important for the experiment in section 4.1.
Resolution
Field of view
Range
Frame Rate
Dimensions (mm)
Weight
124 x 160
43 x 46 (hor. x vert.)
7.5 m
Up to 30 fps
135 (W) x 45 (H) x 32 (D)
0.2 kg
c
,
2 fm
(6)
where R is the unambiguity interval, c the speed of light and fm the modulation frequency.
A camera with a frequency of 20 MHz provides an unambiguous range of 7.5 m. A lower
189
frequency provides a higher range but less accuracy. To satisfy both criteria multiple
frequencies can be used. For instance, this technology is currently used by the
PMD[vision] A2 from PMDTec.
Since the principle is based on integrating discharged electrons from incoming light, the
optical power also influences the reachable accuracy. These electrons are collected within a
conversion capacity, which can result in oversaturation, if the integration time is too high
[Lange, 2000]. Both mentioned manufacturers in this chapter use a burst mode to increase
the power output for short intervals at the same energy level over time.
For an application, the best measurement capability has to be adjusted by the integration
time. This value has to be high enough to provide a high signal level, but low enough to
avoid oversaturation. Oversaturation is indicated by both, the intensity and the amplitude
data. Theoretically the relation between intensity and amplitude is constant as shown in
figure 6, but unfortunately it shows a small deviation due to a non-ideal sinusoidal wave
emitted by the sensors LEDs [Lange, 2000].
Figure 6. Relation between intensity and amplitude data of the SwissRanger SR-2 device in
dependency of the integration time. Note that a higher integration time is also indicated by a
higher intensity value. The amplitude is raising linearly until oversaturation occurs
There are also a number of other noise sources which theoretically influence the reachable
accuracy. It is out of the scope of this chapter to explain all noise effects. A good
theoretically work explaining them in detail can be found in [Lange, 2000]. This work also
describes the dominance of shot noise that cannot be suppressed and, therefore, limits the
theoretically reachable signal-to-noise ratio and the accuracy involved. Hence, the standard
deviation R is approximately given as:
R =
I + Ib
c
,
l
4 f m
2A
(7)
where A is the amplitude and I the intensity. The intensity value is composed of the
reflected constant component Il of the LED illumination and the background illumination Ib.
190
As a rule, it can be said that the proper saturation of a pixels capacitance provides its best
accuracy. The emitted light is uniformly distributed (only approximately, see [Gut, 2004]) on
a surface proportional to the quadratic distance. Therefore, the reflected intensity is also
proportional to the quadratic distance, whereas the received background light (caused by
sunlight) is independent of it [Schneider, 2003]. For both constituent parts the standard
deviation yields a different dependency on the objects distance. If one of these components
is dominant the standard deviation has the following characteristic1:
r , if I l >> I b
R ~ 2
r , if I l << I b
(8)
For indoor applications with less background illumination case one can be assumed. This
relation will be relevant for the experiments mentioned below.
3.2. Tackling Environment Dynamics
Figure 7 shows the same scene taken twice with the SwissRanger SR-2 device at two
different integration times. The integration time for the measurement shown in the left
figure was not adjusted properly with respect to the near object. The bothered area of the
hand indicates that effect. But this measurement might provide better values in the
background area.
Figure 7. Two sample distance measurements of a close hand in false color representation.
The left measurement was taken with an integration time of 15 ms, which is definitely too
high for the near object. The right measurement was taken with an integration time of 4 ms
and fits better for that scene
3.2.1 Setting up the integration time
The diffuse reflectivity of objects is an important parameter for precise measurements.
Typically, a scene comprises objects with different reflectivity. It can vary from 2% for black
rubber tire up to 100% for white paper at a wavelength of 900nm [Lange, 2000]. Let us
assume scenarios containing only a small single object with high reflectance close to the
sensor to explain the compromise that has to be met. Since the integration time can only be
1
Note that also the amplitude has a quadratic dependency on the distance.
191
adjusted for all pixel elements together, one might guess that it is the best strategy to avoid
each pixel from oversaturation. Focusing the small object will most likely decrease the
accuracy for the remaining scene. This also means, that the signal level for objects with low
diffuse reflectivity will be low if objects with high reflectivity are in the same range of vision
during measurement.
One suitable method is to merge multiple captures at different integration times. It reduces
the frame rate but increases the dynamic range.
In [May, 2006] we have presented an alternative integration time controller based on mean
intensity measurements. This solution was empirically found and showed a suitable
dynamic range for our experiments without affecting the frame rate. It also alleviates the
effects of small bothering areas. The averaged amplitude in dependency of intensity can be
seen in figure 8.
Figure 8. Relation between mean amplitude and mean intensity. Note that the characteristic
is now a mixture of the characteristic of each single pixel (cf. figure 6)
We used a proportional closed-loop controller to adjust the integration time from one frame
to the next as shown in the following itemization.
The control deviation variable Ia was assigned with a value of 15000 for the illustrations in
this chapter. It has been chosen conservatively with respect to the characteristic shown in
figure 8.
1.
2.
3.
Update control variable c t + 1 = Vp Dt + c t for grabbing the next frame, where ct and ct+1
are the integration times for two frames following one another, Vp the proportional
closed loop deviation parameter and c0 a suitable initial value.
Independent of the chosen control method, the integration time has always to be adjusted
with respect to the application. A change of integration time causes an apparent motion
considering the distance measurement values. Therefore, it is necessary for the application
to take the presence of control deviation into account while using an automatic integration
time controller.
192
The newest model from Mesa Imaging, the SwissRanger SR-3000 provides an automatic
integration time exposure based on the amplitude values. For most scenes it works properly.
In some cases of fast scene change it could occur that a proper integration time cannot be
found. This is up to the missing intensity information due to the backlight suppression on
chip. The amplitude diagram does not provide a non-ambiguous working point. A short
discussion on the backlight suppression will be given in section 3.3.
3.2.2 Consideration of accuracy
It is not possible to guarantee certain accuracies for measurements of unknown scenes, since
they are affected by the influences mentioned above. However, the possibility to evolve the
accuracy information for each pixel eases that circumstance. In section 4 two examples using
this information will be explained. For determining the accuracy equation (7) is used.
Assuming that the parameters of the camera (in general this is the integration time for users)
are optimally adjusted, the accuracy only depends on the objects distance and its
reflectivity. For indoor applications with less background illumination, the accuracy is
linearly decreasing (see equation (8)). Applying a simple threshold is one option for filtering
out inaccurate parts of an image. Setting a suitable threshold primarily depends on the
application. Lange stated with respect to the dependency between accuracy and distance
[Lange, 2000]: This is an important fact for navigation applications, where a high accuracy
is often only needed close to the target. This statement does not hold for every other
application like mapping, where unambiguousness is essential for registration.
Unambiguous tokens are often distributed over the entire scene. Higher distances between
these tokens provide geometrically higher accuracies for the alignment of two scans. After
this consideration, increasing the threshold linearly with the distance for indoor applications
suggests itself. This approach enlarges the information gain from the background and can
be seen in figure 9.
A light source in the scene decreases the reachable accuracy. The influence of the accuracy
threshold can be seen in figure 10. Bothered areas are reliably removed. The figure shows
also that the small bothering area of the lamp does not much influence the integration time
controller based on mean intensity values, even so that the surrounding area is determined
precisely.
Figure 9. Two images taken with a SwissRanger SR-2 device of the same scene. Left image:
without filtering. Right image: with accuracy filter. Only data points with an accuracy better
than 50mm are remaining
193
Figure 10. Influence of light emitting sources. Top row: The light source is switched off.
Lower Row: The light source is switched on. Note that the bothered area could reliably be
detected
3.3. Latest Improvements and expected Innovations in Future
Considering equation (7) a large background illumination (Ib >> Il) highly affects the
sensors accuracy by increasing the shot noise and lowering its dynamics. Some sensors
nowadays are equipped with some background light suppression functionalities, e.g.
spectral filters or circuits for constant component suppression, which are increasing the
signal-to-noise ratio [Moeller et al., 2005], [Buettgen et al., 2006].
Suppressing the background signal has one drawback. The amplitude represents the
infrared reflectivity and not the reflectivity we sense as human-beings. This might take
effects on computer vision systems inspired by our human visual sense, e.g. [Frintrop, 2006].
Some works in the past had also proposed a circuit structure for a pixel-wise-integration
capability [Schneider, 2003], [Lehmann, 2004]. Unfortunately, this technology did not
become widely accepted due to a lower fill-factor. Lange explained the importance of the
optical fill factor as follows [Lange, 2000]: The optical power of the modulated illumination
source is both expensive and limited by eye-safety regulations. This requires the best
possible optical fill factor for an efficient use of the optical power and hence a high
measurement resolution.
4. 3D Vision Applications
This section investigates the practical influence of upper mentioned thoughts by presenting
some typical applications in the domain of autonomous robotics currently investigated by
us. Since 3D cameras are comparatively new to other 3D sensors like laser scanners or stereo
cameras, the porting of algorithms defines a novelty per se; e.g. one of the first 3D maps
194
created with registration approaches mostly applied to laser scanner systems up to now was
presented at the IEEE/RSJ International Conference on Intelligent Robots and Systems in
2006 [Ohno, 2006]. The difficulties to come across with these sensors are discussed in this
section. Furthermore, a first examination on the capabilities for tackling environment
dynamics will follow.
4.1. Registration of 3D Measurements
One suitable registration method for range data sets is called the Iterative Closest Points
(ICP) algorithm and was introduced by Besl and McKay in 1992 [Besl & McKay, 1992]. For
the readers convenience a brief description of this algorithm is repeated in this section.
Given two independently acquired sets of 3D points, M (model set) and D (data set), which
correspond to a single shape, we aim to find the transformation consisting of a rotation R
and a translation t which minimizes the following cost function:
M D
E( R , t ) = i , j mi ( Rd j + t ) .
(9)
i =1 j =i
i,j is assigned 1 if the i-th point of M describes the same point in space as the j-th point of D.
Otherwise i,j is 0. Two things have to be calculated: First, the corresponding points, and
second, the transformation (R,t) that minimizes E(R,t) on the base of the corresponding
points. The ICP algorithm calculates iteratively the point correspondences. In each iteration
step, the algorithm selects the closest points as correspondences and calculates the
transformation (R,t) for minimizing equation (9). The assumption is that in the last iteration
step the point correspondences are correct. Besl and McKay prove that the method
terminates in a minimum [Besl & McKay, 1992]. However, this theorem does not hold in our
case, since we use a maximum tolerable distance dmax for associating the scan data. Such a
threshold is required though, given that 3D scans overlap only partially. The distance and
the degree of overlapping have a non-neglective influence of the registration accuracy.
4.2. 3D Mapping Invading the Domain of Laser Scanners
The ICP approach is one upon the standard registration approaches used for data from 3D
laser scanners. Since the degree of overlapping is important for the registration accuracy, the
huge field of view and the high range of laser scanners are advantages over 3D cameras
(compare table 1 with table 3). The following section describes our mapping experiments
with the SwissRanger SR-2 device.
The image in figure 11 shows a single scan taken with the IAIS 3D laser scanner. The scan
provides a 180 degree field of view. Getting the entire scene into range of vision can be done
by taking only two scans in this example. Nevertheless, a sufficient overlap can be
guaranteed to register both scans. Of course there are some uncovered areas due to
shadowing effects, but that is not the important fact for comparing the quality of
registration. A smaller field of view makes it necessary to take more scans for the coverage
of the same area within the range of vision. The image in figure 12 shows an identical scene
taken with a SwissRanger SR-2 device. There were 18 3D images necessary for a
circumferential view with sufficient overlap. Each 3D image was registered with its
previous 3D image using the ICP approach.
195
Figure 12. 3D map created from multiple SwissRanger SR-2 3D images. The map was
registered with the ICP approach. Note the gap at the bottom of the image, that indicates the
accumulating error
4.2.1. Closing the Loop
The registration of 3D image sequences causes a non-neglective accumulation error. This
effect is represented by the large gap at the bottom of the image in figure 12. These effects
have also been investigated in detail for large 3D maps taken with 3D laser scanners, e.g. in
[Surmann et al., 2004], [Cole & Newman, 2006]. For a smaller field of view these effects
occur faster, because of the smaller size of integration steps. Determining the closure of a
loop can be used in these cases to expand the overall error on each 3D image. This implies
that the present captured scene has to be recognized to be already one of the previous
captured scenes.
196
197
object localization system in real world applications brings some challenges, which are
discussed in the next subsection.
4.3.1 Challenges
The first challenge is the pose ambiguities of many objects. Figure 13 shows a typical object
for a home service-robot application, a box of instant mashed potatoes. The cuboid shape of
the box has three plains of symmetry which results in the ambiguities of the pose.
Considering only the shape of the object, very often the result of the object localization is not
a single pose but a set of possible poses, depending on the number of symmetry planes. For
determining the real pose of an object other information than only range data are required,
for example the texture. Most 3D cameras additionally providing gray scale images which
give information about the texture but with the provided resolution of around 26.000 pixels
and an aperture angle of around 45 the resolution is not sufficient enough for stable texture
identification. Instead, e.g., a color camera system can be used to solve this ambiguity. This
requires a close cooperation between the object localization system and another
classification system which uses color camera images and a calibration between the two
sensor systems. As soon as future 3D cameras are providing higher resolutions and maybe
also color images, object identification and localization can be done by using only data from
a 3D camera.
Figure 13. An instant mashed potatoes box. Because of the symmetry plains of the cuboid
shape the pose determination gives a set of possible poses. Left: Colour image from a digital
camera. Right: 3D range image from the Swissranger SR-2
Another challenge is close related to the properties of 3D cameras and the resulting ability to
provide precise range images of the objects. It was shown that the ICP based scan matching
algorithm is very reliable and precise with data from a 3D laser scanner, which are always
providing a full point cloud of the scanned scene [Nuechter, 2006], [Mueller, 2006]. The
accuracy is static or at least proportional to the distance. As described in section 3.2.2 the
accuracy of 3D camera data is influenced by several factors. One of these factors for example
is the reflectivity of the measured objects. The camera is designed for measuring diffuse
light reflections but many objects are made of a mixture of specular and diffuse reflecting
materials. Figure 14 shows color images from a digital camera and range images from the
Swissrange SR-2 of a tin from different viewpoints. The front view gives reliable range data
of the tin since the cover of the tin is made of paper which gives diffuse reflections. In the
second image the cameras are located a little bit above and the paper cover as well as high
reflecting metal top is visible in the color image. The range image does not show the top
198
since the calculated accuracy of these data points is less than 30 mm. This is a loss of
information which highly influences the result of the ICP matching algorithm.
Figure 14. Images of a tin from different view points. Depending on the reflectivity of the
objects material the range data accuracy is different. In the range images all data points with
a calculated accuracy less than 30mm are rejected. Left: The front view gives good 3D data
since the tin cover reflects diffuse. Middle: From a view point above the tin, the cover as
well as the metal top is visible. The high reflectivity of the top results in bad accuracy so that
only the cover part is visible in the range image. Right: From this point of view, only the
high metal top is visible. In the range image only some small parts of the tin are visible
4.4. 3D Feature Tracking
Using 3D cameras to full capacity necessitates taking advantage of their high frame rate.
This enables the consideration of environment dynamics. In this subsection a feature
tracking application is presented to give an example of applications that demand high frame
rates. Most existing approaches are based on 2D grayscale images from 2D cameras since
they were the only affordable sensor type with a high update rate and resolution in the past.
An important assumption for the calculation of features in grayscale images is called
intensity constancy assumption. Changes in intensity are therefore only caused by motion.
The displacement of two images is also called optical flow. An extension to 3D can be found
in [Vedula et al., 1999] and [Spies et al., 2002]. The intensity constancy assumption is being
combined with a depth constancy assumption. The displacement of two images can be
calculated more robustly. This section will not handle scene flow. However the depth value
of features in the amplitude space should be examined so that the following two questions
are answered:
Is the resolution and quality of the amplitude images from 3D cameras good enough to
199
Max Dev
[m]
0.068
0.028
0.030
0.169
0.158
0.037
0.196
0.119
0.018
0.025
Table 4. Distance values and deviation of the first ten features calculated from the scene
shown in the left image of figure 15 with no threshold applied
Experimental standard deviation = 0.017m, Threshold R = 50mm
Mean Dist
Min Dev
Feature #
Considered
[m]
[m]
1
Yes
-2.592
-0.110
2
Yes
-2.684
-0.017
3
Yes
-2.881
-0.031
-0.158
4
No
-2.901
-0.176
5
Yes
-2.733
6
Yes
-2.751
-0.025
-0.185
7
No
-2.863
-0.169
8
No
-2.697
9
Yes
-2.760
-0.019
10
Yes
-2.711
-0.017
Max Dev
[m]
0.056
0.029
0.017
0.125
0.118
0.030
0.146
0.134
0.015
0.020
Table 5. Distance values and deviation of the first ten features calculated from the scene
shown in the left image of figure 15 with a threshold of 50mm
The reason for the high standard deviation is the noise criterion for edges. The signal
reflected by an edge is a mixture of the background and object signal. A description of this
200
effect is given in [Gut, 2004]. Applying an accuracy threshold alleviates this effect. The
standard deviation is decreased significantly. This approach has to be balanced with the
number of features found in an image. Applying a more restrictive threshold might decrease
the number of features too much. For the example described in this section an accuracy
threshold of R = 10mm decreases the number of features to 2 and the experimental
standard deviation to 0.01m.
Figure 15. Left image: Amplitude image showing the tracking of KLT-features from two
frames following one another. Right image: Side view of a 3D point cloud. Note the
appearance of jump edges at the border area
201
6. References
Besl, P. & McKay, N. (1992). A Method for Registration of 3-D Shapes, IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 14, No. 2, (February 1992) pp. 239-256,
ISSN: 0162-8828
Buettgen, B.; Oggier, T.; Lehmann, M.; Kaufmann, R.; Neukom, S.; Richter, M.; Schweizer,
M.; Beyeler, D.; Cook, R.; Gimkiewicz, C.; Urban, C.; Metzler, P.; Seitz, P.;
Lustenberger, F. (2006). High-speed and high-sensitive demodulation pixel for 3D
imaging, In: Three-Dimensional Image Capture and Applications VII. Proceedings of
SPIE, Vol. 6056, (January 2006) pp. 22-33, DOI: 10.1117/12.642305
Cole, M. D. & Newman P. M. (2006). Using Laser Range Data for 3D SLAM in Outdoor
Environments, In Proceedings of the IEEE International Conference on Robotics and
Automation (ICRA), pp. 1556-1563, Orlando, Florida, USA, May 2006
CSEM SA (2007), SwissRanger SR-3000 - miniature 3D time-of-flight range camera, Retrieved
January 31, 2007, from http://www.swissranger.ch
Frintrop, S. (2006). A Visual Attention System for Object Detection and Goal-Directed Search,
Springer-Verlag, ISBN: 3540327592, Berlin/Heidelberg
Fraunhofer IAIS (2007). 3D-Laser-Scanner, Fraunhofer Institute for Intelligent Analysis and
Information Systems, Retrieved January 31, 2007, from http://www.3d-scanner.net
Gut, O. (2004). Untersuchungen des 3D-Sensors SwissRanger, Eidgenssische Technische
Hochschule Zrich, Retrieved January 21, 2007, from
http://www.geometh.ethz.ch/publicat/diploma/gut2004/Fehlereinfluesse/index
_fe.html
Hokuyo Automatic (2007), Scanning laser range finder for robotics URG-04LX, Retrieved
January 31, 2007, from http://www.hokuyo-aut.jp/products/urg/urg.htm
Ibeo Automobile Sensor GmbH (2007), Ibeo ALASCA XT Educational System, Retrieved
January 31, 2007, from
http://www.ibeo-as.com/deutsch/products_alascaxtsingle_educational.asp
Kawata, H.; Ohya, A.; Yuta, S.; Santosh, W. & Mori, T. (2005). Development of ultra-small
lightweight optical range sensor system, International Conference on Intelligent Robots
and Systems 2005, Edmonton, Alberta, Canada, August 2005.
Lange, R. (2000). 3D time-of-flight distance measurement with custom solid-state image
sensors in CMOS/CCD-technology, Dissertation, University of Siegen, 2000
Lehmann, M.; Buettgen, B.; Kaufmann, R.; Oggier, T.; Stamm, M.; Richter, M.; Schweizer,
M.; Metzler, P.; Lustenberger, F.; Blanc, N. (2004). CSEM Scientific & technical Report
2004, CSEM Centre Suisse dElectronique et de Microtechnique SA, Retrieved
January 20, 2007, from
http://www.csem.ch/corporate/Report2004/pdf/SR04-photonics.pdf
Lowe, D. G. (2004). Distinctive Image Features from Scale-Invariant Keypoints, International
Journal of Computer Vision, Vol. 60, No. 2, (November 2004) pp. 91-110, ISSN: 09205691
Lucas, B. D. & Kanade, T. (1981). An Interative Image Registration Technique with an
Application to Stereo Vision, In Proceedings of the 7th International Conference on
Artificial Intelligence (IJCAI), pp. 674-679, Vancouver, British Columbia, August 1981
May, S.; Werner, B.; Surmann, H.; Pervoelz, K. (2006). 3D time-of-flight cameras for mobile
robotics, In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pp. 790-795, Beijing, China, October 2006
202
Moeller, T.; Kraft H.; Frey, J.; Albrecht, M.; Lange, R. (2005). Robust 3D Measurement with
PMD Sensors, PMDTechnologies GmbH. Retrieved January 20, 2007, from
http://www.pmdtec.com/inhalt/download/documents/RIM2005-PMDTechRobust3DMeasurements.pdf.
Mueller, M.; Surmann, H.; Pervoelz, K. & May, S. (2006). The Accuracy of 6D SLAM using
the AIS 3D Laser Scanner, In Proceedings of the IEEE International Conference on
Multisensor Fusion and Integration for Intelligent Systems (MFI), Heidelberg,
Germany, September 3-6, 2006
Nuechter A., Lingemann K., Hertzberg J. & Surmann, H. (2005). Accurate Object
Localization in 3D Laser Range Scans, In Proceedings of the 12th International
Conference on Advanced Robotics (ICAR '05), ISBN 0-7803-9178-0, pages 665 - 672,
Seattle, USA, July 2005.
Nuechter A. (2006). Semantische dreidimensionale Karten fr autonome mobile Roboter,
Dissertation, Akademische Verlagsgesellschaft Aka, ISBN: 3-89838-303-2, Berlin
Ohno, K.; Nomura, T.; Tadokoro, S. (2006). Real-Time Robot Trajectory Estimation and 3D
Map Construction using 3D Camera, In Proceedings of the IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pp. 5279-5285, Beijing, China,
October 2006
PMD Technologies (2007), PMD Cameras, Retrieved January 31, 2007, from
http://www.pmdtec.com/e_inhalt/produkte/kamera.htm
RTS Echtzeitsysteme (2007), Mobile Serviceroboter, Retrieved January 31, 2007, from
http://www.rts.uni-hannover.de/index.php/Mobile_Serviceroboter
Schneider, B. (2003). Der Photomischdetektor zur schnellen 3D-Vermessung fr
Sicherheitssysteme und zur Informationsbertragung im Automobil, Dissertation,
University of Siegen, 2003
Shi, J. & Tomasi, C. (1994). Good Features to Track, In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 595-600, Seattle, June 1994
Spies, H.; Jaehne, B.; Barron, J. L. (2002). Range Flow Estimation, Computer Vision Image
Understanding (CVIU2002) 85:3, pp.209-231, March, 2002
Surmann, H.; Nuechter, A.; Lingemann K. & Hertzberg, J. (2003). An autonomous mobile
robot with a 3D laser range finder for 3D exploration and digitalization of indoor
environments, Robotics and Autonomous Systems, 45, (December 2003) pp. 181-198
Surmann, H.; Nuechter, A.; Lingemann, K. & Hertzberg, J. (2004). 6D SLAM A Preliminary
Report on Closing the Loop in Six Dimensions, In Proceedings of the 5th IFAC
Symposium on Intelligent Autonomous Vehicles (IAV), Lisabon, Portugal, July 2004
Thrun, S.; Fox, D. & Burgard, W. (2000). A real-time algorithm for mobile robot mapping
with application to multi robot and 3D mapping, In Proceedings of the IEEE
International Conference on Robotics and Automation (ICRA), pp. 321-328, ISBN: 07803-5886-4, San Francisco, February 1992
Vedula, S.; Baker, S.; Rander, P.; Collins, R. & Kanade, T. (1999). Three-Dimensional Scene
Flow, In Proceedings of the 7th International Conference on Computer Vision (ICCV), pp.
722-729, Corfu, Greece, September 1999
Wulf, O. & Wagner, B. (2003). Fast 3d-scanning methods for laser measurement systems, In
Proceedings of International Conference on Control Systems and Computer Science
(CSCS14), Bucharest, Romania, February 2003
12
A Visual Based Extended Monte Carlo
Localization for Autonomous Mobile Robots
Wen Shang1 and Dong Sun2
1Suzhou
2Department
1. Introduction
Over the past decades, there are tremendous researches on mobile robots aiming at
increasing autonomy of mobile robot systems. As a basic problem in mobile robots, selflocalization plays a key role in various autonomous tasks (Kortenkamp et al., 1998).
Considerable researches have been done on self-localization of mobile robots (Borenstein et
al., 1996; Chenavier & Crowley, 1992; Jensfelt & Kristensen, 2001; Tardos et al., 2002), with
the goal of estimating the robots pose (position and orientation) by proprioceptive sensors
and exteroceptive techniques. Since proprioceptive sensors (e.g., dead-reckoning) are
generally not sufficient to locate a mobile robot, exteroceptive techniques have to be used to
estimate the robots configuration more accurately. Some range sensors such as sonar
sensors (Drumheller, 1987; Tardos et al., 2002; Wijk & Christensen, 2000) and laser range
finders (Castellanos & Tardos, 1996), can be employed for the robot localization. However,
the data obtained from sonar sensors is usually noisy due to specular reflections, and the
laser scanners are generally expensive. As a result, other sensory systems with more reliable
sensing feedback and cheaper price, such as visual sensors (Chenavier & Crowley, 1992;
Dellaert et al., 1999; Gaspar et al., 2000), are more demanded for mobile robot localization.
Probabilistic localization algorithm (Chenavier & Crowley, 1992; Fox et al., 1999b;
Nourbakhsh et al., 1995) is a useful systematic method in sensor-based localizations,
providing a good framework by iteratively updating the posterior distribution of the pose
space. As a state estimation problem, pose estimation with linear Gaussian distribution
(unimodal) can be done by Kalman filters for pose tracking (Chenavier & Crowley, 1992;
Leonard & Durrant-White, 1991), which exhibits good performance under the condition that
the initial robot pose is known. Nonlinear non-Gaussian distribution (multimodal) problem
can be solved by multi-hypothesis Kalman filters (Jensfelt & Kristensen, 2001) or Markov
methods (Fox et al., 1999b; Nourbakhsh et al., 1995) for global localization. The multihypothesis Kalman filters use mixtures of Gaussians and suffer from drawbacks inherent
with Kalman filters. Markov methods employ piecewise constant functions (histograms)
over the space of all possible poses, so the computation burden and localization precision
depend on the discretization of pose space.
204
By representing probability densities with sets of samples and using the sequential Monte
Carlo importance sampling, Monte Carlo localization (MCL) (Dellaert et al., 1999; Fox et al.,
1999a) represents non-linear and non-Gaussian models with great robustness and can focus
the computational resources on regions with high likelihood. Hence MCL has attracted
considerable attention and has been applied in many robot systems. MCL shares the similar
idea to that of particle filters (Doucet, 1998) and condensation algorithms (Isard & Blake,
1998) in computer vision.
As a sample based method with stochastic nature, MCL can suffer from the observation
deviation or over-convergence problem when the sample size is smaller or encountering
some poorly modeled events (to be discussed in detail in Section 2.2) (Carpenter et al., 1999;
Thrun et al., 2001). Many approaches have been proposed to improve the efficiency of MCL
algorithm. A method of adaptive sample size varying in terms of the uncertainty of sample
distribution, was presented in (Fox, 2003). However, the sample size of this method must
meet a condition of an error bound of the distribution, which becomes a bottleneck for a real
global localization. A resampling process through introduction of a uniform distribution of
samples was further applied for the case of non-modeled movements (Fox et al., 1999a).
Likewise, a sensor resetting localization algorithm (Lenser & Veloso, 2000) was also
implemented using a resampling process from visual feedback, based on an assumption that
the visual features with range and bearing are distinguishable. Such a method may be
applicable to RoboCup, but not to a general office environment. Several other visual based
Monte Carlo methods (Kraetzschmar & Enderle, 2002; Rofer & Jungel, 2003) were
implemented under the condition that the environment features must be unique. A mixture
MCL (Thrun et al., 2001) and condensation with planned sampling (Jensfelt et al., 2000)
incorporated the resampling process to MCL for efficiency improvement, which require fast
sampling rate from sensors every cycle.
In order to achieve higher localization precision and improve efficiency of MCL, a new
approach to extended Monte Carlo localization (EMCL) algorithm is presented here. The
basic idea is to introduce two validation mechanisms to check the abnormity (e.g.,
observation deviation and over-convergence phenomenon) of the distribution of weight
values of sample sets and then employ a resampling strategy to reduce their influences.
According to the verification, the strategy of employing different resampling processes is
employed, in which samples extracted either from importance resampling or from
observation model form the true posterior distribution. This strategy can effectively prevent
from the premature convergence and be realized with smaller sample size. A visual-based
extended MCL is further implemented. The common polyhedron visual features in office
environments are recognized by Bayesian network that combines perceptual organization
and color model. This recognition is robust with respect to individual low-level features and
can be conveniently transferred to similar environments. Resampling from observation
model is achieved by the triangulation method in the pose constraint region.
The remainder of this chapter is organized as follows. Section 2 introduces conventional
MCL algorithm and discusses the existing problems when applied to the real situations.
Section 3 proposes the extended MCL (EMCL) with brief implementation explanations
showing the difference from conventional MCL, which is followed by the implementation
details of a visual-based EMCL application example in Section 4. Section 5 presents
experiments conducted on a mobile robot system to verify the proposed approach. Finally,
conclusions of this work are given in Section 6.
A Visual Based Extended Monte Carlo Localization for Autonomous Mobile Robots
205
sk(i )
the likelihood of X k(i ) , i is the index of weighted samples, and N k denotes the number of
samples (or sample size). It is assumed that
Nk
i =1
as probabilities.
During the localization process, MCL is initialized with a set of samples reflecting initial
knowledge of the robots pose. It is usually assumed that the distribution is uniform for
global localization when the initial pose is unknown, and a narrow Gaussian distribution
when the initial pose is known. Then samples are recursively updated with the following
three steps executed (see Table 1).
Step 1: Sample update with robot motion (prediction step)
The probabilistic proposal distribution of robot pose in the motion update is
qk = p( X k | X k 1 , uk 1 ) Bel( X k 1 )
(1)
where p( X k | X k 1, uk 1) denotes probabilistic density of the motion that takes into account
the robot properties such as drift, translational and rotational errors,
T
uk 1 = [xk 1 yk 1 k 1 ] denotes variation of the robot pose at time k-1, and Bel( X k 1 )
denotes posterior distribution of the robot pose Xk1 . Then, extract a new sample set S k
with < Xk(i) , wk(i) > from the proposal distribution qk , by applying the above motion update to
the posterior distribution, where Xk(i) and wk(i) denote the extracted pose and weight after
motion update, respectively.
Step 2: Belief update with observations (sensor update step)
Robots belief about its pose is updated with observations, mostly from range sensors.
Introduce a probabilistic observation model p( Z k | X k(i ) ) , where Z k denotes measurements
206
from the sensor. Re-weight all samples of S k extracted from the prediction step, and we
then have
Algorithm Conventional MCL
Prediction step:
for each i = 1, , N k
Draw sample X k(i ) from S k 1 according to (1)
w k( i ) = 1 / N k
~ (i ) =
w
k
w k( i )
Nk
j =1
w k( j )
~ (i ) > S
< X k(i ) , w
k
k
end for
Resampling step (importance resampling):
~ ( i ) > in S
for each ~
sk(i ) =< X k(i ) , w
k
k
cw ( ~
s k( i ) ) =
i
j =1
~ ( j)
w
k
{Cumulative distribution}
end for
for each i = 1, , N k
r=rand(0,1);
j=1
while( j N k ) do
{random number r}
if( cw ( ~
sk( j ) ) > r )
X k( i ) = X k( j )
w k( i ) = 1 / N k
< X k(i ) , wk(i ) > S k , break
else j=j+1
end if
end while
end for
Table 1. Conventional MCL algorithm
w k(i ) = wk(i ) p(Z k | X k(i ) )
where w
k(i ) denotes the non-normalized weight during the sensor update.
Normalize weights as follows to ensure that all beliefs sum up to 1:
(2)
A Visual Based Extended Monte Carlo Localization for Autonomous Mobile Robots
~ (i ) =
w
k
w k( i )
Nk
207
(3)
j =1 w k( j )
~ (i ) > , is obtained.
Then, the sample set after sensor update, denoted by S k with < X k(i ) , w
k
The observation model p( Z k | X k(i ) ) is also named as importance factor (Doucet, 1998),
which reflects the mismatch between the probabilistic distribution qk after the prediction
step and the current observations from the sensor.
Step 3: Resampling step
The resampling step is to reduce the variance of the distribution of weight values of samples
and focus computational resources on samples with high likelihood. A new sample set S k
is extracted with samples located nearby the robot true pose. This step is effective for
localization by ignoring samples with lower weights and replicating those with higher
weights. The step is to draw samples based on the importance factors, and is usually called
importance resampling (Konolige, 2001). The implementation of such importance
resampling is shown in Table 1.
2.2 Problems of Conventional MCL
When applied to the real situations, conventional MCL algorithm suffers from some
shortcomings. The samples are actually extracted from a proposal distribution (here is the
motion model). If the observation density deviates from the proposal distribution, the (nonnormalized) weight values of most of the samples become small. This leads to poor or even
erroneous localization result. Such phenomenon results from two possible reasons. One is
that too small sample size is used, and the other is due to poorly modeled events such as
kidnapped movement (Thrun et al., 2001). To solve the problem, either a large sample size is
employed to represent the true posterior density to ensure stable and precise localization, or
a new strategy is employed to address the poorly modeled events.
Another problem when using conventional MCL is that samples often converge too quickly
to a single or a few high-likelihood poses (Luo & Hong, 2004), which is undesirable in the
localization in symmetric environments, where multiple distinct hypotheses have to be
tracked for periods of time. This over-convergence phenomenon is caused by the use of too
small sample size, as well as smaller sensor noise level. The viewpoint that the smaller the
sensor noise level is, the more likely over-convergence occurs, is a bit counter-intuitive, but
it actually leads to poor performance. Due to negative influences of the smaller sample size
and poorly modeled events, implementation of conventional MCL in real situations is not
trivial.
Since sensing capabilities of most MCLs are achieved by sonar sensors or laser scanners, the
third problem is how to effectively realize MCL with visual technology, which can more
accurately reflect the true perceptual mode of the natural environments.
208
In the proposed extended MCL algorithm, besides the prediction and sensor update steps
that are the same as in the conventional MCL, two validation mechanisms in the resampling
step are introduced for checking abnormity of the distribution of weight values of sample
sets. According to the validation, different resampling processes are employed, where
samples are extracted either from importance resampling or from observation model. Table
2 gives the procedures of the proposed extended MCL algorithm.
Algorithm Extended MCL
Prediction step:
Sensor update step:
Same as conventional MCL algorithm;
Resampling step: (different from conventional MCL)
Quantitatively describe the distribution of (normalized and non-normalized) weight
values of sample set;
Two validation mechanisms:
if (over-convergence);
over-convergence validation
sample size n s resampling from observations
for each i = 1, , N k ns
importance resampling X k(i ) from S k
wk( i ) = 1 / N k
< X k(i ) , wk(i ) > Sk
end for
for each i = N k ns + 1, , N k
sensor based resampling
X k( i ) p ( X k | Z k )
wk( i ) = 1 / N k
< X k(i ) , wk(i ) > Sk
end for
else if (sum of (non-normalized) weight < Wth );
uniformity validation
resampling size ns = N k
sensor based resampling (same as the above)
else importance resampling
end if
end if
Table 2. Extended MCL algorithm
Two Validation Mechanisms
The two validation mechanisms are uniformity validation and over-convergence validation,
respectively.
Uniformity validation utilizes the summation of all non-normalized weight values of
sample set after sensor update to check the observation deviation phenomenon, in which the
non-normalized weight values in the distribution are uniformly low, since the observation
A Visual Based Extended Monte Carlo Localization for Autonomous Mobile Robots
209
density deviates from the proposal distribution due to some poorly modeled events.
Since the samples are uniformly distributed after the prediction step and re-weighted
through the sensor update step, summation of non-normalized weight values of all samples
W can be, according to (2), expressed as
N
1
Nk
Nk
i =1
p ( Z k | X k( i ) )
(4)
ESS =
Nk
1 + cv 2
(5)
where N k denotes the sample size at time index k, and cv 2 denotes variation of the weight
values of samples, derived by
cv 2 =
var( w(i ))
1
=
2
E ( w(i )) N k
Nk
(N
w(i ) 1) 2
(6)
in which E ( w(i )) and var(w(i )) denote the mean and variance of the distribution of weight
values of samples, respectively.
If the effective sample size is lower than a given threshold (percentage of the sample size),
over-convergence phenomenon is confirmed. It is then necessary to introduce new samples,
210
Hc H p
Hp
(7)
where, H p and H c denote the entropy of the distribution of weight values before and after
sensor update, respectively; (0,1) is a benchmark to check the relative change of entropy,
which decreases as H p increases. The larger the difference is, the more likely overconvergence occurs. When over-convergence is confirmed in this manner, the number of
new samples to be introduced is ns = (1 )( N k ESS ) .
By the analysis of the distribution of weight values of sample set, the abnormity cases can be
effectively checked through the two validation mechanisms, and thereby premature
convergence and deviation problem caused by non-modeled events can be deliberately
prevented. In addition, more real-time requirements can be satisfied with smaller sample
size. Further, the strategy of employing different resampling processes is to construct the
true posterior distribution by treating the observation model as part of the proposal
distribution, which is guaranteed to be consistent with the observations even when using
smaller sample size or more precise sensors.
X k = f( X k 1, uk 1, vk 1 )
where
v k 1 denotes the sensor noise during the motion. Note that uk 1 consists of the
translation sk 1 and the rotation k 1 , which are independent between each other and can
be modeled with the odometry model (Rekleitis, 2003b).
When the robot rotates by an angle of k 1 , the noise caused by odometry error is modeled
as a Gaussian with mean zero and sigma proportional to k 1 . Therefore, the heading angle
of the robot is updated by
k = k 1 + k 1 +
k 1
(8)
211
A Visual Based Extended Monte Carlo Localization for Autonomous Mobile Robots
where
k 1
is a random noise derived from the heading error model (0, rotk 1) , and
rot is a scale factor obtained experimentally (Rekleitis, 2003a). Likewise, there exists a
translation error denoted by
k 1
Furthermore, the change in orientation during the forward translation leads to the heading
deviation. Then, the pose of samples can be updated by
xk 1 + (sk 1 + s k 1 ) cos( k )
xk
X k = yk = f ( X k 1 , u k 1 , vk 1 ) = yk 1 + (sk 1 + s k 1 ) sin( k )
k 1 + k 1 + k 1 + 1
k
where,
k 1
and
N (0, drift sk 1 ) ,
trans
(9)
are random noises from the error models N (0, trans s k 1 ) and
and
these Gaussian models (Rekleitis, 2003a); the sensor noise v k 1 includes random noise
k 1
estimated by the heading error model (0, rot k 1 ) , as well as the translational error
k 1
with zero
y(mm)
800
800
1600
4000
3000
2000
1000
x(mm)
Figure 1. Sample distribution of straight line motion with error trans = 5 and drift = 1
212
(a)
(b)
Figure 2. Compartment model (a) and Bayesian network for compartment recognition (b)
Consider a set of visual features Z k = [ z 1k , z k2 , z km ] to be observed. The eigenvector of
each visual feature j, denoted by z kj = [ t kj , kj ] T , is composed of the feature type t kj and the
visual angle kj relative to the camera system, developed by
p(Z k | X k(i ) ) . It is assumed that the features are detected solely depending on the robots
pose. Therefore, the observation model can be specified as:
A Visual Based Extended Monte Carlo Localization for Autonomous Mobile Robots
p ( Z k | X k( i ) ) = p ( z 1k , , z km | X k( i ) ) = p ( z 1k | X k( i ) ) p ( z km | X k( i ) )
213
(10)
The observation model for each specific feature j can be constructed based on matching of
the feature type and the deviation of the visual angle, i.e.,
p ( z kj | X k( i ) ) = (t kj t kj 0 )
j
k
kj 0
2
(11)
where, t kj and t kj 0 are feature types of current and predictive observations, respectively, and
as the predictive ones. When the number of the predictive features is more than that of the
observed ones, only part of the predictive features with the same number of observed
features are extracted after they are sorted by the visual angle, and then a maximum
likelihood is applied.
4.3 Resampling Step
As discussed in Section 3, two validation mechanisms in the resampling step are firstly
applied to check abnormity of the distribution of weight values of sample sets. Then
according to the validation, the strategy of using different resampling processes is
employed, where samples are extracted either from importance resampling or from
observation model. Importance resampling has been illustrated in Section 2 (see Table 1).
Here we will discuss the resampling method from the visual observation model.
As we have mentioned that the threshold Wth for uniformity validation should be
appropriately selected. For our application example, the threshold Wth is determined as
follows based on the observation model (11):
m
Wth = k w
i =1
1
2
(12)
214
Gonzalez, 1998; Yuen & MacDonald, 2005). Ideally, the robot can be uniquely localized with
at least three features, as shown in Figure 3 (a), where the number 1~3 denotes the index of
features. In practice, however, there exists uncertainty in the pose estimation due to
observation errors and image processing uncertainty. The pose constrained region C0 shown
in Figure 3 (b), illustrates the uncertain area of the robot pose with uncertain visual angle,
where = , + = + , and are visual angle and the uncertainty,
respectively. The uncertain pose region just provides a space for resampling. The incorrect
samples can be gradually excluded as the update process goes on.
(a)
(b)
Figure 3. Triangulation-based localization (a) in the ideal case, (b) in the case with visual
angle error
In the existing triangulation methods, visual features are usually limited to vertical edges
(Mouaddib & Marhic, 2000), which are quite similar and have large numbers. While in our
application, polyhedron features recognized by Bayesian networks combine perceptual
organization and color model, and therefore reduce the number of features and simplify the
search. In addition, the sub-features of polyhedrons such as vertical edges in recognized
compartments, can also be used for triangulation.
In the process of searching features by interpretation tree (IT), the following optimizations
can be applied:
1. Consider all polyhedrons as a whole, and the position of each polyhedron as the central
position of each individual feature.
2. As visual angle of the camera system is limited, form the feature groups that consist of
several adjacent features according to their space layout. The number of features in each
feature group should be more than that of the observed features. The search area of the
interpretation tree is within each feature group.
3. Search match in terms of the feature type, and then verify by triangulation method the
features satisfying the type validation, to see whether the pose constraint regions each
formed by visual angles of two features, are intersected as is shown by the pose
constraint region C0 in Figure 3 (b).
Then, the random samples ( x k( i ) , y k( i ) ) can be extracted from the pose constraint region. The
orientations of the samples are given by
k(i ) =
1 m
(i )
j
(i )
j
j
(arctan(( y k y k ) /( x k x k )) k )
m j =1
(13)
215
A Visual Based Extended Monte Carlo Localization for Autonomous Mobile Robots
where x kj , y kj , kj are position and visual angle of the feature j, respectively, and m is the
number of observed features.
Figure 4 (a) illustrates the sample distribution after sampling from two observed features f 1
and f 2 , for a robot pose (3800mm, 4500mm, -120). The visual angle error is about five
percent of the visual angle. There are about 1000 generated samples that are sparsely
distributed in the intersection region formed by the observed features. Figure 4 (b)
illustrates the sampling results from three features f 1 , f 2 and f 3 , for a robot pose (3000mm,
4200mm, -100). It can be seen that all extracted samples locate in the pose constraint region
and are close to the true robot pose.
5. Experiments
To verify the proposed extended MCL method, experiments were carried out on a pioneer
2/DX mobile robot with a color CCD camera and sixteen sonar sensors, as shown in Figure
5. The camera has a maximum view angle of 48.8 degrees, used for image acquisition and
feature recognition. Sonar sensors are mainly for collision avoidance. Experiments were
performed in a general indoor office environment as shown in Figure 6 (a). Features in this
environment are compartments (diagonal shadow), refrigerators (crossed shadow) and door
(short thick line below). Layout of features is shown in Figure 6 (b).
In the experiments, the sample size was set as a constant of 400. Parameters of the extended
MCL are: the percentage threshold of effective sample size was 10%, the constant c was 0.8,
was 0.15~0.25, and the scale factor k w was 50%. Parameters for conventional MCL (with
random resampling) were: percentage threshold of effective sample size is still 10%, c=0.3,
=0.35, and k w =30%.
6000
6000
4000
Y(mm)
4000
Y(mm)
5000
5000
H
V
R
S
W
R
E
R
U
H
V
R
S
W
R
E
R
U
(
3000
3000
2000
2000
f1
f3
f1
1000
1000
f2
f2
0
0
0
1000
2000
3000
X(mm)
4000
5000
6000
1000
2000
3000
4000
5000
6000
X(m m )
Figure 5. Pioneer 2/DX mobile robot, equipped with a color CCD camera and sixteen sonar
sensors around
216
(a)
(b)
Figure 6. (a) Experiment environmental model. (b) Layout of environmental features
5.1 Global Localization with Kidnapped Movement
First, the proposed extended MCL method was applied for global localization as well as an
non-modeled movement (e.g., kidnapped problem), with time-variant sample distributions,
entropy and effective sample size. Figure 7 illustrates the whole motion trajectory. It is seen
that the robot started the motion from position a to position b, then was kidnapped to
position c, and then continued to move to the end position d. Figures 8 and 9 illustrate
results when the sample size was within 400. Figure 8 shows that the effective sample size
and entropy are time varying. Sample distributions at different iterations are shown in
Figure 9, where (a) ~ (f) corresponds to the initial, the 1st, the 8th, the 9th, the 17th, the 18th
and the 26th iterations, respectively. As shown in Figure 9 (a), the initial distribution was
uniform. At the first iteration, when two compartments f 4 and f 5 were observed, entropy
after sensor update decreased, as shown in Figure 8 (b), and both importance resampling
and triangulation-based resampling were applied. Due to existence of multi-matches,
importance resampling was applied in all successive iterations, as multi-clusters shown in
Figure 9 (b). At the 9th iteration, the feature of the door f 1 was observed, and a single
sample cluster shown in Figure 9 (c) was obtained until reaching position b, where all
samples were distributed nearby the true pose of the robot, as seen in Figure 9 (d). At the
18th iteration, the robot was kidnapped to the position c with a largely-changed heading
angle. At this moment, the effective sample size and the entropy decreased greatly, as
shown in Figure 8 (a) and (b). With observations of the features of the door f 1 , the
refrigerator f 10 , and the compartment f 9 , sample distribution was obtained as shown in
Figure 9 (e), after applying importance resampling and triangulation-based resampling,
until to the end position d where the sample distribution is shown in Figure 9 (f).
From the above localization process, it can be seen that the sensor-based resampling, after
an effective examination of weight values of samples by over-convergence validation, can
well solve the robot kidnapped problem. Due to too many similar features in the
environment, there were still some samples with higher weights after the kidnapped
motion, and therefore only over-convergence validation was executed without the
uniformity validation. If the robot is kidnapped to a region without similar features, the
uniformity validation can be executed.
5.2 Comparison of Localization Errors
Through applying the strategy of different resampling processes in the extended MCL, the
localization error becomes smaller than that with the conventional MCL, especially in the
217
A Visual Based Extended Monte Carlo Localization for Autonomous Mobile Robots
start point a
(mm)
4500
4000
kidnapped point c
3500
3000
end point d
2500
2000
1500
1000
500
0
0
1000
2000
3000
4000
5000
7000 (mm)
6000
Figure 7. Motion trajectory in localization process (a bit enlarged relative to Figure 6 (a))
11
400
10
entropy
9
300
200
100
7
6
5
4
3
0
10
12
14
(a)
16
18
20
22
24
26
28
iterations
10
12
14
(b)
16
18
20
22
24
26
28
iterations
Figure 8. (a) Effective sample size in localization process. (b) Entropy before and after
sensor update
218
Figure 11 (a) illustrates the total update time of the resampling process with respect to
different numbers of samples. It can be seen when the sample size is less than 1500, the
update time is lower and increases slowly as the sample size increases; when the sample size
is more than 2000, the update time is higher and increases fast as the sample size increases.
This indicates higher computational efficiency since high number of samples is not required,
and thus the update time can be saved.
7000
7000
6000
6000
5000
5000
4000
4000
3000
3000
2000
2000
1000
1000
0
0
1000
2000
3000
4000
5000
6000
7000
0
0
1000
2000
3000
(a)
7000
7000
6000
6000
5000
5000
4000
4000
3000
3000
2000
2000
1000
1000
0
0
1000
2000
3000
4000
5000
6000
7000
0
0
1000
2000
3000
(c)
7000
6000
6000
5000
5000
4000
4000
3000
3000
2000
2000
1000
1000
1000
2000
3000
4000
(e)
5000
6000
7000
4000
5000
6000
7000
5000
6000
7000
(d)
7000
0
0
4000
(b)
5000
6000
7000
0
0
1000
2000
3000
4000
(f)
Figure 9. Sample distributions at different iterations (a) initial time and (b) ~ (f) 8th, 9th, 17th,
18th and 26th iterations
A Visual Based Extended Monte Carlo Localization for Autonomous Mobile Robots
219
triangulation-based MCL
random resampling MCL
1100
1000
pose xk error(mm)
900
800
700
600
500
400
300
200
0
10
12
iterations
update time(ms)
8000
6000
4000
2000
100
1000
10000
sample size
(a)
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
100
1000
10000
sam pe size
(b)
Figure 11. (a) Update time (b) Percentage of sensor-based resampling time with respect to
the total update time
220
Figure 11 (b) further illustrates the percentage of the sensor-based resampling time with
respect to the total update time versus different number of samples. The percentage of
resampling time decreases as the sample size increases, i.e., from 48.6% with 200 samples to
35.9% with 8000 samples. Since many samples are not deleted in a large sample set, the
sensor-based resampling is not necessarily performed, and the process without resampling
dominates the whole process. When the sample size increases to a certain extent, the
percentage of the resampling time does not change obviously. This implies that the
extended MCL with smaller sample size has the similar localization performance to that
with relative larger sample size. Although the percentage of the sensor-based resampling
time in the whole update time with smaller sample size is higher than that with higher
sample size, the total update time is reduced when using smaller sample size.
6. Conclusion
An extended Monte Carlo localization (EMCL) method is proposed in this book chapter by
introducing two validation mechanisms to apply a resampling strategy to conventional
MCL. Two validation mechanisms, uniformity validation and over-convergence validation,
are effectively used to check abnormity of the distribution of weight values of sample set,
e.g., observation deviation or over-convergence problem. The strategy of employing
different resampling processes is proposed to construct more consistent posterior
distribution with observations. This new approach is aimed to improve localization
performance particularly with smaller sample size in the non-modeled robot movements,
and thus achieve global localization more efficiently. A vision-based extended MCL is
further implemented, utilizing triangulation-based resampling from visual features in a
constraint region of the pose space. Experiments conducted on a mobile robot with a color
CCD camera and sixteen sonar sensors verify efficiency of the extended MCL method.
7. Acknowledgement
This work was supported in part by a grant from Research Grants Council of the Hong
Kong Special Administrative Region, China [Reference No. CityU 119705], and a grant from
City University of Hong Kong (project no. 7002127).
8. References
Borenstein, J.; Everett, H. & Feng, L. (1996). Where am I? Sensors and Methods for Autonomous
Mobile Robot Positioning. Wellesley. Mass.:AK Peters,
Carpenter, J.; Clifford, P. & Fearnhead, P. (1999). Improved particle filter for nonlinear
problems, IEE proceedings on Sonar and Navigation, pp. 2-7.
Castellanos, J. & Tardos, J. (1996). Laser-based segmentation and localization for a mobile
robot, Proceedings of the 6th Symposium (ISRAM), pp. 101-108, ASME Press, New
York, NY.
Chenavier, F. & Crowley, J. (1992). Position estimation for a mobile robot using vision and
odometry, Proceedings of the IEEE International Conference on Robotics and Automation,
pp. 2588-2593.
A Visual Based Extended Monte Carlo Localization for Autonomous Mobile Robots
221
Dellaert, F.; Burgard, W. & Fox, D. (1999). Using the condensation algorithm for robust,
vision-based mobile robot localization, Proceedings of the IEEE International
Conference on CVPR, pp. 588-594.
Doucet, A. (1998). On sequential simulation-based methods for Bayesian filtering,
Cambridge University, Department of Engineering, Cambridge, UK, Technical
Report. CUED/FINFENG/ TR 310.
Drumheller, M. (1987). Mobile robot localization using sonar, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol.PAMI-9, No.2, 325332.
Fox, D. (2003). Adapting the sample size in particle filter through KLD-Sampling,
International Journal of Robotics Research.
Fox, D.; Burgard, W. & Dellaert, F. (1999a). Monte Carlo localization: efficient position
estimation for mobile robots, Proceedings of the AAAI-9, pp. 343-3499, Orlando,
Florida.
Fox, D.; Burgard, W. & Thrun, S. (1999b). Markov localization for mobile robots in dynamic
environments, Journal of Artificial Intelligence Research, Vol. 11, 391-427.
Gaspar, J.; Winters, N. & Santos-Victor, J. (2000). Vision-based navigation and
environmental representations with an omnidirectional camera, IEEE Transactions
on Robotics and Automation, Vol. 16, No. 6, 890-898.
Luo, R. & Hong, B. (2004). Coevolution based adaptive Monte Carlo localization (CEAMCL),
International Journal of Advanced Robotic Systems. Vol. 1, No. 3, 183-190.
Isard, M. & Blake, A. (1998). Condensation-conditional density propagation for visual
tracking, International Journal of Computer Vision, Vol. 29, No. 1.
Jensfelt, P. & Kristensen, S. (2001). Active global localization for a mobile robot using
multiple hypothesis tracking, IEEE Transactions on Robotics and Automation, Vol. 17,
No. 5, 748-760.
Jensfelt, P.; Wijk, O.; Austin, D. & Anderson, M. (2000). Experiments on augmenting
condensation for mobile robot localization, Proceedings of the IEEE International
Conference on Robotics and Automation.
Konolige, K. (2001). Robot motion: probabilistic model; sampling and Gaussian
implementations; Markov localization, AI Center, SRI International, Technical Note.
Kortenkamp, D.; Bonasso, R. & Murphy, R. (1998). (Eds.), AI-based Mobile Robots: Case Studies
of Successful Robot Systems, MIT Press, Cambridge, MA.
Kraetzschmar, G. & Enderle, S. (2002). Self-localization using sporadic features, Robotics and
Autonomous Systems, Vol. 40, 111-119.
Krotkov, E. (1989). Mobile robot localization using a single image, Proceedings of the IEEE
International Conference on Robotics and Automation, pp. 978-983.
Lenser, S. & Veloso, M. (2000). Sensor resetting localization for poorly modeled mobile
robots, Proceedings of the IEEE International Conference on Robotics and Automation,
pp. 1225-1232.
Leonard, J. & Durrant-White, H. (1991). Mobile robot localization by tracking geometric
beacons, IEEE Transactions on Robotics and Automation, Vol. 7, 8997.
Liu, J.; Chen, R. & Logvinenko, T. (2001). A theoretical framework for sequential importance
sampling and resampling, In Sequential Monte Carlo in Practice, Springer-Verlag,
New York,.
Mouaddib, E. & Marhic, B. (2000). Geometrical matching for mobile robot localization, IEEE
Transactions on Robotics and Automation, Vol. 16, No. 5, 542-552.
222
13
Optical Correlator based Optical Flow
Processor for Real Time Visual Navigation
Valerij Tchernykh, Martin Beck, Klaus Janschek
Technische Universitt Dresden
Germany
1. Introduction
Autonomous visual navigation, i.e. determination of position, attitude and velocity (ego
motion) by processing of the images from onboard camera(s), is essential for mobile robots
control even in the presence of GPS networks, as the accuracy of GPS data and/or the
available map of surroundings can be insufficient. Besides, GPS signals reception can be
unstable in many locations (inside buildings, tunnels, in narrow streets, canyons, under
trees, etc).
Up to now most of the practical visual navigation solutions have been developed for ground
robots moving in cooperative and/or well determined environment. However, future
generations of mobile robots should be also capable of operating in complex and noncooperative 3D environments. Visual navigation in such conditions is much more
challenging, especially for flying robots, where full 6DOF pose/motion should be
determined. Generally 3D environment perception is required in this case, i.e.,
determination of a local depth map for the visible scene.
3D scene information can be obtained by stereo imaging; however this solution has certain
limitations. It requires at least two cameras, precisely mounted with a certain stereo base
(can be critical for small vehicles). Due to fixed stereo base the range of the depth
determination with stereo imaging is limited. A more universal solution with less hardware
requirements can be achieved with optical flow processing of sequential images from a
single onboard camera.
The ego motion of a camera rigidly mounted on a vehicle is mapped into the motion of
image pixels in the camera focal plane. This image motion is commonly understood as
image flow or optical flow (OF) (Horn & Schunck, 1981). This vector field of 2D image
motion can be used efficiently for 3D environment perception (mapping) and vehicle
pose/motion determination as well as for obstacle avoidance or visual servoing. The big
challenge for using the optical flow in real applications is its computability in terms of its
density (sparse vs. dense optical flow), accuracy, robustness to dark and noisy images and
its real-time determination. The general problem of optical flow determination can be
formulated as the extraction of the two-dimensional projection of the 3D relative motion into
the image plane in form of a field of correspondences (motion vectors) between points in
consecutive image frames.
224
This article addresses a real-time solution for high precision optical flow computation based
on 2D correlation of image fragments on the basis of an optical correlator. It exploits the
principle of Joint Transform Correlation (JTC) in an optoelectronic setup using the Optical
Fourier Transform (Goodman, 1968). Based on the experience of the authors with different
successful optical processor developments (Tchernykh et al., 2004, Janschek et al., 2004a,
Tchernykh et al., 2000, Janschek et al., 2005a) a new optical processor design is presented,
which makes use of advanced optoelectronic technology. The proposed optoelectronic
optical flow processor (OE-OFP) shows to be very compact with low mass and low power
consumption and provides the necessary high performance needed for navigation
applications in the field of robotics (ground, aerial, marine) and space flight (satellites,
landing vehicles). The paper recalls briefly the underlying principles of optical flow
computation and optical correlation, it shows the system layout and the conceptual design
for the optoelectronic optical flow processor and it gives preliminary performance results
based on a high fidelity simulation of the complete optical processing chain.
Optical Correlator based Optical Flow Processor for Real Time Visual Navigation
225
technology (Bruhn et al., 2003, Bruhn et al., 2005, Diaz et al., 2006) additionally consume a
lot of power, mass and volume which does not fit the requirements of mobile robotics,
especially if application onboard a flying platform is considered. The recently developed
and currently very popular SIFT approach (Lowe, 1999, Se et al., 2001) allows a
computationally efficient determination of more or less sparse OF fields in well structured
environments. Some specialized high speed OF sensors on hybrid analogue-digital
technology (Barrows & Neely, 2000, Zufferey, 2005) provide even super real-time
performances but are suffering from the required accuracy of the OF vectors for navigation
purposes. A most robust approach is the area correlation, applied originally for image
registration (Pratt, 1974). Area correlation uses the fundamental property of the crosscorrelation function of two images, which gives the location of the correlation peak directly
proportional to the displacement vector of the original image shift.
For each pair of sequential images the OF field is determined by subdividing both images
into small fragments and 2D correlation of corresponding fragments. As a result of each
correlation the local shift vector at the specific location is determined; a whole set of local
shift vectors forms an optical flow matrix (Figure 1).
low dependency on image texture properties (no specific texture features required);
high robustness to image noise (suitable for short exposures and/or poor illumination
conditions)
226
I ( x, y ) (Figure 2). A first Fourier transform results in the joint power spectrum
&
G of f1 and f 2 in the joint image I . A second Fourier transform of the squared joint
function C ff ( x, y ) represents the auto-correlation function of each input image, whereas the
two spatially shifted correlation functions C ff ( x Gx , y G y ) represent the cross-correlation
&
functions of the input images. The shift vector G contains both the technological shift
according to the construction of the joint image I ( x , y ) and the shift of the image contents
according to the image motion. If the two input images f1 and f 2 contain identical (but
shifted) image contents, the cross-correlation peaks will be present and their mutual spatial
&
&
&
shift = G G
This principle can be realized in hardware by a specific optoelectronic setup, named Joint
Transform Optical Correlator (JTOC). The required 2D Fourier transforms are performed by
means of diffraction-based phenomena, incorporating a so called optical Fourier processor
(Goodman, 1968). A laser diode (Figure 3) generates a diverging beam of coherent light
which passes a single collimating lens focusing the light to infinity. The result is a beam of
parallel light with plane wave fronts. The amplitude of the plane wave front is modulated
by a transmissive or reflective spatial light modulator (SLM). The SLM actually works as a
diffraction grid and the resulting diffraction pattern can be made visible in the focal plane of
a second lens (Fourier lens). Under certain geometric conditions the energy distribution of
this pattern is equal to the squared Fourier transform (power spectrum) of the modulated
wave front. The power spectrum can be read by a CCD or CMOS image sensor located in
the focal plane of the Fourier lens of the optical Fourier processor. The position of the
correlation peaks in the second power spectrum (correlation image) and the associated shift
value can be measured with sub-pixel accuracy using e.g. standard algorithms for centre of
mass calculation.
Optical processing thus allows unique real time processing performances of high frame rate
video streams.
This advanced technology and its applications have been studied during last years at the
Institute of Automation of the Technische Universitt Dresden (Tchernykh et al., 2004,
Janschek et al., 2004a, Janschek et al., 2007). Different hardware models have been
manufactured, e.g. under European Space Agency (ESA) contracts (Figure 4). Due to special
design solutions owned by TU Dresden, the devices are robust to mechanical loads and
deformations. (Tchernykh et al., 2000, Janschek et al., 2005a). One of the models has been
227
Optical Correlator based Optical Flow Processor for Real Time Visual Navigation
successfully tested in an airborne test campaign, where very promising performances have
been shown (Tchernykh et al., 2004).
Image Sensor
Digital
Processor
Laser diode
S
Image
shift
S
S
Correlation peaks
Image fragments
to be correlated
Combined input
image
Spectrum image
Correlation image
228
Optical Correlator based Optical Flow Processor for Real Time Visual Navigation
Magnified errors
(reference OF test OF)
229
Reference OF
Test OF
Correlated
fragments size
8x8 pixels
0.5
Test OF
Correlated
fragments size
16x16 pixels
RMS error of
optical flow, pixels
0.45
0.4
0.35
0.3
Test OF
Correlated
fragments size
32x32 pixels
0.25
0.2
0.15
Test OF
Correlated
fragments size
64x64 pixels
0.1
0.05
0
Correlated fragment
size, pixels
12 16
24 32
48 64
Figure 6. Results of the optical flow sensitivity with respect to correlation window size
5.2 Simulation experiment results
The OF fields have been determined with a correlated fragments size varying in the range
from 8x8 to 64x64 pixels and they have been compared with a reference (ideal) OF field to
determine the OF errors. The reference OF field has been produced directly from the
reference trajectory data and the known 3D model of the landing site. Figure 6 shows the
results of the optical flow accuracy sensitivity for different correlation window sizes. Images
in the left column represent the 2D patterns of the OF vectors magnitudes (brighter pixels
represent larger OF vectors), the middle row contains the error patterns, determined as the
difference between the reference (ideal) and test OF fields. RMS error values are shown in
the diagram at the bottom right corner.
According to this sensitivity analysis, minimal OF errors are expected for a window size of
24x24 pixels.
For the selected window size (24x24 pixels) the sensitivity to additive and multiplicative
image noise on the OF error has been investigated. It has been found, that random noise
with standard deviation within 8% of average image brightness (signal-tonoise ratio above
12 dB) has little influence on the OF field accuracy. Starting from = 8%, however, the effect
of image noise rapidly increases. According to these results, the limit of acceptable image
noise for optical flow determination with fragments size 24x24 can be set to = 8% of
average image brightness.
230
The main purpose of the OE-OFP is the real time determination of the optical flow field for
the visible surrounding environment. Figure 7 shows the general data flow chart for the
optical flow computation according to the Joint Transform Correlation (JTC) principle.
Images
acquisition
Fragments
cut
Fragments
loading to
Fourier
Processor
2D Fourier
transform
Spectrums
readout /
loading to
FP
2D Fourier
transform
Correlation
images
readout
Correlation
images
processing
Figure 7. Main operations for optical flow determination according to JTC principle
The operations of 2D Fourier transform are most time and resource consuming for a digital
realization of the OF processor. With optical realisation however, Fourier transform is
practically performed instantly (with speed of light) and requires power only for SLM
illumination. In this case other operations in the data processing chain (images
readout/loading, fragments cut and correlation images processing) are practically
determining the limits of performance improvement and size/mass/power minimization.
Therefore, optimization of these operations is particularly essential for optimal design of the
optoelectronic OF processor.
The concept of the OE-OFP has been developed assuming a realization of the input/output
and digital processing operations directly on the image sensors and SLM chips. This
solution eliminates the need for a dedicated digital processing electronics and reduces
dramatically the power consumption.
The unpackaged chips can be mounted close to each other on a single substrate (Chip-onBoard COB mounting). A small distance between the dies (Figure 8) is offering direct chipto-chip connections. This avoids the need for powerful buffers inside the processor and in
consequence reduces further the OFP power consumption. As the processor outputs only
the OF vectors coordinates, the output data rate and therefore the power consumption of the
output buffers are also limited.
Laser diode
Lens
SLM
Light path
SCIS
Optical Correlator based Optical Flow Processor for Real Time Visual Navigation
231
Lens
8 mm
20 mm
50 mm
3D scene
optical-flow fields
64x64=4096 vectors/field
8x8=64 vectors/field
10 fields/s
500 fields/s
One frame (0.002 0.1 s)
50000 correlations/s
= 0.1 0.25 pixels
50x20x8 mm (w/o lens)
within 20g (w/o lens)
within 2 W
232
Comparison of Table 1 with the requirements listed in section 2 shows that the proposed
optoelectronic Optical Flow processor is expected to satisfy the requirements, listed in
section 2. To compare the proposed processor with other currently available solutions for
real time optical flow determination, it is however important to evaluate a performance
measure related to mobility, which takes into account also the processor power
consumption and volume related to the computing performance in terms of flow vectors per
second and accuracy.
Figure 10 shows these performance-to-mobility measures taking into account also the power
consumption and the volume of the optical-flow processor module. It follows that the
proposed optoelectronic optical flow processor design (OE-OFP) shows unique
performances in comparison with the fastest digital optical-flow computation solution
currently available (Bruhn et al., 2005, Diaz et al., 2006).
OE-OFP
Diaz (2006)
Bruhn (2005)
7. Application areas
The proposed optical flow processor is intended to be used mainly in the field of visual
navigation of mobile robots (ground, aerial, marine) and space flight (satellites, landing
vehicles). The small size, mass and power consumption makes the proposed OE-OFP
particularly suitable for application onboard micro air vehicles (MAVs).
From the obtained optical flow, 3D information can be extracted and a 3D model of the
visible environment can be produced. The considerable high resolution (up to 64x64 OF
vectors) and very high accuracy (errors 0.25 pixels) of the determined optical flow makes
such 3D environment models detailed and accurate. These 3D environment models can be
used for 3D navigation in complex environment (Janschek et al., 2004b) and also for 3D
mapping, making the proposed OF processor ideally suited for 3D visual SLAM. The
applicability of the optical flow data derived with the proposed principles (joint transform
correlation) and technology (optical correlator) to real world navigation solutions even
under unfavourable constraints (inclined trajectories with considerable large perspective
distortions) has been proved by the authors in recent work (Janschek et al., 2005b,
Tchernykh et al., 2006), some simulation results are also given in the next section.
The anticipated real time performance of the processor (up to 500 frames/s with reduced OF
field resolution) provides a wide range of opportunities for using the obtained optical flow
for many additional tasks beyond localization and mapping, e.g. vehicle stabilization,
collision avoidance, visual odometry, landing and take-off control of MAVs.
Optical Correlator based Optical Flow Processor for Real Time Visual Navigation
233
Figure 11. Simulation environment with UAV trajectory (side and top views)
Simulated navigation camera images (Figure 12) have been rendered for a single nadirlooking camera with a wide angle (fisheye) lens (field of view 220), considering the
simulated UAV trajectory.
234
A reference 3D model of the terrain has been produced in a form of Digital Elevation Model
(DEM) by stereo processing of two high altitude images (simulating the standard aerial
mapping). Such model can be represented by a 2D pseudo image with the brightness of each
pixel corresponding to the local height over the base plane.
The optical flow determination has been performed with a detailed simulation model of the
optical correlator. The correlator model produces the optical flow fields for each pair of
simulated navigation camera images, simulating the operation of the real optical hardware.
Figure 13 shows an example of the optical flow field. The 3D surface models have been first
reconstructed as local distance maps in a camera-fixed coordinate frame (Figure 13), then
converted into DEMs in a surface-fixed frame using the estimated position and attitude of
the vehicle. Figure 14 shows an example of both the reconstructed and reference DEMs.
Figure 13. Example of an optical flow field and corresponding distance map
Navigation data (position, attitude and velocity of the robot) have been extracted from the
results of the matching of the reconstructed and reference models and compared with the
reference trajectory data to estimate the navigation errors. As a result of the test, the RMS
position error for the translation part of the trajectory was 0.20 m and the RMS attitude error
was 0.45 degrees. These have been obtained by instantaneous processing of the optical flow
Optical Correlator based Optical Flow Processor for Real Time Visual Navigation
235
data, i.e. without any time filtering, and without any additional navigation aids (except the
DEM reference map). The navigation accuracy can be further improved by some filtering,
and by using data from inertial measurement unit.
Reference DEM
Reconstructed DEM
10. References
Barrows, G. & Neely, C. (2000). Mixed-mode VLSI optic flow sensors for in-flight control of
a micro air vehicle, Proc. SPIE Vol. 4109, Critical Technologies for the Future of
Computing, pp. 52-63, 2000.
Beauchemin, S.S. & Barron, J.L. (1995). The computation of optical flow, ACM Computing
Surveys (CSUR), Vol. 27, no. 3, (September 1995), pp. 433 466.
Bruhn, A., Weickert, J., Feddern, C., Kohlberger, T. & Schnrr, C. (2003). Real-Time Optic
Flow Computation with Variational Methods, CAIP 2003, LNCS, Vol. 2756, (2003),
pp. 222-229.
Bruhn, A., Weickert, J., Feddern, C., Kohlberger, T. & Schnrr, C. (2005). Variational Optical
Flow Computation in Real Time. IEEE Transactions on Image Processing, vol. 14, no.
5, (May 2005)
Daz, J., Ros, E., Pelayo, F., Ortigosa, E.M. & Mota, S. (2006) FPGA-Based Real-Time OpticalFlow System, IEEE Transactions on Circuits and Systems for Video Technology, vol. 16,
no. 2, (February 2006)
Goodman, J.W. (1968). Introduction to Fourier optics, McGraw-Hill, New York.
Horn, B.K.P. & Schunck, B.G. (1981). Determining Optical Flow, Artificial Intelligence, Vol. 17
(1981), pp. 185-203.
Janschek, K., Tchernykh, V. & Dyblenko, S. (2004a). Opto-Mechatronic Image Stabilization
for a Compact Space Camera, Preprints of the 3rd IFAC Conference on Mechatronic
Systems - Mechatronics 2004, pp.547-552, Sydney, Australia, 6-8 September 2004,.
(Congress Best Paper Award).
236
Janschek, K., Tchernykh, V. & Beck, M. (2004b). Optical Flow based Navigation for Mobile
Robots using an Embedded Optical Correlator, Preprints of the 3rd IFAC Conference
on Mechatronic Systems - Mechatronics 2004, pp.793-798, Sydney, Australia, 6-8
September 2004.
Janschek, K., Tchernykh, V. & Dyblenko, S. (2005a) Verfahren zur automatischen Korrektur
von durch Verformungen hervorgerufenen Fehlern Optischer Korrelatoren und
Selbstkorrigierender Optischer Korrelator vom Typ JTC, Deutsches Patent Nr. 100
47 504 B4, Erteilt: 03.03.2005.
Janschek, K., Tchernykh, V. & Beck, M. (2005b). An Optical Flow Approach for Precise
Visual Navigation of a Planetary Lander, Proceedings of the 6th International ESA
Conference on Guidance, Navigation and Control Systems, Loutraki, Greece, 17 20
October 2005.
Janschek, K., Tchernykh, V. & Dyblenko, S. (2007). Performance analysis of optomechatronic image stabilization for a compact space camera, Control Engineering
Practice, Volume 15, Issue 3, March 2007, pages 333-347
Jutamulia, S. (1992). Joint transform correlators and their applications, Proceedings SPIE, 1812
(1992), pp. 233-243.
Liu, H., Hong, T.H., Herman, M., Camus, T. & Chellappa, R. (1998). Accuracy vs Efficiency
Trade-offs in Optical Flow Algorithms, Computer Vision and Image Understanding,
vol. 72, no. 3, (1998), pp. 271-286.
Lowe, D.G. (1999). Object recognition from local scale invariant features, Proceedings of the
Seventh International Conference on Computer Vision (ICCVgg), pp. 1150-1157,
Kerkyra, Greece, September 1999.
McCane, B., Galvin, B. & Novins, K. (1998) On the Evaluation of Optical Flow Algorithms,
Proceedings of 5th International Conference on Control, Automation, Robotics and Vision,
pp. 1563-1567, Singapur, 1998.
Pratt, W.K. (1974) Correlation techniques of image registration, IEEE Transactions on
Aerospace Electronic Systems, vol. 10, (May 1974), pp. 353-358.
Se, S., Lowe, D.G. & Little, J. (2001) Vision-based mobile robot localization and mapping
using scale-invariant features, Proceedings 2001 ICRA - IEEE International Conference
on Robotics and Automation, vol. 2, pp. 2051 2058, 2001.
Tchernykh, V., Janschek, K. & Dyblenko, S. (2000). Space application of a self-calibrating
optical processor or harsh mechanical environment, Proceedings of 1st IFAC
Conference on Mechatronic Systems - Mechatronics 2000, Vol 3, pp.309-314, Darmstadt,
Germany, September 18-20, 2000, Pergamon-Elsevier Science.
Tchernykh, V., Dyblenko, S., Janschek, K., Seifart, K. & Harnisch, B. (2004). Airborne test
results for a smart pushbroom imaging system with optoelectronic image
correction. In: Sensors, Systems and Next-Generation Satellites VII, Proceedings of SPIE,
Vol. 5234 (2004), pp.550-559.
Tchernykh, V., Beck, M. & Janschek, K. (2006). Optical flow navigation for an outdoor UAV
using a wide angle mono camera and DEM matching, submitted to 4th IFAC
Symposium on Mechatronic Systems Mechatronics 2006, Heidelberg, Germany.
Zufferey, J.C. (2005) Bio-inspired Vision-based Flying Robots, Thse n 3194, Facult Sciences
et Techniques de l'Ingnieur, EPFL, 2005.
14
Simulation of Visual Servoing Control and
Performance Tests of 6R Robot Using ImageBased and Position-Based Approaches
M. H. Korayem and F. S. Heidari
Robotic Research Laboratory, College of Mechanical Engineering, Iran University of
Science & Technology, Tehran
Iran
1. Introduction
Visual control of robots using vision system and cameras has appeared since 1980s. Visual
(image based) features such as points, lines and regions can be used to, for example, enable
the alignment of a manipulator / gripping mechanism with an object. Hence, vision is a part
of a control system where it provides feedback about the state of the environment. In
general, this method involves the vision system cameras snapping images of the targetobject and the robotic end effector, analyzing and reporting a pose for the robot to achieve.
Therefore, 'look and move' involves no real-time correction of robot path. This method is
ideal for a wide array of applications that do not require real-time correction since it places
much lighter demands on computational horsepower as well as communication bandwidth,
thus having become feasible outside the laboratory. The obvious drawback is that if the part
moves in between the look and move functions, the vision system will have no way of
knowing this in reality this does not happen very often for fixture parts. Yet another
drawback is lower accuracy; with the 'look and move' concept, the final accuracy of the
calculated part pose is directly related to the accuracy of the 'hand-eye' calibration (offline
calibration to relate camera space to robot space). If the calibration were erroneous so would
be the calculation of the pose estimation part.
A closedloop control of a robot system usually consists of two intertwined processes:
tracking pictures and control the robots end effector. Tracking pictures provides a
continuous estimation and update of features during the robot or target-object motion.
Based on this sensory input, a control sequence is generated.
Y. Shirai and H. Inoue first described a novel method for 'visual control' of a robotic
manipulator using a vision feedback loop in their paper. Gilbert describes an automatic
rocket-tracking camera that keeps the target centered in the camera's image plane by means
of pan/tilt controls (Gilbert et al., 1983). Weiss proposed the use of adaptive control for the
non-linear time varying relationship between robot pose and image features in image-based
servoing. Detailed simulations of image-based visual servoing are described for a variety of
manipulator structures of 3-DOF (Webber &.Hollis, 1988).
238
Mana Saedan and M. H. Ang worked on relative target-object (rigid body) pose estimation
for vision-based control of industrial robots. They developed and implemented an
estimation algorithm for closed form target pose (Saedan & Marcelo, 2001).
Image based visual controlling of robots have been considered by many researchers. They
used a closed loop to control robot joints. Feddema uses an explicit feature-space trajectory
generator and closed-loop joint control to overcome problems due to low visual sampling
rate. Experimental work demonstrates image-based visual servoing for 4-DOF (Kelly &
Shirkey, 2001). Rives et al. describe a similar approach using the task function method and
show experimental results for robot positioning using a target with four circle features
(Rives et al. 1991). Hashimoto et al. present simulations to compare position-based and
image-based approaches (Hashimoto et al., 1991).
Korayem et al. designed and simulated vision based control and performance tests for a 3P
robot by visual C++ software. They minimized error in positioning of end effector and they
analyzed the error using ISO9283 and ANSI-RIAR15.05-2 standards and suggested methods
to improve error (Korayem et al., 2005, 2006). A stationary camera was installed on the
earth and the other one mounted on the end effector of robot to find a target. This vision
system was designed using image-based-visual servoing. But the vision-based control in our
work is implemented on 6R robot using both IBVS and PBVS methods. In case which
cameras are mounted on the earth, i.e., the cameras observe the robot the system is called
out-hand" (the term stand-alone" is generally used in the literature) and when one camera
is installed on the end effector configuration is in-hand. The closed-form target pose
estimation is discussed and used in the position-based visual control. The advantage of this
approach is that the servo control structure is independent from the target pose coordinates
and to construct the pose of a target-object from two-dimension image plane, two cameras
are used. This method has the ability to deal with real-time changes in the relative position
of the target-object with respect to robot as well as greater accuracy.
Collision detection along with the related problem of determining minimum distance has a
long history. It has been considered in both static and dynamic (moving objects) versions.
Cameron in his work mentioned three different approaches for dynamic collision detection
(Cameron, 1985, 1986). Some algorithms such as Boyse's and then Canny's solve the problem
for computer animation (Boyse, 1979) and (Canny, 1986); while others do not easily produce
the exact collision points and contact normal direction for collision response (Lin, 1993). For
curved objects, Herzen etc have described a general algorithm based on time dependent
parametric surfaces (Herzen et al. 1990). Gilbert et al. computed the minimum distance
between two convex objects with an expected linear time algorithm and used it for collision
detection (Gilbert & Foo, 1990). Collision detection along with the related problem of
determining minimum distance has a long history. It has been considered in both static and
dynamic (moving objects) versions. Cameron in his work mentioned three different
approaches for dynamic collision detection. He mentioned three different approaches for
dynamic collision detection. One of them is to perform static collision detection repetitively
at each discrete time steps (Cameran & Culley, 1986).
Using linear-time preprocessing, Dobkin and Kirkpatrick were able to solve the collision
detection problem as well as compute the separation between two convex polytopes in
O(log|A|.log|B|) where A and B are polyhedra and |.| denotes the total number of faces
(Canny, 1986). This approach uses a hierarchical description of the convex objects and
239
extension of their previous work (Lin, 1993). This is one of the best-known theoretical
bounds.
Some algorithms such as Boyse's and then Canny's solve the problem for computer
animation (Gilbert & Foo, 1990); while others do not easily produce the exact collision points
and contact normal direction for collision response (ANSI/RIA R15.05-2, 2002). For curved
objects, Herzen et al. have described a general algorithm based on time dependent
parametric surfaces (ISO9283). Gilbert et al. computed the minimum distance between two
convex objects with an expected linear time algorithm and used it for collision detection
(Ponmagi et al.).
The technique used in our work is an efficient simple algorithm for collision detection
between links of 6R robot undergoing rigid motion ,determines whether or not two objects
intersect and checks if their centers distance is equal to zero or not.
Due to undefined geometric shape of the end effector of the robot we have explained and
used a color based object recognition algorithm in simulation software to specify and
recognize the end effector and the target-object in image planes of the two cameras. In
addition, capability and performance of this algorithm to recognize the end effector and the
target-object and to provide 3D pose information about them are shown.
In this chapter the 6R robot that is designed and constructed in IUST robotic research Lab, is
modeled and simulated. Then direct and inverse kinematics equations of the robot are
derived and simulated. After discussing simulation software of 6R robot, we simulated
control and performance tests of robot and at last, the results of tests according to ISO9283
and ANSI-RIAR15.05-2 standards and MATLAB are analyzed.
240
To simulate control and test of 6R robot, the object oriented software Visual C++6 was used.
This programming language is used to accomplish this plan because of its rapidity and
easily changed for real situation in experiments. In this software, the picture is taken in
bitmap format through two stationary cameras, which are mounted on the earth in the
capture frame module, and the image is returned in form of array of pixels. Both of the two
cameras after switching the view will take picture. After image processing, objects in
pictures are saved separately, features are extracted and target-object and end effector will
be recognized among them according to their features and characteristics. Then 3D position
coordinates of target-object and end effector are estimated. After each motion of joints, new
picture is taken from end effector and this procedure is repeated until end effector reach to
target-object.
241
A schematic view of simulator software and the 6R robot in its home position is depicted in
Figure 2. In this figure, 6R robot is in homeposition and target-object is the red sphere. The
aim of control process is guiding the end effector to reach the target-object within an
acceptable accuracy.
242
Get Target: By pushing this button control process to guide end effector to reach the target
is performed.
Direct Kinematics: Performance tests for direct kinematics are accomplished. Joint variables
are determined in a text file by user.
Inverse kinematics: Inverse kinematics tests for the robot are done. Transformation matrix
is defined by user in a text file. This file would be read and joint variables are determined to
rotate joints to reach the end effector in desired pose.
Continuous Path: It guides the end effector during continuous paths such as circle, line or
rectangle to simulate performance tests. Paths properties are defined in text file by user.
Look At: By pushing this button observer camera will look at robot at any pose.
Camera switch: change the view between two stationary cameras views. Switch from
camera 1 to camera 2 or vice versa.
GetOrient: Changes the orientation of selected camera frame.
Figure 6. Robot at last step of control process reached to target-object in view of camera1
and camera2
Test steps:
1. Initialize the simulator environment by clicking Init button.
2. Select frame object No.1 from Selected object box.
3. Specify its 3D x, y, z position in Frame Position and click Set icon.
4. By Get Target icon, control process is accomplished.
243
244
Problems 1-2:
Set the target-object in four corners of a rectangle with coordinates as: (3,0,-2), (3,0, 2), (-3,0,2), (-3,0, 2) and guide the end effector to attain the target-object.
For reaching end effector in (2,-1, 2) position, compute joint angles and compare them with
actual joint angles at the end of the control process.
4.2 Mapping points in image plane to 3D system
As mentioned before a neural network has been used to transform 2D coordinates of image
planes to global reference 3D coordinate. Collection of points to train the net are achieved by
moving end effector of robot through different points that their coordinates in global
reference system are known and their coordinates in image plane of the two cameras are
computed in pixels by VisionAction module in simulator software. The position of the end
effector is recognized at any time by two cameras, which are fixed with a certain distance
from each other. The camera No.1 determines the target coordinates in a 2D image plan in
pixels. The 3rd coordinate of the object is also computed by information from the second
camera.
The used neural network is a back propagation perception kind network with 2 layers. In
input layer (first layer) there are 4 node entrance including picture plan coordination pixels
from two fixed cameras, to adapt a very fit nonlinear function 10 neurons in this layer with
tan sigmoid function have been used. In the second layer (output layer) there are 3 neurons
with 30 input nodes and 3 output nodes which are 3D coordinates x, y and z of object in the
earth reference system. The transfer function in this layer is linear.
This network can be used as a general function approximator. It can approximate 3D
coordinates of any points in image plane of two cameras arbitrarily well, with given
sufficient neurons in the hidden layer and tan sigmoid functions. As shown the training
results in Figure 7 performance of trained net is 0.089374 in less than 40 iterations (epochs).
This net approximates 3D coordinates of points well enough.
Performance is 0.0893742, Goal is 0
10
Training-Blue
10
10
-1
10
-2
10
20
40
60
80
100
200 Epochs
120
140
160
180
200
245
targets are performed. Network outputs are plotted versus the targets as open circles (Figure
8). The best linear fit is indicated by a dashed line. The perfect fit (output equal to targets) is
indicated by the solid line. In this trained net, it is difficult to distinguish the best linear fit
line from the perfect fit line, because the fit is so accurate. It is a measure of how well the
variation in the output is explained by the targets and there is perfect correlation between
targets and outputs. Results for x, y and z directions are shown in Figure 8.
(a)
(b)
(c)
246
System analysis is based on the stereovision theory and line-matching technology, using the
two images captured by the two cameras. The vision procedure includes four stages,
namely, calibration, sampling, image processing and calculating needed parameters.
Figure 10. Robot at step 2 of control process in view of camera1 and camera2
Figure 11. Robot at step 5 of control process in view of camera1 and camera2
First, the precision of this measuring system must be determined for simulator software. To
maintain robot accuracy, calibration equipment is needed. In this simulator software, a self-
247
calibrating measuring system based on a camera in the robot hand and a known reference
object in the robot workspace is used. A collection of images of the reference target-object is
obtained. From these the positions and orientations of the camera and the end effector,
using image processing, image recognition and photogram metric techniques are estimated.
The essential geometrical and optical camera parameters are derived from the redundancy
in the measurements. By camera calibration, we can obtain the world coordinates of the start
points of robots motion and the relation between images of the target-object and its relative
distance to the end effector. So amount and direction of the end effectors motion is
estimated and feedback for visual servo system will be obtained.
Figure 12. Robot at step 10 of control process in view of camera1 and camera2
Figure 13. Robot at last step of control process reached to target-object in view of camera1
and camera2
At first control step as position of the target-object in 3D global reference system are not
distinct, the end effector of the robot is moved to such a pose that target-object becomes
visible in eye in hand camera view. It means that the end effector would find the targetobject within robots workspace. For this purpose, hand and wrist of the 6R robot rotate to
reach the end effector to top point of workspace. By finding the target-object, the robot
moves toward it to attain it. In each step, two cameras take picture from target and compare
features in these images with reference image to assess required motion for each joints of the
6R robot. This procedure is repeated until the camera mounted on the end effector observes
the target-object in middle of its image plane in desired size. Also in this algorithm, pictures
248
taken by two cameras are saved in arrays of pixels and after threshold operations,
segmentation, and labeling, the objects in the pictures will be extracted and each single
frame is conserved separately with its number. Distance between end effector and targetobject will be estimated, by using inverse kinematics equations of 6R robot, each joint angle
will be computed then by revolution of joints end effector will approach to target. Control
procedure of robot to reach to target-object is briefly shown in Figs 9 to 13.
4.4 Comparing IB and PB visual servoing approaches
Vision based control can be classified into two main categories. The first approach, feature
based visual control, uses image features of a target object from image (sensor) space to
compute error signals directly. The error signals are then used to compute the required
actuation signals for the robot. The control law is also expressed in the image space. Many
researchers in this approach use a mapping function (called the image Jacobian) from the
image space to the Cartesian space. The image Jacobian, generally, is a function of the focal
length of the lens of the camera, depth (distance between camera (sensor) frame and target
features), and the image features. In contrast, the position based visual control constructs the
spatial relationship, target pose, between the camera frame and the target-object frame from
target image features.
In this chapter, both position based and image based approaches were used to simulate
control of the 6R robot. The advantage of position-based approach is that the servo control
structure is independent from the target pose reconstruction. Usually, the desired control
values are specified in the Cartesian space, so they are easy to visualize. In position-based
approach, target pose will be estimated. But in image based approach 3D pose of the targetobject and end effector is not estimated directly but from some structural features extracted
from image (e.g., an edge or color of pixels) defined when the camera and end effector reach
the target as reference image features, the robot is guided and camera calibrating for visual
system is necessary.
To construct the 3D pose of a target object from 2D image feature points, two cameras are
needed. Image feature points in each of the two images have to be matched and 3D
information of the coordinates of the target object and its feature points can then be
computed by triangulation. The distance between the feature points in the target object, for
example, can be used to help compute the 3D position and orientation of the target with
respect to the camera. However, in systems with high DOF using image based approach and
camera calibrating to guide the robot will be complicated, rather than in position-based
approach we have used a trained neural net to transform coordinates. The image based
approach may reduce computational delay eliminate the necessity for image interpretation
and eliminate errors in sensor modeling and camera calibration. However, it does present a
significant challenge to controller design since the process is non-linear and highly coupled.
In addition, in image-based approach, guiding the end effector to reach target will be
completed in some steps but in position-based, the end effector is guided directly toward
the target-object. The main advantage of position-based approach is that it directly controls
the camera trajectory in Cartesian space. However, since there is no control in the image, the
image features used in the pose estimation may leave the image (especially if the robot or
the camera are coarsely calibrated), which thus leads to servoing failure. Also if the camera
is coarse calibrated, or if errors exist in the 3D model of the target, the current and desired
camera pose will not be accurately estimated. Nevertheless, image based visual servoing is
249
known to be robust not only with respect to camera but also to robot calibration errors.
However, its convergence is theoretically ensured only in a region (quite difficult to
determine analytically) around the desired position. Except in very simple cases, the
analysis of the stability with respect to calibration errors seems to be impossible, since the
system is coupled and non-linear.
In this simulator software control simulating of the 6R robot by using both position based
and feature based approaches depicted that position based was faster but feature based
more accurate. For industrial robots with high DOFs position based approach is used more,
specially for performance testing of the robots we need to specify 3D pose of the end effector
in each step so position based visual servo control is preferred.
Results for comparing two visual servo control process PBVS and IBVS are summarized in
Table 1. These two approaches are used to guide the end effector of the robot to reach the
target that is in a fixed distance from the end effector of the robot. Final pose of the wrist is
determined and compared with target-object pose so the positioning error and accuracy is
computed. However, the time duration for these processes is counted and control speed is
compared in this way.
Visual Servoing
Method
PBVS Control
Control Accuracy
(min error)
0.04 mm
Performance Speed
(process duration)
20 sec
Computation
delay
30 sec
IBVS Control
0.01 mm
60 sec
10 sec
Controller
design
simple
highly
coupled
250
depicted in Figure 14. Amount of joint angles i are defined by user in a .txt file this file is
read by software and through RotateJonint function, each joint rotates to its desired value.
ex
ey
ez
1.5
1.3
1.1
error(mm)
0.9
0.7
0.5
0.3
0.1
-0.1
-0.3
-0.5
1
ez
2
ey
Test
ex
10
Figure 14. The error schematics in x, y, z directions for direct kinematics tests
b) Inverse kinematics test
In this stage, desired pose of the end effector is given to robot to go there. Transformation
matrix containing position and orientation of the wrist frame is given by user in txt file. By
computing joint angles from inverse kinematics equations and rotation of joints, end effector
will go to desired pose. By taking pictures with two fixed cameras and trained neural
network, we will have position coordinates of end effector in 3D global reference frame. By
comparing the desired position and orientation of wrist frame with attained pose, the
positioning error will be determined. Positioning error in directions x, y, z for 10 series of
inverse kinematics tests is shown in Figure 15.
error (mm)
2
1
0
-1
-2
-3
1
ex
ez
ey
ey
2
Test
ez
6
ex
9
10
Figure 15. The error schematics in x, y, z directions for inverse kinematics tests
251
desired path
actual path
desired path
desired path
actual path
4
4
2
1
y (m m )
y (m m )
y(mm)
-1
-1
-2
-2
-3
-3
-1
-2
-3
-4
-4.5
-4
-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
x(mm)
0.5
1.5
-3
-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0.5
1.5
x(mm)
2.5
3.5
-2.5
-2
-1.5
-1
-0.5
0.5
1.5
2.5
3.5
x(mm)
252
Problem-3
In direct kinematics test, rotate joints in angles given in Table 2. Compute positioning error
for each test in x, y, z directions and rotation of each joint and draw its graph.
test
9,10
0.85
1.57
-3
-2
3.14
4.71
-1.2
0.8
1.8
-1
-1
-1
-0.5
1.57
0.75
-0.75
1.2
-0.25
3.14
1.57
-0.75
-0.5
-1.2
0.15
0.5
-0.5
0.75
0.5
-0.5
-0.5
0.5
0.5
0.5
-3
-1.57
-0.75
0.3
0.2
0.2
-1
1.1
-2
-1
0.5
-0.5
-0.5
253
In inverse kinematics test, define transformation matrix T with position of wrist according to
Table 3. Orientation of the end effector can be defined by approach, normal and sliding
vectors. Compute positioning error and accuracy of the robot in each test. Compare these
errors with direct kinematics test results.
5.2.3 Continuous path test
To determine accuracy of robot in traversing continuous paths wrist of robot is guided along
different paths. In simulator software, three standard paths are tested (direct line, circular
and rectangular paths).
Specify type of path c for circular, r for rectangular and l for linear path and their
specifications in path.txt file and save it. Each path must be entered in a distinguished line
and its parameters in that line. For example:
Linear path: x1, y1, z1 is coordinates of start point of the linear path and x2, y2, z2 is for end
point of the path.
l, x1, y1, z1, x2, y2, z2. (Fig. 17)
(x2, y2, z2)
(x0, y0)
Figure 18. Coordinates specified for rectangular path in performance tests of the robot
For a circular path: x, y is coordinate of center of circular path and r is its radius.
c, x, y, r. (Figure 19-c)
Approach vector direction is normal to direction of paths i.e. wrist is always normal to its
path. With pose of end effector and inverse kinematics equations of robot, joint angles will
be computed. Joints rotate and end effector will be positioned along its path. Coordinates of
end effector in global reference frame is determined by taking pictures with two fixed
cameras and trained neural network.
Compute path accuracy and error of the robot by data saved in out_path.txt file.
Problem-5
Test motion of the end effector for given paths as following and draw the traversed paths by
the end effector and desired path in one graph to compare them.
254
(4, 1, 1)
r=3
(-2, 2)
(0, 0)
(-2, -4, 1)
(3, -2)
b)
(-2, 4, 1)
c)
(2, 4, 1)
(2, -4, 1)
(d)
Figure 19. Continuous paths for performance tests of the robot
AP (position accuracy)
1.8
1.6
Error (mm)
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1
11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
Test
Figure 20. Position accuracy for the 6R robot in direct and inverse kinematics tests
2.5
255
Distance Accuracy
Error(mm)
1.5
0.5
0
1
5
6
Test
10
CRi
1.20
1.00
error(mm)
0.80
0.60
0.40
0.20
0.00
1
5
6
Test
COi
7
Figure 22. Cornering round off error and cornering overshoot in rectangular path tests
according to ISO9283
6.2 Error analysis according to standard ANSI-RIA
Results of simulated tests in previous sections are analyzed with standard ANSI-RIA to
compare with results of ISO9283.
TEST
AC
AC
CR
CO
line
0.67
0.21
rectangle
0.47
0.09
circle
0.48
0.25
0.47, 0.20,
0.17, 0.15
-
1.03, 0.95,
0.17, 0.26
-
256
Cornering round off error CR in this standard is defined as the minimum distance between the
corner point and any point on the attained path. Cornering overshoot CO is defined as the
largest deviation outside of the reference path after the robot has passed the corner. For
rectangular path test of 6R robot the value of CR and CO are calculated. (Table 4) The tests
were repeated 10 times (n = 10). Two cameras, observing the end effector at fixed distance in
specified periods, take picture from end effector and its environment. Its coordinates are
achieved from image plan with position based visual system.
To transform coordinates of wrist of robot to the reference frame as mentioned before, in
this work we have used neural networks. Using neural networks we map coordinates from
image plan into reference system, in order to have real distances. Maximum and mean path
accuracy FOM and for rectangular path tests corner deviation error (CR) and cornering over
shoot (CO) are listed in Table4.
257
target-object and end effector are estimated. After each motion of joints, new picture is taken
from end effector and this procedure is repeated until end of process. To determine accuracy
of robot in traversing continuous paths wrist of robot is guided along different paths. In
experimental tests, three standard paths are tested.
a) Direct line
To move end effector along a direct line its start and end must be determined. Approach
vector direction is normal to direction of line path i.e. wrist is always normal to its path.
With pose of end effector and inverse kinematics equations of robot, joint angles will be
computed. Joints rotate and end effector will be positioned along its path. At each step, two
stationary cameras take images from robot and its workspace. From these pictures and
trained neural network coordinates of the wrist in global reference frame is determined. The
positioning error is determined by comparing the ideal pose and actual one. Error of robot
in traversing direct line path is shown in Figure 25-a.
b) Circular path
We investigate the accuracy, repeatability and error of robot on the circular continuous path
traversing. Circle is in horizontal plane i.e. height of end effector is constant from earth level.
Orientation of wrist is so that end effector is always in horizontal plane and normal to
circular path and wrist slides along perimeter of circle. In this way sliding, approach and
normal vectors are determined and inverse kinematics equations can be solved. During
motion of wrist on the path, images have been taken from end effector using two webcams.
In this way, end effector coordinates in image plan will be collected. Using neural network,
image plan coordinates of points will be transformed to the reference frame. The desired
path and actual path traversed by robot is shown in Figure 25-b.
Figure 23. Webcams positions in experimental tests of robot (front and right cameras)
258
ex
ey
ez
6.00
5.00
error(cm)
4.00
3.00
2.00
1.00
0.00
1
10
Tests
Figure 24. The error schematics in x, y, z directions for direct kinematics tests of the 6R robot
desired path
actual path
60
45
40
desired path
40
30
50
20
25
20
15
0
-60
-40
-20
10
30
(c )
Y(cm)
Y(cm )
actual path
70
35
20
40
10
60
-80
-20
-60
-40
-20
-10 0
20
40
60
-30
-40
-50
-5
-45
-35
-25
-15
-5
15
X(cm)
25
35
45
desired path
-70
-60
actual path
X(cm )
X(cm )
259
Main points in a simulation of collision among objects can be separated into three parts:
collision detection, contact area determination, and collision response (Ponamgi et al). In
this research, we have considered the first part to prevent penetration of links of the 6R
robot in each other during their motion.
To determine whether or not two objects intersect, we must check if distance between their
border edges is equal to zero or not. So lower bound for the distance between each pair of
objects is equal to zero. In this paper the collision detection technique uses spheres attached
to different parts of robot and moved as well as them. These spheres are arranged compactly
enough to fit the robot shape so we have used a large number of spheres to do.
In an environment with D moving objects and S stationary objects, number of possible
collision for each pair of the objects will be:
D
+ DS
2
determining all of them would be time consuming as D and S are large. By considering the
robot geometry and its joints rotations we can determine which pairs of spheres may contact
and which pairs may not. So the total number of pairwise collisions will decrease and much
time would be saved.
In Figure 26 schematic shape of 6R robot and bounding spheres on different parts of it are
shown. Diameter of each sphere is determined according to size of object which is bounded
by the sphere.
260
ex
0.5
e rro r(m m )
0.4
0.3
0.3
0.2
0.2
0.1
0.1
e rro r(m m )
0.2
0.1
0
ey
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
e rro r(m m )
0.8
0.7
0.6
0.5
0.4
0.3
ex
0.4
e rro r(m m )
0.5
ey
Figure 28. Performance of simulation color based object recognition algorithm to determine
pose of (a) the end effector (b) target-object
Object recognition algorithm has two steps: first to assess objects of interest in pictures taken
by two cameras and then to provide required information (e.g. pose) about these objects. To
do the first step, the model or properties of objects of interest are provided for the vision
system. As said before the end effector is not in basic geometric shape and also due to its roll
and pitch rotations its dimensions and appearance are not the invariant in two cameras
view each time. So we can not use dimensions or distance set to recognize the end effector.
We must identify the image features that are invariant with respect to image scaling,
translation and rotation and partially invariant with respect to illumination changes. Also
they are minimally affected by noise and small distortions. Lindeberg showed that under
some rather general assumptions on scale invariance, the Gaussian kernel and its
261
derivatives are the only possible smoothing kernels for scale space analysis (Low). To
achieve rotation invariance and a high level of efficiency, we have defined two special RGB
color for the target-object and the end effector of the 6R robot separately. By image
processing RGB of each pixel in images are found and if they are the same as RGB of the
object of interest, coordinate of those pixels will be saved and the center position of them in
two image plane will be determined and then by using neural network we will have 3D
coordinates of target-object and the end effector in global reference frame. The results
obtained from simulation of color based object recognition algorithm for the end effector
and target-object are presented in Figure 28. In these figures error of position estimation of
the end effector and target-object in x, y and z directions are shown.
10. Conclusion
In this chapter, both position based and image based approaches were used to simulate
control of the 6R robot. The IBVS control approach, uses image features of a target-object
from image (sensor) space to compute error signals directly. The error signals are then used
to compute the required actuation signals for the robot. The control law is also expressed in
the image space. Many researchers in this approach use a mapping function (called the
image Jacobian) from the image space to the Cartesian space. The image Jacobian, generally,
is a function of the focal length of the lens of the camera, depth (distance between camera
(sensor) frame and target features), and the image features. In contrast, the PBVS control
constructs the spatial relationship, target pose, between the camera frame and the targetobject frame from target image features. Many construction algorithms have been proposed.
The advantage of position-based approach is that the servo control structure is independent
from the target pose reconstruction. Usually, the desired control values are specified in the
Cartesian space, so they are easy to visualize. In position-based approach, target pose will be
estimated. But in image based approach 3D pose of the target-object and end effector is not
estimated directly but from some structural features extracted from image (e.g., an edge or
color of pixels) defined when the camera and end effector reach the target as reference
image features, the robot is guided and camera calibrating for visual system is necessary.
Test errors have been analyzed by using different standards and MATLAB to compute
performance parameters of 6R robot such as accuracy, repeatability, and cornering
overshoot. Performance parameters computed according to ANSI and ISO standards are
fairly close to each other. Statistical quantities computed by MATLAB also certificate
standards analysis. In simulator environment, we have determined collision between two
parts of robot by using bounding-spheres algorithm. To improve the accuracy of the
collision detection we have used very small bounding spheres, breaking links of robot into
several parts and enclosing each of them in a bounding sphere of its own.
Finally simulation results of color based object recognition algorithm used to provide
required information (e.g. pose) about target-object and the end effector were presented.
11. References
American National Standard for Industrial Robots and Robot Systems Path-Related and Dynamic
Performance Characteristics Evaluation. ANSI/RIA R15.05-2. 2002.
Boyse, J. W. (1979) Interference detection among solids and surfaces. ACM, 22(1):3-9.
262
Cameron S.A. (1985) A study of the clash detection problem in robotics. Proc. IEEE ICRA,
pages pp. 488-493.
Cameron, S.A. & Culley R. K. (1986) Determining the minimum translational distance
between two convex polyhedra. Proc. IEEE ICRA, pages pp. 591-596.
Canny, J. (1986) Collision detection for moving polyhedra. IEEE Trans. PAMI, 8:pp. 200-209.
Gilbert E. & Foo C.P. (1990) Computing the distance between general convex objects in three
dimensional space. IEEE Trans. Robotics Aut., 6(1).
Gilbert, A. Giles, M. Flachs, G. Rogers, R. & Yee, H. (1983). A real time video tracking
systems, IEEE, Trans. Pattern Anal. Mech. Intell. 2(1), pp. 47 56
Hashimoto, H. Kimoto, T. and Ebin, T. (1991). Manipulator control with image based visual
servoing, In Proc. IEEE, Conf. robotics and automation, pp. 2267 2272.
Herzen, B. V. Barr A. H. & Zatz H. R. (1990) Geometric collisions for time dependent
parametric surfaces. ACM Computer Graphics, 24(4), August. ISO9283, (1998)
Manipulating industrial robots performance criteria & related test methods
Kelly, R. Shirkey, P. & Spong, M. (2001). Fixed camera visual servo control for planar robots.
Korayem, M. H. Khoshhal, K. and Aliakbarpour, H. (2005) Vision Based Robot Simulation
and Experiment for Performance Tests of Robot, International J. of AMT, Vol.25, No.
11-12, pp. 1218-1231.
Korayem, M H. Shiehbeiki, N. & Khanali, T. (2006). Design, Manufacturing and
Experimental Tests of Prismatic Robot for Assembly Line, International J. of AMT,
Vol.29, No. 3-4, pp. 379-388.
Lin. M.C. (1993) Efficient Collision Detection for Animation and Robotics. PhD thesis,
Department of Electrical Engineering and Computer Science, University of CB.
Ponamgi, M.K. Manocha D. and Lin. M.C. Incremental algorithms for collision detection between
solid models, Department of Computer Science University of N. Carolina
Rives, P. Chaumette, F. & B. Espiau. (1991) Positioning of a robot with respect to an object,
tracking it and estimating its velocity by visual servoing. In Proc. IEEE Int. Conf.
Robotics and Automation, pp 2248-2253.
Saedan M. & Ang M Jr. (2001) 3D Vision-Based Control of an Industrial Robot, Proceedings of
the IASTED Int. Conf. on Robotics and Applications, Florida, USA, pp. 152-157.
Webber, T & Hollis, R. (1988) A vision based correlation to activity damp vibrations of a coarse fine
manipulator, Watson research center.
15
Image Magnification based on the Human Visual
Processing
Sung-Kwan Je1, Kwang-Baek Kim2, Jae-Hyun Cho3 and Doo-Heon Song4
1Dept.
1. Introduction
Image magnification is among the basic image processing operations and has many
applications in a various area. In recent, multimedia techniques have advanced to provide
various multimedia data that were digital images and VOD. It has been rapidly growing
into a digital multimedia contents market. In education, many researches have used elearning techniques. Various equipments - image equipments, CCD camera, digital camera
and cellular phone are used in making multimedia contents. They are now widespread
and as a result, computer users can buy them and acquire many digital images as desired.
This is why the need to display and print them also increases (Battiato & Mancuso, 2001;
Battiato et al., 2002).
However, such various images with optical industry lenses are used to get high-resolution.
These lenses are not only expensive but also too big for us to carry. So, they are using the
digital zooming method with the lenses to solve the problem. The digital zooming method
generally uses the nearest neighbor interpolation method, which is simpler and faster than
other methods. But it has drawbacks such as blocking phenomenon when an image is
enlarged. Also, to improve the drawbacks, there exist bilinear interpolation method and the
cubic convolution interpolation commercially used in the software market. The bilinear
method uses the average of 4 neighborhood pixels. It can solve the blocking phenomenon
but brings loss of the image like blurring phenomenon when the image is enlarged. Cubic
convolution interpolation improved the loss of image like the nearest neighbor interpolation
and bilinear interpolation. But it is slow as it uses the offset of 16 neighborhood pixels
(Aoyama & Ishii, 1993; Candocia & Principe, 1999; Biancardi et al., 2002).
A number of methods for magnifying images have been proposed to solve such problems.
However, proposed methods on magnifying images have the disadvantage that either the
sharpness of the edges cannot be preserved or that some highly visible artifacts are
produced in the magnified image. Although previous methods show a high performance in
special environment, there are still the basic problems left behind. Recently, researches on
Human vision processing have been in the rapid progress. In addition, a large number of
models for modeling human vision system have been proposed to solve the drawbacks of
264
machine vision such as object recognition and object detection (Suyung, 2001). In the field of
optical neural, many researches have been conducted in relation with physiology or biology
to solve the problem of human information processing. Features of biological visual systems
at the retinal level serve to motivate the design of electronic sensors. Although commercially
available machine vision sensors begin to approach the photoreceptor densities found in
primate retinas, they are still outperformed by biological visual systems in terms of dynamic
range, and strategies of information processing employed at the sensor level (Shah &
Levine, 1993). However, most of the retina models have focused only on the characteristic
functions of retina by generalizing the mechanisms, or for researcher's convenience or even
by ones intuition. Although such a system is efficient to achieve a specific goal in current
environment, it is difficult to analyze and understand the visual scene of a human body. The
current visual systems are used in very restricted ways due to the insufficiency of the
performance of algorithms and hardware.
Recently, there are many active researches to maximize the performance of computer vision
technology and to develop artificial vision through the modeling of human visual
processing. Artificial vision is to develop information processing procedures of the human
visual system based on the biological characteristics. Compared with the machine vision
technology, it can be effectively applied to industry. By investing over 20 billion yen
between 1997 and 2016, Japan is conducting research on the areas of machine intelligence,
voice recognition and artificial vision based on the information processing mechanism of the
brain. By the National Science Foundation (NSF) and the Application of Neural Networks
for Industries in Europe (ANNIE), America and Europe are also conducting research on
artificial vision, as well as artificial intelligence and voice recognition using the modeling of
the brain's information processing (Dobelle, 2000).
This paper presents a method for magnifying images that produces high quality images
based on human visual properties which have image reduction on retina cells and
information magnification of input image on visual cortex. The rest of this paper is
organized as follows. Section 2 presents the properties on human visual system and related
works that have proposed for magnifying image. Section 3 presents our method that extracts
the edge information using wavelet transform and uses the edge information base on the
properties of human visual processing. Section 4 presents the results of the experiment and
some concluding remarks are made in Section 5.
(1)
Other method is the bilinear interpolation, which determines the value of a new pixel based
on a weighted average of the 4 pixels in the nearest 2 2 neighbourhood of the pixels in the
original image (Gonzalez & Richard, 2001). Therefore this method produces relatively
265
smooth edges with hardly any blocking and is better than the nearest neighbor but appears
blurring phenomenon. It is defined as equation (2).
Z(i ,2 j ) = I (k , l ), Z(i ,2 j + 1) = 1 2 [I (k , l ), I (k , l + 1)]
Z(2i , j ) = I i (k , l ), Z(2 i + 1 , j ) = 1 2 [I i (k , l ), I i (k + 1 , l )]
(2)
3
2
f ( x ) = a x 5 a x + 8 a x 4 a ,1 x < 2 where a=0, or -1
0 , elsewhere
(3)
Recently, research on interpolation images taking into account the edges has gained much
attention. (Salisbury et al., 1996) proposed methods that search for edges in the input images
and use them to assure that the interpolation does not cross them. The problem is how to
define and find the important edged in the input image.
Other edge-adaptive methods have been proposed by (Li & Orchard, 2001). The commercial
software Genuine Fractals also used an edge adaptive method to magnify images, but the
details of the algorithm are not provided. Currently, the methods presented in (Muresan &
Parks, 2004) are the most widely known edge-adaptive methods. They can well enough avid
jagged edges, but have limitation that they sometimes introduce highly visible artifacts into
the magnified images, especially in areas with small size repetitive patterns (Johan &
Nishita, 2004).
In section 3, we will propose an efficient method by image reduction and edge enhancement
based on the properties on human visual processing.
2.2 Human visual processing
In the field of computer vision, many researches have been conducted in relation with edge
information to solve the problem of magnification. Image information received from retina
in Human visual system is not directly transmitted to the cerebrum when we recognize it.
This is why there are many cells playing in Human visual system (Bruce, 2002).
First, the visual process begins when visible light enters the eye and forms images on the
retina, a thin layer of neurons lining the back of the eye. The retina consists of a number of
different types of neurons, including the rod and cone receptors, which transform light
energy into electrical energy, and fibers that transmit electrical energy out of the retina in
the optic nerve. Second, The signals generated in the receptors trigger electrical signals in
the next layer of the retina, the bipolar cells, and these signals are transmitted through the
266
various neurons in the retina, until eventually they are transmitted out of the eye by
ganglion cell fibers, These ganglion cell fibers flow out of the back of the eye and become
fibers in the optic nerve. Ganglion cells can be mapped into P-cells and M-cells. P-cells
contain major information of images on 'what', whereas M-cells contain edge information of
images. That is, information related to perceiving 'What' is transmitted to P-cells; and P-cells
comprise 80% of total ganglion cells and minimize the loss during transmission. Whereas,
information related to 'Where' is sent to M-cells; and M-cells comprise 20% of total ganglion
cells (Duncan, 1984)
The biological retina is more than just a simple video camera. It not only converts optical
information to electrical signals but performs considerable processing on the visual signal
itself before transmitting it to higher levels. Various local adaptation mechanisms extend the
retinas dynamic range by several orders of magnitude. In order to meet the transmission
bottleneck at the optic nerve, the retina extracts only those features required at later stages
of visual information processing (suyung, 2001).
Bipolar cell
Photoreceptor Ganglion cell
Retina
LGN
Visual cortex
267
Input
Input image
image
Edge
Edge detection
detection
DoG ( i , e , r ) =
2
e
e 2 / 2 e
Image information
2
i
2 / 2 i2
Edge information
Emphasizing
Emphasizing edge
edge
M Im age [i , j ] = G [i , j ] + HB [i , j ]
Considering
Considering the
the each
each direction
direction of
of edge
edge Considering
Considering the
the each
each direction
direction of
of edge
edge
about
image
information
about
edge
information
about image information
about edge information
PS x [i , j ] = PIm age [i , j + 1] PIm age [i , j ]
M S x [i , j ] = M Im age [i , j ] M Im age [i , j + 1]
PS y [i , j ] = PIm age [i + 1, j ] PIm age [i , j ]
M S y [i , j ] = M Im age [i , j ] M Im age [i + 1, j ]
PS z [i , j ] = PS x [i , j ] & PS y [i , j ]
M S z [i , j ] = M S x [i , j ] & M S y [i , j ]
Combination
Combination &
& decomposition
decomposition considering
considering the
the quantity
quantity of
of information
information
Combination
Decomposition
) (
) Combination
[i , j ]) (P [i , j ] + M [i , j ]) Decomposition
VC complex X [i , j ] = PS x [i , j ] + M S x [i , j ] + PS z [i , j ] + M S z [i , j ]
VC complex X [i , j + 1] = PS x [i , j ] + M S x
Sz
Sz
Normalization
Normalization of
of magnified
magnified image
image
i2 + j2
2 2
Combination
VC Hypercomplex [i , j ] =
VCcomplex X [i , j ] + VCcomplexY [i , j ]
2 2
i2 + j2
e
2 2
VC Hypercomplex [i , j + 1] =
VCcomplex X [i , j ] VCcomplexY [i , j ] Decomposition
2 2
e
268
Human vision system does not transfer image information from retina to visual cortex in the
brain directly. By the properties of retinal cells, there is the reduction of information when
vision information is transferred from receptors to ganglion cells. In addition, the reduced
information from retina is transfer to the visual cortex with the magnified information. We
proposed the magnification method with these properties. The proposed magnification uses
edge information which is not used in interpolation based image processing. In image
processing, the interpolated magnified image uses the average or offset of neighborhood
pixels. It is not an ideal method since it only uses neighborhood pixels.
The edge information is important to distinguish the background and object. If a pixel were
edge information, it wouldn't be able to distinguish the background and object using
neighborhood pixels. It was insufficient to detect the edge information. In this paper, we
calculated the edge information of a whole image. In order to solve the problem of
magnification, the direction of the edge information will be considered. The schematic
diagram of the method is shown as Fig. 2.
3.1 Edge Detection
First, we calculated the edge information from the input image. There are many methods in
edge detection such as Laplacian operator, Sobel operator and Gaussian operator. In this
paper, we calculated the edge information by using the DoG (Difference of two Gaussian)
function, which is used in the human vision system. Wilson proposed the model that has
been detected the edge information by the simulated results. It was simulated in the retina
of the human vision system using the second derivative function 2 G (LoG, Laplacian of a
Gaussian). According to Marr and Hildreth, the DoG function has similar result to 2 G .
And it is faster and more effective about the intensity change detection of the image
than 2 G (Dowling, 1987; suyung, 2001). In this paper, by setting the distance from the
center as r, in equation (4), one obtains temporal change in the input image by the Gaussian
filter.
G e ( e , r ) =
Gi ( i , r ) =
DoG( i , e , r ) =
1
2
1
2
1
2
2
e
2
e
e r
2
i
e r
e r
2 / 2 e2
2 / 2 e2
(4)
2 / 2 2
i
(5)
1
2
i
e r
2 / 2 2
i
(6)
269
However, it has less edge information than the other second derivative functions (Laplacian
operator and Sobel operator) which are used mostly in image processing. In order to solve
the problem, we propose an algorithm that emphasizes the image by using contrast regions.
The Unsharp mask tool is used to emphasize an image in image processing. However, it
causes a loss of the image and that adds the noise to the image and in result, it drastically
reduces intensity gradient when the image is sharpened spatial edges, namely, emphasized
contrast region. To solve the problem, we added the convoluted high-boost filter and edge
information again.
M Im age [i , j ] = G[i , j ] + HB[i , j ]
where,
HB =
1
1
1
w
/
9
1
1
1
1
(7)
w = 9 1
HB is a high-boost filter. It sharpened the image and added to G[i , j ] . By setting w = 9 1 ,
in equation (7), one obtains the enhanced edge information M Im age [i , j ] . The proposed
method improves sensitivity to detect the edge of an object.
(8)
(9)
270
By setting the input image to PIm age , in equation (8), one obtains the vertical direction of input
image PS . In the same way, in equation (9), one obtains the horizontal direction of the input
y
image PS where i and j are the vertical and horizontal index of image.
x
The diagonal direction uses the vertical direction and horizontal direction. By setting the
diagonal direction to PS , in equation (10), one calculates the AND operation to the vertical
z
(10)
Information on the vertical, horizontal and diagonal direction of the input image was
calculated through the use of equation (8), (9) and (10). In the same way, information on the
vertical, horizontal and diagonal direction of the detected edge information was calculated
by using equation (11).
MS [i , j ] = M Im age [i , j ] M Im age [i + 1 , j ]
y
MS [i , j ] = M Im age [i , j ] M Im age [i , j + 1]
x
(11)
MS [i , j ] = MS [i , j ]& MS [i , j ]
z
In equation (11), M Im age is the detected edge information, MS is the vertical direction in the
y
is the diagonal
direction.
Thus, we calculated 7 pieces of information collected from the input image. They were the
detected edge information, vertical, horizontal, diagonal direction of input image and
vertical, horizontal, diagonal direction of the detected edge information. They all have a
position, direction and edge information. However, they have different quantities of
information in regards to the edge. It holds different quantities of information for the
vertical direction of the input image and detected edge information. The difference in the
quantity of information in the vertical and horizontal direction is due to the edge. By
equation (8), the detected edge information was an error on the left hand side of the ideal
detecting edge information by the difference operation. In the same way, by equation (11),
the detected edge information has an error on the right hand side of the ideal detecting edge
information by the difference operation. To solve this problem, we calculated the ADD
operation to the same direction of the detected edge information. And we processed the
combination and decomposition considering the quantity of image information (pixel
intensity) and edge information in each direction, the input image and 7 pieces of
information. Therefore, most of the information contained is made up of the input image,
the vertical direction of input image and the vertical direction of the detected edge
information.
By setting, the larger quantity of image information and direction as VCcomplex and the
y
smaller quantity of image information and direction as VCcomplex , we can process the
x
271
(12)
By setting, the input image as PIm age , the vertical direction of the input image as PS , and the
y
vertical direction of the detected edge information as MS , one obtains a large quantity of
y
image information and direction. This is known as VCcomplex . The VCcomplex is a combination
y
of the input image and the vertical direction that is added to the vertical direction of the
input image and the detected edge information. When the combination of the larger
quantity of images is created, we process the ADD operation. In the same way, when there
is a decomposition of the smaller quantity of images, we process the difference operation.
Accordingly, we emphasized the edge information by using the ADD and difference
operation for the combination and decomposition.
First, we calculated the ADD operation to the same direction of the input image and the
calculated edge information. The VCcomplex was a combination of the larger quantity of
x
images which was in the horizontal direction and this was added to the horizontal direction
of the input image and the calculated edge information. When there is a combination of the
larger quantity of images, we use the ADD operation.
) (
)
[i , j ]) (P [i , j] + M [i , j])
VC complex [i , j ] = Ps [i , j ] + M s [i , j ] + Ps [i , j ] + Ms [i , j ]
x
VC complex [i , j + 1] = Ps [i , j ] + M s
x
sz
(13)
sz
By setting, the horizontal direction of the input image as PS , the diagonal direction of the
x
input image as PS , the horizontal direction of the detected edge information as MS and the
z
diagonal direction of the detected edge information as MS , in equation (13), one obtains a
z
smaller quantity of image information and its direction is VCcomplex . The VCcomplex
x
is a
combination of the horizontal and diagonal direction that was added to the horizontal and
diagonal direction of the input image and the detected edge information. In the same way as
equation (12), when it is a decomposition of the smaller quantity of images, we process the
difference operation. Likewise, we emphasized the edge information by using the ADD and
difference operation for the combination and decomposition. We were able obtain the
magnified image by using the combination and decomposition to solve the problem of loss
of high frequencies. But the magnified image has too much information on high frequencies
in the VCcomplex and VCcomplex . To reduce the risk of error of edge information in high
y
frequencies, we processed the normalizing operation by using the Gaussian operator. The
Gaussian operator is usually used in analyzing brain waves in visual cortex. And once a
suitable mask has been calculated, and then the Gaussian smoothing can be performed
using standard convolution methods.
272
(i 2 + j 2 )
2 2
VC hpercomplex [i , j ] =
VC comptex [i , j ] + VC comptex [i , j ]
2 2
(i 2 + j 2 )
e
2 2
VC hpercomplex [i , j + 1] =
VC comptex [i , j ] VC comptex [i , j ]
2 2
e
(14)
(i 2 + j 2 )
2 2
, thus one
2 2
In summary, first, we calculated edge information by using the DoG function and
emphasized the contrast region by using the enhanced Unsharp mask. We calculated each
direction of the input image and edge information to reduce the risk of error in the edge
information. To evaluate the performance of the proposed algorithm, we compared it with
the previous algorithm that was nearest neighborhood interpolation, bilinear interpolation
and cubic convolution interpolation.
4. Experimental results
We used the Matlab 6.5 in a Pentium 2.4GHz, with 512MB memory, in a Windows XP
environment and simulated the computational retina model based on the human visual
information processing that is proposed in this paper. We used the SIPI Image Database and
HIPR packages which is used regularly in other papers on image processing. SIPI is an
organized research unit within the School of Engineering founded in 1971 that serves as a
focus for broad fundamental research in signal and image processing techniques at USC. It
has studied in all aspects of signal and image processing and serviced to available SIPI
Image Database, SIPI technical reports and various image processing services. The HIPR
(Hypermedia Image Processing Reference) serviced a new source of on-line assistance for
users of image processing. The HIPR package contains a large number of images which can
be used as a general purpose image library for image processing experiments. It was
developed at the Department of Artificial Intelligence in the University of Edinburgh in
order to provide a set of computer-based tutorial materials for use in taught courses on
image processing and machine vision. In this paper, we proposed the magnification by
using edge information to solve the loss of image problem like the blocking and blurring
phenomenon when the image is enlarged in image processing. In performance, the human
vision decision is the best. However, it is subjective decision in evaluating the algorithm. We
calculate the PSNR and correlation to be decided objectively between the original image and
the magnified image compared with other algorithms.
First, we calculated the processing time taken for the 256256 sized of the Lena image to
become enlarged to a 512512 size. In Fig. 3, the nearest neighborhood interpolation is very
fast in processing time (0.145s), but it loses parts of the image due to the blocking
phenomenon. The bilinear interpolation is relatively fast in the processing time (0.307s), but
it also loses parts of the image due to the blurring phenomenon. The cubic convolution
interpolation does not have any loss of image by the blocking and blurring phenomenon,
273
but is too slow in the processing time (0.680) because it uses 16 neighborhood pixels. The
proposed algorithm solved the problem of image loss and was faster than the cubic
convolution interpolation in the processing time (0.436s).
0.7
0.6
0.5
0.4
0.680
0.3
0.436
0.2
0.1
0.307
0.145
0.0
Nearest neighbor Bilinear
interpolation interpolation
Bicubic
interpolation
Proposed
algorithm
Magnification algorithms
Figure 3. Comparison of the processing time of each algorithm
To evaluate the performance in human vision, Fig. 4, shows a reducion of 512512 sized
Lena image to a 256256 sized by averaging 33 windows. This reduction is followed by an
enlargement to the 512512 sized image through the usage of each algorithm. We enlarged
the central part of the image 8 times to evaluate vision performance. In Fig. 4, we can find
the blocking phenomenon within vision in the nearest neighborhood interpolation (b). And
we can also find the blurring phenomenon within vision in the bilinear interpolation(c). The
proposed algorithm has a better resolution than the cubic convolution interpolation in Fig.
4(d, e).
We calculated the PSNR for objective decision. By setting the original image as X, and the
magnified image as X * , in equation (15), one obtains the PSNR.
PSNR = 20 log10
MSE =
255 2
MSE
2
1 1 N 1M 1
*
(X (i , j ) X (i , j ))
N M i =0 j =0
(15)
The MSE is a mean square error between the original image and the magnified image.
Generally, the PSNR value is 20~40db, but the difference can not be found between the
cubic convolution interpolation and the proposed algorithm in human vision. In table 1,
there exist difference between two algorithms. The bilinear interpolation has a loss of image
274
due to the blurring phenomenon, but the PSNR value is 29.92. This is better than the cubic
convolution interpolation which has a value of 29.86. This is due to the reduction taken
place by the averaging method which is similar to the bilinear interpolation. We can
conclude from the table 1 that the proposed algorithm is better than any other algorithm as
the PSNR value is 31.35.
X = X Average(X )
X * = X * Average(X * )
n
Cross correlation(X , X * ) =
(16)
n
Xi X
i =0
X
i =0
i =0
2
i
*
i
X i*
i =0
275
Evaluation performance
PSNR(db)
Cross-correlation
19.54
0.978983
Bilinear interpolation
29.92
0.985436
29.86
0.985248
Proposed algorithm
31.35
0.990109
Magnification method
Cross-correlation
29.86
0.987359
Bilinear interpolation
30.72
0.989846
31.27
0.991336
Proposed algorithm
31.67
0.991363
Magnification method
Peppers
Aerial
Airplane
Boat
20.38
26.79
22.62
32.55
25.50
Bilinear interpolation
23.00
31.10.
25.46
33.44
25.50
23.64
31.93
26.64
33.72
29.39
Proposed algorithm
23.81
32.04
27.65
34.52
30.27
Magnification method
Table 3. Comparison of the PSNR of our method and general methods in several images
276
Peppers
Aerial
Airplane
Boat
0.834635
0.976500
0.885775
0.966545
0.857975
Bilinear interpolation
0.905645
0.991354
0.940814
0.973788
0.977980
0.918702
0.992803
0.954027
0.975561
0.982747
Proposed algorithm
0.921496
0.993167
0.963795
0.976768
0.986024
Magnification method
Table 4. Comparison of the correlation value of our method and general methods in several
images
5. Conclusions
In image processing, the interpolated magnification method brings about the problem of
image loss such as the blocking and blurring phenomenon when the image is enlarged. In
this paper, we proposed the magnification method considering the properties of human
visual processing to solve such problems. As a result, our method is faster than any other
algorithm that is capable of removing the blocking and blurring phenomenon when the
image is enlarged. The cubic convolution interpolation in image processing can obtain a
high-resolution image when the image is enlarged. But the processing is too slow as it uses
the average of 16 neighbor pixels. The proposed algorithm is better than the cubic
convolution interpolation in the processing time and performance. In the future, to reduce
the error ratio, we will enhance the normalization filter which has reduced the blurring
phenomenon because the Gaussian filter is a low pass one.
277
6. References
Battiato, S. and Mancuso, M. (2001) An introduction to the digital still camera Technology,
ST Journal of System Research, Special Issue on Image Processing for Digital Still Camera,
Vol. 2, No.2
Battiato, S., Gallo, G. and Stanco, F. (2002) A Locally Adaptive Zooming Algorithm for
Digital Images, Image and Vision Computing, Elsevier Science B.V., Vol. 20, pp. 805812, 0262-8856
Aoyama, K. and Ishii, R. (1993) Image magnification by using Spectrum Extrapolation, IEEE
Proceedings of the IECON, Vol. 3, pp. 2266 -2271, 0-7803-0891-3, Maui, HI, USA, Nov.
1993, IEEE
Candocia, F. M. and Principe, J. C. (1999) Superresolution of Images based on Local
Correlations, IEEE Transactions on Neural Networks, Vol. 10, No. 2, pp. 372-380, 10459227
Biancardi, A., Cinque, L. and Lombardi, L. (2002) Improvements to Image Magnification,
Pattern Recognition, Elsevier Science B.V., Vol. 35, No. 3, pp. 677-687, 0031-3203
Suyung, L. (2001) A study on Artificial vision and hearing based on brain information
processing, BSRC Research Report: 98-J04-01-01-A-01, KAIST, Korea
Shah, S. and Levine, M. D. (1993) Visual Information Processing in Primate Retinal Cone
Pathways: A Model, IEEE Transactions on Systems, Man and Cybernetics, Part B, Vol.
26, Issue. 2, pp. 259-274, 1083-4419
Shah, S. and Levine, M. D. (1993) Visual Information Processing in Primate Retina:
Experiments and results, IEEE Transactions on Systems, Man and Cybernetics, Part B,
Vol. 26, Issue. 2, pp. 275-289, 1083-4419
Dobelle, W. H. (2000) Artificial Vision for the Blind by Connecting a Television Camera to
the Visual Cortex, ASAIO journal, Vol. 46, No. 1, pp. 3-9, 1058-2916
Gonzalez, R. C., and Richard E. W. (2001) Digital Image Processing, Second edition, Prentice
Hall, 0201180758
Keys, R. G. (1981) Cubic Convolution Interpolation for Digital Image Processing, IEEE
Transaction on Acoustics, Speech, and Signal Processing, Vol. 29, No. 6, pp. 1153-1160,
0096-3518
Salisbury, M., Anderson, C., Lischinski, D., and Salesin, D. H. (1996) Scale-dependent
reproduction of pen-and ink illustration, In Proceedings of SIFFRAPH 96, pp. 461468, 0-89791-746-4, ACM Press, New York, NY, USA
Li, X., and Orchard, M. T. (2001) New edge-directed interpolation, IEEE Transactions on
Image Processing, Vol. 10, Issue. 10, pp. 1521-1527, 1057-7149
Muresan, D. D., and Parks, T. W. (2004) Adaptively quadratic image interpolation, IEEE
Transaction on Image Processing, Vol. 13, Issue. 5, pp. 690-698, 1057-7149
Johan, H., and Nishita, T. (2004) A Progressive Refinement Approach for Image
Magnification, In Proceedings of the 12th Pacific Conference on Computer Graphics and
Applications, pp. 351-360, 1550-4085
Bruce, G. E. (2002) Sensation and Perception, Sixth edition, Wadsworth Pub Co., 0534639917
Duncan, J. (1975) Selective Attention and the Organization of Visual Information, Journal of
Experimental Psychology: General, American Psychological Assn., Vol.113, pp. 501517, 0096-3445
278
Bernardino, A. (2004) Binocular Head Control with Forveal Vision: Methods and
Applications, Ph.D in Robot Vision, Dept. of Electrical and Computer Engineering,
Instituto Superior Tcnico, PORTUGAL
Dowling, J.E. (1987) The Retina: An Approachable Part of the Brain, Belknap Press of Harvard
University Press, Cambridge, MA, 0-674-76680-6
Hildreth, E. C. (1980) A Theory of Edge Detection, Technical Report: AITR-579, Massachusetts
Institute of Technology Cambridge, MA, USA
Schultz, R. R. and Stevenson, R. L. (1994) A Bayesian Approach to Image Expansion for
Improved Definition, IEEE Transaction of Image Processing, Vol. 3, No. 3, pp. 233-242,
1057-7149
Shapiro, J. M. (1993) Embedded Image coding using zerotrees of wavelet coefficients, IEEE
Trans. on Signal Processing, Vol. 41, No. 12, pp. 3445-3462, Dec., 3445-3462
The HIPR Image Library, http://homepages.inf.ed.ac.uk/rbf/HIPR2/
The USE-SIPI Image Database, http://sipi.usc.edu/services/database
16
Methods of the Definition Analysis
of Fine Details of Images
S.V. Sai
Pacific national university
Russia
1. Introduction
Definition is one of the most important parameters of the color image quality and is
determined by the resolution of channel brightness and chromaticity. System resolution is
traditionally determined by a number of the television lines, calculated on the maximal
spatial frequency value at which threshold contrast of the reproduced image is provided.
Traditional methods of definition analysis are developed for standard analog color TV
systems. Specific kind of distortions in digital vision systems is associated with the
restrictions imposed by a particular compression algorithm, used for handling static and
dynamic images.
Such distortions may lead to an inconsistency between a subjective estimate of the decoded
image quality and the program estimate based on the standard calculation methods.
Till now, the most reliable way of image quality estimation is the method of subjective
estimation which allows estimating serviceability of a vision system on the basis of visual
perception of the decoded image. Procedures of subjective estimation demand great amount
of tests and a lot of time. In practice, this method is quite laborious and restricts making
control, tuning and optimization of the codec parameters.
The most frequently used root-mean-square criterion (RMS) for the analysis of static image
quality does not always correspond to the subjective estimation of fine details definition,
since a human vision system processes an image on local characteristic features, rather than
averaging it elementwise. In particular, RMS criterion can give "good" quality estimations in
vision systems even at disappearance of fine details in low contrast image after a digital
compression.
A number of leading firms suggest hardware and software for the objective analysis of
dynamic image quality of MPEG standard (Glasman, 2004). For example Tektronix PQA 300
analyzer; Snell & Wilcox Mosalina software; Pixelmetrix DVStation device. Principles of
image quality estimation in these devices are various.
For example, PQA 300 analyzer measures image quality on algorithm of Just Noticeable
Difference JND, developed by Sarnoff Corporation. PQA 300 analyzer carries out a series
of measurements for each test sequence of images and forms common PQR estimation on
the basis of JND measurements which is close to subjective estimations.
To make objective analysis of image quality Snell & Wilcox firm offers a PAR method
Picture Appraisal Rating. PAR technology systems control artifacts created by compression
280
under MPEG-2 standard. The Pixelmetrix analyzer estimates a series of images and
determines definition and visibility errors of block structure and PSNR in brightness and
chromaticity signals.
The review of objective methods of measurements shows that high contrast images are
usually used in test tables, while distortions of fine details with low contrast, which are most
common after a digital compression, are not taken into account.
Thus, nowadays there is no uniform and reliable technology of definition estimation of
image fine details in digital vision systems.
In this chapter new methods of the definition analysis of image fine details are offered.
Mathematical models and criteria of definition estimation in three-dimensional color space
are given. The description of test tables for static and dynamic images is submitted. The
influence of noise on the results of estimations is investigated. The investigation results and
recommendations on high definition adjustment in vision systems using JPEG, JPEG-2000
and MPEG-4 algorithms are given.
where is the object angular size, t is the object presentation time, Co is the object color
coordinates; Cb is the background color coordinates, is the root-mean-square value of
noise.
Solving the task it was necessary first to find such metric space where single changes of
signals would correspond to thresholds of visual recognition throughout the whole color
space, both for static, and for dynamic fine details.
One of the most widespread ways of color difference estimation of large details of static
images is transformation of RGB space in equal contrast space where the area of dispersion
of color coordinates transforms from ellipsoid to sphere with the fixed radius for the whole
color space (Krivosheev & Kustarev, 1990).
In this case the threshold size is equal to minimum perceptible color difference (MPCD) and
keeps constant value independently of the object color coordinates.
The color error in equal color space, for example, in ICI 1964 system (Wyszecki, 1975) is
determined by the size of a radius - vector in coordinates system and is estimated by the
number of MPCD
~
where
Wo* ,U *o ,Vo*
(1)
~ * ~* ~*
is the color coordinates of a large object in a test image and Wo ,U o ,Vo is
u and v is the
281
(2)
where W1*U1*V1* is the color coordinates of the 1-st object; W2*U *2V2* is the color coordinates of
the 2-nd object.
As against (1), equation (2) determines color difference between objects of one image,
instead of between objects of images "before" and "after" digital processing.
If one of the objects is background, color contrast object background is determined as
follows:
(3)
or in difference coordinates:
E = ( W * )2 + ( U * )2 + ( V * )2 ,
(4)
E Eth ,
(5)
where Eth = 23 (MPCD) is the threshold contrast which keeps constant value within the
limits of the whole color space.
Further, we shall substitute (4) in (5) and convert to the following:
W *
Eth
U *
+
Eth
V *
+
Eth
(6)
282
The contrast sensitivity of human vision is reduced with the reduction of details sizes and
E
threshold value ( th ) becomes dependent on the object size ( ), both in brightness, and
chromaticity. Thus the criterion of fine details difference is defined as
2
W * U * V *
+
+
1
W * ( ) U * ( ) V * ( )
th
th
th
(7)
*
and Vth* is the threshold values according to brightness and
where Wth* , U th
chromaticity indexes which usually depend on color background coordinates, time of object
presentation and noise level.
Write (7) in the following way:
( W * )2 + ( U * )2 + ( V * )2 1
(8)
where W * = W * / Wth* ,
*
U * = U * / U th
and V * = V * / Vth* is the normalized
values of object background contrast. Provided condition (8) is true, color difference
between object and background is visible with an eye, hence fine details are perceptible.
Thus, transition from equal space into normalized equal space allows on the basis of
criterion (8) to estimate objectively color difference of both large and fine details under
preset conditions of color image supervision.
In vision systems where the receiver of the decoded images is the automatic device, and
vision properties are not taken into account, the criterion of fine details difference can be
received directly in three-dimensional space of RGB signals:
where K th is the threshold contrast value, which depends on device sensitivity and noise
level at an output of a system.
In order to use criterion (8) in practice it is necessary to determine numerical values of fine
details threshold contrast at which they are visible with an eye, depending on the size of
details for the set of supervision conditions.
To solve this task it was required:
1. To develop a synthesis algorithm of the test image consisting of small static and dynamic
objects with regulated contrast in MPCD values. 2. To develop a procedure of the
experiment and on the basis of subjective estimations to determine threshold values of fine
details contrast.
283
At the first stage form, sizes, spatial position and color coordinates ( W *U *V * ) of objects and
background for the basic first frame of test sequence are set. The vectors of movement are
set for the subsequent frames.
At the second stage the transformation { Wm* ,i , j U *m ,i , j Vm* ,i , j } { Rm ,i , j Gm ,i , j Bm ,i , j } which is
necessary for visualization of the initial sequence on the screen and for submission of digital
RGB signals on the input of the system under research is carried out for each frame of test
sequence on the basis of mathematical model which have been developed. Where m is the
frame number; i and j is the pixels numbers in columns and lines of image.
At the third stage, cyclic regeneration of the M frames with the set frequency ( f frame ) is
carried out. When reproducing the test sequence, dynamic objects move on the set trajectory
to the number of pixels having been determined by the motion vector.
On the basis of the above described algorithm the test table and video sequences are
developed into which all the necessary elements for the quality analysis of fine details of
static and dynamic images are included.
Let's consider the basic characteristics of the test table which is developed for the quality
analysis of static images.
The table represents the image of CIF format (360288), which is broken into 6 identical
fragments (120144). Each fragment of the table contains the following objects: a) horizontal,
vertical and inclined lines with the stripes width of 1, 2, 3 or more 3 pixels; b) single small
details of rectangular form. Objects of the image are located on a grey unpainted
background.
a)
b)
U * = 3( U *o U b* ) , at W * = 0 and V * = 0 .
The object - background chromaticity V * contrast is set by MPCD number for the 5-th
and the 6-th fragments
284
The test image with adjustable values of color contrast on axis W * with step 1 MPCD
and on axes U * and V * U with step 2 MPCD was offered to the observer.
2. During the experiment the observer changed the contrast value beginning with the
minimal until the stripes became distinct.
3. As an estimation criterion of threshold contrast the following condition was set: the
stripes should be distinguishable with an eye in comparison with the previous image
i.e. at which contrast was one step lower.
4. Under condition (3) the observer fixed value of contrast at which, in his opinion,
sufficient "perceptibility" of lines was provided.
Students and employees of Khabarovsk state technical university (Pacific National
University) participated in the experiments.
>3
Wth*
*
U th
26
34
48
72
Vth*
24
36
52
76
285
Table 1. shows subjective average estimations of threshold contrast from the size ( ) of
objects for background brightness is Wb* = 80 MPCD, arithmetic-mean value being received
by estimation results of 20 observers.
In the table the size of objects is set by pixels number, and the threshold value by the MPCD
number. For example, at the minimal sizes of lines ( =1) the average value of a visual
threshold on brightness index is equal to 6 MPCD and on chromaticity index it is equal to 72
and 76 MPCD.
For example, at the minimal sizes of stripes the average value of a visual threshold on
brightness index is equal to 6 MPCD and on chromaticity index it is equal to 72 and 76
MPCD.
The results of the experiments show, that values of threshold contrast on an unpainted
background on axes U * and V * are approximately identical, and exceed values of
thresholds on axis W * in 10 13 times. Change of background brightness from 70 up to
90 MPCD does not essentially influence the thresholds of fine details visual perception.
Experimental estimations of color thresholds in L*u*v* system show, that estimations on
coordinates of chromaticity u* and v* 1.5 1.8 times differ. Therefore the use of W *U *V *
system is more preferable.
The values of threshold contrast for mobile details of test sequence are received by
experimentally with the help of subjective estimations by the following technique.
During the experiment the observer changed of contrast value, beginning with the minimal
until the mobile objects became distinct.
The results of the experiments show that, at movement of objects, contrast threshold values
in comparison with the data of Table 1, increase, depending on t according to function
f (t ) = 1 /(1 e t / ) , where = 0,05 is the time of vision inertia; t is the time interval, during
which the object moves on a certain number of pixels set by the vector.
In particular, at t = 0,033 ( f frame = 30 Hz) values of contrast threshold of fine details have
increased approximately in 1,8 2 times.
Thus, the received experimental data allow using criterion (8) in practice as an objective
estimation of transfer accuracy of both static and dynamic fine details of the test image.
286
~
~
~
At the third stage after decoding, the test sequence is restored and R
m,i , j , Gm,i , j , Bm,i , j
~ * , U~ * , V~ *
signals are transformed into W
m ,i , j
m,i , j m,i , j signals for each frame. All 12 frames of the
restored sequence write in a RAM of the analyzer.
At the fourth stage, contrast and distortions of fine details are measured by the local
fragments of the restored image, and definition estimation is obtained by the objective
criteria.
Let's consider a measurement method of stripes contrast of the first image frame.
For an estimation of definition impairment it is necessary to measure contrast for each
fragment of the decoded image of stripes with the fixed size and to compare the received
value to threshold value. We assume that stripes are distinguished by the observer, if the
condition is satisfied:
~
W * ( , k ) U * ( , k ) V * ( ,k )
~
+
+
1
E ( ,k) =
W * ( ) U * ( ) V * ( )
th
th
th
(9)
~
where E ( , k ) is the average normalized value of stripes contrast, average on the k
~
~
~
"window" area of the image; W * , U * and V * is the average values of contrast on
brightness and chromaticity indexes; k the parameter determining the type the "window"
*
under analysis (k = 0 - vertical stripes, k = 1 - horizontal, k = 2 - sloping); Wth* ( ) , U th
( )
EW * ( ) =
~
~
W * ( )
U * ( )
V * ( )
1
,
E
1
,
E
1,
* ( ) =
* ( ) =
U
V
*
( )
Wth* ( )
U th
Vth* ( )
(10)
where making calculations the minimal value of contrast from the three (k) "windows"
under analysis is chosen on each color coordinate, which allows taking into account the
influence of spatial orientation of lines for decoding accuracy.
287
Figure 2 shows the example of spatial position of the image fragment "windows" under
analysis on the brightness index with contrast Wth* = 18 MPCD which is three times
higher than the threshold value for the finest details ( = 1).
Average contrast values on brightness and chromaticity indexes are equal to the initial
values if there are no distortions. In this case, contrast of all the "windows" of the test image
under analysis three times exceeds threshold values and, hence, definition does not become
worse.
Average contrast value of the "windows" of the test image under analysis decreases, if there
are distortions. But, if the contrast on brightness or chromaticity index becomes less than
threshold value, i.e. conditions (10) are not satisfied, the conclusion is made that the
observer does not distinguish fine details.
Finally, minimal size of stripes with the contrast which satisfies criteria (10) makes it
possible to determine maximum number of distinct elements of the image that constitutes
the image definition estimation on brightness and chromaticity.
It is obvious, that the estimation by criteria (10) depends on the initial image contrast.
In particular the stripes contrast decrease on 1 2 thresholds gives "bad" results when
using test image with low contrast. But when the initial contrast exceeds threshold values 10
times, definition impairment is not observed in such contrast decrease.
Thus, the criterion (8) gives an objective estimation of definition impairment of fine details
of low contrast image.
To exclude initial contrast influence on indeterminacy of estimations, we should take the
following equation for brightness index:
W * ( ) =
N 1
Wth* (
) N
~*
*
W ( ) Wi ( ) < QW *
i =0
(11)
where W * ( ) is the threshold- normalized deviation of contrast from the initial value;
(12)
The experimental results of the images quality analysis in different compression systems
show that, when these criteria are met, the reduction of the visual sharpness of fine details is
only barely visible or almost imperceptible.
The developed method differs from the known in the fact that contrast of fine details at the
exit of a system is estimated by the threshold- normalized average value of the window
area of the stripes image under analysis, but not by the amplitude value of the first
harmonic of brightness and chromaticity signals.
288
Object - background initial contrast is also set not by the maximal value, but in two - three
times exceeding threshold value that allows to estimate the effectiveness of coding system in
up to threshold area where distortions are the most essential.
Thus, the offered method allows estimating objectively the reduction of the visual sharpness
since it takes into account thresholds of visual perception of fine details and possible
fluctuations of color coordinates caused by linear distortions of signals and noise presence
in digital system.
In image coding digital systems using nonlinear transformations not only linear reduction of
high-frequency component of decoded RGB signals is possible, but also nonlinear
distortions may occur.
Therefore, in some cases, the estimation of contrast reduction by criteria (12) can lead to
incorrect results.
To take into account the influence of nonlinear distortions on objectivity of estimations the
following decision is offered.
In addition to estimations (12), the normalized average deviation of reproduced color
coordinates relative to the initial ones in the image window, for example, on brightness is
offered to estimate:
W * ( ) =
N 1
Wth* ( ) N
W~i* ( ) Wi* ( )
(13)
i =0
It is shown in (Sai, 2003) that in order to provide high-quality reproduction of fine details
with the rating not less than 6 7 points, it is necessary to satisfy the following conditions
in addition to criteria (12): a) the root-mean-square deviation of brightness coordinates in all
"windows" under analysis must be not more than 30 %; b) the root-mean-square deviation of
chromaticity coordinates not more than 50 % for the details not less than three pixels in size.
(14)
Consider the method of distortions estimation of fine single details of a rectangular form.
For the test image fragment, for example on brightness, find the normalized average
deviation of object contrast and initial value on the object area:
W * ( ) =
N 1
Wth* (
) N
W~i* ( ) W * ( )
(15)
i =0
As against (11), number N is determined by the image window with a single object being
included into it. For example, at the analysis of distortions of point object the window size
is 11 pixels. At the analysis of distortions of object 22 pixels in size, the window size is
22, etc.
It is obvious from the experiments, that in order to ensure high-quality reproduction of fine
details with the rating not less than 6 7 points, it is necessary to satisfy the following
conditions: a) the root-mean-square deviation on brightness must be not more than 1,5 for
all the details; b) the root-mean-square deviation on chromaticity must be not more than 0,8
for the details 3 or more pixels in size.
(16)
289
Thus a program analyzer can estimate visual quality of reproduction of striped lines and
fine details of the test image by criteria (12), (14) and (16).
Table 1 shows the experimental dependence of parameters (11), (13) and (15) from quality
rating.
Results are received after JPEG compression of the image in Adobe Photoshop 5 using tenpoint scale of quality. The results are received for the test image with fine details contrast
exceeding threshold values two times. Thus according to Table 1., it is possible to estimate
the quality rating for each of the six parameters.
The average quality rating of each frame of the test sequence is calculated as follows:
Qm =
1
Low
3
4
Medium
W *
1,006
0,966
0,948
W *
0,690
0,700
0,686
1 6
Qi .
6 i =1
6
High
8
9
Maximum
10
0,690
0,498
0,225
0,099
0,071
0,012
0,013
0,627
0,519
0,338
0,240
0,145
0,083
0,015
W *
2,039
2,039
1,722
1,617
1,512
1,409
0,998
0,295
0,097
0,001
U *
1,528
1,617
1,569
1,073
0,772
0,557
0,391
0,241
0,009
0,002
U *
0,960
0,955
0,917
0,688
0,505
0,432
0,331
0,238
0,143
0,053
U *
1,124
1,070
1,024
1,143
0,456
0,460
0,477
0,299
0,124
0,047
=1
=3
1
W * ( ) =
M
1
N
m =1
~ * ( ,m )
) W
i
Wth ( ) f ( t )
N 1 W * (
i =0
(17)
where M = 12 is the frames number; f ( t ) is the function taking into account recession of
contrast - sensitive vision characteristic depending on objects presentation time.
Reduction of stripes contrast on chromaticity is calculated similarly.
Calculations (17) having been made, conditions (12) are checked.
If (14) is satisfied on brightness and chromaticity, the decision is made, that the observer
distinguishes fine mobile details and definition reduction is slightly visible.
For the estimation of parameters (13) and (15) average values on 12 frames of test sequence
are calculated on analogy to the equation (17).
290
{ Rm ,i , j Gm ,i , j Bm ,i , j } { Wm* ,i , j U *m ,i , j Vm* ,i , j } .
Hence, values of equal coordinates become random variables with root-mean-square
deviations: W * , U * , V * .
Dispersions of W * , U * and V * coordinates are received with the help of a linearization
method (Ventzel & Ovtharov, 2000) of the functions W * = 25 Y 1/3 17 , U * = 13 W * ( u u 0 )
and V * = 13 W * ( v v0 ) .
Define dispersion of brightness index W * = 25 Y 1/3 17 . Linearization of the functions
W *
Y
2*
W
Y2 = 25
2/3
3 Y
is
Y2 ,
determined
by
Therefore
linear
transformation:
291
Define dispersion of chromaticity index U * with the help of a linearization method of the
function U * = 13 W * ( u u 0 )
2*
U
U *
R2 + U
G
G2 + U
B
B2 ,
dU *
= 13 2 /43 ( u u 0 ) + 1 2 1 W * ;
dR
3
Y
T
a T b U
25a
dU *
= 13 2 /53 ( u u0 ) + 2 2 2 W * ;
dG
T
3Y
a T b U
25a
dU *
= 13 2 /63 ( u u0 ) + 3 2 3 W * ,
dB
T
3Y
where T = U + V + W ; u = U / T ; v = V / T ;
U = a1 R + a2 G + a3 B ; V = a4 R + a5G + a6 B ; W = a7 R + a8G + a9 B ;
a1 ... a9 is the constants, and b1 = a1 + a4 + a7 ; b2 = a2 + a5 + a8 ; b3 = a3 + a6 + a9 .
W * 3 2 * + 2 * , at U * = 0 and V * = 0 ;
Wo
Wb
U * 3 2 * + 2 * , at W * = 0 and V * = 0 ;
Uo
Ub
V * 3 2 * + 2 * , at W * = 0 and U * = 0 .
Vo
Vb
Define criterion at which the observer distinguishes image stripes with noise.
The known three sigma rule is used to solve the task. This rule means that deviation
probability of a random variable X from its mean value not less than three sigma, provided
the law of distribution is close to normal, does not exceed 1/9.
Criteria, at which the observer distinguishes stripes in the test image with noise on
brightness, are found in the following way:
W * ( ) 3 W * W * ( )
Wth* ( )
(18)
292
Experimental research results have shown that the maximal value of weight function
( ( ) =1) corresponds to the minimal size ( = 1 ) of stripes, and weight function values
decrease with the increase of the stripes size. This is proportional to the reduction of
threshold values (Table 1).
The numerical solution of the developed mathematical model allows estimating the
influence of additive noise on definition reduction depending on root-mean-square values
of noise in RGB signals on the system output.
Dependences of root-mean-square deviations of color coordinates on brightness and
chromaticity from ( R G B ) are shown in Table 2, provided that noise levels in R,
G and B signals are approximately identical.
Value ( ) is given in percentage ratio to maximal amplitude of R, G and B signals.
Objects color coordinates of the test image with the contrast equal to threshold value for the
details with the minimal sizes are used in calculations, i.e., Wb* = 80 MPCD, W * = 6 MPCD,
%
W *
U *
V *
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
0.31
0.61
0.92
1.22
1.53
1.84
2.14
2.45
2.76
3.06
1.20
2.40
3.60
4.81
6.01
7.19
8.43
9.61
10.8
12.1
2.10
4.20
6.30
8.41
10.5
12.6
14.8
16.8
18.9
21.0
2*
W
N N
( W * i , j W * )2 ,
2
N
i =1 j =1
7. Practical Results
The developed methods are used in practice for the analysis and adjustment of video
systems parameters, to get high quality transfer and reproduction of images fine details.
The results of the analysis are given below and the recommendations on adjustment for high
definition in vision systems using JPEG, JPEG-2000 and MPEG-4 algorithms are offered.
293
The experimental analysis of coding quality of static images is carried out by the following
technique.
At the first stage the influence of the coder parameters on the decoded image quality of the
test table is analyzed with the help of a computer analyzer.
The computer analyzer calculated the following dependences of image quality parameters
on the coder adjustment parameters: a) reduction of stripes contrast (11) on brightness and
chromaticity; b) average deviation (13) of stripes color coordinates; c) average deviation (15)
of a single object contrast.
At the second stage the coder parameter at which the results of the analysis correspond to
high quality rating of ( Q 6 7) is selected.
At the third stage, the efficiency of digital compression of test images is estimated. Original
test photo images containing 50 70 percent of thin structural elements were used at the
experiment.
Quality analysis of JPEG and JPEG2000 images
The results of quality analysis of JPEG and JPEG2000 images coded in Adobe Photoshop CS
are given below. In Table 3 one can see reproduction quality parameters of stripes and fine
details of the test image on brightness for low, average and high rating quality.
In column (Var) values of adjustment parameters of images quality, being used in Adobe
Photoshop CS are shown.
Fragments of the test image with various quality rating are shown on Figure 3.
JPEG
W *
W *
W *
Var
JPEG2000
W *
W *
W *
Var
Q=2
Q=4
Q=7
0,97
0,69
0,10
0,70
0,63
0,24
2,04
1,62
0,99
4
7
9
Q=2
Q=4
Q=7
1,09
0,93
0,19
0,69
0,57
0,13
1,72
1,62
0,31
25
30
65
Cf
Cf
Adobe Photoshop CS
JPEG (9)
JPEG2000 (65)
6,4
3,1
8,5
7,0
JPEG (80)
12
13
ACD See 8
JPEG2000 (30)
27
30
Lena
Barbara
294
JPEG
JPEG2000
Q=7
Q=4
Q=2
295
Var = 90 %
Var = 70 %
Var = 50 %
Q = High
Q = Low
90%
High
6,8
29,6
70%
Medium
4,3
62,6
50%
Low
2,7
123,5
296
8. Conclusion
The developed objective methods of the definition analysis of images fine details practically
prove to be effective and can be used for adjustment and optimization of codec parameters
on high visual sharpness in various vision systems.
The main distinctive features of the developed methods should be noted in the summary.
1. The visual sharpness analysis is suggested to be carried out on the test image with low
contrast, the initial contrast of fine details being set by a number of minimal color vision
thresholds two - three times exceeding the average thresholds values.
2. Reduction in visual sharpness is suggested to be estimated by the normalized to vision
thresholds, average value of test image fine details contrast deviation from the initial
values of contrast in three-dimensional equal color space.
3. Image noise is suggested to be estimated by the root-mean-square values of color W * ,
U * and V * coordinates deviations which are calculated by the quantity of minimal
color vision thresholds.
To provide high quality reproduction of images fine details is the task of paramount
importance at designing vision systems of various applications.
The author hopes that the methods offered in this work will help designers of vision
systems to solve this task more effectively.
9. References
Glasman, K. (2004). MPEG-2 and Measurements. 625, No. 1, pp. 34-46, ISNN 0869-7914
Krivosheev, M.I. & Kustarev, A.K. (1990). Color Measurements. Energoatom, Moscow, ISBN
5-283-00545-3.
Mac Adam, D.L. (1974). Uniform Color Scales. JOSA, Vol. 64, pp. 1691-1702.
Novakovsky, S.V. (1988). Color in Color TV. Radio and communication, Moscow, ISBN 5-25600090-X.
Pratt, W.K. (2001) Digital Image Processing. Wiley, ISBN 0471374075.
Sai, S.V. (2002). Definition Analysis of Color Static Images in Equal Contrast Space. Digital
signals processing, No. 1., pp. 6-9, ISNN 1684-2634.
Sai, S.V. (2003). The Quality of Transmission and Reproduction of Fine Details in Color Television
Images. Dalnauka, Vladivostok, ISBN 5-8044-0345-1.
Sai, S.V. (2006). Quality Analysis of MPEG-4 Video Images. Pattern Recognition and Image
Analysis, Vol. 16, No. 1, pp. 50-51, ISNN 1054-6618.
Ventzel, E.S. & Ovtharov, L.A. (2000). Probability Theory and its Engineering Application.
Higher school, Moscow. ISBN 5-06-003830-0.
Wyszecki, G. (1975). Uniform Color Scales: CIE 1964 U*V*W* Conversion of OSA
Committee Selection. JOSA, Vol. 65, pp. 456-460.
17
A Practical Toolbox for Calibrating
Omnidirectional Cameras
Davide Scaramuzza and Roland Siegwart
Swiss Federal Institute of Technology, Zurich
Switzerland
1. Introduction
An omnidirectional camera is a vision system providing a 360 panoramic view of the scene.
Such an enhanced field of view can be achieved by either using catadioptric systems, which
opportunely combine mirrors and conventional cameras, or employing purely dioptric fisheye lenses. Omnidirectional cameras can be classified into two classes, central and noncentral, depending on whether they satisfy the single effective viewpoint property or not
(Baker & Nayar, 1998). As noted in (Svoboda & T. Pajdla, 1997), it is highly desirable that
such imaging systems have a single effective viewpoint. When this property is verified,
there exists a single center of projection, that is, every pixel in the sensed images measures
the irradiance of the light passing through the same viewpoint in one particular direction.
The reason a single viewpoint is so desirable is that it allows the user to generate
geometrically correct perspective images from the pictures captured by an omnidirectional
camera. Moreover, it allows applying the known theory of epipolar geometry, which easily
allows the user to perform ego-motion estimation and structure from motion from image
correspondences only.
As shown in (Baker & Nayar, 1998), central catadioptric systems can be built by combining
an orthographic camera with a parabolic mirror, or a perspective camera with a hyperbolic
or elliptical mirror. Conversely, panoramic cameras using fish-eye lenses cannot in general
be considered central systems, but the single viewpoint property holds approximately true
for some camera models (Micusik & Pajdla, 2003).
In this chapter, we focus on calibration of central omnidirectional cameras, both dioptric and
catadioptric. After outlining previous works on omnidirectional camera calibration, we
describe our novel procedure and provide a practical Matlab Toolbox, which allows any
inexpert user to easily calibrate his own camera.
Accurate calibration of a vision system is necessary for any computer vision task requiring
extracting metric information of the environment from 2D images, like in ego-motion
estimation and structure from motion. While a number of calibration methods has been
developed for standard perspective cameras (Zhang, 2000), little work on omnidirectional
cameras has been done. The first part of this chapter will present a short overview about
previous methods for calibration of omnidirectional cameras. In particular, their limitations
will be pointed out. The second part of this chapter will present our calibration technique
whose performance is evaluated through calibration experiments. Then, we will present our
298
Matlab toolbox (that is freely available on-line), which implements the proposed calibration
procedure. We will also describe features and use of our toolbox.
2. Related Work
Previous works on omnidirectional camera calibration can be classified into two different
categories. The first one includes methods which exploit prior knowledge about the scene,
such as the presence of calibration patterns (Cauchois et al., 2000; Bakstein & Pajdla, 2002) or
plumb lines (Geyer & Daniilidis, 2002). The second group covers techniques that do not use
this knowledge. The latter includes calibration methods from pure rotation or planar motion
of the camera (Gluckman & Nayar, 1998), and self-calibration procedures, which are
performed from point correspondences and epipolar constraint through minimizing an
objective function (Kang, 2000; Micusik & Pajdla, 2003).
All mentioned techniques allow obtaining accurate calibration results, but primarily focus
on particular sensor types (e.g. hyperbolic and parabolic mirrors or fish-eye lenses).
Moreover, some of them require special setting of the scene and expensive equipment
(Bakstein & Pajdla, 2002; Gluckman & Nayar, 1998). For instance, in (Bakstein & Pajdla,
2002), a fish-eye lens with a 183 field of view is used as an omnidirectional sensor. Then,
the calibration is performed by using a half-cylindrical calibration pattern perpendicular to
the camera sensor, which rotates on a turntable.
In (Geyer & Daniilidis, 2002; Kang, 2000), the authors treat the case of a parabolic mirror. In
(Geyer & Daniilidis, 2002), it is shown that vanishing points lie on a conic section which
encodes the entire calibration information. Thus, the projections of two sets of parallel lines
suffice for the intrinsic camera calibration. However, this property does not apply to nonparabolic mirrors. Therefore, the proposed technique cannot be easily generalized to other
kinds of sensors.
In contrast with the techniques mentioned so far, the methods described in (Kang, 2000;
Micusik & Pajdla, 2003; Micusik et al., 2004) fall in the self-calibration category. These
methods require no calibration pattern, nor a priori knowledge about the scene. The only
assumption is the capability to automatically find point correspondences in a set of
panoramic images of the same scene. Then, calibration is directly performed by epipolar
geometry by minimizing an objective function. In (Kang, 2000), this is done by employing a
parabolic mirror, while in (Micusik & Pajdla, 2003; Micusik et al., 2004) a fish-eye lens with a
view angle greater than 180 is used. However, besides focusing on particular sensor types,
the mentioned self-calibration techniques may suffer in case of tracking difficulties and of a
small number of features points (Bougnoux, 1998).
The calibration methods described so far focus on particular sensor types, such as parabolic
and hyperbolic mirrors or fish-eye lenses. In contrast with these methods, in the last years,
novel calibration techniques have been developed, which apply to any central
omnidirectional camera. For instance, in (Micusik & Pajdla, 2004), the authors extend the
geometric distortion model and the self-calibration procedure described in (Micusik &
Pajdla, 2003), including mirrors, fish-eye lenses, and non-central cameras. In (Ying & Hu,
2004; Barreto & Araujo, 2005), the authors describe a method for central catadioptric
cameras using geometric invariants. They show that any central catadioptric system can be
fully calibrated from an image of three or more lines.
The work described in this chapter also handles with calibration of any central
omnidirectional camera but aims at providing a technique that is very easy to apply also for
299
the inexpert user. Indeed, our technique requires the use of a chessboard-like pattern that is
shown by the user at a few different positions and orientations. Then, the user is only asked
to click on the corner points of the images of the pattern.
The strong point of our technique resides in the use of a new camera model that adapts
according to the appearance of the pattern in the omnidirectional images. The peculiarity of
this model is that it can also handle the cases where the single effective viewpoint property
is not perfectly satisfied. Indeed, although several omnidirectional cameras exist directly
manufactured to have this property, for a catadioptric system this requires to accurately
align the camera and the mirror axes. In addition, the focus point of the mirror has to
coincide with the optical center of the camera. Since it is very difficult to avoid cameramirror misalignments, an incorrectly aligned catadioptric sensor can lead to a quasi single
viewpoint system (Swaminathan & Grossberg, 2001).
The method described in this chapter was first introduced in (Scaramuzza et al., 2006). In
that work, we proposed a generalized parametric model of the sensor, which is suitable to
different kinds of omnidirectional vision systems, both catadioptric and dioptric. In that
model, we assume that the imaging function, which manages the projection of a 3D real
point onto a pixel of the image plane, can be described by a Taylor series expansion whose
coefficients are the parameters to be calibrated.
In this chapter, we will first summarize the generalized camera model (section 3) and the
calibration method introduced in our previous work (section 4). Then, in section 5, we will
introduce our Matlab Toolbox (named OcamCalib Toolbox). There, we will outline the
features of the toolbox, with particular regard to the automatic detection of the center of the
omnidirectional camera. Indeed, in previous works, the detection of the center is performed
by exploiting the visibility of the circular external boundary of the mirror. In those works,
the mirror boundary is first enhanced by using an edge detector, and then, a circle is fitted
to the edge points to identify the location of the center. In our approach, we no longer need
the visibility of the mirror boundary. The algorithm described in this chapter is based on an
iterative procedure that uses only the points selected by the user.
In section 6, the performance of our toolbox will be evaluated through calibration
experiments.
300
incorporates the digitizing process and small axes misalignments; thus u' ' = Au' + t ,
where A 2 x 2 and t 2 x1 .
At this point, we can introduce the imaging function g, which captures the relationship
between a point u' ' , in the sensor plane, and the vector p emanating from the viewpoint O
to a scene point X (see figure 1.a). By doing so, the relation between a pixel point u and a
scene point X is:
>0,
(1)
))T
(2)
Furthermore, we assume that function f depends on u and v only through ' ' = u ' ' 2 + v ' ' 2 .
This hypothesis corresponds to assume that function g is rotationally symmetric with
respect to the sensor axis.
(a)
(b)
(c)
Figure 1. (a) Coordinate system in the catadioptric case. (b) Sensor plane, in metric
coordinates. (c) Camera image plane, expressed in pixel coordinates. (b) and (c) are related
by an affine transformation
Function f can have various forms depending on the mirror or the lens construction. These
functions can be found in (Kumler & Bauer, 2000), (Micusik et al., 2004), and (Svoboda &
Pajdla, 2002). Unlike using a specific model for the sensor in use, we choose to apply a
generalized parametric model of f , which is suitable to different kinds of sensors. The
reason for doing so, is that we want this model to compensate for any misalignment
between the focus point of the mirror (or the fisheye lens) and the camera optical center.
Furthermore, we desire our generalized function to approximately hold with those sensors
where the single viewpoint property is not exactly verified (e.g. generic fisheye cameras).
We propose the following polynomial form for f
f (u'',v'' ) = a 0 + a 1 ,, + a 2 ,, + ... + a N ,,
2
(3)
where the coefficients a i , i = 0, 1, 2, ...N and the polynomial degree N are the calibration
parameters that we want to determine. This polynomial description of f can be more
simplified by considering that all previous definitions of f always satisfy the following:
301
df
d
= 0
(4)
=0
This property holds for hyperbolic and parabolic mirrors or fisheye cameras (see (Kumler &
Bauer, 2000), (Micusik et al., 2004), and (Svoboda & Pajdla, 2002)).
This simplification allows us to assume a1 = 0 , and thus (3) can be rewritten as:
f (u'',v'' ) = a 0 + a 2 ,, + ... + a N ,,
2
(5)
(A u' + t )
= P X,
f (u ' ' , v ' ' )
>0
(6)
4. Camera Calibration
4.1 Solving for intrinsic and extrinsic parameters
According to what we told so far, to calibrate an omnidirectional camera, we have to
estimate the parameters A, t, a 0 , a 2 ,..., and a N .
In our approach, we decided to separate the estimation of these parameters into two stages.
In one, we estimate the affine parameters A and t. In the other one, we estimate the
coefficients a 0 , a 2 ,..., and a N .
The parameters A and t describe the affine transformation that relates the sensor plane to the
camera plane (figures 1.b and 1.c). A is the stretch matrix and t is the translation vector
I c O c (figure 1.c). To estimate A and t we introduce a method, which, unlike other previous
works, does not require the visibility of the circular external boundary of the mirror
(sketched by the ellipse in figure 1.c). This method is based on an iterative procedure, which
starts by setting A to the identity matrix Eye and t=0. This assumption means that the
camera plane and the sensor plane initially coincide. The correct elements of A will be
estimated afterwards by non linear refinement, while t will be estimated by an iterative
search algorithm. This approach will be detailed in section 4.3.
According to this, from now on we assume A=Eye and t=0, which means u' ' = u' . Thus, by
substituting this relation in (6) and using (5), we have the following projection equation
u' '
u'
u'
(7)
where now u ' and v ' are the pixel coordinates of an image point with respect to the image
center, and ' is the Euclidean distance. Also, observe that now only N parameters
( a 0 , a 2 ,..., a N ) need to be estimated. From now on, we will refer to these parameters as
intrinsic parameters.
302
During the calibration procedure, a planar pattern of known geometry is shown at different
unknown positions, which are related to the sensor coordinate system by a rotation
matrix R 3 x 3 and a translation T 3 x1 . R and T will be referred to as extrinsic parameters.
Let I i be an observed image of the calibration pattern, M ij = [ X ij , Yij , Z ij ] the 3D coordinates of
its points in the pattern coordinate system, and mij = [uij , vij ]T the correspondent pixel
coordinates in the image plane. Since we assumed the pattern to be planar, without loss of
generality we have Z ij = 0 . Then, equation (7) becomes:
u ij
i
ij p ij = ij
v ij
=P X =
2
N
a 0 + a 2 ' +... + a N '
[r
i
1
r2i r3i
X ij
Y
T i ij = r1i r2i T i
0
1
X ij
ij (8)
1
] Y
(9.1)
(9.2)
(9.3)
where the sub-index i has been removed to lighten the notation, and t 1 ,t 2 and t 3 are the
elements of T.
Observe that in (9), X j , Y j and Z j are known, and so are u j , v j . Also, observe that only (9.3)
is linear in the unknown r11 , r12 , r21 , r22 , t1 , t 2 .
From now on, the details for the resolution of equation (9) can be found in (Scaramuzza et
al., 2006). The principle of the technique consists first in solving for the parameters
r11 , r12 , r21 , r22 , t1 , and t 2 by linearly solving equation (9.3). Next, we use the solution of (9.3) as
input to (9.1) and (9.2), and solve for the remaining parameters a 0 , a 2 ,..., a N and t 3 . In both
steps, the solution is achieved by using linear least-square minimization.
Up to now, we didnt specify which polynomial degree N one should use. To compute the
best N, we actually start from N=2. Then, we increase N by unitary steps and we compute
the average value of the reprojection error of all calibration points. The procedure stops
when a minimum error is found. Typical empirical values for N are usually N=3 or N=4.
4.2 Detection of the Image Center
As stated in sections 1 and 2, a peculiarity of our calibration toolbox is that it requires the
minimum user interaction. One of the tools that accomplish this task is its capability of
303
identifying the center of the omnidirectional image O c (figure 1.c) even when the external
boundary of the sensor is not visible in the image.
At the beginning of section 4.1, we made the following assumptions for A and t, namely
A=Eye and t=0. Then, we derived the equations for solving for the intrinsic and extrinsic
parameters that are valid only under those assumptions.
Y
O
X
(a)
(b)
Figure 2. When the position of the center is correct, the 3D points of the checker board do
correctly project (green rounds) onto the calibration points (red crosses) (a). Conversely,
when the position of the center is wrong, the points do not project onto the real calibration
points (b)
In figure 2.a, the reader can see what happens when the position of the center is correct. The
red crosses are the input calibration points selected by the user. The green rounds are the 3D
points reprojected onto the images according to the intrinsic and extrinsic parameters
estimated by the calibration. As the reader can see, the 3D points perfectly overlay the input
points, meaning that the calibration worked properly. Figure 2.b shows the result when the
input position of the center is wrong, that is, the reprojection error is large. Motivated by
this observation, we performed many trials of our calibration procedure for different center
locations, and, for each trial, we computed the Sum of Squared Reprojection Errors (SSRE).
As a result, we verified that the SSRE always has a global minimum at the correct center
location.
This result leads us to an exhaustive search of the center O c , which stops when the
difference between two potential center locations is smaller than a certain (we used =0.5
pixels). The algorithm is the following:
1. At each step of this iterative search, a fixed number of candidate center locations is
uniformly selected from a given image region (see figure 3).
2. For each of these points, calibration is performed by using that point as a potential
center location and SSRE is computed.
3. The point providing the minimum SSRE is taken as a potential center.
4. The search proceeds by selecting other candidate locations in the region around that
point, and steps 1, 2 and 3 are repeated until the stop-condition is satisfied.
Observe that the computational cost of this iterative search is so low that it takes less than 3
seconds to stop.
At this point, the reader might be wondering how we do estimate the elements of matrix A.
In fact, at the beginning we assumed A=Eye. The iterative algorithm mentioned above
304
exhaustively searches the location of the center (namely O c ) by leaving A unchanged. The
reason for doing so is that the eccentricity of the external boundary of an omnidirectional
image is usually close to zero, which means A~Eye. Therefore, we chose to estimate A in a
second stage by using a non linear minimization method, which is described in section 4.3.
E=
i =1
m ij m R i , T i ,A , O c , a 0 ,a 2 ,...,a N , M
(10)
j =1
according to equation (1). Ri and Ti are the rotation and translation matrices of each plane
pose. Ri is parameterized by a vector of 3 parameters related to Ri by the Rodrigues formula.
Observe that now we incorporate into the functional both the stretch matrix A and the
center of the omnidirectional image O c .
By minimizing the functional defined in (10), we actually find the calibration parameters
which minimize the reprojection error. In order to speed up the convergence, we decided to
split the non linear minimization into two steps. The first one refines the extrinsic
parameters, ignoring the intrinsic ones. Then, the second step uses the extrinsic parameters
305
just estimated, and refines the intrinsic ones. By performing many simulations, we found
that this splitting does not affect the final result with respect to a global minimization.
To minimize (10), we used the Levenberg-Marquadt algorithm (Levenberg, 1944;
Marquardt, 1963), as implemented in the Matlab function lsqnonlin. The algorithm
requires an initial guess for the parameters. These initial parameters are the ones obtained
using the linear technique described in section 4.1. As a first guess for A, we used the
identity matrix, while for O c we used the position estimated through the iterative
procedure explained in subsection 4.2.
Visual feedback about the quality of the calibration result by reprojecting the 3D points
onto the input images.
Computer assisted selection of the input points. Indeed, the selection of the corner
points on the calibration pattern is assisted by a corner detector.
Figure 5. Some pictures with the checker board used as a calibration grid. In our
experiments, we used at least 5 or more images with the grid shown all around the camera
The user interface of the toolbox is depicted in figure 4. After having collected a few pictures
of a chessboard shown all around the omnidirectional camera (see figure 5), the images can
be loaded for calibration (i.e. use Read names). In the second step, the user can start
selecting the corner points of the pattern using the Extracting grid corners tool. By this
tool, the user is asked to click on all the corner points by following the left-right order. To
306
achieve high accuracy in the selection of the input points, the clicking is assisted by a Harris
base corner detector (Harris & Stephens, 1988).
In the third step, the calibration can be done by means of two tools. The Calibration tool
will ask the user to specify the position of the center in case he knows, if not, the user can
directly use the Find center tool, which automatically applies the iterative search
algorithm described in 4.2. In both cases, the calibration is performed by using the linear
estimation technique mentioned in 4.1. The optimal calibration parameters in the maximum
likelihood sense can be estimated by the Calibration Refinement tool, which implements
the non linear minimization described in 4.3. After the previous steps, the user can choose
among several tools:
Show Extrinsic visualizes the reconstructed 3D poses of the grid in the camera
reference frame (figure 6).
Analyze error visualizes the reprojection error of each calibration point along the x-yaxes.
Reproject on images reprojects all the 3D points onto the images according to the
calibrated parameters.
Figure 6. A picture of our simulator showing several calibration patterns and the virtual
omnidirectional camera at the axis origin
After the calibration, all the parameter can be accessed through the structure
ocam_model. The calibrated camera model can then be used for other applications by
means of the following two functions:
m = world2cam(M, ocam_model), which reprojects a 3D point (M) onto the image and
returns its pixel coordinates (m).
307
6. Results
We evaluated the performance of our toolbox through calibration experiments both on
synthetic and real images. In particular, we used synthetic images to study the robustness of
our calibration technique in case of inaccuracy in detecting the calibration points. To this
end, we generated several synthetic poses of a calibration pattern. Then, Gaussian noise
with zero mean and standard deviation was added to the projected image points. We
varied the noise level from =0.1 to =3.0 pixels, and, for each noise level, we performed 100
independent calibration trials and computed the mean reprojection error. Figure 7 shows the
plot of the reprojection error as a function of . Observe that we separated the results
obtained by using the linear minimization alone from the results of the non linear
refinement. As the reader can see, in both cases the average error increases linearly with the
noise level. Furthermore, the reprojection error of the non linear estimation keeps always
smaller than the error computed by the linear method. Finally, notice that when =1.0,
which is larger than the normal noise in practical situations, the average reprojection error
of the non linear method is lower than 0.4 pixels.
2
1.5
0.5
0
0
0.5
1.5
2.5
Figure 7. The reprojection error versus . The dashed line represents the results obtained by
using the linear minimization alone. The solid line shows the results after the non linear
refinement. Both units are considered in pixels
An indirect method to evaluate the quality of the calibration of a real camera consists in
reconstructing the 3D structure of an object from its images and checking then the quality of
the reconstruction. This problem is known by the computer vision community as structure
from motion. The object we used in this experiment is a trihedron made up of three
orthogonal chessboard-like patterns of known geometry (see figure 8.a). Our
omnidirectional camera is KAIDAN 360 One VR with a hyperbolic mirror.
After having calibrated the camera, we took two images of the trihedron from two different
unknown positions (see figure 8.b). Next, several point matches were hand selected from
both views of the object and the Eight Point algorithm was applied (Longuet-Higgins, 1981).
In order to obtain good reconstruction results, more than eight points (we used 135 points)
were used. The method mentioned so far gives a first good 3D reconstruction of the points.
A better estimation of the 3D structure can be obtained by densely using all the pixels of the
images. To accomplish this task, we used the first estimation along with normalized cross
correlation to automatically match all the points of the image pair. Finally, all matches were
used to compute the structure. The results of the reconstruction are shown in figure 8.c.
As the reconstruction with one single camera can be done up to a scale factor, we recovered
the scale factor by comparing the average size of a reconstructed checker with the real size
308
on the trihedron. In the end, we computed the angles between the three planes fitting the
reconstructed points and we found the following values: 94.6, 86.8 and 85.3. Moreover,
the average distances of these points from the fitted planes were respectively 0.05 cm, 0.75
cm and 0.07 cm. Finally, being the size of each checker 6.0 cm x 6.0 cm, we also calculated
the dimension of every reconstructed checker and we found an average error of 0.3 cm.
These results comply with the expected orthogonality of the surfaces and the size of the
checkers in the ground truth.
(a)
(c)
(b)
Figure 8. (a) The object to be reconstructed. (b) Two omnidirectional pictures of the object
taken from two unknown positions. (c) Dense 3D reconstruction of the object. The
reconstruction is very good, meaning that the model of the camera was well estimated
7. Conclusion
In this chapter, we presented a method for calibrating any central omnidirectional camera
both dioptric or catadioptric. The method relies on a generalized parametric function that
describes the relation between a given pixel point and the correspondent 3D vector
emanating from the single effective view point of the camera. We describe this function by
means of a polynomial expansion whose coefficients are the parameters to be calibrated.
Furthermore, we presented a toolbox for Matlab (named OcamCalib) that implements the
mentioned calibration procedure. The toolbox is available on-line. We described the tools
and the main features of our toolbox, one of which being the capability to automatically
identify the center of the omnidirectional image. The toolbox relies on the use of a
chessboard-like calibration pattern that is shown by the user at a few different positions and
orientations. Then, the user is only asked to click on the corner points of the patterns. The
performance of the toolbox was finally evaluated through experiments both on synthetic
and real images. Because of its ease of use, the toolbox turns out to be very practical, and
allows any inexpert user to calibrate his own omnidirectional camera.
8. Acknowledgements
This work was conducted within the EU Integrated Projects COGNIRON (The Cognitive
Robot Companion) and BACS (Bayesian Approach to Cognitive Systems). It was funded
by the European Commission Division FP6-IST Future and Emerging Technologies under
the contracts FP6-IST-002020 and FP6-IST-027140 respectively.
309
We want to thank Zoran Zivkovic and Olaf Booij, from the Intelligent Systems Laboratory of
Amsterdam (University of Amsterdam), for providing the sample images included in our
toolbox.
Furthermore, we want to thank Dr. Jean-Yves Bouguet, from Intel Corporation, for
providing some functions used by our toolbox.
9. References
Baker, S. & Nayar, S.K. (1998). A theory of catadioptric image formation. Proceedings of the
6th International Conference on Computer Vision, pp. 3542, ISBN 81-7319-221-9, India,
January 1998, IEEE Computer Society, Bombay.
Svoboda, T., Pajdla T. & Hlavac, V. (1997). Central panoramic cameras: Geometry and
design. Research report K335/97/147, Czech Technical University, Faculty of
Electrical Engineering, Center for Machine Perception, Czech Republic, December
1997. Praha.
Micusik, B. & Pajdla, T. (2003). Estimation of omnidirectional camera model from epipolar
geometry. Proceedings of the International Conference on Computer Vision and Pattern
Recognition. ISBN 0-7695-1900-8, US, June 2003, IEEE Computer Society, Madison.
Zhang, Z (2000). A Flexible New Technique for Camera Calibration. IEEE Transactions on
Pattern Analysis and Machine Intelligence. Volume 22, No. 11, November 2000, ISSN
0162-8828.
Kumler, J. & Bauer, M. (2000). Fisheye lens designs and their relative performance.
Proceedings of SPIE Conference. Vol. 4093. pp. 360-369. 2000.
Micusik, B., Martinec, D. & Pajdla, T. (2004). 3D Metric Reconstruction from Uncalibrated
Omnidirectional Images. Proceedings of the Asian Conference on Computer Vision.
January 2004, Korea.
Svoboda, T. & Pajdla, T. (2001). Epipolar Geometry for Central Catadioptric Cameras. In.
Panoramic Vision: Sensors, Theory and Applications, Benosman R. & Kang, S.B., pp. 85114, Springer.
Cauchois, C., Brassart, E., Delahoche, L. & Delhommelle, T. (2000). Reconstruction with the
calibrated SYCLOP sensor. Proceedings of the IEEE International Conference on
Intelligent Robots and Systems, pp. 14931498, ISBN: 0-7803-6348-5, Japan, October
2000, IEEE Computer Society, Takamatsu.
Bakstein, H. & Pajdla, T. (2002). Panoramic mosaicing with a 180 field of view lens.
Proceedings of the IEEE Workshop on Omnidirectional Vision, pp. 6067, ISBN: 0-76951629-7, Denmark, June 2002, IEEE Computer Society, Copenhagen.
Geyer, C. & Daniilidis, K. (2002). Paracatadioptric camera calibration. IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 24, No. 5, May 2002, pp. 687-695, ISSN
0162-8828.
Gluckman, J. & Nayar, S. K. (1998). Ego-motion and omnidirectional cameras. Proceedings of
the 6th International Conference on Computer Vision, pp. 999-1005, ISBN 81-7319-221-9,
India, January 1998, IEEE Computer Society, Bombay.
Kang. S.B. (2000). Catadioptric self-calibration. Proceedings of the IEEE International Conference
on Computer Vision and Pattern Recognition, pp. 201-207, ISBN: 0-7695-0662-3, USA,
June 2000, IEEE Computer Society, Hilton Head Island.
310
Micusik, B., & Pajdla, T. (2004). Para-catadioptric Camera Auto-calibration from Epipolar
Geometry. Proceedings of the Asian Conference on Computer Vision. January 2004,
Korea.
Micusik, B., D.Martinec & Pajdla, T. (2004). 3D Metric Reconstruction from Uncalibrated
Omnidirectional Images. Proceedings of the Asian Conference on Computer Vision.
January 2004, Korea.
Bougnoux, S. (1998). From projective to Euclidean space under any practical situation, a
criticism of self-calibration, Proceedings of the 6th International Conference on Computer
Vision, pp. 790796, ISBN 81-7319-221-9, India, January 1998, IEEE Computer
Society, Bombay.
Swaminathan, R., Grossberg, M.D. & Nayar. S. K. (2001). Caustics of catadioptric cameras.
Proceedings of the IEEE International Conference on Computer Vision, pp. 2-9, ISBN: 07695-1143-0, Canada, July 2001, IEEE Computer Society, Vancouver.
Ying, X. & Hu, Z. (2004). Catadioptric Camera Calibration Using Geometric Invariants, IEEE
Transactions on Pattern Analysis and Machine Intelligence, Vol. 26, No. 10, October
2004, pp. 1260-1271, ISSN: 0162-8828.
Ying, X. & Hu, Z. (2004). Can We Consider Central Catadioptric Cameras and Fisheye
Cameras within a Unified Imaging Model?, Proceedings of the European Conference on
Computer Vision, pp. 442-455, Czech Republic, Lecture Notes in Computer Science,
May 2004, Prague,
Barreto, J. & Araujo, H. (2005). Geometric Properties of Central Catadioptric Line Images
and their Application in Calibration, IEEE Transactions on Pattern Analysis and
Machine Intelligence, Vol. 27, No. 8, pp. 1327-1333, August 2005.
Sturm, P. & Ramaligam, S. (2004). A Generic Concept for Camera Calibration, Proceedings of
the European Conference on Computer Vision, pp. 1-13, Czech Republic, Lecture Notes
in Computer Science, May 2004, Prague,
Scaramuzza, D., Martinelli, A. & Siegwart, R. (2006). A Toolbox for Easy calibrating
Omnidirectional Cameras. Proceedings of the IEEE International Conference on
Intelligent Robots and Systems, pp. 56955701, China, October 2006, Beijing.
Scaramuzza, D., Martinelli, A. & Siegwart, R. (2006). A Flexible Technique for Accurate
Omnidirectional Camera Calibration and Structure from Motion, Proceedings of
IEEE International Conference on Computer Vision Systems, USA, January 2006, New
York.
Levenberg, K. (1944). A Method for the Solution of Certain Problems in Least Squares,
Quarterly of Applied Mathematics, Vol. 2, No. 2, pp. 164-168, July 1944.
Marquardt, D. (1963). An Algorithm for Least-Squares Estimation of Nonlinear Parameters,
SIAM Journal on Applied Mathematics. Vol. 11, No. 2, pp. 431-441, 1963.
Harris, C. & Stephens, M.J. (1988). A combined corner and edge detector, Proceedings of The
Fourth Alvey Vision Conference, pp. 147-151, UK, 1988, Manchester.
Longuet-Higgins, H.C. (1981). A computer algorithm for reconstructing a scene from two
projections. Nature, Vol. 293, Sept 1981, pp. 133135.
18
Dynamic 3D-Vision
K.-D. Kuhnert , M. Langer, M. Stommel and A. Kolb
University Siegen
Germany
1. Introduction
Measuring and recognising the surfaces of the surrounding world forms a ubiquitous
problem in automation and robotics. The knowledge of the environment allows a flexible
and autonomous behaviour in different situations. Stereo vision belongs to the most popular
techniques for gathering this information because it provides the dense depth information
necessary for complex grasping tasks. Compared to laser scanners stereo cameras also have
the advantage of higher framerates, so they are widely used for mobile robots. Porta (2005)
e.g. uses the Small Vision stereo system (Konolige, 1997) to enhance localisation of a mobile
robot: Features from depth maps are used additionally to appearance based intensity
features. Other examples include Zhu et al. (2004) or Kang et al. (1995). However, the main
problems of stereo vision remain speed and robustness. In order to accelerate the timeconsuming registration of the stereo images and avoid specialised hardware, Sun (2002)
employs an intelligent subregioning mechanism which reduces the search space of the
correspondence analysis. Another approach builds upon the usage of modern SIMD
processor instructions as documented by Sunyoto et al. (2004). Kim et al. (2005) on the other
hand achieve real-time behaviour by segmenting foreground objects from the background.
Depth information is then only updated for moving objects. Of course this approach is
problematic for mobile robots. The lack of robustness of stereo analysis for particular scenes
mainly arises from depth discontinuities and ambiguous surface texture. Kang et al. (1995)
avoid these ambiguities by projecting textured light on the scene, but this is no general
solution. Kim et al. (2005) made experiments with an adaptive matching window to increase
the accuracy near edges. Zhao and Katupitiya (2006) examined the effect of occlusion and
developed a method that detects occlusion areas and adapts a matching window
appropriately. To evaluate and compare the robustness of different stereo algorithms,
Scharstein et al. (2001) propose a taxonomy for different stereo algorithms and create a
testbed including stereo images with groundtruth. Using this testbed we will document the
results of the software system for the computation of dense disparity maps presented here.
Our stereo system unites some of the speed optimisations mentioned above and hence
achieves real-time behaviour. The calibration procedure and some comments on the
brightness change constraint will be given. We will also present results for the distance
measurements with a PMD camera (Photonic Mixer Device, Schwarte (2001), Kraft et al.
(2004)) which is a technique for measuring the distance of an object by the time of flight of
an active infrared illumination. The calibration procedure and the specifics of the
measurements will be described, especially for scenes with surfaces almost in parallel to the
312
optical axis of the system. Finally the results on combining the stereo and PMD technique
will be given, discussing the advantages and disadvantages.
2. Stereo
Depth measurement by stereo vision algorithm is done by computing a so called (sparse or
dense) disparity map. Disparity maps encode the depth for each reference pixel in the stereo
images. Using the well-known stereo geometry formula 12 and the disparity map, one can
easily calculate the depth value of any given pixel. The calculation of disparity maps leads
to the so called stereo correspondence problem.
The stereo correspondence problem can be formulated as the problem to efficiently traverse
a two-dimensional search space consisting of any intensity value at any position in the
stereo images. Algorithms finding corresponding pixel pairs in acceptable time will be
presented later.
When one pixel of one stereo image is compared with one pixel of the other stereo image,
the degree of correspondence between these pixels is calculated using a certain metric. Two
of the most prominent metrics are the sum of absolute differences (SAD) and the sum of squared
differences (SSD). Both metrics have in common that they calculate the correspondence value
over a certain block size. This block size is usually represented by a rectangular region pixel
by pixel around the pixels that are to be matched. SAD sums up both rectangular regions
and takes the absolute value of the difference of these sums, whereas SSD takes the squared
value of the difference of these sums. Both functions can be efficiently computed using
modern SIMD (Single Instruction, Multiple Data) processor instruction sets.
2.1 Review
In this part we review various stereo correspondence algorithms for the traversing of the
two-dimensional search space mentioned earlier. First these algorithms are briefly
described, then concrete implementations of these algorithms are discussed.
All of the reviewed algorithms belong to the class of block-matching methods along a
horizontal scanline. Therefore, the images must be - at least approximately rectified if the
optical axes of the cameras are not adjusted in parallel. Due to the rectification and the
epipolar constraint (see Faugeras, 1993), it can be presumed that corresponding pixels in
stereo image pairs can be found on the same horizontal lines in both images. Then we define
a three-dimensional correspondence candidate matrix (also called cost matrix) C(x,y,d(x,y))
holding all disparity value candidates d of a given reference pixel at position (x,y). After
building this matrix, one needs to find efficient algorithms for reordering the optimal
disparity value for any given reference pixel. Several geometric and object specific
constraints reduce the search space and lead to an increase in both speed and quality of the
results of the correspondence analysis. A set of the most important constraints is given in
the following.
The so called brightness change constraint states that if a pixel in one of the stereo images has a
corresponding pixel in the other image, the intensity values of these pixels need to be same.
Fulfilling this requirement makes a reliable pixel matching even possible, because all of the
later discussed algorithms work intensity based. This is actually one of the most important
constraints concerning the quality of the correspondence analysis results.
313
Dynamic 3D-Vision
Figure 1. Incoming light ray (marked red) with angle i = ( i ,i ) and outgoing/reflected light
ray (marked green) with angle o = ( o , o ) at point x
Object surfaces need to be piecewise smooth. This prerequisite ensures that corresponding
pixel pairs in both images have almost the same disparity values. This enables a further
reduction of the search space, because corresponding pixels can be found around a certain
offset on a scanline in the analyzed image.
The object surfaces are highly textured. The more textured the surfaces are, the more reliably
corresponding pixels can be found. If for instance an object surface has only one colour, it is
almost impossible to distinguish corresponding pixels from non-corresponding pixels on
these surfaces because they look identical.
The monotony constraint demands that, if the stereo images are rectified, corresponding pixel
pairs are to be searched in the same direction on a scanline. That means fulfilling the
monotony constraint, the pixel pairs occurrence is ordered on the scanlines.
As a last important constraint the objects are supposed to be equally visible for both cameras
of the stereo-system. That means, objects in a scene are not partially occluded. Partial
occlusion leads to different projections of the same object onto the image planes of the
stereo-cameras, because the cameras look at the objects from slightly different angles. The
consequence of that effect is that the left and the right stereo image possess different, even
mutually exclusive information of the object in the scene. Hence errors can be expected
when trying to match pixels in one image along a scanline in the other image where the
other image actually has no information about these pixels at all.
As being said, the brightness change constraint belongs to the most important constraints
for the set of intensity based stereo matching algorithms. The majority of errors occurring in
the matching process can be classified as violations of this constraint. The reason why this
constraint is so easily violated is mainly due to the angle dependent reflection properties of an
object as well as the problem of subpixel edge shift.
The reflection properties of an object mainly depend on the surface of the object as well as
on the light reaching the object surface. These two factors determine the equation of the
reflected light intensity
Lo ( x, o ) = f ( x, i o ) Li ( x, i ) cos i di ,
(1)
314
where i = ( i ,i ) and o = ( o , o ) denote the solid angle of incoming and reflected light rays
at point x as shown in Figure 1.
L1 and L2 denote the light intensities of the incoming and outgoing direction of the light
rays. The term f (x, i o ) represents the so called bidirectional reflectance distribution function
(BRDF) measuring the physical reflectance behaviour of the surface material. For any light
ray with given entry angle hitting the object at point x , the BRDF yields the quotient of
irradiance and emittance of any reflected light ray. The model of the BRDF is based on the
concept of so called micro facets. Micro facets are microscopically small mirrors randomly
aligned and distributed all over the objects surface. The alignment of these mirrors is
determined by the probability distribution of eq. 2:
f ( x, i o ) =
F D G
o $ n i $ n
(2)
F denotes the fresnel-factor which models the fraction properties of the object material, D is
the probability distribution function of the micro facets and G is the geometry factor modelling
shading between the micro facets. An in-depth look into the complex mathematical
deduction of these parameters can be found in related textbooks. To yield the overall light
intensity reflected from a point x , one has to integrate over the entire spatial angle . This
complex procedure requires extensive computation time.
The problem of subpixel edge shift results from to the inherently limited resolution of
cameras. To measure the light intensity of a pixel of recorded scene, the camera has to
integrate the light intensity over a certain area of the scene predetermined by the camera
resolution. This causes problems on edges in the scene, because edges typically mark an
abrupt change in light intensity values. So the integration process averages the light
intensities in the given areas over the edges. Due to high angle dependency of the projection
of edges (as described earlier) in such a subpixel integration area, the intensity values of
corresponding pixels may differ significantly. This problem can be reduced just slightly by
using high resolution cameras.
Hence, when performing intensity based stereo correspondence analysis, one always has to
consider these inherent problems. There is no general solution to address these problems
and the results need to be interpreted respectively. In the following sections some of the
most popular approaches of stereo correspondence analysis are presented.
2.1.1 Winner Takes it All (WTA)
One of the simplest algorithms for searching in the matching matrix is the Winner Takes it All
method. WTA works as follows: For a given reference pixel WTA walks through the cost
matrix selecting that pixel which has the lowest difference to the reference pixel. This is a
simple minimum search on a given set of numbers. It is also a local method, because the
algorithm operates only on one vector of the matching matrix for each reference pixel.
The biggest advantage of such a primitive local method is the easy way of its
implementation. Furthermore WTA can most easily be optimized. The disadvantages on the
other hand are that local methods highly depend on the constraints discussed earlier. There
is also the possibility that identical pixels of the cost matrix are assigned to reference pixels
more than once.
315
Dynamic 3D-Vision
(3)
as it is described in (Scharstein & Szeliski, 2001). The term E d (d ) measures how well the
disparity function d matches the stereo pixel pair which is to be evaluated. This is done by
eq. 4 which takes all possible corresponding pixel pairs into account. is a global constant
denoting a weight for E s (d ) .
Ed (d ) =
C( x, y, d ( x, y))
(4)
( x, y )
316
317
Dynamic 3D-Vision
BP =
(dC ( x, y) dT ( x, y) > d )
N ( x, y )
(5)
The variable d C denotes one element of the calculated disparity map, whereas dT is an
element of the groundtruth map at position (x,y). d is a fault tolerance value. We set it to
1.0.
Method
Schar/Szel
WTA
optimized
Schar/Szel
SAD
SSD
DP/SO
SAD
SSD
SA
SAD
SSD
building matching-matrix
WTA
SAD
SSD
DP/SO
SAD
SSD
SA
SAD
SSD
building matching-matrix
4.60
4.85
6.46
6.63
4.35
4.54
N/A
7.47
6.14
6.87
6.96
5.74
5.14
N/A
318
show the run-time behaviour as well as the quality (measured in the rate of falsely
computed disparity values) of the implementation computing the disparity maps of the
pictures like shown in Figure 2. The pictures are 32 Bit truecolour images with images sizes
of 450 x 375 pixels. Figure 3 shows the corresponding disparity map and the difference
image of the groundtruth image with the computed stereo image.
Dynamic 3D-Vision
319
The difference picture of the calculated disparity maps (Figure 3 right side) shows that
errors mainly occur on edges due to partial occlusion. The brightness change constraint in
this case is preserved hence leading to no additional errors. It can be expected that under
normal conditions the brightness change constraint will be violated and DP/SO may yield
better results than the simple WTA, because DP/SO is more robust in this case.
Figure 3. Computed disparity map (using WTA with SSD method) and difference map
according to the groundtruth map
We also used other images mainly taken from the (Stereo Vision Research Page, 2007). The
results of those pictures resemble the results presented in table 1.
In Figure 4 we present a reconstruction of the original scene using the stereo-images and
their corresponding disparity map. This of course cannot be a full 3D-view of the scene
because there is only one stereo image pair made from a certain angle. Hence we call this a
2.5D-view of the original scene. The viewer can clearly distinguish which objects were closer
to the camera and which were lying in the background.
2.3 Variable Block Size and Interest Operators
Some cameras offer the possibility to automatically control the brightness of the image.
Concerning single camera images this is often an improvement. However, for the matching
of two images from different cameras this is a clear disadvantage. In a stereo setup the
length of the baseline between the cameras causes deviations in the fields of view of the
cameras, so only a part of the scene can be seen in both images. The rest of the images is
different for the two cameras and influences the brightness control. As a result, also the
common image parts have a different brightness. In that case we made good experiences by
matching the derivation of the image instead of the pure image intensities.
Figure 6. Disparities for a block size of 30x10 pixels and 60x20 pixels
The derivative can also be used as an interest operator, to accelerate matching and in
particular to determine the size of the image regions being matched. The following
explanations refer to a WTA block-matching algorithm with SSD as similarity measure. The
algorithm finds matches by comparing a block from the left image to all positions on the
corresponding epipolar line in the right image. Interest operators were introduced by
Moravec (1977). Their purpose is to limit image processing operations to the relevant image
area and save computational cost. A review on different methods can be found in Bhr and
Vgtle (1991). We use the horizontal derivative to find textured image positions in the left
image which can be matched robustly to the right image, and to find image positions in the
320
right image which are candidates for a correspondence. Other positions in the right image
are not compared, which heavily reduces the computational effort. Since the robustness of
stereo analysis relies on texture, we examined the use of maxima of the derivative as
positions for matching. The result is a comparatively sparse disparity map because for a
given block in the left image there is only one matching position per edge in the right image.
The number of positions depends also on the threshold for maxima detection and often a
correct maximum is not recognized because the gradient falls below a given threshold.
However, the results are quite robust. To increase the number of correspondences, we
replaced the maxima detection by simple thresholds for the gradient magnitude. We use a
higher threshold for the left image and a lower threshold for the right image, to make sure
we obtain all corresponding positions in the right image despite the image noise. The results
are both more dense and more robust.
Figure 7. Adaptive block size for the cross method and resulting disparities
Figure 8. Adaptive block size for the region method and resulting disparities
The stability of the stereo analysis depends also on the chosen block size. Figure 6. shows
the disparity maps for the stereo image pair from Figure 5 for different block sizes. Light
values indicate a higher disparity, dark areas a lower disparity. Missing values are marked
as black. They result from the plausibility test proposed by Faugeras (1993): The matching
result in the right image is searched for in the left image. The resulting second match should
be the original block in the left image. If this test fails, the disparity value is discarded. For a
block size of 30x10 pixels 53 percent of all possible disparity values could be computed. For
a block size of 60x20 pixels this value increases to 59 percent. As can be seen from the
picture, this is an improvement especially for homogeneous regions. The drawbacks of an
Dynamic 3D-Vision
321
increased block size are the additional computational effort and the lower accuracy, in
particular near depth discontinuities.
To overcome these disadvantages, a locally adaptive block size is introduced. In
homogeneous regions a higher block size is chosen than in textured regions to make
matching more robust if there are few characteristic image features. To reduce the
inaccurateness near depth discontinuities, bigger blocks are not allowed to cross sharp
edges. For every image position the block size is determined by evaluating the gradient
magnitude along the respective image row and column. Starting with an initially small
block, the block size is increased horizontally and vertically until the borders of the block
reach an edge in the row or column through the centre pixel. If there are no edges the block
growth is stopped after a certain maximum size has been reached. The search along a cross
was chosen because size determination can be performed very quickly. Figure 7 shows some
of the blocks as rectangles. Little dots indicate the edges which stopped the block growth.
The right part of Figure 7 shows the resulting disparity map for a maximum block size of
60x60 pixels. In comparison to the disparity map in Figure 6 the results for homogeneous
regions have improved. On the other hand, some of the stable but inaccurate disparities near
the edges of the monitor are missing now. This is a result of the smaller block size in
textured image regions and absolutely correct taking into account the occlusion effects in
that area. A drawback of this method is that only edges lying on a cross through the block
center have an influence on the block size. The disparity values near the corners of the
monitor hence are still inaccurate. To improve that, a second strategy for the size
determination is examined. Starting from an initial block size the block is expanded
alternating by one row or one column, respectively, if there is no edge in this new row or
column. If the block growth is stopped in one dimension, it continues in the other dimension
until also there an edge has been reached. The results can be seen in Figure 8. Concerning
the corners of the monitor, the disparities are more accurate now, while the remaining
values are similar to the results before. For both methods of size determination the gain in
density and accurateness of the disparity maps has the disadvantage of a higher
computational cost, which is primarily a result of the increased block size for homogeneous
regions. The computation of the disparity map by the cross method took three times longer
than for a fixed block size of 30x10 pixels. The method, which tested the whole area of a
block for edges, was even four times slower. It should also be mentioned that the matching
results for homogeneous regions are comparatively unstable even for large blocks. A soft
edge at the border of a block caused by a depth change in the scene together with smoothing
etc. can dominate the whole structure inside that block and thus influence the resulting
disparity value.
2.4 Post Processing: Disparity Histogram and Subpixel Accurate Disparity
Ideally, disparity maps consist of big areas with steadily changing disparities for flat
surfaces and abrupt changes for depth discontinuities. In homogeneous regions the results
deviate strongly from this ideal. After removing most of these unstable disparities using a
suitable interest operator, mid-size areas of homogeneous disparity remain. These areas are
surrounded by areas without results. The remaining false matches appear as single
disparities deviating much from their neighbourhood. These observations motivate the
following assumptions: (a) A correct disparity value belongs to a certain homogeneous
surface. It probably appears there multiple times. (b) Errors are rare. (c) Wrong disparities
322
do not belong to a certain surface and hence take on arbitrary values. The same wrong value
appears probably only a few times. That means in reverse: Frequent disparity values are
usually right, while rare values are often wrong. For our experiments we computed the
histogram of disparity values and used a threshold on the histogram to discard uncommon
values. Our experiments lead to good results for thresholds in the range of a few tenth of a
percent. With these thresholds sometimes up to 50 percent of the removed values were
actually wrong disparities. Of course, the results depend also on the image contents and
perhaps a comparison with neighbouring disparity values could lead to further
improvements. In general this is a good supporting method if most of the unstable
disparities are already filtered out by an interest operator. Then this method discards a high
percentage of wrong values at almost no computational cost without removing too many
disparity values.
For applications which require a high precision rather than high speed the disparity can be
computed with subpixel accuracy. A robust way to determine the subpixel shift between
two corresponding block is to minimise the square error
e = x ( g ( x + s ) f ( x ))
(8)
between the pixel intensities g of the block in the left image and the intensities f of the
corresponding block right image. The variable x denotes a pixel position inside the block
and the variable s denotes a subpixel shift along the epipolar line. The intensity of a block
with subpixel shift is linearly interpolated using the image derivative according to the
formula
g ( x + s ) =g ( x ) + sg ' ( x ) .
(9)
The subpixel shift is then obtained by finding the root of the derivation of the error, i.e. the
value of s for which
(10)
2 g ' f 2 gg '
2g
(11)
In practice, this method leads to smooth subpixel shifts in areas of uniform disparity.
Outliers occur only where the disparity value already deviates from the neighbourhood for
pixel accuracy. Since at wrong positions the subpixel shift often is greater than half a pixel,
the subpixel shift is well suited to indicate wrong disparity values. Since such great values
correspond to a neighbouring matching position with a lower subpixel shift, the disparity
must already be wrong at pixel level and thus can be discarded.
2.5 Camera Calibration and Accuracy of the Distance Measurements
We use a stereo setup with two DFK 21F04 cameras by The Imaging Source and
Cosmicar/Pentax lenses to compute the accuracy of the distance measurements. These
cameras provide images with a resolution of 640 x 480 pixels. They are mounted on separate
10mm aluminium plates which can be adjusted in yaw, pitch and roll angle. The baseline
323
Dynamic 3D-Vision
length of the camera setup is 20 cm. The lens aperture is set to 5.6, the focus to infinity. The
lens of the right camera has a focal length of 8mm.
To save computation time during operation the stereo images are not rectified in software.
Instead we rely on a careful manual adjustment of the camera orientation. By changing the
zoom of the lens of the left camera, the image sizes are brought into accordance with pixel
accuracy. Because the stereo algorithms we use belong to the category of scanline matching,
the roll and pitch angles of the cameras are adjusted in a way that the line correspondence
between the left and the right image is maximised. We obtain an error of less than 1/100
pixels for the roll angle and less than 1 pixel for the pitch angle. For reasons of simplicity,
stereo systems are often built with parallel optical axes. But for a working distance of 1.5m4m that was chosen with regard to a later data fusion with the PMD camera, the images had
a common field of view too small for stereo analysis. Therefore, the optical axes are directed
towards each other, so that both camera images centre an object at a distance of 4m.
Assuming a pinhole camera model, we can compute the distance z of a point in 3D-space by
the well-known formula
b
z=
,
(12)
f xl f
where f denotes the focal length of both cameras and xr and xl denote the horizontal
coordinate of the corresponding position in the right and the left camera image. The variable
b denotes the length of the baseline. For the proposed stereo setup the formula changes to
xr
z=
b
1
(13)
Here, and denote the deviation of the yaw angles of the optical axes from a parallel
setup. The values f1 and f2 are used because the focal length is not necessarily the same for
both cameras. The variable u denotes the size of one pixel on the CCD-sensor. It is given by
the Sony ICX098BQ data sheet as 5.6m/pixel.
B
F1
F2
0.0024691
0.3045662
0.3041429
0.7583599
0.7580368
324
averaged disparity value the absolute coordinates xr and xl can be computed by adding an
arbitrary offset to the disparity, keeping in mind that the disparity is valid for a larger image
region. Some optimisation methods do not take this into account and thus find parameter
sets with a significantly lower standard deviation, but then the distance values are only
plausible for the chosen offset.
3. PMD
The advent of the photonic mixer device (PMD) leads to new possibilities in real-time depth
measurement. Compared to the use of stereo correspondence analysis, laser scanning or
other depth information yielding technology, PMD-cameras have the advantage of
recording a scene providing intensity images and depth images at once. No further time
consuming computation needs to be done. PMD cameras integrate the sensor hardware plus
the needed software for gathering the images in one device. Figure 9 shows a PMD camera.
Dynamic 3D-Vision
325
readout signals, which directly correspond to the phase difference. Hence the mixing is
done by mixing electrons of the incoming signals with the signal of the push-pull voltage.
The incoming signal is integrated over its amplitude and the measured result is put into its
corresponding pot. The difference of the capacity of both pots is directly related to the phase
difference which is linearly related to the distance of the camera to the recorded object.
Figure 11. Left: Sample measurements (dots) and linear approximation of the distance
function. Right: Standard deviation of the measurements as function of the distance
3.2 Calibration
In our experiments we use a 1K PMD camera with ambient light suppression, which has a
resolution of 16x64 pixels. The camera provides a measurement of the phase difference for
every pixel and the modulation ratio which is a measure for the signal quality. The phase
difference corresponds to the distance between the surface of an object and the camera. For
ease of use we transform the distance data to a representation in Cartesian coordinates. To
this end the camera geometry is approximated by central projection and the illumination by
a point source at the centre of projection. With these approximations we obtain directly
distance measurements in polar coordinates: The angles of the coordinates are built between
the rays from the centre of projection through the grid cells of the sensor array. The distance
is given by the measurements themselves. This representation is then converted to Cartesian
coordinates. A side effect of this procedure is that a slight increase of the lateral resolution
towards the image borders is visible now. It is caused by the large aperture angle of 70.5
degree of the camera. To increase the accuracy of our approximations, lens distortion was
326
corrected during the coordinate transformation. Besides that, the coordinate system was
shifted by 1.3 pixels horizontally and 2.5 pixels vertically to account for an offset between
the optical axis and the centre of the sensor chip.
To determine the relation between the phase difference and the distance, a flat surface with
high reflectance was recorded for several distances. The phase difference was averaged over
the middle 5 pixels of the array because there our approximations have the smallest error.
The distance function z() was then approximated by the linear function
z ( ) = 5.4022 + 672.5143
(14)
with a remaining maximum error of about three percent (see Figure 11 left). Beside the
distance function also the relation between distance and standard deviation is of major
importance because the accuracy of the measurements depends heavily on the amount of
light received by the camera. Since the illumination decreases quadratically with the
distance, a second order polynomial was fitted to the data. The resulting function
(15)
can be seen in Figure 11 right. Figure 12 shows a calibrated distance matrix of the test
surface used for camera calibration as well as two nearer marker objects.
Figure 13. Calibrated distance measurements for a near flat surface (1m distance)
The remaining errors are mainly caused by model inaccuracy for small distances, by too low
signal strength and by difficulties with the recorded scene. It turned out that for distances
Dynamic 3D-Vision
327
smaller than 1.5 meters the camera model is not appropriate and the light source should
better be modelled by a transmitter with finite area (see Figure 13). The problem of a too low
signal occurs mainly at the borders of the sensor array and for very distant objects. But
also the reflectance of the objects plays a role and the angle between the light source and the
optical axis. Besides that, there is a nonlinear relation between the phase and the signal
amplitude due to deviations from the ideal sinus wave shape. To account for these
problems, a threshold for the modulation ratio which indicates the signal strength is
introduced. Pixels with a modulation ratio below 30 percent are ignored. This procedure
ensures that the measurements are in good accordance to function 15.
Beside the reflectance of the recorded objects also their geometry can cause inaccuracies. If
the border of an object is mapped to one sensor element, the sensor receives a mixed signal
from the object in the front and the background. Then the recorded signal is a linear
combination of two sinus waves which are weighted by the reflectance and distance of both
objects. It is also possible that the signal is composed of more sinus waves if there are more
objects occluding each other. For these sensor elements the true distance cannot be
computed. A reasonable assumption then is that the real distance is somewhere between the
neighbouring distances. A simple way of handling this is to introduce a minimum and a
maximum depth map. The minimum depth map is result of a 3x3 minimum operator on the
distance matrix, the maximum map is the result of an analogue 3x3 maximum operator. This
is a fast method and accounts for the distance dependency of the error.
3.3 Experimental Results
We present in this section some results from experiments that were conducted to find out
more about the real world behaviour of the PMD camera. The goal of the experiments was
to gain a calibrating function for any pixel yielding a mapping from a measured value to a
standardised value. First we tried to do the calibration process in front of a white wall, but
later a board was taken instead of the wall. The reason for that will be addressed later in this
section.
Figure 14. Phase differences of the wall-scene and the the board-scene (green line)
The camera was placed in front of the wall at distances of 1m to 6m with steps of 1m, which
is in the range of 7.5m (see section 4.1). Each time a full measurement was taken by the
camera. For all the experiments a tape measure was used instead of the internal alignment
function of the camera to measure exactly the distance of the camera to the wall. To
328
suppress the effect of noise 2500 measurements were taken and every pixel was integrated
over a 5x5 area at each distance step.
Up to a distance of 3m the phase difference values show expected linear behaviour. But at
distances beyond 3m the values start decreasing again (see Figure 14). This might be due to
the fact that the surface of the wall is too smooth so that the light is not reflected in a diffuse
manner. Hence the modulated light might not find its way back to the PMD sensor. Further
tests showed that this effect is almost independent to the exposure time of the camera.
As an alternative we chose a board with fine but coarse surface resulting in a better diffuse
light reflection. With this, the camera showed the expected behaviour. The phase difference
values are linearly increasing even beyond the point of 3m (green line). It has to be said that
the calibration process could not measure distances of 5m and 6m, because above 4m the
board was not big enough to cover the entire image plane of the camera.
Dynamic 3D-Vision
329
decreases rapidly in the border area of the two walls not allowing any reliable
measurements. The reason is the flat angle transporting the light away from the camera
instead of reflecting it right back to the sensors. Hence for reliable results one has to make
sure the camera is aligned orthogonally towards flat objects. The phase difference picture for
Figure 16 a) is presented in fig 18.
Figure 16. Edge scene a) camera looking at edge of the wall orthogonally b) camera
looking at edge of the wall at a beaked angle
330
Figure 20. Intersecting the depth intervals of the stereo and the PMD camera
For the PMD-camera the minimum and maximum depth maps are computed by taking the
minimum and maximum value of a 3x3 environment around every point in the depth
image. This is because for occluding edges a mixed signal of the nearer and farther object
surface is received. Then the depth maps are corrected by the measurement error. To
compensate for different aperture angles of the camera systems and different lateral
Dynamic 3D-Vision
331
resolutions the depth map of the PMD camera is scaled by appropriate factors in vertical
and horizontal direction.
The depth maps now form arrays which give a depth interval for every x- and y-coordinate.
Data fusion is then done by intersecting the intervals of the PMD-camera with the intervals
of the stereo camera. Figure 20 illustrates three different cases. The first case is that the depth
intervals partially overlap. The area marked as final depth is the intersection of the two
depth intervals. The second case shown is that one depth interval completely covers the
other one. Here, both sensors deliver the same depth value, but with different measurement
accuracy. The third case is that there is no overlap. In that case one sensor or both deliver
wrong values. Without taking further assumptions, nothing more can be said here about the
true depth value, so the final value is marked as missing.
4.2 Experimental Results
Figure 21 shows a scheme of the experimental setup and the recorded scene. The
experiments were conducted in a corridor of the university. The only changes to the original
scene are the person we asked to stand in the corridor and the low carton placed in the
foreground. In particular, we did not facilitate the depth recognition by e.g. hanging up
highly textured
posters. The environment can thus be considered as a natural indoor scene (for office
buildings). The distance between the cameras and the objects of the scene was 3m to 4m. The
stereo camera was placed behind the PMD camera because of the smaller aperture angle of
23 degree compared to the 70.5 degree of the PMD camera. It was also placed 30cm higher to
avoid the PMD camera of being visible in the stereo images.
Figure 22 shows a pair of images from the stereo camera. These images are problematic for
stereo analysis in many ways. First, there are big homogeneous regions, primarily the white
walls and the white column. Scanline stereo or block matching fails in these regions due to
the lack of characteristic image features. Secondly, the shirt of the person shows a repetitive
pattern causing multiple solutions to the correspondence problem. In the remaining parts of
the image stereo analysis is affected by occlusion. This concerns mainly the ceiling and the
environment around the person. Figure 23 shows the resulting disparity and depth maps.
Since most parts of the images lead to unstable results, a sobel operator was applied to the
images to find areas where stable results can be expected. Disparity values in unstable
regions are discarded. In the given depth map these areas are marked black. They are
replaced by the mentioned 10m interval before fusion with the PMD camera.
Figure 21. Left: Top view of the experimental setup. Right: Side view of the experimental
setup
332
Figure 23. Left: Disparity map. Right: Resulting depth map (in image coordinates)
Figure 24. Size of the overlapping depth intervals (left) and average depth (right)
The results of the sensor fusion can be seen in Figure 24. The left picture shows the size of
the overlapping depth intervals, the right image shows the mean depth. The coarse block
structure results from the low resolution of the PMD camera. Coordinates without
overlapping depth intervals are marked as white. The bright blocks in the left figure indicate
big depth intervals around the person, at the edges of the column and around the box. This
inaccuracy results from depth discontinuities because there the PMD camera receives a
Dynamic 3D-Vision
333
signal both from the far and the near surface. By contrast, the stereo camera is quite accurate
for these positions and delivers in general the distance to the nearer object. Although the
depth map from the stereo analysis is relatively sparse, the result after sensor fusion is
dense. Missing results from stereo are replaced by values from the PMD camera because the
stereo system provides sufficient large depth intervals when a result is unknown. The few
missing values in the fused depth map occur when both systems report a high accuracy for
their results, although one of the depth values or the accuracy itself is wrong. Although the
fusion mechanism itself is comparatively straight forward, it seems to preserve the
advantages of both 3D sensors while avoiding their disadvantages
6. Conclusion
Common state of the art mechanisms for the measurement of the surrounding environment
in real time usually pose a trade off between high speed, robustness and accuracy. With
applications for mobile robots in mind, this work focuses on the faster methods stereo
analysis and PMD camera. Our research aims at the computation of robust and dense depth
maps in real time.
First, the performance of three standard stereo algorithms is examined with regard to two
different measurements of similarity. The subsequent optimisation of the standard methods
by using modern SIMD instructions and programming techniques like e.g. recursive
subdivision leads to an increase of speed by a factor of four. As a result, for the WinnerTakes-It-All algorithm we achieve a computation time of 250ms (plus 230ms for building the
cost matrix), which can be considered real time. The accuracy of our stereo setup is
determined experimentally and a scene reconstructed from stereo data is shown.
To improve robustness and speed the image derivative is evaluated. In order to deal with
poorly structured environments experiments with an adaptive block size are conducted. The
resulting disparity maps are more dense but the resulting block sizes for homogeneous
regions slow down the correspondence analysis. Hence, this approach is not suitable for real
time. A fast post processing step dealing with a disparity histogram is introduced to discard
wrong matches. The subpixel disparity is computed as a measure of plausibility.
As a comparatively new technique the PMD camera is used for distance measurement. The
PMD camera provides directly the depth information for every pixel without the intensive
computation that characterises stereo analysis. The camera was calibrated with an accuracy
of 5 percent for distances over 1.5m. For smaller distances a more complex model than a
pinhole camera with a point light source is needed. We observe that the measurement error
increases quadratically with the distance, which is an effect of the reduced amount of light
received from distant surfaces. Other inaccuracies result from the reflectance properties of
the recorded surfaces or extreme geometric arrangements of the scene. A big advantage of
the PMD camera is that it does not rely on the texture of a surface or the visibility of objects
in a second camera like a stereo camera. With an (adjustable) integration time of 80ms per
image it is also much faster. On the other hand, the stereo camera has a higher image
resolution as well as higher depth accuracy. Especially the behaviour on and near edges is
better. We thus made experiments combining both methods and they turn out to
compensate the disadvantages of each other very well. As a result, we obtain robust and
dense depth information in real time.
334
7. References
Bhr, H.-P. & Vgtle, T. (1991). Digitale Bildverarbeitung. Anwendung in Photogrammetrie,
Kartographie und Fernerkundung, Wichmann, Karlsruhe, ISBN 3-87907-224-8
Faugeras, O. (1993). Three-Dimensional Computer Vision A Geometric Viewpoint, MIT Press,
Cambridge, Mass., ISBN 0-262-06158-9
Kang, S.; Webb, J.A.; Zitnick, C. & Kanade, T. (1995). A multibaseline stereo system with
active illumination and real-time image acquisition, Proceedings of the Fifth
International Conference on Computer Vision (ICCV '95), June 1995, pp. 88-93.
Kim, H.; Min, D.B. & Sohn, K. (2005). Real-Time Stereo Using Foreground Segmentation and
Hierarchical Disparity Estimation, in Ho, Y.S. & Kim, H.J. (Eds.): PCM 2005, Part I,
LNCS 3767, Springer-Verlag, Berlin, Heidelberg, pp. 384-395.
Konolige, K. (1997). Small Vision System: Hardware and Implementation, In Proceedings of
the 8th International Symposium on Robotics Research, Japan.
Kraft, H.; Frey, J.; Moeller, T.; Albrecht, M.; Grothof, M.; Schink, B.; Hess, H. & Buxbaum, B.
(2004). 3D-Camera of high 3D-frame rate, depth-resolution and background light
elimination based on improved PMD (photonic mixer device)-technologies, OPTO,
Nuernberg, May 2004.
Moravec, H.P. (1977). Towards automatic visual obstacle avoidance. Proceedins of the Fifth
International Joint Conference on Artificial Intelligence, IJCAI-77.
Metropolis, N.; Rosenbluth, A.; Rosenbluth N.; Teller A. & Teller E. (1953). Equation of State
Calculations by Fast Computing Machines, Journal of Chemical Physics, article 21,
pages 1087-1159.
Porta, J.M.; Verbeek, J.J. & Krse, B.J.A. (2005). Active appearance-based robot localization
using stereo vision, Autonomous Robots 18(1), pages 59-80.
Scharstein, D.; Szeliski, R. & Zabih, R. (2001). A taxonomy and evaluation of dense twoframe stereo correspondence algorithms, Proceedings of the IEEE Workshop on Stereo
and Multi-Baseline Vision, Kauai, HI, Dec. 2001.
Schwarte, R. (2001). Dynamic 3D-Vision, IEEE Int. Symp. on Electron Devices for Microwave and
Opto-electronic Applications, EDMO 2001, Wien.
Stereo Vision Research Page (2007). http://cat.middlebury.edu/stereo/newdata.html,
Middlebury University, 30.1.2007.
Sun, Ch. (2002). Fast Stereo Matching using Rectangular Subregioning and 3D Maximum
Surface Techniques, International Journal of Computer Vision, 47(1/2/3), May 2002
Sunyoto, H.,; van der Mark, H. & Gavrila, D.M. (2004). A Comparative Study of Fast Dense
Stereo Vision Algorithms, University of Parma, Italy 2004 IEEE Intelligent Vehicle
Symposium.
Zhao, J. & Katupitiya, J. (2006a). A Fast Stereo Vision Algorithm with Improved
Performance at Object Borders, IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS06), Beijing, China, October 9-15, 2006.
Zhao, J. & Katupitiya, J. (2006b). A Dynamic Programming Approach Based Stereo Vision
Algorithm Improving Object Border Performance, IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS06), Beijing, China, October 9-15, 2006.
Zhu, Z.; Karuppiah, D.R.; Riseman, E. & Hanson, A. (2004). Adaptive panoramic stereo
vision for human tracking with cooperative mobile robots, Robotics and Automation
Magazine, Special Issue on Panoramic Robots, 14 (10): pages 69-78.
19
Bearing-only Simultaneous Localization and
Mapping for Vision-Based Mobile Robots
Henry Huang, Frederic Maire and Narongdech Keeratipranon
Faculty of Information Technology, Queensland University of Technology
Australia
1. Introduction
To navigate successfully, a mobile robot must be able to estimate the spatial relationships of
the objects of interest accurately. A SLAM (Simultaneous Localization and Mapping) system
employs its sensor data to build incrementally a map of an unknown environment and to
localize itself in the map simultaneously. Thanks to recent advances in computer vision and
cheaper cameras, vision sensors have become popular to solve SLAM problems (Bailey,
2003; Costa et al., 2004; Davison et al., 2004; Goncavles et al., 2005; Jensfelt et al., 2006;
Mouragnon et al., 2006). The proposed bearing-only SLAM system requires only a single
camera which is simple and affordable for the navigation of domestic robots such as
autonomous lawn-mowers and vacuum cleaners.
Solutions to SLAM problems when the mobile robot is equipped with a sensor that provides
both range and bearing measurements to landmarks are well developed (Leonard &
Durrant-Whyte, 1991; Zunino & Christensen, 2001; Spero, 2005; Bailey & Durrant-Whyte,
2006). With a single camera, landmark bearings can be derived relatively easily from a
grabbed image, however it is much more difficult to obtain accurate range estimates. Due to
the low confidence in range estimates from vision data, it is desirable to solve SLAM
problems with bearing only measurements.
One of the fundamental tasks of a SLAM system is the estimation of the landmark positions
in an unknown environment. This task is called Landmark Initialization. A typical bearingonly SLAM system requires multiple observations for landmark initialization through
triangulation. With only one observation, a stereo vision can provide range measurements
because its multiple cameras grab images from slightly different viewpoints. However the
reliable vision range in a stereo vision is limited due to the distance between the two
cameras. Several observations at different locations are required to provide a robust range
estimate.
Structure From Motion (SFM) is a process to construct the map of an environment with the
video input from a moving camera. SFM allows a single camera to grab images at some
vantage points for landmark initialization, such as a sufficient baseline and a straight
movement. The requirement of SFM is well satisfied with a mobile robot, some recent works
had utilized SFM to bearing-only SLAM (Goncavles et al., 2005; Jensfelt et al., 2006). Our
method to bearing-only SLAM is inspired from the techniques used in both stereo vision
and SFM.
336
2. Related work
The term SLAM was first introduced by Leonard and Durrant-Whyte (1991), it refers to
Simultaneous Localization and Mapping. SLAM is one of the fundamental tasks in the
navigation of an autonomous mobile robot. In robotic navigation, a map is a representation
of the spatial relationship between the objects of interest in the environment. A map usually
contains the positions of certain objects of interest, such as landmarks and obstacles. The
process of a robot to determine its position in a map is called localization. GPS (Global
Positioning System) is a popular localization system, in which the map is given for
navigation. GPS is well suited for vehicles to navigate in a large scale outdoors environment,
for instance, to navigate from city to city. For a domestic robot, however, a GPS is not
accurate enough and does not work properly indoors and in some built-up areas. Further,
the map of a particular environment may not be always available. A domestic robot cannot
localize itself without a map. A SLAM system needs to build incrementally a map while it
explores the environment and to determine its location in the map simultaneously.
For localization the robot needs to know where the landmarks are, whereas to estimate the
landmark positions the robot needs to know its own position with respect to the map. The
problem of SLAM is considered as a chicken-and-egg problem (Siegwart & Nourbaksh,
2004). To predict the position of the robot, conventional SLAM systems rely on odometry.
Unfortunately, the accumulation of odometric errors (due mainly to wheel slippage) makes
337
the accuracy of the position estimates based only on odometry deteriorate rapidly. Updating
the estimates with other sensory input is needed if the robot navigates for a long time.
Solutions to SLAM can be found if both range and bearing measurements to landmarks are
obtained through sensors (Leonard & Durrant-Whyte, 1991; Zunino & Christensen, 2001;
Spero, 2005; Bailey & Durrant-Whyte, 2006). Such a sensor reading refers to a Full
Observation. A full observation can be achieved by either a single sensor (i.e., a laser range
finder) or a combination of two sensors (i.e., a sonar senor and a camera). Range and bearing
measurements constitute a full state of the environment. The sensors which observe the full
state of the environment (i.e., both range and bearing) are called range-bearing sensors. A full
observation is sufficient to form an estimate, such as an uncertainty region, of a landmark
position. A typical uncertainty region is a Gaussian distribution over the possible positions
of a landmark. Updating an estimate can be achieved by fusing the estimates from the
subsequent observations. However, a range-bearing sensor is too expensive for a domestic
robot. Solving the SLAM problems with a cheaper sensor is desirable.
A sensor reading with either range-only or bearing-only measurement to a landmark is
called a Partial Observation. A partial observation is insufficient to completely determine a
landmark position. A partial observation generates only a non-Gaussian distribution over an
unbounded region for the landmark position (Bailey & Durrant-Whyte, 2006). Multiple
observations from several vantage points are required to estimate the landmark position. A
sensor reading obtained from a single camera constitutes only a partial observation because
it provides bearing measurements but does not provide accurate range measurements. In
general, a vision sensor is relatively cheaper than a range-bearing sensor. We wish to solve
SLAM problems with bearing-only measurements. Next section reviews related work on
vision based navigation for bearing-only SLAM.
2.1 Vision based navigation
Vision based navigation for a mobile robot had been investigated since early nineties of last
century. Levitt and Lawton (1990) developed a Qualitative Navigator based on vision
sensors. This navigator was able to explore the environment and to determine the relative
positions of all the objects of interest. In general, an image contains very rich information for
mapping the corresponding environment. A certain feature can be recognized through its
specific color, shape and size. The frame rate up to 30 Hz from a video camera also enhances
to SLAM, in particular to solve the data association problem.
Landmark bearings can be derived from a panoramic image taken by an omni-directional
vision sensor (for example, a single camera aiming at a catadioptric mirror). A panoramic
image offers a 3600 view of the environment. Because of the robustness of bearing estimates
and the complete view of the environment, previous works have utilized omni-directional
vision sensors in robotic navigation (Rizzi & Cassinis, 2001; Usher et al., 2003; Menegatti et
al., 2004; Huang et al., 2005b).
Stereo vision is another option used in robotic navigation. In addition to the bearing
information, a stereo vision sensor can also measure the depth to a landmark (Murray &
Jennings, 1998; Se et al., 2002; Sabe et al., 2004). A typical stereo vision sensor consists of two
cameras, also known as a Binocular Vision. The disparity of the images taken from two
slightly different viewpoints determines the landmark's range through triangulation. A
Baseline in stereo vision is a line segment connecting the centres of two cameras lens. Some
stereo vision systems consist of three cameras, they are called Trinocular Visions. Common
338
configuration of a trinocular vision is to put three cameras on a right angle polygonal line. A
trinocular vision can achieve better results than a binocular vision because the second pair
of cameras can resolve situations that are ambiguous to the first pair of cameras (Se et al.,
2002; Wooden, 2006).
The length of the baseline is essential in stereo vision, because it affects the precision of
depth estimation and the exterior design of robotic hardware. QRIO (Sabe et al., 2004), a
humanoid robot having a 5cm baseline in its binocular vision. The error of the depth
measurement at a distance of 1.5m is over 80mm. The depth estimates of objects with the
distances of 2m or more are omitted. LAGR (Wooden, 2006), an outdoor robot equipped
with a trinocular vision of Point Grey Bumblebee. A maximum vision range of 6m was
reported with a baseline of 12cm. To maximize the vision range of a stereo vision sensor, a
longer baseline is required. Based on the mobility of a mobile robot, it is possible to extend
the distance of any two viewpoints of a single camera (called a Monocular Vision). If a robot
can move straight, the estimation of a landmark range from a monocular vision will be the
same as the estimation in a stereo vision. Such approach was first proposed by Huang et al.
(2005a). In this paper, a localization method with two observed bearings along a linear
trajectory was presented. The method is particularly useful and accurate if the robot can
move straight, i.e., the robots yaw is toward to a specific landmark.
In computer vision, Structure From Motion (SFM) refers to the process of building a 3D map
of a static environment from the video input from a moving camera. This is very similar to
stereo vision where a 3D map is built from two simultaneous images of the same landmark.
In both cases, the same landmark is taken into multiple images and the disparities of images
are used to compute the landmark location. In stereo vision, the images are taken at
different viewpoints simultaneously. In SFM, due to the robot's motion, the images are
taken at different viewpoints at different time steps. Visual odometry (Nister et al., 2004)
employs SFM to estimate the motion of a stereo head or a single moving camera based on
video data. The front end of this system is a feature tracker. Point features are matched
between pairs of frames and linked into image trajectories at video rate. SFM presents
significant advantages compared with a stereo vision due to the low cost of a monocular
vision and the flexible baseline. However, SFM can build a map with respect to a static
environment only because of the images are obtained at different time steps.
Goncavles et al. (2005) presented a framework to bearing-only SLAM based on SFM from
three observations. They utilized a wall corner as the landmark for guiding the robot to
move straight. Three images were taken while the robot was moving toward the wall
corner. Each image was taken when the robot had moved 20cm approximately. A similar
work (Jensfelt et al., 2006) was using N images for landmark initialization, here N is a
sufficient number to obtain an accurate estimation. To ensure a proper triangulation, the
images were discarded if the robot had not moved more than 3cm (i.e., baseline under 3cm)
or turned more than 1 degree (i.e., not a straight movement). Both of the approaches solve
the bearing-only SLAM problem using a monocular vision. However, they require an
odometer to determine robots motion. Our method to bearing-only SLAM is similar to SFM
with a monocular vision, but does not rely on odometric information.
339
340
hypotheses distributed along the direction of the bearing measurement. The validity of each
hypothesis was then evaluated based on the Sequential Probability Ratio Test (SPRT). Sola et
al. (2005) gave a new insight to the problem and presented a method by initializing the
whole vision cone (see Figure 4(a)) that characterizes the direction of the landmark. This
cone is covered by a sequence of ellipses that represent the likelihood of the landmark.
Undelayed method of landmark initialization is efficient to identify the directions of all
landmarks when the first bearing measurements are made. It does not state explicitly the
locations of the landmarks. Further observations are still required to initialize the landmark
positions. Lemaire et al. (2005) applied an undelayed initialization method to a 3D bearingonly SLAM. The landmark initialization is similar to the method proposed in (Kwok &
Dissanayake, 2004) by maintaining a mixture of Gaussians. The updating process was done
by comparing the likelihoods of subsequent observations. If the likelihood falls below a
certain threshold then the Gaussian is removed. Once only a single Gaussian is left in the
cone, the landmark is initialized and added into the map for EKF-SLAM.
2.3 Our approaches
This chapter presents two methods to compute the spatial uncertainties of the objects based
solely on bearing measurements only: namely a geometric method and a probabilistic
method. These methods are similar to the approach of the undelayed method of landmark
initialization. Since the estimate based on a partial observation (known bearing but
unknown range) is insufficient to completely determine a landmark position, a second
observation from a vantage position is required to generate an explicit estimate.
In the geometric method, we manipulate directly each vision cone as a polyhedron instead
of a sequence of Gaussians. Each cone contains a landmark position. After a second
observation in a linear trajectory, the uncertainty region (the set of possible landmark
positions that are consistent with the first and second observations) of the landmark
becomes the intersection of two cones rooted at the two observation points, see Figure 4(b).
Depending on the difference of bearings, the intersection is either a quadrangle (four-side
polygon) or an unbounded polyhedron. For each estimation, we change the bases from the
local frame (the robot-based frame, denoted by FR ) into the global frame (the landmarkbased frame, denoted by FL ). The uncertainties of all objects are re-computed with respect to
FL by the change of bases. A global map with the estimated positions of all objects and their
associated uncertainties can be gradually built while the robot explores its environment.
In the probabilistic method, a landmark position is represented by a PDF p(r , ) in a polar
coordinate where r and are independent. Formulas are derived for computing the PDF
of landmark position when an initial observation is made. The updating of the PDF with the
subsequent observations can be done by direct computing from the formulas. We select a
number of sample points in the uncertainty region (computed from the geometric method)
by the rejection method (Leydold, 1998). These sample points are used to represent the PDF
in FR . By changing the bases from FR to FL , the PDFs of all object positions in the global
frame FL can also be computed.
Without range measurement, we assume the probability density of an object location is
constant along the range. It is a more realistic assumption than the one made by other
existing methods (Davison, 2003; Davison et al., 2004; Kwok & Dissanayake, 2004; Sola et al.,
2005) which assume that the probability density of the object location is a Gaussian or a
341
mixture of Gaussians. Indeed, if only bearing information is given, the probability that the
landmark is between 4 and 5 metres should be the same as the probability that the landmark
is between 5 and 6 metres. The representation with a Gaussian or a mixture of Gaussians
fails in this constraint. With our PDF representation, the probability that the landmark is
between 4 and 5 metres will be the same as the probability that the landmark is between 5
and 6 metres.
3. A direct localization method using only the bearings extracted from two
panoramic views along a linear trajectory
In this section, we describe a direct method (in the sense it does not use an iterative search)
based solely on vision for localizing a mobile robot in a 2-dimensional space. This method
relies only on the bearings derived from two images taken along a linear trajectory. We only
assume that the robot can visually identify landmarks and measure their bearings. This
method does not require any other sensors (like range sensors or wheel encoders) or the
prior knowledge of relative distances among the objects. This method can be adopted in a
localization system which utilizes only a single camera as the sensor for navigation. Given
its low cost, such a system is well suited for domestic robots such as autonomous lawnmowers and vacuum cleaners.
3.1 Method description
In order to describe our method we need to introduce some notation. The robot position at
i th observation point is denoted by O i . The position of j th landmark is denoted by L j . The
notation ij denotes the bearing measurement at Oi with respect to L j . The line going
through two points x1 and x2 will be denoted by ( x1 , x 2 ) . This section shows how to
compute the Cartesian coordinates of landmarks and the robot positions from the bearings
measured at O1 and O 2 relatively to L1 and L 2 . We consider two right-handed coordinate
systems, the robot-based frame denoted by FR , and the landmark-based frame denoted by
FL . In Figure 1(a), the coordinates of O1 and O 2 in FR are respectively [0,0]' and [1,0]' .
Similarly, in FL (see Figure 1(b)), the coordinates of L1 and L 2 are respectively [0,0]' and
[1,0]' . The frame FL is a global frame since all the landmarks are assumed static in the
environment. The distance || L1 L2 || is taken as the measurement unit for the localization
system.
While the robot moves in a linear trajectory, two images are taken at O1 and O 2 respectively.
The landmark bearings are derived from these two images. The position L j in FR is
computed as the intersection of two lines (O1 , L j ) and (O2 , L j ) . The equations of this two
lines can be obtained from the bearings 1j and 2j , and the coordinates of O1 and O 2 in FR .
Once L j is available in FR , we can determine the affine transformation that relates the
coordinates X L (x ) and X R (x ) of a point x in the two coordinate systems FL and FR . That
is, an expression of the form X L (x ) = a * X R (x ) + b , here a is a matrix and b is a vector. The
coordinates of O1 and O 2 in FL are then easily derived.
342
Figure 1. From two landmark bearings observed at points O1 and O 2 , the coordinates of
L1 and L 2 in FR are computed. Then a simple change of bases gives the positions of O1 and
O 2 in the global frame FL
Experiments in simulation and on a real robot (see Section 3.2) indicate that the accuracy of
this localization system is sensitive to the relative difference of bearings. Let e be the
343
estimated error of landmark position. Figure 2 shows that d 0 = d * sin( 22 ) = e * sin( ) , here
= 22 12 , d 0 is the distance between L 2 and (O 2 ' , L 2 ' ) , and d =|| O2 O2 '|| . We have
e=
d*sin( 22 )
sin( )
. That is, e is proportional to the landmark range d and the inverse of the relative
difference in bearings. Assume the landmark range is fixed and the bearings angles are
small, the ratio
e
d
22
22 12
that a large relative change in bearings should give a more accurate position estimate.
3.2 Empirical Evaluation
Our localization method was evaluated on a Khepera robot equipped with a color camera
(176 x 255 resolution). The average error between the measured and actual bearings is about
2 degrees. In this experiment, the second landmark was placed 20 centimetres apart from
the first landmark. Four different starting points were used, and 20 trials at each point were
conducted. The moving distance in all cases was 30 centimetres. The moving directions were
westwards parallel to the landmarks. The results are shown in Figure 3. In this figure,
landmarks are denoted by stars, trajectories are shown as arrows, and the estimated
positions by our localization method are displayed as scatter points.
The localization error, average distances between the estimated positions and the actual
positions, at positions a, b, c, and d (in Figure 3) are respectively 0.6, 1.2, 2.2, and 2.8
centimetres. The errors are small compared to the diameter of the robot (6 centimetres).
Other experimental results have confirmed that the error is inversely proportional to the
relative difference in bearings.
Figure 3. Estimated positions of the robot determined by the proposed localization method
When more than two landmarks are present, the localization accuracy can be further
improved by fusing the estimated positions, giving more importance to the estimation
returned by the pair of landmarks that has a larger relative difference in landmark bearings.
In summary, we have introduced a novel effective approach for robot self-localization using
only the bearings of two landmarks. This technique can be viewed as a form of stereo-vision.
The method we propose is well suited for real-time system as it requires very little
computation.
344
When more than two landmarks exist in the environment, the robot can determine the
relative positions of the landmarks provided some weak visibility constraints are satisfied.
Indeed, suppose there are two pairs of landmarks {L1 , L 2 } and {L 3 , L 4 } visible from a
segment O1O2 (notice that {L1 , L 2 } and {L3 , L 4 } do not have to be in direct line of sight).
Then using three different bases, the first one B0 attached to O1O2 , the second one
B1, 2 attached to {L1 , L2 } , and the third one B3, 4 attached to {L3 , L 4 } , we can determine the
change of basis matrices M B ,B
0
1, 2
3, 4
1, 2
3, 4
allows us to
compute the positions of the pairs of landmarks {L1 , L 2 } and {L3 , L 4 } relatively to each
other.
This method enables a mobile robot to localize itself with only two observed bearings of two
landmarks. Such a localization system will be invaluable to an indoor robot as well. As the
bearings of the sides of a door frame can play the roles of the landmarks L1 and L 2 and tell
the robot exactly where it stands relative to the door. In next section, we employ this method
to solve the landmark initialization problem in bearing-only SLAM.
observation point Oi , a landmark position L j is contained in the vision cone which is formed
by two rays rooted at Oi . The first ray is defined by Oi and the bearing ij + ; the second
ray is defined by Oi and bearing ij . Figure 4(a) shows the vision cones in the robotbased frame FR based on the reading of the landmark bearings from O1 .
After reading the bearing measurements from both O1 and O2 , the uncertainty region AL
becomes the intersection of two cones rooted at O1 and O2 respectively. Figure 4(b) shows
that a typical intersection is a 4-sided polygon. If the cones are almost parallel, their
intersection can be an unbounded polyhedron.
The spatial relationships in Figure 4(b) are expressed in FR . Since the robot is moving over
time, the base of FR changes too. Therefore, it is necessary to change coordinate systems to
express all positions in the global frame FL . Figure 5 illustrates the difficulty of expressing
345
the robot centred information in the global frame FL . The uncertainty on the landmarks
prevents us from applying directly a change of bases. In next section, we will show how to
solve this problem.
Figure 4. (a) The vision cones rooted at O1 contain the landmarks. Each cone represents the
unbounded uncertainty of the associated landmark. The diagram is drawn with respect to
the robot-based frame FR . (b) The intersections of the vision cones form the uncertainty
region A L
j
Figure 5. When L1 and L 2 are not certain, a simple change of bases does not induce correct
uncertainty regions of L3 and L 4
346
k = 1...4
(b) After the change of frames, the uncertainty regions of O1 and O2 are denoted by AO
and AO . We obtain O2k from Equation (4) with respect to Lk2 , k = 1...4 .
2
The uncertainty regions of O1 and O2 with respect to FL can be obtained by considering all
possible hypotheses for the location of L 2 consistent with the observations. That is, we
consider the set of possible coordinate vectors X R (L2 ) of L 2 in FR . For each hypothesis
X R (L2 ) = h 2 , a standard change of bases returns the coordinates X L (O1 ) and X L (O 2 ) of
create the polyhedra AO and AO of uncertainty regions with respect to FL (see Figure
1
6(b)).
347
In the general case, when uncertainty exists for both L1 and L 2 , to transfer the information
from FR to FL , we consider simultaneously all the possible locations of L1 and L 2 consistent
with the observations. We hypothesize,
X R (L1 ) = h1 , and X R (L 2 ) = h 2
(1)
Let h ,h be the affine transformation function for changing frames from FR to FL . That is,
1
(2)
X L (L 2 ) = h , h (X R (L 2 )) = [1,0]'
(3)
The above constraints completely characterize h ,h . For any point x , the coordinates
1
X L (x) = h ,h (X R (x))
1
(4)
X L ( AO ) =
i
h , h (X R (O i ))
h1 AL1 , h 2 AL 2
(5)
Figure 7. The uncertainty regions in FL are derived from the uncertainty regions in FR (see
Figure 5). The centroids of the uncertainty regions are used to represent the estimated
positions of different objects. The areas of the polyhedra quantify how uncertain the
estimates are
The uncertainty regions AL for L3 and L 4 in FL are computed similarly,
j
348
X L ( AL ) =
j
h , h (X R ( AL ))
h1 AL1 , h 2 AL 2
(6)
The computation in this example, we take the four vertices (the extreme points) Lk1 and Lk2
( k = 1...4 ) from AL and AL . Figure 7 shows the estimated uncertainties of O1 , O2 and
1
and the polyhedron X L ( AL ) approximates the set of all consistent points for L j . Although
j
polyhedron. We have tested the proposed method both in simulation and on a real robot.
These results are presented in next section.
4.3 Simulation
We tested the proposed method in simulation in an environment with four landmarks (at
unknown positions to the localization system). The robot moves in a polygonal line around
the centre with some randomness. Since we focus on landmark initialization, Figure 8 shows
only the estimated positions of the landmarks.
AL
: True landmark
: Estimated landmark
AL
: True landmark
: Estimated landmark
AL
AL
O1
L
O2
AL
: True landmark
: Estimated landmark
AL
: True landmark
: Estimated landmark
AL
AL
349
available after the robot has moved again, new AL and AL are obtained in the same
3
manner. The estimated positions from all movements are unifiable since they are with
respect to the same frame FL . Figure 8 shows how the uncertainty regions are refined after
several movements. The polyhedra AL and AL shrink gradually. A global map with the
3
Figure 9. (a) Estimated landmark positions after 10 pairs of observations. (b) The
uncertainties of the third and fourth landmarks
The uncertainties of L3 and L 4 decrease rapidly in the first few observations and does not
change much after the third observation as shown in the top chart of Figure 9(b). The bottom
chart of Figure 9(b) displays the estimated errors of L3 and L 4 are 2 centimetres and 3
centimetres respectively. The measurement unit in this chart equals to 20 centimetres.
350
We carried out another experiment to study the sensitivity of vision error . The top chart
of Figure 10 shows the relationship between and the uncertainties of L3 . The amount of
the vision error was varied from 2 to 7 degrees. The uncertainties are proportional to
in a linear manner. The bottom chart of Figure 10 shows that the estimated error might not
decrease monotonically. This is because we assign the centroid of the uncertainty region as
the estimated landmark position.
In this section, we introduced a method for analyzing how uncertainty propagates when
information is transferred from one observer attached to a robot-based frame to an observer
attached to a landmark-based frame. The accuracy of this method was demonstrated both in
simulation and on a real robot. In next section we will employ a probabilistic method to
compute the uncertainties of object positions.
Figure 10. Uncertainties and estimated errors at different amounts of vision error in
degrees
351
r is independent on . Formulas are derived for computing the PDF (Probability Density
Function) when an initial observation is made. The updating of the PDF when further
observations arrive is explained in Section 5.2.A.
5.1 Method description
Let p(r , ) be the PDF of the landmark position in polar coordinates when only one
observation has been made. We characterize p(r , ) when r and are independent. Let
denote the measured landmark bearing. Assume that the error range for the bearing is
. The landmark position is contained in the vision cone which is formed by two rays
rooted at the observation point with respect to two bearings and + (see Figure 11).
Figure 11. The vision cone is rooted at the observation point. The surface of the hashed area
is approximately r dr d for small d and dr
The surface of the hashed area in Figure 11 for small dr and d can be computed as
[ (r + dr )
r 2
] d2 = 12 [2 r dr + (dr ) ]d r dr d
2
Because the probability of the landmark being in the vision cone is 1, we have
+ Rmax
Rmin
p(r , ) r dr d = 1
(7)
In Equation (7), Rmax and Rmin are the bounds of the vision range interval. We define F (R) as
the probability of the landmark being in the area {(r , ) | r [ R min , R], [ , + ]} ,
+ R
Rmin
p(r , ) r dr d
(8)
352
We define ( R, ) as the probability of the landmark being in the dotted area in Figure 11.
Since ( R, ) = F ( R + ) F ( R ) , we have
( R, ) =
+ R +
p(r , ) r dr d
(9)
If the range r and the angle are independent, then ( R, ) is constant with respect
( R, )
to R . That is,
= 0 . From Equation (9), we derive
R
F ( R + ) F ( R)
=
R
R
(10)
Rmin
(11)
g( ) d = 1 .
F ( R)
= f ( R) R
R
(12)
f ( R + ) ( R + ) = f ( R) R . Therefore,
f ( R + ) f ( R)
R
+ f ( R + ) = 0 . By making goes to zero, we obtain R f ' ( R ) + f ( R ) = 0 .
ec
1
log( f ( R)) = log(R) + c = log + c = log
R
R
Where c is a constant, let = e c , we obtain f ( R) =
f (r ) =
R max
1
R min
1
( R max Rmin ) r
1
g( )
( R max R min ) r
(13)
If we use a Gaussian function for g ( ) with mean and standard deviation , the PDF
353
p(r , ) can be re-written as Equation (14). Figure 12 shows the PDF of p(r , )
p (r , ) =
( R max
( ) 2
exp
2 2
1
R min ) r
2
(14)
.
Figure 12. The PDF of the landmark position following Equation (14)
5.2 Utilization of the PDF for bearing-only SLAM
We illustrate the application of the PDF for our bearing-only SLAM system. Section 5.2-A
describes how the PDF can be updated with a second observation. In Section 5.2-B, we
present experimental results on a real robot.
A. Updating the PDF with a second observation
When a second observation is made after a linear motion, the landmark position falls in the
uncertainty region which is the intersection of two vision cones rooted at the first and the
second observation points O1 and O2 . We denote with p1 and p 2 the PDFs of the landmark
positions computed from Equation (14) with respect to O1 and O2 respectively. Let p
denote the PDF of the landmark position after fusing the sensory readings from O1 and O 2 .
From the work of Stroupe et al. (2000), we have p = p1 p 2
p2 .
354
q , we generate a set S according to the PDF p by the Rejection Method (Leydold, 1998). We
determine the maximum probability density p max
of p by computing p1 p 2 at the
intersection of two bearings. The sampling process selects uniformly a sample point s and a
random number { 0, 1} . If p(s) p max < , s is rejected, otherwise s is accepted and
added to S . The sampling process is repeated until enough points are accepted. Figure 13
shows the generated samples in the uncertainty regions of four landmarks.
The mean x and the covariance matrix C of q are obtained by computing the mean and the
covariance matrix of S as previously done by Smith & Cheeseman (1986) and Stroupe et al.
(2000). In Figure 13, the contour plots present the PDFs of landmark positions.
The estimated PDFs in Figure 13 are expressed in the robot-based frame FR . Since the robot
is moving over time, its frame changes too. Therefore, it is necessary to change the
coordinate systems to express all the estimations in the global frame FL . We use the method
introduced in Section 4 to transfer the samples in S from FR to FL . After the change of
frames, the uncertainties of L1 and L 2 are transferred to other objects. The samples of other
objects are taken to approximate the PDFs of the object positions in FL . Figure 14 shows the
distribution of the samples after the change of frames. The contour plots present the PDFs of
the object positions in the global frame FL associated to the points (L1 , L2 ) .
Figure 13. The PDFs and the contour plots of four landmarks in the robot-based frame FR ; in
this example, the uncertainty region of each landmark is a bounded polygon. The generated
samples are distributed in the uncertainty regions
355
Figure 14. After the change of frames from FR to FL , the PDFs and the contour plots of O1 ,
356
Figure 15. Experimental setup, the landmarks are the vertical tubes
Figure 16. (a) Diagram of the experimental setup. (b) The errors of the estimated landmark
positions
We made another experiment to test the sensitivity of the errors of the landmark positions
with respect to the different directions of the robots moving trajectories. We let the robot
move in four different directions with respect to three landmarks. In Figure 17(a), stars
denote the landmark positions and arrows denote the moving trajectories. The robot
repeated 10 iterations for each trajectory.
The errors on L3 in four trajectories after the tenth iteration were 2.12, 1.17, 1.51, and 13.99
centimetres respectively. The error of the fourth trajectory is large because the robot moves
along a line that is close to L3 . Therefore, the vision cones at the first and the second
observations are nearly identical.
357
The estimation of the landmark position is more accurate when the intersection of two
vision cones is small. This is the case of the second trajectory where the intersection is the
smallest among all trajectories.
Figure 17. (a) Trajectories of the robot for the experiment to study the relationship between
moving directions and estimated errors of landmark positions. (b) The errors on L3 at each
iteration
Although the intersecting area of L3 for the first and the third trajectories are the same, the
intersecting areas of L1 and L 2 for the first trajectory are much bigger than the areas from
the third trajectory. This is the reason why the estimated error from the third trajectory is
smaller than the one for the first trajectory.
6. Conclusion
In this chapter, we proposed a vision-based approach to bearing-only SLAM in a 2dimensional space. We assumed the environment contained several visually distinguishable
landmarks. This approach is inspired from techniques used in stereo vision and Structure
From Motion. Our landmark initialization method relies solely on the bearing
measurements from a single camera. This method does not require information from an
odometer or a range sensor. All the object positions can be estimated in a landmark-based
frame. The trade-off is that this method requires the robot to be able to move in a straight
line for a short while to initialize the landmarks. The proposed method is particularly
accurate and useful when the robot can guide itself in a straight line by visually locking on
static objects.
Since the method does not rely on odometry and range information, the induced map is up
to a scale factor only. In our method, the distance || L1 L 2 || of two landmarks is taken as
the measurement unit of the map. The selection of L1 and L 2 is critical for the accuracy of
358
the map. In Section 3.1, the mathematical derivation shows that the estimated error of a
landmark position is proportional to the range of the landmark and the inverse of the
relative change in landmark bearings. Choosing L1 and L 2 with larger change in bearings
produces a more accurate mapping of the environment.
In the sensitivity analysis, we showed how the uncertainties of the objects positions are
affected by a change of frames. We determine how an observer attached to a landmarkbased frame FL can deduce the uncertainties in FL from the uncertainties transmitted by an
observer attached to the robot-based frame FR . Each estimate of landmark uncertainties
requires a pair of the observations in a straight movement. The simulation in Section 4.3
shows how the uncertainties of landmark positions are refined when the robot moves in a
polygonal line.
With dead reckoning, the error of the estimated robots location increases with time because
of cumulated odometric errors. In our method, we set O1 and O 2 (pair of observation
points in a straight movement) at [0,0]' and [1,0]' in FR . There is no dead reckoning error
on O1 and O 2 by construction. In practice, the robots movement may not be perfectly
straight. However, the non-straight nature of the trajectory can be compensated by
increasing the size of the confidence interval of the bearing.
The induced map created by our method can be refined with EKF or PF algorithms. With
EKF, the uncertainty region computed from the geometric method can be translated into a
Gaussian PDF. With PF, the weights of the samples can be computed with the formulas
derived in Section 5.1. Since the samples are drawn from the uncertainty region, the
number of samples is minimized.
The accuracy of our method was evaluated with simulations and experiments on a real
robot. Experimental results demonstrate the usefulness of this approach for a bearing-only
SLAM system. We are currently working on the unknown data association when all
landmarks are visually identical. In future work, we will deal with the problems of object
occlusion and non-planar environments. That is, the system will be extended from a 2dimensional to a 3-dimensional space.
7. References
Bailey, T. (2003). Constrained Initialisation for Bearing-Only SLAM, Proceedings of IEEE
International Conference on Robotics and Automation, pp. 1966 - 1971, 1050-4729
Bailey, T. & Durrant-Whyte, H. (2006). Simultaneous Localization and Mapping (SLAM):
Part II, IEEE Robotics and Automation Magazine, page numbers (108-117), 1070-9932
Costa, A.; Kantor, G. & Choset, H. (2004). Bearing-only Landmark Initialization with
Unknown Data Association, Proceedings of IEEE International Conference on Robotics
and Automation, pp. 1764 - 1770
Davison, A. (2003). Real-time simultaneous localization and mapping with a single camera,
Proceedings of International Conference on Computer Vision, pp. 1403-1410, Nice,
October
Davison, A.; Cid, Y. & Kita, N. (2004). Real-time 3D SLAM with wide-angle vision,
Proceedings of IAV2004 - 5th IFAC/EURON Symposium on Intelligent Autonomous
Vehicles, Lisboa, Portugal, July
359
Fox, D.; Burgard, F.; Dellaert, W. & Thrun, S. (1999). Monte Carlo localization: Efficient
position estimation for mobile robot, Proceedings of National Conference on Artificial
Intelligence, pp. 343-349
Goncavles, L.; Bernardo, E. d.; Benson, D.; Svedman, M.; Karlsson, N.; Ostrovski, J. &
Pirjanian, P. (2005). A visual front-end for simultaneous localization and mapping,
Proceedings of IEEE International Conference on Robotics and Automation, pp. 44-49
Huang, H.; Maire, F. & Keeratipranon, N. (2005a). A Direct Localization Method Using only
the Bearings Extracted from Two Panoramic Views Along a Linear Trajectory,
Proceedings of Autonomous Minirobots for Research and Edutainment (AMiRE), pp. 201206, Fukui, Japan
Huang, H.; Maire, F. & Keeratipranon, N. (2005b). Uncertainty Analysis of a Landmark
Initialization Method for Simultaneous Localization and Mapping, Proceedings of
Australian Conference on Robotics and Automation, Sydney, Australia
Jensfelt, P.; Kragic, D.; Folkesson, J. & Bjorkman, M. (2006). A Framework for Vision Based
Bearing Only 3D SLAM, Proceedings of IEEE International Conference on Robotics and
Automation, pp. 1944- 1950, 0-7803-9505-0, Orlando, FL
Kwok, N. M. & Dissanayake, G. (2004). An Efficient Multiple Hypothesis Filter for BearingOnly SLAM, Proceedings of IEEE International Conference on Intelligent Robots and
Systems, pp. 736- 741, 0-7803-8463-6
Lemaire, T.; Lacroix, S. & Sola, J. (2005). A practical 3D Bearing-Only SLAM algorithm,
Proceedings of IEEE International Conference on Intelligent Robots and Systems, pp.
2757-2762
Leonard, J. J. & Durrant-Whyte, H. F. (1991). Simultaneous localization for an autonomous
mobile robot, Proceedings of IEEE Intelligent Robots and System, pp. 1442-1447, Osaka,
Japan
Levitt, T. S. & Lawton, D. M. (1990). Qualitative navigation for mobile robots, Artificial
Intelligence, Vol. 44, No. 3, page numbers (305-360)
Leydold, J. (1998). A Rejection Technique for Sampling from log-Concave Multivariate
Distributions, Modelling and Computer Simulation, Vol. 8, No. 3, page numbers (254280)
Menegatti, E.; Zoccarato, M.; Pagello, E. & Ishiguro, H. (2004). Image-based Monte Carlo
localisation with omnidirectional images, Robotics and Autonomous Systems, Vol. 48,
No. 1, page numbers (17-30)
Montemerlo, M.; Thrun, S.; Koller, D. & Wegbreit, B. (2003). FastSLAM 2.0: An improved
particle filtering algorithm for simultaneous localization and mapping that
provably converges, Proceedings of International Joint Conferences on Artificial
Intelligence, pp. 1151-1156, Morgan Kaufmann, IJCAI
Mouragnon, E.; Lhuillier, M.; Dhome, M.; Dekeyser, F. & Sayd, P. (2006). 3D Reconstruction
of complex structures with bundle adjustment: an incremental approach,
Proceedings of IEEE International Conference on Robotics and Automation, pp. 3055 3061, Orlando, USA
Murray, D. & Jennings, C. (1998). Stereo vision based mapping and navigation for mobile
robots., Proceedings of IEEE International Conference on Robotics and Automation, pp.
1694 - 1699, New Mexico
Nister, D.; Naroditsky, O. & Bergen, J. (2004). Visual Odometry, Proceedings of IEEE Computer
Society Conference, pp. I-652 - I-659, Princeton
360
20
Object Recognition for Obstacles-free
Trajectories Applied to Navigation Control
W. Medina-Melndez, L. Fermn, J. Cappelletto, P. Estvez,
G. Fernndez-Lpez and J. C. Grieco
Universidad Simn Bolvar, Grupo de Mecatrnica, Valle de Sartenejas, Caracas
Venezuela
1. Introduction
In this chapter applications of image and video processing to navigation of mobile robots
are presented. During the last years some impressive real time applications have been
showed to the world, such as the NASA missions to explore the surface of Mars with
autonomous vehicles; in those missions, video and image processing played an important
role to rule the vehicle.
Algorithms based on the processing of video or images provided by CCD sensors or video
cameras have been used in the solution of the navigation problem of autonomous vehicles.
In one of those approaches, a velocity field is designed in order to guide the orientation and
motion of the autonomous vehicle. A particular approach to the solution of the navigation
problem of an autonomous vehicle is presented here. In the first section of this introduction
a state of the art review is presented, after it, the proposed algorithm is summarized; the
following sections present the procedure. Finally, some experimental results are shown at
the end of the chapter.
1.1 Review of Navigation of Autonomous Robots using Vision Techniques.
In the area of autonomous navigation using vision techniques, the previous works (SantosVictor & Sandini,1997), and (Nasisi & Carelli, 2003) are corner stones. In the first mentioned
study, robot control techniques are explored, using both cameras on board of the robots and
external cameras. In that work it is shown that it is possible to accomplish effective control
actions without doing a complete processing of the image captured or without the
calibration of the camera. In the job of Nasisi & Carelli, a set of equations needed to establish
relationships among a bidimensional image captured by a video camera and its
corresponding tri-dimensional image is obtained, an equation set that is important when a
single camera is being used. The jobs of S. Skaar et al., (who participated in the 2003 Mars
Exploration Rover experiment of NASA), over the concept of Camera Space Manipulation
(CSM) (Skaar et al., 1992) and the concept of Mobile Camera Space Manipulation (MCSM)
(Seelinger et al., 2002), must be cited. The MCSM method consists of the estimation of the
relationship among the characteristics position of the manipulator and its corresponding
points in the image spaces of the two cameras mounted over the robot; the CSM concept is
quite similar but with more restrictions. Both methods, the CSM and the MCSM require not
362
only the parameters of the cameras to be completely known, but also the kinematics model
of the manipulator, even if they dont require the complete calibration of the camera and the
manipulator. These methods require a set of cameras, while the methodology proposed in
(Santos-Victor & Sandini,1997) and (Nasisi & Carelli, 2003) involves only one.
One vital characteristic of every navigation strategy is the way that the decisions are taken
on it when the sensory system indicates the presence of obstacles on the robot trajectory.
Different obstacle avoidance strategies have been presented; among those strategies the use
of electrostatic fields - where the robot is modeled as a positive electric charge, as also the
obstacles, and the objective of the trajectory is modeled as a negative charge - must be
mentioned (Dudek & Jenkin, 2000) and (Khatib, 1985); the Coulomb law is applied to
determine the trajectory of the mobile robot.
In 1995, Li and Horowitz presented the concept of velocity fields (Li & Horowitz, 1995); in
such a case, a vector velocity is defined over the trajectory of the robot for each possible
position coding a specific task. Using control schemes with the velocity field as the reference
for the system, has allowed approaches to the problem employing different control schemes
instead of the classic following trajectory problem. In this approach, the improvement of
coordination and synchronization among the degrees of freedom of the robot are much
more important than the time to execute a given task. In the Figure 1 an example of this
strategy is presented.
363
(Moreno & Kelly, 2003b) and (Moreno & Kelly, 2003c), proposed different control schemes
using velocity fields focusing mainly on the asymptotically stable conditions.
Other investigations related to velocity fields have focused in other areas, such as adaptive
control (Li, 1999). In that work, Li proposed an adaptive control scheme using velocity fields
as references for robots with unknown inertia parameters. Also in (Dixon et al., 2005) an
adaptive derivative control is presented in order to follow a velocity field. In almost all the
previous cases, the works had as an objective the design of control schemes taking a velocity
field as a time-invariant reference, obtained after a theoretical or analytical procedure. Only
few works in the literature have considered the case when the field is time-dependent.
364
augmented dynamic system using an inertia flywheel, in order to avoid the problems of
control following by velocity fields. Recently, Kelly and collaborators in (Kelly et al., 2004a)
and (Kelly et al., 2004b), used a camera as a main sensor in a control scheme by velocity
fields in order to control a mobile robot and a robot manipulator. Particularly in (Kelly et al.,
2004a) is presented a controller by velocity fields for a manipulator of two degrees of
freedom that incorporates an external camera to capture the manipulator movements and
the work environment. On the captured image a theoretical velocity field that the
manipulator is capable to follow is pre-defined. In the Figure 2 an example for this case is
presented.
In (Kelly et al., 2004b), the control is applied to a wheeled mobile robot where the video
camera is located over the robot. Like in the previous case, the camera is located in such a
way that covers the environment and the robot itself. In the Figure 3 can be visualized the
system under study.
1.2. The proposed approach
During last years, as it was mentioned, the use of artificial vision in robot tasks has been
increasing. In many applications, such as rescue jobs, the use of vision to generate velocity
fields is under study. The dynamic modification and generation of the velocity field itself,
changing it with the environment modifications, is a new research area. If timing is not an
issue in the control objectives, then velocity field control, where the speed can be adjusted as
desired, is a potentially important technique. For instance, if the objective is to control a
vehicles trajectory at 80 Km/h, the velocity field can be adjusted to that speed and modified
in order to avoid obstacles detected by a camera (either on board or external), while keeping
close to the original trajectory in order to reach the final objective. This approach could be of
crucial importance in rescue tasks where a flying machine could be understanding the
environment, and changing the velocity field references for another robot on the ground.
Another potential application is in automatic warehouses where changing the velocity field
references could be assigned different and changing tasks to the manipulator. In the future,
the dynamic velocity field generator that is presented in this work will be modified to in
order to allow the generation of a 3-dimensional velocity field. Also, the algorithm is going
to be used to coordinate the tasks of cooperative robots for Futbot, submarine vehicles
coordination, cooperative multi-robot observation, etc.
In order to perform the tests, an experimental setup has been integrated. This experimental
setup consists of:
1. A Hemisson differential mobile robot, created by the Swiss company K-Team
Corporation.
2. 2.4 GHz wireless video camera, model XC10A and its wireless receptor VR31A. This is
the only sensor employed in the experiments in order to detect the robot and the
obstacles. The camera has a resolution of 640 x 480 pixels and offers up to 30 frames per
second (fps).
3. Image acquisition card model NI-PXI-1407 using the Standard RS-170 for
monochromatic video signals. All the characteristics can be seen in Table 1.
4. A PXI module from National Instrument is employed to process the data. The module
is the NI-PXI-1000B.
Description
Bus
Video Inputs
Spatial Resolution
Pxel depth
Video input Standard
Digital I/O
365
Standard monochromatic
PXI/CompactPCI
1
640 480 RS-170
768 576 CCIR
8 bits
RS-170, CCIR
1
366
final navigation field is obtained adding the initial field with the evading fields pondering
each one properly.
In first place and before testing the system in the experimental setup, the results were
validated using the Easy Java Simulations platform. The results are presented also in this
chapter.
L 1 M =1
d 2f ,t (u, v ) = [ f ( x, y ) t ( x u, y v )]
(1)
x=0 y =0
Where f is the input image of size LxM, and the sum is done over (x, y) in the window
containing the sub-image t localized at (u, v), of size JxK. By expanding d2 we obtain:
L 1 M 1
d 2f ,t (u, v ) = f 2 ( x, y ) 2 f ( x, y ) t ( x u, y v ) + t 2 ( x u, y v )
x=0 y =0
Where
L 1
x=0
t (x u , y v ) is a constant. If
M 1 2
y =0
L 1
M 1
x =0
y =0
(2)
the resulting term for the cross correlation is the measure of similitude or concordance
between the image and the pattern. For this case u = 0, 1,, M-1, and v = 0, 1,, L-1, and
J=M and K=L.
L 1 M 1
C (u , v ) = f (x, y ) t ( x u , y v )
x=0 y =0
(3)
367
Figure 5 shows the correlation process assuming that the origin for f is placed in the upper
left corner. Therefore, the correlation process consists on moving the template t through the
area of the input image and to calculate the value for C. By this method, the maximum value
for C points the position where is the highest similitude between t and f.
R(u , v ) =
( f (x, y ) f ) (t (x u, y v ) t )
[ ( f (x, y ) f ) (t(x u, y v) t ) ]
L 1
M 1
x =0
y =0
L 1
M 1
x =0
y =0
u,v
u,v
L 1
M 1
x =0
y =0
(4)
1
2 2
Where t is the average intensity value for the pixels in the image and f u ,v is the average of
f(x,y) in the template. The obtained value for R is normalized between -1 and 1, and is
independent to intensity of f and t images.
2.1.2. Pattern recognition process.
The detection pattern process is divided in to main subsystems: learning and recognition
itself.
The learning phase analyzes the template image in order to extract features that can
improve recognition process compared to standard approach.
The pattern learning algorithm can be described as follows (National, 2004):
368
borders could be lost. In the other hand, completely random sampling can produce
large areas with poor sampling or over-sampled areas.
The utilization of a circular intensity profile obtained in the learning phase, in order to
localize rotated and displaced versions of that profile through the image.
The utilization of those pixels obtained in the pseudo-random sub-sampling, that are
employed in a correlation process between the candidates previously identified, giving
it a score that is used to determine if that candidate can be classified as a pattern match.
2.1.3. Pattern recognition subsystem Implementation
As it was stated in the previous section, the pattern recognition subsystem is separated in to
stages: learning and matching; each of them was implemented as an VI in LabVIEW v7.1.
The corresponding VI for the matching stage also contains the implementation for the
obstacle detection. However, in this section it will be described only the pattern recognition
system.
Pattern Learning
The learning stage starts by loading an image that contains the pattern to be recognized. For
this implementation, the desired recognition target is a mobile robot model Hemisson. Then
it is performed the pattern learning and the resulting information is stored in a PNG image
file.
Figure 6 shows the block diagram for the pattern learning process, described with more
details in further sections.
Image file loading: Over the previously created image it is loaded the file image
containing the desired pattern.
Image conversion: The image is converted into a grayscale one, to be processed by the
learning module.
Storing: The information related to the pattern learning in a new PNG image file.
369
Initialization: two blank images with 8 bpp are generated. One image corresponds to
the video capture and the other one is for the patter image loading. It is also started the
video acquisition through the NI IMAQ 1407 module.
370
Image Capture: It is acquired a real image from the workspace, the resulting image in
grayscale.
Image resizing: In order to improve position calculation and detection speed, the
captured image is resized from 640 x 480 to 480 x 480 by cropping it. This image size
corresponds to a physical working area of 4 m2.
Information loading: It is loaded the information related to the learning pattern
contained in the PNG image file stored in the previous system.
Matching module configuration so it performs the invariant rotation pattern search.
Matching process: This process will be done with the acquired image and the image
loaded with the learning information. In case of a successful detection of the desired
pattern, it will be obtained the position of the coincidence in the working space and its
rotation, as shown in Figure 4.7.
371
Classification Process
The classification process involves the pre-processing of the input image, the features
extraction and classification, as is shown in Figure 10.
D X , C j = min d X , X ij
i
(5)
(6)
372
Classifier configuration: features are selected according to specific requirements for the
classifying process. It can be controlled the thresholding method called clustering,
that consists in sorting the image histogram into a discrete number of classes according
the number of phases detected in the image.
It is also configured so it can detect brilliant figures because many of the used objects
have high white levels.
In the engine options, it is selected the desired classification method (NN) and the
distance metric (Manhattan).
Sample recollection: are loaded files containing the pattern to be classified. If the image
file contains more than one pattern, the desired one can be enclosed in a rectangle.
Then, it is identified the class to where the object belongs, and a tag is added. Once the
tag is specified one can proceed to add the sample.
Training: once the sample has been added, it is performed the features vector of the
sample that will identify in a unique way a class. This process must be repeated for each
added class.
Storing: at this stage, all the obtained data is saved into a file with information of each
class.
Based on the file obtained on the previous section, is possible to create a classification
session in a VI of LabVIEW that performs the object classification for the images acquired
through the camera.
Objects detection and classification
As it was stated in section 2.1.3, this module is implemented with the pattern matching
module in the same VI file, so the first three steps are almost the same; the only difference is
that this specific module requires four white images.
This module has as inputs Brightness, Contrast and Gamma from the acquired image, which
default values are 50, 45 and 1 respectively. According to the existing illumination in the
scene, those values can be adjusted son the vision system can see the objects in the visual
field. Usually is enough to increase the Brightness value.
373
Figure 12. Block Diagram of the Object Detection and Classification Process
The detection and classification process is illustrated in Figure 12, and is implemented in the
following way:
Image pre-processing: a level inversion is applied to the grayscale image and a particle
analysis is applied converting the grayscale into a binary image via thresholding. Then,
the image is filtered through the morphological process of erosion in order to delete
meaningless particles from the particle analysis.
Classification: the particles in the binary images are classified with the particles
positions and the classification sections created in the learning process.
Obstacle parameters: once the object has been classified, it is calculated the diagonal of
the smallest null-rotated rectangle that encloses the figure. For the cases of rectangle
and square detection it is also extracted the object orientation, by using Rotation Detect
from IMAQ Vision.
Results: the final outputs for the object detection and classification system are the
positions of the detected figures on the original image, values obtained in the particle
detection phase. Those values are sorted into an array for each corresponding figure,
thus giving three different arrays for object positions. The rotation angles and diagonal
lengths are also sorted according to the related figure, so it gives three more arrays for
the angles y other three for the calculated diagonals.
374
Fy 2 Fy 1
Fx 2 Fx 1
375
Using the Vision System proposed consisting on a single camera watching the workspace of
the robot is possible to detect the obstacles presented on the robot trajectory. The application
of Velocity Field Generation based on Artificial Vision constitutes a valuable contribution to
the state of the art.
In the following paragraphs a strategy to generate obstacles free velocity trajectories is
presented. The problem was divided into two stages linked through the Vision System
implementation. These stages are: the generation of an initial velocity field, and the
generation of an evader velocity field for each object detected.
3.1.1. Initial velocity Field Generation
The system allows user defined trajectory to be followed by the robot. It can be a hand made
one or a set of straight lines. The algorithm developed was tested for 41x41 velocity fields
and can be described as follows.
The vision system takes a snapshot of the robots workspace. This image is cropped to hold
only the ROI which is subsampled to a 41x41 image (this resolution of the velocity field
offers 1 vector each 5 cm, which is less that a half of the robot dimensions). Over it, the user
traces the desired trajectory.
When the user finish defining the desired trajectory, coordinates of the pixels to be part of
the trajectory are extracted by a thresholding process and stored in an N-size 1D array of
(X,Y) coordinates pairs. N is the number of points or pixels of initial trajectory.
Trajectories can be open or closed. In both cases a sorting process is performed, establishing
as sorting parameter the Euclidean distance from one point to another, organizing them
from closers to more distant. When the trajectory is open, it is necessary knowing where it
begins and where ends. Studying neighbors of each element of the sorted array, the start
and end point are obtained.
Then an approximation vector field is defined. For that, it was considered a 2D array of
41x41 elements containing the coordinates (X,Y) of all pixels of a 41x41 image, i.e. element (i,
j) have a value of (i,j). For each element of the 2D array, the closest element of the trajectory
is searched based on the Euclidean distance from point (i, j) of the 2D array to each element
of the 1D array containing the coordinates of trajectory. Each approximation vector is
defined as the normalized one whose direction is obtained from the subtraction of the
closest point of trajectory and the point of the 2D array being analyzed.
A tangent vector field is also defined. Each vector is obtained from the subtraction of the
element p + 1 with element p 1, where p is the index of the closest point of the trajectory to
the point (i, j) being studied. When the closest point is the point where the trajectory starts,
the subtraction is performed between elements p + 2 and p (p = 0), whereas if it is the ending
one, points subtracted are the p and p 2(p = N 1). With this assumption, tangent vectors
will always have congruent senses. Tangent vectors are normalized too.
The initial velocity field is obtained performing a weighted sum, expressed in (7), between
the approximation and tangent vector fields. The selection of weights depends directly of
the distance between point (i, j) and the closest one of the trajectory. As a weight function a
sigmoid was chosen. If point (i, j) is close to trajectory, the tangent vector will prevail over
the approximation one and vice versa.
&
&
&
Vij = Va ij f 1 d ij + Vt ij f 2 d ij
(7)
( )
( )
376
&
&
&
where Vij is the vector of the final velocity field, Vaij and Vtij are the approximation and
tangent vectors at ij, respectively. d ij is the Euclidean distance from point ij to the desired
( )
( )
( )
f 1 d ij =
( )
2
1+e
0.4dij
(8)
( )
f 2 d ij = 1 f 1 d ij
(9)
Figure 15 shows the effect of the weighting functions expressed in (8) y (9).
Parameter was chosen to be 0.4 because this value allows an important attenuation of the
tangent vectors when dij > 6 (3 times the dimensions of the robot). Figure 15 shows the effect
of the weighting functions expressed in (8) y (9).
Figure 15. Weighting functions f 1 and f 2 effect. Note that for little distances the tangent
vector is greater than the approximation one, and vice versa
3.1.2. Dynamic Velocity Field Modification
The evader velocity field generation module takes information provided by the vision
system, parameterizes the correspond ellipse for each obstacles and create a velocity field
that surrounds the object.
The proposed algorithm contemplates dividing the ellipse into four regions: one for entry,
one for exit and two for transitions.
In the transition regions the velocity field is chosen to be parallel to the trajectory given by
the ellipse contour, i.e. tangent to the ellipse. The general tangent line equation at any point
(X0, Y0) is given by (10).
(X P ) (X 0 P ) (Y Q ) (Y0 Q )
A2
B2
=1
(10)
where (P, Q) are the coordinates of the location of the ellipse and A and B represent a half of
the major and minor axes respectively. From (10), the unit tangent vector at any point
(X0,Y0) can be deduced to be
&
Vt (X 0 , Y0 ) = Vtx , Vt y
Vtx (X 0 , Y0 ) =
Vt y (X 0 , Y0 ) =
377
(11)
A 2 (Y0 Q )
(12)
B 2 (X 0 P )
(13)
A 4 (Y0 Q )2 + B 4 (X 0 P )2
A 4 (Y0 Q )2 + B 4 (X 0 P )2
In the entry region the field is defined in the direction and sense toward the ellipse contour
and is turned aside smoothly until it converges to the tangent vector as the point is closer to
the transition region. This is achieved through a weighted sum of the approximation and
tangent vectors to the ellipse, where the weights depends on the proximity to the distance
from a given point to the edge between entry and transition regions.
Entry and exit regions are always of the same size. Transitions regions too. The size (angle)
for each region is chosen such as they have an area equal to a quarter of the ellipses area. To
accomplish this requirement, the area of the entry (or exit) region is given by (14)
+ r ( )
dd =
1 + 2
AB
d =
2
4
(14)
Consider Figure 16 where angles related to an ellipse are defined. Based on it, is possible to
obtain the relations shown in (15).
Figure 16. Definition of Angles and coordinates of the point belonging to the ellipse
Xc = A cos( )
X e = r cos( )
Yc = B sin ( )
Ye = r sin ( )
Ye = B sin ( )
(15)
r is defined by (16) in terms of , or, considering (17), it can be defined by (18) in terms of .
r ( ) = A2 cos2 ( ) + B2 sin 2 ( )
tan ( ) =
A
tan ( )
B
(16)
(17)
378
r ( ) = A
1 + tan 2 ( )
2
A
1 + tan 2 ( )
B
(18)
Solving (14), the size (angle) of the entry and exit region is defined by (18).
B 2
+ tan 2 ( )
A
2 = 2 arctan 2
1 + B tan 2 ( )
(19)
The orientation of regions is given by the angle of the original field at the point where the
object is located. Regions must be rotated for achieving an entry region aligned with the
original velocity field.
For the exit region the same approach used for the entry region is employed. However, in
this case, the field is defined leaving the ellipse.
Approximation vectors a any point (X0, Y0) is given by (20)
&
(20)
Va (X 0 , Y0 ) = Vax , Vay
Vax (X 0 , Y0 ) =
Vay (X 0 , Y0 ) =
B 2 (X 0 P )
(21)
A 2 (Y0 Q )
(22)
A 4 (Y0 Q )2 + B 4 (X 0 P )2
A 4 (Y0 Q )2 + B 4 (X 0 P )2
The division proposed by using the defined regions wants to achieved an evader velocity
field similar to the sketch shown in Figure 17.
Figure 17. Sketch of evader field. Following the direction and sense of the initial field at
the point where is located the obstacle, the different regions are defined. Note the deviations
of field in the entry and exit regions
379
Figure 18.Velocity Field. Note the hand made desired trajectory remarked in black
The evader algorithm responds to an arbitrary object as shown in Figure 19.
Figure 19. Evader velocity field for an arbitrary object. The four predefined regions are
shown and the behavior of the algorithm can be observed
380
Figure 19.a shows an object located on a place where the original field has an orientation of
75 and the circumscribed ellipse has 25. Figure 19.b presents the evader field for an object
whose circumscribed ellipse has 90 and the original field at the point the object is placed
has 55. In both figures the exponential fading imposed is shown. This effect assures that the
evader field only affects the neighborhood of the object.
Now it is presented a test of the system with the three modules combined. Figure 14
resumes the results obtained.
381
d 2f ,t (u , v ) = [ f ( x, y ) t ( x u , y v )]
(23)
x, y
where f is the image. The sum is performed over (x, y), in the window containing the subimage t located at (u, v). Expanding d2, and making some considerations (Bolaos et. al,
2006), the remaining term of the cross correlation
C (u, v ) = f ( x, y ) t ( x u, y v )
(24)
x, y
Cropping: The captured image is cropped to a 480x480 pixels (from 640x480 pixels).
This image size allows the visualization of a workspace of 6.25 m2.
Information load: The information (related to pattern learning) contained in the PNG
image stored in the learning process is loaded.
Pattern matching module setup: The pattern matching module is set to rotationinvariant mode so it can detect the desired pattern regardless of its rotation.
Matching: The matching process is done according to the configuration above described
between the captured image and the loaded image (with the information from the
learning process). If the desired pattern is located, the result will be its position within
the image and its orientation.
382
vleft = vlinear + w a
vright = vlinear w a
(25)
A. Linear Controller
This controller was a conventional PI where the error and the sum of previous errors are
added to generate the control signal. The inputs of the control system are distance and angle
383
errors and the outputs are references for right and left velocities directly. The control scheme
employed is shown in Figure 22.
Angle Error (Degrees): Small Positive (0. -90.), Big Positive (90. -180.), Small Negative
(180. -270.) and Big Negative (270. -360.).
Distance to the destination: Far (150 cm - 800 cm), close (20 cm -150 cm) and very close
(0 cm - 40 cm).
Velocities (Left and Right wheels): Fast (8.5cm/s), Medium (5 cm/s), slow(3.5cm/s) and
very slow(0.8cm/s).
With these membership functions the system of fuzzy inference is based on the Mamdanis
aggregation method (Ying, 2000), with 9 fuzzy rules, and defuzzyfication technique based
on the gravity center.
The base of rules was formed like is shown in Table 2.
Distance
Very Close
Close
Close
Close
Close
Far
Far
Far
Far
Angle Error
X
Small Positive
Small Negative
Big Positive
Big Negative
Small Positive
Small Negative
Big Positive
Big Negative
Left Velocity
Very Slow
Slow
Very Slow
Medium
Very Slow
Fast
Medium
Fast
Slow
Right Velocity
Very Slow
Very Slow
Slow
Very Slow
Medium
Medium
Fast
Slow
Fast
384
proposed linear controller was used, and can be observed how the robot oscillates near the
final position. Oscillations at the end point were very strong around it and impossible to
eliminate, since controllers parameters that worked well far from the destination, did not
give good results in its proximity. The controlled variables saturated in some circumstances,
affecting the response of the controller. Also the tuning of the controller was very difficult
due to the interdependence between vlinear and .
385
5. References
Bolaos, J. M.; Medina-Melndez, W.; Fermn, L.; Cappelletto, J.; Fernndez-Lpez, G. & J.
C. Grieco. (2006). Object recognition for obstacle avoidance in mobile robots.
Artificial Intelligence and Soft Computing, ICAISC 2006, Lecture Notes in Computer
Science, pp. 722731, ISBN: 3-5403-5748-3, Zakopane, Poland, June 2006.
Cervantes I.; Kelly, R.; Alvarez, J. & Moreno J. (2002). A robust velocity field control. IEEE
Transactions on Control, Systems and Technologies, Vol. 10, No. 6, (November 2002)
888894, ISSN: 1063-6536.
Dixon, W. E.; Galluzo, T.; Hu, G. & Crane, C. (2005). Adaptive velocity field control of a
wheeled mobile robot. Proceedings of the 5th International Workshop on Robot Motion
and Control, RoMoCo 05, pp. 145150, ISBN: 83-7143-266-6, Poznan, Poland, June
2005.
Dudek, G. & Jenkin M. (2000) Computational Principles of Mobile Robotics, Cambridge
University Press, ISBN: 0-5215-6876-5, U.S.A.
Kelly, R.; Moreno, J. & Campa, R. (2004). Visual servoing of planar robots via velocity fields.
Proceedings of the IEEE 43rd Conference on Decision and Control, pp. 40284033, ISBN:
0-7803-8682-5, Atlantis, Paradise Island, Bahamas, December 2004.
Kelly R.; Bugarn, E. & Campa, R. (2004). Application of velocity field control to visual
navigation of mobile robots. Proceedings of the 5th IFAC Symposium on Intelligent
Autonomous Vehicles, on CD, Lisbon, Portugal, June 2004.
Khatib, O. (1985). Real-time obstacle avoidance for manipulators and mobile robots,
Proceedings of the IEEE International Conference on Robotics and Automation, pp. 500
505, St. Louis, U.S.A., March 1985.
Li, P. Y. & Horowitz, R. (1995). Passive velocity field control of mechanical manipulators.
Proceedings of the IEEE International Conference on Robotics and Automation ICRA01,
pp 27642770, ISBN: 0-7803-1965-6, Nagoya, Japan, April 1995.
Li, P. Y. (1999). Adaptive passive velocity field control. Proceedings of the 1999 American
Control Conference, Vol. 2, pp. 774779, ISBN: 0-7803-4990-3, San Diego, U.S.A., June
1999.
Li, P. Y. & Horowitz R. (1999). Passive velocity field control of mechanical manipulators.
IEEE Transactions on Robotics and Automation, Vol. 15, No. 4, (August 1999) 751763,
ISSN: 1042-296X.
Li, P. Y. & Horowitz R. (2001). Passive velocity field control (PVFC): Part I - Geometry and
Robustness. IEEE Transactions on Automatic Control, Vol. 46, No. 9, (September 2001)
13461359, ISSN: 0018-9286.
Li, P. Y. & Horowitz R. (2001). Passive velocity field control (PVFC): Part II - application to
contour following. IEEE Transactions on Automatic Control, Vol. 46, No. 9,
(September 2001) 13601371, ISSN: 0018-9286.
Mitchell, T. (1997). Machine Learning, McGraw-Hill Science/Engineering/Math, ISBN: 00704-2807-7, U.S.A.
Moreno J. & Kelly R. (2003a). On manipulator control via velocity fields. Proceedings of the
15th IFAC World Congress, pp. 14201427, Barcelona, Spain, Julio 2003.
Moreno J. & Kelly R. (2003b). Velocity control of robot manipulator: Analysis and
experiments. International Journal on Control, Vol. 76, No. 14, (September 2003) 1420
1427, ISSN: 0020-7179.
386
Moreno J. & Kelly R. (2003c). Hierarchical velocity field control for robot manipulators.
Proceedings of the IEEE International Conference on Robotics and Automation ICRA03,
Vol. 3, pp. 43744379, ISBN: 0-7803-7736-2, Taipei, Taiwan, September 2003.
National Instruments (2004). IMAQ Vision Concepts Manual, National Instruments.
Santos-Victor, J. & Sandini G. (1997). Visual Behaviors for Docking. Computer Vision and
Image Understanding: CVIU, Vol. 67, No. 3, (September 1997) 223238, ISSN: 10773142.
Seelinger, M.; Yoder J-D.; Baumgartner, E. T. & Skaar, S. B. (2002). High-precision visual
control of mobile manipulators. IEEE Transactions on Robotics and Automation, Vol.
18, No. 6, (December 2002) 957965, ISSN: 1042-296X.
Skaar, S. B.; Yalda-Mooshabad I. & Brockman W. H. (1992). Nonholonomic camera-space
manipulation. IEEE Transactions on Robotics and Automation, Vol. 8, No. 4, (August
1992) 464478, ISSN: 1042-296X.
Yamakita, M. & Suh, J. H. (2000). Adaptive generation of desired velocity field for
cooperative mobile robots with decentralized PVFC. Proceedings of the IEEE/RSJ
International Conference on Intelligent Robots and Systems IROS00, pp. 18411846,
ISBN: 0-7803-6348-5, Takamatsu, Japan, Oct./Nov. 2000.
Yamakita, M. & Suh, J. H. (2001). Adaptive generation of desired velocity field for leader
follower type mobile robots with decentralized PVFC. Proceedings of the IEEE
International Conference on Robotics and Automation ICRA01, pp. 34953501, ISBN: 07803-6576-3, Seoul, Korea, May 2001.
Ying, H. (2000). Fuzzy Control and Modeling: Analytical Foundations and Applications,
IEEE Press Series on Biomedical Engineering, Wiley-IEEE Press, ISBN: 0-7803-34977, U.S.A.
21
Omnidirectional Vision-Based Control From
Homography
Youcef Mezouar, Hicham Hadj Abdelkader and Philippe Martinet
LASMEA / Blaise Pascal University
France
1. Introduction
Vision-based servoing schemes are flexible and effective methods to control robot motions
from cameras observations (Hutchinson et al 1996). They are traditionally classified into
three groups, namely position-based, image-based and hybrid-based control (Espiau et al
1992), (Hutchinson et al 1996), (Malis et al 1999). These three schemes make assumptions on
the link between the initial, current and desired images since they require correspondences
between the visual features extracted from the initial image with those obtained from the
desired one. These features are then tracked during the camera (and/or the object) motion.
If these steps fail the visually based robotic task can not be achieved. Typical cases of failure
arise when matching joint images features is impossible (for example when no joint features
belongs to initial and desired images) or when some parts of the visual features get out of
the field of view during the servoing. Some methods have been investigated to resolve this
deficiency based on path planning (Mezouar et al 2002), switching control (Chesi et al 2003),
zoom adjustment (Benhimane et al 2003). However, such strategies are sometimes delicate
to adapt to generic setup.
Conventional cameras suffer thus from restricted field of view. There is significant
motivation for increasing the field of view of the cameras. Many applications in visionbased robotics, such as mobile robot localization (Blaer et al 2002) and navigation (Winter et
al 2000), can benefit from panoramic field of view provided by omnidirectional cameras. In
the literature, there have been several methods proposed for increasing the field of view of
cameras systems (Benosman et al 2000). One effective way is to combine mirrors with
conventional imaging system. The obtained sensors are referred as catadioptric imaging
systems. The resulting imaging systems have been termed central catadioptric when a single
projection center describes the world-image mapping. From a theoretical and practical view
point, a single center of projection is a desirable property for an imaging system (Baker et al
1999). Baker and Nayar in (Baker et al 1999) derive the entire class of catadioptric systems
with a single viewpoint. Clearly, visual servoing applications can also benefit from such
sensors since they naturally overcome the visibility constraint. Vision-based control of
robotic arms, single mobile robot or formation of mobile robots appear thus in the literature
with omnidirectional cameras (refer for example to (Barreto et al 2002), (Vidal et al 2003),
(Mezouar et al 2004). Image-based visual servoing with central catadioptric cameras using
388
points has been studied by in (Barreto et al 2002). The use of straight lines has also been
investigated in (Mezouar et al 2004).
This paper is concerned with homography-based visual servo control techniques with
central catadioptric cameras. This framework (called 2 1/2 D visual servoing) has been first
proposed by Malis and Chaumette in (Malis et al 1999). The 2 1/2 D visual servoing scheme
exploits a combination of reconstructed Euclidean information and image-space information
in the control design. The 3D informations are extracted from a homography matrix relating
two views of a reference plane. As a consequence, the 2 1/2 D visual servoing scheme does
not require any 3D model of the target. The resulting interaction matrix is triangular with
interesting decoupling properties and it has no singularity in the whole task space.
Unfortunately, in such approach the image of the target is not guaranteed to remain in the
camera field of view. Motivated by the desire to overcome this deficiency, we extend in this
paper homography-based visual servo control techniques to an entire class of
omnidirectional cameras. We describe how to obtain a generic homography matrix related
to a reference plane for central catadioptric cameras using imaged points or lines. Then the
3D informations obtained from the homography is used to develop 2 1/2 D visual servoing
schemes based on points and lines features. Simulations as well as experimental results on a
six degrees of freedom robotic arm illustrate the efficiency of omnidirectional vision-based
control with homography.
389
and M of Fc and Fm will be termed optical center and principal projection center
respectively. The optical center C has coordinates [0 0 -]T with respect to Fm and the image
plane Z=f.(-2) is orthogonal to the Z-axis where f is the focal length of the conventional
camera and and describe the type of sensor and the shape of the mirror, and are function
of mirror shape parameters (refer to (Barreto et al 2002b)).
Consider the virtual unitary sphere centered in M as shown in Figure 1 and let X be a 3D
point with coordinates X= [X Y Z]T with respect to Fm . The world point X is projected in the
image plane into the point of homogeneous coordinates xi = [xi yi 1]T. The image formation
process can be split in three steps as:
First step: The 3D world point X is first projected on the unit sphere surface into a point
of coordinates in Fm :
Xm =
1
X
X
Y
Z
The projective ray Xm passes through the principal projection center M and the world point
X.
Second step: The point Xm lying on the unitary sphere is then perspectively projected
on the normalized image plane Z=1-. This projection is a point of homogeneous
coordinates x = x T 1
= f ( X ) (where x = [x y]T ):
X
x = f( X ) =
Z+ X
Y
Z+ X
(1)
Third step: Finally the point of homogeneous coordinates xi in the image plane is
obtained after a plane-to-plane collineation K of the 2D projective point x:
x i = Kx
The matrix K can be written as K=Kc M where the upper triangular matrix Kc contains the
conventional camera intrinsic parameters, and the diagonal matrix M contains the mirror
intrinsic parameters:
0
0
0
0
M = 0
0
0
and K c = 0
0
0
0
1
0
1
Note that, setting =0, the general projection model becomes the well known perspective
projection model.
2 2
2 2
In the sequel, we assume that Z 0. Let us denote = s X / Z = s 1+X /Z +Y /Z ,
where s is the sign of Z. The coordinates of the image point can be rewritten as:
x=
X/Z
1+
and y=
Y/Z
1+
390
By combining the two previous equations, it is easy to show that is the solution of the
following second order equation:
2
2
-(x+y) (1+)-1=0
Noticing that the sign of is equal to the sign of Z, it can be shown that the exact solution is:
=
2 2
--(x +y )
2 2
(x +y )-1
(2)
2 2
where = 1+(1-)(x +y ) Equation (2) shows that can be computed as a function of
image coordinates x and sensor parameter . Noticing that:
X m = (
where x = x T
+ )x
(3)
1+
1
. The unit
-T
T
T
h) x = l x = 0
(4)
If the line is imaged with a central catadioptric camera then the 3D points on the 3D line L
are mapped into points x in the catadioptric image lying on a conic curve:
T T
1
T
x K K x = x i x = 0
(5)
Where = K T K 1 and:
i
h x2 -(1-h 2y )
2
h x h y (1- )
h h
x z
2
h x h y (1- )
2
2
h y -(1-h x )
h yhz
hxhz
hyhz
2
hz
391
K
K
Oi = K
K 0 0 1
(6)
T
K li
T
K li
(7)
It is thus clear that the polar line li contains the coordinates of the projection of the 3D line L
in an image plane of an equivalent (virtual) perspective camera defined by the frame Fv =
Fm (see Figure 2) with internal parameters chosen equal to the internal parameters of the
catadioptric camera (i.e Kv = Kc M). This result is fundamental since it allows us to represent
the physical projection of a 3D line in a catadioptric camera by a simple (polar) line in a
virtual perspective camera rather than a conic. Knowing only the optical center Oi, it is thus
possible to use the linear pin-hole model for the projection of a 3D line instead of the non
linear central catadioptric projection model.
392
*
*
given in Fm by the vector T= [n* -d*], where n* is its unitary normal in Fm and d* is the
*
distance from () to the origin of Fm .
393
X m = (-1 +)x =
*
1
1
T
[ X Y Z] and X m* = (*-1 +)x = X *
Y*
Z *
( -1 +)x = [ I 3
0 ] X = [R t ] X *
(8)
The distance d(X,) from the world point X to the plane () is given by the scalar product
* T .X * and:
*
d( X * , * ) = * ( *-1 +)n * T .x d * H *
* ( *-1 +) * T * d( X * , * )
n .x
d*
d*
(9)
*
The homogeneous coordinates of X with respect to Fm can be rewritten as:
X * = * (*-1 +) 3
H *
0 x + 01 x 3
(10)
0 ] Ax +b*
*
(11)
Where
A * = I 3
n*
*
and b = 0 1x3
d*
d( X , )
d *
(12)
t *T
d( X , )
. H is the Euclidean homography matrix written as a
n and =
d*
d*
*
function of the camera displacement and of the plane coordinates with respect to Fm . It has
With H = R +
394
the same form as in the conventional perspective case (it is decomposed into a rotation
matrix and a rank 1 matrix). If the world point X belongs to the reference plane () (i.e = 0)
then Equation (12) becomes:
x Hx
(13)
*
Note that the Equation (13) can be turned into a linear homogeneous equation x Hx = 0
(where x denotes the cross-product). As usual, the homography matrix related to (), can
thus be estimated up to a scale factor, using four couples of coordinates ( x k ; x *k ) , k= 1 ... 4,
corresponding to the projection in the image space of world points Xk belonging to (). If
only three points belonging to () are available then at least five supplementary points are
necessary to estimate the homography matrix by using for example the linear algorithm
proposed in (Malis et al 2000). From the estimated homography matrix, the camera motion
t
parameters (that is the rotation R and the scaled translation t d* = * , and the structure of the
d
observed scene (for example the vector n*) can thus be determined (refer to (Faugeras et al
1988). It can also be shown that the ratio = * can be estimated as follow:
(* 1 + )n * T x
= (1 + n * TRt d * ) 1
*
( + )n * TR T x
(14)
This parameter is used in our 2 1/2 D visual servoing control scheme from points.
3.2 Homography matrix from lines
T
T T
Let L be a 3D straight line with binormalized Euclidean Plcker coordinates h u h
*T *T * T
*
u
h
with respect to Fm and with coordinates h
with respect to Fm . Consider that
the 3D line L lies in a 3D reference plane () as defined below. Let X1 and X2 be two points in
the 3D space lying on the line L. The central catadioptric projection of the 3D line L is fully
defined by the normal vector to the interpretation plane h . The vector h can be defined by
X X2
two points in the 3D line as h = 1
. Noticing that HX 1* = det(H )H T X 1* H 1
X1 X2
( HX 1* being the skew-symmetric matrix associated to the vector HX 1* ) and according to (3)
and (13), h can be written as follow:
det(H)H T X 1* X 1*
X 1* X 1*
(18)
395
Since h =
X 1* X 1*
X 1* X 1*
*
frame Fm , the relationship between two views of the 3D line can be written as:
h H T h
(15)
The expression of the homography matrix in the pixel space can be derived hence using the
polar lines. As depicted below, each conic, corresponding to the projection of a 3D line in the
omnidirectional image, can be explored through its polar line. Let l i and l i* be the polar
lines of the image center Oi with respect to the conics i and i* respectively in the two
*
positions Fm and Fm of the catadioptric camera. From equation (6), the relationship given
in equation (15) can be rewritten as:
l i G T l i*
(16)
t * T 1
n )K . As in the case of points the homography matrix
d*
related to () can be linearly estimated. Equation (16) can be rewritten as: l i G T l i* = 0 and
*
G can thus be estimated using at least four couples of coordinates ( (l ik , l ik
) , k=1 ... 4). The
n * T K T l i*
h
T
T *
(1
=
+
t
*R n )
d
h*
Rn * K T l i
(17)
These parameters are important since they are used in the design of our control scheme with
imaged lines. In the next section, we propose a vision control scheme which allows to fully
decouple rotational and translational motions.
4. Control schemes
In order to design an hybrid visual servoing scheme, the features used as input of the
control law combine 2D and 3D informations. We propose to derive these informations from
imaged points or polar lines and the homography matrix computed and decomposed as
depicted in the last section. Let us first define the input of the proposed hybrid control
scheme as follow:
s = sTv
sT
(18)
396
The vector s v depends of the chosen image features. The vector s is chosen as s = u
where u and are respectively the axis and the rotation angle extracted from R (i.e the
rotation matrix between the mirror frame when the camera is in these current and desired
positions). The task function e to regulate to 0 is then given by:
s s * s s *
e = s s * = v *v = v v
s s u
(19)
where s* is the desired value of s. Note that the rotational part of the task function can be
estimated using partial Euclidean reconstruction from the homography matrix derived in
Section 3). The exponential convergence of e can be obtained by imposing e = e , the
corresponding control law is:
= L1 (s s * )
where = vT
(20)
respectively the linear and angular velocities) , tunes the convergence rate and L is the
interaction matrix which links the variation of feature vector s to the camera velocity
s = L .
The time derivative of s = u can be expressed as a function of the camera velocity as:
d( u)
= [03
dt
L ]
sin() 2
[ u ]
L = I 3 [ u ] + 1
2
2
c
sin
Where sin c ( ) =
(21)
sin ( )
and [ u ] being the antisymmetric matrix associated to the rotation
axis u.
4.1 Using points to define s v
To control the 3 translational degrees of freedom, the visual observations and the ratio
expressed in (14) are used:
sv = [ x y ]
(22)
Where x and y are the current coordinates of a chosen catadioptric image point given by
Equation (1), = log() . The translational part of the task function is thus:
e = s v s*v = x-x *
y-y *
(23)
397
Where = log * = log() . The first two components of s v s *v are computed from the
normalized current and desired catadioptric images, and its last components can be
estimated using Equation (14).
Consider a 3-D point X, lying on the reference plane , as the reference point. The time
derivative of its coordinates, with respect to the current catadioptric frame Fm, is given by:
[ X ]
X = I 3
(24)
s v
X
X
(25)
With:
2
2
Z+(Y
+Z
)
XY
-X(+Z)
s v
1
=
-XY
Z+(X 2 +Z 2 ) -Y(+Z)
2
X (Z+)
2
Y(Z+)2
Z(Z+)2
X(Z+)
By combining the equations (24), (25) and (14), it can be shown that:
s v = [ A B]
(26)
With
1+x2 (1-(x +))+y2
xy
xx
+
x
(x -1)
xx
xy
and
2 + (1- 2 )(x 2 +y 2 )+ 2
x 2 +y 2 + 2
xy
(1+y 2 ) -x2
x
B=
x +
(1+x 2 ) x -y 2
x +
-xy
0
-x
(19)) can thus be regulated to 0 using the control law (Equation (20)) with the following
interaction matrix L:
A
L=
0 3
B
L
(27)
can be
398
To control the 3 translational degrees of freedom with imaged lines, the chosen visual
observation vector is:
s v = [log(h 1 ) log(h 2 ) log(h 3 )]
(28)
Where h 1 , h 2 and h 3 are the depth of three co-planar lines. From the time derivative of the
line depth expressed as a function of the camera velocity (Andreff et al 2002), given
by h = ( u k h k )T v , it can be shown that:
k
d(log(h k )) 1
= ( u k h k )T
dt
hk
03 v
(29)
According to (6) and (29), the time derivative of the vector s v is thus given by:
s v = [ L v
03 ]
T
Where:
K T li 1 h 1
Lv =
0
( u 1 K T l i 1 )T
0
( u 2 K T l i 2 ) T
K T l i 3 h 3 ( u 3 K T l i 3 )T
0
K li 2 h 2
0
(30)
Note that the time derivative of s v does not depend of the camera angular velocity. It is
also clear that L v is singular only if the principal point M of the mirror frame lies in the 3D
reference plane (). The task function e can thus be regulated to zero using the control law
(20) with the following square block-diagonal interaction matrix:
L v
L=
0
0
L
(31)
As can be seen on equation (30), the unknown depth hi and the unitary orientations ui with
respect to the catadioptric camera frame have to be introduced in the interaction matrix.
*
estimated as follow:
u=
K T l i RK T l i*
K T l i RK T l i*
399
Furthermore, if the camera is calibrated and h i is chosen to approximate hi, then it is clear
h
that L v1L v is a diagonal matrix with i for i=1, 2, 3 as entries. The only point of equilibrium
hi
is thus s* and the control law is asymptotically stable in the neighbourhood of s* if h i is
chosen positive. In practice, an approximated matrix L * 1 at the desired position is used to
compute the camera velocity vector and the rotational part of the interaction matrix can be
set to L1 = I 3 (Malis et al 1999). Finally, the control law is thus given by:
L * 1
= v
0
0 s v s*v
I 3 u
(32)
5 Results
5.1 Simulation Results
We present now results concerning a positioning task of a six degrees of freedom robotic
arm with a catadioptric camera in eye-in-hand configuration. The catadioptric camera used
is an hyperbolic mirror combined with a perspective camera (similar results are obtained
with a catadioptric camera combining a parabolic mirror and an orthographic lens, these
results are not presented in this paper). From an initial position, the catadioptric camera has
to reach the desired position. This means that the task function (refer to equation (19)),
computed from the homography matrix between the current and desired images, converges
to zero. To be close to a real setup, image noise has been added (additive noise with
maximum amplitude of 2 pixels) to the image and the interaction matrix is computed using
erroneous internal camera parameters. The first simulation concerns imaged points while
the second simulation concerns imaged lines.
5.1.a Imaged points
The initial and desired attitudes of the catadioptric camera are plotted in the Figure 3. This
figure also shows the 3-D camera trajectory from its initial position to the desired one.
Figure 3 shows the initial (blue *) and desired (red *) images of the observed target. It shows
also the trajectory of the point (green trace) in the image plane (the controlled image point
has a black trace trajectory). The norm of the error vector is given in Figure 4(b). As can
been seen in the Figures 4(c) and 4(d) showing the errors between desired and current
observation vectors the task is correctly realized. The translational and rotational camera
velocities are given in Figures 4(e) and 4(f) respectively.
5.1.b Imaged lines
Figure 2 shows the spatial configuration of the 3D lines as well as the 3D trajectory of the
central catadioptric. The images corresponding to the initial and desired positions are
shown by figures 5(c) and 5(d). These figures show the projected 3D lines (conics) and the
associated polar lines. The trajectories of the conics and of the corresponding polar lines in
the image plane are given in Figures 5(a) and 5(b) respectively. These trajectories confirm
that the initial images (conics and polar lines) reach the desired images. Figures 5(e) and 5(f)
400
show respectively the translational and rotational velocities of the catadioptric camera. As
shown in Figures 5(g) and 5(h), the error vector e between the current and desired
observation vectors are well regulated to zeros, and thus the positioning task is correctly
realized.
Figure 3. 3D trajectories of the catadioptric camera [meters]: (left) the case of points, (right)
the case of lines
401
(a)
(b)
(c)
(d)
(e)
(f)
Figure 4. (a) Trajectories in the image of the target points [pixels]. (b) norm of the error
vector, (c) error vector: [meters], (d) rotation vector [rad], (e) Translational velocity [m/s], (f)
rotational velocity [rad/s]
402
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 5. Visual servoing with para-catadioptric camera: (a) initial image, (b) desired image
(c) trajectory of the conics in the image plane, (d) trajectory of the polar line, , (e) translation
velocities [m/s], (f) rotational velocities [rad/s], (g) u errors [rad], (h) s v s *v vector errors
403
404
(a)
(b)
(c)
(d)
(e)
Figure 7. Visual servoing with lines: (a) initial image, (b) desired image and trajectory of the
conics in the image plane (c) translational velocities [m/s], (d) rotational velocities [rad/s], ,
(c) s v s *v and u errors
405
(a)
(b)
(c)
(d)
(e)
(f)
Figure 8. Visual servoing with lines: (a) initial image, (b) desired image and trajectory of the
conics in the image plane, (c) s v s *v , (d) u errors [rad] (e) translational velocities [m/s], (f)
rotational velocities [rad/s]
406
6. Conclusion
In this paper hybrid vision-based control schemes valid for the entire class of central
cameras was presented. Geometrical relationship between imaged points and lines was
exploited to estimate a generic homography matrix from which partial Euclidean
reconstruction can be obtained. The information extracted from the homography matrix
were then used to design vision-based control laws. Results with simulated data confirmed
the relevance. In future work, the robustness and stability analysis with respect to
calibration errors must be studied.
7. References
S. Baker & S. K. Nayar (1999). A theory of single-viewpoint catadioptric image formation.
International Journal of Computer Vision, 35(2):1 22, November 1999.
J. Barreto and H. Araujo (2002). Geometric properties of central catadioptric line images. In
7th European Conference on Computer Vision, ECCV02, pages 237251, Copenhagen,
Denmark, May 2002.
R. Benosman & S. Kang (2000). Panoramic Vision. Springer Verlag ISBN 0387-95111-3, 2000.
P. Blaer & P.K. Allen (2002). Topological mobile robot localization using fast vision
techniques. In IEEE International Conference on Robotics and Automation, pages 1031
1036, Washington, USA, May 2002.
G. Chesi, K. Hashimoto, D. Prattichizzo & A. Vicino (2003). A switching control law for
keeping features in the field of view in eye-in-hand visual servoing. In IEEE
International Conference on Robotics and Automation, pages 39293934, Taipei, Taiwan,
September 2003.
B. Espiau, F. Chaumette & P. Rives (1992). A new approach to visual servoing in robotics.
IEEE Transactions on Robotics and Automation, 8(3):313326, June 1992.
S. Hutchinson, G.D. Hager, and P.I. Corke (1996). A tutorial on visual servo control. IEEE
Transactions on Robotics and Automation, 12(5):651670, October 1996.
E. Malis, F. Chaumette, & S. Boudet (1999). 2 1/2 d visual servoing. IEEE Transactions on
Robotics and Automation, 15(2):238250, April 1999.
Y. Mezouar & F. Chaumette (2002). Path planning for robust image-based control. IEEE
Transactions on Robotics and Automation, 18(4):534549, August 2002.
Y. Mezouar, H. Haj Abdelkader, P. Martinet,& F. Chaumette (2004). Central catadioptric
visual servoing from 3d straight lines. In IEEE/RSJ Int. Conf. on Intelligent Robots and
Systems, IROS04, volume 1, pages 343349, Sendai, Japan, September 2004.
E. Malis & S. Benhimane (2003). Vision-based control with respect to planar and non-planar
objects using a zooming camera. In IEEE International Conference on Advanced
Robotics, pages 863869, July 2003.
R. Vidal, O. Shakernia, & S. Sastry (2003). Formation control of nonholonomic mobile robots
with omnidirectional visual servoing and motion segmentation. In IEEE
International Conference on Robotics and Automation, pages 584589, Taipei, Taiwan,
September 2003.
N. Winter, J. Gaspar, G. Lacey, & J. Santos-Victor (2000). Omnidirectional vision for
robotnavigation. In Proc. IEEE Workshop on Omnidirectional Vision, OMNIVIS, pages
2128, South Carolina, USA, June 2000.
22
Industrial Vision Systems, Real Time and
Demanding Environment: a Working Case for
Quality Control
J.C. Rodrguez-Rodrguez, A. Quesada-Arencibia and R. Moreno-Daz jr
Institute for Cybernetics (IUCTC), Universidad de Las Palmas de Gran Canaria
Spain
1. Introduction
This chapter exposes an OCR (Optical Character Recognition) procedure able to work with
very high speeds.
The architecture of the pattern recognition algorithm we present here includes certain
concepts and results which are developed in previous publications [3,4]. We consider a
production line of a beverage canning industry where cans with faulty imprinted use date
or serial number have immediately to be discharged from the line.
The problem is well-known in the industrial scene. A code or a set of characters is registered
on the surfaces (can bottoms) to very high speed. The registration can fail, can take place
only partially, or can print wrong something. It is important to know with certainty what
has happened. The most general solution is to read what has been printed immediately after
print itself.
Surfaces are metallic (tinplate/aluminium) can bottoms for our particular case and the code
denotes the limit of the use of the product (beer or similar).
408
Our goal is to build a capable application to process 120,000 cans per hour (35 cans per
second). Nowadays, there is not application on the market which is able to manage this
speed.
Therefore, our purpose has to do with an OCR that is confronted to two challenges:
1. Extraction of characters registered on a difficult surface.
2. Processing to very high speed.
Our keys to approach the problem have been:
1. Capable hardware
2. Domain restrictions
3. Intensive calibration
4. Parallel architecture
5. A progressive and aggressive reduction process of the interest area
6. Simple operations in integer arithmetic
7. Two independent tasks: Validation and Traineeship
Here is a brief explanation of these keys.
1. Capable hardware: The critical requirements are that the system is in fact invariant from
its position during the text analysis of nearly 30.000 cans per minute in real time. The
system has to be reliable and it relies on specific hardware. Thus, a high speed
acquisition camera, an efficient acquisition board, a strong multiprocessing system and
a considerable bandwidth for main memory load are the basic requirements.
2. Domain restrictions: A specialized environment which is very well known and
restricted diminishes the number of contingencies to consider and allows therefore
making presumptions easily. View section [1] and section [2].
3. Intensive calibration: There are two types of calibration of the system. The first type is
to focus you on guaranteeing an enough quality of image for the treatment. It affects
mainly physical parameters of the system. The second type has to do with the training
of the system. The system should be trained with expected shapes for comparison.
4. Parallel architecture: Use intensive of multithread at several architecture layers in
strong multiprocessing system.
5. A progressive and aggressive reduction process of the interest area: Reducing the input
domain contributes to reduce the time of processing.
6. Simple operations in integer arithmetic: Sum, subtraction, multiplication of integers,
and integer division are the absolutely dominant operations.
In computers of general purpose, the integer arithmetic unit is faster than the float
arithmetic unit. All procedures described in this chapter disregard float arithmetic for
validation computation. The real essential operations are out side of the continuous
cycle of validation.
7. Two independent tasks: Validation and Traineeship. Only The Validation Stage is
described in this paper. However there is another stage in where the knowledge base is
built.
2. Visual Scenario
We will work at an industrially dedicated environment; therefore the scene of events should
be restricted. The reality will be mapped to two-dimensional matrixes with values between
0 - 255 representing different levels of grey. The possible states can be anticipated and listed:
They are not too numerous neither very complex.
409
1.
2.
3.
4.
There can be nothing but cans and background in our field of vision.
The background has fixed and unchangeable visual characteristics [very dark colour].
There is NO background inside the border of a can.
Outside the border of a can there can only be background or other cans, but nothing
else.
We are only interested in processing a single can within the frame of our camera. Therefore
the physical acquisition platform will reinforce the presence of a single clean can in each
capture. On the other hand, the hardware/software system will have to guarantee that no
can will be lost during acquisition: all can processed by the installation will be processed too
by our system.
3. System Preconditions
The code is associated to a single orientation and processing direction. In order to compare
the expected code with the acquired code during validation both codes must have the same
orientation and processing direction. The question is: How do we know the orientation and
processing direction of the acquired code?
We have the following facts in our favour:
1. Once the print head and the acquisition source camera are fixed, orientation and
processing direction are fixed for all cans.
2. The print head and the camera can be freely rotated if it is more convenient to our
purposes. There are no other factors to consider except our own satisfaction.
3. Due to the previous fact, we can force to have an orientation and processing direction
for the code. Therefore, these are known from the beginning before processing starts. It
is not necessary to make a specific computation to get them.
As we will see soon, trying to have the orientation parallel to the natural vertical and
horizontal axes will make the computation easier. This is what we are trying to get.
410
perceptiveness to certain stimuli. A purposeful task can be the simple stimulus presence
detection.
The stimulus is the can. The task is to determine if a can is in the vision field or not. If so,
estimate the position of its centre within the vision field.
The facts which support our procedure are:
1. The cans always are shown up at the vision field like can bottoms.
2. The can bottoms are always brighter than the background.
3. The can bottoms are nearly circular.
4. The can bottoms run a restricted and predictable trajectory inside the vision field.
The idea is to look for any measure that has maximum or minimum values according to fact
that the can is present or absent. Arbitrary noise in the image should not provoke false
positives or false negative easily. On the other hand, the measure should be computed fast.
The histogram is a classification and count of levels of grey steps. Provide the distribution of
colour of a collection of points. It is a very fast operation because only it implies queries,
comparisons and sums.
The fact 2) establishes that the presence/absence of a can modifies the distribution of color.
Therefore, the fact 2) determines the sensibility to histograms.
We should look for an optimal collection of points that provides very different histograms
according to the fact that the can is present or not. The collection of points should be
efficient and avoiding unnecessary queries.
411
Figure 4. A can could be located at any position sweeping the image with the described
receptive fields
Figure 5. The nature of the chain of can filling imposes that cans move around in a
restricted line so the search can be restricted to exploration of that axis of line
412
Figure 6. Processing the circles with greater probability of being active first can improve the
speed
Why elliptical histograms? The frames produced from the camera are distorted due to the
natural limitations of the quality of optics. This distortion does not affect of appreciable way
the readability of the code. But the cans leave the circular shape as they get close to the
boundaries of the image.
413
5. Thickening
The pre-process of thickening improves the general quality of the image with a cheaply
computer cost. It simplifies superfluous details of the image without harming the readability
of the code.
414
Its main advantage is that it provides a code with less fragmented and more consistent
characters.
Its main disadvantage is that characters can stick erroneously among themselves.
The procedure involves applying a limited convolution the winning ellipse of the algorithm
of elliptical histograms.
Figure 11. This is the convolution mask. Thence there are no complex implicated operations
6. Segmentation by flooding
The reduction of the number of queries is the reason that justifies the application of this
technique. A query is the reading/writing of a pixel of the actual image.
It is possible to go through every pixel of the image checking if they are part of the code. A
systematic tour will give us the best results. However, its very important to avoid
unnecessary operations/queries. The Flooding Techniques are one of the possible solutions
to use.
The mechanism of the flooding technique main goal (also known as Pixel Progressive
Addition Method) is to obtain which is result as the systematic tour, but with less
operations/queries.
Its use should be success full because it depends on two easily contrastable principles:
1. The majority of queries will give a negative result concerning their belonging to the
code.
2. The queries results with positive code are those positions which have only a shortdistance from each other positions.
The flooding as well as the segregation of points that are part of the code from those points
that are not, groups the located points of code in useful sets. The pixels that form the main
semantic object scene (the code) can be grouped in sub-semantic objects (characters).
Therefore, this is the decisive step of transforming pixels into abstract entities (characters of
code).
415
3.
Choose an not-yet-discarded seed, if there is no one available go to exit. If the seed has
not yet been marked as visited, do it and go to step 3. Otherwise discard it and repeat
step 2.
If there is no ink in the seed, declare it as sterile and discard it .Return to step 2.
Otherwise go to step 4.
0
Open up a flooding point. The seed is declared as the flooding starting point.
Look up the immediate proximity neighbours marking all performed queries like
visited neighbours. Go to step 6.
416
Declare all neighbour marked with ink as a flooding point. Return to step 5. If all
neighbours with ink have been processed, go to step 7.
0
It has too many collected points (too big to be a character, so it is just noise).
417
Flooding Completed
7. Grouping
The main goal of Grouping is to distribute the flooded zones got in the previous stage
between the real characters of the code. It is composed of three steps.
7.1 Division in Bands
This technique tries to separate the lines of printed code. Besides this, you can get additional
useful information for noise suppression. The requirement is that the orientation of the
expected code has to be known previously. This condition be comes true because it was
established in the section [2].
The purpose of this technique is to divide the image in bands. A band is a longitudinal
section of an image which is parallel to the code orientation axis. This technique tries to
contain each of the different processed code lines in bands. Therefore, a band can be empty
(or filled with noise) or contains a code line. Bands are described by their boundaries.
Ideally, the band boundaries are straight lines. However, a band boundary can not cut
flooding zones. So it has to surround the flooding zones in an optimal way. Band
boundaries have their starting points separated at a certain distance from each other, and
should not cross between them.
The speciality of the technique consists of taking advantage of the interlineations code space.
It is hope that the interlineations code space could be easily perceived by a band boundary.
The band boundaries are generated by the following procedure:
1. Assume an origin: The beginning of the boundary, and the direction of movement.
2. Check whether the end of the image has been reached or if there has been a cross
between boundaries (a band boundary can never cross another boundary). If the check
is positive, exit.
3. Apply a collision direction test with a flooded zone.
3.1. If the test is negative, advance a step in a straight line (direction of movement)
418
4.
3.2. If the test is positive we surrounded the flooded zone, choosing the direction of
movement with less resistance. If there are several options of movement with equal
resistance, we choose one randomly.
Return to step 2.
419
2.
The overlapping dimension of a flooded zone is equal to the greatest distance between
the projected points of the flooded zone on the perpendicular axis of the overlapped
axis.
3. The traced segment between the borderline points of the overlap dimension is denoted
as overlap segment of flooded zone.
4. Two flooded zones are overlapped in relation to a suitable axis of overlay when it is
possible to draw at least one parallel line to this axis that cuts its two overlap segments.
5. The overlay property is applied equally to all implicated flooded zones.
6. The overlay among flooded zones from different bands is always zero.
7. A flooded zone can overlap with multiple flooded zones.
8. The number of not-coincidental parallel lines of overlap is the dimensions of the
overlay between two flooded zones. Even though the dimension of the overlay is equal
for the two flooded zones, the degree of overlay of each flooded zone is the quotient
among the fore mentioned dimension and the dimension of its overlapped axis.
9. A flooded zone is included by another one that is overlapped by the first, if all its points
are intersected by the overlapped rays. Two or more flooded zones can include
themselves mutually.
10. The distance among the overlap segment midpoints should not exceed a certain
distance in order that the overlay will be accepted.
Here is the procedure description:
1. A table of overlay with the flooded zones is built.
The overlay table is a collection of all degree of overlay between the existent flooded
zones. It is calculated confronting all the zones among themselves applying the
described rules.
The table construction require to establish the overlay axis perpendicular to the code
orientation axis.
Zone I
Zone II
Zone III
Zone IV
Zone I
0%
Zone II
Zone III
Zone IV
0%
0%
0%
420
8. Validation
The steps of validation are:
1. Labels of Character
It is possible to find the coordinates of the flooded codes in terms of the expected codes
because it is known that the orientation and processing direction of the printing signs.
We will label each flooded zone with the pair: (Line index, position of the character
inside the line ). In this way (1, 4) means second line (Zero is the first), fourth character.
2. Retrieval of expected character
It is possible to recover the family of morphologies of the expected character of our base
of morphologies of those characters learned by means of the described pair.
421
4.
422
C
A Acquisition
M
Illumination
Sy stem
2
Text Irregularities Test
E$E0? 08:25
L4#85/0?
3
Legibility Test
E$E0? 08:25
L4#85/0?
Code Validation
4
Syntax Test
Semantic Test
Decision
10. References
Alemn-Flores, M., Leibovic, K.N., Moreno Daz jr, R.: A computacional Model for Visual
Size, Location and Movement, Springer Lectura Notes in Computer Science, Vol
1333. Springer-Verlag. Berlin Heidelberg New York (1997) 406-419
Quesada-Arencibia, A., Moreno-Daz jr, R., Alemn-Flores, M., Leibovic, K.N: Two Parallel
Channel CAST Vision System for Motion Analysis. Springer Lecture Notes in
Computer Science, Vol. 2178. Springer-Verlag. Heidelberg New York (2001) 316-327
Quesada-Arencibia, A.: Un Sistema Bioinspirado de Anlisis y Seguimiento Visual de
Movimiento. Doctoral Dissertation. PhD Thesys. Universidad de Las Palmas de Gran
Canaria (2001)
J.C. Rodrguez Rodrguez, A.Quesada-Arencibia, R.Moreno-Daz jr, and K.N. Leibovic: On
Parallel Channel Modelling of Retinal Processes Vol 2809 Springer-Verlag. Berlin
Heidelberg New York (2003) 471-481
Leibovic, K.N., Science of Vision, Springer Verlag, New York, 1990.
23
New Types of Keypoints for Detecting Known
Objects in Visual Search Tasks
Andrzej luzek1,2 and Md Saiful Islam1,3
1Nanyang
1. Introduction
Visual exploration of unknown environments is considered a typical and highly important
task in intelligent robotics. Although robots with visual capabilities comparable to human
skills (e.g. mushroom-picking robots or bird-viewing robots) are apparently unachievable in
the near future, but the concept of robots able to search for known objects in unknown
surroundings is one of the ultimate goals for machine vision applications. In the scenarios
that are currently envisaged, the expectations should be realistically limited. Nevertheless,
one can expect that a robot, after a visual presentation of an object of interest, should be able
to learn it and, subsequently, to detect the same object in complex scenes which may be
degraded by typical effects, i.e. partial visibility of the objects (due to occlusions and/or
poor illumination) and their unpredictable locations. The purpose of this chapter is to
propose a novel mechanism that is potentially useful (it has been confirmed by promising
preliminary results) in such applications.
Several theories exist explaining the human perception of objects (e.g. Edelman, 1997). Some
researchers promote the importance of multiple model views (e.g. Tarr et al., 1997) others
(e.g. Biederman, 1987) postulate viewpoint invariants in the form of shape primitives
(geons). From all the theories, however, the practical conclusion is that vision systems
detecting objects in a human-like manner should use locally-perceived features as the
fundamental tool for matching the scene content to the models of known objects.
The idea of using local features (keypoints, local visual saliencies, interest points,
characteristic points, corner points several almost equivalent names exist) in machine
vision can be traced back to the 80s (e.g. Moravec, 1983; Harris & Stephens, 1988). Although
stereovision and motion tracking were initially the most typical applications, it was later
found that the same approach can be used in more challenging tasks (e.g. matching images
in order to detect partially hidden objects). A well-known Harris-Plessey operator (Harris &
Stephens, 1988) was combined with local descriptors of detected points to solve object
recognition problems in which local features from analysed images are matched against a
database of images depicting known objects (e.g. Schmid & Mohr, 1995). The intention was
to retrieve images containing arbitrarily rotated and partially occluded objects.
Subsequently developed keypoint detectors address the issues of scale changes (this was the
weakest point of the original detectors) and perspective distortions. Generally, to achieve
424
425
New Types of Keypoints for Detecting Known Objects in Visual Search Tasks
2. Approximation-based Keypoints
2.1 Pattern-based Approximations
Recently (in Sluzek, 2005) a method has been proposed for approximating circular images
with selected predefined patterns. Although corners and corner-like patterns (e.g. junctions)
are particularly popular and important, the method is applicable to any parameter-defined
patterns (both grey-level and colour ones, though the latter are not discussed in this
chapter).
We assume that a grey-level circular pattern is modelled by several configuration parameters
and intensity parameters (as shown in exemplary patterns given in Figure 1). Typical
patterns are specified by 2-3 configuration parameters and 2-3 intensities. The radius R of a
pattern can be arbitrarily selected. Thus, if a configuration parameter is a length (e.g. 1 in
Figure 1B, or 1 and 2 in Figure 1C) it should be measured both absolutely and relatively to
the radius.
426
2 = arctan 2( m 01 , m 10 )
(1)
1 = 2 arcsin 1
2
16[(m20 m02 )2 + 4m11
]
2
2
9 R 2 m10
+ m01
(2)
For T-junctions (Figure 1D) 1 angular width and 2 orientation angle can found from
1
2
(3)
and
m01 cos 2 m10 sin 2 =
4
( m20 m02 )2 + 4m112
3R
(4)
where m10, m20, etc. are moments of the corresponding orders computed in the coordinate system
attached to the centre of circular windows.
The intensities of such approximations can be estimated using other moment-based expressions
(details in Sluzek, 2005).
Exemplary circular windows (containing actual corners, T-junctions and more random
contents) are given in the top row of Figure 2. The bottom row shows the optimum corner or
T-junction approximations. For some irregular images the approximations do not exist, i.e.
the corresponding equations have no solutions.
New Types of Keypoints for Detecting Known Objects in Visual Search Tasks
427
(i.e. the approximation with the highest similarity) indicates how accurately the pattern of
interest is actually seen in the image.
Alternative methods of quantifying similarity between an image and its pattern
approximation have been given in past papers (Sluzek, 2005; Sluzek, 2006). Unfortunately,
the complexity of both methods is as high as the complexity of building the approximations.
It has been eventually found that highly satisfactory results can be achieved in a simpler
way by comparing moments of circular images (these moment have to be computed
anyway) and moments of their approximations (those moments can be immediately
calculated from the parameters of the approximation). Thus, the similarity between a
circular image I and its approximation AI can be quantified using one of the following
similarity functions:
sim 1 ( I , AI ) = K
(5)
(6)
sim 2 ( I , AI ) = K
m 20 + m 02 + abs ( m 11 )
where mpq and mapq are moments of I and AI (respectively) and K, are arbitrarily selected
positive values.
Figure 3. Top row: a sequence of windows moving across a high-quality corner image. Bottom
row: corresponding corner approximations and the similarity levels (for the last window the
corner approximation does not exist)
If at certain location an image contains a fragment similar to the pattern of interest, a high
level of similarity between the content of a scanning window located there and its
approximation is expected. However, a high similarity level would be found not only for the
actual location but also for neighbouring locations. The similarity, nevertheless, reaches a
local maximum at the location. Figure 3 illustrates this effect.
Moreover, if an image contains a certain pattern, the similarity between the window content
and the approximations exists for a certain range of radii of the scanning window and the
approximations are consistent over this range of radii (instead, the scanning window may
remain the same, but the image is resized correspondingly over the range of scales). An
example showing such a consistency both for the configuration and intensity parameters for
a selected fragment of a digital image (containing a T-junction) is given in Figure 4.
428
Thus, our proposal of the novel type of keypoints is based on the above discussion.
Figure 4. Top row: a sequence of 15-pixel windows over a gradually enlarged image of a Tjunction. Bottom row: corresponding T-junction approximations and the similarity levels
Definition 1
For a given image, pixel (x,y) is (subject to additional requirements explained below) an
approximation-based model keypoint (shortly model keypoint) defined by a circular pattern TP if
for the scanning windows located at (x,y):
1. TP pattern-based approximation exists for each radius R from a certain range (R1, R2).
2. The approximations have consistently similar parameters over the whole range of radii
(R1, R2).
3. If several neighbouring pixels satisfying (1) and (2) exist, the model keypoint is located
at the pixel where the similarity between the scanning windows and their
approximations reaches a local maximum.
Typically recommended additional requirements (introduced for practical reasons) are as
follows:
Pattern-specific constraints may exist. For example, the angular width of a corner
approximation should not be too close to 180 (it becomes an edge then) or to 0 (it
effectively becomes a line tip).
New Types of Keypoints for Detecting Known Objects in Visual Search Tasks
429
430
A
B
Figure 6. Corner-based model keypoints (A) and 90 T-junction-based model keypoints (B)
detected in a more complex image of normal quality. Scanning window radii range from 5
to 20 pixels
2.3 Scene Keypoints for Object Detection
Computational complexity of model keypoints may be too high for real-time applications of
machine vision. If, however, similar keypoints can be detected online in inspected images,
model keypoints would be very reliable references for matching content of images to the
available models. Therefore, we propose a simplified variant of model keypoints, so-called
scene keypoints. The definition of scene keypoints is very similar to Def. 1.
Definition 2
For a given image, pixel (x,y) is an approximation-based scene keypoint of radius R (shortly scene
keypoint) defined by a circular pattern TP if for the scanning windows located at (x,y):
1. The approximations by TP pattern exist for the scanning radius R and for another
radius Rsub, where Rsub is a predefined constant percentage of R (the recommended
value for Rsub is approx. 70% of R).
2. The approximations parameters obtained for R and Rsub radii are similar.
3. If several neighbouring pixels satisfying (1) and (2) exist, the scene keypoint is located
at the pixel where the similarity between both scanning windows and their
approximations reaches a local maximum.
Usually, the practical constraints defined and explained after Def. 1 are also applicable to
the above definition.
Computational complexity of detecting scene keypoints is much lower. Moments of only
two windows (R and Rsub radius) are computed at each location, and reusability of moment
calculations both at the current location and for neighbour pixels can be exploited. The
equations for parameter identification are also used only twice.
Figs 7 and 8 contain exemplary images with scene keypoints detected (for corners and 90 Tjunctions) using windows of radii 10 and 7 pixels. Obviously, for the same images the
number is scene keypoints is larger than the number of model keypoints because the
detection algorithm is much less restrictive. Even though for perfect-quality images
(compare Figure 5 to Figure 7) we would expect the same keypoints, the presence of
additional keypoints can be explained by digital effects and mathematical properties of the
New Types of Keypoints for Detecting Known Objects in Visual Search Tasks
431
moments used. Nevertheless, each model keypoint is also always detected as a scene
keypoint
A
B
Figure 7. Corner-based 90 scene keypoints (A) and T-junction-based scene keypoints (B)
detected in the images from Figure 5. Scanning window radii are 7 and 10 pixels
A
B
Figure 8. Corner-based scene keypoints (A) and 90 T-junction-based scene keypoints (B)
detected in the image from Figure 6. Scanning window radii are 7 and 10 pixels
Matching scene keypoints extracted from analysed images to the database model keypoints
is the fundamental operation in the proposed object detection framework. The following
section discusses practical aspects of matching. In particular, the adaptability of the method
(through selection of thresholds and matching rules) is highlighted.
432
with just a few objects captured from a reasonable number of viewpoints can easily reach
tens of thousands. Thus, the matching procedures become a serious computational problem.
In order to optimise the matching and to avoid too many potential matches, researches
either propose multidimensional descriptors of the keypoints and/or use carefully designed
matching schemes. For example, 128 gradient-based directional descriptors are used in
Lowe, 2004, while in Islam, 2006 only five moment descriptors are used but an efficient
hashing technique has been developed to speed up keypoint matching.
In the proposed method, the abovemetioned problems are significantly simplified. Even if
the overall number of model keypoints is large, they are divided into different categories
(defined by different patterns) that can be handled independently. Scene keypoints are
similarly divided into the same categories (even though the total number of scene keypoints
for typical images may look larger than the numbers seen in other works). Eventually, each
scene keypoint is only matched to the model keypoints in the same category which greatly
reduced the computational efforts and allows parallelisation of the matching process.
Descriptors of both model and scene keypoints are obviously parameters of the
corresponding pattern approximations. Such descriptors can be used more selectively than
other descriptors (e.g. Koenderink & van Doorn, 1987; Lowe, 2004; Islam, 2006, etc.) that are
based on general properties of image intensities. Generally, the processes of keypoint
detection and matching can be adaptively tuned to various applications. Three issues are
highlighted below (the problem is scale invariance is separately discussed in Subsection 3.1).
Thresholds
The number of extracted keypoints depends on several threshold values (see Subsections 2.2
and 2.3) defining the acceptable accuracy of pattern approximations and the minimum
levels of visual prominence (contrasts) of the scene keypoints. It is possible, for example, to
demand high accuracy and to accept very low contrasts. Then the method would be able to
identify only those image fragments that are very accurately approximated by the patterns
used. However, such fragments may not be even visible to a human eye. Typically, such
requirements can be used for search in poorly illuminated scenes (detection of frauds in
images may be another application). Alternatively, only highly-contrasted approximations
could be accepted as keypoints with less demands regarding the accuracy of the
approximations. This would be potentially useful for detecting objects that may be seen
differently than in the database images (but the scenes are expected to be well illuminated).
Moreover, the level of acceptable differences between the descriptors of model keypoints
and scene keypoints determines the overall behaviour of the method (high numbers of
keypoints with possibly many false positives versus high confidence keypoints only).
Configuration parameters
The configuration parameters of keypoint approximations have a higher priority as they
specify geometry of the local structures of the observed scenes. However, the parameters
defining rotations of the patterns (e.g. 2 angle in Figs 1A, 1B and 1D) should be carefully
used for matching (unless the search is for objects at certain orientations). Generally, the
orientation parameters are used only in later stages (see Section 4). Moreover, parameters
indicating distances (e.g. 1 in Figs 1B and 1C) should be measured both absolutely and
relatively to the window radius (for scale invariance, more in Subsection 3.1).
New Types of Keypoints for Detecting Known Objects in Visual Search Tasks
433
Figure 9. Examples of windows matched using different patterns and for diversified
conditions (optimum approximations also shown for references):
(A)
Low accuracy of approximations acceptable. High similarity for angular widths and
intensities required. Orientation match ignored.
(B)
Low accuracy of approximations acceptable. High similarity between angular widths
required. Only relational match for intensities. Orientation match ignored.
(C)
Low accuracy of approximations acceptable. Similarity between angular widths
ignored. Only relational match for intensities. Orientation match required.
(D)
No contrast thresholds in approximations. Low accuracy of approximations
acceptable. All configuration parameters matched. Intensity matching not used.
(E)
High accuracy of approximations required. All configuration parameters matched.
Intensities matched proportionally.
(F)
Good accuracy of approximations required. High similarity between line widths
required. Intensity matching not used
Intensity parameters
The intensity parameters of keypoint approximations can be used more selectively that the
configuration parameters (and their significance is usually lower). In the extreme scenarios
they are not used in the matching process at all (i.e. only the local structures of the objects
are important) although the other extreme is to match them accurately (to detect keypoints
viewed in the same illumination conditions). Typically, either only relations between the
intensities are verified (e.g. a scene corner keypoint can match a given model corner
434
keypoint if the acute section is lighter than the obtuse one see Figure 1A) or the
proportions between the intensities of keypoints should match to a certain level.
To illustrate the above issues, Figure 9 presents exemplary pairs of circular windows (they
are in the same scale as scale invariance is discussed in Subsection 3.1) that can be matched
under various (sometimes not very realistic) assumptions. The windows are already placed
at the local maxima of similarity functions so that if keypoints are extracted they would be
found at the same locations. The corresponding pattern approximations are also given to
highlight that matching is actually performed between the approximations rather than
between the original contents of windows.
3.1 Scale Invariance in Keypoint Matching
Although the examples given in Figure 9 address the issue of matching circular windows of
the same radius, the same approach can be used for matching scene keypoints of the same
size. The only difference is that the match should be satisfactorily established both for the
outer windows (of radius R) and for their sub-windows (of radius Rsub). However, matching
objects shown in arbitrary scales to their models (i.e. matching scene keypoints to the model
keypoints) can be done only under additional assumptions.
If a visual correspondence between a fragment of a model image and a fragment in an
inspected image exists, it can be generally confirmed by a match between the corresponding
model keypoint (defined for the radius range (R1, R2) see Def. 1) and the scene keypoint
(defined by radii R and Rsub ) only when:
R1 Rsub
and
R R2
(7)
where is the relative scale between the model image and the processed image.
The relative scale defines how much the size of an object (measured in the image units) has
been changed against the size of the same object in the model image. The relative scale is
jointly determined by the image resolution, the camera-object distance and the camera focal
length. Detailed analysis of relative scale issues in the context of object detection in given in
(Saiful, 2006).
In Section 2, we extract exemplary model keypoints using the range of radii (R1, R2) from 5
to 20 pixels, while exemplary scene keypoints are found using 10 and 15 pixels. From Eq. (7)
we can immediately calculate that for such conditions images of objects of interest can be
prospectively matched to the model images if the relative scale changes from 1.3 to 0.33. It
means that the objects can be only insignificantly enlarged, but the up to three times
reduced in size. These results correspond to requirements of typical applications (e.g. in
mobile robotics) where exemplary objects of interest are available so that their images can be
captured from a close proximity. In the actual search operations, however, those objects
would be usually seen from a longer distance, i.e. the size reduction in captured images is
more likely to happen.
Moreover, the approximation parameters representing distances (e.g. 1 and 2 in Figure 1C)
should be matched is a special way. They are invariant under usage of variable-radius
windows in terms of absolute distances, but they are not invariant relatively to the radius.
Thus, if a scene keypoint is captured in an unknown scale, such parameters cannot be
directly matched to the values in model keypoints. However, they can be later used for
verifying the validity of the matches (see Sub-section 4.1).
New Types of Keypoints for Detecting Known Objects in Visual Search Tasks
435
It should be finally remarked that the selection of radius ranges over which the model
keypoints are built affects both scale-sensitivity and robustness of object detection. With
wider (R1, R2) the scale invariance of obviously expanded to more scales. However, the
number of model keypoints can be reduced as the pattern approximations must be stable
over a wider range of radii. Therefore, the abilities to detect objects (both fully and partially
visible ones) deteriorate. For occluded objects, fewer locations corresponding to model
keypoints are seen, while for fully visible objects fewer correspondences can be found to
verify hypotheses about the presence of objects. Limited (R1, R2) results in the opposite
effects, i.e. the scale invariance is reduced to a narrower range, but the method is potentially
able to detect objects under stronger occlusions and/or in poorer visibility conditions.
436
C
D
Figure 10. Model images (A and C) successfully matched to test images (B and D,
respectively). Clusters of matching keypoints are shown
The minimum number of nodes in the subgraphs (i.e. the number of consistently matched
keypoints) required for confirmation of the objects identity may depend on the set of objects
under consideration. However, our experiments and statistical analysis show that usually 5
keypoints are enough. It can be noticed, that the incorrect match between Figure 10C and
Figure 10D is confirmed only by three keypoints.
The hypotheses verification can be additionally supported by the analysis of configuration
parameters of scene keypoints. In particular, only those keypoints from a single cluster
would be used for building a scene-graph which are consistently rotated with respect to the
corresponding model keypoints (see the last column of Table 2). This is a very powerful
constraint that greatly reduces the complexity of the hypothesis verification procedure.
437
New Types of Keypoints for Detecting Known Objects in Visual Search Tasks
C
D
Figure 11. The model image (A) and the test image (B). Corner scene keypoints shown in (C)
and 90 T-junction scene keypoints are given in (D)
An exemplary model image and a test image are given in Figure 11. Location of corner scene
keypoints and 90 T-junction scene keypoints detected in the test image are also shown.
The selected example deliberately uses a piece of cloth as the object of interest to show that
the method has a potential to deal with some non-rigid objects as well. Match results have
been obtained using only two types of scene keypoints shown in Figs 11C and 11D. To
compensate for non-rigidity of the object, the shape/scene graphs labels have been
compared only for the longest edges (so that minor local shape distortions do not affect the
hypothesis verification). The additional assumptions are as follows:
Intensity parameters in scene keypoints and the corresponding model keypoints differ
approximately similarly.
Angular widths in the corner scene keypoints are similar to the angles in the matching
model keypoints.
438
Figure 12. Matched keypoints in the model and test images ( corner keypoints, 90 Tjunction keypoints
A
40.0
= 0.57
69.7
37.6
= 0.55
68.8
73.0
= 0.57
128
63.0
= 0.59
104.5
89.8
= 0.61
145.3
40.0
= 0.57
69.7
60.1
= 0.57
105.6
72.4
= 0.56
130
30.1
= 0.57
52.5
51.9
= 0.63
81
37.6
= 0.55
68.8
60.1
= 0.57
105.6
46.4
= 0.63
71.9
69.1
= 0.63
109
98.1
= 0.64
152.4
73.0
= 0.57
128
72.4
= 0.56
130
46.4
= 0.63
71.9
57.4
= 0.57
99.7
79
= 0.59
134.9
63.0
= 0.59
104.5
30.1
= 0.57
52.5
69.1
= 0.63
109
57.4
= 0.57
99.7
31.4
= 0.59
53.3
89.8
= 0.61
145.3
51.9
= 0.63
81
98.1
= 0.64
152.4
79
= 0.59
134.9
31.4
= 0.59
53.3
Table 1. Distance ratios for the corresponding fragment of the shape graph (denominator
values) and the scene graph (numerator values) for Figure 12 images
Figure 12 presents pairs of finally matched keypoints, and Table 1 shows a fragment of the
shape/scene graph (only the most distant keypoints are included). Although certain
variations of the ratio between the corresponding distances in the shape and scene graphs
can be noticed, the average ratio is consistently near 0.6 which can be assumed the
approximation of the relative scale between the model image and the test one. This value
corresponds to the visual assessment of Figs 11A and 11B.
439
New Types of Keypoints for Detecting Known Objects in Visual Search Tasks
Corner approximations
A
B
C
D
E
F
G
H
I
J
K
L
Keypoint
type
model
scene
model
scene
model
scene
model
scene
model
scene
model
scene
model
scene
model
scene
model
scene
model
scene
model
scene
model
scene
Intensities
187 and 57
160 and 60
194 and 46
165 and 60
186 and 45
151 and 44
187 and 48
135 and 38
121 and 18
91 and 20
162 and 35
136 and 56
162 and 38
117 and 21
171 and 53
137 and 31
154 and 10
123 and 7
174 and 48
143 and 35
26 and 172
19 and 149
20 and 169
14 and 139
Angular
width
134
149
90
104
153
157
147
146
140
154
149
152
151
154
142
153
142
151
145
156
158
158
158
160
Orientation
difference
32
24
25
29
41
36
36
24
26
37
29
32
90 T-junction approximations
X
Y
Keypoint
type
model
scene
model
scene
Intensities
Orientation
difference
41
32
Table 2. Approximation parameters for the model and scene keypoints used for the match
shown in Figure 12
As a further reference, Table 2 compares parameters of corner approximations and Tjunction approximations obtained for model and scene keypoints used for the hypothesis
confirmation. It shows a relatively high consistency for the orientation differences (ranging
from 24 to 42) for all keypoints and high level of similarity for the angular widths of corner
keypoints. The differences between the corresponding intensities are wider (which is
unavoidable for images captured in different conditions) but they are consistent as well. In
440
particular, if the intensities differ they change in a similar way for all intensities of a given
approximation.
5. Concluding Remarks
We have presented principles and exemplary results of a novel technique for detection of
known objects in inspected images. The method is based on new types of keypoints which
are the focus of this paper. The proposed keypoints are significantly different from typical
gradient-based keypoints used in the alternative techniques. Our keypoints are based on
moment-derived pattern approximations of circular patches. Though currently only a few
patterns are used (i.e. corners, T-junctions and round tips of thick lines) a wide range of
other patterns can be added using the approach presented in our previous works (e.g.
Sluzek, 2005). The keypoints are characterised by intensity and configuration descriptors
(e.g. angular widths and orientation of the approximations) that are generally robust under
illumination changes, noise, texturisation, and other typical real-world effects. More
importantly, the keypoints are also scale-invariant within a certain range of scales. This has
been obtained by using two different methods for keypoint building in model images and in
analysed images.
Model images of database (known) objects are processed in multiple scales in order to
identify model keypoints that are invariantly characterised within the assumed range of
scales. The operation may be computationally expensive, but it is typically performed either
offline or in the preliminary phase of deployment when timing constraints are not critical.
However, the scene keypoints extracted from inspected image are based (unlike keypoints
used in other scale-invariant techniques) on a single-scale image scanning and processing.
Additionally, the efficiency of keypoint matching is improved by a simultaneous usage of
several keypoint types. Even if the overall number of keypoints (both model and scene ones)
is comparable to the numbers typically extracted and used by other methods, scene
keypoints of a certain category are matched only against the corresponding subset of model
keypoints of the same category. Therefore, the computational costs of image analysis are
relatively low and the method is suitable for real-time applications (e.g. for exploratory
robotics which is considered the primary application area).
Several improvements of the method are currently envisaged First, we propose to enhance
the efficiency of keypoint matching by adding (without any significant computational costs)
more keypoint descriptors. For that purpose, moment-based expressions invariant under
similarity transformations and linear intensity changes are considered. Although generally
such invariants (proposed for colour images and areas of arbitrary shapes in Mindru et al.,
2004) are rather complex, we intend to apply them to circular images only. For circular
images, the following expressions have been found invariant under similarity
transformations and linear intensity changes. For other shapes of the processed areas they
are not invariant, however.
2
R 2 m102 + m01
and
(8)
2
R m102 + m01
New Types of Keypoints for Detecting Known Objects in Visual Search Tasks
441
6. References
Biederman, I. (1987). Recognition-by-components: A theory of human image understanding.
Psychological Review, Vol.94, No.2 (Apr. 1987), pp. 115-147, ISSN: 0033-295X.
Edelman, S. (1997). Computational theories of object recognition. Trends in Cognitive Sciences,
Vol.1, No.8 (Nov. 1997), pp. 298-309, ISSN: 1364-6613.
Harris, C. & Stephens, M. (1988). A combined corner and edge detector. Proceedings of 4th
Alvey Vision Conference, pp. 147-151, Manchester, Sep. 1988.
Husler, G. & Ritter, D. (1999). Feature-based object recognition and localization in 3Dspace, using a single video image. Computer Vision & Iage Understanding, Vol.73,
No.1 (Jan. 1999), pp 64-81, ISSN: 1077-3142.
Hueckel, M.H. (1973). A local visual operator which recognizes edges and lines, Journal of
ACM, Vol.20, No.2 (Apr. 1973), pp 350, ISSN: 0004-5411.
Huttenlocher, D.P. & Ullman, S. (1990). Recognizing solid objects by alignment with an
image. Int. Journal of Computer Vision, Vol.5, No.2 (Nov. 1990), pp 195-212, ISSN:
0920-5691.
Islam, M.S.; Sluzek, A. & Zhu, L. (2005). Detecting and matching interest points in relative
scale. Machine Graphics & Vision, Vol.14, No. 3 (Nov. 2005), pp. 259-283, ISSN: 12300535.
Islam, M.S. (2006). Recognition and localization of objects in relative scale for robotic
applications. PhD Thesis, School of Comp. Engineering, Nanyang Technological
University (Dec. 2006), Singapore.
Koenderink, J.J. & van Doorn, A.J. (1987). Representation of local geometry in the visual
system. Biological Cybernetics, Vol.55, No.6 (March 1987), pp. 367375, ISSN: 03401200.
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. Int. Journal of
Computer Vision, Vol.60, No.2 (Nov. 2004) pp. 91-110, ISSN: 0920-5691.
442
Mikolajczyk, K. & Schmid, C. (2004). Scale & affine invariant interest point detectors. Int.
Journal of Computer Vision, Vol.60, No.1 (Oct. 2004) pp. 63-86, ISSN: 0920-5691.
Mindru, F.; Tuytelaars, T.; van Gool, L. & Moons, Th. (2004). Moment invariants for
recognition under changing viewpoint and illumination. Computer Vision & Image
Understanding, Vol.94, No.1-3 (April 2004) pp 3-27, ISSN: 1077-3142.
Moravec, H. (1983). Stanford cart and the CMU rover. Proceeding of the IEEE, Vol.71, No.7
(July 1983), pp. 872-884, ISSN: 0018-9219.
Prasad, B.G.; Biswas, K.K. & Gupta, S.K. (2004). Region-based image retrieval using
integrated color, shape, and location index. Computer Vision & Image Understanding,
Vol.94, No.1-3 (April 2004) pp 193-233, ISSN: 1077-3142.
Rosin, P.L. (1997). Measuring corner properties. Computer Vision & Image Understanding,
Vol.73, No.2 (Feb. 1999), pp 291-307, ISSN: 1077-3142.
Schmid, C. & Mohr, R. (1995). Local grayvalue invariants for image retrieval. IEEE
Transactions on Pattern Analysis and Machine Intelligence, Vol.19, No.5 (May 1997),
pp. 530-535, ISSN: 0162-8828.
Sluzek, A. (2005). On moment-based local operators for detecting image patterns. Image and
Vision Computing, Vol.23, No.3 (March 2005), pp 287-298, ISSN: 0262-8856.
Sluzek, A; Palaniappan, A. & Islam, M.S. (2005). A wireless sensor network for visual
detection and classification of intrusions, WSEAS Transactions on Circuits and
Systems, Vol.4, No.12 (Dec. 2005), pp 1855-1860, ISSN: 1109-2734.
Sluzek, A. (2006). An improved detection algorithm for local features in gray-level images.
In: Computer Vision and Graphics (Computational Imaging and Vision, vol.32),
K.Wojciechowski et al. (Eds.), pp 406-412, Springer, ISBN: 1-4020-4178-0, Dordrecht.
Tarr, M.J.; Blthoff, H.H.; Zabinski, M. & Blanz, V. (1997). To what extent do unique parts
influence recognition across changes in viewpoint? Psychological Science, Vol.8, No.4
(July 1997), pp. 282-289, ISSN: 0956-7976.
Ulrich, M.; Steger, C. & Baumgartner, A. (2003). Real-time object recognition using a
modified generalized Hough transform. Pattern Recognition, Vol.36, No.11 (Nov.
2003), pp. 2557-2570, ISSN: 0031-3203.
Wolfson, H.J. & Rigoutsos, I. (1997). Geometric hashing: an overview. IEEE Computational
Science & Engineering, Vol.4, No.4 (Oct. 1997), pp. 10-21, ISSN: 1070-9924.
24
Biologically Inspired Vision Architectures:
a Software/Hardware Perspective
Francesco S. Fabiano, Antonio Gentile, Marco La Cascia
and Roberto Pirrone
Dipartimento di Ingegneria Informatica Universit di Palermo
Italy
1. Introduction
Even tough the field of computer vision has seen huge improvement in the last few decades,
computer vision systems still lack, in most cases, the efficiency of biological vision systems.
In fact biological vision systems routinely accomplish complex visual tasks such as object
recognition, obstacle avoidance, and target tracking, which continue to challenge artificial
systems. The study of biological vision system remains a strong cue for the design of devices
exhibiting intelligent behaviour in visually sensed environments but current artificial
systems are vastly different from biological ones for various reasons. First of all, biologically
inspired vision architectures, which are continuous-time and parallel in nature, do not map
well onto conventional processors, which are discrete-time and serial. Moreover, the
neurobiological representations of visual modalities like colour, shape, depth, and motion
are quite different from those usually employed by conventional computer vision systems.
Despite these inherent difficulties in the last decade several biologically motivated vision
techniques have been proposed to accomplish common tasks. For example Siagian & Itti [14]
developed an algorithm to compute the gist of a scene as a low-dimensional signature of an
image, in the form of an 80-dimensional feature vector that summarizes the entire scene. The
same authors also developed a biologically-inspired technique for face detection [13].
Interesting results have also been reported in generic object recognition and classification
(see for example [15] [16] [12] [11]). Also on the sensor side the biological vision systems are
amazingly efficient in terms of speed, robustness and accuracy. In natural systems visual
information processing starts at the retina where the light intensity is converted into
electrical signals through cones and rods. In the outer layers of the retina the photoreceptors
are connected to the horizontal and bipolar cells. The horizontal cells produce a spatially
smoothed version of the incoming signal while the bipolar cells are sensitive to the edges in
the image. Signals output from the cells are then used for higher level processing. Several
architecture have been proposed to mimic in part the biological system and to extract
information ranging from low to high level. For example Higgins [10] proposed a sensor
able to perform an elementary visual motion detector. Other researchers proposed sensor to
detect mid-level image features like corners or junctions [4] or even to perform higher level
tasks such as tracking [6] or texture classification [5]. Robotics represents a typical field of
application for hardware implementations of biologically inspired vision architectures.
444
Robot vision routines such as self localization, or 3D perception via calibrated cameras
require large computing capabilities. Autonomous robot platforms have limited space to
dedicate to such high level tasks because on board computers are busy most the time with
motor control, and sensorial data acquisition. Even more limited embedded hardware is
available on small wheeled robots for which almost all sensory computation is delegated to
remote machines. Also in the case of robots equipped with onboard computer, most
processing focuses on motion control, and low level sensorial data elaboration while heavy
computer vision tasks, like image segmentation and object recognition, are performed in
background, via fast connections to a host computer. Emerging gigascale integration
technologies offer the opportunity to explore alternative approaches to domain specific
computing architectures that can deliver a significant boost to on-board computing when
implemented in embedded, reconfigurable devices. This paper describes the mapping of
low level feature extraction on a reconfigurable platform based on the Georgia Tech SIMD
Pixel Processor (SIMPil).
In particular, an adaptation of the Boundary webs Extractor (BWE) has been implemented
on SIMPil exploiting the large amount of data parallelism inherently present in this
application. The BWE [1] is derived from the original Grossbergs Boundary Contour
System (BCS) and extracts a dense map of iso-luminance contours from the input image.
This map contains actual edges along with a compact representation of local surface
shading, and it is useful for high level vision tasks like Shape-From-Stereo. The Fast
Boundary Web Extraction (fBWE) algorithm has been implemented in fixed point as a feedforward processing pipeline thus avoiding BWE feedback loop, and achieving a
considerable speed-up when compared against the standard algorithm. Application
components and their mapping details are provided in this contribution along with a
detailed analysis of their performance. Results are shown that illustrate the significant gain
over a sequential implementation, and most importantly, the execution times in the order of
170 sec for a 256000 pixel image. These results allow ample room for real-time processing
of typical subsequent tasks in a complete robot vision system. The rest of this chapter is
organized as follows. Section II introduces the Georgia Tech SIMPil architecture, and
implementation efforts on FPGA. Section III provides some remarks on the original
Grossbergs BCS, and its derived BWE model. In section IV the fBWE system is described,
and its mapping onto SIMPil detailed. Section V reports extensive experiments with the
fBWE compared with the BWE results, while in section VI some conclusions are drawn.
445
used in the simulations. This system is capable of delivering a peak throughput of about 5
Tops/sec in a monolithic device, enabling image and video processing applications that are
currently unapproachable using todays portable DSP technology. The SIMPil architecture is
designed for image and video processing applications. In general, this class of applications
is very computational intensive and requires high throughput to handle the massive data
flow in real-time. However, these applications are also characterized by a large degree of
data parallelism, which is maximally exploited by focal plane processing. Image frames are
available simultaneously at each PE in the system, while retaining their spatial correlation.
Image streams can be therefore processed at frame rate, with only nominal amount of
memory required at each PE [8]. The performance and efficiency of the SIMPil have been
tested on a large application suite that spans the target workload.
446
register file to the boundary output port. An upload/download operation (one word per
node) takes only one clock cycle. Both boundary input and output operations are enabled by
a single instruction, XFERB. When a NEWS transfer instruction arrives, it needs only one
clock cycle to transfer the data word from each node to a neighbour one, in a specified
direction. The SIMPil-K platform can be reconfigured by varying a number of architectural
parameters, as detailed in Table 2. This allows for experimentation with a large set of
different system configurations, which is instrumental to determine the appropriate system
characteristics for each application environment AW and RAW parameters set the address
space of register file and memory, respectively. PPE specifies the number of image pixels
mapped to each PE. The Influence parameter toggle between a fixed instruction width (24
bit) and a variable one (8+K bits). The interface of a processing element is depicted in Figure
3, below. There are two input ports for clock signals, a reset input port and the instruction
stream port. NEWS transfers are carried through the three bidirectional dedicated ports
(NEWS ports) which drive three NEWS buses, namely the North/West Bus, East Bus and
South Bus.
Image Transforms
Image Enhancement
Discrete Fourier Transform
Intensity Level Slicing
Discrete Cosine Transform
Convolution
Discrete Wavelet Transform Magnification
Image Rotation
Median Filtering
Image/Video Compression Image Analysis
Quantization
Morphological Processing
Vector Quantization
Region Representation
Entropy Coding
Region Autofocus
JPEG Compression
K-means Classification
Motion Estimation
MPEG Compression
Table 1. SIMPil Application Suite
447
Boundary data input and output are carried through the two dedicated boundary ports. The
processing element parameterized architecture is described in Figure 4. There are four
communication buses shared by the functional units. All functional units can be
reconfigured based on the datapath width selected. A single PE can perform integer
operations on K-bits. Dedicated barrel shift unit and multiply-accumulate unit are
instrumental to speed-up most image processing kernels. The Sleep Unit verifies and
updates the node activity state, thus allowing execution flow control based on each PE local
data. The SIMPil-K system has been simulated and synthesized on FPGA; synthesis statistics
about employed resources has been generated and analyzed. Figure 5 shows resources use
percentage achieved by implementing several 16-bit SIMPil-K versions on an eight million
gates FPGA: particularly, 2-by-2, 4-by-4 and 8-by-8 16-bit SIMPil-K arrays have a resources
use percentage respectively of 3.3%, 13.3%, and 53.3%.
Parameter
K
X
Y
AW
RAW
PPE
Influence
(I)
Function
Word Width
Array Columns
Array Rows
Register
File Address
Width
Local RAM
Address Width
Pixel per
Processing
Instructions
Format
Change
Enable
Values
{8, 16, 32, 64}
XN
YN
Constr.
X, Y = 2j, j Z
Def.
16
4
4
AW N [1, 16]
I = off AW 4
I = on
AW (K/4)
RAW (K/4)
PPE N
PPE = p2, p N,
PPE 2
I [on, off]
off
448
colors. In this formulation, the BCS is a multi-layer recurrent network trained using a
competitive/cooperative scheme until an equilibrium state is reached. BCS units have
dynamic activations that are expressed using differential equations with respect to time. The
network takes the input from a gray-level image, with a lattice of receptive fields computing
local contrast in small areas. Output is provided as a 2D map of vectors, with the same
spatial displacement of the input receptive fields, which are called boundary webs, and
describe changes in brightness over the image. A boundary web is locally oriented along a
constant brightness line, meaning that image contrast changes along the orthogonal
direction. The amplitude of each boundary web is related to the strength of the local
contrast. Boundary webs form a piecewise linear approximation of all image contours, while
they follow iso-luminance paths inside smoothly shaded surfaces: consequently, they can be
regarded as a compact description of image shading. a typical BCS analysis is described in
Figure 7(b), while Figure 6 reports an outline of the BCS architecture. The network consists
of an input stage used to collect contrast information, the so called OC Filter, and of three
layers: Competition I, Competition II and Cooperation. The OC Filter is used to collect local
image contrast along different directions without taking into account contrast orientation.
449
two symmetric halves Lijk and Rijk defined for each mask at the k-th orientation. Assuming
that [x]+ = max(x, 0), the resulting activation at position (i, j) and orientation k is:
(1)
where Uijk and Vijk are the summed input in the masks halves, while and are suitable
constants. The first competitive layer enforces local winner activations via the feedback
signal and the input from the OC Filter, while tends to decrease activation in neighboring
cells with the same orientation. In case of strong aligned activations induced by image
contours, the aim of the first competitive stage is to reduce the activation diffusion beyond
contours endpoints.
450
and the equilibrium activation wijk for each cell in this stage is computed as:
(3)
where vijk is the feedback signal, Apqij are the coefficient of a small kernel with cylindrical
shape, while I and B are suitable constants. In following equations capital letters without
indexes are constant values used to tune the model. The second competitive stage performs
competition among orientations inside the same cell: this is a local contour refinement
mechanism which will be enhanced by the cooperative stage. The activation law has the
following form:
(4)
where capital indexes are referred to orthogonal direction with respect to the current one.
The cooperative stage performs long range cooperation between cells with the same
orientation that are displaced in a wide neighborhood. In this way long contours completion
is enabled. Considering the vector d connecting the position (i, j) with a generic neighbor
(p,q), the following quantities can be defined Npqij = |d| and Qpqij = d, while the
cooperative activation law is:
(5)
where:
This very complex kernel has the form of two elongated blobs aligned with the orientation k,
and exponentially decreasing towards 0. In particular, P represents the optimal distance
from the cooperative cell at which maximum input activation is collected. Finally, feedback
is provided from the cooperative stage to the first competitive one, in order to enforce those
activations that are aligned with emergent contours and decrease spurious ones. The form of
the feedback signal is:
(6)
where Wpqij are the coefficient of a small cylinder shaped kernel. BCS provides a compact
description of image shading at selectable resolution levels: shading, in turn, can be used to
perform shape estimation, while boundary webs can be used as low level features for
contour extraction, alignment, or stereo matching. Possible uses of BCS have been explored
451
by some of the authors resulting in a software implementation of the BCS, called Boundary
Web Extractor (BWE) which has been used as a low level feature extraction module in
different vision systems. In particular, a neural shape estimation model has been proposed
[1] coupling BWE analysis with a backpropagation network trained to classify BWE patterns
as belonging to superquadrics surface patches. Input image surfaces are processed by BWE,
and the BWE output pertaining to different ROIs is modeled in terms of superquadrics.
Another approach [2] performs BWE analysis on stereo couples. Input images are analyzed
both with standard correlation operator over pixels intensities, and with BWE as a
supplementary feature. Candidates points are labeled using a measure of the matching
probability with respect to both the preprocessing operators. Finally, a relaxation labeling
algorithm provides matches for almost all points in the image, and disparities are obtained.
The high resolution achievable by the BWE analysis enables dense depth maps. The main
objective of BWE is to perform local brightness gradient estimation, without taking into
account the support for perception theories. In this perspective BWE has been slightly
modified with respect to BCS, to obtain sharp contrast estimation and emergent contours
alignment. In particular, N couples of dually oriented Gabor masks have been used as
receptive fields to obtain n activation values discarding, for each couple, the mask providing
negative output. The resulting OC Filter is described by the following equation:
(7)
where Uijk and Vijk are the outputs of two dual Gabor masks. The generic Gabor filter has
been selected in our implementation with a width w equal
to 8 pixels, 2N = 24. The filter equation is:
(8)
Here s is the application step of the masks; the ... parameters have been heuristically
tuned. The kernel in eqs. (3) and (6) have been selected with gaussian shape, and the
kernel has been suppressed, and all
subtractive term in the exponential part of
constant values in the equations have been suitably tuned. To ensure the kernel to be
symmetric, its central value has been forced to be 0 in order to avoid the exponential
function to give a positive value when Npqij 0. Finally, we can give a formulation of the
BWE structure as a 3D matrix containing, at each location (i, j), 2N activation values
belonging to a star of vectors.
(9)
Each vector represents the value of the image contrast along the orthogonal direction with
respect to its phase. As a consequence of the modified OC Filter behaviour, the location Bij
of the BW matrix contains N couples, each of them having a null vector that corresponds to
the negative output of the filter with at same orientation.
(10)
452
For computer vision purposes the average boundary webs are noticeable because they
provide a single estimation of the local image contrast at each spatial location, both as
intensity and direction. The average process is computed using a suitable average function
fav:
(11)
The average function can be selected according to several criteria: the maximum value or the
vector sum of all the elements at each location; we selected a form of fav that weights each
intensity with the cosine of the angle between the phase value and a mean phase angle,
obtained weighting each phase with the respective intensity.
(12)
Figure 7 makes a comparison between the original BCS and BWE both for the actual output,
and for the average one.
(a)
(b)
(d)
(c)
(e)
453
454
OC Filter have been implemented using equation (8), and have been provided to the PE
array as a suitable gray level image. The original floating point values obtained for the
weights have been approximated to 8-bit integer values, and the minimum value has been
added to each of them to obtain a correct dynamics in the range [0,....,255]. The set of Gabor
masks is depicted in Figure 9. The same mask is loaded into all the PEs in one column.
455
direction the fBWE activations bijk are almost of the same magnitude and a sort of little star is
visualized in the output. To avoid this behaviour the uniform activations suppression acts
according to the following rule:
(13)
Here the threshold value of 0.8 has been selected on the basis of a trial and error process.
and
are selected at each
After uniform activations suppression the maximum values
location for the left and right branches thus obtaining two average boundary webs images,
using max( ) in place of the averaging function fav. High pass, and median filters give rise
to extremely different dynamics in the two pipeline branches, so a gain element has been
placed in the high pass branch to normalize these ranges. The gain factor has been
determined as
(14)
In all our experiments As assumed values between 6 and 7. Before the conjunction of the two
branches with the union pixel by pixel of the left (WL) and right (WR) image, a sharp
threshold S has been applied in order to join exactly WL and WR. The value of S has been
selected as the 30% of the maximum activation in WL, and all the values in WR that are over
the value of S are joined with all the values of WR that are beneath the same threshold. The
joined image WJ can be defined as WJ = [(WJ,ij, kij)] where for each location (i, j) the
amplitude, and the relative orientation value are defined. The last step is the cooperative
filtering that generates the fBWE image W, and is aimed to enforce aligned neighboring
activations.
An activation is enforced if its orientation is slightly different from the one of the location at
the center of the filter mask, otherwise it is decreased. The generic weight Mpq of the filter
applied to the location (i, j) is defined as:
(15)
Also in this case it is necessary for each PE to obtain 12 values from its eight neighbors.
5. Experimental Results
Several experiments have been conducted on a set of images with different pictorial
features: real images with a lot of shading, highly textured images, high contrast ones, and
artificial pictures with both high dynamics (like cartoons) and poor one (Kanizsa figures). In
Figure 10 the BWE and fBWE images are reported along with a diagram of the local
. It can be noticed that the two
orientation differences
implementations are perceptually equivalent, and the major differences are present in the
uniform brightness regions. In these parts of the image the BWE exhibits some small
456
residual activations due to the feedback based stabilization process, while the fBWE
suppresses them at all. In the case of Kanizsa figures with a few well distinct gray levels (see
Figure 11) the OC Filter alone performs better of the fBWE, so it has been selected as the
system output. As regards the performance, the BWE execution time in our experiments
ranges from 14.94 sec. in the case of Kanizsa figure to 68.54 sec. for the Lena and Tank
images, while fBWE has a constant execution time of 0.168 msec. This is an obvious finding
because the fBWE is a feed-forward architecture, while the BWE is not, and its convergence
to a steady state depends on the input brightness structure.
a) Lena
( (b) Tank
(c) Gear
(d) Cartoon
Figure 10. Experimental results, from top to bottom: input image, BWE output, fBWE
output, difference
457
(a)
(b)
(c)
Figure 11. Experimental results on a Kanizsa figure: (a) input image, (b) fBWE output, (c)
OC Filter output
6. Conclusion
A Fast Boundary Web Extraction (fBWE) algorithm was presented in this paper as a fixedpoint, data parallel implementation of the BWE. fBWE was mapped on SIMPil-K
reconfigurable FPGA based platform.
Application components and their mapping details were provided along with a detailed
analysis of their performance. Experimental results illustrate the significant gain achieved
over the traditional BWE, with execution times allowing ample room for real-time
processing of typical subsequent tasks in a complete robot vision system. Experimental
results on an extensive data set illustrate the significant gain achieved over the traditional
BWE implementation. Execution times are in the order of 170 sec for a 256000 pixel image,
thus allowing ample room for real-time processing of typical subsequent tasks in a complete
robot vision system.
7, Reference
E. Ardizzone, A. Chella, R. Pirrone, and F. Sorbello. Recovering 3-D Form Features by a
Connectionist Architecture. Pattern Recognition Letters, 15:7785, 1994. [1]
E. Ardizzone, D. Molinelli, and R. Pirrone. A Fast Robust BCS Application to the Stereo
Vision. In M. Marinaro and R. Tagliaferri, editors, Neural Nets WIRN Vietri-95
Proceedings of the 7th Italian Workshop on Neural Nets, pages 215225, Vietri Sul Mare
(SA), Italy, 1995. World Scientific Pu., Singapore. [2]
H.H. Cat, A. Gentile, J.C. Eble, M. Lee, O. Vendier, Y.J. Joo, D.S. Wills, M. Brooke, N.M.
Jokerst, and A.S. Brown. SIMPil: An OE Integrated SIMD Architecture for Focal
Plane Processing Applications. In Proceedings of the Third IEEE International
Conference on Massively Parallel Processing using Optical Interconnection (MPPOI-96),
pages 4452, Maui Hawaii, USA, 1996. [3]
J. Van der Spiegel and M. Nishimura. Biologically inspired vision sensor for the detection of
higher-level image features. In 2003 IEEE Conference on Electron Devices and SolidState Circuits, pages 1116. IEEE Computer Society, Washington DC, USA,
December 16-18 2003. [4]
458
25
Robot Vision in the Language of Geometric
Algebra
Gerald Sommer and Christian Gebken
Department of Computer Science, Christian-Albrechts-University, Kiel
Germany
1. Introduction
In recent years, robot vision became an attractive scientific discipline. From a technological
point of view, its aim is to endow robots with visual capabilities comparable to those of
human beings. Although there is considerable endeavour, the progress is only slowly
proceeding, especially in comparison to the level of behavior of human beings in natural
environments. This has its reason in lacking insight into the organization principles of
cognitive systems. Therefore, from a scientific point of view, robot vision is a test bed for
understanding more on cognitive architectures and the mutual support of vision and action
in cognitive systems. While in natural systems self-organization of structures and data flow
is responsible for their success, in case of technical systems, the designer has to model
cognitive systems. Modeling needs a theoretical base which is rooted in the state-of-art
knowledge in science, mathematics and engineering.
The most difficult problem to be solved is the design of a useful cognitive architecture. This
concerns e.g. the gathering and use of world knowledge, controlling the interplay of
perception and action, the representation of equivalence classes, invariants and concepts.
Besides, hard real-time requirements have to be considered. The most attractive approach to
the design of a cognitive architecture is the framework of behavior-based systems (Sommer,
1997). A behavior is represented by a perception-action cycle. Remarkable features of such
architecture are the tight coupling of perception and action, and learning the required
competences (Pauli, 2001) from experience.
Another problem to be coped with in designing robot vision systems is the diversity of
contributing disciplines. These are signal theory and image processing, pattern recognition
including learning theory, robotics, computer vision and computing science. Because these
disciplines developed separately, they are using different mathematical languages as
modeling frameworks. Besides, their modeling capabilities are limited. These limitations are
caused to a large extend by the dominant use of vector algebra. Fortunately, geometric
algebras (GA) as the geometrically interpreted version of Clifford algebras (CA) (Hestenes
& Sobczyk, 1984) deliver a reasonable alternative to vector algebra.
The aim of this contribution is to promote the use of geometric algebra in robot vision
systems based on own successful experience over one decade of research. The application of
GA within a behavior based design of cognitive systems is the long-term research topic of
the Kiel Cognitive Systems Group (Sommer, 1999). Such a coherent system has to be an
460
embodiment of the geometry and the stochastic nature of the external world. That is, it
should enable both internal processes converging at reasonable interpretations of the world
and performing useful actions in the environment. We will report on some novel results
achieved within the last years which extend the survey papers (Sommer, 2004; Sommer,
2005).
Our main contributions to applications of geometric algebra in robot vision are focussing on
the following problems:
Knowledge based neural learning by using algebraic constraints (Buchholz & Sommer,
2006)
Higher-order statistics (Buchholz & Le Bihan, 2006) and estimations (Perwass et al.,
2006) in GA.
More details of the results contributed by the Kiel Cognitive Systems Group can be found in
the publications and reports on the website http://www.ks.informatik. uni-kiel.de. Here we
will report from an engineers point of view. But the reader should be aware that GA
constitutes a framework which has to be adapted to the problem at hand. Therefore, the
system designer has to shape this mathematical language in a task related manner. This is
both a challenge and a chance at the same time.
In section 2, we will present a birds eye view on geometric algebra and will also motivate
its use in robot vision. Special emphasis will be on the conformal geometric algebra (CGA).
A novel approach to local image analysis based on embedding the curvature tensor of
differential geometry into a Clifford analysis setting will be presented in section 3. Sections
4 and 5 are dedicated to our recent progress on estimations from uncertain data in CGA. We
will handle uncertainty for geometric entities and kinematic operations as well. Parameter
estimation methods, based on the principle of least squares adjustment, will be used for
evaluating multi-vectors and their respective uncertainties. Also, in section 5 we will focus
on the problem of pose estimation in case of uncertain omnidirectional vision. In addition,
we will present a novel generalized camera model, the so-called inversion camera model.
Again, we will take advantage of the representation power of CGA.
461
product, destroys the direction information originally represented in the pair of vectors by
mapping them to a scalar. Second, we are mostly interested in vector spaces with Euclidean
norm. The basic geometric entities of Euclidean spaces are points. A Euclidean vector space
can thus be interpreted as an infinite set of points. There is no possibility of formulating
useful subspace concepts in the vector space but set based ones. Third, a cognitive system is
reasoning and acting on global geometric entities, like a tea pot. It makes no sense to
decompose the world phenomena into point-like entities. Fourth, the most important
transformation in robot vision, that is rigid body motion (RBM), has no linear representation
in Euclidian space. Instead, if we are interested in describing RBM of points, we have to take
advantages of an algebraic trick as extending the dimension of the space for remaining in
terms of linear operations. There is no general way for generalizing this trick within the
vector space concept to other geometric entities (as a pair of points or a line). Therefore,
most of the basic disciplines of robot vision are getting stuck in non-linearities. The resulting
iterative solutions are intractable in real-time applications. Finally, besides translation, all
other operational entities acting on a vector are not itself elements of the algebra. This
makes the description of actions based on certain transformation groups a difficult task.
Geometric algebra enables to overcome most of those problems, at least to a certain extend.
In fact, if not specified, the term geometric algebra represents a whole family of geometric
algebras. The designer has to select the right one for the problem at hand or has to design a
special one with the desired features. Hence, its use enables a knowledge based system
design in an algebraic framework which can represent the geometry of interest.
Representing geometry in an algebraic framework means thinking in a Kleinian sense
(Brannan et al., 1999). Any GA has the following features:
1. It is a linear space, which can be mapped to a vector space again. Its elements are multivectors, that is directed numbers of mixed grade. It has a rich subspace structure with
each subspace having algebraic properties and interpretations in a geometric or
operational sense of representing entities of a certain grade, e.g. of higher order.
2. It represents a geometry of interest. That means, it models geometric spaces equipped
with basic geometric entities and a range of higher order geometric entities with useful
algebraic properties. Besides, it represents a Clifford group the elements of which are
linear operational entities. This makes non-linear operations in vector spaces to linear
ones in the chosen GA. That is, both geometric and operational entities are elements of
the algebra.
3. A geometric algebra is equipped with a geometric product the action of which on multivectors not only enables mappings into certain subspaces but from which also incidence
algebraic operations between subspaces can be derived.
This as a whole makes GA a powerful tool for modeling in robot vision and beyond.
2.2 Basic Structure of Geometric Algebra
Here we will only present a sketch of the rich structure represented by a geometric algebra.
For more details see (Hestenes & Sobczyk, 1984) or the introduction paper (Hestenes et al.,
2001), respectively the tutorial report (Perwass & Hildenbrand, 2003).
is a linear space of dimension 2n constructed from a vector space
A geometric algebra
with signature (p,q, r), n = p+q+r, which we denote
=
. The algebra is
built by applying the geometric product to the basis vectors of
, i = 1, . . .,n,
462
(1)
The GA
is called Euclidean for n = p and pseudo-Euclidean for n = p + q. In the case of
r0, its metric is degenerate. The signature (p,q, r) is the key for selecting certain geometric
properties of the GA. The geometric product is linear and associative but not commutative.
The linear space of a GA is split into a rich subspace structure represented by a set of blades
Bk of grade k. Given k independent vectors1 ai, i = 1, . . . , k, a k-blade is defined for k = 1,...,n
by
(2)
Here
indicates the outer product. There are
different k-blades, each having its
. Hence, k-blades constitute directed linear
own direction given by
. In figure 1 we visualize the blade structure of , that is the GA of
.
subspaces of
By considering next the simple example of the geometric product of two vectors a,b
we will get an inductive access to the construction rule of multi-vectors as the algebraic
entities of a geometric algebra.
We use lower case letters, as , for algebra vectors or for vector space elements.
463
a separation of the geometric product into the sum of the inner product
(4)
and the outer product
(5)
The geometric product, a b, results in the sum of a scalar, (ab)0, and a bivector, (ab)2.
In contrast to the scalar product of vector algebra, the geometric product of geometric
algebra is both grade-decreasing and grade-increasing. In general the multi-vector A is a
mixture of k-vectors, Ak,
(6)
with
(7)
l* lk. For the geometric product of homogeneous multi-vectors of grades s and r we get a
multi-vector C with a certain spectrum of different k-vectors,
(8)
with the pure inner product Ar Bs = (ArBs)|rs| and the pure outer product
(ArBs)r+s. Hence, the other components result from mixing the inner and outer product.
The blades of grade n are called pseudoscalar, P ,
(9)
with I being the unit pseudoscalar with I2 = 1 if r = 0 and being a scalar which equals the
determinant of matrix algebra. Because I = IkInk, a blade Bk is related to its dual one, Bnk, by
(10)
This is a useful operation for switching between different representations of a multi-vector.
There are several main algebra involutions in GA, like in case of complex numbers the only
is a
existing one is conjugation. Let us mention as an example the reversion. If
k-vector, then its reverse is defined as
(11)
and the reserve of a multi-vector
defined as
(12)
464
and
being the quaternion algebra.
with
being the
465
are related to shape modeling, projective geometry and kinematic. Conformal geometric
algebra (CGA) delivers a non-linear representation of a Euclidean space with remarkable
features:
First, CGA constitutes a unique framework for affine, projective and Euclidean geometry.
Because the special Euclidean transformation (RBM) is a special affine transformation, we
can handle either kinematic, projective or metric aspects of the problem at hand in the same
algebraic frame. Second, the basic geometric entities of conformal geometry are spheres of
dimension n. Other geometric entities as points, planes, lines, circles,... may be easily
constructed. These entities are no longer set concepts of a vector space but elements of CGA.
Third, the special Euclidean group is a subgroup of the conformal group, which is in CGA
an orthogonal group. Therefore, its action on the above mentioned geometric entities will be
a linear operation. Fourth, the inversion operation is another subgroup of the conformal
group which can be advantageously used in robot vision. Fifth, CGA generalizes the
incidence algebra of projective geometry with respect to the above mentioned geometric
entities.
Before we enlighten the structure and features of CGA in more detail, we will have a short
, the geometric algebra of the Euclidean 3D-space
. This will be the starting
look on
is the
point for the mentioned non-linear representation in CGA. Additionally,
embedding framework for image analysis, which will be described in section 3. The basis of
its 8- dimensional space is given by
(18)
with
and
being the basis vectors of
with
. Here the
constitute
constitute the unit 2-blades with
, see figure 1. The
the unit 1-blades and the
squares according to
.
unit pseudoscalar
is isomorphic to the quaternion algebra
according to equation
The even subalgebra
(17). Its dimension is four and the basis is given by
(19)
where i, j, k are the imaginary unit vectors of a quaternion.
, is built by extending
with a so-called
The conformal geometric algebra
, resulting in
. Originally, this construction of the CGA of a
Minkowski plane
which results in
, was proposed and analyzed by
pseudo-Euclidean space
(Angles, 1980). Only the work of (Li et al., 2001a) has been recognized by the robot vision
community as valuable access to the interesting phenomena in a unique framework. The
same authors presented also a CGA for spherical geometry (Li et al., 2001b) and a further
generalization to cope with Euclidean, spherical and hyperbolic geometry (Li et al., 2001c).
But the last two cases have not yet been studied in robot vision.
, is of dimension 32. That one of the extended space
The basis of the Euclidean CGA
contains as additional basis vectors
and
with
.
Both basis vectors constitute the so-called orthonormal basis of the Minkowski plane. More
with
and
attractive is to switch to the so-called null-basis
. This has two reasons. First, both the origin of
, represented by
, and the point at infinity, represented by
, are explicitly
accessible. Second, a point x 4,1 of the Euclidian 3D-vector space
is mapped to a
466
conformal point (null vector)
function
, with
and
, by the embedding
(20)
builds a
We denote these special vectors by capital letters as well. The mapping
homogeneous representation of a stereographically projected point (Rosenhahn & Sommer,
2005a). As a grade-1 entity, a point is a special sphere, S, (also of grade one) with radius
zero. The dual representation of a sphere
(21)
is of grade four and is defined by the outer product of four points. A circle as a 2dimensional sphere,
or
is defined by
(22)
By replacing one point in the defining equations (21) or (22) by the point at infinity,
,a
plane, a line or a point pair (a one-dimensional sphere) may be derived. Most interesting for
of the elements of the conformal group
robot vision is the orthogonal representation in
C(3). All transformations belonging to the conformal group are linear ones and the null
be an
cone, that is the set of all null vectors, is invariant with respect to them. Let
element of the conformal group and
any entity which has to be transformed by
. Then
(23)
describes this transformation as a (bi-)linear mapping. In general, all algebraic entities with
such sandwich product are called versors (Hestens et al., 2001). Given some conditions,
certain versors are called spinors (representing rotation and dilation) and normalized
spinors are called rotors (representing pure rotation). Interestingly, also translation has a
rotor representation (called translator) in CGA. But the most interesting transformation
belonging to the conformal group is inversion, see (Needham, 1997), because all other
be a unit sphere located at
transformations can be derived from it. Let
the origin , then the inversion of any conformal point
in the unit sphere is
written
(24)
The elements of the rigid body motion in CGA are called motors,
. They connect
rotation, represented by a rotor R, and translation, represented by a translator T, in a
multiplicative way,
(25)
and can be interpreted as a general rotation (Rosenhahn & Sommer, 2005a). As all versors,
they are concatenated multiplicatively. Let M = M2M1 be a sequence of two motors, then
(26)
for all
. Another important feature of linear operations in GA also applies for
versors in CGA. It is the preservation of the outer product under linear transformation,
467
shows why our algorithms do not noticeably differ in the output when using an
exact embedding or its approximation. We evaluate the corresponding 55 covariance
matrix
for
by means of error propagation and find
(30)
where we used the Jacobian of
(31)
468
In both cases the local phase constitutes a vector that includes also the orientation as a
geometric feature.
In case of i2D structures, an angle of intersection can be derived from the derivations of
phase angles.
469
All these efforts have been made because of the advantages of phase based image analysis
for getting access to geometry and because of the illumination invariance of phase
information.
3.2 Monogenic Curvature Tensor
The image representation we want to model should have some invariances:
Invariance with respect to intrinsic dimension: Both i1D and i2D structures can be
modeled. This is possible by the curvature tensor of differential geometry (Koenderink
& van Doorn, 1987).
Invariance with respect to parity symmetry: Both even and odd symmetric structures
can be represented. This is possible by designing quadrature phase filters, whose
harmonic conjugate component is in quadrature phase relation to the real valued
component (Sommer & Zang, 2007). The way to get this is applying a (generalized)
holomorphic extension of a real valued multi-dimensional function by a (generalized)
Hilbert transform.
Invariance with respect to rotation: This becomes possible by specifying the generalized
holomorphic extension by a monogenic extension (Felsberg & Sommer, 2001), whose
operator realization is given by the Riesz transform (Stein & Weiss, 1971).
Invariance with respect to scale: This requires embedding of the image representation,
respective of the operator which derives it into a monogenic scale-space (Felsberg &
Sommer, 2004).
Having these invariances in the image representation, in a second step of analysis the
corresponding variances can be computed. These are intrinsic dimension, parity symmetry,
rotation angle, angle of intersection and intrinsic scale at which these features exist.
. Let be the curvature tensor of the second
We will interpret a 2D-image as a surface in
fundamental theorem of differential geometry. Its Monge patch representation is given by
(32)
with the Hesse matrix
(33)
, and the mean curvature,
, are
Then the Gaussian curvature,
spanning a basis in which the local signal f(x) can be classified according to its intrinsic
dimension according to table 1.
Type
Elliptic (i2D)
Hyperbolic (i2D)
Parabolic (i1D)
Planar (i0D)
(Mean Curvature)
|| 0
|| = 0
(Gaussian Curvature)
>0
<0
=0
=0
470
(39)
(40)
471
As equations (35) and (40) reveal, the Riesz transform is identic to the first order circular
harmonic,
(41)
What remains for fulfilling the scale invariance requirement is embedding the monogenic
Hessian operator into the monogenic scale-space (Felsberg & Sommer, 2004). This is
achieved by replacing the radial component of circular harmonics, Cn (), by a Difference-ofPoisson kernel, HDOP,
(42)
with s1 < s2 being two different scale parameters. This results in circular harmonic bandpass
functions
(43)
Finally, we get the monogenic curvature tensor
as
(44)
472
,
(52)
In the case of fi1D, (x) is indicating the orientation of the i1D structure within the image
plane and in the case of fi2D, 2 (x) represents the local main orientation of the i2D structure
in a double angle representation which results from the eigenvector decomposition of the
structure tensor (Felsberg, 2002). Hence, phase analysis delivers also the orientation
information as a consequence of the monogenic representation of the curvature tensor.
In Figure 2, an example signal is analyzed with respect to its local spectral representations.
The monogenic curvature tensor is obviously invariant with respect to rotation. In figure 3,
two patterns of even and odd symmetric structures are analyzed with respect to local
amplitudes and local phases for fi1D and fi2D, respectively. Clearly can be seen the
invariances of the monogenic curvature tensor with respect to the intrinsic dimension,
parity symmetry and angle of intersection.
We will not discuss in detail the scale-space properties (Zang & Sommer, 2006a). It should
only be mentioned that the embedding of the curvature tensor into a monogenic scale-space
results in an improved corner detection based on a novel two-dimensional phase
congruency method (Zang & Sommer, 2006b) and delivers superior estimates of the optical
flow field based on a phase constrained variational approach (Zang et al., 2007).
473
expressions and thanks to the bilinearity of the always involved geometric product, the
corresponding equations are linear or at most quadratic in the multi-vector components. In
section 2.3 we have introduced a simple way to represent geometric algebra operations in
terms of a tensor notation, where the term tensor denotes the classical extension of matrix
theory to higher dimensions. This allows us to use well-tried and efficient algorithms
without leaving the algebra. Moreover, it paves the way for using the stochastic: standard
error propagation, for example, is exact for the geometric product and makes it easily
possible to keep track of the uncertainties while doing operations like an intersection.
Figure 2. Top: original image (left), even and odd components of fi2D (middle and right).
Bottom: local amplitude (left), local phase (middle) and local orientation (right)
Figure 3. From left to right: original images, local amplitudes and local phases of the
monogenic signal fi1D, local amplitudes and local phases of the generalized monogenic
Gaussian curvature signal fi2D
474
The stochastic is one of the fundamental aspects of this section. To account for the
uncertainties in observational data we consequently decided on a least squares adjustment
parameter estimation. We use the Gauss-Markov and the Gauss-Helmert method. Each of
them provides an estimate together with a suitable covariance matrix. Hence, further
calculations can be carried out with these uncertain elements, as mentioned above.
This text builds on previous works by (Heuel, 2004) where uncertain points, lines and
planes were treated in a unified manner, but not in GA. The linear estimation of rotation
operators in GA was previously discussed in (Perwass & Sommer, 2002), albeit without
taking account of uncertainty. In (Perwass et al., 2005) the estimation of uncertain general
operators was introduced.
The structure of this section is as follows: first, we explain the underlying parameter
estimation methods. We then present two applications. For each, we demonstrate in which
way we profit from the expressiveness of CGA and we explain how our method can be
applied within that framework.
4.1 Stochastic Estimation Method
In the field of parameter estimation one usually parameterizes some physical process in
and a suitable parameter vector . The components of p are then to be
terms of a model
estimated from a set of observations originating from .
Here, we introduce our two parameter estimation methods, the common Gauss-Markov
method and the most generalized case of least squares adjustment, the Gauss-Helmert
method. Both are founded on the respective homonymic linear models, cf. (Koch, 1997). The
word adjustment puts emphasis on the fact that an estimation has to handle redundancy in
observational data appropriately, i.e. to weight unreliable data to a lesser extend. In order to
overcome the inherent noisiness of measurements one typically introduces a redundancy by
taking much more measurements than necessary to describe the process. Each observation
must have its own covariance matrix describing the corresponding Gaussian probability
density function that is assumed to model the observational error. The determination of
which is inferred from the knowledge of the underlying measurement process. The matrices
serve as weights and thereby introduce a local error metric.
The principle of least squares adjustment, i.e. to minimize the sum of squared weighted
, is often denoted as
errors
(53)
where
Let
475
Note that most of the fitting problems in these sections are not linear but quadratic, i.e. the
condition equations require a linearization and estimation becomes an iterative process. An
important issue is thus the search for an initial estimate (starting point). If we know an
already good estimate , we can make a linearization of the G-constraint yielding
. Hence, with
and
:
, which exactly matches the linear Gauss-Markov model. The
minimization of equation (53) in conjunction with the Gauss-Markov model leads to the best
linear unbiased estimator. Note that we have to leave the weighting out in equation (53),
do not match the
. Subsequently, we consider a model
since our covariance matrices
which includes the weighting.
, we can make a Taylor series
If we take our observations as estimates, i.e.
yielding
expansion of first order at
(54)
we obtain
, which exactly matches the
Similarly, with
linear Gauss-Helmert model. Note that the error term yi has been replaced by the linear
; the Gauss-Helmert differs from the Gauss-Markov model in
combination
that the observations have become random variables and are thus allowed to undergo small
to compensate for errors. But changes have to be kept minimal, as observations
changes
represent the best available. This is achieved by replacing equation (53) with
(55)
where
is now considered as error vector.
The minimization of (55) subject to the Gauss-Helmert model can be done using Lagrange
multipliers. By introducing
,
and
the Lagrange
function , which is now to be minimized, becomes
(56)
The last summand in corresponds to the linearized H-constraint, where
and
was used. That term can be omitted, if has no functional dependencies. A
differentiation of with respect to all variables gives an extensive matrix equation, which
could already be solved. Nevertheless, it can be considerably reduced with the substitutions
equation is free from
and
and can be solved for
(57)
For the corrections
distance (55), we compute
476
(58)
It is an important by-product that the (pseudo-) inverse of the quadratic matrix in equation
belonging to . The similar solution for the
(57) contains the covariance matrix
Gauss-Markov model and the corresponding proofs and derivations can be found in (Koch,
1997). Due to outstanding convergence properties we start iterating with the Gauss-Markov
method. At the optimum we start the slower Gauss-Helmert method, which ultimately
.
adjusts the estimate according to the uncertainties
4.2 Fitting a Circle in 3D
Now we show how the estimation method can be used in CGA to fit a circle in 3D-space to a
set of N data points {b1..N}. Each data point is given with its mean bi and covariance matrix
. In order to apply the estimation methods as described, we need a G-constraint and
possibly an H-constraint. We therefore give an introduction to circles in CGA.
of a 2-blade C. That space consists
We represent a circle by the inner product null space
of all conformal points X, the inner product of which with the circle C is zero, i.e.
=.
To understand this relationship, consider the inner product
null space of a sphere Sr with radius r and center m. It can be created from a point
by subtracting the term
. The sphere is thus
given by
. For some vector x it can be verified that
iff
. Now, consider two intersecting spheres S1 and S2. A
circle intuitively consists of all points X lying on S1 and S2. Intersection can be expressed by
. For a justification examine
the outer product and in fact the circle definition is
the inner product X C
(59)
The terms cannot cancel each other if S1 S1 and S2 are linearly independent, i.e. if they do
not represent the same sphere. The upper equation is therefore zero iff X is located on S1 and
S2 as well.
Remarkably, we have found an appropriate G-constraint right from the definition of the
circles inner product null space itself. It remains to transfer the inner product expression
,
X C to an equivalent matrix expression. As there are ten basis blades of grade two in
we have
. The points {b1..N } are embedded and mapped as follows:
. Note that our condition equation (59) yields a vector, being
defined by five components in
. Consequently, we obtain
(60)
and follow from
which can be differentiated easily. Thus, the required Jacobians
the bilinearity of geometric algebra products in an implicit manner.
Because a circle in 3D-space can be described by a minimum number of six parameters, we
face a functional dependency of grade 4 = 10 6 within . As mentioned in section 4.1, we
. We enforce C
have to introduce constraints on the parameters, namely the H-constraint
, which can be shown to be sufficient. In almost the
to be a circle by requiring that
same way as for the G-constraint, the usage of allows us to derive the -matrix. Being in
477
the possession of all necessary matrices, we are able to run the estimation in order to solve
and
.
for the corrections
We remain with this stage and refer the reader to the next estimation example. There, we
explicitly derive the constraint functions in terms of the tensor notation.
of the estimated entity
As mentioned earlier, our method provides the covariance matrix
as well. It shows up to which degree the model fits the observations and how
advantageously they were initially distributed. It does not reflect to which extend the
estimate deviates from a potentially perfect fit, i.e. it is no quality measure for our method.
Figure 4 exemplarily shows the uncertainty of an estimated circle. The surrounding tubes,
indicated by slices, show the standard deviation of the estimates.
4.3 Fitting two Point Clouds in 3D
In this part, we describe how the proposed methods can be used to estimate an RBM; it
extends a rotation, given by a rotor, by a translational component along the axis of rotation.
Hence, we can think of it as a screw motion, cf. (Rosenhahn, 2003). In geometric algebra an
RBM is represented by an operator called motor. In the scope of pose estimation, the pose is
uniquely characterized by an RBM. The estimation of motors is thus a first step towards the
perspective pose estimation problem.
Let {a1..N} and {b1..N} be two sets of N Euclidian points each. The latter represent the
. The set {a1..N} is assumed
observations for which we have the covariance matrices
and
denote the conformal embedding
to have no uncertainty. Let
of ai and bi, respectively. We search for the motor M, which best transforms all points in
{A1..N} to the respective points in {B1..N}. The scenario is shown in figure 5.
cf. (Perwass & Sommer, 2002).
Using geometric algebra, we can easily write
Note that a motor is a unitary versor, i.e. it has to satisfy
. Exploiting this fact, we
rearrange the previous formula and obtain the G- constraint
(61)
where we used
and
. The tensor encodes the
geometric product. In order to evaluate the matrices and , we differentiate equation (61)
and , respectively. Hence, we get
and
with respect to
.
Since an RBM is defined by six rather than eight parameters, we need the H-constraint. We
. The tensor
again exploit unitarity and choose
encodes the reverse operation and
is zero, except for t = 1. Differentiation
yields
. The estimate for M can now be computed by simply
substituting the matrices
,
and
into the respective equations given in the
theoretical part.
478
Figure 5. Fitting two point clouds: the rotation of the motor M is indicated by the partial
disc. The translational part is specified by the arrow attached to it
479
camera model, which has the ability to model a variety of distinct camera systems thereby
taking image distortions into account.
Pose estimation certainly is a well-studied subject, but not in case of an omnidirectional
vision system. Hence, our objective was to develop accurate pose estimation for
omnidirectional vision, given imprecise image features, i.e. 2D-sensory data. Note that these
features can readily be detected by the method proposed in section 3.
Comparable to triangulation, the accuracy of an estimated pose benefits when landmarks
can be seen in clearly different directions. But the most significant advantages of
omnidirectional vision are related to navigation, since the objects remain on the image plane
under most camera movements. We consider a single viewpoint catadioptric vision sensor.
It combines a customary camera with a parabolic mirror and provides a panoramical view
of 360.
We make the assumption to have 3D-models of the interesting objects we observe in the
images. Secondly, we assume to know the one-to-one correspondences between the model
features and the image features. Note that a model consists of 3D-lines, which mostly
represent object edges, which in turn, are likely to generate a line under imaging;
consequently, we have lines as image features. We herewith extend our previous work
where we had been employing point features and point models.
Figure 6. Fitting a triangle model to the projection planes spanned by R1, R2 and R3
5.1 Omnidirectional 2D-3D Pose Estimation
Roughly speaking, rigidly moving an object in 3D such that it comes into agreement with
2D-sensory data of a camera is called 2D-3D pose estimation (Grimson, 1990). Specifically,
we estimate an RBM in 3D, such that the model lines come to lie on the projection planes of
the underlying image lines, see figure 6.
The method to be proposed comprises three steps: from those pixels corresponding to
visible model lines, we estimate projection planes with associated uncertainties. In a second
step, a simple algorithm is used to do prior rotation estimation being a first and rough guess
at the rotational part of the desired RBM. As a result the model will be aligned such that its
lines are nearly parallel to the respective projection planes. We finally estimate the entire
pose taking the computed plane uncertainties into account as well.
Before we explain those steps in detail, we give a sketch of catadioptric imaging.
480
Figure 7. Left: mapping (cross-section) of a world point Pw: the image planes 1 and 2 are
identical. Right: mapping of line L to L via great circle LS on S. As an example, scattered
image data belonging to L is shown
5.3 Estimating Projection Planes
We must come up with observations in the form of planes for a line-plane fitting; we
compute a projection plane for each set of image points that corresponds to a visible model
line. To be more specific, we estimate the planes from the stereographically back-projected
image points. Hence, the points have to be moved to the projection sphere S, see figure 7.
This is done by an inversion of the image points in a certain sphere SI. Note that the
(uncertain) image points, initially identically 2D-distributed, thereby obtain distinct 3Duncertainties, which reflect the imaging geometry. The uncertainties are computed using
. The
error propagation, where we profit from the inversion being a linear operation in
plane estimation can now be done by restricting the circle estimation, see section 4.2, to the
three parameters describing the circles plane. Recall that we obtain a covariance matrix for
each estimated plane.
481
482
Figure 8. Mapping schemes (cross-section) in terms of the inversion camera model. Left:
setup reflecting a pinhole camera. Right: setup modeling a real lens by taking radial
distortions into account. Namings are in concordance with figure 7; C denotes the center of
SI
The aim of image rectification is to undo distortions which originate from a variety of
optical imperfections. The right side of figure 8 shows the problem. The ray belonging to
world point Pw was subjected to distortion which lead to the ray Rdev that eventually
produced p. However, Rdev deviates from the geometrically true ray R in a non-linear
manner depending on the angle to the optical axis. Hence a mapping has to be found that
corrects the position of point p, within the image plane, such that it comes to lie on its
projection ray R again.
We denote the rectified point prect. In (Perwass & Sommer, 2006), the authors discovered that
moving off the inversion sphere SI from F , which distinguishes the mapping schemes in
figure 8, results in a mapping suitable to model distortions. It consists of two parts. First a
versor K, which essentially does the inversion of the image point p, is applied. Next, the
corresponding ray R is constructed and intersected with image plane to give prect.
Our subsequent considerations require a right handed coordinate system. The -axis
denotes the optical axis. It points upwards and is incident with F. The -axis points to the
right and is aligned with the image plane. Hence, all image points lie on the
-plane.
, where we used
The inversion sphere SI of radius r is defined by SI =
the abbreviation
. One of the simplest forms K can take on is K = SID. In order
to handle scaling and for numerically well-balanced equations, the inversion in SI is
preceded by the dilator D (isotropic scaling operator). The dilation operator D for a scaling
483
6. References
Angles, P. (1980). Construction de revtements du groupe conforme dun espace vectoriel
muni dune mtrique de type (p,q). Ann. Inst. Henri Poincar, 33(1):33-51.
Brackx, F.; Delanghe, R. & Sommen F. (1982). Clifford Analysis, volume 76 of Research Notes
in Mathematics. Pitman Advanced Publishing Program, Boston, MA.
Brannan, D.A.; Esplen, M.F. & Gray, J.J. (1999). Geometry. Cambridge University Press,
Cambridge.
484
485
Koenderink, J.J. & van Doorn, A.J. (1987). Representation of local geometry in the visual system.
Biol. Cybern., 55(6):367-375.
Li, H.; Hestenes, D. & Rockwood, A. (2001). Generalized homogeneous coordinates for
computational geometry. In G. Sommer, editor, Geometric Computing with Clifford
Algebras, pages 27-59, Springer-Verlag, Heidelberg, Berlin.
Li, H.; Hestenes, D. & Rockwood, A. (2001). Spherical conformal geometry with geometric
algebra. In G. Sommer, editor, Geometric Computing with Clifford Algebras, pages 6175, Springer-Verlag, Heidelberg, Berlin.
Li, H.; Hestenes, D. & Rockwood, A. (2001). A universal model for conformal geometries of
euclidean, spherical and double-hyperbolic spaces. In G. Sommer, editor, Geometric
Computing with Clifford Algebras, pages 77-104, Springer-Verlag, Heidelberg, Berlin.
Needham, T. (1997). Visual Complex Analysis. Clarendon Press, Oxford.
Pauli, J. (2001). Learning-Based Robot Vision, volume 2048 of Lecture Notes in Computer
Science. Springer-Verlag, Heidelberg, Berlin.
Perwass, C.; Gebken, C. & Sommer, G. (2005). Estimation of geometric entities and operators
from uncertain data. In 27. Symposium fur Mustererkennung, DAGM 2005, Wien,
29.8.-2.9.005, number 3663 in LNCS. Springer-Verlag, Heidelberg, Berlin.
Perwass, C.; Gebken, C. & Sommer, G. (2006). Geometry and kinematics with uncertain
data. In A. Leonardis, H. Bischof, and A. Pinz, editors, 9th European Conference on
Computer Vision, ECCV 2006, May 2006, Graz, Austria, number 3951 in LNCS, pages
225-237. Springer-Verlag, Heidelberg, Berlin.
Perwass, C. & Hildenbrand, D. (2003). Aspects of geometric algebra in euclidean, projective
and conformal space. Technical Report Number 0310, Christian-AlbrechtsUniversitt zu Kiel, Institut fr Informatik und Praktische Mathematik, September.
Perwass, C. & Sommer, G. (2002). Numerical evaluation of versors with Clifford algebra. In
Leo Dorst, Chris Doran, and Joan Lasenby, editors, Applications of Geometric Algebra
in Computer Science and Engineering, pages 341-349. Birkhauser.
Perwass, C. & Sommer, G. (2006). The inversion camera model. In 28. Symposium fr
Mustererkennung, DAGM 2006, Berlin, 12.-14.09.2006, number 4174 in LNCS, pages
647-656. Springer-Verlag, Heidelberg, Berlin.
Porteous, I.R. (1995). Clifford Algebras and the Classical Groups. Cambridge Stud. Adv. Math.,
Cambridge University Press, Cambridge.
Rosenhahn, B. (2003). Pose Estimation Revisited. Technical Report Number 0308, ChristianAlbrechts-Universitt zu Kiel, Institut fr Informatik und Praktische Mathematik,
September.
Rosenhahn, B. & Sommer, G. (2005). Pose estimation in conformal geometric algebra, part I:
The stratification of mathematical spaces. Journal of Mathematical Imaging and Vision,
22:27-48.
Rosenhahn, B. & Sommer, G. (2005). Pose estimation in conformal geometric algebra, part II:
Realtime pose estimation using extended feature concepts. Journal of Mathematical
Imaging and Vision, 22:49-70.
Sommer, G. (1997). Algebraic aspects of designing behavior based systems. In G. Sommer
and J.J. Koenderink, editors, Algebraic Frames for the Perception and Action Cycle,
volume 1315 of Lecture Notes in Computer Science, pages 1-28. Proc. Int.Workshop
AFPAC97, Kiel, Springer-Verlag, Heidelberg, Berlin.
486
Sommer, G. (1999). The global algebraic frame of the perception-action cycle. In B. Jhne, H.
Haussecker, and P. Geissler, editors, Handbook of Computer Vision and Applications,
volume 3, pages 221-264. Academic Press, San Diego.
Sommer, G. (2004). A geometric algebra approach to some problems of robot vision. In J.
Byrnes, editor, Computational Noncommutative Algebra and Applications, number 136
in NATO Science Series II, pages 309-338. Kluwer Academic Publishers, Dordrecht.
Sommer, G. (2005). Applications of geometric algebra in robot vision. In H. Li, P.J. Olver,
and G. Sommer, editors, Computer Algebra and Geometric Algebra with Applications,
volume 3519 of LNCS, pages 258-277. 6th International Workshop IWMM 2004,
Shanghai, China and International Workshop GIAE 2004, Xian, China, SpringerVerlag, Heidelberg, Berlin.
Sommer, G. & Zang, D. (2007). Parity symmetry in multi-dimensional signals.
Communications in Pure and Applied Analysis, accepted.
Stein, E.M. & Weiss G. (1971). Introduction to Fourier Analysis on Euclidean Spaces, volume
32 of Princeton Mathematical Series. Princeton University Press, Princeton, N.J.
Zang, D. & Sommer, G. (2006). The monogenic curvature scale-space. In R. Reulke, U.
Eckardt, B. Flach, U. Knauer, and K. Polthier, editors, 11th International Workshop on
Combinatorial Image Analysis, IWCIA06, Berlin, volume 4040 of LNCS, pages 320-332.
Springer-Verlag, Heidelberg, Berlin.
Zang, D. & Sommer, G. (2006). Detecting intrinsically two-dimensional image structures
using local phase. In K. Franke, K. Muller, B. Nickolay, and R. Schfer, editors, 28.
Symposium fur Mustererkennung, DAGM 2006, Berlin, 12.9.-14.9.2006, number 4174
in LNCS, pages 222- 231. Springer-Verlag, Heidelberg, Berlin.
Zang, D. & Sommer, G. (2007). Signal modeling for two-dimensional image structures.
Journal of Visual Communication and Image Representation, 18(1):81-99.
Zang, D.; Wietzke, L.; Schmaltz, C. & Sommer, G. (2007). Dense optical flow estimation from
the monogenic curvature tensor. In Int. Conf. on Scale-Space and Variational Methods
(SSVM), Ischia, Italy.
26
Algebraic Reconstruction and Post-processing
in Incomplete Data Computed Tomography:
From X-rays to Laser Beams
Alexander B. Konovalov, Dmitry V. Mogilenskikh, Vitaly V. Vlasov and
Andrey N. Kiselev
Russian Federal Nuclear Centre Zababakhin Institute of Applied Physics
Russia
1. Introduction
Methods of computed tomography are well developed and widely used in medicine and
industry. If tomographic data are complete, it is possible to reconstruct the images with submillimeter resolution. If the data are incomplete, tomograms may blur, i.e. their resolution
degrades, noise increases and artifacts form. The situation is worst if measurement data are
so poor that the system of equations which describe the discrete reconstruction problem
appears to be strongly underdetermined. In this situation, images of acceptable quality can
be obtained with algorithms that regularize the solution and use a priori information about
the object, and do post-processing of reconstructed tomograms also with the use of a priori
information, as a rule. This chapter provides two examples demonstrating the
reconstruction of the internal structure of an object from strongly incomplete measurement
data: few-view computed tomography (FVCT) and diffuse optical tomography (DOT) of
strongly scattering media. The problem of reconstruction from a small number of
views (<10) arises, for example, in experimental plasma research (Pickalov & Melnikova,
1995) or nondestructive testing (Subbarao et al., 1997). DOT is now deemed to hold much
promise for cancer detection (Arridge, 1999; Hawrysz & Sevick-Muraca, 2000; Yodh &
Chance, 1995). Here the strong incompleteness of data is caused by the fact that the number
of source-receiver relations that define the number of measurements is strictly limited.
Despite that these types of tomography use different wavelength bands (X-ray and near
infrared) and different mathematical models (linear and non-linear), we think it is not only
possible, but also interesting to consider them together because in both cases we successfully
use similar reconstruction algorithms and similar post-processing methods. The unique
possibility to do that comes from the fact that in case of DOT, we use a simplified
reconstruction method (Konovalov et al., 2003; 2006b; 2007; Lyubimov et al., 2002; 2003)
reducing the inverse problem to a solution of the integral equation with integration along a
conditional photon average trajectory (PAT) an analog of the Radon transform in
projection tomography.
In case of FVCT, we use actual data from measurements in a simple experimental
radiography setup (Konovalov et al., 2006). The FVCT procedure is simulated by rotation
of the object from exposure to exposure about the centre of the reconstruction region. For
488
objects, we use a spatial resolution test and an iron sphere with quasi-symmetric cracks
resulted from shock compression.
In case of DOT, we use model data from the numerical solution of a time-dependent
diffusion equation with an instantaneous point source (time-domain measurement
technique). We consider a traditional geometry where sources and receivers are on the
boundary of a scattering object in the form of a flat layer (Konovalov et al., 2006b). The
object contains periodic structures created by circular absorbing inhomogeneities.
In both cases, the inverse problem is solved using algebraic reconstruction techniques
(additive and multiplicative) which we modernized to attain the better convergence of the
iterative reconstruction process (Konovalov et al., 2006; 2006b). Procedures used to
calculate the weight matrices are described in detail. Solution correction formulas are
modified with respect to distributions of weight sums and solution correction numbers over
image elements. Weighted smoothing is performed at each iteration of solution
approximation. We use a priori information on whether the solution is non-negative and on
the presence of structure-free zones in the reconstruction region.
For post-processing of reconstructed tomograms, we use space-varying restoration
(Konovalov et al., 2007), methods for enhancing informativity of images based on its
nonlinear color interpretation (Mogilenskikh, 2000) and methods for estimating image
informativity based on binary operations and visualization algorithms (Mogilenskikh &
Pavlov, 2002; Mogilenskikh, 2003).
Results of investigation help decide how spatial resolution depends on the degree of data
incompleteness and draw inferences on whether the modified reconstruction techniques are
effective and on the investigated post-processing methods are capable of making
tomograms more informative.
The chapter is organized as follows. Section 2 gives a general formulation of the
tomography problem. It is shown that the inverse problem of DOT, like the problem of
reconstruction from X-ray projections, can be reduced to a solution of an integral equation
with integration along the trajectory. The Section describes a discrete model of a 2D
reconstruction problem and modernized algebraic techniques. Section 3 gives examples of
2D reconstruction from experimental radiographic data and model diffusion projections
from optical inhomegeneities. The Section makes a quantitative analysis of the spatial
resolution of tomograms reconstructed from strongly incomplete data. Section 4 describes
post-processing methods and gives examples of their use. Section 5 draws inferences and
outlines further research in the area.
g=
w(r) f (r) d
r,
(1)
489
where w(r ) is a weighting function which depends on source and receiver positions in
space, the type of actual physical measurements and the way of data recording.
In transmission X-ray tomography where the spatial distribution of the extinction coefficient
(r ) is reconstructed, it is usually assumed that the weighting function is unity along a line
L connecting a point source and a point receiver, and zero elsewhere. Then expression (1)
turns into the linear integral
g=
L (r) dl .
(2)
In computed tomography, it is known as the Radon transform. Integral (2) is inverted with a
linear reconstruction model implemented with the use of both integral algorithms (Kak &
Slanay, 1988) and algebraic techniques (Herman, 1980).
Divergence of the probing beam in, for example, proton (Hanson, 1981; 1982) or diffraction
(Devaney, 1983) tomography makes it necessary to consider not a line but a narrow 3D strip
of a finite length. In this case, it may be needed to change from linear integration (2) to
volume one (1) and pose restrictions on the use of the linear reconstruction model.
Diffuse optical tomography (DOT) of strongly scattering media is the most demonstrative
example of non-linear tomography. Laser beams used for probing undergo multiple
scattering, so photon trajectories are not regular and photons are distributed in the entire
volume V under study. As a result, each point in the volume significantly contributes to the
detected signal. If, for example, we deal with absorbing inhomogeneities of tissues
examined by pulsed probing with the time-domain measurement technique, integral (1), in
the approximation of the perturbation theory by Born or Rytov, takes the form (Lyubimov et
al., 2002; 2003)
g (t ) =
V 0 vP [r,
(3)
where t is the time-gating delay of the receiver recording the signal, v is the light velocity
in the media, P [r, (rs ,0) (rd , t )] is the density of the conditional probability that a
photon migrating from a space-time source point (rs ,0) to a space-time receiver point
490
means of the conditional probability density P , Lyubimov et al. (2002; 2003) proved that
integral (3) could be presented as
g (t ) =
a (r )
v(l )
dl ,
(4)
R ( ) =
V rP [r,
(5)
which we call a photon average trajectory (PAT). Here l is a distance along the PAT, v(l ) is
the relative velocity of the mass center of the distribution P along the PAT as a function of
l , P is the operator of averaging over the spatial distribution P . Integral equation (4) is
a fundamental equation of the photon average trajectories method (PAT method) in case of
time-domain measurement technique. It is an analog of Radon transform (2) and can be
inverted with the fast algorithms of projection tomography. In other words, converting (3)
into (4) offers an opportunity to change from multi-step to one-step reconstruction in the
sense that the system of algebraic equations describing the discrete reconstruction model is
only inverted once and hence, to achieve significant savings in computational time.
Equation (4) has definitely a number of differences from equation (2), specifically:
(a) Integration is performed along not a straight but curved line;
(b) Under integral (4), there is a weighting distribution 1 / v(l ) which depends on spatial
coordinates; and
(c) Trajectory integration is applied not to the object function itself, but to a function
averaged over the spatial distribution P .
The latter means that the reconstructed image is degraded by a priori blur which requires
additional work, i.e. post-processing of tomogram. With the above differences, it becomes
clear that the inversion of equation (4) with the linear reconstruction model requires certain
assumptions which may affect the quality of reconstructed images. Nevertheless, our earlier
studies (Konovalov et al., 2003; 2006b; 2007; Lyubimov et al., 2002; 2003) and results
presented in Sections 3 and 4 show that the PAT method is quite effective in the context of
the tomogram quality versus reconstruction speed trade-off.
2.2 Discrete image reconstruction model
In medical applications of X-ray computed tomography, equation (2) is usually inverted by
means of integral reconstruction algorithms such as the backprojection algorithm with
convolution filtering (Kak & Slanay, 1988). In FVCT where the number of views is small,
reconstruction with the integral algorithms gives aliasing artifacts which are present on
tomograms as rays tangential to reproduced structures (Palamodov, 1990). Different
smoothing and regularization methods can be applied to remove these artifacts which
strongly restrict the resolution of small details. But the quality of reconstructed images still
remains far from satisfactory.
It is also difficult to invert equation (4) with integral algorithms. Here problems arise from
not only incomplete data, but also from curved PATs. Our attempts to implement the
491
backprojection algorithm for diffusion tomograms (Konovalov et al., 2003; 2007; Lyubimov
et al., 2003) are based on the assumption that the PATs are almost straight lines inside the
scattering object. But with this approach it is impossible to reconstruct the spatial
distribution of absorbing inhomogeneities near boundaries where photons escape from the
object like an avalanche and the PATs strongly bend.
In this case, both in FVCT and in DOT, it is appropriate to use iterative algebraic algorithms
implementing a discrete reconstruction model. In this chapter, without loss of generality, we
will only consider examples of 2D reconstruction, i.e. reconstructions of 2D images. The
generalized discrete model of 2D reconstruction is formulated traditionally (Herman, 1980).
Let us establish a Cartesian grid for square image elements so that it covers the object.
Assume that the reconstructed object function takes a constant value f kl in an element with
indices k and l (hereafter, (k , l ) -cell). Let Lij be a straight line or PAT connecting
i -source and j -receiver, and gij be a projection measured by j -receiver from i -source.
Then the discrete reconstruction model can be characterized by a system of linear algebraic
equations
g ij = Wijkl f kl , ,
(6)
k ,l
where Wijkl is the weight contributed by the (k , l ) -cell to the measured value gij . In the
traditional setup of 2D reconstruction, the weight Wijkl is proportional to the length of
intersection of the trajectory Lij with the (k , l ) -cell (Herman, 1980; Lyubimov et al., 2002).
i
Xyl +1
Sijkl
yl +1
Sijkl
yl
Dj
Xxk Xxk +1
a
xk
xk +1 j
492
convergence of algorithms used to solve system (6) that is strongly underdetermined due to
incomplete data. To reduce the number of zero elements in the matrix, we modernized the
method for calculation of Wijkl having changed the infinite narrow trajectory by a strip of a
finite width (Konovalov et al., 2006; 2006b).
In X-ray tomography, the strip is long trapeze (Figure 1(a)). Its bases are source aperture
(the linear size of the focal spot) and receiver aperture (as a rule, the intrinsic resolution of
the recording system). In this case, the weights can be calculated with the formula
Wijkl = S ijkl / ,
(7)
where S ijkl is the area of intersection of the strip corresponding to i -source and j -receiver
(hereafter, (i, j ) -strip) with the (k , l ) -cell, and is the linear size of the cell. It is obvious
that the calculation of S ijkl for trapezoidal strips must not cause difficulty.
The situation is more complicated in DOT. The configuration and size of the appropriate
strip must be selected with account for the spatial distribution of the trajectories of photons
migrating from the point (rs ,0) to the point (rd , t ) . According to the above statistical
model, the most probable trajectories are distributed in a zone defined by the standard rootmean-square deviation (RMSD) from the PAT in accordance with the formula
( ) =
2
r R ( ) P [r, (rs ,0) (rd , t )] d 3r
1/ 2
(8)
This zone is shaped as a banana (Lyubimov et al., 2002; Volkonskii, 1999) with vertices at the
points of source and receiver localizations on the boundary of the scattering object.
Therefore, for the (i, j ) -strip we take a banana-shaped strip (Figure 1(b)) whose width is
directly proportional to the RMSD: ( ) = ( ) . The problem is thus reduced to finding
statistical characteristics (5) and (8) of photon trajectories. Note that the exact analytical
calculation of R ( ) and ( ) is difficult for even simple configurations such as a circle or a
flat layer. The use of numerical techniques is undesirable because of the necessity to save
computational time. Therefore, a number of simplifying assumptions should be done.
Lyubimov et al. (2002) and Volkonskii et al. (1999) propose to approximate the PAT by a
three-segment broken line whose end segments are orthogonal to the boundary of the
scattering object and the middle segment connects the end ones. This approach is effective if
inhomogeneities are located inside the object, but causes distortions if inhomogeneities are
near the boundaries where the PATs bend. In this chapter we configure banana-shaped
strips in the geometry of a flat layer using a simplified analytical approach based on the
analysis of PAT bending near a plane boundary. The approach uses the time-dependent
radiation transport equation in the diffusion approximation. Konovalov et al. (2006b)
showed that in the case where a instantaneous point source was in a homogeneous halfspace (a half-plane in 2D) y 0 at a point (0, y0 ) and a receiver was at a point ( x0 ,0) on
the boundary y = 0 , coordinates of the mass center of the distribution P , moving from the
source point to the receiver point could be expressed as
493
X ( ) = x0 / t
1/ 2
1/ 2
(t )
t
t
+
(
)
=
1
+
1
erf
Y
y
exp
0
2
2
(9)
where = 4 Kvt y02 , K is the diffusion coefficient of the media, erf ( ) is the probability
integral. If assume that PAT bending near the plane (straight line) of a source S is similar to
bending near the plane (straight line) of a receiver D and there is no influence of the
opposite boundary, analytical expressions (9) can be easily used to construct the PAT for the
flat layer geometry (Figure 2). Indeed, the mass center passes the distance SO and the
distance OD during the time t / 2 . If the mass center moved in the half-space y 0 from a
point S 0 to the point D through the point O , the time t / 2 would correspond to the
distance S0O . Since component velocity along the X-axis is constant, the point S 0 lies on
the perpendicular SS to the media boundaries. The distance S 0 S can be found through
the numerical solution of the equation Y
= d / 2 , where d is the width of the layer, for
=t / 2
y0 (see expressions (9)). After that the distance OD is calculated with (9) and the distance
SO is obtained through its symmetric reflection about the point O .
Yy, m
y
S0
S5
-4
-2
D29
D32
Xx, m
-2
D26
Figure 3 shows the geometry of data recording we chosen for simulations. Red triangles
denote the positions of sources and blue circles do the positions of receivers. It also shows,
as examples, six average trajectories reproduced with the above algorithm for t = 3000 ps
and optical parameters K = 0.066 cm and v = 0.0214 cm/ps. Blue lines show piecewiselinear approximations of the PATs. Coordinates of the indicated sources and receivers (in
centimeters) are as follows: S5 (-2.52, 4), D17 (-5, -4), D20 (-3.06, -4), D23 (-1.13, -4),
D26 (0.81, -4), D29 (2.74, -4), D32 (4.68, -4). In this chapter we study the probing regime
in transmission, i.e. only relations between sources and receivers located on the opposite
boundaries of the object are considered. The total number of average trajectories therefore
494
equals to 3216 (32 sources and 16 receivers). In the reconstruction we will vary the number
of sources to study how the spatial resolution depends on the degree of data
incompleteness.
High accuracy of RMSD calculation is not crucial for the construction of banana-shaped
strips. Therefore, in accordance with the inference of Volkonskii et al. (1999) that RMSD is
actually independent of the form of the object, we can use the following simple formula for
infinite space:
( ) [ 2 Kv(t ) / t ]1 / 2 .
(10)
Boundaries of the strips are thus defined by piecewise-linear functions. To calculate the
areas S ijkl , we find the points where the strip boundaries intersect the sides of the cell. A
polygon with vertices at the obtained points and cell nodes is treated as the intersection of
the (i, j ) -strip and the ( k , l ) -cell (Figure 1(b)). Weights are calculated with the formula
(11)
is the discrete velocity of the mass center of the distribution P for the
(i, j ) -strip and the (k , l ) -cell. Analytically, the velocities v( p ) are determined through
495
differentiation of expressions (9). The array of discrete values {vijkl } is defined with the
following algorithm.
(a) Define a set of discrete times { p } .
(b) Construct perpendiculars to tangential lines at points of Lij corresponding to the
times { p } (Figure 5).
(c) Assign a loop for p , in which the following sequence of steps is performed:
Find cells where the (i, j ) -strip intercepts a strip created by two neighbor
perpendiculars corresponding to the times p and p +1 . In Figure 5, these cells
old
If some value vijkl
has already been assigned to a cell, it is updated with the
formula
vijkl =
old
new
vijkl
N + vijkl
N +1
(12)
new
where vijkl
is the new value and N is the number of previous updates.
Figure 6. The area of the object filled with banana-shaped strips for different values of
coefficient : (a) 0; (b) 0.05; (c) 0.15; and (d) 0.25
496
(d) All PATs are searched sequentially and, for each of them, the procedure is repeated
beginning from step (b).
The proportionality coefficient (0, 1) which defines the width of the banana-shaped
strip is selected from a condition dictating that all strips must sufficiently fill the area of the
object. Figure 6 shows the filling of the rectangular object presented in Figure 3 for ratio of
sources and receivers (hereafter, measurement ratio) 3216 and equal to 0, 0.05, 0.15, and
0.25. In Figure 6(a), (b), and (c), there are extended regions with no strips (shown in blue).
This means that, if the grid is of high resolution, there are cells where corrections wont be
introduced during the process of reconstruction. In Figure 6(d) these regions are very small
in size which minimizes the probability that dead cells will appear. That is why we
reconstruct the absorbing inhomogeneities embedded in the scattering object shown in
Figure 3 using banana-shaped strips whose width is ( ) = 0.25 ( ) .
It should be noted that the problem of area filling in FVCT is not as decisive as in DOT if
even the strips are very narrow. Despite the small number of views, the number of strips
corresponding to one view is rather large (> 100).
2.3 Algebraic reconstruction techniques and methods of their modification
When selecting an algorithm to invert system (6), we must remember that in case of very
incomplete data, the system appears to be strongly underdetermined. That is why the
problem of solution regularization is of great importance in the context of the need to
approximate the solution correctly and hence, to obtain tomograms which are free of
artifacts. It is well known that the minimum of artifacts corresponds to the minimum of
information contained in images. Under these circumstances, it seems appropriate to do
reconstruction with an approach based on entropy optimization (Levine & Tribus, 1978). In
this chapter we study the multiplicative algebraic reconstruction technique (MART) which
implements the entropy maximum method. The problem of solution regularization is
formulated as follows. Find the array of values { f kl } which satisfies system (6) and the
conditions
f kl 0,
f kl ln f kl max .
k ,l
(13)
For the purpose of comparison and to demonstrate advantages of the MART, we also
consider a well-known additive algebraic reconstruction technique (AART) which does not
optimize entropy.
Both MART and AART are based on an iterative procedure of correction of certain initial
approximation { f kl } . At each ( s + 1) -iteration trajectories (strips) from one source only are
(0)
considered. Thus, the correction is introduced into the elements of the approximation
{ f kl( s ) } which correspond to the cells intersected by the given strips. Upon a transition from
one iteration to another, the sources are searched cyclically. Original formulas for the
correction of the s -th approximation to the solution are written as follows (Herman, 1980)
MART :
g ij
AART :
Wijkl f kl( s )
k ,l
497
Wijkl /
(14)
Wijkl f kl( s)
k ,l
Wijkl ,
2
F
where (0,1) is the parmeter which controls the rate of iterative process convergence and
Our experience of using the algebraic techniques in FVCT (Konovalov et al., 2006a) and
DOT (Konovalov et al., 2006b; Lyubimov et al., 2002) suggests that a number of
modifications to formulas (14) are needed to improve convergence in case of strongly
incomplete data. So, expressions (14) does not allow for
(a) the non-uniform distributions of weight sums and solution correction numbers over the
cells; and
(b) any a priori information on the spatial distribution of reproduced structures.
As a result, both algorithms including the MART with regularization (13) often converge to
a wrong solution. Because of the incorrect redistribution of intensity, images exhibit distinct
artifacts which are often present in the regions where the structures are actually absent.
To avoid these shortcomings, we here use the following formulas for modified algebraic
techniques
Step 1
~
MART :
AART :
~
where Wkl =
i, j Wijkl
f kl( s +1)
= wkl f kl( s ) g ij
f kl( s +1)
g ij
(
)
s
= wkl f kl +
Wijkl f kl( s )
k ,l
Wijkl / Wkl
Wijkl f kl( s)
k ,l
2
F
Wijkl
~
Wkl
(15)
number of strips used in reconstruction, and w is the matrix of correction factors which
allow for a priori information on the object function (see below).
Step 2
f kl( s +1) =
1
(2r + 1) 2
(16)
m = r n = r
where the integer r specifies the size r r of the smoothing window, Akl is the number of
corrections to the solution element corresponding to the ( k , l ) -cell, and
k ,l
~
is the operator which normalizes the distributions {Wkl } and { Akl } .
(17)
498
Accounting for the distributions of reduced weight sums and correction numbers over the
cells is most crucial for DOT where they are markedly non-uniform (Figure 7). Figure 8
shows an example of reconstruction of the scattering object with two circular absorbing
inhomogeneities 0.8 cm in diameters (see Section 3.2). Here and after the red triangles
represent the localizations of the sources used for reconstruction. The Figure demonstrates
advantages of the modified MART. We have bad results without taking into account the
Figure 7. Distributions of reduced weight sums (a) and solution correction numbers (b) over
137100 grid which cover the object shown in Figure 3
~
{Wkl } and { Akl }
499
of the middle section of the iron sphere compressed by an explosion from radiographic data
(see Section 3.1). The algorithm is described by the following sequence of steps:
1
(a) Reconstruct the image { f kl } from projections corresponding to the first source only
(Figure 9 ()).
1
Figure 9. Generation of a useful part of the tomogram: (a) the image { f kl } ; (b) the image
~
~
{ f kl2 } ; (c) the image { f kl24 } ; (d) the set of multilevel regions
b) Reconstruct the image { f kl2 } from projections corresponding to the second source only
and compare it with the result obtained at step (a). Following from the result of the
~2
f kl = min( f kl1 , f kl2 ) for each (k , l ) -cell
(Figure 9(b)).
(c) Repeat step (b) for each following i -source forming the image { f kli } such that
~i
~
f kl = min( f kli 1, f kli ) (Figure 9(c)). Search all given sources.
500
(d) For the last image { f kllast } , define certain ascending sequence of relative thresholds
{ m }1M , the largest of which does not exceed, for example, 0.1-0.2 and determine correction
factors {wkl } using the following relations:
wkl = 0,
wkl =
if
m
M
if
wkl = 1,
if
~ last
~
f kl < 1 max{ f kllast } ,
~
~
~
m max{ f kllast } f kllast < m +1 max{ f kllast }
(18)
~ last
~
f kl M max{ f kllast }.
Figure 10. Reconstructions of the sphere section from 24 views by the MART without (a) and
with (b) the correction factors {wkl }
Such a definition of the set of multilevel regions with values that monotonically decrease
from unity to zero (Figure 9(d)) allows artifacts to be avoid in the structure-free zones, i.e.
where the object function must be zero or close to zero. The effect of accounting for {wkl } is
demonstrated in Figure 10 which illustrates the reconstruction of the section of a sphere
from 24 views by the MART. For visual demonstration, reconstructions are presented as
surface plots.
It should be noted that in the case of the AART, it is also appropriate to use a priori
information on whether the reconstructed object function is non-negative. For this end, all
negative elements in the solution approximation are changed by zeros at each iteration. In
the case of the MART, this is not needed because the algorithm works with a priori positive
values.
501
(a) a foam plastic cylinder 6 cm in diameter with periodical spatial structures in the form of
rows of coaxial thin steel rods whose diameters are 1.5, 2.5, 5 and 8 mm, and
(b) an iron sphere 4.8 cm in diameter with lots of internal damages from shock compression.
X-ray projections are detected with a simple experimental setup (Figure 11 (a)). The
radiation source is a pocket-size betatron with a small focal spot (about 1 mm) and a
relatively small effective energy of the photon spectrum (about 2 MeV). The recording
system combines a luminescent amplifying screen and an X-ray film. The object is placed
between the source and the recording system so as to ensure that the film fully covers the
objects shadow. To determine parameters of the characteristic curve of the recording system
(photometric density versus exposure), we register the image of a step lead wedge with the
object, as shown in Figure 11. Distances between the source and the object and between the
source and the recording system are, respectively, 150 and 220 cm for the cylinder with
periodic structures and 120 and 180 cm for the shocked sphere.
1 - Recording system
2 - Wedge
3 - Test object
4 - Radiation source
Figure 11. Experimental setup (a) and X-ray photograph of the shocked iron sphere (b)
To collect information, each film with the X-ray image is scanned using a laser scanner with
a small focal spot. Digital data collected are converted from scanner counts into film
exposures with a technique (Kozlovskii, 2006) developed and experimentally adjusted at
Russian Federal Nuclear Center Zababakhin Institute of Applied Physics. The technique is
based on the approximation of the characteristic curve by the relation
c
I = I 0 + I max exp(a b lg H ) ,
where I is the photometric density, H
(19)
characterizes the density of film fogging, I max is a parameter which characterizes the
maximum density the film permits, a and c are inclination and shape parameters, and b
is a parameter which defines sensitivity of the recording system. The characteristic curve
parameters I , I max , a , b and c are found through solving the problem of optimization
for the objective function
(Ii
i =1
I imeas ) 2
1/ 2
min ,
(20)
502
where I i is the photometric density calculated by expression (19) for i -step on the wedge,
I imeas is the experimental density found with the image of the step wedge (Figure 11(b)) and
Z is the number of steps on the wedge.
MART
AART
12
Figure 12. Tomograms of a cross section of the cylinder with periodic structures
reconstructed from 12, 8, 6, and 4 views
503
Figure 13. A photograph of the middle section of the sphere (a) and its reconstructions by
the modified MART from 24 (b), 12 (c), and 8 (d) views
We assume that each X-ray in the conic beam is detected by a conventional receiver whose
aperture is larger than the size of one cell of the digitized x-ray photograph. It is appropriate
to take the aperture to be equal to the intrinsic resolution of the recording system. So, in
order to calculate projections, we must average the exposures H over aperture areas.
Projections are calculated as
g = log( H H 0 ) ,
(21)
504
MART is a bit better than that of the images reconstructed by the AART. For the images
shown in Figure 12, the visual resolution limit seems to be close to 1.5 mm because the row
of 1.5-mm-diam rods is clearly seen in the upper images (MART, 12 and 8 views; AART,
12 views) and hardly distinguishable in the others. The quantitative analysis of spatial
resolution is given in Section 3.3.
Figure 13 shows the tomograms of a middle section of the shocked sphere reconstructed by
the modified MART in comparison with its photo taken after the sphere was cut with an
elecroerosion machine. It is seen from Figure 13 (a) and (b) that 24 views allow quite
accurate reproduction of a fine fracture pattern (characterizing the reproduction of highfrequency structures) to be obtained. The images in Figure 13 (c) and (d) well reproduce the
fracture pattern on whole, but small details are reproduced much worse compared with
Figure 13(b).
Tomograms presented in Figure 13 qualitatively differ from those in Figure 12: the spatial
structures in the sphere drop in reconstruction, i.e. the structures in the center are
reproduced less intensively than the structures near its boundary. This is caused by the effect
of beam hardening (Kak & Slanay, 1988) which distinctly manifests itself in the reconstruction
of strongly absorbing objects. This proves that tomograms need post-processing.
3.2 Reconstruction of optical inhomogeneities embedded in strongly scattering media
from model diffusion projections
To demonstrate efficiency of the modified algebraic algorithms for one-step reconstruction
of diffuse optical tomograms, we conduct a numerical experiment where we simulate
scattering objects with absorbing inhomogeneities and calculate diffusion projections. Four
square objects 118 cm2 in size (Figure 3) are considered. Light velocity in the media and
diffusion and absorption coefficients are 0.0214 cm/ps and 0.066 cm and 0.05 cm-1,
respectively. Each object has two circular inhomogeneities of identical diameters; they are
near the center at a distance of one diameter from each other. Diameters of inhomogeneities
in different objects are 1.2, 1.0, 0.8 and 0.6 cm. The inhomogeneity absorption coefficient is
equal to 0.075 cm-1. To simulate diffusion projections, we solve the time-dependent diffusion
equation with the instantaneous point source
1 (r, )
K 2 (r, ) + [ a + a (r )] (r, ) = (r rs , )
v
(22)
for the photon density (r, ) by the finite element method. The signal of the receiver is
found with the formula
J (rd , t ) = K
(r, )
r =r
(23)
d , =t
where is the derivative in the direction of the outer normal to the boundary of the
object at the receiver point r = rd . Accordingly, the diffusion projection g (t ) is found as
logarithm of the ratio of the non-perturbed signal J 0 (t ) calculated for the homogeneous
medium to the signal J (t ) perturbed due to inhomogeneities.
MART
505
AART
1.2
1.0
0.8
0.6
506
MART
AART
16
Figure 15. Reconstructions of the object with 0.8-cm-diam inhomogeneities for measurement
ratios 1616, 816, and 416
507
FVC
DOT
Reconstruction technique
MART
AART
12
1.0
1.5
1.2
1.6
1.4
2.5
1.5
2.6
32
7.0 (6.0)
8.6
16
8.1 (6.4)
10.0
8.2 (6.8)
10.1
9.0 (7.0)
12.6
Table 1. Estimated spatial resolution limit (in millimeters) for FVCT and DOT
508
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.5
1.5
2.5
3.5
0.0
0.5
1.5
(a)
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.5
1.5
2.5
3.5
2.5
3.5
(b)
2.5
3.5
0.0
0.5
1.5
(c)
(d)
Spatial frequency, cycles/cm
Figure 16. FVCT: MTFs for MART (red lines) and AART (black lines): (a) 12; (b) 8; (c) 6;
and (d) - 4 views
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.4
0.5
0.6
0.7
0.8
0.0
0.4
0.5
(a)
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.5
0.6
0.7
0.8
0.7
0.8
(b)
1.0
0.0
0.4
0.6
0.7
0.8
0.0
0.4
0.5
(c)
0.6
(d)
Spatial frequency, cycles/cm
Figure 17. DOT: MTFs for MART (red lines), AART (black lines), and MART after restoration
(blue lines): (a) 3216; (b) 1616; (c) 816; and (d) - 416 sources and receivers
Analysis of data presented in the Table suggests that the use of the modified MART in FVCT
helps get close to the resolution of medical X-ray tomography which uses the full set of
509
views. As for DOT, the resolution of the PAT method reached again with the modified
MART is only slightly worse than the resolution of tomograms reconstructed by the multistep reconstruction algorithms (Arridge, 1999) and there is still hope to improve it through
post-processing.
4. Post-processing of tomograms
4.1 Space-varying restoration
As mentioned in Section 3.3, the strict description of the visualization system both in FVCT
and DOT can only be made with a spatially variant blur model. In FVCT, spatial variance at
a rough approximation can be neglected because the size of the object is small compared to
source-object and object-receiver distances. In DOT, the strong dependence of structure
reconstruction accuracy on structure localization directly follows from expressions (8) and
(10) which characterize the theoretical limit of spatial resolution. The theoretical resolution
tends to zero near boundaries. In the center, the resolution is worst and depends on the
degree of data incompleteness (Table 1).
The traditional approach (Fish et al., 1996) to the restoration of images degraded by spatially
variant blur is based on the assumption that blur is approximately spatially invariant in
small regions of the image. Each such region is restored with its own spatially invariant
point spread function (PSF) and results are then sewn together to obtain the full true image.
This approach gives blocking artifacts at the region boundaries and they need to be removed
by some means or other. In this chapter we restore diffusion tomograms using the blur
model of Nagy et al. (2004). In accordance with the model, the image is divided into a
number of regions where the PSF is approximately spatially invariant. However, instead of
deblurring each region separately and then combining restoration results, the method
interpolates individual invariant PSFs and restores the entire image. The discrete restoration
problem for a tomogram with blur f is described by a system of linear algebraic equations
f = Qz ,
(24)
510
(c) A tomogram with PSF is reconstructed from obtained model projections by the modified
algebraic techniques described above.
For the inversion of system (24), we selected the iterative residual norm steepest descent
algorithm (Kaufman, 1993) that converges rather fast and has a semi-convergence with
respect to the relative error z s z z , where z s is the approximation to z on
s -iteration. This is of great importance for getting the regularized solution. Here we omit
details of the procedure used to restore diffuse optical tomograms reconstructed by the PAT
method. They can be found in (Konovalov et al., 2007).
32
16
511
Figure 19. Results of application of linear (left) and nonlinear (right) palette to the images
presented in Figure 13(b), (c), and (d)
512
For flattening the image intensity in order to compensate the beam hardening effect, we
use simplified approach based on an application of the color interpretation methods. We
consider the methods for creation of nonlinear color palettes and nonlinear statistical and
analytical functions of correspondence between image intensity and color space. Detailed
description of algorithms is given in (Mogilenskikh, 2000). This chapter is mainly focused on
illustrative examples of their application.
The color palette is the ordered set of colors from the color space where each color
corresponds to its own ordinal number. If the palette is nonlinear, the set of colors form a
curvilinear trajectory in the color space. For image visualization with the use of the color
palette we should form the law of correspondence between image intensities and colors in
the cells (hereafter, correspondence function, CF). The argument value of such function is
the image intensity, and the function value is the color or the color index in the palette. The
linear CF is usually applied. Figure 19 shows the result of application of the linear blackand-white palette and the linear CF (left), as well as nonlinear palette including four basic
colors (blue, yellow, red, and green) and the linear CF (right) to the tomograms given in
Figure 13. It is seen that the fracture area is more obviously revealed in the second case.
However, the linear CF does not always allow data interpretation to be informative enough.
To enhance the image informativity, we use the nonlinear statistical and analytical CF. The
algorithm for creating the nonlinear statistical CF can be briefly described by the following
sequence of steps.
(a) Form the linear CF and put color G ( f kl ) in conformity with image intensity f kl in the
(k , l ) -cell.
cells
N cells ( f kl ) + 1
WG ( f kl ) = N col norm G cells
,
N
(25)
cells
is a full number of image cells, and
where N col is a number of colors in the palette, N
(26)
(d) Form the nonlinear CF through addition of the statistical CF (step(c)) and the linear CF
(step (a)).
Left column of images in Figure 20 demonstrates the example of application of such
nonlinear CF to tomograms given in Figure 13. Thus, it is possible to automatically
distinguish informative contours of factures and simultaneously preserve intensity shades
inside the image.
The essence of analytical CF is in applying the nonlinear color coordinate scales to attain the
correspondence between the color and the intensity in the cells. Elementary functions and
their algebraic combinations are used for that. Right column of images in Figure 20 shows
the result of application of exponential CF G ( f ) = exp(60 f ) to images given in Figure 19
513
on the left. The type of the function is selected on the basis of the a priori information on the
homogeneity of high-density structures of the object, which helps to present the internal
facture pattern in the palette of two colors: black and white. This allows the informative
regions of cracks and their boundaries to be strongly distinguished.
Figure 20. Results of application of statistical (left) and analytical (right) CF to the images
presented in Figure 13 and 19 (on the left), respectively
514
Figure 21. The processed photograph of sphere section (a) and the results of binary
operations with processed tomograms reconstructed from 24 (b), 12 (c), and 8 (d) views
To estimate the accuracy of fracture pattern reproduction, we compare the results of
tomogram post-processing with the etalon image obtained through processing of the photo
presented in Figure 13(a). For comparison, a variety of methods based on binary operations
and visualization algorithms (Mogilenskikh & Pavlov, 2002; Mogilenskikh, 2003) can be
used. In our case, processing of the photo includes the construction of the same-level
isolines, clearing of half-tones between the isolines, and filling of the isolines-bounded areas
by black (Figure 21(a)). The processed photo is superimposed onto the processed
tomograms given in Figure 20 on the right. As a result of binary operations, one obtains
three-tone images presented in Figure 21(b), (c), and (d), where gray color characterizes the
difference, and black and white coincidence. The relations between gray areas and black
area of the etalon image are equal to 0.03, 0.19, and 0.28, respectively. These quantitative
estimates and visual analysis of Figure 21 show that the accuracy of reproduction of the fine
fracture pattern seems to be unsatisfactory for reconstructions by 12 (Figure 21(c)) and
8 views (Figure 21(d)). This conclusion is in agreement with the results of Table 1, which
show that the spatial resolution limit is worse than 1.0 mm when the number of views does
not exceed 12.
The methods for creating the nonlinear CF are also efficient in the case of the diffusion
tomagrams post-processing. The space-varying restoration of tomograms obviously
improves but still reproduces incomplete profile of inhomogeneities. And as it follows from
Figure 22, nonlinear-CF-based processing of restored tomogram of the scattering object with
0.8-cm-diam inhomogeneities makes it possible to approach a flat region of the true profile.
515
Figure 22. MART-tomogram with 0.8-cm-diam inhomogeneities and its profile after
restoration (top), application of exponential CF G ( f ) = exp(14 f ) (center), and application
of nonlinear statistical CF
5. Conclusion
In this chapter we consider two examples of algebraic reconstruction in incomplete data
computed tomography: few-view X-ray computed tomography and one-step diffuse optical
tomography. Multiplicative algebraic reconstruction technique optimizing the entropy
allows the better quality of tomograms to be obtained. It is shown that, to enhance the
convergence of iterative reconstruction procedure and to minimize the artifact level on
tomograms, the conventional formulas of solution correction should be modified. The
presented results of reconstruction demonstrate the efficiency of the following our
modifications:
(a) To calculate the weight matrix, we use not the lines but the narrow strips which provide
the sufficient filling of the reconstruction area.
(b) We take into account the non-uniformity of the distributions of the weight sums and the
solution corrections numbers over the image elements.
(c) We calculate the correction factors which account for a priori information on whether the
solution is non-negative and on the presence of structure-free zones in the reconstruction area.
For increasing the accuracy of spatial structures reproduction under conditions of the strong
incompleteness of data, it is advisable to post-process the reconstructed tomograms with the
516
use of a priori information about the object. We demonstrate the efficiency of the methods of
space-varying restoration and post-processing with nonlinear palette and nonlinear function of
correspondence between the palette color and image intensity in the cells. As a result, we
obtain the reproduction quality close to that of medical tomograms in the case of few-view
tomography and close to quality of diffusion tomograms reconstructed by well-designed
multi-step algorithms in the case of diffuse optical tomography.
In conclusion it should be noted that, for calculation, we use a rather slow soft-ware
medium like MATLAB and a Windows XP Intel PC with 1.7-GHz Pentium 4 processor and
256-MB RAM. Computational time of the reconstruction-restoration procedure for diffuse
optical tomograms is 1.52.5 minutes. These digits are better than those for multi-step
reconstruction, but they do not satisfy the requirements of real-time medical explorations. In
the future, it is interesting to optimize the duration of the diffuse optical image production.
The implementation of a spatially variant blur model in few-view computed tomography is
also the subject of our short-term interest.
6. Acknowledgements
The authors would like to thank S. P. Antipinskii, E. A. Averyaskin, S. A. Brichikov, V. V.
Fedorov, D. M. Gorbachev, S. V. Kolchugin, E. V. Kovalev, E. A. Kozlov, V. M. Kryukov,
I. V. Matveenko, A. V. Mikhaylov, L. A. Panchenko, V. N. Povyshev, G. N. Rykovanov, V. V.
Smirnov, T. V. Stavrietskaya, V. I. Stavrietskii, T. A. Strizhenok, A. B. Zalozhenkov, and
M. N. Zakharov for collaboration in X-ray radiography and few-view computed
tomography. The authors also thank A. G. Kalintsev, O. V. Kravtsenyuk, V. V. Lyubimov,
A. G. Murzin, and L. N. Soms whose contribution to theory of the photon average
trajectories method cannot be overemphasized.
7. References
Alvarez, R. E. & Macovski, A. (1976). Energy-selective reconstruction in X-ray computerized
tomography. Physics in Medicine & Biology, Vol. 21, No. 5, September 1976, pp. 733744.
Arridge, S. R. (1999). Optical tomography in medical imaging. Inverse Problems, Vol. 15, No. 2,
April 1999, pp. R41R93.
Brooks, R. A. & Di Chiro, G. (1976). Beam hardening in X-ray reconstructive tomography.
Physics in Medicine & Biology, Vol. 21, No. 3, May 1976, pp. 390-398.
Devaney, A. J. (1983). A computer simulation study of diffraction tomography. IEEE
Transaction on Biomedical Engineering, Vol. 30, No. 7, July 1983, pp. 377-386.
Elbakri, I. A. & Fessler, J. A. (2002). Statistical image reconstruction for polyenergetic X-ray
computed tomography. IEEE Transactions on Medical Imaging, Vol. 21, No. 2, February
2002, pp. 89-99.
Fish, D. A.; Grochmalicki, J. E. & Pike, R. (1996). Scanning singular-value-decomposition
method for restoration of images with space-variant blur. Journal of the Optical Society
of America A: Optics, Image Science & Vision, Vol. 13, No. 3, March 1996, pp. 464-469.
Hanson, K. M.; Bradbury, J. N; Cannon, T. M.; Hutson, R. L.; Laubacher, D. B.; Macek, R. J.;
Paciotti, M. A. & Taylor, C. A. (1981). Computed tomography using proton energy
loss. Physics in Medicine & Biology, Vol. 26, No. 6, November 1981, pp. 965-983.
517
Hanson, K. M.; Bradbury, J. N; Koeppe, R. A.; Macek, R. J.; Machen, D. R.; Morgado, R.;
Paciotti, M. A.; Sandford, S. A. & Steward, V. W. (1982). Proton computed
tomography of human specimens. Physics in Medicine & Biology, Vol. 27, No. 1,
January 1982, pp. 25-36.
Hawrysz, D. J. & Sevick-Muraca, E. M. (2000). Developments toward diagnostic breast cancer
imaging using near-infrared optical measurements and fluorescent contrast agents.
Neoplasia, Vol. 2, No. 5, September 2000, pp. 388-417.
Herman, G. T. (1980). Image Reconstruction from Projections: The Fundamentals of Computerized
Tomography, Academic, New York.
Kak, A. C. & Slaney, M. (1988). Principles of Computerized Tomographic Imaging, IEEE Press,
New York.
Kamm, J. & Nagy, J. G. (1998). Kronecker product and SVD approximation in image
restoration. Linear Algebra & Its Applications, Vol. 284, No. 1-3, November 1998,
pp. 177-192.
Kaufman, L. (1993). Maximum likelihood, least squares, and penalized least squares for PET.
IEEE Transactions on Medical Imaging, Vol. 12, No. 2, February 1993, pp. 200214.
Konovalov, A. B.; Volegov, P. L.; Kochegarova, L. P. & Dmitrakov, Yu. L. (1999).
Determination of component concentrations in mixtures of organic liquids using a
computer tomograph. Journal of Analytical Chemistry, Vol. 54, No. 4, April 1999,
pp. 315-319.
Konovalov, A. B.; Volegov, P. L. & Dmitrakov, Yu. L. (2000). A simple method for CT-scanner
calibration against evective photon energy. Instruments & Experimental Techniques,
Vol. 43, No. 3, May 2000, pp. 398-402.
Konovalov, A. B.; Lyubimov, V. V.; Kutuzov, I. I.; Kravtsenyuk, O. V.; Murzin, A. G.;
Mordvinov, G. B.; Soms, L. N. & Yavorskaya, L. M. (2003). Application of the
transform algorithms to high-resolution image reconstruction in optical diffusion
tomography of strongly scattering media. Journal of Electronic Imaging, Vol. 12, No. 4,
October 2003, pp. 602-612.
Konovalov, A. B.; Kiselev, A. N. & Vlasov, V. V. (2006a). Spatial resolution of few-view
computed tomography using algebraic reconstruction techniques. Pattern Recognition
& Image Analysis, Vol. 16, No. 2, April 2006, pp. 249-255.
Konovalov, A. B.; Vlasov, V. V.; Kalintsev, A. G.; Kravtsenyuk, O. V. & Lyubimov, V. V.
(2006b). Time-domain diffuse optical tomography using analytic statistical
characteristics of photon trajectories. Quantum Electronics, Vol. 36, No. 11, November
2006, pp. 1048-1055.
Konovalov, A. B.; Vlasov, V. V.; Kravtsenyuk, O. V. & Lyubimov, V. V. (2007). Space-varying
iterative restoration of diffuse optical tomograms reconstructed by the photon
average trajectories method. EURASIP Journal on Advances in Signal Processing,
Vol. 2007, No. 1, January 2007, ID 34747.
Kozlovskii, V. N. (2006). Information in Pulsed Radiography, RFNCVNIITF publisher, Snezhinsk
(in Russian).
Levine, R. D. & Tribus, M. (1978). The Maximum Entropy Formalism, MIT, Cambridge, MA.
518
Lyubimov, V. V.; Kalintsev, A. G.; Konovalov, A. B.; Lyamtsev, O. V.; Kravtsenyuk, O.V.;
Murzin, A. G.; Golubkina, O. V.; Mordvinov, G. B.; Soms, L. N. & Yavorskaya, L. M.
(2002). Application of photon average trajectories method to real-time reconstruction
of tissue inhomogeneities in diffuse optical tomography of strongly scattering media.
Physics in Medicine & Biology, Vol. 47, No. 12, June 2002, pp. 2109-2128.
Lyubimov, V. V.; Kravtsenyuk, O. V.; Kalintsev, A. G.; Murzin, A. G.; Soms, L. N.; Konovalov,
A. B.; Kutuzov, I. I.; Golubkina O. V. & Yavorskaya, L. M. . (2003). The possibility of
increasing the spatial resolution in diffusion optical tomography. Journal of Optical
Technology, Vol. 70, No. 10, October 2003, pp. 715720.
McDavid, W. D.; Waggener, R. G.; Payne, W. H. & Dennis, M. J. (1977). Correction of spectral
artifacts in cross-sectional reconstruction from X-rays. Medical Physics, Vol. 4, No. 1,
January 1977, pp. 54-57.
Mogilenskikh, D. V. (2000). Nonlinear color interpretation of physical processes, Proceedings of
International Conference on Computer Graphics Graphicon2000, pp. 201-211, Moscow,
August-September 2000, Moscow State University publisher, Moscow.
Mogilenskikh, D. V. & Pavlov, I. V. (2002). Color interpolation algorithms in visualizing results
of numerical simulations, In: Visualization and Imaging in Transport Phenomena,
Sideman, S. & Landesberg, A. (Eds.), Annals of the New York Academy of Sciences,
Vol. 972, Part. 1, pp. 43-52, New York Academy of Sciences, New York.
Mogilenskikh, D. V. (2003). CONTOUR algorithm for finding and visualizing flat sections of
3D-objects, In: Computer Science and Its Applications, Kumar, V. et al. (Eds.), Lecture
Notes in Computer Science, Vol. 2669, pp. 407-417, Springer-Verlag, Berlin/Heidelberg.
Nagy, J. G.; Palmer, K. & Perrone, L. (2004). Iterative methods for image deblurring: a Matlab
object oriented approach. Numerical Algorithms, Vol. 36, No. 1, May 2004, pp. 7393.
Ng, M. K.; Chan, R. H. & Tang, W.-C. (1999). A fast algorithm for deblurring models with
Neumann boundary conditions. SIAM Journal on Scientific Computing, Vol. 21, No. 3,
November-December 1999, pp. 851866.
Palamodov, V. P. (1990). Some singular problems in tomography, In: Mathematical Problems of
Tomography, Gelfand, I. M. et al. (Eds.), Transactions of Mathematical Monographs,
Vol. 81, pp. 123-139, American Mathematical Society, Providence, R. I.
Papoulis, A. (1968). Systems and Transforms with Applications in Optics, McGraw-Hill, New
York.
Pickalov, V. V. & Melnikova, T. S. (1995). Plasma Tomography, Nauka, Novosibirsk (in Russian).
Subbarao, P. M. V.; Munshi, P. & Muralidhar, K. (1997). Performance of iterative tomographic
algorithms applied to non-destructive evaluation with limited data. Nondestructive
Testing & Evaluation International, Vol. 30, No. 6, December 1997, pp. 359-370.
Volkonskii, V. B.; Kravtsenyuk, O. V.; Lyubimov, V. V.; Mironov, E. P. & Murzin, A. G. (1999).
The use of the statistical characteristics of the photons trajectories for the tomographic
studies of the optical macroheterogeneities in strongly scattering objects. Optics &
Spectroscopy, Vol. 86, No. 2, February 1999, pp. 253260.
Yan, C. H.; Whalen, R. T.; Beaupre, G. S.; Yen, S. Y. & Napel, S. (2000). Reconstruction
algorithm for polychromatic CT imaging: application to beam hardening correction.
IEEE Transactions on Medical Imaging, Vol. 19, No. 1, January 2000, pp. 1-11.
Yodh, A. & Chance, B. (1995). Spectroscopy and imaging with diffusing light. Physics Today,
Vol. 48, No. 3, March 1995, pp. 3440.
27
AMR Vision System for Perception, Job
Detection and Identification in Manufacturing
Sarbari Datta and Ranjit Ray
Robotics and Automation Group, Central Mechanical Engineering Research Institute
India
1. Introduction
Autonomous mobile robots are becoming an integral part of flexible manufacturing system
especially for material transport, cleaning and assembly purpose. The advantage of this type
of robots is that the existing manufacturing environment need not be altered or modified as
in case of conventional AGVs where permanent cable layouts or markers are required for
navigation. These robots are also used extensively for survey, inspection, surveillance, bomb
and mine disposal, underwater inspection and space robotics. For autonomous navigation,
proprioceptive and exteroceptive sensors are mounted on these mobile robots. As
proprioceptive sensors measure the kinematic states of the robot, they accrue error over time
and they are supplemented by exteroceptive sensors like ultrasonic and laser range finders,
camera and global positioning systems that provide knowledge of its local environment
which the robot subsequently uses to navigate. Here we describe the vision system of first
indigenous autonomous mobile robot, AMR, with manipulator for environment perception
during navigation and for job detection and identification required for material handling in
a manufacturing environment.
1.1 Autonomous Mobile Robot System (AMR)
The ultimate goal for research on autonomous navigation of mobile robot is to endow these
robots with some practical intelligence so that they can relieve or replace the human
operators of tedious and repetitive tasks and for this reason manufacturing is one area
where mobile robots are becoming a necessity.
Among on-going research on autonomous mobile robots for applications related to
manufacturing, University of Massachusetts Amherst is developing a mobile robot with a
comprehensive suite of sensors that includes LRF and vision along with a dexterous
manipulator, as mobility extends the workspace of the manipulator, posing new challenges
by permitting the robot to operate in unstructured environments (Katz et al., 2006).
Bundeswehr University Munich is developing vision-guided intelligent robots for
automated manufacturing, materials handling and services, where vision guided mobile
robots ATHENE I and II navigates in structured environments based on the recognition of
its current situation and a calibration-free manipulator handles various objects using an
stereo-vision system (Bischoff & Graefe, 1998).
520
In this country, mobile robots are being developed in some research institutes in
collaboration with academic institutes and private sectors. One such mobile robot is
SmartNav built by Zenn Systems, Ahmedabad in collaboration with IIT, Kanpur and BARC
(Sen et al., 2004). Our mobile robot with manipulator, AMR, is especially tailored for
material handling and transport in a manufacturing environment. The vehicle navigates
autonomously and transports jobs and tools from one workstation to another workstation.
Figure 1 shows the AMR with all the mounted sensors. Among the sensors, a stereovision
camera is mounted in front of AMR for environment perception. Another CMOS camera
mounted on the wrist of the manipulator is used for material detection and identification
required for pick and place operation. Laser and sonar range finders are used for
localization through map building and for obstacle avoidance respectively during
navigaiton (Datta et al., 2006). AMR stands on a distributed architecture for performing
various tasks without any perceptible delay and for safeguarding the total system against
major failure that may occur when the total burden rests on a single point of operation
(Datta et al., 2007).
2. AMR Perception
2.1 Prior Art
Color image transmission from the robot while navigating in robot workspace has become
very important in the field of mobile robotics, not only for localization by feature
identification but also for monitoring of the robot environment through reconstructed
images at multiple points in the robot work area. The robot work area can be a huge shop
floor or a warehouse encompassing an area of about 200meters, for effective control from a
remote host through a single WLAN Access Point.
Figure 1. AMR
521
Traditional transfer of bitmap images is quite cumbersome. A large image transferred over
the Ethernet can take several seconds. To alleviate this problem, progressive image
transmission scheme is used where image fidelity, taking advantage of such popular image
file formats as JPEG (Joint Photographic Experts Group) and GIF (Graphics Interchange
Format), is gradually built up so that the viewer can see an approximated image in its whole
without the need to wait for all the data to be received (Tong & Zhang, 1998). But gradual
building up of an image in a constantly changing environment becomes a hindrance for
mobile robot perception, as high-speed image transmission is absolutely necessary while
navigating, to capture the changing scenario.
Similarly, several methods exist for reconstruction of transmitted data. Two approaches that
provide robust image transmission through reconstruction are decoder-based adaptive
reconstruction and reconstruction-optimized source coding (Hemami, 1995).
Among decoder-based adaptive reconstruction, Smoothing Criterion Reconstruction (SCR),
an adaptive algorithm, is designed to exploit the characteristics of the compressed visual
information, which reconstructs the lost information of the image using image
characteristics such as spatial and temporal correlation (Hemami & Meng, 1995). As such,
SCR generally requires extensive computation power, which thwarts the purpose of online
viewing of robot environment through fast reconstruction during navigation. Another
approach, Vector Quantized Linear Interpolation (VQLI) (Hemami & Gray, 1994) provides
reconstruction of equivalent visual quality with less than 10% transmission overhead.
Vector Quantization (VQ) is used at the encoder to set appropriate weights for image
compression which is decoded for reconstruction. This approach provides reconstruction
capabilities without the extensive computational burden as in previous case, but restricts
coding of the image for transmission to a proprietary format.
In reconstruction-optimized source coding, a block-based source coding technique Lapped
Orthogonal Transform (LOT) is designed to maximize the reconstruction performance
(Hemami, 1996). Mean-reconstruction, in which a missing coefficient block is replaced with
the average of its available neighobors, is selected and a reconstruction criterion is defined
for equal distribution of reconstruction errors across all transform coefficients and a family
of LOT is then designed to meet the reconstruction. The overall performance can be gauged
by considering both the coding gain and the reconstruction gain. Although the
reconstruction-optimized LOT family provides excellent reconstruction capability, but any
kind of matrix manipulation required is inconvenient for instant viewing of robot
environment through fast reconstruction.
2.2 AMR Image Transmission Network
Reliable transmission and reception of images is imperative for mobile robot perception.
Most transmission schemes take advantage of popular JPEG or GIF image format as
responsiveness gained from rapid image transmission is more important than perfect image
fidelity. Robustness is therefore vital for rapid image transmission and reconstruction in a
mobile robot network. Hence, we also take advantage of the most popular and widely
supported JPEG image file format (Wallace, 1991) (Schafer, 2001) for transmitting full color
images frame by frame from AMR to multiple clients, set at different monitoring points
within AMR work area in a manufacturing environment and for reconstructing these
images for viewing almost without any perceptible delay.
522
523
524
neighborhood around the highest peak of the histogram is chosen as the threshold region.
A novel variation of Otsus method (Otsu, 1979) is proposed for faster online processing,
which chooses the optimal thresholds by maximizing the between-class variance with an
heuristic search method for adaptive thresholding. Once the job parameters are calculated,
the gripper picks the job and puts it on the platform. In this way, the jobs are stacked on the
vehicle and are transported to the next workstation where they are unloaded using reverse
operation and AMR continues with its next mission.
Write to CSocket
using CSocketFile
Ethernet /
WLAN
Render image in
JPEG format
IPicture
method
COleLoadPicture
Ethernet /
WLAN
525
526
many sign changes between adjacent dij, which is the motivation behind the use of sign
change criteria, a basis for measures of similarity between images.
But the most prevalent method for measure of similarity is the correlation measure. The
correlation between the template and the image window has been used as a measure of
similarity in template matching and image registration since the 1970s (Rosenfeld, 1969).
For digitized images { f } and { g } of size A, the normalized correlation coefficient (corr)
between { f } and { g }is defined as
corr ( f , g ) =
E[( f E[ f ]) ( g E[ g ])]
sd [ f ] sd [ g ]
corr ( f , g ) =
E[ f g ] E[ f ] E[ g ]
sd [ f ] sd [ g ]
where E[x] is the expected value or mean of a data set {x} and sd[x] is the standard deviation
of {x}. The correlation coefficient takes a value in the range of 1.0 to +1.0, providing a
quantitative measure of similarity between two data sets.
Though the advantages of the correlation coefficient approach are its reliability and accuracy
however, computing the correlation coefficient is extremely expensive. The calculation of
correlation coefficients for every possible search point during template matching is
extremely time consuming. Thus a search method with both high speed and accuracy is
required in making the correlation coefficient method computationally reasonable.
Among fast template matching techniques, bounded partial correlation (BPC), based on the
normalized cross-correlation (NCC) is used for finding global distortion minimum or
correlation maximum (Stefano & Mattoccia, 2003). It is an extension of successive
elimination algorithm (SEA) (Li & Salari, 1995) (Wang & Mersereau, 1999) and partial
distortion elimination (PDE) (Bei and Gray, 1985), which allow for notably speeding up the
computation required by an exhaustive-search template-matching process. Since BPC is a
data dependent optimization technique, the computational benefit depends on the image,
the template, the position of the template within the image, the correlation threshold, as well
as on whether or not one deploys information concerning the expected matching position.
(Yoshimura & Kanade, 1994) suggest using multi-resolution eigenimages for fast template
matching based on normalized correlation. This method allows to accurately detect both
location and orientation of the object in a scene at faster rate than applying conventional
template matching to the rotated object.
Another existing template matching technique is the use of sum-of-squared-differences
(SSD) measure to determine the best match. Unfortunately, this measure is sensitive to
outliers and is not robust to variations in the template, such as those that occur at occluding
boundaries in the image. To compensate for these drawbacks techniques such as subpixel
localization, uncertainty estimation and optimal feature is used for robust measure (Olson,
2000).
Another traditional technique for template matching using cross-correlation and an
exhaustive search is Fast Fourier transform (FFT) operations which can be used to calculate
the cross-correlation surface (Anuta, 1970). In order to use an FFT, the image dimensions
(N) must be powers of 2. Therefore it is necessary to pad the template with zeroes in order
to make it the same size as the image.
527
(Vanderbrug & Rosenfeld, 1977) using sum of the absolute valued differences (SAVD) and
(Goshtasby et al., 1984) using cross-correlation describe two-stage template matching for
reducing the computation required of template matching. This two stage template matching
is refined to coarse-fine template matching (Rosenfeld & Vanderbrug, 1977) where a low resolution template is applied in the first stage, followed by the full resolution template
where the match threshold is exceeded. Another class of fast search algorithms is three-step
search (Jain, 1981), which is widely used in motion estimation for digital video compression
and processing. In the first search step, a search step size of 4 pixels is used. Once an
optimal point is found, the step size is reduced to 2 pixels to evaluate the neighborhood of
this previously determined optimal point to choose the next search point. In the third step,
all the neighboring points of second search point are evaluated to find the final bestmatched point. Certainly, this fast search method can speed up the search process, but
mismatches or suboptimal matches can occur.
528
autocorrelation, CAPS method justifiably selects a set of horizontal and vertical search steps
rather than consecutive point-to-point search for faster job detection as shown in Figure 7.
For a particular cut-off coefficient VC from the autocorrelation graph in Figure 7, the
horizontal and vertical widths, Hs and Vs, are chosen as step sizes for coarse search. Figure 8
shows the CAPS search lattice with CAPS horizontal and vertical step sizes Hs and Vs. In
our case, Hs and Vs are determined for VC = 0.6, which yields satisfactory result.
529
the distribution of grey levels in the captured image. Thresholding techniques can be
categorised into two classes: global thresholding and local (adaptive) thresholding. In
global thresholding, a single threshold value is used to separate the foreground and the
background of an image. It is attractive because it is simple and is sufficient in a fixed,
structured environment. However, in case of AMR navigating its way from one
workstation to another, due to uneven illumination, local thresholding is more appropriate
for segregating the image from the background for proper gripping of the object through
parametric calculations using Freeman chain coding (Freeman, 1961).
530
thresholding. (Pun, 1980; Kapur et al., 1985) proposes an approach, which maximizes a
posteriori entropy to measure the homogeneity of threshold classes, (Sahoo et al., 1997)
proposes a variation to this approach through Renyi entropy. However, these methods are
computationally intensive, hence time consuming thus not suitable for real time
computation. (Sahoo et al., 1988) in their study on global thresholding concluded that
Otsus method was one of the better threshold selection methods for general real world
images with regard to uniformity and shape measures. Otsus method chooses the optimal
thresholds by maximizing the between-class variance with an exhaustive search (Otsu,
1979). The drawback of Otsus method is as the number of classes of an image increase,
Otsus method exceeds the time limit for multilevel thresholding in real time. To overcome
this, (Liao et al.,) proposes a modified approach based on heuristic search method for faster
multi-level thresholding.
For our AMR, we have selected one dimension bi-level thresholding using a maximum of
eighteen-step on-line heuristic search on a gray-scale image based on Otsus method for
determining the proper image threshold.
3.2.1 Eighteen Step Algorithm for on-line thresholding
Defining Otsus method for image thresholding, an image is a 2D grayscale intensity
function containing N pixels with gray levels from 1 to L. The number of pixels with gray
level i is denoted fi, giving a probability of gray level i in an image of
p i = f i /N
(1)
In the case of bi-level thresholding of an image, the pixels are divided into two classes, C1
with gray levels [1, , t] and C2 with gray levels [t+1, , L]. Then, the gray level probability
distributions for the two classes are
C1: p1/1(t), . pt / 1(t) and C2: pt+1/ 2(t), pt+2/ 2(t),, pL/ 2(t) where
t
1 (t)= pi
(2)
i=1
and
L
2 (t)= pi
(3)
i=t+1
1 = i.pi /1 (t)
(4)
i=1
and
L
2 = i.pi /2 (t)
(5)
i=t+1
Let T be the mean intensity for the whole image. It is easy to show that
1.1 + 2 .2 = T
(6)
531
(7)
1 +2 =1
Using discriminant analysis, Otsu defined the between-class variance of the thresholded
image as
2B =1. ( 1 - T ) + 2 . ( 2 - T )
2
(8)
For bi-level thresholding, Otsu verified that the optimal threshold t* is chosen so that the
between-class variance B 2 is maximized; that is,
(9)
As an alternate formulation to Otsus method, using Eqs. (6) and (7), the between-class
variance in Eq. (8) of the thresholded image can be rewritten as
2B =112 + 2 22 -2 T2
(10)
As the last term of Eq. 10 is independent of the choice of the thresholds, the optimal bi-level
2
B ), defined as
threshold is chosen by maximizing a modified between-class variance (
2B =112 +2 22
(11)
(12)
T = i.pi
i=1
B ) can be written as
From Eqs. 6 & 7, modified between-class variance (
2B =112 +
( T -11 )
(1-1 )
(13)
Comparing Eq. 13 with Eq. 8, we find that 2B value can be directly calculated ignoring the
Eq.3, Eq.5 & Eq.6. Again, from the Eq. 9., by OTSU method, optimal bi-level threshold is
chosen by maximizing modified between-class variance ( 2B ) for the gray level from 1 to L.
According to the criteria of both Eq. 9 for 2B and Eq. 13 for 2B to find the optimal threshold
by Otsu method, the search range for the maximal 2B and the maximal 2B is 1 t < L .
This exhaustive search involves (L-1) possible combination, computationally intensive for
on-line processing. Using eighteen-step method, we have reduced the computational time
by reducing the exhaustive search from (L-1) possible combinations to a maximum of
eighteen combinations for detecting the proper threshold. The flowchart in Figure 11
delineates this process.
*
532
533
534
Figure 12. Time for sending a 768x576 image frame over Ethernet
Figure 13. Time for sending a 768x576 image frame over WLAN
535
Figure 14. Time for grabbing and converting a 1024 x 768 raster image to JPEG
536
Coarse Search
Frame
Frame 1 (Fig. 9a)
Frame 2 (Fig. 9b)
Frame 3 (Fig. 9c)
Time (seconds)
0.0156
0.0156
0.125
Fine Search
Frame 4 (Fig. 9d)
1.422
Bi-Level
Threshold
155
130
121
118
Figure 16. Plot of modified between-class variance against the corresponding gray level
value
537
538
7. Conclusion
Vision is becoming an integral part of robotic systems not only for navigation but also for
job identification for material handling as camera is the only sensor that imparts a feel of
spatial sensing through 3-D sensing that is lacking in other sensors like laser or ultrasonic
range finders. In AMR, vision plays an integral role in all aspects. It plays a pivotal role for
mobile robot perception of environment while navigating, through rapid transmission of
images from mobile robot to remote viewers. Moreover, vision also plays a crucial role in
online job identification for job handling. Though for all these above tasks, the sheer volume
of information to be processed online becomes a hindrance, as it was in the past, but we
have shown that by coming up with novel concepts based on existing knowledge and ideas
and with continuing advancement in computer architecture, especially with powerful
modern processors available today, we can not only overcome these difficulties but use its
unique feature to our advantage.
8. Acknowledgement
The authors would like to thank the members of the project team for their help and support
in developing an Autonomous Mobile Robot for Manufacturing Environment under the
aegis of Council of Scientific and Industrial Research (CSIR, India) network project on
Advanced Manufacturing Technology. The authors would like to express their sincere
gratitude to the Director of the Institute for his assent in publishing this work.
9. References
Anuta, P. E. (1970). Spatial registration of multispectral and multitemporal digital imagery
using fast fourier transform techniques, IEEE Trans. on Pattern Analysis and Machine
Intelligence, Vol. 8, No. 4, October 1970, pp. 353-368, ISSN: 0359-4237
Barnea, D.I. & Silverman, H.F. (1972). A class of algorithms for fast digital image
registration, IEEE Trans. on Computers, Vol. C-21, February 1972, pp. 179-86
Bei, C.D. & Gray, R.M. (1985). An improvement of the minimum distortion encoding
algorithm for vector quantization, IEEE Trans. on Communications, Vol. 33, No. 10,
October 1985, pp. 1132-33, ISSN: 0096-2244
Bischoff, R. & Volker, G. (1998). Vision-guided intelligent robots for automating
manufacturing, material handling and services, WESIC98 Workshop on European
Scientific and Industrial Collaboration on Promoting Advanced Technologies in
Manufacturing, Girona, June 1998
Datta, S.; Ray, R. & Banerji, D. (2007). Development of autonomous mobile robot with
manipulator for manufacturing environment, accepted for publication in
International Journal of Advanced Manufacturing Technology, Springer Publication
Datta, S.; Banerji, D. & Mukherjee, R. (2006). Mobile robot localization with map building
and obstacle avoidance for indoor navigation, Proc. IEEE International Conference on
Industrial Technology, Vol. 3, pp. 2535-2540, ISBN 1-4244-0726-5, December 15-17
2006, Mumbai, India
Freeman, H. (1961). On the encoding of arbitrary geometric configurations, IRE Trans.
Electronic Computers, Vol. EC-10, pp. 260-268, June 1961
539
Goshatby, A.; Gage, S.H. & Batholic, J.F. (1984). A two-stage cross correlation approach to
template matching, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.
PAMI-6, pp. 374-78, May 1984
Hemami, S.S. (1996). Reconstruction-optimized lapped orthogonal transforms for robust
image transmission, IEEE Trans. On Circuits and Systems for Video Technology, Vol. 6,
No. 2, April 1996, pp. 168-181, ISSN: 1051-8215
Hemami, S.S. (1995). Digital image coding for robust multimedia transmission, Symposium
on Multimedia Communications & Video Coding, New York, October, 1995
Hemami, S.S. & Meng, T.H.Y. (1995). Transform-coded image reconstruction exploiting
interblock correlation, IEEE Trans. on Image Processing, Vol. 4, No. 7, July 1995, pp.
1023-27, ISSN: 1057-7149
Hemami, S.S. & Gray, R.M. (1994). Image reconstruction using using vector quantized linear
interpolation, Proc. ICASSP 94, Adelaide, Australia, April 1994, Vol. 5, pp. 629-32
Jain, J.R. & Jain, A.K. (1981). Displacement measurement and its application in interframe
image coding, IEEE Trans on Communication, Vol. 29, No.12, December 1981, pp.
1799-1808, ISSN:0096-2244
Kapur, J.N.; Sahoo, P.K. & Wong, A.K.C. (1985). A new method for gray-level picture
thresholding using the entropy of the histogram, Computer Vision Graphics Image
Processing, Vol. 29, 1985, pp. 273-285
Katz, D. et al. (2006). The Umass Mobile Manipulator UMan: An experimental platform for
autonomous mobile manipulator, Workshop on Manipulator for Human Environment
at Robotics: Science and Systems, Philadelphia, USA, August 2006
Li, W. & Salari, E. (1995). Successive elimination algorithm for motion estimation, IEEE
Trans. on Image Processing, Vol. 4 No. 1, Jan. 1995, pp. 105-107, ISSN:1057-7149
Liao, P.; Chen, T. & Chung, P. (2001). A fast algorithm for multilevel thresholding, Journal of
Information Science and Engineering, 17, pp. 713-727, 2001
Olson, C. F. (2000). Maximum-likelihood template matching, Proc. IEEE Conference on CVPR,
Vol. 2, pp. 52-57, 13-15 June 2000, Hilton Head, South Carolina, USA
Otsu, N. (1979). A threshold selection method from gray-level histogram, IEEE Trans. System
Man Cybernetics, Vol. 9, March 1979, pp. 62-66
Pun, T. (1980). A new method for gray-level picture thresholding using the entropy of the
histogram, Signal Processing, Vol.2, 1980, pp. 223-237
Rosenfeld, A. (1969). Picture processing by computer, New York: Academic Press, 1969
Rosenfeld, A. & Vanderburg, G. J. (1977). Coarse-fine template matching, IEEE Trans. on
Systems, Man and Cybernetics, Vol. 7, 1977, pp. 104-107
Sahoo, P.K.; Wilkins, C. & Yeager, J. (1997). Threshold selection using Renyis entropy,
Pattern Recognition, Vol. 30, No. 1, pp. 71-84, 1997, Elsevier Science Ltd.
Sahoo, P.K. et al. (1988). A survey of thresholding techniques, Computer Vision Graphics
Image Processing, Vol. 41, 1988, pp. 223-237
Schaefer, G. (2001). JPEG image retrieval by simple operators, CBMI 01, pp. 207-214,
Brescia, Italy, September 19-21, 2001
Sen, S; Taktawala, P. K. & Pal, P. K. (2004). Development of a range-sensing, indoor, mobile
robot with wireless Ethernet connectivity, Proceedings of the National Conference on
Advanced Manufacturing & Robotics, pp. 3-10, ISBN 81-7764-671-0, CMERI, January
2004, Allied Publishers Pvt. Ltd., Durgapur
540
Shijun, S. et al. (2003). Fast template matching using correlation-based adaptive predictive
search, International Journal of Imaging System Technology, Vol. 13, pp. 169-178, 2003
Stefano. L.D. & Mattoccia, S. (2003). Fast template matching using bounded partial
correlation, Machine Vision and Applications,Vol. 13, pp. 213-21, ISSN: 0932-8092,
Springer-Verlag, 2003
Tong, H.F. & Zhang, D. (1998). A new progressive colour image transmission scheme for the
World Wide Web, Computer Networks and ISDN Systems, 30 (1998) pp. 2059-2064
Vanderburg, G.J. & Rosenfeld, A. (1977). Two stage template matching, IEEE Trans. on
Computers, Vol. C-26, pp. 384-93, April 1977
Venot, A.; Lebruchec, J.F. & Roucayrol, J.C. (1984). A new class of similarity measures for
robust image registration, Computer Vision, Graphics and Image Processing, Vol. 28,
pp. 176-84, 1984
Wallace, G.K. (1991). The JPEG Still Picture Compression Standard, Communications of the
ACM, Vol. 34 No.4, pp. 30-44, 1991
Wang, H. & Mersereau, R. (1999). Fast algorithm for the estimation of motion vectors, IEEE
Trans. on Image Processing, Vol. 8, No. 3, March 1999, pp. 435-438, ISSN: 1057-7149
Yoshimura, S. & Kanade, T. (1994). Fast template matching based on the normalized
correlation by using multiresolution eigenimages, Proc. IROS 94, Vol. 3, pp. 20862093, September 12-16 1994, Munich, Germany
28
Symmetry Signatures for Image-Based
Applications in Robotics
Kai Huebner1 and Jianwei Zhang2
1Comp.
2Technical
1. Introduction
The robots that are to find their way in our future households and everyday lives
necessarily have to be mobile and self-dependent. For such autonomous systems it becomes
more and more important to efficiently process the incoming data and to thereby eradiate
what we might call "intelligent behaviour".
While intelligence in terms of plans and goals are abstract metaphors of each robot's
decision process, the perception of the local environment has to be a central issue. Dealing
with this demand in the context of intelligent systems shows plainly what sophisticated
human visual perception is like. The creators and developers of artificial systems therefore
build up a construction kit with construction blocks that try to represent the reproduction of
cognitive perception mechanisms by machine algorithms.
In this chapter, we will show how a construction block for symmetry perception can be
added to this set. The main issues discuss three layers that describe this block from its origin
of biological motivation up to its application for intelligent systems:
1. Symmetry as a Feature: The first layer addresses the basic motivation of symmetry as a
feature. Therefore, symmetry references to diverse domains are given and new methods
developed that provide description and application of symmetry as an image feature. These
include two main symmetry measures that offer a variety of symmetry properties for
higher-level image processing tasks.
2. Regional Symmetry Features: The second layer proceeds to application in relation to modern
regional image features. The three steps of detection, description and robust matching of
regional symmetry features form the necessary links between the basic motivation and the
practical application of symmetry. Symmetry features are evaluated and compared to stateof-the-art features considering their robustness w.r.t. common image transformations.
3. Integration and Application: A practical example from the area of mobile robot navigation is
proposed in the third layer to demonstrate the capability of the developed symmetry
features in applications. For this purpose, the mobile service robot TASeR from the working
group of Technical Aspects of Multimodal Systems (University of Hamburg, Germany) is
used. The application provides the links to higher-level construction blocks from the set of
visual object analysis and robot navigation.
542
The main issue of this chapter will focus on a conclusion of our work on a fundamental
analysis of symmetry as an image feature, followed by a framework on the development of
a robust visual symmetry feature detector and on the implementation of symmetry in robot
applications. We motivate our work in section 2. In section 3, we show our work on finding
symmetry measures valuable for our goal of robotic applications. Section 4 describes the
implementation of the developed symmetry measures into a regional feature detector and
its evaluation. We propose some preliminary work on exemplary integrating those regional
features into robotic image processing for egomotion classification in section 5, before we
conclude this chapter in section 6.
2. Motivation
Nowadays, robots are not only meant to sort and stack parcels in unenlivened storage
depots. They are supposed to wash dishes, to lead through museum halls or even to play
soccer in interaction with humans. For these tasks, a robot must be able to act mobile and
self-dependent. It must adapt to its changing environment instead of letting humans adapt a
constant environment for the robot. An inflexible model of the world is useless in a world of
motion and dynamics. Robots thus have to be equipped with methods that allow them to
build their own world model to localize within. They should be able to handle in dynamic
or unknown environments by constructing, adapting and expanding their models of the
world with a large degree of autonomy. However, the interaction in a world necessarily
starts with the perception of things or objects inside. In many applications of our field of
research, the human visual system gives a wide inspiration to the solution of common
robotic problems and tasks, e.g. distance estimation to objects, object and situation
recognition and localisation. We can also observe that a robot's sensor configuration is both
depending on the application and on the financial means of constructors, developers and
customers. A camera has the advantages of becoming versatile and cheap visual sensor over
the last years and of being a system that is very close to our own human visual perception.
In the work presented here, we focus on the camera as the only sensor.
If we restrict on visual data only, the problem of selecting special visual features comes up,
as images are high-dimensional and thus complex to process. Additionally, images are
highly sensitive to unpredictable interferences like rotation, scaling and occlusion of objects,
illumination influence, perspective warp and viewpoint change. In (Jepson & Richards,
1993), the meaning of a good visual feature that separates the core of information from the
clutter basically depends on the application itself. Some other definitions suppose that an
image feature is a
543
Though all imaginable visual features are numerous and manifold in type, they can be
divided into one of three main classes belonging to their focus. Common global features that
describe general properties of an entire image scene are rather inappropriate for the task of
visual scene interpretation. While images of single objects can be generalized easily by
simple global attributes, e.g. size, colour or texture, it is more difficult to find stable and
repeatable features for conglomerate scenes. However, global features give very compact
representations of significant image properties.
Many higher-level tasks like scene exploration or object classification and object tracking in
complex scenes are therefore grounded on local features. Being related to human visual
perception, local visual features like edges and corners give clues for efficient scene
exploration and allow focusing on well-located interest points. The Scale-Invariant Feature
Transform (Lowe, 2004) and the Harris-Laplacian (Mikolajczyk & Schmid, 2004) are popular
methods of local feature detection, approaching robustness to rotation and scale. As the
exploration of invariant features is an active field of research, well-elaborated comparisons
of various local feature detectors and descriptors concerning a set of common
transformations have been published (Mikolajczyk & Schmid, 2004; Schmid et al., 2000;
Mikolajczyk & Schmid, 2005).
Due to the different characteristics of global and local features, some applications benefit
from the combination of both approaches into regional features, where a region is defined as
an arbitrary subset of the image. The extraction of Maximally Stable Extremal Regions
(Matas et al., 2004) highlights the advantage of region-based detectors that produce both
sparse and robust features particularly covariant to viewpoint change and affine
transformations.
If we consider these issues of different natural visual features, we find local features like
edges or corners, regional features like colour or intensity blobs, or global features like
colour histograms in the literature. A rather unnoticed type of feature to use in robotic
applications is symmetry, though symmetry is present everywhere in our everyday's life.
Many objects of our world show a high degree of some symmetric property and humans are
usually surrounded with symmetric objects. Plants and animals grow up in a somehow
symmetric manner. But even in many other domains like mathematics, art, architecture or
manifacturing, symmetry plays a major role.
Let some psychological cites from (Locher & Nodine, 1989) describe the high influence of
symmetry on human visual perception:
Symmetry is a property of a visual stimulus which catches the eye in the earliest stages
of vision.
Most perceptual theories assume that the eye-brain system uses the axis of symmetry as
an anchoring point for visual exploration and analysis.
Symmetry comes along with attention and interest, which are supposed to be necessary for a
useful natural image feature. We claim that therefore symmetry is worth a view on being
used as a feature in the context of robot vision. In the next section, we will start by asking
the question on how can we receive a description of symmetry from the visual data?.
3. Symmetry as a Feature
As mentioned above, symmetry is a fundamental feature that is evaluated throughout
several domains, e.g. architecture, art and nature (see Figure 1). Many aspects that
concentrate on nature and mathematics are discussed in the book Fearful Symmetry: Is
544
God a Geometer? (Stewart & Golubitsky, 1992). It has been shown in the biological domain
that some animals prefer mates that outperform by their symmetric appearance (Enquist &
Arak, 1994; Kirkpatrick & Rosenthal, 1994), or that doves are able to distinguish between
symmetric and asymmetric patterns (Delius & Nowak, 1982). A very good introduction to
several types and appearances of general symmetry can be found in the book Symmetry
A Unifying Concept (Hargittai & Hargittai, 1994).
3.1 Definition of Symmetry
Besides this introduction, we also find descriptions of various types of symmetry in that
work (Hargittai & Hargittai, 1994). Each type of symmetry can be assigned to a
corresponding action that fulfills the basic property of symmetry: keeping the shape after
having performed the action. Thus, we can reflect shapes along an axis that are mirrorsymmetric, or rotate shapes that are rotationally symmetric, or even shift shapes that are
translationally symmetric, without changing their shape. Here, we focus on the first two
types of symmetries by giving the following definitions:
Rotational symmetry: A shape is symmetric w.r.t the rotation about a point and a
certain angle . Rotating the shape at the point about does not result in a change of its
appearance. A shape is n-times rotational symmetric with n = 2 /.
These definitions are leaned against more detailed and more general definitions of twodimensional symmetry types by Zabrodsky et al. (Zabrodsky et al., 1995), which are based
on exact invariance. However, almost no object of our world shows invariant symmetry
properties from this point of view. For example, faces are highly symmetric, but both halves
of one face are never exactly the same. Therefore, we differ our above definitions by using
the term change of appearance. A common face is thus reflectional symmetric, as the
reflection does not change perception or appearance for the viewer. Following this
definition, we find that or world consists of many symmetric objects.
3.2 Symmetry in Human Perception
How the existence of symmetry influences the human visual system and how this is used for
visual scene exploration, was evaluated in psychophysical experiments (Locher & Nodine,
1989). As an important result of those experiments it was shown that especially reflectional
symmetries and their orientations are of significant importance for human vision. Eyetracking experiments show that humans quickly detect and take advantage of horizontal
and vertical symmetries. Figure 1 gives two samples of such visual explorations. While the
viewer has fully explored the asymmetric shape on the left hand side, the focus clearly
concentrates on just one half of the symmetric shape on the right. Hereby, we get a clue that
the human eye is able to detect and use symmetry as a visual anchoring point for visual
exploration of objects and scenes. Palmer and Hemenway (Palmer & Hemenway, 1978)
consider the time of detection of arbitrarily skewed symmetric shapes in similar
experiments. They conclude that vertical reflective symmetry is very often and more quickly
detected than horizontal reflective symmetries, which is better than arbitrarily skewed
reflective symmetries.
545
Those two references are exemplary for others that also motivate symmetry as a visual
feature from the biological and psychophysical point of view (Barlow & Reeves, 1979;
Csath et al., 2003; Ferguson, 2000; Tyler, 1994).
3.3 Symmetry in Computer Vision
Besides its influence on human visual perception, symmetry has also been investigated in
computer vision. There are some references that motivate symmetry as a feature in very
versatile tasks (Ferguson, 2000; Liu, 2000; Reisfeld et al., 1995; Zabrodsky, 1990). Early work
in the area of symmetry axis extraction for object description, like the Symmetry Axis
Transform (Blum & Nagel, 1978) and the Smoothed Local Symmetries (SLS) (Brady &
Asada, 1984), are very related to the Medial Axis Transform (MAT) offering main axes of a
shape. The idea of using symmetry as a feature has been advanced over the last decades. In
the following, some recent and related work is referenced.
Figure 1. Left: Examples for symmetric structures in architecture, nature and art. Right:
Visual explorations both on an asymmetric and a symmetric shape. The path of visual focus
covers the whole shape for the asymmetric shape, but only one half of the symmetric shape
Sun (Sun, 1995) and Sun & Si (Sun & Si, 1999) present a fast algorithm to detect the
symmetry axis of a shape by gradient histograms. A similar approach analyzing an energy
function of the input image is proposed by Scognamillo et al. (Scognamillo et al., 2003). The
task of these methods is to detect the main symmetry axis of one shape, thus only images
with a single object on a uniform background are useful. An application to symmetry as a
feature in an arbitrary scene would therefore need prior segmentation.
Reisfeld et al. (Reisfeld et al., 1995) define a generalized symmetry transform that uses
symmetry to extract regions-of-interest in a scene. The two-dimensional operator both
includes symmetry as also gradient information. Regions that show a high degree of
symmetry, but low contrast, e.g. walls, are therefore not extracted. Di Ges and Valenti
present the Discrete Symmetry Transform (DST) which is speeded up by the selection of
non-uniform image regions (Di Ges & Valenti, 1995). The resulting symmetry image is
used for several tasks of face recognition, image segmentation and object classification as
also motion analysis (Di Ges & Valenti, 1996). These approaches suffer from the generality
546
which causes higher effort in calculation time and parametrization. Similar results using
symmetry as a detector of interest have been shown by analyzing frequency components of
an image (Kovesi, 1997).
Chetverikov (Chetverikov, 1999; Chetverikov, 2003) analyzes the surrounding of each image
point with regard to its anisotropy. Based on this result, a symmetry structure is calculated
that represents symmetric texture orientation. The extracted feature thereby describes the
texture and the image, respectively, as a whole. Liu et al. (Liu et al., 2004) describe patterns
considering their symmetry properties, including translational symmetries. Regions that
even correspond with regard to their structure of symmetric shape after perspective warp
are deeply investigated by Tuytelaars et al. (Tuytelaars et al., 2003). However, these afford a
number of pre-processing steps that influence the run-time of each feature detection.
Face recognition based on symmetry description is found in a model-based work by
Zabrodsky et al. (Zabrodsky et al., 1993; Zabrodsky et al., 1995). Another model-based
approach to segment objects from the visual data by symmetry is proposed by Liu et al. (Liu
et al., 1998). Johansson et al. (Johansson et al., 2000) detect rotational symmetries by
particularly defined rotational operators, while Loy and Zelinsky (Loy & Zelinsky, 2003)
present an efficient and real-time capable feature detector based on radial symmetries.
We find that all these approaches differ both in the methods applied and the results, though
all of them handle the problem of detecting symmetries in the visual data. Some describe
symmetry properties for a pre-segmented object (Chetverikov, 1999; Liu et al., 2004; Sun,
1995) and are thereby inadequate for the extraction of feature points from cluttered scenes.
Some include reflective symmetries of arbitrary orientation (Chetverikov, 1999; Di Ges &
Valenti, 1995; Di Ges & Valenti, 1996; Kovesi, 1997; Reisfeld et al., 1995; Sun, 1995;
Zabrodsky et al., 1995), offer methods to extract rotational symmetries (Johansson et al.,
2000; Loy & Zelinsky, 2003; Zabrodsky et al., 1995) or use pre-processing steps (Liu et al.,
2004; Tuytelaars et al., 2003) and thereby need additional effort in computing time.
For our scenario, we prefer an approach that extracts symmetric features from the raw
visual data without such pre-processing steps, similar to the work by Reisfeld et al. (Reisfeld
et al., 1995) and Di Ges and Valenti (Di Ges & Valenti, 1995; Di Ges & Valenti, 1996). A
time-line of the mentioned literature is presented in Table 1.
The diagonal line highlights the trend towards detection of very general, different and
complex descriptions of symmetry in computer vision. However, a real-time application in
robotic systems suffers from this evolution, as more complex algorithms need more
processing time.
We found that most image processing operators available for our needs of bilateral
symmetry detection in cluttered scenes have the crucial demerit of being large and complex.
In our first approach, we therefore proposed a simple, fast and compact operator to extract
the regions of interest from images (Huebner, 2003). The psychophysically motivated simple
symmetry operator detects horizontal and vertical reflective symmetries only. Resulting
symmetry images offer multiple feature extraction methods.
In particular, binary images derived from symmetry axis detection are interesting for further
image processing steps. As we show in the next section, the fast operator can be applied to
arbitrary images without prior adaptation and without thresholds. The only parameters to
specify are the size of the operator mask and the resolution of symmetry data.
547
SAT
SLS
CF
CF
CF
MOD
FQ
MOD
CF
CF
CFQ
CF
CF
CFQ
MOD
MOD
CFQ
Number of
features
Translational
Rotational
Approach key
Reflective
Method
b
b
mn
mn
1
b
mn
b
1
b
b
mn
mn
1
b
b
mn
Type of features
Object skeleton
Object skeleton
Symmetry values
Symmetry values
Main symmetry axis
Reconstructions
Symmetry values
Symmetry segments
Regularity values
Main symmetry axes
Symmetry axis points
Symmetry values
Symmetry values
Main symmetry axis
Symmetry groups
Classification
Symmetry values
(1)
where m>0 is the size of the neighbourhood of pi along the direction perpendicular to the
axis of symmetry. The symmetry value of pi shall be detected with respect to this axis. The
complete number of pixels considered is 2m. c is a normalization constant which depends
548
on the colour space used and on (j, m), which is a radial weighting function. The difference
between two opposing points pi-j, pi+j is determined by a gradient function g(pi-j, pi+j), which
typically is the Euclidian distance of the corresponding colour vectors. A few example
results are presented in Figure 2, demonstrating that the choice of m is important for the
performance of the algorithm. Setting m to a low value works out well for the symmetry
axes of small objects, while those of bigger objects are enlarged. However, a large value m is
better in detecting the symmetry axes of bigger objects. Note that the border regions of the
images (left and right for vertical symmetry) are influenced strongly by the effect of fading if
the operator reaches out of the image, but symmetry axis points (maxima of the values) are
quite stable and independent of m. However, for this operator and for other techniques from
the literature, a symmetry value for an image point is detected by a static operator covering
a surrounding region around that point. These operators return relative values, i.e. qualities,
of symmetry that describe symmetry as low or high inside a pre-determined, fixed region.
We call these approaches qualitative or strength-based, as a quality of symmetry is their
output. Results are depending on the operator size chosen and thus not comparable if two
different sizes have been used for symmetry feature extraction.
Figure 2. Example image (left). Vertical symmetry image calculated with small operator (m
= 10; centre) and with large operator (m = 50; right). Brightness corresponds to symmetry.
3.5 Quantitative Symmetry Extraction using Dynamic Programming
Having uncovered these disadvantages of qualitative operators, we claim that it is more
relevant to get quantitative or range-based information about the size of symmetry instead of
its degree. We have therefore proposed a novel approach to symmetry extraction based on
Dynamic Programming (Huebner et al., 2005), which we briefly describe in this section.
To keep the motivation of psychophysical work on symmetry perception (Locher & Nodine,
1989; Palmer & Hemenway, 1978), we still restrict our symmetry detection to horizontal and
vertical symmetry detection, i.e. reflection with respect to a horizontal or vertical axis. Using
this restriction, the problem states to estimate the range around an image point in its row or
column in which symmetry is still detectable, i.e. the assignment of opposing points is linear
and not erroneous.
The assignment of points is therefore seen as an optimization problem to find the best
correspondence between the two opposing patterns. Dynamic Programming offers the
global optimum for such problems, including the assumption that the order of pattern
elements is kept. See the example in Figure 3.
The example shows two patterns R = R0,...,R4 and L = L0,...,L4 for which the best
correspondence between their elements shall be found. The solution of this problem is equal
to finding the best path in a two-dimensional search space spanned by L and R (Ohta &
Kanade, 1985). Each path ranging from (R0, L0) to (Rrmax, Llmax) inside this search space
describes a possible mapping of feature points, as long as the order of elements is kept. This
549
property is only ensured by path elements reaching from one cell to the right, the top-right
or the top neighbouring cell in the search space. The optimal path can then be found by
Dynamic Programming starting from point (R0, L0), using simple error measures (ganec et
al., 1993), as can be seen in Figure 4.
Note that the structure of the path and the overall costs are dependent on the patterns'
symmetric correspondence. As the example shows, high symmetry of the patterns results in
a diagonal path and low costs. In contrast, the comparison of asymmetric shapes will result
in a non-linear path and high costs.
(2)
550
images. Robust detection of features is a crucial task for applications that deal with visual
information. Image data is high-dimensional, complex and particularly sensitive to a
multitude of changes which are mostly unpredictable and may greatly influence the image
representation of one and the same object or scene. Therefore, a good feature detection is
strongly required in dynamic and unrestricted real world environments. Preferably, this
detection is invariant to a number of transformations, namely
rotation,
scale change,
image noise,
occlusion,
image flow.
A visual feature is referred to as good if it separates the core of information from the
clutter. This basically depends on the application at hand and on the context it is used in
(Jepson & Richards, 1993). For our research on vision systems for mobile robots, we define a
good feature to be both independent of the transformations above and distinctively
repeatable in dynamic environments.
Most features applied in literature are commonly classified either as being global, local or
regional. Concerning our definition of a good feature, common global features that describe
general properties of an entire image scene are rather inappropriate for our task of robot
scene interpretation. While single objects can be generalized easily by simple global features,
e.g. size, colour or texture attributes, finding stable and repeatable features is more complex
for conglomerate scenes. However, such global features give very compact representations
of significant image properties. Therefore, global features are mainly used in image-based
applications like image retrieval or image annotation.
Many higher-level tasks like scene exploration or object classification and object tracking in
complex scenes are grounded on local features. Being related to human visual perception,
local visual features give clues for efficient scene exploration. They allow to focus on welllocated interest points. Therefore, a variety of local features have been applied in a range of
vision tasks, aiming at high robustness and repeatability. The Scale-Invariant Feature
Transform (SIFT) proposed by Lowe (Lowe, 2004) and the Harris-Laplacian by Mikolajczyk
and Schmid (Mikolajczyk & Schmid, 2002) are two popular methods of local feature
detection. While the SIFT uses local extrema of Difference-of-Gaussian (DoG) filters in scalespace to produce scale-invariant features, the Harris-Laplace operator joins rotational
invariant Harris features (Harris & Stephens, 1988) and Laplacian scale-space analysis into
an affine invariant interest point detector. As the exploration of invariant features is an
active field of research, well elaborated comparisons of various feature detectors and
descriptors under a set of common transformations have been published by Schmid et al.
(Schmid et al., 2000) and Mikolajczyk and Schmid (Mikolajczyk & Schmid, 2004;
Mikolajczyk & Schmid, 2005).
Due to the different characteristics of local and global features, it is beneficial for some
applications to combine both approaches. Lisin et al. (Lisin et al., 2005) show two methods
where combining local and global features improve the accuracy of a classification task.
More than another hybrid-like approach has been found in the detection of regional
features, in which regions are defined as arbitrary subsets of the image. The extraction of
Maximally Stable Extremal Regions (MSERs) by Matas et al. (Matas et al., 2004) highlights
the advantage of a region-based approach: it produces both sparse and robust features that
are particularly covariant to viewpoint change and affine transformations. Mikolajczyk et al.
compare and evaluate a set of recent affine region detectors in (Mikolajczyk et al., 2005).
551
Lowe (2004)
Matas et al. (2004)
Kadir et al. (2004)
Mikolajczyk & Schmid (2004)
Evaluation
Approach key
Global
feature
Local
feature
Regional
feature
Regional features combine the merits of focus point localisation from local features with
image-describing methods of global features. As symmetry is a regional feature, it supports
the idea of regional features, especially in the context of range-based symmetry description.
We reference and compare the regional symmetry features to known state-of-the-art affine
region detectors. Therefore, we refer to Harris-affine regions, Hessian-affine regions,
intensity-based regions (IBR), entropy-based regions and Maximally Stable Extremal
Regions (MSER) that are summarized in (Mikolajczyk & Schmid, 2005). According to those
recent state-of-the-art detectors, the symmetry descriptions at hand shall be included in a
robust and stable regional feature detector in this section.
A time-line overview on selected work on local, regional and global features, as also on
feature evaluation, is presented in Table 2:
Notes
Rotational invariance
KLT feature tracking
Evaluation of local detectors
(Harris, Improved Harris,
Heitger, Horaud, Cottier,Frster)
Fourier-Mellin Transform
SIFT (DoG) detector
Edge-based regions (EBR)
Scale invariance
Intensity-based regions (IBR)
Harris-affine detector
(Harris-affine, Harris-Laplace,
Harris-affine regions)
Optimized SIFT detector
MSER detector
Salient-region detector
Evaluation of local detectors
(Harris, Harris-Laplace, Harrisaffine, Harris-affine region, SIFT,
Laplace, Hessian, Gradient)
Global feature densities
Combination of global & local
Evaluation of local descriptors
(Div., GLOH, SIFT, PCA-SIFT)
Evaluation of regional detectors
(Harris-affine, Hessian-affine,
EBR, IBR, Salient regions, MSER)
552
(center point)
(orientation)
(1st semi axis)
(2nd semi axis)
(3)
where v and h correspond to the orientations of intersecting segments. See Figure 5 for an
exemplifying illustration of the main processing steps. Caused by line segmentation,
intersections might miss the ideal symmetry maxima point, thus s'v(yi) and s'h(yi) are
computed by finding the maximum vertical sv(x) and horizontal sh(x) in a small
neighbourhood of yi. As a representation similar to the quadratic equation of central conics,
each feature ellipse can also be formulated as
Fi = { (x,y) R2 | AiDi (x-x(yi))2 + 2 BiDi (x-x(yi))(y-y(yi)) + CiDi (y-y(yi))2 = 1 }
where
(4)
Figure 5. From the left: Vertical and horizontal qualitative symmetry axis points, symmetry
segment selection and final regional symmetric features built with range-based symmetry
4.2 Symmetry
For a part of our experiments, we use SIFT descriptor and matching (Lowe, 2004). As
another model, we introduced a distribution-based colour descriptor as a very simple form
553
(5)
with the kernel function Ki(yi,x) describing a weighting over locations x with regard to the
kernel centre yi. The Kronecker--function compares the bins u and v(x) for equality, where
v(x){1..m} maps the colour feature of location x to its corresponding histogram bin. Finally,
ci is a normalization constant ensuring that all p'u(fi) sum up to 1. We now derive an
elliptical target frame Gi and a Gaussian kernel function Ki for each detected image feature
yi directly from its elliptic feature representation (4). The frame Gi enclosing each x in Fi can
easily be defined by widening the representation to
Gi = { x | AiDi (x(x)-x(yi))2 + 2 BiDi (x(x)-x(yi))(y(x)-y(yi)) + CiDi (y(x)-y(yi))2 1 }.
(6)
Introducing the 2-dimensional Gaussian kernel function Ki(yi, x), the correlation matrix Mi
that fits the elliptical feature shape in orientation and ratio of the semi axes is given by
Mi = l 2 / 2 q
Ci -Bi
-Bi Ai
r,
(7)
where l can be used for scaling both standard deviations of the Gaussian function. Figure 6
shows two kernel shapes of l = 0.5 and l = 1.0 for an exemplary feature yi.
Figure 6. Gaussian distributed kernel functions for an exemplary elliptical region feature yi
with l = 1.0 (left) and l = 0.5 (right)
4.3 Symmetry Feature Matching for Mean-Shift Description
After the detection and colour histogram description of symmetry-based regions, a measure
of correspondence has to be defined to map most correlated features. Each feature is mainly
characterized by its descriptor vector, we therefore use the Bhattacharyya coefficient
(fi, gi) = u=1..m ( p'u(fi)p'u(gj) )1/2
(8)
554
to compute the similarity of two features fi and gj. The common application of feature
matching is given by comparing a feature fi from one scene with a set of features g =
{gk}k=1..n deriving from a second scene. The best match for fi is thus given by
fi gi:
(9)
Feature matching experiments usually describe correspondences between two feature sets f
and g. Depending on the final application, different matching strategies may be reasonable,
namely non-injective and injective matching. Non-injective matching allows several fi to be
assigned to one gj. This mapping is adequate for applications like classification of multiple
features into a number of classes. In applications, where features are meant to be nonambiguous, one feature from one set should maximally be assigned to one feature of the
other set. These assignments describe the symmetric subset of injective feature matches
between f and g. We found that the Mean-Shift descriptor is better for classification tasks
between features, while the SIFT descriptor is better for distinctive matches.
4.4 Panoramic Evaluation on Single Images
In this section, we follow the experiments of affine region detectors in (Mikolajczyk et al.,
2005) by evaluating the proposed symmetry feature detector in relation to other wellelaborated feature detectors. We compare symmetry features of a set of panoramic images to
Harris-Affine and Hessian-Affine Regions (Mikolajczyk & Schmid, 2005), Intensity-Based
Regions (IBRs) (Tuytelaars & van Gool, 2000) and Maximally Stable Extremal Regions
(MSERs) (Matas et al., 2004). While Hessian-Affine and Harris-Affine offer edge-based
regions, IBRs, MSERs and the Symmetry approach are oriented towards area-based regions.
We compute these regional feature types for the 1440 288 panoramic image in Figure 7
(right). Results are depicted in Figure 8. The histogram in Figure 9 shows a very common
distribution of image feature sizes, where the size of an elliptical region is computed as the
mean value of its semi axes. Symmetry, MSER and IBR provide few and sparse features with
mean feature size, while Harris-Affine and Hessian-Affine detect many small features. For
our symmetry detector, the feature count and the run-time do not depend on image size
only, but also on symmetric image structure. The main effort is spent on the quantitative
symmetry detection, where a growing search space for each image point has to be
established. We can conclude that symmetry offers the most sparse set of features with large
mean feature size. Additionally, the whole process of feature description and matching is
depending on feature count, so symmetry features can be described and matched fastest.
Related approaches emphasize to be covariant under affine transformations like change of
scale, rotation and perspective view. Covariance terms that elliptical representations of a
feature cover the same region in different images. Range-based symmetry intuitively
illustrates the concept of scale robustness, as symmetry is highly proportional to scale.
However, as we have only used horizontal and vertical symmetry, the detection of features
is not rotational invariant. Symmetry axes of horizontal and vertical operators are able to
approximate slightly skewed axes of symmetry, but are rotational invariant for rotations of
n/2 only. We found that this causes symmetry to be comparatively weak in covariance on
affine transformations compared to other approaches (Huebner et al., 2006). Nevertheless,
no multiple scale analysis or scale selection is needed, since scale emerges from symmetry.
555
Figure 7. The first and the last image of a panoramic sequence of 37 images (1440 288)
(a)
(b)
(c)
(d)
(e)
number of detected
regions
Figure 8. Regional features of the image from Figure 7 (right). (a) Harris-Affine. (b) HessianAffine. (c) Intensity-Based. (d) Maximally Stable Extremal Regions. (e) Symmetry Features
556
557
one of the other screens in the image and truly is incorrect. In this homography H(1,3) there
are few eye-catching outliers for IBR, Harris-Affine, MSER, Symmetry and Hessian-Affine.
Figure 10. For the comparison of all detectors, features of image 1 are matched to those of
images 2-37 of the sequence
558
The mean deviation of matches about the homographies along the whole sequence is
depicted in 10(f). It can be seen that matching correctness decreases for each method the
more the image i differs from the reference image 1, but IBR and Symmetry provide best
matching correctness for the analysed image sequence.
Concluding, the experiments show that regional symmetry features are successfully
applicable for feature detection and matching during panoramic warp. No multiple scale
analysis or scale selection is needed, as scale emerges directly from the range-based
symmetry component of the detector. The detector offers comparatively few and significant
features that support fast description and matching. Matched features are highly stable,
distinctive and correct in combination with the SIFT descriptor. Another advantage of the
symmetry approach is the strong relationship of extracted features to objects in the scene.
Walls, doors, monitors and cabinets are frequently included by one feature.
(10)
with ( (j) = 1, if x(j) > ), ( (j) = -1, if x(j) < - ), ( (j) = 0, otherwise ).
The difference d(x) between d1(x) and d2(x) is maximal for the x in moving direction, if d(x)
is larger than a small threshold t1, so that movement direction can be calculated as
= argmax{x} d(x) with d(x) = ( d1(x) d2(x) ).
(11)
559
Figure 11. The four basic movements Move , Turn right, Turn left, and No move.
Typical feature shifts for those movements are shown in the centre column. On the right
column, the corresponding homography sectors are depicted, e.g. a Turn right action is to
result in feature shifts in negative x-direction only
Additionally, the product d1(x)d2(x) is useful, as it may distinguish a sinus-shaped
homography graph from a constant one. Therefore, we establish two measures c1 and c2 as
c1 = (d1(x)d2(x)) / w, c2 = -c1, if d1(x) < 0, or c2 = c1, otherwise.
(12)
For c1<0, we can assume a movement of the robot in direction . For c1>0, a turn action or
no movement is probable. To distinguish between these two, we use a second threshold t2. c2
finally helps distinguishing between Turn left and Turn right actions.
Figure 12. Top: Between images 122 and 123. The algorithm computes d()=-215,
c1()=c2()= 0 and correctly returns No move. Bottom: Between images 24 and 25. The
algorithm computes d()=-428, c1()=-95, c2()=95 and returns Move in direction 340
560
The resulting algorithm is applied to a sequence of 200 images that were recorded by the
TASeR robot. The odometry sensors allow for a final comparison of the real movements to
the estimated ones. For each neighbouring image pair, symmetric regional features are
computed and described by the SIFT descriptor for distinctive matching. The symmetry
homography is computed like described in section 4.5 and classified by the algorithm.
From this sequence, we show two examples for No move and Move left in Figure 12. As
can be seen from the matching, the homography graphs and the measures d, c1 and c2, the
algorithm also offers robust results for homography graphs that are influenced by failure
matches. The same robustness is also shown by the results on the whole image sequence, as
presented in Figure 13. Comparing the real robot route with the estimation, we find a very
high correctness of movement classification. Although correctly classified, there is uncertainty in the estimation of the movement direction samples , which ideally should all be 0.
Figure 13. Left: The TASeR robot. Right: The experimental route map delivered from the
odometry (top) and the movement classification with our simple algorithm (bottom).
6. Conclusion
In this work, bilateral symmetry has been proposed as a concept for the extraction of
features from the visual data and their application to robot navigation tasks. Symmetry in
shape and vision is strongly motivated by biological and psychophysical aspects. It is a
natural feature that can be found in many scenes, whether they show structured indoor or
unstructured outdoor environments. We conclude with a review on the three main topics:
1. Symmetry as a Feature: Symmetry has been investigated in several domains like biology,
psychophysics, architecture and art. Accordingly, symmetry has also been applied as a
valuable attentional feature for the extraction of regions of interest or for object description
by symmetric properties in computer vision.
Motivated by psychophysical experiments on symmetry perception, a fast and compact onedimensional operator was supposed earlier (Huebner, 2003) to handle horizontal and
vertical bilateral symmetry measures only. The operator overcomes the problem of
symmetry detection methods in literature that use large operators which are mostly
unsuitable for robotic real-time tasks. However, each of these strength-based operators
returns a relative, commonly normalized value of symmetry for each image element. For
561
this purpose, a novel method to generate robust range-based symmetry values was
proposed that produces symmetry range information for each image point (Huebner et al.,
2005). This approach is based on an algorithm computing bilateral quantitative symmetry
information using an adopted Dynamic Programming technique. Qualitative and
quantitative symmetry measures offer a variety of symmetry representations - especially
those of symmetry axes - for higher-level image processing tasks.
It was shown how globally and versatile symmetry can be used as a feature. Even beyond
the context of image processing and visual data, symmetry can be used as a general feature
of structure. A further task in this topic would be the further workout of the quantitative
symmetry approach. The calculation of the Dynamic Programming Symmetry search spaces
might be optimized and thereby accelerate computing time. Another open issue is the use of
search space path structure for quantitative symmetry computation.
2. Regional Symmetry Features: In this part, a new regional symmetry feature matching
approach was proposed. It comprises several modular techniques for detection, description
and matching of image features based on the symmetry types developed in the previous
section. While the qualitative symmetry operator describes symmetry as a relative degree
and the quantitative operator describes symmetry as a range, advantages of both were
combined in a stable regional feature detector. In combination with descriptors, symmetry
features can robustly be matched. The descriptors used were the famous gradient-based
SIFT and a Mean-Shift approach that was adopted to the task of feature description. The
evaluation including state-of-the-art regional feature detectors shows that the symmetry
feature approach is well applicable for robust feature recognition, especially for panoramic
image warp. Description and matching of symmetry features is very robust and faster than
other approaches that derive larger feature sets. Additionally, symmetry features are
strongly related to objects in the scene. Walls, doors, monitors and cabinets are frequently
included by one feature.
Besides the advantages of regional symmetry features, their sensibility to rotation is due to
this works concentration on horizontal and vertical symmetry measures mainly. This
invariance would be an important step to support the task-spanning robustness of the
approach. The measures of covariance and overlap might benefit from an additional
rotation invariance of the proposed features. Therefore, a further task is to efficiently find a
robust orientation measure of symmetry and symmetric features. Along and perpendicular
to this orientation, the proposed twofold quantitative symmetry measures could be used.
3. Integration and Application: The third topic addressed the applicability of the developed
symmetry features for robot navigation by panoramic vision. For this purpose, the mobile
service robot TASeR from the Working Group Technical Aspects of Multimodal Systems at
the University of Hamburg was used. The capability of symmetry feature matching with
regard to simple classification of robot egomotion was presented. The integration of the
developed visual symmetry features into high-level object recognition and robot navigation
tasks in dynamic environments is thereby motivated.
It is important to state that the step from one or more features to an object has not been
made in this work. Features are natural low-level points or regions of attention that are
supposed to describe significant visual information and thus might be interesting to be
analysed. Objects are understood as higher-level entities filled with semantic descriptions.
Those are embedded in higher-level applications like object recognition or autonomous
robot navigation of intelligent systems. These tasks are theirselves wide areas of research
562
such as the detection of robust and natural image features that has been treated in our work.
As presented in this work, symmetry can support these tasks.
Returning to the image of a construction set that has been used in the introduction,
symmetry is just one of the construction blocks that might help intelligent systems to
perceive and act in dynamic environments.
7. References
Barlow, H. B. & Reeves, B. C. (1979). The Versatility and Absolute Efficiency of Detecting
Mirror Symmetry in Random Dot Displays. Vision Research, 19, pp. 783793.
Blum, H. & Nagel, R. (1978). Shape Description Using Weighted Symmetric Axis Features.
Pattern Recognition, 10(3): 167180.
Brady, M. & Asada, H. (1984). Smoothed Local Symmetries and Their Implementation. The
International Journal of Robotics Research, 3(3): 3661.
Chetverikov, D. (1999). Fundamental Structural Features in the Visual World. In Proc. of the
Int. Worksh. on Fundamental Structural Properties in Image and Pattern Analysis, 4758.
Chetverikov, D. & Jank, Z. (2003). Skewed Symmetry of Bidirectional Textures. In Proc. of
the 27th Workshop of the Austrian Association for Pattern Recognition, pp. 97102.
Comaniciu, D.; Ramesh, V. & Meer, P. (2000). Real-Time Tracking of Non-Rigid Objects
using Mean Shift. In Proc. of Comp. Vision and Pattern Recognition, vol. 2, pp. 142
149.
Cross, A. D. J. & Hancock, E. R. (1999). Scale space vector fields for symmetry detection.
Image Vision Computing, 17(5-6): 337345.
Csath, .; Van der Vloed, G. & Van der Helm, P. A. (2003). Blobs strengthen repetition but
weaken symmetry. Vision Research, 43, pp. 9931007.
Delius, J. D. & Nowak, B. (1982). Visual Symmetry Recognition by Pigeons. Psychological
Research, vol. 44, pp. 199212.
Di Ges, V. & Valenti, C. (1995). The Discrete Symmetry Transform in Computer Vision.
Technical report, DMA Universit di Palermo.
Di Ges, V. & Valenti, C. (1996). A New Symmetry Operator for the Analysis of Sequences
of Images.
Dufournaud, Y.; Schmid, C. & Horaud, R. (2000). Matching Images with Different
Resolutions. In IEEE Conf. on Computer Vision and Pattern Recognition, pp. 612618.
Enquist, M. & Arak, A. (1994). Symmetry, beauty and evolution. Nature, 372, pp. 169172.
Ferguson, R. W. (2000). Modeling Orientation Effects in Symmetry Detection: The Role of
Visual Structure. Proceedings of the 22nd Conference of the Cognitive Science Society.
Hargittai, I. & Hargittai, M. (1994). Symmetry - A Unifying Concept. Shelter. ISBN 093607017X.
Harris, C. & Stephens, M. (1988). A Combined Corner and Edge Detector. In 4th ALVEY
Vision Conference, pp. 147151.
Huebner, K. (2003). A 1-Dimensional Symmetry Operator for Image Feature Extraction in
Robot Applications. The 16th International Conference on Vision Interface, pp. 286291.
Huebner, K.; Westhoff, D. & Zhang, J. (2005). Optimized Quantitative Bilateral Symmetry
Detection. International Journal of Information Acquisition, 2(3): 241249.
Huebner, K.; Westhoff, D. & Zhang, J. (2006). A Comparison of Regional Feature Detectors
in Panoramic Images. In Proceedings of IEEE Int. Conf. on Information Acquisition.
563
Huebner, K. & Zhang, J. (2006). Stable Symmetry Feature Detection and Classification in
Panoramic Robot Vision Systems. In Proceedings of IEEE/RSJ International Conference
on Intelligent Robots and Systems, pp. 34293434.
Jepson, A. D. & Richards, W. (1993). What Makes a Good Feature?, Proceedings of the 1991
York Conference on Spacial Vision in Humans and Robots, pp. 89125.
Johansson, B.; Knutsson, H. & Granlund, G. (2000). Detecting Rotational Symmetries using
Normalized Convolution. In Proceedings of the 15th International Conference on
Pattern Recognition, vol. 3, pp. 500504.
Kadir, T.; Zisserman, A. & Brady, M. (2004). An Affine Invariant Salient Region Detector. In
European Conference on Computer Vision, pp. 228241.
Kirkpatrick, M. & Rosenthal, G. G. (1994). Symmetry without fear. Nature, 372, pp. 134135.
Kovesi, P. D. (1997). Symmetry and Asymmetry From Local Phase. In Proceedings of the 10th
Australian Joint Conference on Artificial Intelligence, pp. 185190.
Lisin, D.; Mattar, M. A.; Blaschko, M. B.; Benfield, M. C. & Learned-Miller, E. G. (2005).
Combining Local and Global Image Features for Object Class Recognition.
Proceedings of IEEE Workshop on Learning in Computer Vision and Pattern Recognition.
Liu, T.-L.; Geiger, D. & Yuille, A. (1998). Segmenting by Seeking the Symmetry Axis. In
Proceedings of the 14th International Conference on Pattern Recognition, pp. 994998.
Liu, Y. (2000). Computational Symmetry, chapter 21 of Symmetry 2000, pp. 231245.
Liu, Y.; Collins, R. & Tsin, Y. (2004). A Computational Model for Periodic Pattern Perception
Based on Frieze and Wallpaper Groups. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 26(3): 354371.
Locher, P. J. & Nodine, C. F. (1989). The Perceptual Value of Symmetry. Computers and
Mathematics with Applications, 17, pp. 475484.
Lowe, D. G. (2004). Distinctive Image Features from Scale-Invariant Keypoints. International
Journal of Computer Vision, 60(2), pp. 91110.
Loy, G. & Zelinsky, A. (2003). Fast Radial Symmetry for Detecting Points of Interest. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 25(8): 959973.
Matas, J.; Chum, O.; Urban, M. & Pajdla, T. (2004). Robust Wide Baseline Stereo from
Maximally Stable Extremal Regions. Image and Vision Computing, 22(10), pp.761767.
Mellor, M. & Brady, M. (2005). A New Technique for Local Symmetry Estimation. In ScaleSpace 2005, pp. 3849.
Milanese, R.; Cherbuliez, M. & Pun, T. (1998). Invariant Content-Based Image Retrieval
Using the Fourier-Mellin Transform. In S. Singh, editor, In Proceedings of the
International Conference on Advances in Pattern Recognition, pp. 7382. Springer.
Mikolajczyk, K. & Schmid, C. (2001). Indexing Based on Scale Invariant Interest Points. In
8th International Conference on Computer Vision, pp. 525531.
Mikolajczyk, K. & Schmid, C. (2002). An Affine Invariant Interest Point Detector. In European
Conference on Computer Vision, pp. 128142. Springer.
Mikolajczyk, K. & Schmid, C. (2004). Scale and Affine Invariant Interest Point Detectors.
International Journal of Computer Vision, 60(1), pp. 6386.
Mikolajczyk, K. & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 27(10), pp. 16151630.
Mikolajczyk, K.; Tuytelaars, T.; Schmid, C.; Zisserman, A.; Matas, J.; Schaffalitzky, F.; Kadir,
T. & van Gool, L. (2005). A Comparison of Affine Region Detectors. International
Journal of Computer Vision.
564
Ohta, Y. & Kanade, T. (1985). Stereo by Intra- and Inter-Scanline Search Using Dynamic
Programming. IEEE Trans. on Pattern Analysis and Machine Intelligence, 7(2): 139154.
Palmer, S. E. & Hemenway, K. (1978). Orientation and Symmetry: Effects of Multiple,
Rotational, and Near Symmetries. Journal of Experimental Psychology: Human
Perception and Performance, 4(4), pp. 691702.
Pratt, W. K. (2001). Digital Image Processing. John Wiley, New York, 3rd edition.
Reisfeld, D.; Wolfson, H. & Yeshurun, Y. (1995). Context Free Attentional Operators: the
Generalized Symmetry Transform. Int. Journal of Computer Vision, 14: 119130.
Schmid, C.; Mohr, R. & Bauckhage, C. (2005). Evaluation of Interest Point Detectors.
International Journal of Computer Vision, 37(2), pp. 151172.
Scognamillo, R.; Rhodes, G.; Morrone, C. & Burr, D. (2003). A feature based model of
symmetry detection. In Proc. of the Royal Society of London, vol. 270 B, pp. 17271733.
Shi, J. & Tomasi, C. (1994). Good Features to Track. In IEEE Conference on Computer Vision
and Pattern Recognition, pp. 593600.
Stewart, I. & Golubitsky, M. (1992). Denkt Gott symmetrisch ? German translation of the book
Fearful Symmetry: Is God a Geometer?, ISBN 3-7643-2783-9.
Sun, C. (1995). Symmetry detection using gradient information. Pattern Recognition Letters,
16: 987996.
Sun, C. & Si, D. (1999). Fast Reflectional Symmetry Detection Using Orientation Histograms.
Real-Time Imaging, 5(1): 6374.
Taraborelli, D. (2003). What is a Feature? A Fast and Frugal Approach to the Study of Visual
Properties. In Proceedings of the Eighth International Colloquium on Cognitive Science.
Truco, E. & Verri, A. (1998). Introductory Techniques for 3-D Computer Vision. Prentice Hall.
Tuytelaars, T. & van Gool, L. J. (1999). Content-Based Image Retrieval Based on Local Affinely Invariant Regions. In Visual Information and Information Systems, pp. 493500.
Tuytelaars, T. & van Gool, L. J. (2000). Wide Baseline Stereo Matching based on Local,
Affinely Invariant Regions. In British Machine Vision Conference, pp. 412422.
Tuytelaars, T.; Turina, A. & Van Gool, L. J. (2003). Noncombinatorial Detection of Regular
Repetitions under Perspective Skew. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 25(4): 418432.
Tyler, C. W. (editor). (1994) Spatial Vision, volume 8(4), Special Issue on The Perception of
Symmetry. VNU Science Press..
Yavlinsky, A.; Schofield, E. & Rger, S. (2005). Automated Image Annotation Using Global
Features and Robust Nonparametric Density Estimation. In Proceedings of the 4th
Int. Conference on Image and Video Retrieval, vol. 3568 of LNCS, pp. 507517. Springer.
Zabrodsky, H. (1990). Symmetry - A Review. Technical report, Department of Computer
Science, The Hebrew University of Jerusalem, May 1990.
Zabrodsky, H.; Peleg, S. & Avnir, D. (1993). Completion of Occluded Shapes using
Symmetry. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, pp. 678679.
Zabrodsky, H.; Peleg, S. & Avnir, D. (1995). Symmetry as a Continuous Feature. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(12), pp. 11541166.
ganec, M.; Paveic, N. & Kovacic, S. (1992). Stereo-matching by dynamic programming.
Proceedings of the Slovenian-German workshop Image processing and stereo analysis,
26(1), pp. 3751.
29
Stereo Vision Based SLAM
Issues and Solutions
D.C. Herath, K.R.S. Kodagoda and G. Dissanayake
ARC Centre of Excellence for Autonomous Systems, University of Technology, Sydney
Australia
1. Introduction
1Simultaneous Localization and Mapping (SLAM) has been one of the active research areas
in robotic research community for the past few years. When a robot is placed in an unknown
environment a SLAM solution attempts to build a perfect map of the environment while
localising the robot with respect to this map simultaneously. Traditionally SLAM utilised
endogenous sensor data in the process. Successful SLAM implementations using laser
(Guivant and Nebot, 2002), sonar and radar (Clark and Dissanayake, 1999) can be found in
the literature, which prove the possibility of using SLAM for extended periods of time in
indoor and outdoor environments with well bounded results.
Recent extensions to the general SLAM problem has looked in to the possibility of using 3dimensional features and the use of alternative sensors to traditionally used lasers and
radars. Cameras are competitive alternatives owing to the low cost and rich information
content they provide. Despite the recent developments in camera sensors and computing,
there are still formidable challenges to be resolved before successful vision based SLAM
implementations are realised in realistic scenarios. Monocular camera based SLAM is
widely researched (Davison et al., 2004; Kwok et al., 2005), however, binocular camera based
SLAM is mostly overlooked. Some of the noted stereo implementations can be found in
(Davison and Murray, 2002) and recently in (Jung, 2004). Lack of enthusiasm for research in
this direction could possibly be attributed to the misconception that range and bearing
information provided by the stereo vision system is directly utilizable providing a simplistic
solution to SLAM which is academically less appealing or the apparent success in single
camera SLAM implementations.
However, after rigorous analysis and sensor modelling, we found that the standard
extended Kalman filter (EKF) based SLAM with small base line stereo vision systems can
easily become inconsistent (Herath et al., 2006a) .
This chapter attempts to provide readers with an understanding of the SLAM problem and
its solutions in the context of stereo vision. The chapter introduces the Extended Kalman
Filter as applied to the generic SLAM Problem. Then, while identifying the prevailing issues
inherent in solutions to the SLAM problem in stereo vision context, our solutions are
presented with simulated and experimental evaluations. Several components of the stereo
1 This work is supported by the ARC Centre of Excellence program, funded by the Australian Research
566
vision system, including outlier rejection, sensor modelling, inconsistency analysis and
alternate formulations of SLAM are discussed.
(1)
Where x(k) - state at time k, u(k) - control input vector at time k, v(k) - additive process noise,
B(k) - control input transition matrix, G(k) - noise transition matrix and F(k) - state transition
matrix. The linear observation equation can be written as
z ( k ) = H ( k ) x( k ) + w ( k )
(2)
where z(k) - observation made at time k, x(k) - state at time k, H(k) - observation model and
w(k) - additive observation noise. Process and observation noise are assumed to be zeromean and independent. Thus
E[vi v Tj ] = ij Qi , E[ wi wTj ] = ij R i
The estimate of the state at a time k given all information up to time k is written as x ( k / k )
and the estimate of the state at a time k given information up to time k-1 is written as
x ( k / k 1) and is called the prediction. Thus given the estimate at (k-1) time step the
prediction equation for the state at kth time step can be written as
x (k / k 1) = F (k )x (k 1/ k 1) + B(k )u(k )
(3)
(4)
567
Then the unbiased (the conditional expected error between estimate and true state is zero)
linear estimate is
x (k / k ) = x (k / k 1) W (k )[z (k ) H (k )x (k / k 1)]
(5)
Where W(k) is the Kalman Gain at time step k. This is calculated as:
W (k ) = P (k / k 1)HT (k )S 1 (k )
(6)
Where S(k) is called the innovation variance at time step k and given by:
S(k ) = H (k )P (k / k 1) HT (k ) + R (k )
(7)
(8)
Essentially the Kalman filter takes a weighted average of the prediction x ( k / k 1) , based on
the previous estimate x ( k 1/ k 1) , and a new observation z (k ) to estimate the state of
interest x ( k / k ) . This cycle is repeatable.
2.2 The Extended Kalman Filter
Albeit Kalman filter is the optimal minimum mean squared (MMS) error estimator for a
linear system, hardly would one find such a system in reality. In fact the systems considered
in this chapter are purely non-linear systems. A solution is found in the Extended Kalman
Filter (EKF) which uses a linearised approximation to non-linear models. The extended
Kalman filter algorithm is very similar to the linear Kalman filter algorithm with the
substitutions;
F ( k ) f x ( k ) and H ( k ) h x ( k ) , where fx(k) and hx(k) are non-linear functions of both
state and time step, and fx(k), hx(k) are the process model and observation model
respectively. Therefore the main equations in EKF can be summarized as follows;
1. Prediction equations
2.
x (k / k 1) = f (x (k 1/ k 1) , u(k ))
(9)
(10)
Update equations
(11)
P (k / k ) = P(k / k 1) W (k )S(k ) WT (k )
(12)
(13)
Where
568
(14)
Under the hypothesis that filter is consistent and is linear Gaussian, k is chi-square
distributed with nx degrees of freedom. Where nx is the dimension of x k .
E[ (k )] = nx
(15)
Using multiple Monte Carlo simulations to generate N independent samples, the average
NEES can be calculated as
k =
1
N
ik
(16)
i =1
Then under the previous hypothesis N ( k ) will have a chi-square density with Nnx degrees
of freedom. Then the above hypothesis is accepted if
(k ) [r1 , r2 ]
(17)
yr
r ]T
centre with respect to a global coordinate frame and r is the heading with reference to the
x-axis of the same coordinate system. Landmarks are modelled as point features,
p = xi
i
yi
zi
xr ( k + 1) xr ( k ) + TV ( k ) cos( r ( k ))
y ( k + 1) = y ( k ) + TV ( k ) sin( ( k ))
r
r
r
r ( k + 1) r ( k ) + T ( k )
(18)
569
T is the time step, V(k) is the instantaneous velocity and (k) is the instantaneous turn-rate.
The observation model can be represented as,
z x (k + 1) a
Z ( k + 1) = z y ( k + 1) =
b
z z (k + 1) z fi ( k + 1)
(19)
where
a = ( x fi ( k + 1) xr ( k + 1)) cos( r ( k )) + ( y fi ( k + 1) yr ( k + 1)) sin( r ( k ))
b = ( x fi ( k + 1) xr ( k + 1)) sin( r ( k )) + ( y fi ( k + 1) yr ( k + 1)) cos( r ( k ))
fi
(k )
= [ x fi ( k ), y fi ( k ), z fi ( k )]
60
13 14
25
-2
Y (m)
57
-6
-2
21
22
24 23
26
-4
19
20
56
55
27
51
29
52
46 4539 30
34 28
50 44
53
40 38
33
54
47 42
3531
49
32
4136
37
4843
2
4
6
0.5
X error (m)
59 58
16
12
0
-0.5
20
40
60
80
100
120
140
0.5
Y error (m)
15
0
-0.5
20
40
60
80
100
120
140
0.1
0
-0.1
20
40
60
80
time (s)
100
120
140
Psi error(rad)
11
5
4
X error (m)
Y error (m)
1
2
17
Psi error(rad)
10
0.5
0
-0.5
20
40
60
80
100
120
140
20
40
60
80
100
120
140
20
40
60
80
time (s)
100
120
140
0.5
0
-0.5
0.2
0
-0.2
X (m)
a.)
b.)
c.)
Figure 2. (a) Simulated environment: solid line true path, dashed line odometry path, * features. (b) State errors with estimated 2-sigma bounds for a well tuned filter (c) An
inconsistent filter
570
3. Stereo Vision
Generally, more precise the sensors used in SLAM more tractable and practical the solution
is. Underlying characteristics of the sensor play an important role in determining the scale
and practical use of the SLAM algorithm. Sensors such as laser have proven to be very
precise in nature and have shown to work well in large environments for extended periods
of time (Guivant, 2002; 2003; Wang, 2004). However vision is yet to prove its application in
similar environments. In vision, successful implementations to date have used either large
baseline stereo cameras (Davison, 1998; Jung, 2004), camera configurations with more than
two cameras(Se et al., 2002) providing refined observations or single camera bearing only
(Kwok and Dissanayake, 2003; 2004) methods. Principal aim of this section is to asses the
performance of a small baseline binocular stereo camera equipped with wide angle lenses in
the context of robotic SLAM.
3.1 The Sensor
Stereopsis or Stereoscopic vision is the process of perceiving depth or distances to objects in
the environment. As a strand of computer vision research stereo vision algorithms have
advanced noticeably in the past few decades to a point where semi-commercial products are
available as off the shelf devices. However a more augmented approach is needed to realize a
sensor useful in SLAM. Following list is an attempt to enumerate the essential components
of such a sensor in the context of SLAM.
(1.) Stereo camera-hardware for acquiring stereo images (2.) Calibration informationcontains intrinsic and extrinsic information about the camera necessary for correcting image
distortion and depth calculation (3.) Interest point (features) selection algorithm-mechanism
through witch naturally occurring features in the environment are selected for integration in
the state vector (4.) Feature tracking algorithm-Image based mechanism used for data
association (5.) Stereo correspondence algorithm-estimates the disparity at corresponding
pixels (6.) Filtering-mechanisms used to remove spurious data. A schematic of the
components along with interactions amongst each other is outlined in Fig. 3.
3.2 Sensor Error Analysis
As mentioned in the beginning of the chapter characteristics of a sensor dictates the limits of
its applications. In the following sections a discussion of an empirical study of the particular
sensor of interest is given based on two representative experiments conducted. It was found
that even though quantitative error analyses of stereo, based on static cameras are available
in the literature they do not necessarily represent the effects of a moving camera. This study
fills a void on specific characterisation of noise performance of small baseline large field of
view camera in respect to SLAM. In this context several robotic mapping experiments were
carried out in order to understand the behaviour of sensor noise.
From previous section on camera modelling the triplet z = [ u , v, d ]
observation z by the sensor. Where (u,v) being the image coordinates of a feature and d is
the disparity. Assuming that the errors in the observations to be additive z can be written
as,
z = z true + ( , z true )
(20)
571
Where z true being the true state of the observation and being the additive noise component
dependent on the sensor characteristics and on the true state itself as will be shown
empirically later. Modelling and understanding the behaviour of is the subject of
discussion in sections 3.4 and 3.5. In section 3.6 the discussion continues on modelling the
error behaviour of the projected form z c of this observation in to the 3D coordinate frame.
12
5
18
7
2
8
4 11
17
10
19
12
16
13
15
1814
16
14
11
4
19
17
0
7
10
15
11
17
13
4
10
1913
3 0
b.)
11
15
12
9 2
c.)
16 8
18
13
9
a.)
3
14
14
10
d.)
Figure 4. Rectified images overlaid with features at (a) 5.4 m (b) 3.9 m (c) 2.4 m (d) 0.9 m
12
572
a.)
b.)
c.)
Figure 5. Disparity error. (a) Distribution (b) Zero mean error distribution with depth (c)
Zero mean standard deviation (log scale). The spike in standard deviation is due to a stereo
mismatch that was not detected by any of the heuristics applied in stereo correspondence
algorithm
573
Several observations can be made. Firstly, data still contains many visible outliers that are
difficult to be eliminated by the various smoothing operations. Secondly a rather intuitive
observation is the correlation in the variance of the disparity distribution with the expected
disparity. As would be expected variance is smaller for features seen from afar and it
increases gradually with nearby features. For faraway features the disparity is small and
also the discriminatory information contained within the correlation area is higher
compared to a closer observation. This is especially true for environments where lack of
texture persists. This gives a higher confidence to the disparity values estimated for features
afar as opposed to ones closer. This is a better interpretation for the variance in disparity and
based on this interpretation it is better to assume a varying disparity standard deviation
correlated with the estimated disparity value as opposed to the general practice of assuming
a constant disparity standard deviation. The observation standard deviation is shown in Fig.
5 (c).
It is difficult to estimate an exact relationship between the disparity variation with the
estimated disparity. Thus an empirically generated curve based on the results shown in Fig.
5 (b) is used. It was also observed that the variance estimated thus is slightly higher than the
one shown above in Fig. 5 (a). This stems from the fact that the local biases are present in the
data shown in Fig. 5 (b). This can be illustrated by scrutinising the local distributions present
in the disparity data corresponding to each feature location at a given depth. Fig. 8 shows an
example local distributions contributing to the overall distribution at a given depth. As can
be noticed there are independent local distributions dispersed from the true expected mean.
These are a combination of local biases in the image, stereo mismatches and any
misalignments of the stereo hardware and the reference system. For practical purposes
correcting these errors is difficult and an all encompassing error model is thus adapted.
3.3 Uncertainty in u and v
In order to model the errors in u and v for SLAM a dynamic camera error model needs to be
studied which would include the behaviour of the tracking algorithm as well as other
dynamics involved with the camera motion. From experiment 1 and 2 it is possible to extract
a representative set of data for this purpose. Again as discussed for the case of disparity
error, u and v also carries components of local bias due to distortion effects and other
misalignments. In addition the effects of the feature tracker also contribute when the
augmented sensor representation is considered.
For this analysis only a single image is considered at each depth. These images are then
assembled from a depth of 1m to 6m. 16 features covering the entire image plane are then
initialised in the image corresponding to 1m depth and are then consecutively tracked
through to image at 6m depth. This while tracking a set of features at fixed locations in
space will map to varying u,v coordinates. This essentially captures the overall behaviour of
u,v in the entire image plane.
Fig. 6 shows the results for both parameters where cumulative data for each point is
subtracted by the expected values at each point and then combined together. Qualitatively
these results resemble Gaussian distributions. However it is possible to observe various
artefacts appearing in the tails of the distributions indicating that a considerable amount of
spurious data is present for the reasons discussed earlier. This spuriousness in u,v and d
pose considerable challenges to a successful implementation of a SLAM algorithm. Various
issues arising from these observations are discussed in the next section
574
200
200
180
180
160
160
140
140
Frequency
Frequency
120
100
120
100
80
80
60
60
40
40
20
20
0
-8
-6
-4
-2
0
u error (pixel)
a.)
-8
-6
-4
-2
0
v error (pixel)
b.)
Figure 6. Error distribution (a) in u with standard deviation = 1.34 (b) in v with standard
deviation = 1.53
575
a.)
b.)
Figure 7.An image from a wide angle lens. a.) raw. b.) rectified
25
Frequency
20
15
10
5
0
13.6
13.8
14
14.2
14.4
14.6
Disparity (pixels)
14.8
15
15.2
576
The minimum number of features per image is also dictated partially by the environment
the application operates as well as the accuracy of the stereo algorithm. As shown earlier the
depth accuracy correlates with the depth measured. Thus it is necessary to observe both
features that are closer to the camera for short term translational accuracy as well as ones
that are further away for long term rotational accuracy. An issue with most feature selectors
is that they tend to cluster around small patches of highly textured areas in a scene. This
may or may not result in satisfying the condition stated above. In our experience the best
value for minimal number of features is thus selected by repeated experimentations in the
intended environment.
4.3 Spurious Features
Spurious features occur not only due to structure (e.g. Occlusion) but also due to gross
errors in stereo calculations. For instance in Fig. 9 (a) the pole marked with the arrow and
the horizontal edge of the partition in the foreground are two distinct disjoint enteritis.
However on the image plane the apparent intersection of the two edges is a positive feature
location. Such occlusions results in physically non existent features. These features are
catastrophic in a SLAM implementation. A possible method was discussed in (Shi and
Tomasi, 1994) in identifying such occlusions by a measure of dissimilarity.
Mismatch
a.)
b.)
c.)
20
40
60
80
100
120
140
20
40
60
80
100
120
140
20
40
60
80
100
120
140
20
40
60
80
100
120
140
20
40
60
80
100
120
140
20
40
60
80
time (s)
100
120
140
y pose
(m)
-2
0
1
0
-1
0
2
0
-2
0
1
0
-1
0
0.5
0
-0.5
0
1
0
-1
0
x pose
(m)
2
0
time
a.)
heading
(rad)
heading
(rad)
y pose
(m)
x pose
(m)
Figure 9. Spurious observations. (a) A rectified image showing several edge profiles. (b)
Disparity image (c) Close-up view of the depth profile with a mismatch (see discussion for
details)
b.)
Figure 10. Robot pose error with 2-sigma error bounds (a) effects of spurious data (b.) with
the RanSaC like filter applied
Depending on the image composition it is possible to generate occasional mismatches (Fig.
9(c)) in stereo correspondence. Most stereo algorithms include multiple heuristics
(Konolige, 1997 ) to alleviate this issue. However it is still advisable to include a statistical
validation gate (Cox, 1993) for the occasional mismatch that is not filtered by such heuristics.
577
A third set of spurious features were observed due to feature tracking mechanisms used.
These features tend to drift arbitrarily in the image plane. Such features not only are harder
to detect by conventional statistical validation gates but also tend to contribute to filter
inconsistencies. A solution to such spurious features based on the RanSaC (Fischler and
Bolles, 1981) algorithm was discussed in (Herath et al., 2006b). Fig. 10 shows SLAM results
for a real data set with (Fig. 10 (b)) and without (Fig. 10 (a)) the RanSaC like filter while
maintaining other filter parameters identical. In this instance the consistency has improved,
however, inflated observation noise parameters are used in both cases to accommodate the
nonlinearities (see 4.5) in the observation model.
4.4 Static vs. Dynamic Noise Parameters
Most researchers tend to use static noise parameters in their SLAM implementations. These
are the noise parameters obtained by observing static features through a static camera.
However a more realistic set of values can be obtained by estimating these parameters
through data obtained by a moving camera especially in the same application environment.
An experiment of this nature was discussed in (Herath et al., 2006a). This encompasses not
only the error variation in camera, but also the error variations in the feature tracker and
other difficult to quantify dynamic factors. This invariably tends to increase the stereo noise
parameters and in some cases is much higher than the theoretical sub-pixel accuracies
quoted by stereo algorithms.
Another aspect of noise parameters was illustrated in section 3.2. For a better estimate of the
noise parameters it is possible to utilise the empirical knowledge of variation in disparity
standard deviation with measured depth. Also in (Jung and Lacroix, 2003) presented
another observation, where the variation in disparity standard deviation is correlated with
the curvature of the similarity score curve at its peak. This knowledge can enhance the
quality of the estimation process.
4.5 Nonlinearity Issues
Realistic SLAM problems are inherently non linear. While EKF implementations are shown
to be able to handle this nonlinearity an emerging debate in recent years suggest that the
nonlinearity could lead to filter inconsistency(Bailey et al., 2006; Huang and Dissanayake,
2006; Julier and Uhlmann, 2001).
These studies concentrate on eventual failure of the filter in large scale and/or long term
SLAM implementations. On the other hand the few stereo vision based EKF solutions
present in the literature altogether neglects the filter consistency analysis. It is well known
that the standard geometric projection equations used in stereo vision are highly nonlinear
and suffers from inherent bias (Sibley et al., 2006; 2005). It is imperative then that an analysis
is carried out to estimate the effects of this nonlinearity in the context of EKF SLAM. For this
reason a set of Monte Carlo simulations were conducted and were analysed using the NEES
criterion presented in section 2.3. The simulated environment presented in section 2.4 (Fig. 2
(a)) was used throughout these Monte Carlo runs. N [=50] runs were carried out for each
implementation with [2.36, 3.72] being the 95% probability concentration region for ( k ) since
the dimensionality of the robot pose is 3.
In Fig. 11 (a) the average NEES for the example in 2.4 is shown to be well bounded. This
indicates that for the small loop considered in this example a standard EKF yields consistent
results.
For this simulation, the observation noise (R(k)) has components
578
( x = y = z =0.05m) and process noise (Q(k)) will remain at ( v =0.05m/s, w =5deg/s) for all
the simulations.
In the second simulation while adhering to the previous formulation, the observations are
now subjected to the geometric transformations of a standard stereo vision sensor.
x=
Bf
Bu
Bv
;y =
;z =
d
d
d
(21)
Where B is the camera baseline and f the focal length. As discussed in the previous section
Gaussian noise can be assumed for (u,v,d) and a transformed noise matrix must be used
(Herath et al., 2006a) for
R(k). For all the simulations following noise values
( u =1.34, v =1.53, d =0.65) estimated from experimental analysis were used. The average
NEES results for this simulation are presented in Fig. 11 (b). The unacceptably large values
for the statistics indicate that a straight forward SLAM implementation does not yield
consistent results. An important parameter in this experiment is the small baseline (B) used.
At a nominal 9cm this corresponds to a commercially available stereo head on which most
of our real experiments are based on. It is possible to show through simulation that lager
baselines give rise to lower nonlinearity effects. However it remains a key factor for most
stereo heads used in indoor and outdoor scenarios.
T
To further illustrate this phenomenon, consider the Gaussian random variable [ d , u ] (only
two components used for clarity) representing the disparity and horizontal image
coordinate for a given feature at xc = 10m and yc = 1m . With B =0.09m and f = 150 pixels,
this translates to mean disparity, d of 1.32 pixels and mean u of 15 pixels. A Monte Carlo
simulation can be carried out using (21) to transform Gaussian distributed [ d , u ]T
into [ z x , z y ] . Fig. 12 (a) and (b) show the resulting distributions with 0.09m and 0.5m as
T
baselines respectively. This clearly indicates the non Gaussian nature of the transformed
observations when a small baseline camera is used (Fig. 12 (a)). The smaller the baseline is
the shorter the range is at which the nonlinear effect manifest.
900
4.2
800
4
700
3.8
600
3.4
NE E S
NEES
3.6
3.2
3
500
400
300
2.8
200
2.6
100
2.4
0
50
100
150
Step
a)
200
250
300
50
100
150
Step
200
250
300
b)
Figure 11. Average NEES of the robot pose (a) Standard EKF (b) Standard EKF with stereo
observations
579
a.)
b.)
Figure 12. Errors in projective mapping (a) B=0.09m (b) B=0.5m, linearized and
approximated Gaussians are overlapping
A different choice of observation model is tested that yields improved results. As shown
above the main cause for the inconsistency is due to the highly nonlinear projective
mapping. It is possible to derive a formulation where the principal observation becomes
(u,v,d) instead of the widely used (x,y,z) as follows. (Compare this with (19))
z ( k + 1) = z
( k + 1)
( k + 1) =
( k + 1)
f
x
[ y
B]
(22)
where
x = ( x fi ( k + 1) x r ( k + 1)) cos( ( k )) + ( y fi ( k + 1) y r ( k + 1)) sin( ( k ))
y = ( x fi ( k + 1) x r ( k + 1)) sin( ( k )) + ( y fi ( k + 1) y r ( k + 1)) cos( ( k ))
z = z fi ( k + 1)
This alleviates necessity of the linearized transformation of the noise matrix (R(k)) as
measurements are well represented with Gaussian models. Simulation results with the new
observation model for average NEES are presented in Fig. 13 (a). Although the
improvement over previous model is apparent, filter still remains optimistic. Finally the
unscented Kalman filter (UKF) (Julier and Uhlmann, 2004) is implemented with the
previous observation model. The UKF performs a derivative free transform of the states
resulting in better estimates. UKF is shown to work well with highly non linear systems.
However the Monte Carlo simulation results indicate (Fig. 13 (b)) that the improvement
against consistency is minimal.
These observations lead us to the belief that standard SLAM implementations could yield
inconsistent results even for comparatively smaller loops given small baseline stereo
cameras are used. An observation hitherto has not been studied. Current solutions for this
issue remains at either in use of wider baseline cameras or in the implementation of small
loops with sub map (Williams, 2001) like ideas. Better consistency could also be expected by
improving the overall noise performance of the vision system. This includes improving the
stereo correspondence, resolution of the images as well as improving the stability of the
mobile platform.
580
25
20
20
NEES
NEES
15
15
10
10
5
5
0
0
50
100
150
Step
a)
200
250
300
50
100
150
Step
200
250
300
Figure 13. Average NEES of the robot pose (a) (uvd)-observation model (b) UKF
5. Conclusion
In this chapter we have made an attempt to analyse the issues in stereo vision based SLAM
and proposed plausible solutions. Correct sensor modelling is vital in any SLAM
implementation. Therefore, we have analyzed the stereo vision sensor behaviour
experimentally to understand the noise characteristics and statistics. It was verified that the
stereo observations in its natural form (i.e. [u,v,d]) can safely be assumed to represent
Gaussian distributions. Then several SLAM implementation strategies were discussed using
stereo vision. Issues related to limited field of view of the sensor, number of features,
spurious features, noise parameters and nonlinearity in the observation model were
discussed. It was shown that the filter inconsistency is mainly due to inherent nonlinearity
presence in the small baseline stereo vision sensor. Since UKF is more capable in handling
nonlinearity issues than that of EKF, an UKF SLAM implementation was tested against
inconsistency. However, it too leads to inconsistencies. This shows that even with
implementations that circumvent the critical linearization mechanism in standard EKF
SLAM as in UKF, the nonlinearity issue in the stereo vision based SLAM can not be
resolved. In order to address the filter inconsistency a more elegant solution is currently
being researched based on smoothing algorithms which shows promising results.
In conclusion this chapter dwelt on some obscure issues pertaining to stereo vision SLAM
and work being done in solving such issues.
6. References
Bailey, Tim, Juan Nieto, Jose Guivant, Michael Stevens and Eduardo Nebot. (2006).
Consistency of the EKF-SLAM Algorithm. In International Conference on Intelligent
Robots and Systems (IROS 2006). Beijing, China.
Bar-Shalom, Yaakow, X.-Rong Li and Thiagalingam Kirubarajan. (2001). Estimation with
Applications to Tracking and Navigation. Somerset, New Jersey: Wiley InterScience.
Clark, S. and G. Dissanayake. (1999). Simultaneous localisation and map building using
millimetre wave radar to extract natural features. In IEEE International Conference on
Robotics and Automation: IEEE.
Cox, Ingemar J. (1993). A review of statistical data association techniques for motion
correspondence. International Journal of Computer Vision 10(1):53-66.
581
Davison, A.J. and D.W. Murray. (2002). Simultaneous localization and map-building using
active vision. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7):865
- 880
Davison, Andrew J. (1998). Mobile Robot Navigation Using Active Vision. Thesis: University
of Oxford.
Davison, Andrew J., Yolanda Gonzalez Cid and Nobuyuki Kita. (2004). Real-Time 3D Slam
with Wide-Angle Vision. In IFAC Symposium on Intelligent Autonomous Vehicles.
Lisbon.
Dissanayake, M.W.M.Gamini, Paul Newman, Steven Clark, Hugh F. Durrant-Whyte and M.
Csorba. (2001). A Solution to the Simultaneous Localization and Map Building
(SLAM) Problem. IEEE TRANSACTIONS ON ROBOTICS AND AUTOMATION
17(3):229-241.
Fischler, Martin A. and Robert C. Bolles. (1981). Random Sample Consensus: A paradigm for
model fitting with applications to image analysis and automated cartography.
Communications of the ACM 24(6):381 - 395.
Guivant, Jose E. (2002). Efficient Simultaneous Localization and Mapping in Large
Environments. Thesis. Sydney: University of Sydney.
Guivant, Jose and Eduardo Nebot. (2002). Simultaneous Localization and Map Building: Test
case for Outdoor Applications. Sydney: Australian Center for Field Robotics,
Mechanical and Mechatronic Engineering, The University of Sydney.
Guivant, Jose, Juan Nieto, Favio Masson and Eduardo Nebot. (2003). Navigation and
Mapping in Large Unstructured Environments. International Journal of Robotics
Research 23(4/5): 449-472.
Herath, D. C., K. R. S. Kodagoda and Gamini Dissanayake. (2006a). Modeling Errors in
Small Baseline Stereo for SLAM. In The 9 th International Conference on Control,
Automation, Robotics and Vision (ICARCV 2006). Singapore.
Herath, D.C., Sarath Kodagoda and G. Dissanayake. (2006b). Simultaneous Localisation and
Mapping: A Stereo Vision Based Approach. In IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2006). Beijing, China: IEEE.
Huang, Shoudong and Gamini Dissanayake. (2006). Convergence Analysis for Extended
Kalman Filter based SLAM. In IEEE International Conference on Robotics and
Automation (ICRA 2006). Orlando, Florida.
Julier, S. J. and J. K. Uhlmann. (2001). A counter example to the theory of simultaneous
localization and map building. In IEEE International Conference on Robotics and
Automation, ICRA 2001.
Julier, S. J. and J. K. Uhlmann. (2004). Unscented filtering and nonlinear estimation.
Proceedings of the IEEE 92(3):401-422.
Jung, I.K. (2004). Simultaneous localization and mapping in 3D environments with
stereovision. Thesis. Toulouse: Institut National Polytechnique.
Jung, Il-Kyun and Simon Lacroix. (2003). High resolution terrain mapping using low
altitude aerial stereo imagery. In Ninth IEEE International Conference on Computer
Vision (ICCV'03).
Konolige, Kurt. (1997 ). Small Vision Systems: Hardware and Implementation. In Eighth
International Symposium on Robotics Research.
582
30
Shortest Path Homography-Based Visual
Control for Differential Drive Robots
G. Lpez-Nicols, C. Sags and J.J. Guerrero 1
Universidad de Zaragoza
Spain
1 Introduction
It is generally accepted that machine vision is one of the most important sensory modalities
for navigation purposes. Visual control, also called visual servoing, is a very extensive and
mature field of research where many important contributions have been presented in the
last decade [Malis et al.,1999, Corke and Hutchinson, 2001, Conticelli and Allotta, 2001,
Tsakiris et al., 1998, Ma et al., 1999]. Two interesting surveys on this topic are [De Souza and
Kak, 2002] and [Hutchinson et al., 1996]. In this work we present a new visual servoing
approach for mobile robots with a xed monocular system on board. The idea of visual
servoing is used here in the sense of homing, where the desired robot position is dened by
a target image taken at that position. Using the images taken during the navigation the robot
is led to the target.
A traditional approach is to perform the motion by using the epipolar geometry [Basri et al.,
1999, Rives, 2000, Lopez-Nicolas et al., 2006]. These approaches have as drawback that the
estimation of the epipolar geometry becomes ill conditioned with short baseline or planar
scenes, which are usual in human environments. A natural way to overcome this problem is
using the homography model. In [Malis and Chaumette, 2000] it is proposed a method
based on the estimation of the homography matrix related to a virtual plane attached to an
object. This method provides a more stable estimation when the epipolar geometry
degenerates. In [Benhimane et al., 2005] it is presented a visual tracking system for car
platooning by estimating the homography between a selected reference template attached to
the leading vehicle. A signicant issue with monocular camera-based vision systems is the
lack of depth information. In [Fang et al., 2005] it is proposed the asymptotic regulation of
the position and orientation of a mobile robot by exploiting homography-based visual servo
control strategies, where the unknown time-varying depth information is related to a
constant depth-related parameter.
These homography-based methods usually require the homography decomposition, which
is not a trivial issue. Two examples of approaches which do not use the decomposition of
the homography are [Sagues and Guerrero, 2005] which is based on a 2D homography and
[Benhimane and Malis, 2006] which presents an uncalibrated approach for manipulators.
We present a novel homography-based approach by performing the control directly on the
1This
584
elements of the homography matrix. This approach, denoted as Shortest Path Control, is
based on the design of a specic robot trajectory which consists in following a straight line
towards the target. This motion planning allows to dene a control law decoupling rotation
and translation by using the homography elements. This approach needs neither the
homography decomposition nor depth estimation. In this work we have developed three
similar methods based on the particular selection of the homography elements. Each
method is suitable for dierent situations.
The chapter is divided as follows, Section 2 presents the homography model developing its
elements as a function of the system parameters to be used in the design of the controllers.
Section 3 presents the Shortest Path Control with three dierent methods based on the
elements of the homography. Sections 4 and 5 present the stability analysis of the controllers
and the experimental results respectively. Section 6 gives the conclusions.
(1)
where x and y are the focal length of the camera in pixel units in the x and y directions
respectively; s is the skew parameter and (x0,y0) are the coordinates of the principal point.
We have that x=f mx and y=f my, where f is the focal length and mx, my are the number of
pixels per distance unit. In practice, we assume that the principal point is in the centre of the
image (x0=0, y0=0) and that there is no skew (s=0).
A 3D point in the world can be represented in the projective plane with homogeneous
coordinates as p=(x,y,1)T. A projective transformation H exists from matched points
belonging to a plane in such a way that p2=H p1. The homography between the current and
target image can be computed from the matched points, and a robust method like RANSAC
should be used to consider the existence of outliers [Hartley and Zisserman, 2004]. Taking
advantage of the planar motion constraint, the homography can be computed from three
correspondences instead of four, reducing the processing time.
Let us suppose two images obtained with the same camera whose projection matrixes in a
common reference system are P1=K[I0] and P2=K[RRc], being R the camera rotation
and c the translation between the optical centres of the two cameras. A homography H can
be related to camera motion (Figure 1a) in such a way that
(2)
where n=(nx , ny , nz)T is the normal to the plane that generates the homography and d is the
distance between the plane and the origin of the global reference.
We consider a mobile robot in planar motion (Figure 1b). In this case the robot position is
dened by the state vector (x,z, ) and the planar motion constraint gives:
585
(3)
Taking this into account, the homography corresponding to a planar motion scheme can be
written as
(4)
The second row of the matrix will be ignored in the design of the control law as it does not
give useful information. Developing expression (2) we obtain the homography elements as a
function of the parameters involved:
(5)
The analysis of these homography elements will lead to the control law design. After
computing the homography from the image point matches it has to be normalized. We
normalize by dividing H/h22, given that h22 is never zero due to the planar motion constraint.
(a)
(b)
Figure 1. (a) Homography from a plane between two views. (b) Coordinate system
586
(7)
In the Shortest Path Control approach, we propose decoupling the motion, rotation and
translation, by following a specic trajectory. Then, we design a navigation scheme in such a
way that the robot can correct rotation and translation in a decoupled way. The resulting
path of this motion is shown in Figure 2.
587
during each phase of the navigation. When the motion starts, the initial homography is the
general case (5). It can be seen in Figure 2 that during the second step the robot moves in a
straight line with a constant angle respect the global reference ( = t). From our reference
system we can obtain the geometrical expression x = z tan t. Using this expression in (5)
we obtain the particular form of the homography that is held during the straight motion of
the second step.
(8)
At the end of the second step the robot has an orientation error and no translation error
(x=0, z=0, = t). Taking this into account, the homography matrix that results at the end of
the second step (i.e. in the target position up to orientation error) is
(9)
This previous expression also implies that det(H) = 1. Finally, at the end of the navigation,
when the robot reaches the target pose with the desired orientation the homography will be
the identity matrix,
(10)
The particular expressions of the homography just deduced are related graphically with its
corresponding positions in Figure 3. It can be seen that the goal of each step is to move the
robot having as reference the next desired expression of the homography.
Figure 3. The number below each gure denotes the equation of the homography that holds
in that position. In each step, the numbers give the homography equations at the start and at
the end of the step
Now we briey introduce the expressions used to dene the controllers of the three dierent
methods of the Shortest Path Control. These are detailed in the following subsections. From
the previous particular expressions of the homography, we can dene the conditions that
will be used in each step of the navigation to drive the robot. In the rst step we want to
reach the orientation = t, where the robot points to the target. The forward velocity is set
588
to zero (v=0) and from (8) we could use h11, h12 or h13 to set the angular velocity of the robot
in a proportional control:
(11)
(12)
(13)
In this step we have rejected elements h31, h32 and h33 because they require knowledge about
the plane and the robot position, which are unknown. Each one of these expressions (11),
(12) or (13) can be used to correct rotation in the rst step. The selection of the expressions
for each of the three methods depending on the calibration hypothesis is explained below. In
method I camera calibration is supposed to be known, while in Method II and III no specic
calibration is required.
Once the orientation t is gained, the second step aims to get translation to the target equal
to zero (x=z=0), keeping the orientation constant during the motion ( = t). In this case we
could use the parameters h31, h32 or h33 from (9) to set the robot velocity as
(14)
(15)
(16)
In this second step we have rejected elements h11, h12 and h13 for the correction of v because
the value of these elements is constant during this step. Any of the expressions (14), (15) or
(16) can be used to compute v during this step. Odometry drift or image noise appear in real
situations, so the orientation is corrected to avoid possible errors. Thus, in the three methods
the rotation during second step is corrected respectively with the same control of the rst
step.
In the last step the robot has zero translation error and only needs to perform a rotation in
order to reach the target orientation,
(17)
(18)
Then, the velocity is set to zero in this step (v=0) and the rotation can be corrected with
expressions of (17) or (18). We have selected =k h13 for the three methods because of the
robustness to noise of h13 with respect to the rest of the homography elements. Experimental
results presented support this decision.
The control loop of the scheme presented is shown in the diagram of Figure 4. An image in
the current position is taken at each loop of the control. The homography that links it with
the target image is computed from the feature matching. Using the homography, the control
performs the three steps. When the homography-based control loop nishes, the robot is in
the target position, the current and the target images are the same, and the homography is
the identity matrix. Next, the details of the three methods of the Shortest Path Control for
visual servoing based on homographies for mobile robots are presented in detail.
589
(19)
Using this expression we do not need to know the value of t to correct the orientation in
the rst step, and this is corrected until (19) is satised. In step two, the orientation is
corrected with the same expression to take into account odometry drift or noise. The
velocity v in the second step is corrected using (16) which is combined with h11 from (9) to
remove the unknown parameter t from the expression of the control. Third step is based
on (17). Then, we dene the Method I as
(20)
590
(22)
When the robot is close to the target position and the translation is nearly zero, all the points
in the scene can be related by the homography. In this situation the parallax is not useful to
correct the orientation. Before this happen we change the orientation control at the end of
step 2 to the expression (11). This expression needs the value of t, which can be computed
previously with the same equation while the rotation is corrected with the parallax
procedure. Here, we use neither expression (15) because vertical planes can be easily found
591
in human environments nor expression (19) because it needs specic calibration. We can
detect easily when the parallax is not useful to work with by measuring the parallax of the
points not belonging to the plane of the homography. If the result is under a threshold, the
parallax procedure is not used any more. In the simulations presented with this approach
the threshold is set to 5 pixels.
In the three methods presented the homography is not decomposed, and neither the robot
coordinates nor the normal of the plane are computed. This approach requires the selection
of the signs of some of the control gains depending on where is the initial robot position and
what is the orientation of the plane detected. This can be easily done by taking advantage of
the parallax relative to the plane by computing it once at the start. Thus, the sign of the gains
is easily determined.
4. Stability Analysis
We dene the common Lyapunov function expressing the robot position in polar
coordinates (r(t), (t), (t)), with the reference origin in the target and positive from z-axis
anticlockwise, as
(23)
This is a positive denite function, where rGi , Gi and Gi denote the desired value of the
parameter in the subgoal position for each step (i=1,2,3). Due to the designed path, the value
of is constant during the navigation. Although in the case of noisy data the value of
could vary, it does not aect the control, because the path is dened towards the target
independently of the value of , thusV = 0. After dierentiating we obtain:
(24)
We analyze the derivative Lyapunov candidate function in each step to show it is strictly
negative. This analysis is valid whether if the goal is behind or in front of the initial position.
Step 1. Here the robot performs a rotation with v=0. Thus, we only need to consider
. The desired orientation is G1 = t . < 0 is guaranteed if ( G1 ) > 0 and then
<0; or else, if ( G1)<0 and then >0. In Method I and II, the sign of is guaranteed to
be correct, given that the sign of k is selected as previously explained. In Method III,
=kecx and, when ( G1)>0 then ecx>0 and <0, or ecx<0 and >0 when ( G1) < 0.
Therefore <0.
Step 2. In this step the robot moves towards the target in a straight line path and we have
. The sign of (r r G2) is always positive. Then, with cos( ) < 0 we have v>0
and with cos( )>0 we have v < 0. In Method II, the sign of v is guaranteed to be correct,
given that the sign of kv is properly selected. In Method I and III, the velocity given by the
control and with (8) is v = kv z nz / (dcos t), which gives the expected signs. Therefore r < 0.
With
we have the same reasoning of step 1.
Step 3. Similar to the reasoning of step 1, in this case, the sign of can be easily checked
taking into account that G3 =0 and h13=x sin t. Therefore < 0.
So, we have shown that <0 for the controllers of the three methods. We have also
asymptotic stability given that is negative denite in all the steps.
592
5. Experimental Results
Several experiments have been carried out with the controllers of the three methods
presented by using virtual data. The simulated data is obtained by generating a virtual
planar scene consisting of a distribution of random 3D points. The scene is projected to the
image plane using a virtual camera, the size of the images is 640480 pixels. In each loop of
the control, the homography between the current and target image is computed from the
matched points and the control law send the velocities (v, ) to the robot. In the
experiments, we assume that the camera is centred on the robot pointing forwards. Figure 5
shows the resulting path from dierent initial positions. The target is placed in
(x(m),z(m), (deg))=(0,0,0). The dierent initial positions behind the target are:
(3,10,30), (0,8,40) and (6,6,0). The results also show that the method works
properly when the target is behind the initial robot position, moving the robot backwards in
that case. The dierent initial positions used in this case are: (6, 4, 20), (6, 8, 10) and
(5,2,50).
Figure 5. Simulations with target position at (0,0,0) and dierent initial positions
The performance of the three methods is exactly the same when using perfect data and quite
similar when there is image noise. In Figure 6 two simulations are compared, one without
noise, and the other, adding white noise to the image points with a standard deviation of
=1 pixel using Method III. The evolution along time of the robot position and the
homography elements is drawn.
We have tested the controllers with odometry drift and with dierent values of image noise.
The rst row of Figure 7 shows the resulting evolution of the robot position when there is
odometry drift in rotation of 1 deg/m. As it can be seen the controllers can cope properly
with the drift error. Simulations with each method have been carried out using dierent
levels of image noise. The results are shown in the second row of Figure 7 and it can be seen
that the methods converge properly in spite of image noise.
The control law of Method I needs the calibration parameter x of the camera whereas
Method II and III do not use it. In Figure 8 we show the performance of the control to
calibration errors. The value of the focal length of the controllers is xed to f=6 mm while its
real value is modied to see the nal position error obtained for each Method, (rst row of
Figure 8). Besides, we have assumed that the principal point is in the centre of the image. In
the second row of Figure 8, the value of x0 used in the controllers is supposed to be zero
593
while its real value is changed. Performance of Method I is sensitive to calibration errors as
expected, this is because this control law is related directly with x and depends highly on
its accuracy. The simulations show that Method II works properly in spite of calibration
errors. Finally, results using Method III show that a rough calibration is enough for the
convergence, because it is robust to focal length in accuracy and it is only aected by
calibration errors in the principal point.
(d) h11
(e) h12
(f) h13
(g) h31
(h) h32
(i) h33
Figure 6. Simulation without noise (thick line) and with image white noise of =1 pixel (thin
line). The initial position is (x,z, )=(3, 10, 30) and the target (0,0,0)
The performance of the methods can be spoiled in some cases by the particular plane that
generates the homography. Simulations using dierent planes are presented in Table 1. The
planes are dened by the normal vector n=(nx,ny,nz)T , and a list of unitary normal vectors is
selected to carried out the simulations with n=1. The nal error obtained with each
method is shown. The initial position is (3, 10, 30) and the target is (0, 0, 0). The results
show that Method I and III need nz 0 to work properly. On the other hand, Method II
needs ny 0.This is because the Methods are directly related with these parameters of n.
594
Vertical planes are usually common in human environments; besides, in our monocular
system, planes in front of the robot with dominant nz will be detected more easily. Methods
I and III work properly in this case. If we constraint the homography plane detected to be
the oor or the ceiling (any plane with ny 0 is enough) the Method II will also work
properly.
Figure 7. (First row) Simulations with odometry drif of 1 deg/m. The evolution of one
simulation in x, z and is shown for each method. (Second row) Final error of different
simulations varying the image noise
Figure 8. Final error for each method in x,z and varying the focal length (rst row) and
varying the principal point coordinates (second row)
595
nx
0
-0.20
-0.40
-0.60
-0.80
-1.00
1.00
0.98
0.92
0.80
0.60
0
0
0.57
0.69
0.69
0.57
0
n
ny
0
0.57
0.69
0.69
0.57
0
0
-0.20
-0.40
-0.60
-0.80
-1.00
-1.00
-0.80
-0.60
-0.40
-0.20
0
nz
-1.00
-0.80
-0.60
-0.40
-0.20
0
0
0
0
0
0
0
0
-0.20
-0.40
-0.60
-0.80
-1.00
x
0
0.03
-0.00
-0.00
-0.10
-3.00
-3.00
-3.00
-3.00
-3.00
-3.00
-3.00
-3.00
-0.10
-0.00
-0.00
0
0
Method I
z
0
-0.00
-0.00
-0.01
-0.34
-10.00
-10.00
-10.00
-10.00
-10.00
-10.00
-10.00
-10.00
-0.34
-0.01
-0.00
0
0
-0.09
-0.09
-0.09
-0.09
-0.03
0
0
0
0
0
0
0
0
-0.03
-0.09
-0.09
-0.09
-0.09
x
-3.00
-0.00
-0.00
-0.00
-0.00
-3.00
-3.00
-0.15
-0.01
-0.00
0
0
0
0
-0.00
-0.01
-0.15
-3.00
Method II
z
-10.00 -3.12
-0.00 -0.09
-0.00 -0.09
-0.00 -0.09
-0.00 -0.09
-10.00
0
-10.00
0
-0.62
0
-0.04 -0.09
-0.00 -0.09
-0.00 -0.09
0
-0.09
0
-0.09
-0.00 -0.09
-0.00 -0.09
-0.04 -0.10
-0.62 -0.15
-10.00 -3.12
Method III
x
z
0
0
-0.09
0
0
-0.09
-0.00 -0.00 -0.09
-0.00 -0.01 -0.09
-0.10 -0.34 -0.03
-3.00 -10.00
0
-3.00 -10.00
0
-3.00 -10.00
0
-3.00 -10.00
0
-3.00 -10.00
0
-3.00 -10.00
0
-3.00 -10.00
0
-3.00 -10.00
0
-0.10 -0.34 -0.03
-0.00 -0.01 -0.09
-0.00 -0.00 -0.09
0
0
-0.09
0
0
-0.09
Table 1. Final error for each method in x(m), z(m) and (deg) varying the normal of the
plane that generates the homography: n=(nx, ny, nz)T
6. Conclusions
We have presented a new homography-based approach for visual control of mobile robots.
The control design is directly based on the homography elements and deals with the motion
constraints of the dierential drive vehicle. In our approach, called Shortest Path Control, the
motion is designed to follow a straight line path. Taking advantage of this specic trajectory
we have proposed a control law decoupling rotation and translation. Three dierent
methods have been designed by choosing dierent homography elements. Their
performance depends on the conditions of the plane or the calibration. The methods use
neither the homography decomposition nor any measure of the 3D scene. Simulations
shows the performance of the methods with odometry drift, image noise and calibration
errors. Also, the inuence of the plane that generates the homography is studied.
7. References
Basri, R., Rivlin, E., and Shimshoni, I. (1999). Visual homing: Surng on the epipoles.
International Journal of Computer Vision, 33(2):117137. [Basri et al., 1999]
Benhimane, S. and Malis, E. (2006). Homography-based 2D visual servoing. IEEE
International Conference on Robotic sand Automation, pages 23972402. [Benhimane
and Malis, 2006]
Benhimane, S., Malis, E., Rives, P., and Azinheira, J. R. (2005). Vision-based control for car
platooning using homography decomposition. In IEEE International Conference on
Robotics and Automation, Barcelona, Spain, pages 21732178. [Benhimane et al.,2005]
596
Blanc, G., Mezouar, Y., and Martinet, P. (2005). Indoor navigation of a wheeled mobile robot
along visual routes. In Proceedings of the IEEE International Conference on Robotics and
Automation ,ICRA 05, pages 33653370. [Blanc et al.,2005]
Conticelli, F. and Allotta, B. (2001). Nonlinear controllability and stability analysis of
adaptive image-based systems. IEEE Transactionson Robotic sand Automation, 17 (2):
208214. [Conticelli and Allotta, 2001]
Corke, P. I. and Hutchinson, S. A. (2001). A new partitioned approach to image-based visual
servo control. IEEE Transactions on Robotics and Automation, 17 (4) :507515. [Corke
and Hutchinson, 2001]
De Souza, G. N. and Kak, A. C. (2002). Vision for mobile robot navigation: A survey. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 24 (2): 237267. [De Souza
and Kak, 2002]
Fang,Y., Dixon, W. E., Dawson, D.M., and Chawda, P. (2005). Homography-based visual
servo regulation of mobile robots. IEEE Transactions on Systems, Man, and
Cybernetics, Part B, 35 (5): 10411050. [Fang et al., 2005]
Hartley, R. I. and Zisserman, A. (2004). Multiple View Geometry in Computer Vision.
Cambridge University Press, ISBN: 0521540518. [Hartley and Zisserman, 2004]
Hutchinson, S., Hager, G., and Corke, P. (1996). A tutorial on visual servo control. IEEE
Transactions on Robotics and Automation, 12 (5):651670. [Hutchinson et al., 1996]
Liang, B. and Pears, N. (2002). Visual navigation using planar homographies. In IEEE
Conference on Robotics and Automation, pages 205210. [Liang and Pears, 2002]
Lopez-Nicolas, G., Sagues, C., Guerrero, J., Kragic, D., and Jensfelt, P. (2006). Nonholonomic
epipolar visual servoing. IEEE International Conference on Robotics and Automation,
pages 23782384. [Lopez-Nicolas et al., 2006]
Ma, Y., Kosecka, J., and Sastry, S. (1999). Vision guided navigation for a nonholonomic
mobile robot. IEEE Transactions on Robotics and Automation, 15( 3): 521537. [Ma et
al., 1999]
Malis, E. and Chaumette, F. (2000). 2 D visual servoing with respect to unknown objects
through a new estimation scheme of camera displacement. International Journal of
Computer Vision, 37 (1): 7997. [Malis and Chaumette, 2000]
Malis, E., Chaumette, F., and Boudet, S. (1999). 2 D visual servoing. IEEE Transactions on
Robotic and Automation, 15 (2): 234246. [Malis et al., 1999]
Rives, P. (2000). Visual servoing based on epipolar geometry. In IEEE/RSJ International
Conference on Intelligent Robots and Systems, volume 1, pages 602607. [Rives ,2000]
Sagues, C. and Guerrero, J. (2005). Visual correction for mobile robot homing. Robotics and
Autonomous Systems, 50 (1): 4149. [Sagues and Guerrero, 2005]
Tsakiris, D., Rives, P., and Samson, C. (1998). Extending visual servoing techniques to
nonholonomic mobile robots. In G. Hager, D. K. and Morse, S., editors, The
Conuence of Vision and Control, Lecture Notes in Control and Information Systems
(LNCIS). Springer-Verlag. [Tsakiris et al., 1998]
31
Correlation Error Reduction of Images in Stereo
Vision with Fuzzy Method and its Application on
Cartesian Robot
Mehdi Ghayoumi and Mohammad Shayganfar
Islamic Azad University Shahr-e-Rey Branch
Iran
1. Introduction
Stereo vision is one of the most active research topics in machine vision. Finding
corresponding points in different images of the same scene could be a tough procedure of
depth extraction in this field. Correlation is one of the most common approaches that could
be applied in this procedure. There are also methods that have been presented to reduce
some existing errors associated with this approach. Here, a fuzzy model is demonstrated.
Also the experimental results are presented based on a 3p laboratory robot and
improvements are illustrated comparing with a neural network method by simulation
outcomes.
Vision method at first was used for estimating robot errors more than one decade ago. So
far, different companies and research centers have used for robot positioning, calibration,
error estimation and error compensation with genetic algorithm, neural networks and fuzzy
control algorithms. In general, recognition of 3D objects requires two or more appropriately
defined 2D images. With this approximation many methods have been proposed such as
structure from motion, (Seitz et al., 1995) (Taylor & Kriegman, 1995) stereo lenses
correspondence and shape (Grosso et al., 1996) (Haralick & Shapiro, 1992). Achour and
Benkhelif present a new approach for 3D scene reconstruction based on projective geometry
without camera calibration. The contribution is to reduce the number of reference points to
four points by exploiting some geometrical shapes contained in the scene (Achour &
Benkhelif, 2001). In online applications, these methods have some problems. There is a
difficulty in finding the correspondence between one image and the others. The most
important step in stereo vision is to find two points of two or more images. A general
correlation approach including errors is discussed in (Lopez & Plat, 2000). Also, fuzzy logic
has applied in some cases such as process control, decision support systems, optimization
and a large class of robotic manipulators and other mechanical systems (Hsu et al., 2001).
Here, a fuzzy approach is applied to reduce existent errors to concern with the aspect of
improving correlation based stereo vision by reducing errors on a set of points. The
experiments are due to a Cartesian robot. So far a neural network approach has been used to
get the optimum point in world coordinate for 3p robot(Korayem et al., 2001). Clearly there
is no magic panacea for selecting a neural network for the best generalization and also
598
because of structure and foundation of neural networks, it has some errors. A fuzzy
approach can be used to reduce these.
B
x 2 + x1
(1)
Which x1, x2 are x coordinate for one point in real world in each image of two cameras.
and is focal distance and distance between two focal, respectively. In 3p robot two
cameras are not in the same direction. It means each of x and z axis should be rotated and
two rotation matrixes can be concatenated into a single matrix:
R = R R
(2)
cos
sin cos
R =
sin sin
sin
cos cos
cos sin
0
0
sin
cos
0
0
0
0
(3)
Where , are rotating angles of z and x axis, respectively. Also commutative axis can be
obtained with the follow matrix:
1
0
C =
0
1
0
0
1
0
z0
(4)
Where z 0
599
X cos + Y sin
X sin sin + Y cos sin ( Z Z 0 ) cos +
(5)
(6)
And:
x2 =
Where x, y and z are coordinates of the image as a point in the real world. It is noted that
these equations reduce to Eq. 1 when X0 = Y0 = Z0 = 0, r1 = r2 = r3 = 0 and = = 0 0 . Fig. 1
shows the method of stereo vision in 3p robot.
3. Correlation
Correlation is one of the applied methods in stereo vision and it is discussed in this section.
3.1 Correlation method
Although correlation method can be explained with vector but working with a windows
form is commonly used. In its simplest form, the correlation between these two real
functions w( x, y ) , f ( x, y ) is given by(Paulino et al., 2001):
c( s, t ) = f ( x, y ) w( x s, y t ) t = 0,1,..., N 1, s = 0,1,..., M 1
x
(7)
600
Where f ( x, y ) is a digital image with size M N and w( x, y ) is a similar region with size
and K % N ). The correlation function given in Eq. 7 has drawback, because it is
sensitive to scale changes in the amplitude of f ( x, y ) and w( x, y ) . A method that frequently
JK (J % M
used to overcome this difficulty is to perform matching via the correlation coefficient,
defined as:
(s,t) =
(8)
t = 0,1,..., N 1, s = 0,1,..., M 1
Where w is the average intensity of the mask (this value is computed only once), f ( x, y ) is
the average value of f ( x, y ) in the region coincident with w( x, y ) , and the summations are
taken over the common coordinate to both f and w. It is not difficult to show that c( s, t ) is
scaled to the range from -1 to 1, independent of scale changes in the amplitude of
f ( x, y ) and w( x, y ) . If the functions are in the same size, this approach can be more efficient
than a direct implementation of correlation in the spatial domain. It is important to note that
the dimension of w( x, y ) is usually smaller than f ( x, y ) in implementing Eq. 7.
3.2 Problems of using correlation based stereo vision
601
object, which suggests the maximum disparity. The position with the highest correlation
value determines the pixel that corresponds to the pixel of interest. Larger correlation
windows increase the reliability by averaging over a larger area, besides reducing the effects
of noise. Generally, the choice of the correlation window size is a trade off between
increasing reliability in areas with constant depth and decreasing errors where depth
changes. The use of a smaller correlation window reduces the problem, because smaller
window does not overlap the depth discontinuity to the same extent(Paulino et al., 2001).
4. Fuzzy System
Fuzzy logic controller utilizes fuzzy to convert the linguistic control strategy based on
expert knowledge into an automatic control strategy. This section describes the design of
fuzzy system for vision of 3p robot. It also discusses the heuristic approach that has been
applied to determine the number of necessary fuzzy input and output set. In the first step,
the border points are obtained by exploiting some geometrical relations. Then the fuzzy
system is applied to points of correlation area. The best point that is achieved with heuristic
method is shown in Fig.3.
602
correlation area. Then, Eq.9 gives the distance of correlation from these four points as
follows:
X1 X 2 = ( x1 x 2 ) 2 + ( y1 y 2 ) 2 + ( z1 z 2 ) 2
(9)
603
4.2.1 Fuzzification
The computational technique of inputs of fuzzy system is demonstrated in Fig.5. The
distances of the centers of images from best point are d1 and d 2 , respectively. The
triangular membership functions have been used. Four inputs in the fuzzy system are the
distances of center of each area from the images (Fig. 5).
The fuzzy controller employs four inputs by using Euclidian distance, shown in Eq. 9, of
each point at each partition of correlation area from the image center. This fuzzy controller
has only one control output.
Figure 5. Distances of the centers from best point of images (heuristic method)
4.2.2 Fuzzy Rule Base
The IF part of rule bases include the ratio of the distances of central point in each area from
the image center. The THEN part of these rules is suggested for the center of correlation
area:
IF input r = d1 1 THEN output is center of area (1) or center of area (2).
d2
IF input r = d1 = 1 THEN output is center of correlation area.
d2
IF input r = d1 % 1 THEN output is center of area (3) or center of area (4).
d2
The above rules can be dedicated for each area. It means the total number of rules is 12.
4.2.3 Defuzzification
The Eq. 10 is applied to defuzzify the fuzzy control rules in the defuzzification step. The
defuzzifier which is applied is the center of gravity.
M
y w
l
y =
l =1
M
(10)
l =1
5. Algorithm
The diagram belongs to our approach is demonstrated in this section. All the processes
described before are shown as an algorithmic approach in Fig. 7.
604
Figure 8. (Left to right) images of: camera 1, camera 2, and binary images: 1, 2, and
correlation of images
605
7. Simulation
At first, the implemented software captures two images from two existing cameras
belonging to the robot. Then the correlation algorithm gives a set of points to be used in next
processing steps. Then, the heuristic method achieves all its necessary data as the correlated
points of the correlation process and determines four extreme points to compute available
distances. The fuzzy method is also used to find the best point in this case. Fig. 10
demonstrates the steps to get the correlation points. Finally the best coordinates are
accessible (Fig. 11).
606
8. Experimental Results
In this study, the accuracy of our approach was compared with other approaches based on
neural networks on 3P robots (Lopez & Plat, 2000) (Wang, 1992) (Hahnel et al., 2001).which
contains some errors according to the structure and foundation of neural networks. Table.1
compares neural network and fuzzy methods on stereo vision for a 3P robot. It
demonstrates that the fuzzy method is more reliable.
Type of Object
Circle
Cylinder
Cubic Rectangle
Cubic Square
Neural Network
98.3%
98%
97%
97.2%
Fuzzy System
100%
100%
100%
100%
9. Conclusions
Here, applying a fuzzy model in stereo vision of a 3p robot is presented. According to the
simulation results, correlation error is reduced where the best result in a 3p robot applying
neural networks is about 97% of correctness, but using a fuzzy approach, let us to achieve
up to 100%. It is obvious that all these results are achieved by simulated software and
different kinds of errors could be occurred in real environment. Some of them are discussed
in(Korayem et al., 2001).This fuzzy model can be applied to a large class of robotic
manipulators.
607
10. References
Achour, K., and Benkhelif, M. (2001). A New Approach to 3D Reconstruction without
Camera Calibration, Pattern Recognition, Vol.34, pp. 24672476.
Alexander. (1996). Distributed Fuzzy Control of Multivariable System, Kulwer academic
publishers.
Bender, E. A. (1996).Mathematical Methods in Artificial Intelligence, IEEE Computer Society
Press.
Gonzalez, R. C., Woods, R. E.(1998).Digital Image Processing, Tennessee university press.
Guisser, L., Payrissat, R., and Castan, S. (2000). PGSD: An Accurate 3D Vision System Using
a Projected Grid for Surface Descriptions, Image and Vision Computing Journal,
Vol.18, pp.463491.
Grosso, E., Metta, G., Oddera, A., and Sandini, G.: Robust Visual Serving in 3-D Reaching
Tasks, IEEE Trans on Robotics Automation, Vol.12,(1996) 732742.
Haralick, R. M., and Shapiro, L. G.(1992). Computer and Robot Vision, Addison-Wesley
Publisher.
Hahnel, D., Burgard, W., and Thrun, S. (2003). Learning Compact 3D Models of Indoor and
Outdoor Environments with a Mobile Robot, Robotics and Autonomous System,
Vol.44, pp. 1527.
Hsu, Y., Chen, G., Li, H.(2001). A Fuzzy Adaptive Variable Structure Controller with
Applications to Robot Manipulators, IEEE Transaction Systems man and Cybernetics
part, Vol. 31, pp. 331340.
Hirschuller, H., Innocent, P. R., and Garibaldi, J.(2000).Real Time Correlation Based Stereo
Vision with Reduce Border Errors, International Journal of Computer Vision.
Konolige, K. (1997). Small Vision Systems: Hardware and Implementation, International
Symposium on Robotics Research, London, Springer,pp. 203212.
Korayem, M. H., Khoshhal, K., Aliakbarpour, H. (2005). Vision Based Robot Simulation and
Experiment for Performance Tests of Robot, International Journal of AMT, Vol.25,pp.
12181231.
Korayem, M. H., Shiehbeiki, N., and Khanali, T. (2005).Design, Manufacturing and
Experimental Tests of Prismatic Robot for Assembly Line, Paper Accepted for
Publication in International Journal of AMT.
Lopez, A., and Plat, F. (2000). Dealing with Segmentation Errors in Region-Based Stereo
Matching, Pattern Recognition ,Vol.33, pp. 13251338.
Matthies, L., Kelly, A., Litwin, T., and Tharp, G. (1995). Obstacle Detection for Unmanned
Ground Vehicles: A progress report, International symposium of Robotics Research,
Munich, Germany.
Paulino, A., Batista, J., and Araujo, H.(2001). Maintaining the Relative Position and
Orientation of Multiple Robots Using Vision, Pattern Recognition Letters, Vol.22,pp.
13311335.
Seitz, S. M., and Dyer, C. R.(1995).
Complete Scene Structure from Four Point
Correspondences, 5th Int. Conf on Computer Vision, pp. 330337, Cambridge MA.
Taylor, C. J., and Kriegman, D. J. (1995). Structure and Motion from Line Segments in
Multiple Images, IEEE Trans. on Pattern Analysis Machine Intelligence, Vol.17, pp.
10211033
Volpe, R., Balaram, J., Ohm, T., and Ivlev, R. (1996). The Rocky 7 Mars Rover Prototype,
International Conference on Intelligent Robots and Systems, Vol. 3, pp.15581564.
608
Wang, L.(1997). A Course in Fuzzy Systems and Control, Prentice Hall, pp.151156.
Wang, L., X . (1992). Fuzzy System are Universal Approximate, Proceedings of the first IEEE
Conference on Fuzzy Systems, pp. 1631170.
Zhang, J., Knoll, A., and Schwert, V.(1999). Situated Neuro-Fuzzy Control for Vision-Based
Robot Localization, Robotics and Autonomous System.