Machine Vision
Machine Vision
Machine Vision
MACHINE VISION
This book is an accessible and comprehensive introduction to machine vision. It provides all
the necessary theoretical tools and shows how they are applied in actual image processing
and machine vision systems. A key feature is the inclusion of many programming exercises
that give insights into the development of practical image processing algorithms.
The authors begin with a review of mathematical principles and go on to discuss key
issues in image processing such as the description and characterization of images, edge
detection, feature extraction, segmentation, texture, and shape. They also discuss image
matching, statistical pattern recognition, syntactic pattern recognition, clustering, diffusion,
adaptive contours, parametric transforms, and consistent labeling. Important applications
are described, including automatic target recognition. Two recurrent themes in the book are
consistency (a principal philosophical construct for solving machine vision problems) and
optimization (the mathematical tool used to implement those methods).
Software and data used in the book can be found at www.cambridge.org/9780521830461.
The book is aimed at graduate students in electrical engineering, computer science,
and mathematics. It will also be a useful reference for practitioners.
wesleye. snyder received his Ph.D. from the University of Illinois, and is currently
Professor of Electrical and Computer Engineering at North Carolina State University. He has
written over 100 scientic papers and is the author of the book Industrial Robots. He was
a founder of both the IEEE Robotics and Automation Society and the IEEE Neural
Networks Council. He has served as an advisor to the National Science Foundation,
NASA, Sandia Laboratories, and the US Army Research Ofce.
hairong qi received her Ph.D. from North Carolina State University and is currently an
Assistant Professor of Electrical and Computer Engineering at the University of
Tennessee,Knoxville.
MACHINE VISION
Wesley E. Snyder
North Carolina State University, Raleigh
Hairong Qi
University of Tennessee, Knoxville
Contents
To the instructor
Acknowledgements
1
1.1
1.2
1.3
1.4
1.5
1.6
2
2.1
2.2
2.3
2.4
3
3.1
3.2
3.3
3.4
3.5
4
4.1
4.2
page xv
xviii
Introduction
1
2
3
5
6
6
7
8
10
15
20
28
29
29
31
32
33
34
38
Image representations
The digital image
38
42
vii
viii
Contents
4.3
4.4
4.5
4.6
4.7
Topic 4A
4A.1
4A.2
5
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
Topic 5A
5A.1
5A.2
5A.3
5A.4
5A.5
6
6.1
6.2
6.3
6.4
49
51
53
56
56
Image representations
A variation on sampling: Hexagonal pixels
Other types of iconic representations
References
57
57
60
62
65
65
66
69
73
75
76
78
83
85
88
90
92
92
Edge detectors
The Canny edge detector
Improvements to edge detection
Inferring line segments from edge points
Space/frequency representations
Vocabulary
References
97
97
98
99
99
101
104
107
Relaxation
Restoration
The MAP approach
Mean eld annealing
107
108
111
115
ix
Contents
6.5
6.6
Topic 6A
6A.1
6A.2
6A.3
6A.4
6A.5
6A.6
7
7.1
7.2
7.3
7.4
7.5
Topic 7A
7A.1
7A.2
7A.3
7A.4
7A.5
8
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
Conclusion
Vocabulary
126
127
129
129
131
133
133
137
137
138
Mathematical morphology
144
Binary morphology
Gray-scale morphology
The distance transform
Conclusion
Vocabulary
144
152
153
156
156
Morphology
Computing erosion and dilation efciently
Morphological sampling theorem
Choosing a structuring element
Closing gaps in edges and surfaces
Vocabulary
Bibliography
158
158
161
164
164
177
178
Segmentation
181
181
182
185
196
197
201
204
205
206
Contents
Topic 8A
8A.1
8A.2
8A.3
8A.4
8A.5
8A.6
9
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
9.10
9.11
9.12
9.13
9.14
Topic 9A
9A.1
9A.2
9A.3
9A.4
10
10.1
10.2
10.3
10.4
Segmentation
Texture segmentation
Segmentation of images using edges
Motion segmentation
Color segmentation
Segmentation using MAP methods
Human segmentation
Bibliography
207
207
210
210
210
210
211
211
Shape
216
Linear transformations
Transformation methods based on the covariance matrix
Simple features
Moments
Chain codes
Fourier descriptors
The medial axis
Deformable templates
Quadric surfaces
Surface harmonic representations
Superquadrics and hyperquadrics
Generalized cylinders (GCs)
Conclusion
Vocabulary
216
219
225
229
230
231
232
233
234
236
236
238
238
239
Shape description
Finding the diameter of nonconvex regions
Inferring 3D shape from images
Motion analysis and tracking
Vocabulary
Bibliography
240
240
243
250
253
256
Consistent labeling
263
Consistency
Relaxation labeling
Conclusion
Vocabulary
263
266
270
270
xi
Contents
Topic 10A
271
273
Parametric transforms
275
11.1
11.2
11.3
11.4
11.5
11.6
275
279
280
282
283
283
Topic 11A
11A.1
11A.2
11A.3
11A.4
11A.5
11A.6
Parametric transforms
Finding parabolae
Finding the peak
The Gauss map
Parametric consistency in stereopsis
Conclusion
Vocabulary
References
283
283
285
286
286
287
287
288
290
Graphs
Properties of graphs
Implementing graph structures
The region adjacency graph
Using graph-matching: The subgraph isomorphism problem
Aspect graphs
Conclusion
Vocabulary
References
290
291
291
292
294
295
296
297
297
Image matching
298
298
304
305
309
309
11
12
12.1
12.2
12.3
12.4
12.5
12.6
12.7
12.8
13
13.1
13.2
13.3
13.4
13.5
xii
Contents
Topic 13A
13A.1
13A.2
13A.3
13A.4
13A.5
13A.6
Matching
Springs and templates revisited
Neural networks for object recognition
Image indexing
Matching geometric invariants
Conclusion
Vocabulary
Bibliography
312
312
314
318
318
321
322
322
326
14.1
14.2
14.3
14.4
14.5
14.6
14.7
14.8
14.9
Design of a classier
Bayes rule and the maximum likelihood classier
Decision regions and the probability of error
Conditional risk
The quadratic classier
The minimax rule
Nearest neighbor methods
Conclusion
Vocabulary
326
329
336
337
340
342
343
345
345
Topic 14A
14A.1
14A.2
14A.3
14A.4
347
347
349
354
354
354
Clustering
356
15.1
15.2
15.3
15.4
15.5
357
359
363
366
366
368
16
369
Terminology
Types of grammars
369
371
14
15
16.1
16.2
xiii
Contents
16.3
16.4
16.5
17
17.1
17.2
17.3
17.4
17.5
17.6
18
18.1
18.2
18.3
18.4
18.5
18.6
18.7
18.8
18.9
373
380
380
381
Applications
382
382
382
383
383
384
385
386
392
392
394
395
400
403
407
408
409
410
411
Author index
Index
417
426
To the instructor
xv
Sample syllabus.
Lecture
Topics
Assignment (weeks)
Reading assignment
3.1 (1)
Chapters 1 and 3
Sections 4.14.5
Sections 6.16.3
Diffusion
6.2 (2)
Sections 6A.2
10
Section 6A.4
11
Image morphology
7.57.7 (1)
Section 7.1
12
7.10 (2)
13
7.4 (1)
Section 7A.4
14
15
8.2 (1)
Section 8.3
16
2D geometry, transformations
9.3 (1)
17
Sections 9.39.7
18
19
Section 8.5.2
20
21
22
10.1 (1)
Chapter 10
23
11.1 (2)
24
25
26
9.10 (1)
Section 11A.3
Sections 13.113.3
xvii
To the instructor
The assignments are projects which must include a formal report. Since there is usually programming involved, we allow more time to accomplish these assignments
suggested times are in parentheses in column 3. It is also possible, by careful selection of the students and the topics, to use this book in an advanced undergraduate
course.
For advanced students, the Topics sections of this book should serve as a collection of pointers to the literature. Be sure to emphasize to your students (as we
do in the text) that no textbook can provide the details available in the literature,
and any real (that is, for a paying customer) machine vision project will require
the development engineer to go to the published journal and conference literature.
As stated above, the two recurrent themes throughout this book are consistency
and optimization. The concept of consistency occurs throughout the discipline as a
principal philosophical construct for solving machine vision problems. When confronted with a machine vision application, the engineer should seek to nd ways to
determine sources of information which are consistent. Optimization is the principal mathematical tool for solving machine vision problems, including determining
consistency. At the end of each chapter which introduces techniques, we remind the
student where consistency ts into the problems of that chapter, as well as where
and which optimization methods are used.
Acknowledgements
Id like to express my sincere thanks to Dr. Wesley Snyder for inviting me to coauthor this book. I have greatly enjoyed this collaboration and have gained valuable
experience.
The nal delivery of the book was scheduled around Christmas when my parents
were visiting me from China. Instead of touring around the city and enjoying the
holidays, they simply stayed with me and supported me through the nal submission
of the book. I owe my deepest gratitude to them. And to Feiyi, my forever technical
support and emergency reliever.
HQ
xviii
Introduction
We have written this book at two levels, the principal level being introductory.
Introductory does not mean easy or simple or doesnt require math. Rather,
the introductory topics are those which need to be mastered before the advanced
topics can be understood.
In addition, the book is intended to be useful as a reference. When you have to
study a topic in more detail than is covered here, in order, for example, to implement a
practical system, we have tried to provide adequate citations to the relevant literature
to get you off to a good start.
We have tried to write in a style aimed directly toward the student and in a
conversational tone.
We have also tried to make the text readable and entertaining. Words which are
deluberately missppelled for humorous affects should be ubvious. Some of the humor
runs to exaggeration and to puns; we hope you forgive us.
We did not attempt to cover every topic in the machine vision area. In particular, nearly all papers in the general areas of optical character recognition and face
recognition have been omitted; not to slight these very important and very successful application areas, but rather because the papers tend to be rather specialized; in
addition, we simply cannot cover everything.
There are two themes which run through this book: consistency and optimization.
Consistency is a conceptual tool, implemented as a variety of algorithms, which helps
machines to recognize images they fuse information from local measurements to
make global conclusions about the image. Optimization is the mathematical mechanism used in virtually every chapter to accomplish the objectives of that chapter,
be they pattern classication or image matching.
Ja-Chen Lin and Wen-Hsiang Tsai, Feature-preserving Clustering of 2-D Data for Two-class Problems Using
Analytical Formulas: An Automatic and Fast Approach, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(5), 1994.
Introduction
These two topics, consistency and optimization, are so important and so pervasive,
that we point out to the student, in the conclusion of each chapter, exactly where those
concepts turned up in that chapter. So read the chapter conclusions. Who knows, it
might be on a test.
The target audience for this book is graduate students or advanced undergraduates
in electrical engineering, computer engineering, computer science, math, statistics,
or physics. To do the work in this book, you must have had a graduate-level course
in advanced calculus, and in statistics and/or probability. You need either a formal
course or experience in linear algebra.
Many of the homeworks will be projects of sorts, and will be computer-based.
To complete these assignments, you will need a hardware and software environment
capable of
1.3.1
Image processing
Many people consider the content of this course as part of the discipline of image
processing. However, a better use of the term is to distinguish between image processing and machine vision by the intent. Image processing strives to make images
look better, and the output of an image processing system is an image. The output
of a machine vision system is information about the content of the image. The
functions of an image processing system may include enhancement, coding, compression, restoration, and reconstruction.
Enhancement
Enhancement systems perform operations which make the image look better, as
perceived by a human observer. Typical operations include contrast stretching
(including functions like histogram equalization), brightness scaling, edge sharpening, etc.
Coding
Coding is the process of nding efcient and effective ways to represent the information in an image. These include quantization methods and redundancy removal.
Coding may also include methods for making the representation robust to bit-errors
which occur when the image is transmitted or stored.
Compression
Compression includes many of the same techniques as coding, but with the specic
objective of reducing the number of bits required to store and/or transmit the image.
Restoration
Restoration concerns itself with xing what is wrong with the image. It is unlike
enhancement, which is just concerned with making images look better. In order
to correct an image, there must be some model of the image degradation. It is
common in restoration applications to assume a deterministic blur operator, followed
by additive random noise.
Introduction
Reconstruction
Reconstruction usually refers to the process of constructing an image from several partial images. For example, in computed tomography (CT),2 we make a large
number, say 360, of x-ray projections through the subject. From this set of onedimensional signals, we can compute the actual x-ray absorption at each point in the
two-dimensional image. Similar methods are used in positron emission tomography
(PET), magnetic resonance imagery (MRI), and in several shape-from-X algorithms
which we will discuss later in this course.
1.3.2
Machine vision
Machine vision is the process whereby a machine, usually a digital computer, automatically processes an image and reports what is in the image. That is, it recognizes
the content of the image. Often the content may be a machined part, and the objective
is not only to locate the part, but to inspect it as well. We will in this book discuss
several applications of machine vision in detail, such as automatic target recognition
(ATR), and industrial inspection. There are a wide variety of other applications, such
as determining the ow equations from observations of uid ow [1.1], which time
and space do not allow us to cover.
The terms computer vision and image understanding are often also used to
denote machine vision.
Machine vision includes two components measurement of features and pattern
classication based on those features.
Measurement of features
The measurement of features is the principal focus of this book. Except for
Chapters 14 and 15, in this book, we focus on processing the elements of images
(pixels) and from those pixels and collections of pixels, extract sets of measurements
which characterize either the entire image or some component thereof.
Pattern classication
Sometimes, CT is referred to as CAT scanning. In that case, CAT stands for computed axial tomography.
There are other types of tomography as well.
example, the set of possible classes might be men and women and one measurement
which we could make to distinguish men from women would be height (clearly,
height is not a very good measurement to use to distinguish men from women, for
if our decision is that anyone over ve foot six is male we will surely be wrong in
many instances).
Pattern recognition
Feature vector
Raw data
Class identity
Feature
Pattern
measurement
classifier
Introduction
Shape
analysis
Consistency
Features
Matching
analysis
Segmentation
Noise
removal
Raw data
Fig. 1.2. Some components of a feature characterization system. Many machine vision
applications do not use every block, and information often ows in other ways. For
example, it is possible to perform matching directly on the image data.
We will learn many different operations to perform on images. The emphasis in this
course is image analysis, or computer vision, or machine vision, or image
understanding. All these phrases mean the same thing. We are interested in making
measurements on images with the objective of providing our machine (usually, but
not always, a computer) with the ability to recognize what is in the image. This
process includes several steps:
r
r
r
denoising all images are noisy, most are blurred, many have other distortions
as well. These distortions need to be removed or reduced before any further
operations can be carried out. We discuss two general approaches for denoising
in Chapters 6 and 7.
segmentation we must segment the image into meaningful regions. Segmentation is covered in Chapter 8.
feature extraction making measurements, geometric or otherwise, on those
regions is discussed in Chapter 9.
Reference
r
r
So turn to the next chapter. (Did you notice? No homework assignments in this
chapter? Dont worry. Well x that in future chapters.)
Reference
[1.1] C. Shu and R. Jain, Vector Field Analysis for Oriented Patterns, IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(9), 1994.
(2.1)
In Eq. (2.1), the symbols a and b represent events, e.g., the rolling of a six. Pr (b) is the
probability of such an event occurring, and Pr (a | b) is the conditional probability
of event a occurring, given that event b has occurred.
In Fig. 2.1, we tabulate all the possible ways of rolling two dice, and show the
resulting number of different ways that the numbers from 2 to 12 can occur. We
note that 6 different events can lead to a 7 being rolled. Since each of these events
is equally probable (1 in 36), then a 7 is the most likely roll of two dice. In Fig. 2.2
the information from Fig. 2.1 is presented in graphical form.
In pattern classication, we are most often interested in the probability of a particular measurement occurring. We have a problem, however, when we try to plot a
graph such as Fig. 2.2 for a continuously-valued function. For example, how do we
ask the question: What is the probability that a man is six feet tall? Clearly, the
answer is zero, for an innite number of possibilities could occur (we might equally
well ask, What is the probability that a man is (exactly) 6.314 159 267 feet tall?).
Still, we know intuitively that the likelihood of a man being six feet tall is higher
than the likelihood of his being ten feet tall. We need some way of quantifying this
intuitive notion of likelihood.
8
Sum
Number
of ways
0
0
0
1
1
2 11
2
3 21, 12
3
4 13, 31, 22
4
5 23, 32, 4 1, 1 4
5
6 15, 51, 2 4, 42, 33
7 3 4, 4 3, 25, 52, 1 6, 6 1 6
5
8 2 6, 6 2, 35, 53, 4 4
4
9 3 6, 6 3, 4 5, 5 4
3
10 4 6, 6 4, 55
2
11 6 5, 5 6
1
12 6 6
Number of ways
6
5
4
3
2
1
0 1 2 3 4 5 6 7 8 9 10 11 12
Sum
Fig. 2.2. The information of Fig. 2.1, in graphical form.
One question that does make sense is, What is the probability that a man is
less than six feet tall? Such a function is referred to as a probability distribution
function
P(x) = Pr (z < x)
(2.2)
d
P(x).
dx
(2.3)
10
10
11
12
Fig. 2.3. The probability distribution of Fig. 2.2, showing the probability of rolling two dice to get
a number LESS than x. Note that the curve is steeper at the more likely numbers.
p(x) has all the properties that we desire. It is well dened for continuously-valued
measurements and it has a maximum value for those values of the measurement
which are intuitively most likely.
Furthermore:
p(x) d x = 1,
(2.4)
In this section, we very briey review vector and matrix operations. Generally, we
denote vectors in boldface, scalars in lowercase Roman, and matrices in uppercase
Roman.
Vectors are always considered to be column vectors. If we need to write one
horizontally for the purpose of saving space in a document, we use transpose notation.
For example, we denote a vector which consists of three scalar elements as:
v = [x1
x2
x3 ]T .
The inner product of two vectors is a scalar, v = a T b. Its value is the sum of products
11
You will also sometimes see the notation x, y used for inner product. We do not like
this because it looks like an expected value of a random variable. One sometimes
also sees the dot product notation x y for inner product.
The magnitude of a vector is |x| = x T x. If |x| = 1, x is said to be a unit vector.
If x T y = 0, then x and y are orthogonal. If x and y are orthogonal unit vectors,
they are orthonormal.
The concept of orthogonality can easily be extended to continuous functions by
simply thinking of a function as an innite-dimensional vector. Just list all the values
of f (x) as x varies between, say, a and b. If x is continuous, then there are an innite
number of possible values of x between a and b. But that should not stop us we
cannot enumerate them, but we can still think of a vector containing all the values
of f (x). Now, the concept of summation which we dened for nite-dimensional
vectors turns into integration, and an inner product may be written
b
f (x), g(x) =
f (x)g(x) d x.
(2.5)
The concepts of orthogonality and orthonormality hold for this denition of the
inner product as well. If the integral is equal to zero, we say the two functions are
orthogonal. So the transition from orthogonal vectors to orthogonal functions is not
that difcult. With an innite number of dimensions, it is impossible to visualize
orthogonal as perpendicular, of course, so you need to give up on thinking about
things being perpendicular. Just recall the denition and use it.
Suppose we have n vectors x 1 , x 2 , . . . x n ; if we can write v = a1 x 1 + a2 x 2 +
an x n , then v is said to be a linear combination of x 1 , x 2 , . . . x n .
A set of vectors x 1 , x 2 , . . . x n is said to be linearly independent if it is impossible
to write any of the vectors as a linear combination of the others.
Given d linearly independent vectors, of d dimensions, x 1 , x 2 , . . . x d dened on
1]T
and
x 2 = [1
0]T .
12
x1
x2
y
a1
Fig. 2.4. x1 and x2 are orthonormal bases. The projection of y onto x1 has length a1 .
This is the familiar Cartesian coordinate system. Heres another basis set for
2
x 1 = [1
1]T x 2 = [1
1]T .
If x 1 , x 2 , . . . x d span
d , and y = a1 x 1 + a2 x 2 + ad x d , then the components
of y may be found by
ai = y T xi
(2.6)
2.2.1
What does this say about
m and d?
Linear transformations
A linear transformation, A, is simply a matrix. Suppose A is m d. If applied to a
vector x
d , y = Ax, then y
m . So A took a vector from one vector space,
d ,
and produced a vector in
m . If that vector y could have been produced by applying
A to one and only one vector in
d , then A is said to be one-to-one. Now suppose
that there are no vectors in
m that can not be produced by applying A to some
vector in
d . In that case, A is said to be onto. If A is one-to-one and onto, then
A1 exists. Two matrices A and B are conformable if the matrix multiplication
C = AB makes sense.
Some important (and often forgotten) properties: If A and B are conformable,
then
(AB)T = B T AT
(2.7)
(AB)1 = B 1 A1
(2.8)
and
13
tr(AB) = tr(B A)
and
(2.9)
then obviously, the transpose of the matrix is the inverse as well, and A is said to
be an orthonormal transformation (OT), which will correspond geometrically to
a rotation. If A is a d d orthonormal transformation, then the columns of A are
orthonormal, linearly independent, and form a basis spanning the space of
d . For
3 , three convenient OTs are the rotations about the Cartesian axes:
Some example
orthonormal
transformations.
1
0
Rx = 0 cos
0 sin
0
cos
sin R y = 0
cos
sin
0 sin
cos
1
0 Rz = sin
0 cos
0
sin
cos
0
0
1
(2.10)
d T
(x Ax) = (A + AT )x.
dx
Since we mentioned derivatives, we might as well mention a couple of other vector
calculus things:
Suppose f is a scalar function of x, x
d , then
f
df
=
dx
x1
f
x2
f
xd
T
,
(2.11)
and is called the gradient. This will be often used when we talk about edges in
images, and f (x) will be the brightness as a function of the two spatial directions.
14
fm
f1 f2
x1 x1
x1
d fT
=
,
dx
f1 f2
fm
xd xd
xd
and is called the Jacobian.
One more: If f is scalar-valued, the matrix of second derivatives
2
f
2 f
2 f
x2
x1 x2
x1 xd
f
2 f2
2 f
xd x1 xd x2
xd2
(2.12)
(2.13)
2.2.2
Derivative operators
Here, we introduce a new notation, a vector containing only derivative operators,
.
(2.14)
x1 x2
xd
It is important to note that this is an OPERATOR, not a vector. We will do linear
algebra sorts of things with it, but by itself, it has no value, not even really any
meaning it must be applied to something to have any meaning. For most of this
book, we will deal with two-dimensional images, and with the two-dimensional form
of this operator,
T
=
.
(2.15)
x y
Apply this operator to a scalar, f , and we get a vector which does have meaning,
the gradient of f :
f f T
f =
.
(2.16)
x y
Similarly, we may dene the divergence using the inner (dot) product (in all the
following denitions, only the two-dimensional form of the del operator dened in
15
Eq. (2.16) is used. However, remember that the same concepts apply to operators of
arbitrary dimension):
f2
f1
f1
div f = f =
+
.
(2.17)
=
f2
x y
x
y
We will also have opportunity to use the outer product
matrix:
f1
x
x
f =
[ f1 f2] = f1
y
y
2.2.3
f2
x
.
f2
y
(2.18)
(2.19)
(2.20)
Eigen- is the German prex meaning principal or most important. These are NOT named for Mr Eigen.
16
minimal H as x
x = arg minx H (x).
The authors get VERY
annoyed at improper use
of the word optimal. If
you didnt solve a formal
optimization problem to
get your result, you didnt
come up with the
optimal anything.
(2.21)
The most straightforward way to minimize a function is to set its derivative to zero:
H (x) = 0,
(2.22)
where is the gradient operator the set of partial derivatives. Eq. (2.22) results in
a set of equations, one for each element of x, which must be solved simultaneously:
H (x) = 0
x1
H (x) = 0
x2
H (x) = 0.
xd
(2.23)
Such an approach is practical only if the system of Eq. (2.23) is solvable. This may
be true if d = 1, or if H is at most quadratic in x.
EXERCISE
H
= 2ax1 + b
x1
H
= 2cx2
x2
H
= 2d x3
x3
minimized by
b
.
2a
If H is some function of order higher than two, or is transcendental, the technique
of setting the derivative equal to zero will not work (at least, not in general) and we
must resort to numerical techniques. The rst of these is gradient descent.
In one dimension, the utility of the gradient is easy to see. At a point x (k)
(Fig. 2.5), the derivative points AWAY FROM the minimum. That is, in one
dimension, its sign will be positive on an uphill slope.
x3 = x2 = 0, x1 =
17
x(k)
Fig. 2.5. The sign of the derivative is always away from the minimum.
H
x x (k)
(2.24)
2.3.1
(2.25)
NewtonRaphson
It is not immediately obvious in Eq. (2.25) how to choose the variable . If is too
small, the iteration of Eq. (2.25) will take too long to converge. If is too large, the
algorithm may become unstable and never nd the minimum.
We can nd an estimate for by considering the well-known NewtonRaphson
method for nding roots: (In one dimension), we expand the function H (x) in a
Taylor series about the point x (k) and truncate, assuming all higher order terms are
zero,
H x (k+1) = H x (k) + x (k+1) x (k) H x (k) .
Since we want x (k+1) to be a zero of H , we set
H x (k) + x (k+1) x (k) H x (k) = 0,
(2.26)
(k+1)
=x
(k)
H x (k)
(k) .
H x
(2.27)
In optimization however, we are not nding roots, but rather, we are minimizing a
function, so how does knowing how to nd roots help us? The minima of the function
are the roots of its derivative, and our algorithm becomes
18
(k+1)
=x
(k)
H x (k)
(k) .
H x
(2.28)
(2.29)
2
H (x) .
(2.30)
H=
xi x j
EXAMPLE
(2.31)
(2.32)
Solution
We can solve this problem with the linear approach by observing that
ln y = ln a + bx and re-dening variables g = ln y and r = ln a.
With these substitutions, Eq. (2.32) becomes
H (r, b) =
(gi r bxi )2
(2.33)
H
=2
(gi r bxi )(xi )
b
i
H
=2
(gi r bxi )(1).
r
i
Setting Eq. (2.34) to zero, we have
gi xi
r xi
bxi2 = 0
i
or
r
xi + b
gi r
(2.35)
(2.36)
(2.34)
xi2 =
i
1b
gi xi
(2.37)
xi = 0
(2.38)
i
19
or
Nr + b
i
xi =
gi
(2.39)
where N is the number of data points. Eqs. (2.37) and (2.39) are two simultaneous
linear equations in two unknowns which are readily solved. (See [2.2, 2.3, 2.4] for
more sophisticated descent techniques such as the conjugate gradient method.)
2.3.2
2.3.3
Simulated annealing
We will base much of the following discussion of minimization techniques on an
algorithm known as simulated annealing (SA) which proceeds as follows. (See
the book by Aarts and Van Laarhoven for more detail [2.1].)
Algorithm: Simulated annealing
Choose (at random) an initial value of x, and an initial value of T > 0.
While T > Tmin , do
(1) Generate a point y which is a neighbor of x. (The exact denition of neighbor
will be discussed soon.)
(2) If H (y) < H (x) then replace x with y.
20
(x))
(3) Else compute Py = exp( (H (y)H
). If Py R then replace x with y, where
T
R is a random number uniformly distributed between 0 and 1.
(4) Decrease T slightly and go to step 1.
Simulated annealing is most easily understood in the context of combinatorial optimization. In this case, the neighbor of a vector x is another vector x2 , such that
only one of the elements of x is changed (discretely) to create x2 .2 Thus, if x is
binary and of dimension d, one may choose a neighboring y = x z, where z is a
binary vector in which exactly one element is nonzero, and that element is chosen
at random, and represents exclusive OR.
In step 2 of the algorithm, we perform a descent. Thus we always fall down hill.
In step 3, we provide a mechanism for sometimes making uphill moves. Initially,
we ignore the parameter T and note that if y represents an uphill move, the probability
of accepting y is proportional to e(H (y)H (x)) . Thus, uphill moves can occur, but
are exponentially less likely to occur as the size of the uphill move becomes larger.
The likelihood of an uphill move is, however, strongly inuenced by T . Consider the
( x)
case that T is very large. Then H (y)H
1 and Py 1. Thus, all moves will be
T
accepted. As T is gradually reduced, uphill moves become gradually less likely until
for low values of T (T (H (y) H (x))), such moves are essentially impossible.
One may consider an analogy to physical processes in which the state of each
variable (one or zero) is analogous to the spin of a particle (up or down). At high temperatures, particles randomly change state, and if temperature is gradually reduced,
minimum energy states are achieved. The parameter T in step 4 is thus analogous to
(and often referred to as) temperature, and this minimization technique is therefore
called simulated annealing.
21
(2.40)
This is an awkward bit of notation, but here is what it means. When you see the
term y(t) think the symbol received at time t. When you see the , think is.
When you see w k think 1 or think 0 or think whatever might be received at
time k. For example, we might ask what is the probability that you receive at time
k the symbol 1, when the last four symbols received previously were 0110? Which,
in our notation is to ask what is
P(y(k) w 1 |(y(k 1) w 0 , y(k 2) w 1 , y(k 3) w 1 , y(k 4) w 0 )).
It is possible that in order to compute this probability, we must know all of the
history, or it is possible that we need only know the class of the last few symbols.
One particularly interesting case is when we need only know the class of the last
symbol received. In that case, we could say that the probability of class assignments
for symbol y(t), given all of the history, is precisely the same as the probability
knowing only the last symbol:
P(y(k)|y(k 1)) = P(y(k)|(y(k 1), y(k 2), . . .))
(2.41)
where we have simplied the notation slightly by omitting the set element symbols.
That is, y(k) does not denote the fact that the kth symbol was received, but rather
that the kth symbol belongs to some particular class. If this is the case that the
probability conditioned on all of history is identical to the probability conditioned
on the last symbol received we refer to this as a Markov process.3
This relationship implies that
N
P(y(t) w t |(y(t 1) w t1 )) P(y(1) w 1 ).
P(y(N ) w N , . . . y(1) w 1 ) =
t=2
3
To be perfectly correct, this is a rst-order Markov process, but we will not be dealing with any other types in
this chapter.
22
Suppose there are only two classes possible, say 0 and 1. Then we need to know only
four possible transition probabilities, which we dene using subscripts as follows:
P(y(t) = 0|y(t 1) = 0) P00
P(y(t) = 0|y(t 1) = 1) P01
P(y(t) = 1|y(t 1) = 0) P10
P(y(t) = 1|y(t 1) = 1) P11 .
In general, there could be more than two classes, so we denote the transition probabilities by Pi j , and can therefore describe a Markov chain by a c c matrix P whose
elements are Pi j .
We will take another look at Markov processes when we think about Markov
random elds in Chapter 6.
Assignment
2.4.1
2.1
Is the matrix P symmetric? Why or why not? Does P have
any interesting properties? Do its rows (or columns)
add up to anything interesting?
Markov
process
2
Switch
y(t)
y(t ) =
Fig. 2.6. A hidden Markov model may be viewed as a process which switches randomly
between two signals.
23
Switch
up
Switch
down
state machine (FSM), which at each time instant may stay in the same state or may
switch, as shown in Fig. 2.7.
Here is our problem: We observe a sequence of symbols
Y = [y(t = 1), y(t = 2), . . .] = [y(1), y(2), . . .].
What can we infer? The transition probabilities? The state sequence? The structure
of the FSM? The rules governing the FSM? Lets begin by estimating the state
sequence.
Let s(t) t = 1, . . . , N denote the state associated with measurement y(t), and denote
the sequence of states S = [s(1), s(2), . . . s(N )], where each s(t) {s1 , s2 , . . . sm }.
We seek a sequence of states, S, which maximizes the conditional probability that
the sequence is correct, given the measurements; P(S|Y ).
Using Bayes rule
P(S|Y ) =
p(Y |S)P(S)
.
p(Y )
(2.42)
(2.43)
t=2
Now, lets make a temporarily unbelievable assumption, that the probability density
of the output depends only on the state. Denote that relationship by p(y(t)|s(t)).
Then the posterior conditional probability of the sequence can be written:
N
N
p(Y |S)P(S) =
p(y(t)|s(t))
Ps(t),s(t1) Ps(0) .
(2.44)
t=1
t=2
N
t=1
p(y(t)|s(t))Ps(t),s(t1) .
(2.45)
24
Now look back at Eq. (2.42). The choice of S does not affect the denominator, so
all we need to do is nd the sequence S which maximizes
E=
N
p(y(t)|s(t))Ps(t),s(t1) .
(2.46)
t=1
2.4.2
N
((t) + i, j )
(2.47)
t=1
sm
sm
sq
t=1
s3
s2
s3
s2
s3
s2
s1
s1
s1
t=2
t=N
Fig. 2.8. Every possible sequence of states can be thought of as a path through such a graph.
25
s4
s4
s4
s4
s3
s2
s3
s2
s3
s2
s3
s2
s1
s1
s1
s1
Fig. 2.9. A path through a problem with four states and four time values.
(2.48)
The best path to node j at time t + 1 is the maximum of these. When we nally
reach time step N , the node which terminates the best path is the nal node.
The computational complexity of this algorithm is thus N m 2 , which is a lot less
than m N , the complexity of a simple exhaustive search of all possible paths.
2.4.3
Markov outputs
In the description above, we assumed that the probability of a particular output
depended only on the state. We do not have to be that restrictive; we could allow the
outputs themselves to be the Markov processes.
Assume if the state changes, the rst output depends only on the state, as before.
But afterward, if the state remains the same, the outputs obey a Markov chain. We
can formulate this problem in the same way, and solve it with the Viterbi algorithm.
2.4.4
26
(2.49)
That is, given the observation sequence, what is the probability that we went from
state i to state j at time t? We can compute that quantity using the methods of
section 2.4.2 if we know the transition probabilities Pi, j and the output probabilities
k, l . Suppose we do know those. Then, we estimate the transition probability by
averaging the probabilities over all the inputs.
N
Pi, j = t=2
N
t=2
(2.50)
where, since in order to go into state j, the system had to go there from somewhere,
P j|Y (t) =
N
Pi j|Y (t).
(2.51)
i=1
Then we estimate the probability of the observation by again averaging all the
observations.
N
k,l =
(2.52)
At each iteration, we use Eqs. (2.50) and (2.52) to update the parameters. We then
use Eqs. (2.49) and (2.51) to update the conditional probabilities. The process then
repeats until it converges.
2.4.5
Applications of HMMs
Hidden Markov models have found many applications in speech recognition and
document content recognition [17.29].
27
Assignment
2.2
(Trivia question) In what novel did a character named
Markov Chaney occur?
Assignment
2.3
Find the OT corresponding to a rotation of 30 about
the z axis. Prove the columns of the resulting matrix
are a basis for
3 .
Assignment
2.4
Prove Eq. (2.10). Hint: Use Eq. (2.7); Eq. (2.9) might
be useful as well.
Assignment
2.5
A positive definite matrix has positive eigenvalues.
Prove this. (For that matter, is it even true?)
Assignment
2.6
Does the function y = xex have a unique value x which
minimizes y? If so, can you find it by taking a
derivative and setting it equal to zero? Suppose
this problem requires gradient descent to solve.
Write the algorithm you would use to find the x which
minimizes y.
Assignment
2.7
We need to solve a minimization problem using gradient
descent. The function we are minimizing is sin x + ln y.
Which of the following is the expression for the
gradient which you need in order to do gradient
descent?
(a) cos x +
cos x
(d)
1/y
Assignment
1
1
(b)y =
(c)
y
cos x
(e)
sin x +
ln y
y
x
2.8
(a) Write the algorithm which uses gradient descent to
find the vector [x,y]T which minimizes the function
28
2.9
Determine whether the functions sin x and sin 2x might
be orthonormal or orthogonal functions.
References
[2.1] E.H.L. Aarts and P.J.M. van Laarhoven. Simulated Annealing: Theory and Applications,
Dordrecht, Holland, Reidel, 1987.
[2.2] R.L. Burden, J.D. Faires, and A.C. Reynolds. Numerical Analysis, Boston, MA, Prindle,
Weber and Schmidt, 1981.
[2.3] G. Dahlquist and A. Bjorck. Numerical Methods, Englewood Cliffs, NJ, Prentice-Hall,
1974.
[2.4] B. Gottfried and J. Weisman. Introduction to Optimization Theory, Englewood Cliffs,
NJ, Prentice-Hall, 1973.
Computer Science is not about computers any more than astronomy is about telescopes
E. W. Dijkstra
One may take two approaches to writing software for image analysis, depending on
what one is required to optimize. One may write in a style which optimizes/minimizes
programmer time, or one may write to minimize computer time. In this course,
computer time will not be a concern (at least not usually), but your time will be far
more valuable. For that reason, we want to follow a programming philosophy which
produces correct, operational code in a minimal amount of programmer time.
The programming assignments in this book are specied to be written in C or
C++, rather than in MATLAB or JAVA. This is a conscious and deliberate decision.
MATLAB in particular hides many of the details of data structures and data manipulation from the user. In the course of teaching variations of this course for many
years, the authors have found that many of those details are precisely the details
that students need to grasp in order to effectively understand what image processing
(particularly at the pixel level) is all about.
r
r
IFS supports any data type including char, unsigned char, short, unsigned short,
int, unsigned int, oat, double, complex oat, complex double, complex short,
and structure.
IFS supports any image size, and any number of dimensions. One may do signal
processing by simply considering a signal as a one-dimensional image.
IFS is available on most current computer systems, including Windows on the PC,
Linux on the PC, Unix on the SUN, and OS-X on the Macintosh.1 Files written on
Regrettably, IFS does not support Macintosh operating systems prior to OS-X.
29
30
3.1.1
one platform may be read on any of the other platforms. Conversion to the format
native to the platform is done by the read routine, without user intervention.
A large collection of functions are available, including two-dimensional Fourier
transforms, lters, segmenters, etc.
3.1.2
Now, you should look at
the IFS manual. Read
especially carefully
through the rst two
example programs in the
front of the manual.
31
then a oating point number will be returned independent of the data type of the
image. That is, the subroutine will do data conversions for you. Similarly, ifsigp will
return an integer, no matter what the internal data type is. This can, of course, get you
in trouble. Suppose the internal data type is oat, and you have an image consisting
of numbers less than one. Then, the process of the conversion from oat to int will
truncate all your values to zero.
For some projects you will have three-dimensional data. That means you must
access the images using a set of different subroutines, ifsigp3d, ifsfgp3d, ifsipp3d,
and ifsfpp3d. For example,
y = ifsigp3d(img,frame,row,col)
3.1.3
Common problems
Two common problems usually occur when students rst use IFS software.
(1) ifsipp(img,x,y,exp(-t*t)) will give you trouble because ifsipp expects a fourth
argument which is an integer, and exp will return a double. You should use
ifsfpp.
(2) ifsigp(img,x,y,z) is improperly formed. ifsigp expects three arguments, and does
not check the dimension of the input image to determine number of arguments
(it could, however; sounds like a good student project . . . ). To access a threedimensional image, either use pointers, or use ifsigp3d(img,x,y,z), where the
second argument is the frame number.
32
......
int row, col, frame;
......
for (frame = 0; frame < 224; frame++)
{
for ( row = 0; row < 128; row++)
{
for( col = 0; col < 128; col++)
{
/* pixel processing */
......
}
......
}
}
In this example, we use two integers (row and col) as the indices to the row and
column of the image. By increasing row and col with a step one, we are actually
scanning the image pixel-wise from left to right, top to bottom.
If the image has more than two dimensions, (e.g. hyperspectral images) a third
integer is then used as the index to the dimensions, and correspondingly, a threenested for-loops is needed, as shown in Fig. 3.2.
33
for (<cond>) {
<body>
}
(a) the K&R style
for (<cond>)
{
<body>
}
for (<cond>)
{
<body>
}
for (<cond>)
{
<body>
}
34
/* Example1.c
This program thresholds an image. It uses a fixed image size.
Written by Harry Putter, October, 2006
*/
#include <stdio.h>
#include <ifs.h>
main( )
{
IFSIMG img1, img2;
/* Declare pointers to headers */
int len[3];
/* len is an array of dimensions, used by ifscreate */
int threshold;
/* threshold is an int here */
int row,col;
/* counters */
int v;
/* read in image */
img1 = ifspin("infile.ifs");
Fig. 3.4. Example IFS program to threshold an image using specied values of dimensions and
predetermined data type.
3.5 Makeles
You really should use makeles. They are far superior to just typing commands.
If you are doing your software development using Microsoft C++, Lcc, or some
other compiler, then the makeles are sort of hidden from you, but it is helpful to
know how they operate. Basically, a makele species how to build your project, as
illustrated by the example makele in Fig. 3.6.
The example in Fig. 3.6 is just about as simple a makele as one can write. It
states that the executable named myprogram depends on only one thing, the object
module myprogram.o. It then shows how to make myprogram from myprogram.o
and the IFS library.
Similarly, myprogram.o is made by compiling (but not linking) the source le,
myprogram.c, utilizing header les found in an include directory on the CDROM,
named hdr. Note: To specify a library, as in the link step, one must specify the library
35
3.5 Makeles
/* Example2.c
Thresholds an image using information about its data type and the dimensionality.
Written by Sherlock Holmes, May 16, 1885
*/
#include <stdio.h>
#include <ifs.h>
main( )
{
IFSIMG img1, img2;
/* Declare pointers to headers */
int *len;
/* len is an array of dimensions, used by ifscreate */
int frame, row, col;
/* counters */
float threshold, v;
/* threshold is a float here */
img1 = ifspin("infile.ifs");
/*read in file by this name*/
len = ifssiz(img1);
/* ifssiz returns a pointer to an array of image dimensions*/
img2 = ifscreate(img1->ifsdt,len,IFS_CR_ALL,0);
/* output image is to be the same type as the input */
threshold = 55;
/* set some value to threshold */
/* check for one, two or three dimensions */
switch (len[0]) {
case 1:
/* 1d signal */
for (col = 0; col < len[1]; col++)
{
v = ifsfgp(img1,0,col);
/* read a pixel as a float */
if (v > threshold)
ifsfpp(img2,0,col,255.0);
/* write a float */
else
/* if img2 not float, will be converted*/
ifsfpp(img2,0,col,0.0);
}
break;
case 2:
/* 2d picture */
for (row = 0; row < len[2]; row++)
for (col = 0; col < len[1]; col++)
{
v = ifsfgp(img1,row,col);
/* read a pixel as a float */
if (v > threshold)
ifsfpp(img2,row,col,255.0); /* store a float */
else
ifsfpp(img2,row,col,0.0);
}
break;
case 3:
/* 3d volume */
for (frame = 0; frame < len[3];frame++)
for (row = 0; row < len[2]; row++)
for (col = 0; col < len[1]; col++)
{
v = ifsfgp3d(img1,frame,row,col);
/* read a pixel as a float */
if (v > threshold)
ifsfpp3d(img2,frame,row,col,255.0);
else
ifsfpp3d(img2,frame,row,col,0.0);
}
break;
default:
printf("Sorry I cannot do 4 or more dimensions\n");
} /* end of switch */
ifspot(img2, "img2.ifs"); /* write image 2 to disk */
}
Fig. 3.5. An example IFS program to threshold an image using number of dimensions, size of
dimensions, and data type determined by the input image.
36
myprogram: myprogram.o
cc -o myprogram myprogram.o /CDROM/Solaris/ifslib/libifs.a
myprogram.o: myprogram.c
cc -c myprogram.c -I/CDROM/Solaris/hdr
Fig. 3.6. An example makele which compiles a program and links it with the IFS library.
name (e.g. libifs.a), but to specify an include le (e.g. ifs.h), one species only the
directory in which that le is located, since the le name was given in the #include
preprocessor directive.
In WIN32 the makeles look like the example shown in Fig. 3.7. Here, many of
the symbolic denition capabilities of the make program are demonstrated, and the
location of the compiler is specied explicitly.
The programs generated by IFS are (with the exception of ifsview) console-based.
That is, you need to run them inside an MSDOS window on the PC, inside a terminal
window under Linux, Solaris, or on the Mac, using OS-X.
Assignment
37
Assignment
In this chapter, we describe how images are formed and how they are represented.
Representations include both mathematical representations for the information contained in an image and for the ways in which images are stored and manipulated in a
digital machine. In this chapter, we also introduce a way of thinking about images
as surfaces with varying height which we will nd to be a powerful way to describe
both the properties of images as well as operations on those images.
4.1.1
y
f(x, y)
2D brightness images, also called luminance images. The things you are used to
calling images. These might be color or gray-scale. (Be careful with the words
black and white, as that might be interpreted as binary). We usually denote
the brightness at a point x, y as f (x, y). Note: x and y could be integers (in
this case, we are referring to discrete points in a sampled image; these points are
called pixels, short for picture elements), or real numbers (in this case, we
are thinking of the image as a function).
38
39
z
f (x, y, z)
y
x
z
z(x, y)
y
x
You have a homework
involving les named
(e.g.) site1.c.ifs. The c
indicates that correction
has been done.
4.1.2
(4.1)
ax 2 + by 2 + cz 2 + d x y + ex z + f yz + gx + hy + i z + j = 0.
(4.2)
or a quardic:
The form given in Eq. (4.1), in which one variable is dened in terms of the others,
is often referred to as an explicit representation, whereas the form of Eq. (4.2) is an
implicit representation [4.23], which may be equivalently represented in terms of
the zero set, {(x, y, z) : f (x, y, z) = 0}. Implicit polynomials have some convenient
properties. For example consider a point (x0 , y0 ) which is not in the zero set of
f (x, y), that is, the set of points x, y which satisfy
f (x, y) x 2 + y 2 R 2 = 0.
(4.3)
If we substitute x0 and y0 into the equation for f (x, y), we know we get a nonzero
result (since we said this point is not in the zero set); if that value is negative, we
40
know that (x0 , y0 ) is inside the curve, otherwise, outside [4.3]. This inside/outside
property holds for all closed curves (and surfaces) representable by polynomials.
4.1.3
Throughout this book, we
use the transpose
whenever we write a
vector on a row, because
we think of vectors as
column vectors.
4.1.4
10
4
We will discuss this in more detail later, but the basic idea is to unwind
the image into a vector, and then talk about doing vectormatrix operations on such a representation. For example, the 2 2 image with brightness values as shown could be written as a vector f = [5 10 6 4]T .
Probabilistic representations
In Chapter 6 we will represent an image as the output of a random process which
generates images. We can in that way make use of a powerful set of mathematical
tools for estimating the best version of a particular image, given a measurement of
a corrupted, noisy image.
4.1.5
Fig. 4.1. (a) An image with lower horizontal frequency content. (b) An image with higher
horizontal frequency content.
41
Fig. 4.2. (L) An image. (R) A low-frequency iconic representation of that image. The right-hand
image is blurred in the horizontal direction only, a blur which results when the camera
is panned while taking the picture. Notice that horizontal edges are sharp.
The spatial frequency content of an image can be modied by lters which block
specic frequency ranges. For example, Fig. 4.2 illustrates an original image and an
iconic representation of that image which has been passed through a low-pass lter.
That is, a lter which permits low frequencies to pass from input to output, but which
blocks higher frequencies. As you can see, the frequency response is one way of
characterizing sharpness. Images with lots of high-frequency content are perceived
as sharp.
Although we will make little use of frequency domain representations for
images in this course, you should be aware of a few aspects of frequency domain
representations.
First, as you should have already observed, spatial frequencies differ with direction. Fig. 4.1 illustrates much more rapid variation, higher spatial frequencies in the
vertical direction than in the horizontal. Furthermore, in general, an image contains
many spatial frequencies. We can extract the spatial frequency content of an image
using the two-dimensional Fourier transform, given by
F(u, v) =
1
f (x, y) exp(i2(ux + v y))
K x y
(4.4)
42
The second observation made here is that spatial frequencies vary over an image.
That is, if one were to take subimages of an image, one would nd signicant
variation in the Fourier transforms of those subimages.
Third, take a look at the computational complexity implied in Eq. (4.4). The
Fourier transform of an image is a function of spatial frequencies, u and v, and may
thus be considered as an image (its values are complex, but that should not worry
you). If our image is N N , we must sum over x and y to get a SINGLE u, v value
a complexity of N 2 . If the frequency domain space is also sampled at N N , we
have a total complexity of N 4 to compute the Fourier transform. BUT, there exists
an algorithm called the fast Fourier transform which very cleverly computes a single
u, v value in N log2 N rather than N 2 , resulting in a signicant saving. Thus it is
sometimes faster to compute things in the frequency domain.
Finally, there is an equivalence between convolution, which we will discuss in
the next chapter and multiplication in the frequency domain. More on that in section
5.8.
4.1.6
4.2.1
43
A lens is used to form an image on the surface of the CCD. When a photon of the
appropriate wavelength strikes the special material of the device, a quantum of charge
is created (an electronhole pair). Since the conductivity of the material is quite low,
these charges tend to remain in the same general area where they were created. Thus,
to a good approximation, the charge, q, in a local area of the CCD follows
tf
q=
i dt
0
where i is the incident light intensity, measured in photons per second. If the incident
light is a constant over the integration time, then q = itf , where tf is called the frame
time.
In vidicon-like devices, the accumulated (positive) charge is cancelled by a scanning electron beam. The cancellation process produces a current which is amplied
and becomes the video signal. In a CCD, charge is shifted from one cell to the next
synchronously with a digital clock. The mechanism for reading out the charge, be it
electron beam, or charge coupling, is always designed so that as much of the charge
is set to zero as possible. We start the integration process with zero accumulated
charge, build up the charge at a rate proportional to local light intensity, and then
read it out. Thus, the signal measured at a point will be proportional to both the light
intensity at that point and to the amount of time between read operations.
Since we are interested only in the intensities and not in the integration time, we
remove the effect of integration time by making it the same everywhere in the picture.
This process, called scanning, requires that each point on the device be interrogated
and its charge accumulation zeroed, repetitively and cyclically. Probably the most
straightforward, and certainly the most common way in which to accomplish this is
in a top-to-bottom, left-to-right scanning process called raster scanning (Fig. 4.3).
Fig. 4.3. Raster scanning: Active video is indicated by a solid line, blanking (retrace) by a dashed
line. In an electron beam device, the beam is turned off as it is repositioned. Blanking
has no physical meaning in a CCD, but is imposed for compatibility. This simplied
gure represents noninterlaced scanning.
44
Active
video
Blanking
Sync
Composite video
Noncomposite video
Fig. 4.4. Composite and noncomposite outputs of a television camera, voltage as a function
of time.
To be consistent with standards put in place when scanning was done by electron
beam devices (and there needed to be a time when the beam was shut off), the
television signal has a pause at the end of each scan line called blanking. While
charge is being shifted out from the bottom of the detector, charge is once again built
up at the top. Since charge continues to accumulate over the entire surface of the
detector at all times, it is necessary for the read/shift process to return immediately
to the top of the detector and begin shifting again. This scanning process is repeated
many times each second. In American television, the entire faceplate is scanned once
every 33.33 ms (in Europe, the frame time is 40 ms).
To compute exactly how fast the electron beam is moving, we compute
1s
525 lines
= 63.5 s/line.
30 frame
frame
(4.5)
Using the European standard of 625 lines and 25 frames per second, we arrive at
almost exactly the same answer, 64 s per line. This 63.5 s includes not only
the active video signal but also the blanking period, approximately 18 percent of
the line time. Subtracting this dead time, we arrive at the active video time, 52 s
per line.
Fig. 4.4 shows the output of a television camera as it scans three successive lines.
One immediately observes that the raster scanning process effectively converts a
picture from a two-dimensional signal to a one-dimensional signal, where voltage is
a function of time. Fig. 4.4 shows both composite and noncomposite video signals,
45
that is, whether the signal does or does not include the sync and blanking timing
pulses.
The sync signal, while critical to operation of conventional television, is not
particularly relevant to our understanding of digital image processing at this time.
The blanking signal, however, is the single most important timing signal in a raster
scan system. Blanking refers to the time that there is no video. There are two distinct
blanking events: horizontal blanking, which occurs at the end of each line, and
vertical blanking, which occurs at the bottom of the picture. In a digital system, both
blanking events may be represented by pulses on separate digital wires. Composite
video is constructed by shifting these special timing pulses negative and adding them
to the video signal.
Now that we recognize that horizontal blanking signies the beginning of a new
line of video data, we can concentrate on that line and learn how a computer might
acquire the brightness information encoded in that voltage.
46
Resolution
The number of samples on a single line denes the horizontal resolution of a video
system. Similarly, the number of lines in a single image denes the vertical resolution. It is interesting to note that European television, with 625 lines per picture,
has a greater vertical resolution than American. This is why viewers observe that
European TV has a better picture than American.
The term resolution may also refer to the physical size of the smallest thing the
imaging system can clearly image. For example, the resolution of mammographic
x-ray lm is around 50 microns; meaning that a dot on the lm as small as that may
be discovered.
For computer monitors, there are many resolution standards, and we will not list
them all here. However the approach to calculating clock rates is the same.
Dynamic range
The sampled analog signal is converted to digital form by the quantization process,
as shown in Fig. 4.7. The digital representation of any signal can have only a nite
number of possible values, which are dened by the number of bits in the output
word. Video signals are often quantized to 8 bits of accuracy, thus allowing a signal
to be represented as one of 256 possible values.
One denition of the dynamic range of the imaging system is the number of
bits of the digital representation. An alternative denition species the dynamic
range as the range of input signal over which a camera successfully operates. Both
meanings are accepted and are in common use, but they differ according to the
context.
Since a digital image is raster scanned and sampled, there is a one-to-one relationship between time and space. That is, if we refer to the sampling time, we must speak
of it relative to the top-of-picture signal (vertical blanking). That timing relationship
identies a unique position on the screen.
The sampling theorem
An interesting question arises if we wish to sample and store an analog signal and
we wish to reconstruct that signal exactly from the sampled version. It can be shown
47
Fig. 4.8. Image of a face represented using 16 shades of gray (4 bits) on left, and with eight
shades (3 bits) on right.
that exact reconstruction requires a sampling rate of at least twice the highest frequency in the signal.
In machine vision, we are usually not concerned with exact reconstruction of the
most subtle details of the image, but wish to extract just the information we need to
accomplish the task at hand.
Quantization error is the term used to refer to the fact that information is lost
whenever the continuously valued analog signal is partitioned into discrete ranges.
Quantization error is often observed as contouring, as illustrated in Fig. 4.8.
4.2.2
Stereopsis
B
Fig. 4.9. From the
images extracted from
two cameras, it is
possible to compute
the location in 3-space
of any point, provided
we can solve the
correspondence
problem.
Most animals have two eyes, and we know from experience that with two views, we
can extract three-dimensional information. It is not hard to work that out geometrically (Fig. 4.9).
If we know the distance between the cameras, the angle of observation of each
camera (in most stereopsis systems, the cameras are set up so their midlines are
parallel), and we can measure where particular points in the scene appear in both
images, then we can calculate distance to the object, which will be referred to as
range. If we can do this in the general case, that is, if we can always identify which
point in the left image corresponds to which point in the right image, we have solved
the correspondence problem.
48
The correspondence
problem is one of the
fundamental problems of
machine vision! Some
people would say it is
THE problem in machine
vision.
An often-used simplifying assumption is that the two cameras are set up so that
they are exactly parallel, and if a point occurs on a particular line in the left image
then it will appear on the same epipolar line in the right image. In other words,
the epipolar line connects a point and its correspondence. This assumption makes it
possible to reduce the complexity of the correspondence problem dramatically.
The literature is lled with papers on approaches to the correspondence problem.
Most of them focus on point matching. That is, nd a point on the epipolar line in the
second image which in some way resembles a point in the rst image. For example,
Bokil and Khotanzad [4.5] extended the work of Marr and Poggio [4.27], which
makes use of epipolar assumption. They accomplish point matching by establishing
a gray level compatibility matrix (GLCM). The pixel values of the left and right
images are labels at the bottom and left of the matrix. The i, j element of the matrix
is determined by computing the absolute value of the difference between brightness
values in the ith row of the left-hand image and the jth column in the right-hand
image. The GLCM values are then normalized. Row-to-row correlations are then
established and a best match is selected.
The correspondence problem may be made easier by hierarchical matching [4.26]
(two low-level features like epipolar edges correspond only if they belong to regions
which correspond).
There are methods for nding curves in 3-space which do not explicitly require a
solution to the correspondence problem. For example, Cohen and Wang [4.10, 4.11]
solve for the best matching curve rather than for the individual points.
Camera calibration [4.28, 4.37] is important for stereo [4.31], and at the same
time, stereopsis can be used for camera calibration, since it establishes a relationship
between the two-dimensional image and the three-dimensional world. A great deal
of effort has been expended to determine the minimum set of correspondences [4.1,
4.32] or other relationships [4.33, 4.34] required to calibrate the cameras.
A set of correspondences implies a transformation determining the pose of an
object (the position and orientation of the object in 3-space is known as the pose).
In a given scene, there may be multiple sets of correspondences [4.20, 4.38].
There are many special case applications [4.13], including how to obtain stereo
information from panoramic cameras [4.30].
We will revisit stereopsis in Chapter 11 after we have learned how the concepts of
parametric transformations will help in the solution of the correspondence problem.
Structured illumination
One variation eliminates the correspondence problem replace one camera with
a light source (e.g. a laser beam passing through a cylindrical mirror). However,
this is really no longer stereopsis. Instead, it is a method referred to as structured
illumination. To see how this works, look back at Fig. 4.9, and think of one of
49
Camera
Projector
d
Fig. 4.10. Structured illumination.
those cameras as being replaced by a projector which shines a very narrow, very
bright slit on the scene, as illustrated in Fig. 4.10. Now, one angle, , is known
from the projector; the other angle, , is measured by nding the bright spot in the
camera image, counting over pixels, and knowing the relationship between pixels
and angle. Finally, knowledge of the distance between cameras, d, makes the triangle
solvable.
One observation seems relevant to this point in describing images. An interesting
problem occurs when one uses structured illumination to look at specular reectors
such as metal surfaces. With specular reectors, either not enough or too much light
may be reected (polarization lters help [4.29]).
We will see more about using structured illumination when we get to the shapefrom-X sections of this book.
f (x, y)
Measurement
system (D)
g (x, y)
A measurement system corrupts the input image to produce the measured image:
g(x, y) = D( f (x, y)),
(4.6)
where D is some distortion function which typically includes some random noise
process.
50
g(x, y) =
(4.7)
This is the convolution integral. We can show that for any distortion operator D, if
D is linear and is the same wherever it is applied (space-invariant), then there is a
convolution in the form of Eq. (4.7) which produces distortion identical to D. Here is
how it happens: The following derivation of the convolution integral is done in one
dimension, only for convenience. The extension to two dimensions is trivial. First,
observe that any function f, evaluated at a point x, can be written as
f (x) =
f (x )(x x ) d x
(4.8)
First assumption: d is
linear.
where (r ) represents the delta function which is equal to zero when its argument
is nonzero, equal to innity when its argument is zero, and has an integral of one.
Equation (4.8) is not profound it simply denes the way the delta samples a
function. However, now let us suppose our function f is corrupted by some operator
which changes f at every point x. Then,
(4.9)
D( f (x)) = D
f (x )(x x ) d x .
Now if D is a linear operator, then we can interchange the operator D and the integral
to obtain
D( f (x)) =
D( f (x )(x x )) d x .
(4.10)
f (x )D((x x )) d x .
(4.11)
Now observe that D may depend on x, or it may depend on the difference between x
and x , but in any case, it is the distortion operator applied to just the delta function.
So any LINEAR distortion of f can be written as an integral of the product of f with a
function which is the distortion applied to the delta. Since the delta function is really
just a very bright spot, with innite height and zero width, in one dimension, we call
it an impulse, and call D((x x )) the impulse response. The two-dimensional
51
delta function is a point of light, so we call the result of applying the distortion to it
the point spread function. The impulse response and the point spread function are
precisely the same thing, the only difference is in usage.
Since the impulse response might depend on both x and x , we introduce a new
notation which we call h:
h(x, x ) = D((x x )).
(4.12)
If we make another assumption, we can get a simpler expression: Lets assume that
D depends not on x, but only on the difference between x and x . In that case, we
can write h(x, x ) = h(x x ), and Eq. (4.11) simplies to
g(x) = D( f (x)) =
f (x )h(x x ) d x
(4.13)
where we have introduced g, the output of the system. This, you will come to
recognize as the convolution integral. This integral is very important for a variety
of reasons, including the fact that it can be computed rapidly using the fast Fourier
transform (FFT). Even more signicant is the observation that ANY distortion of
an image (as long as it is linear and space-invariant) can be computed by an integral
like this.
4.4.1
Isophotes
Consider the value of f (x, y) as a surface in space, described by z = f (x, y). Then
the ordered triple [x, y, f (x, y)]T describes this surface. For every point, (x, y), there
is a corresponding value in the third dimension. It is important to observe that there
is just ONE such z value for any x, y pair ( f (x, y) is a function). Therefore, z is a
surface.
Consider the set of all points satisfying f (x, y) = C for some constant C. If f
represents brightness, then this set of points is a set of points, all of which have the
same brightness. We therefore refer to this set as an isophote.
Theorem
At any image point, (x, y), the isophote passing through that point is perpendicular
to the gradient.2
The gradient vector is dened in Eq. 2.11 and elaborated upon in Eq. 5.22.
52
Direction of gradient
Fig. 4.11. Contour lines on an elevation map are equivalent to isophotes. The gradient vector at
a point is perpendicular to the isophote at that point.
4.4.2
Ridges
Now lets think about z (x, y), a surface in space, as a mountain (see Fig. 4.11). If
we draw a geological contour map of this mountain, the lines on the map are lines
of equal elevation. However, if we think of elevation denoting brightness, then
the contour lines are isophotes. Stand at a point on this mountain, and look in the
direction of the gradient. The direction you are looking is the way you would go to
undertake the steepest ascent.
Look to your right or left and you are looking along the isophote. Note that the
direction of the gradient is the steepest direction at that particular point. It does not
necessarily point at the peak.
Lets climb this mountain by taking small steps in the direction of the local
gradient. What happens at the ridge line? How would you know you were on a
ridge? How can you describe this process mathematically?
Think about taking steps in the direction of the gradient. Your steps are generally
in the same direction, until you reach the ridge, then, the direction radically shifts.
So, one useful denition of a ridge is the locus of points which are local maxima of
the rate of change of gradient direction. That is, we need to nd the points where
/v is maximized. Here, v represents a derivative taken in the direction of the
gradient. In Cartesian coordinates,
2 f x f y f x y f y2 f x x f x2 f yy
=
.
3/2
v
f x2 + f y2
(4.14)
Maintz et al. [4.24] point out that it is essentially equivalent to a slightly simpler
formulation based on simply the second derivative of brightness in the v direction,
which leads to maximizing
f y2 f x x 2 f x f y f x y + f x2 f yy
,
f x2 + f y2
(4.15)
53
where the subscript denotes partial derivative with respect to. In three-dimensional
data, the concepts of ridges are the same, just harder to visualize. In that case, the
gradient is a 3-vector, pointing in the direction of increasing density. Isophotes are
surfaces instead of curves. In that same paper [4.24], Maintz et al. also consider the
concept of ridges in three-dimensional data; check it out if you have to implement
such things.
4.4.3
Compared to the number
of pixels in the image, the
number where the
gradient is strong, the
boundaries of regions, is a
very small percentage. In
fact, as the resolution
goes up, the percentage
gets smaller and smaller.
This is called a set of
measure zero.
We may dene neighborhoods in a variety of ways, but the most common and most
intuitive way is to say that two pixels are neighbors if they share a side (4-connected)
or they share either a side or a vertex (8-connected). The neighborhood of a pixel
is the set of pixels which are neighbors (surprise!). The 4-neighbors of the center
point are illustrated in Fig. 4.12. Denote the neighborhood of a point s by s . Later
we will discuss operations on sets of points, neighborhoods of points, and on sets of
neighborhoods. For example, let A and B be sets of points in the image, and let s be
a point in the set A. We may dene the aura [4.15] of a set A with respect to a set B,
for a neighborhood structure s by
O B (A) =
(s B).
sA
That is, the aura of a set of pixels A relative to a set B is the collection of all points
in B that are neighbors of pixels in A, where the concept of neighbor is given
by a problem-specic denition. Fig. 4.13 illustrates an image containing (a) a set
A (dened to be the set of shaded pixels), a set B (dened to be the set of blank
pixels), a neighborhood relation given by (b), and the aura of set B with respect to
set A in (c) [4.15]. We will see more about relationships like this when we discuss
morphology.
54
(a)
(b)
(c)
Fig. 4.13. (a) Set A (shaded pixels) and set B (blank pixels); (b) a neighborhood relation.
The shaded pixels are neighbors by denition of the center pixel. (c) The aura of
the set of white pixels in (a) relative to the set of shaded pixels is given by the dark
pixels. It is important to observe that this example uses the standard 4-connected
denition of neighbor, there is no requirement that neighbors even be spatially
adjacent.
55
56
origin to the center of the image using a move command before you do the roll). Use the program ifs
stack to convert these six two-dimensional images
into a single three-dimensional image. If you are
using Unix, view that image using imp, and demonstrate how to use imp to display the rotation as a
movie (Hint: Use the volume button). If you are
using a PC, you may convert the three-dimensional
image produced by stack into an AVI image. For
this, use ifs2avi. The AVI image can be viewed by
any of a large collection of PC programs. A double
click on the icon of the .avi image should do the
job.
(5) Now, learn how3 to use the program ifs spin. Demonstrate that you can use it to generate a sophisticated movie. NOTE: ifs spin actually runs viewpoint.
The Unix version will generate quite a large set
of temporary files, which it deletes when done. Be
aware of a need for temporary disk space.
Write up your results, and show your instructor a demo.
Hint: On your CDROM, in the leadhole directory, you
will find an image name spinout.avi. The image you produced should vaguely resemble the output.
4.6 Conclusion
In this chapter, you have been introduced to a variety of ways to represent images,
and the information in images. In subsequent chapters, we will build on these representations, developing algorithms which extract and categorize that information.
4.7 Vocabulary
You should know the meanings of the following terms.
Correspondence problem
Curvature
3
To learn how to use an IFS program, either type program name -h or look it up in the manual.
57
Dynamic range
Functional representation
Graph
Iconic representation
Isophote
Linear system
Medial axis
Probabilistic representation
Quantization
Range image
Raster scan
Ridge
Resolution
Sampling
Spatial frequency
Stereo
Structured illumination
Topic 4A
4A.1
Image representations
u
u
Fig. 4.14. A coordinate system which is natural to hexagonal tessellation of the plane. The
u and v directions are not orthogonal. Unit vectors u and v describe this coordinate
system.
58
Traditionally, electronic imaging sensors have been arranged in rectangular arrays mainly
because an electron beam needed to be swept in a raster-scan way, and more recently because
it is slightly more convenient to arrange charge-coupled devices in a rectangular organization.
Rectangular arrays, however, introduce an ambiguity in attempts to dene neighborhoods.
On the other hand, we see no connectivity paradoxes in hexagonal connectivity analysis:
Every pixel has exactly six neighbors, foreground, background, or other colors.
Notation
We denote a point in R 2 by p = uu + vv, where the unbolded character denotes the magnitude
in the direction of the unit vector denoted by the bold character. In the case that we discuss two
or more points, we will denote different vectors by using subscripts, with the same subscripts
on the components, e.g., pi = u i u + v i v.
We will also use column vector notation for such points:
Pi = [u i , v i ]T .
(4.16)
In some cases, we will be interested in the location of points in the familiar Cartesian
representation, [x, y]T . In this case, we will denote points by subscripts as well, e.g.
P i = [u i , v i ]T = [xi , yi ]T with corresponding values for u, v, x, and y.
Lemma 1
Any ordered pair [u, v]. corresponds to exactly one pair [x, y].
Proof
Using simple trigonometry, and noting that the cosine of 60 degrees is 1/2, it is straightforward
to derive that
v
x =u+
2
and
3v
y=
.
2
(4.17)
Lemma 2
Any ordered pair of Cartesian coordinates [x, y] corresponds to exactly one pair [u, v].
Proof
By solving Eq. (4.17) for u and v, we nd
y
u=x
3
y
and v = 2 .
3
(4.18)
59
in Eq. (4.19), where the inner product of u and v in Cartesian coordinates does not equal
zero.
u = x, v =
1
3
x+
y,
2
2
1
2 1
so uT v = [1, 0]
3 = 2.
2
(4.19)
Theorem
The vectors u and v form a basis for
d .
Proof
Since x and y obviously are a basis for
d , we can write any point p in
d as an ordered pair
p = [x, y]T = x x + y y. But from Eq. (4.19), we have
p = xu + y
(2v u)
y
2y
= x u + v.
3
3
3
4A.1.1
60
u 1, v + 1
u 1, v
u, v 1
u, v + 1
u, v
u + 1, v
u + 1, v 1
4A.2
4A.2.1
Curvature
The computation of local curvature could be performed at every point in an image. For 2 12
D images (surfaces), the curvature cannot be described adequately by a single scalar, but
rather takes the form of a matrix. (See doCarmos book [4.12] or other texts on differential
geometry for details.)
E
K =
F
F
G
e
f
f
g
(4.20)
where
2
z 2
zz
z
E = 1+
,F =
,
,G = 1+
x
x y
y
2 2
2 2
2 2
z
z
z
e=
H,
f
=
H,
g
=
H
2
x
x y
y2
and nally,
H=
z
x
2
+
z
y
2
+ 1.
The principal curvatures K 1 and K 2 are dened as the two eigenvalues of the matrix K, and
the corresponding eigenvectors determine the directions of the curvature.
61
4.7 Vocabulary
For many of our purposes, we will need scalar measurements of curvature which are
invariant to viewpoint. Two such scalars are easily dened, the mean curvature
Km =
1
1
(K 1 + K 2 ) = Tr (K )
2
2
(4.21)
(4.22)
Since it is a product, the Gauss curvature is zero whenever either of the two principal curvatures
is zero, a condition which routinely occurs with industrial parts. For this reason, we seldom
use the Gauss curvature.
4A.2.2
Texture
Texture is one of those words that everybody seems to know, but knows without a denition. There are at least two different denitions of texture natural textures, which are
best characterized by random process descriptions, and regular textures, which are best
characterized by frequencydomain representations.
Haralick and Shapiro [4.18] describe textures as having one or more of the properties of
neness, coarseness, smoothness, granulation, randomness, lineation, or as being mottled,
irregular, or hummocky. In detecting that clusters of pixels are different, many features may
be used including moments of the power spectrum [4.4, 4.14], fractal dimension [8.12], and
the cepstrum [4.35]. Texture segmentation involves representing [4.17, 4.21] an image in a
way which incorporates both spatial and spatialfrequency information, and then using that
information to identify regions with similar characteristics [4.7, 4.14, 4.16].
The fact that textures can be effectively represented by self-similar (fractal) processes is
addressed in a number of papers, the rst of which was presented in the classic work by
Mandelbrot and Van Ness [4.25]. Kaplan and Kuo [4.22] point out that true textures do not
necessarily keep the same exact textures over scale, and the concept of self-similarity should
be modied.
Assignment
4.A1
Suppose the image f(x,y) is describable by f(x,y) = x 4 /4
x 3 + y 2 . At the point x = 1, y = 2, which of the following is a
unit vector which points along the isophote passing through
(a)
(b)
(c)
Fig. 4.16. (a) Examples of wool textures [4.2]. (b) Examples of tree bark textures [4.2].
(c) Examples comparing natural and regular textures [4.4]. Used with permission.
62
that point?
2
(a)
( 5)
1
(b)
( 5)
Assignment
5
2
(c)
( 5)
T
(e) [2
T
(d) [2
4]T
(f)
1]T
4.A2
Imagine you are standing on a surface. You cannot see the entire surface, but you can see a fairly large portion. If you
measure the curvature at all the points you can see, you find
that one of the two principal curvatures is zero. The other
principal curvature varies monotonically in one direction.
You cannot measure it precisely, but you suspect that variation of curvature is linear in that one direction. On what
type of surface are you standing?
References
[4.1] T. Alter, 3-D Pose from 3 Points Using Weak-perspective, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 16(8), 1994.
[4.2] D. Badler, J. JaJa, and R. Chellappa, Scalable Data Parallel Algorithms for
Texture Synthesis and Compression using Gibbs Random Fields, IEEE Transactions
on Image Processing, 4(10), 1995.
[4.3] R. Bajcsy and F. Solina, Three Dimensional Object Representation Revisited,
International Conference on Computer Vision, London, May, 1987.
[4.4] J. Bigun and J. du Buf, N-folded Symmetries by Complex Moments in Gabor Space
and Their Application to Unsupervised Texture Segmentation, IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(1), 1994.
[4.5] A. Bokil and A. Khotanzad, A Constraint Learning Feedback Dynamic Model for
Stereopsis, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(11),
1995.
[4.6] K. Castleman, Digital Image Processing, Englewood Cliffs, NJ, Prentice-Hall, 1996.
[4.7] J. Chen and A. Kundu, Rotation and Gray Scale Transformation Invariant Texture Identication using Wavelet Decomposition and Hidden Markov Models, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(2), 1994.
[4.8] R. Chien and W. Snyder, Hardware for Visual Image Processing, IEEE Transactions
on Circuits and Systems, 22(6), 1975.
[4.9] D. Clausi, Texture Segmentation Example, Web publication,
http://www.eng.uwaterloo.ca/dclausi/texture.html, Spring 2001.
[4.10] F. Cohen and J. Wang, Part I: Modeling Image Curves Using Invariant 3-D Object
Curve Models A Path to 3-D Reconstruction and Shape Estimation from Image
63
References
[4.11]
[4.12]
[4.13]
[4.14]
[4.15]
[4.16]
[4.17]
[4.18]
[4.19]
[4.20]
[4.21]
[4.22]
[4.23]
[4.24]
[4.25]
[4.26]
Contours Using B-Splines, Shape Invariant Matching and Neural Network, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(1), 1994.
F. Cohen and J. Wang, Part II: 3-D Object Recognition and Shape Estimation from
Image Contours, IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(1), 1994.
M. doCarmo, Differential Geometry of Curves and Surfaces, Englewood Cliffs, NJ,
Prentice-Hall, 1976.
U. Dhond, and J. Aggarwal, Stereo Matching in the Presence of Narrow Occluding
Objects using Dynamic Disparity Search, IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(7), 1995.
D. Dunn, W. Higgins, and J. Wakeley, Texture Segmentation using 2-D Gabor
Elementary Functions, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2), 1994.
I. Elfadel and R. Picard, Gibbs Random Fields, Co-occurrences, and Texture Modeling, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1), 1994.
H. Greenspan, R. Goodman, R. Chellappa, and C. Anderson, Learning Texture Discrimination Rules in a Multiresolution System, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9), 1994.
M. Gurelli and L. Onural, On a Parameter Estimation Method for GibbsMarkov
Random Fields, IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(4), 1994.
R. Haralick and L. Shapiro, Computer and Robot Vision, Volume I, Reading, MA,
Addison-Wesley, 1992.
G. Healey and R. Kondepudy, Radiometric CCD Camera Calibration and Noise
Estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(3),
1994.
Y. Hel-Or and M. Werman, Pose Estimation by Fusing Noisy Data of Different
Dimensions, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(2),
1995.
A. Jain and K. Karu, Learning Texture Discrimination Masks, IEEE Transactions
on Pattern Analysis and Machine Intelligence, 18(2), 1996.
L. Kaplan and C. Kuo, Texture Roughness Analysis and Synthesis via Extended
Self-similar (ESS) Model, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(11), 1995.
D. Keren, D. Cooper, and J. Subrahmonia, Describing Complicated Objects by Implicit Polynomials, IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(1), 1994.
J. Maintz, P. van den Elsen, and M. Viergever, Evaluation of Ridge Seeking Operations for Multimodality Medical Image Matching, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 18(4), 1996.
B. Mandelbrot and J. Van Ness, Fractional Brownian Motions, Fractional Noises,
and Applications, SIAM Review, 10, October, 1968.
S. Marapan and M. Trivedi, Multi-primitive Hierarchical (MPH) Stereo Analysis,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(3), 1994.
64
[4.27] D. Marr and T. Poggio, Cooperative Computation of Stereo Disparity, Science, 194,
pp. 283287, October, 1976.
[4.28] P. McLauchlan and D. Murray, Active Camera Calibration for a Head-eye Platform
Using Variable State-dimension Filter, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(1), 1996.
[4.29] N. Page, W. Snyder, and S. Rajala, Turbine Blade Image Processing System,
In Advanced Software for Robotics, ed. A Danthine, Amsterdam, North-Holland,
1984.
[4.30] S. Peleg, M. Ben-Ezra, and Y. Pritch, Omnistereo: Panoramic Stereo Imaging, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 23(3), 2001.
[4.31] L. Quan, Invariants of Six Points and Projective Reconstruction from Three Uncalibrated Images, IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(1), 1995
[4.32] A. Shashua, Projective Structure from Uncalibrated Images: Structure from Motion
and Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(8), 1994.
[4.33] A. Shashua, Algebraic Functions for Recognition, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 17(8), 1995.
[4.34] A. Shashua and N. Navab, Relative Afne Structure: Canonical Model for 3D From
2D Geometry and Applications, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(9), 1996.
[4.35] P. Smith and N. Nandhakumar, An Improved Power Cepstrum Based Stereo Correspondence Method for Textured Scenes, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(3), 1996.
[4.36] W. Snyder, H. Qi, and W. Sander, A Hexagonal Coordinate System, SPIE Medical
Imaging: Image Processing, Pt. 12, pp. 716727, February, 1999.
[4.37] G. Wei and S. Ma, Implicit and Explicit Camera Calibration: Theory and Experiments, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5),
1994.
[4.38] X. Zhuang and Y. Huang, Robust 3-D 3-D Pose Estimation, IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(8), 1994.
(5.1)
where f 1 and f 2 are images, and are scalar multipliers, then we say that D is a
linear operator.
A gedankenexperiment
a, b
Is D a linear operator?
We suggest you work this out for yourself before reading the solution. It certainly
LOOKS linear. Multiplication by a constant followed by addition of a constant. If
f were a scalar variable, then D describes the equation of a line, which SURELY is
linear (isnt it?)! OK. Lets prove it. Using Eq. (5.1), we evaluate
D(f 1 + f 2 ) = a(f 1 + f 2 ) + b = af 1 + a f 2 + b
1
The authors are grateful to Bilge Karacali, Rajeev Ramanath, and Lena Soderberg for their assistance in
producing the images used in this chapter.
65
66
This case occurs when both and take on only the values 1, 0, and 1. In this
case, Eq. (5.2) expands to
g(x, y) = f (x 1, y 1)h(1, 1) + f (x, y 1)h(0, 1)
+ f (x + 1, y 1)h(1, 1) + f (x 1, y)h(1, 0)
Remember, y = 0 is at
the TOP of the image and
y goes up as you go down
in the image.
(5.3)
f2
h 1
f3
h0
f4
f5
h1
Fig. 5.1. A one-dimensional image with ve pixels and a one-dimensional kernel with three
pixels. The subscript is the x-coordinate of the pixel.
67
h(0, 1)
h(0, 0)
h(0, 1)
h(1, 1)
h(1, 0)
h(1, 1)
To better capture the essence of Eq. (5.3), let us write h as a 3 3 grid of numbers
(yes, we used the word grid rather than array intentionally), as in Table 5.1.
Now we imagine that we place this grid down on top of the image so that the
center of the grid is directly over pixel f (x, y); then each h value in the grid is
multiplied by the corresponding point in the image. We will refer to the grid of h
values henceforth as a kernel.
5.2.1
Mathematically,
convolution and
correlation differ in the
leftright order of
coordinates.
5.2.2
f (x , y )h(, ).
(5.5)
The observant student will have noticed a discrepancy in order between Eqs. (5.4)
and (5.5). In formal convolution, as given by Eq. (5.5), the arguments reverse: the
right-most pixel of the kernel (h 1 ) is multiplied by the left-most pixel in the corresponding region of the image ( f 2 ). However, in Eq. (5.4), we think of placing the
kernel down over the image and multiplying corresponding pixels. If we multiply
corresponding pixels, leftleft and rightright, we have correlation. There is, unfortunately, a misnomer in much of the literature both may be called convolution.
We advise the student to watch for this. In many publications, the authors use the
term convolution when they really mean sum of products. In order to avoid confusion, in this book, we will avoid the use of the word convolve unless we really
do mean the application of Eq. (5.5), and instead use the term kernel operator,
when we mean Eq. (5.4).
68
Derivative estimated at
this point
But this kernel is aesthetically unpleasing the estimate at x depends on the value
at x and at x + 1, but not at x 1; why? We actually like a symmetric denition
better, such as
f
f (x0 + x) f (x0 x)
.
= lim
x0
x
2x
xa
1/2
1/2
1/2 1
1 .
1 0 1
f
1
(5.6)
= 1 0 1 f
x x0
6
1 0 1
where we have introduced a new symbol, which will denote the sum-of-products
implementation described above. The literature abounds with kernels like this one.
All of them combine the concept of estimating the derivative by differences, and
then averaging the result in some way to compensate for noise. Probably the best
known of these ad hoc kernels is the Sobel:
1
0
1
f
1
(5.7)
= 2 0 2 f .
x x0
8
1 0 1
The Sobel operator has the benet of being center-weighted.
69
(5.8)
Then, we may consider the edge strength using the two numbers f / x = a,
f / y = b, and the rate of change of brightness at the point (x, y) is represented by
the gradient vector
f =
f
x
f
y
T
= [a
b]T .
(5.9)
The approach followed here is to nd a, b, and c given some noisy, blurred measurement of f, and the assumption of Eq. (5.8).
To nd those parameters, rst observe that Eq. (5.8) may be written as f (x, y) =
T
A X where the vectors A and X are AT = [a b c] and X T = [x y 1].
Suppose we have measured brightness values g(x, y) at a collection of points
Z Z (Z is the set of integers) in the image. Over that set of points, we
wish to nd the plane which best ts the data. To accomplish this objective, write
the error as a function of the measurement and the (currently unknown) function
f (x, y).
A sum-squared error
objective function
E=
( f (x, y) g(x, y))2 =
(AT X g(x, y))2 .
Expanding the square and eliminating the functional notation for simplicity, we nd
E=
f(x, y)
Fig. 5.2. The brightness in an image can be thought of as a surface, a function of two variables.
The slopes of the tangent plane are the two spatial partial derivatives.
70
through, we have
E=
AT X X T A 2
= AT
!
X XT
AT Xg +
A 2AT
g2
Xg +
g2.
Lets call X X T S (it is the scatter matrix) and see what Eq. (5.10) means:
consider a neighborhood which is symmetric about the origin. In that neighborhood, suppose x and y only take on values of 1, 0, and 1, then
2
x
xy
x
x
2
S=
[x
y
1]
=
y
X XT =
x
y
y
y
1
x
y
1
which, for the neighborhood described, is
6 0 0
0 6 0 .
0 0 9
More detail on how the
elements of the scatter
matrix are derived. Do not
forget the positive
direction for y is down.
g(x, y)x
a
6 0 0
g(x, y)y .
2 0 6 0 b = 2
c
0 0 9
g(x, y)
71
f
1
g(x, y)x
.
6
x
So, to compute the derivative from a t to a neighborhood at each of the nine points in
the neighborhood, take the measured value at that point, multiply by its x coordinate,
and add them up. Lets write down the x coordinates in tabular form:
That is precisely the kernel of Eq. (5.6), which we derived intuitively. Now, we have
it derived formally. Doesnt it give you a warm fuzzy feeling when theory agrees with
intuition?!? (Whoops, we forgot to multiply each term by 1/6, but we can simply
factor the 1/6 out, and when we get the answer, we will just divide by 6.)
We accomplished this by using an optimization method, in this case, minimizing
the squared error, to nd the coefcients of a function f (x) in an equation of the
form y = f (x), where f is polynomial. Recall from section 4.1.2 that this form is
referred to as an explicit functional representation.
One more terminology issue: In future material, we will use the expression radius
of a kernel. The radius is the number of pixels from the center to the nearest edge.
For example, a 3 3 kernel has a radius of one. A 5 5 kernel has a radius of 2,
etc. It is possible to design kernels which are circular, but most of the time, we use
squares.
In this section, we nd image gradients again, in exactly the same way, but this
time with hexagonally arranged pixels. Refer to section 4A.1 for a discussion of the
coordinate system. It is a different presentation of the same material, and if you read
both presentations carefully, you will understand the concepts more clearly.
To nd the gradient of intensity in an image, we will t a plane to the data
in a small neighborhood. This plane will be represented in the form of Eq. (5.8).
We then take partial derivatives with respect to u and v to nd the gradient of
intensity in those corresponding directions. We choose a neighborhood of six points,
surrounding a central point, and t the plane to them. Dene the set of data points as
z i , (i = 1, . . . , 6). Then the following expression represents the error in tting these
six points to a plane parameterized by a, b, and c.
E=
6
i=1
(z i (au i + bv i + c))2 .
(5.11)
72
6
(z i AT Z i )2
(5.12)
i=1
6
z i2 2z i AT Z i + AT Z i AT Z i .
(5.13)
i=1
ui
ui vi
ui
4
2
0
S=
ui vi
v i2
v i = 2 4 0 .
0
0 6
ui
vi
1
(One nds the numerical values by summing the u and v coordinates over each of
the pixels in the neighborhood, as illustrated in Fig. 4.15, assuming the center pixel
is at location 0, 0.)
Dene
z i u i u
z i v i v ,
and set the partial derivative of Eq. (5.15) equal to zero to produce a pair of
simultaneous equations,
4a 2b = u
4a + 8b = 2v
with solution
(5.16)
b=
1
(2v + u ).
6
(5.17)
a=
1
(v + 2u ).
6
(5.18)
Similarly,
73
f (x, y) =
F = [1 2 4 1 7 3 2 8 9 2 1 4 4 1 2 3]T .
2
1
1
2
Fig. 5.3. The kernels used to estimate the gradient of brightness in the u direction, and in the v
direction.
74
1 0 2
h = 2 0 4
3 9 1
0]T .
Now you try it. Determine what vector to use to apply this kernel at (2, 2). Did you
get this?
H10 = [0
1]T .
0]T .
1 0 0
0 1 0
2 0 0
0 2 0
2 0 0
0 2 1
4 0 0
0 4 2
3 0 0
9 3 2
1 9 0
0 1 4
0 0 0
0 0 3
0 0 9
0 0 1
H5
H6
Compare H5 at (1, 1) and H6 at (2, 1). They are the same except for a rotation.
We could convolve the entire image by constructing a matrix in which each column
is one such H. Doing so would result in a matrix such as the one illustrated. By
producing the product G = H T F, G will be the (vector form of) convolution of
image F with kernel H.
Some observations about this process:
r
r
r
75
9
ai ui
i=1
= [1
= [0
0
1
0
0
0
0
u9
= [0
0 0
0 0
.
.
.
0 0
0
0
0
0
0]T
0]T
1]T
which, while convenient and simple, does not help us at all here. Could another basis
be more useful? (answer yes). Before we gure out what, do you remember how
many possible basis vectors there are for this real-valued 9-space? The answer is
a zillion.2 With so many choices, we should be able to pick some good ones. To
accomplish that, recall the role of the coefcients ai . Recall that if some particular ai
is much larger than all the other as, it means that V is very similar to ui . Computing
the as then allows us a means to nd which of a set of prototype neighborhoods a
particular image most resembles.
Fig. 5.4 illustrates a set of prototype neighborhoods developed by Frei and Chen
[5.12]. Notice that neighborhood (u1 ) is negative below and positive above the
horizontal center line, and therefore is indicative of a horizontal edge, or a point
where f / y is large.
Now recall how to compute the projection ai . The scalar-valued projection of a
vector V onto a basis vector ui is the inner product ai = V T ui .
2
76
1
2
0
0
1 2
u1
2 1
1 0
1
0
1
0
1
1 2
u4
1 2 1
2 4 2
1 2 1
u7
0
1
0 2
0 1
u2
1 0
0
1 0 1
2
1
0
2
1
2
1 2
0 1
2 1 0
u3
1 0 1
0
0 0
0
1
1 0
u5
1 2
4 1
1 2
u8
0 1
u6
1 1
1 1
1
1
1
u9
One way to determine how similar a neighborhood about some point is to a vertical
edge is to compute the inner product of the neighborhood vector with the vertical
edge basis vector. One nal question: What is the difference between calculating
this projection and convolving the image at that point with a kernel which estimates
f / x? The answer is left as an exercise to the student. (Dont you wish they were
all this easy?)
So now you know all there is to know (almost) about linear operators and kernel
operators. Lets move on to an application to which we have already alluded nding
edges.
77
1
1
1
hx =
0
0
0
1
1
1
(5.19)
hy =
1 1
or
(5.20)
estimates f / y.
Some other forms have appeared in the literature that you should know about for
historical purposes.
Remember the Sobel
operator?
hy =
1 2
or
(5.21)
Important
(This will give you trouble for the entire semester, so you may as well start now.)
In software implementations, the positive y direction is DOWN! This results from
the fact that scanning is top-to-bottom, left-to-right. So pixel (0, 0) is the upper left
corner of the image. Furthermore, numbering starts at zero, not one. We nd the best
way to avoid confusion is to never use the words x and y in writing programs,
but instead use row and column remembering that now 0 is on top.
However, in these notes, we will use conventional Cartesian coordinates in order
to get the math right, and to further confuse the student (which is, after all, what
Professors are there for. Right?).
Having cleared up that muddle, let us proceed. Given the gradient vector
f f T
f =
[G x G y ]T
(5.22)
x y
we are interested in its magnitude
| f | =
G 2x + G 2y
(5.23)
78
(5.24)
(g h)
x
where now g is the measured image, h is a Gaussian, and d will be our new derivative
estimate image. Now, a crucial point from linear systems theory:
For linear operators D and ,
D(g h) = D(h) g.
(5.25)
Equation (5.25) means we do not have to do blurring in one step and differentiation
in the next; instead, we can pre-compute the derivative of the blur kernel and simply
apply the resultant kernel.
79
Lets see if we can remember how to take a derivative of a 2D Gaussian (did you
forget it is a 2D function?).
A d-dimensional multivariate Gaussian has the general form
1
[x ]T K 1 [x ]
(5.26)
exp
(2)d/2 |K |1/2
2
where K is the covariance matrix and is the mean vector. Since we want a Gaussian
centered at the origin (which will be the center pixel) = 0, and since we have no
reason to prefer one direction over another, we choose K to be diagonal (isotropic)
2 0
K =
= 2 I.
(5.27)
0 2
2 2
2 2
2 2
2 2
and
x
(x 2 + y 2 )
h(x, y) =
exp
.
x
2 4
2 2
(5.28)
(5.29)
If our objective is edge detection, we are done. However, if our objective is precise
estimation of derivatives, particularly higher order derivatives, use of a Gaussian
kernel, since it blurs the image, clearly introduces errors which can only be partially
compensated for [5.39]. Nevertheless, this is one of the most simple ways to develop
effective derivative kernels.
For future reference, here are a few of the derivatives of the one-dimensional
Gaussian. Even though there is no particular need for the normalizing 2 for most
of our needs (it just ensures that the Gaussian integrates to one), we have included it.
That way these formulae are in agreement with the literature. The subscript notation
is used here to denote derivatives. That is,
2
G(, x),
x2
where G(, x) is a Gaussian function of x with mean of zero and standard deviation .
x2
1
exp 2
G(, x) =
2
2
x
x2
exp 2
G x (, x) =
2
2 3
(5.30)
2
x2
x
1
exp 2
G x x (, x) =
2
2 5
2 3
x2
x2
x
3 2 exp 2 .
G x x x (, x) =
2
2 5
G x x (, x) =
80
Lets look in a bit more detail about how to make use of these formulae and their
two-dimensional equivalents to derive kernels.
The simplest way to get the kernel values for the derivatives of a Gaussian is
to simply substitute x = 0, 1, 2, etc. along with their negative values, which yields
numbers for the kernel. The rst problem to arise is what should be? To address these questions, we will derive the elements of the kernel used for the second
derivative of a one-dimensional Gaussian. The other derivatives can be developed
using the same philosophy. Take a look at Fig. 5.6(a) and ask, is there a value of
such that the maximum of the secondderivative occurs at x = 1 and x = 1?
Clearly there is, and its value is = 1/( 3). Given this value of , we can compute
the values of the second derivative
of a Gaussian at the integer points
x = {1, 0, 1}.
At x = 0, we nd G x x (1/ 3, 0) = 2.07, and at x = 1, G x x (1/ 3, 1) = 0.9251.
So are we nished? That wasnt so hard, was it? Unfortunately, we are not done. It is
very important that the elements of the kernel sum to zero. If they dont, then iterative
algorithms like those described in Chapter 6 will not maintain the proper brightness
levels over many iterations. The kernel also needs to be symmetric. That essentially
denes the second derivative of a Gaussian. The most reasonable set of values close
to those given, which satisfy symmetry and summation to zero are {1, 2, 1}.
However, this does not teach us very much. Lets look at a 5 1 kernel and see
if we can learn a bit more. We require the following.
r
r
r
The elements of the kernel should approximate the values of the appropriate
derivative of a Gaussian as closely as possible.
The elements must sum to zero.
The kernel should be symmetric about its center, unless you want to do special
processing.
81
very important that the kernel values integrate to zero, not quite so important that the
actual values be precise. So what do we do in a case like this? We use constrained
optimization. One strategy is to set up a problem to nd a second derivative of a
Gaussian, which has these values as closely as possible, but which integrates to
zero. For more complex problems, the authors use Interopt [5.3] to solve numerical
optimization problems, but you can solve this problem without using numerical
methods. This is accomplished as follows. First, understand the problem (presented
for the case of ve points given above): We wish to nd ve numbers as close as
possible to [0.0565, 0.9251, 2.0730, 0.9251, 0.0565] which satisfy the constraint
that the ve sum to zero. By symmetry, we actually only have three numbers, which
we will denote [a, b, c]. For notational convenience, introduce three constants =
0.0565, = 0.9251, = 2.073. Thus, to nd a, b, and c which resemble these
numbers, we write the mean squared error (MSE) form
H0 (a, b, c) = 2(a )2 + 2(b )2 + (c )2 .
H is the constrained
version of H0 .
(5.31)
Using the concept of Lagrange multipliers, we can nd the best choice of a, b, and
c by minimizing a different objective function
H (a, b, c) = 2(a )2 + 2(b )2 + (c )2 + (2a + 2b + c).
(5.32)
A few words of explanation are in order for those students who are not familiar with
constrained optimization using Lagrange multipliers. The term with the in front
( is the Lagrange multiplier) is the constraint. It is formulated such that it is exactly
equal to zero, if we should nd the proper a, b, and c. By minimizing H, we will nd
the parameters which minimize H0 while simultaneously satisfying the constraint.
To minimize H , take the partials and set them equal to zero:
H
= 4a 4 + 2
a
H
= 4b 4 + 2
b
H
= 2c 2 + .
c
(5.33)
Setting the partial derivatives equal to zero, simplifying, and adding the constraint,
we nd the following set of linear equations:
b=
2
c=
2
2a + 2b + c = 0
a =
(5.34)
82
0.0261
0.0261
0.1080
0.1080
0.0261
0.0261
We can proceed in the same way to compute the kernels to estimate the partial
derivatives using Gaussians in two dimensions.
One implementation of the rst derivative with respect to x, assuming an isotropic
Gaussian is presented in Fig. 5.7. You will have the opportunity to derive others as
homeworks.
In this chapter, we have explored the idea of edge operators based on kernel operators. We discovered that no matter what, noisy images result in edges which are:
r
r
r
That is just life we cannot do any better with simple kernels. In Chapter 6, we
will explore some approaches to these problems.
As we hope you have guessed, there are other ways of nding edges in images
besides simply thresholding a derivative. In later sections, we will mention a few of
them.
5.7.1
83
1 2
2 4
2 .
2 1
(5.35)
The product of two transforms means, for each spatial frequency value (each
combination of x and y ), multiply the values of the two functions. (Just in case
you do not remember the details, in general, these values are complex numbers.)
84
Transform f: N 2 log N .
Transform h: L 2 log L .
Perform appropriate operations, such as padding, to get H and F the same size.
Multiply H by F: N 2 .
Inverse transform the result: N 2 log N .
L
16
12
200
400
600
800
1000
85
Scale space is a recent addition to the well-known concept of image pyramids, rst
used in picture processing by Kelly [5.19] and later extended in a number of ways
(see [5.5, 5.8, 5.30, 5.32], and many others). In a pyramid, a series of representations
of the same image are generated, each created by a 2 : 1 subsampling (or averaging)
of the image at the next higher level (Fig. 5.9).
In Fig. 5.10, a Gaussian pyramid is illustrated. It is generated by blurring each
level with a Gaussian prior to 2 :1 subsampling. An interesting question should arise
as you look at this gure. Could you, from all the data in this pyramid, reconstruct
the original image? The answer is no, because at each level, you are throwing away
high-frequency information.
Although the Gaussian pyramid alone does not contain sufcient information
to reconstruct the original image, we could construct a pyramid that does contain
sufcient information. To do that, we use a Laplacian pyramid, constructed by
computing a similar representation of the image; this preserves the high-frequency
information (Fig. 5.11). Combining the two pyramid representations allows reconstruction of the original image.
In a modern scale space representation we preserve the concept that each level
is a blurring of the previous level, but do not subsample each level is the same
size as the previous level, but more blurred. Normally, each level is generated by
convolving the original image with a Gaussian of variance 2 , and varies from
one level to the next. This variance then becomes the scale parameter. Clearly, at
high levels of scale, ( large), only the largest features are visible. We will see more
about scale space later in this chapter, when we talk about wavelets.
Fig. 5.9. A pyramid is a data structure which is a series of images, in which each pixel is the
average of four pixels at the next lower level.
Fig. 5.10. A Gaussian pyramid, constructed by blurring each level with a Gaussian and then 2 : 1
subsampling.
86
5.9.1
Quad trees
A quad tree [5.21] is a data structure in which images are recursively broken into
four blocks, corresponding to nodes in a tree. The four blocks are designated NW
(northwest), NE, SW, and SE. The correspondence between the nodes in the tree
and the image are best illustrated by an example (see Fig. 5.12).
In encoding binary images, it is straightforward to come up with a scheme for
generating the quad tree for an image: If the quadrant is homogeneous (either solid
black or solid white), then make it a leaf, otherwise divide it into four quadrants and
add another layer to the tree. Repeat recursively until the blocks either reach pixel
size or are homogeneous.
It is easy to make a quad tree representation into a pyramid. It is only necessary to
keep, at each node, the average of the values of its children. Then, all the information
in a pyramid is stored in the quad tree.
If an image has large homogeneous regions, a quad tree would seem to be an
efcient way to store and transmit an image. However, experiments with a variety
of images, even images which were the difference between two frames in a video
30
310
312
33
32
30
310
32
33
Fig. 5.12. An image is divided into four blocks. Each inhomogeneous block is further divided.
This partitioning may be represented by a tree.
87
sequence, have shown that this is not true. Since the difference image is only nonzero
where things are moving, it seems obvious that this, mostly zero, image would be
efciently stored in a quad tree. Not so. Even in that case, the overhead of managing
the tree overwhelms the storage gains. So, surprisingly, the quad tree is not an
efcient image compression technique. When used as a means for representing a
pyramid, it does, however, have advantages as a way of representing scale space.
Another disadvantage of using quad trees is that a slight movement of an object
can result in radically different tree representations, that is, the tree representation
is not rotation or translation invariant. In fact, it is not even robust. Here, robust
means a small translation of an object results in a correspondingly small change in the
representation. One can get around this problem, to some extent, by not representing
the entire image, but instead, representing each object subimage with a quad tree.
The generalization of the quad tree to three dimensions is called an octree. The
same principles apply.
5.9.2
A good way to remember
large and small scale is
that at large scale, only
large objects may be
distinguished.
The last property is certainly debatable, since convolution formally requires a linear, space-invariant operator. One interesting approach to scale space which violates
this requirement is to produce a scale space by using gray scale morphological
88
smoothing (we will discuss this later) with larger and larger structuring elements
[5.16].
You could use scale space concepts to represent texture [4.16] or even a probability density function (in which case, your scale space representation becomes a
clustering algorithm [5.24]) as well as brightness. We will see applications of scale
representations as we proceed through the course.
One of the most interesting aspects of scale space representations is the behavior
of our old friend, the Gaussian. The second derivative of the Gaussian (in two
dimensions, the Laplacian of Gaussian: LOG) has been shown [5.27] to have some
very nice properties when used as a kernel. In particular, the zero crossings of the
LOG are good indicators of the location of an edge. One might be inclined to ask,
Is the Gaussian the best smoothing operator to use to develop a kernel like this?
Said another way: We want a kernel whose second derivative never generates a new
zero crossing as we move to larger scale. In fact, we could state this desire in the
following more general form.
Let our concept of a feature be a point where some operator has an extreme,
either maximum or minimum. The concept of scale space causality says that as
scale increases, as images become more blurred, new features are never created. The
Gaussian is the ONLY kernel (linear operator) with this property [5.1, 5.2]. Studies
of nonlinear operators have been done to see under what conditions these operators
are scale space causal [5.22].
This idea of scale space causality is illustrated in the following example. Fig. 5.13
illustrates the brightness prole along a single line from an image, and the scale space
created by blurring that single line with one-dimensional Gaussians of increasing
variance. In Fig. 5.14, we see the Laplacian of the Gaussian, and the points where
the Laplacian changes sign. The features in this example, the zero crossings (which
are good candidates for edges) are indicated in the right image. Observe that as
scale increases, feature points (in this case, zero-crossings) are never created as
scale increases. As we go from top (low scale) to bottom (high scale), some features
disappear, but no new ones are created.
One obvious application of this idea is to identify the important edges in the image
rst. We can do that by going up in scale, nding those few edges, and then tracking
them down to lower scale.
89
(b)
(a)
50
100
150
200
250
50
20 40
60
100
150
200
250
Fig. 5.13. (a) Brightness prole of a scanline through an image. (b) Scale space representation
of that scanline. Scale increases toward the bottom, so no new features should be
created as one goes from top to bottom.
50
50
100
100
150
150
200
200
250
250
50
0.8 0.6
150
100
0.4 0.2
0.2
200
0.4
250
0.6
50
0.8
100
150
200
250
Fig. 5.14. Laplacian of the scale space representation, and the zero crossings of the Laplacian.
Since this is one-dimensional data, there is no difference between the Laplacian and
the second derivative.
Ia
1
1
I N i=1 1 + d 2
(5.36)
90
We were going to put a dead cat joke here, something like One of the cats died during the procedure, but its
behavior was unchanged, but the publisher told us people would be offended, so we had to remove it.
91
x 10-3
2
1.5
1
0.5
0
0.5
30
-1
20
1.5
2
20
10
0
15
10
10
20
10
15
20
30
Fig. 5.15. Gabor lter. Note that the positive/negative response is very similar to those that we
derived earlier in this chapter.
Gabor Function
2
1
0
1
2
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
0.6
0.8
0.8
0.6
0.4
0.2
0.2
0.4
Fig. 5.16. Comparison of a section through a Gabor lter and a similar section through a fourth
derivative of a Gaussian.
through a Gabor and a fourth derivative of a Gaussian. You can see noticeable
differences, the principal one being that the Gabor goes on forever, whereas the fourth
derivative has only three extrema. However, to the precision available to neurology
experiments, they are the same. The problem is simply that the measurement of
the data is not sufciently accurate, and it is possible to t a variety of curves
to it.
92
Bottom line: We have barely a clue how the brain works, and we dont really
know very much about the retina. There are two or three mathematical models
which adequately model the behavior of receptive elds.
5.12 Conclusion
Consistency in edge
detection.
Explicit use of consistency has not been made in this chapter. However, in Assignment 10.1, you will see an application of consistency to edge detection. In that
problem, you will be asked to develop an algorithm which makes use of the fact
that adjacent edge pixels have parallel gradients. That is, if pixel A is a neighbor
of pixel B, and the gradient at pixel A is parallel (or nearly parallel) to the gradient
at pixel B, this increases the condence that both pixels are members of the same
edge.
In this chapter, we have looked at several ways to derive kernel operators which,
when applied to images, result in strong responses for types of edges.
r
r
Minimize the
sum-squared error.
r
r
Constrained optimization
and Lagrange multipliers.
5.13 Vocabulary
You should know the meanings of the following terms.
Basis vector
Convolution
Correlation
Gabor filter
Image gradient
Inner product
Kernel operator
Lagrange multiplier
Lexicographic
Linear operator
93
5.13 Vocabulary
LOG
Projection
Pyramid
Quad tree
Scale space
Sum-squared error
Assignment
5.1
The previous section showed how to estimate the first
derivative by fitting a plane. Clearly that will
not work for the second derivative, since the second
derivative of a plane is zero everywhere. Use the same
approach, but use a biquadratic
f(x,y) = ax 2 + by 2 + cx + dy + e.
Then [a b c d e]T = A
%
x2
y2
&T
= X.
2 f
.
x2
Assignment
5.2
Oh no! Another part of the assignment! (hey, this one
is much easier than the one above -- a real piece of
cake). Using the same methods, find the 5 5 kernel which estimates xf at the center point, using the
equation of a plane.
Assignment
5.3
Determine whether u1 and u2 in Fig. 5.4 are in fact
orthonormal. If not, recommend a modification or other
approach which will allow all the us to be used as
basis functions.
94
Assignment
5.4
(1) Write a program to generate an image which is
64 64, as illustrated below.
50
100
25
10
32
16
5
32
The images should contain areas of uniform brightness, with dimensions and brightness as shown. Save
this in a file and call it SYNTH1.
(2) Write a program which reads in SYNTH1, and applies
the blurring kernel
1/10
5.5
Write a program to apply the following two kernels
(referred to in the literature as the Sobel operators) to images SYNTH1, BLUR1, and BLUR1.V1.
hx =
hy =
95
5.13 Vocabulary
(1) Apply hx to the input, save the result as a temporary image in memory (remember that the numbers CAN
be negative).
(2) Apply hy to the input, save the result as another
array in memory.
(3) Compute a third array in which each point in the
array is the sum of the squares of the corresponding
points in the two arrays you just saved. Finally,
take the square root of each point. Save the result.
(4) Examine the values you get. Presumably, high values
are indicative of edges. Choose a threshold value
and compute a new image, which is one whenever the
edge strength exceeds your threshold and is zero
otherwise.
(5) Apply steps (1)--(4) to the blurred and noisy images
as well.
(6) Write a report. Include a printout of all three
binary output images. Are any edge points lost? Are
any points artificially created? Are any edges too
thick? Discuss sensitivity of the result to noise,
blur, and choice of threshold.
Be thorough; this is a research course which
requires creativity and exploring new ideas, as
well as correctly doing the minimum required by the
assignment.
Assignment
5.6
In Assignment 5.2, you derived a 5 5 kernel. Repeat
Assignment 5.5 using that kernel, for / x and the
appropriate version for / y.
Assignment
5.7
(1) Verify the mathematics we did in Eq. (5.30). Find
a 3 3 kernel which implements the derivative-ofGaussian vertical edge operator of Eq. (5.30). Use
= 1 and = 2 and determine two kernels. Repeat for
a 5 5 kernel. Discuss the impact of the choice of
and its relation to kernel size. Assume the kernel
may contain real (floating point) numbers.
96
Assignment
5.8
In section 5.7, parameters useful for developing discrete Gaussian kernels were discussed. Prove that the
value of such that the maximum of the second deriva
tive occurs at x = 1 and x = 1. Is = 1/ 3?
Assignment
5.9
Use the method of fitting a polynomial to estimate
2 f/ y2 . Which of the following polynomials would be
most appropriate to choose?
(a) f = ax 2 + by + cxy
(c) f = ax 3 + by 3 + cxy
(b) f = ax 2 + by 2 + cx y + d
(d) f = ax + by + c
Assignment
5.10
Fit the following expression to pixel data in a 3 3
neighborhood: f(x,y) = ax 2 + bx + cy + d. From this fit,
determine a kernel which will estimate the second
derivative with respect to x.
Assignment
5.11
Use the function f = ax 2 + by 2 + cxy to find a 3 3 kernel
which estimates 2 f/ y 2 . Which of the following is the
kernel which results? (Note: The following answers do
not include the scale factor. Thus, the best choice
below will be the one proportional to the correct
answer.)
6
4
0
(a)
6
4
0
4
6
0
2
0
0 (c) 4
4
2
6
0
6
2
4 (e)
2
(b)
1
3
1
2
0
2
1
1
3 (d) 2
1
1
0
0
0
1
1
2 (f) 2
1
1
4
6
0
0
0
1
2 1
0 2
2 1
97
5.12
Assignment
5.12
Suppose the kernel that estimates 2 f/ x y is
estimates
Topic 5A
2
2
f/ x y ? Explain your answer.
Edge detectors
The process of edge detection includes more than simply thresholding the gradient. We want
to know the location of the edge more precisely than simple gradient thresholds reveal. Two
methods which seem to have acquired a substantial reputation in this area are the so called
Canny edge detector [5.6] and the facet model [4.18]. Here, we only describe the Canny
edge detection.
5A.1
This is sometimes written using the plural, nonmaxima suppression. The expression is ambiguous. It could be
a compression of suppress every point which is not a maximum, or suppress all points which are not maxima.
We choose to use the singular.
98
of the gradient, and one pixel in the reverse direction. If M(x, y) (the point in question) is
not the maximum of these three, set its value in N (x, y) to zero. Otherwise, the value of N is
unchanged.
After NMS, we have edges which are properly located and are only one pixel wide. These
new edges, however, still suffer from the problems we identied earlier extra edge points
due to noise (false hits) and missing edge points due either to blur or to noise (false misses).
Some improvement can be gained by using a dual-threshold approach. Two thresholds are
used, 1 and 2 , where 2 is signicantly larger than 1 . Application of these two different
thresholds to N (x, y) produces two binary edge images, denoted T1 and T2 respectively. Since
T1 was created using a lower threshold, it will contain more false hits than T2 . Points in T2
are therefore considered to be parts of true edges. Connected points in T2 are copied to the
output edge image. When the end of an edge is found, points are sought in T1 which could
be continuations of the edge. The continuation is continued until it connects with another T2
edge point, or no connected T1 points are found.
In [5.6] Canny also illustrates some clever approximations which provide signicant
speedups.
5A.2
The derivative of g(x) is
the second derivative of
the image, so look for
zero crossings.
99
Examination of the literature in biological imaging systems, all the way back to the pioneering work of Hubel and Wiesel in the 1960s [5.13] suggests that biological systems analyze
images by making local measurements which quantify orientation, scale, and motion. Keeping this in mind, suppose we wish to ask a question like is there an edge at orientation
at this point? How might we construct a kernel that is specically sensitive to edges at that
orientation? A straightforward approach [5.37] is to construct a weighted sum of the two
Gaussian rst derivative kernels, G x and G y , using a weighting something like
G = G x cos + G y sin .
Could you calculate the
orientation selectivity?
What is the smallest
angular difference you
could detect with a 3 3
kernel determined in this
way?
5A.3
(5.38)
Unfortunately, unless quite large kernels are used, the kernels obtained in this way have rather
poor orientation selectivity. In the event that we wish to differentiate across scale, the problem
is even worse, since a scale space representation is normally computed rather coarsely, to
minimize computation time. Perona [5.31] provides an approach to solving these problems.
5A.4
Space/frequency representations
Wavelets are very important, but a thorough examination of this area is beyond the scope of
this book. Therefore we present only a rather supercial description here, and provide some
pointers to literature. For example, Castleman [4.6] has a readable chapter on wavelets.
5A.4.1
Why wavelets?
Consider the image illustrated in Fig. 5.17. Clearly, the spatial frequencies appear to be different, depending on where one looks in the image. The Fourier transform has no mechanism for
capturing this intuitive need to represent both frequency and location. The Fourier transform
of this image will be a two-dimensional array of numbers, representing the amount of energy
in the entire image at each spatial frequency. Clearly, since the Fourier transform is invertible,
it captures all this spatial and frequency information, but there is no obvious way to answer
the question: at each position, what are the local spatial frequencies?
100
The wavelet approach is to add a degree of freedom to the representation. Since the
Fourier transform is complete and invertible, all we really need to characterize the image is a
single two-dimensional array. Instead however, following the space/frequency philosophy as
described in section 5.8, we use a three-(or higher) dimensional data structure. In this sense,
the space/frequency representation is redundant (or overcomplete ), and requires signicantly
more storage than the Fourier transform.
5A.4.2
(5.41)
1.5
1.5
0.5
0.5
0.5
0.5
2
10
10
15
20
1
10
10
15
20
101
Fig. 5.19. Original and the result of the inner product of the original and three different twodimensional wavelets (three slices through the wavelet transform).
Observe that the transform is a function of the scale and the shift. The same idea holds in two
dimensions, where the equation for the inner product becomes
W f (a, bx , b y ) =
(5.42)
Fig. 5.19 illustrates a cross section of W for different values of a. Clearly, this process produces a scale space representation.
Lee [5.23] takes the neurophysiological evidence and derives a mother wavelet of the
following form.
2
2
1
4x + y 2
(x, y) = exp
exp(ix) exp
,
8
2
2
(5.43)
where is a constant whose value depends on assumptions about the bandwidth, but is approximately equal to 3.5. Scaling and translating this mother wavelet produces a collection
of lters which (in the same way that the FreiChen basis set did) represents how much of
an image neighborhood resembles the feature, and from which the image can be completely
reconstructed.
5A.5
Vocabulary
You should know the meaning of the following terms.
Canny edge detector
Nonmaximum suppression
Wavelet
Assignment
5.A1
At high levels of scale, only
visible. (Fill in the blank.)
objects are
102
Assignment
5.A2
What does the following expression estimate? ( f h1 )2 +
( f h2 )2 = E , where the kernel h1 is defined by
0.05
0.08
0.05
0.08
0.05
Assignment
5.A3
You are to use the idea of differentiating a Gaussian to
derive a kernel. What variance does a one-dimensional (zero
mean) Gaussian need to have to have the property that the
extrema of its first derivative occur at x = 1?
Assignment
5.A4
What is the advantage of the quadratic variation as compared
to the Laplacian?
Assignment
5.A5
Let E = (f H g)T (f H g). Using the equivalence between the
kernel form of a linear operator and the matrix form, write
an expression for E using kernel notation.
Assignment
5.A6
The purpose of this assignment is to walk you through the
construction of an image pyramid, so that you will more fully
understand the potential utility of this data structure, and
can use it in a coding and transmission application. Along
the way, you will pick up some additional understanding of
the general area of image coding. Coding is not a principal learning objective of this course, but in your research
career, you are sure to encounter people who deal with coding
103
104
References
[5.1] V. Anh, J. Shi, and H. Tsai, Scaling Theorems for Zero Crossings of Bandlimited
Signals, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(3).
1996.
[5.2] J. Babaud, A. Witkin, M. Baudin, and R. Duda, Uniqueness of the Gaussian Kernel for Scale-space Filtering, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 8(1), 1986.
[5.3] G. Bilbro and W. Snyder, Optimization of Functions with Many Minima, IEEE
Transactions on Systems, Man, and Cybernetics, 21(4), July/August, 1991.
105
References
[5.4] K. Boyer and S. Sarkar, On the Localization Performance Measure and Optimal Edge
Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1),
1994.
[5.5] P. Burt and E. Adelson, The Laplacian Pyramid as a Compact Image Code, Computer
Vision, Graphics, and Image Processing, 16, pp. 2051, 1981.
[5.6] J. Canny, A Computational Approach to Edge Detection, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 8(6), 1986.
[5.7] W. Chojnacki, M. Brooks, A. van der Hengel, and D. Gawley, On the Fitting of Surfaces to Data with Covariances, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(11), 2000.
[5.8] J. Crowley, A Representation for Visual Information, Ph.D. Thesis, Carnegie-Mellon
University, 1981.
[5.9] J. Daugman, Two-dimensional Spectral Analysis of Cortical Receptive Fields,
Vision Research, 20, pp. 847856, 1980.
[5.10] J. Daugman, Uncertainly Relation for Resolution in Space, Spatial Frequency, and
Orientation Optimized by Two-dimensional Visual Cortical Filters, Journal of the
Optical Society of America, 2(7), 1985.
[5.11] W. Deng and S. Iyengar, A New Probabilistic Scheme and Its Application to Edge
Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4),
1996.
[5.12] W. Frei and C. Chen, Fast Boundary Detection: A Generalization and a New Algorithm, IEEE Transactions on Computers, 26(2), 1977.
[5.13] D. Hubel and T. Wiesel, Receptive Fields, Binocular Interaction, and Functional
Architecture in the Cats Visual Cortex, Journal of Physiology (London), 160,
pp. 106154, 1962.
[5.14] D. Hubel and T. Wiesel, Functional Architecture of Macaque Monkey Visual Cortex,
Proceedings of the Royal Society of London, B, 198, pp. 159, 1977.
[5.15] L. Iverson and S. Zucker, Logical/Linear Operators for Image Curves, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(10), 1995.
[5.16] P. Jackway and M. Deriche, Scale-space Properties of the Multiscale Morphological
DilationErosion, IEEE Transactions on Pattern Analysis and Machine Intelligence,
18(1), 1996.
[5.17] J. Jones and L. Palmer, An Evaluation of the Two-dimensional Gabor Filter Model
of Simple Receptive Fields in the Cat Striate Cortex, Journal of Neurophysiology,
58, pp. 12331258, 1987.
[5.18] E. Joseph and T. Pavlidis, Bar Code Waveform Recognition using Peak Locations,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6), 1994.
[5.19] M. Kelly, in Machine Intelligence, volume 6, University of Edinburgh Press,
1971.
[5.20] M. Kisworo, S. Venkatesh, and G. West, Modeling Edges at Subpixel Accuracy using
the Local Energy Approach, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(4), 1994.
[5.21] A. Klinger, Pattern and Search Statistics, in Optimizing Methods in Statistics,
New York, Academic Press, 1971.
106
[5.22] P. Kube and P. Perona, Scale-space Properties of Quadratic Feature Detectors, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(10), 1996.
[5.23] T. Lee, Image Representation Using 2-D Gabor Wavelets, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 18(10), 1996.
[5.24] Y. Leung, J. Zhang, and Z. Xu, Clustering by Scale-space Filtering, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 2000.
[5.25] T. Lindeberg, Scale-space for Discrete Signals, IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(3), 1990.
[5.26] T. Lindeberg, Scale-space Theory, A Basic Tool for Analysing Structures at Different
Scales, Journal of Applied Statistics, 21(2), 1994.
[5.27] D. Marr and E. Hildreth, Theory of Edge Detection, Proceedings of the Royal Society
of London, B, 207, pp. 187217, 1980.
[5.28] D. Marr and T. Poggio, A Computational Theory of Human Stereo Vision, Proceedings of the Royal Society of London, B, 204, pp. 301328, 1979.
[5.29] R. Nelson, Finding Line Segments by Stick Growing, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(5), 1994.
[5.30] E. Pauwels, L. Van Gool, P. Fiddelaers, and T. Moons, An Extended Class of Scaleinvariant and Recursive Scale Space Filters, IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(7), 1995.
[5.31] P. Perona, Deformable Kernels for Early Vision, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 17(5), 1995.
[5.32] P. Perona and J. Malik, Scale-space and Edge Detection using Anisotropic Diffusion,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(7), pp. 629639,
1990.
[5.33] W. Pratt, Digital Image Processing, Chichester, John Wiley and Sons, 1978.
[5.34] H. Tagare and R. deFigueiredo, Reply to On the Localization Performance Measure
and Optimal Edge Detection, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(1), 1994.
[5.35] A. Taratorin and S. Sideman, Constrained Regularized Differentiation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1), 1994.
[5.36] F. van der Heijden, Edge and Line Feature Extraction Based on Covariance Models,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1), 1995.
[5.37] M. Van Horn, W. Snyder, and D. Herrington, A Radial Filtering Scheme Applied to
Intracoronary Ultrasound Images, Computers in Cardiology, September, 1993.
[5.38] P. Verbeek and L. van Vliet, On the Location Error of Curved Edges in Low-pass
Filtered 2-D and 3-D Images, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(7), 1994.
[5.39] I. Weiss, High-order Differentiation Filters that Work, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(7), 1994.
[5.40] M. Werman and Z. Geyzel, Fitting a Second Degree Curve in the Presence of Error,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(2), 1995.
[5.41] R. Young, The Gaussian Derivative Model for Spatial Vision: I. Retinal Mechanisms,
Spatial Vision, 2, pp. 273293, 1987.
To change, and to change for the better are two different things
German proverb
In this chapter, we move toward developing techniques which remove noise and
degradations so that features can be derived more cleanly for segmentation. The
techniques of a posteriori image restoration and iterative image feature extraction
are described and compared. While image restoration methods remove degradations
from an image [6.3], image feature extraction methods extract features such as edges
from noisy images. Both are shown to perform the same basic operation: image relaxation. In the advanced topics section, image feature extraction methods, known
as graduated nonconvexity (GNC) and variable conductance diffusion (VCD), are
compared with a restoration/feature extraction method known as mean eld annealing (MFA). This equivalence shows the relationship between energy minimization
methods and spatial analysis methods and between their respective parameters of
temperature and scale. The chapter concludes by discussing the general philosophy
of extracting features from images.
6.1 Relaxation
The term relaxation was originally used to describe a collection of iterative numerical techniques for solving simultaneous nonlinear equations (see [6.18] for a review).
The term was extended to a set of iterative classication methods by Rosenfeld and
Kak [6.64] because of their similarity. Here, we provide a general denition of the
term which will encompass these methods as well as those more recent techniques
which are the emphasis of this discussion.
A relaxation process is an
iteration.
Denition
A relaxation process is a multistep algorithm with the property that (1) the output of
a single step is of the same form as the input, so that the algorithm may be applied
iteratively, and (2) it converges to a bounded result. Some researchers also require
that the operation on any element (any pixel, in our application) be dependent only
on the state of the pixels in some well dened, nite neighborhood of that element.
We will see that all the algorithms discussed here are relaxation processes, according
to these criteria.
107
108
6.2 Restoration
In an image restoration problem, we assume that an ideal image, f, has been corrupted
to create the measured image, g. The usual model for the corruption is a distortion
operation, denoted by D, followed by the addition of random noise
g = D( f ) + n,
(6.1)
where g = [g1 , . . . , g N ]T and gi denotes the ith pixel in a column vector representation of the image g. f and n are similarly dened. The restoration problem,
then, is the problem of nding a best estimate of f given the measurement, g, some
knowledge of the distortion (often called blur), and the statistics of the noise.
Restoration is often referred to as an inverse problem. That is, we have a process
(in this case blur) which takes an input and produces an output. We can only measure
the output, and we wish to infer the input.
6.2.1
Denition of an ill-posed
problem.
If these three conditions do not all hold, the problem is said to be ill-posed. Illposedness is normally caused by the ill-conditioning of the problem. Conditioning of
a mathematical problem is measured by the sensitivity of output to changes in input.
For a well-conditioned problem, a small change of input does not affect the output
much; while for an ill-conditioned problem, a small change of input can change the
output a great deal.
Condition number is the measurement of the conditioning of a problem. Generally,
it is dened as Eq. (6.2). The larger the condition number, the more ill-conditioned
the problem is:
condition number
change in output
.
change in input
(6.2)
(6.3)
where . usually indicates the 2-norm. K is in the range of [1, ). When K 1,
the linear system is ill-conditioned.
109
6.2 Restoration
0.5b+
0.33a+
0.5d
0.33c+
0.33e
0.5b+
0.5f
0.33a+
0.33e+
0.25b+
0.25d+
0.33c+
.33g
0.25f+
0.25h
0.33e+
0.33i
0.5d+
0.5h
0.33e+
0.33g+
0.5f+
0.33i
0.5h
In Eq. (6.1), suppose we know how the blur process works. Then it would seem
that we should be able to undo it. We will see why that might not be the case.
As an example, look at a very simple image, just about the simplest image we can
come up with, a 3 3 image, and give each pixel a name, using the letters a, . . . , i.
Now, suppose this image is blurred by a linear blur process in which each pixel is
replaced by the average of its neighbors (suppose the 4-neighbor denition is used).
In the case of edge or corner pixels, fewer pixels are in the neighborhood. This new,
blurred image has values ga, . . . , gi, and these are related to the original values by
the system of linear equations shown in Table 6.1.
a
d
g
0
0.5
0.33
0
0.5
0
0.33
0
H =
0
0.25
0
0
0
0
0
0
0
0
b
e
h
c
f
i
0
0.5
0
0
0.33
0 0.33
0
0
0
0
0.5
0
0
0.33
0
0 0.25
0 0.25
0.33
0 0.33
0
0
0.5
0
0
0
0
0.33
0
0
0
0
0.5
0
0
0
0
0
0
0
0
0
0.33
0
0
0 0.25
0
0
0
0.33
0
0.5
0
0.33
0 0.33
0
0.5
0
Then, we may represent the process of blur by G = H F and solve for the values
before the blur using F = H 1 G. It appears that this should be simple. If we have a
model for the distortion process (the model in this case is the matrix H), all we need
do is invert it, and multiply. Let us look to see why that might not work too well.
First, calculate the inverse of H numerically. Whoops! Our matrix inverter program
110
tells us that the matrix is singular. Guess that will not work. Is it a bad choice of blur
operations?
It is hard, it turns out, to contrive a numerical example which is not singular. It is
possible, of course, but difcult. Here is the real key: Even if the distortion matrix
turns out to be nonsingular, the problem is still likely to be ill-conditioned.
That is, (lets review) we measure ga, . . . , gi, and use matrix multiplication to
determine a, . . . , i. If H is not singular, then in a perfect world it should work. As
engineers know, however, in fact there is always noise, so instead of measuring ga,
we actually measure ga + , where is some perturbation due to noise. If the system,
that is, the distortion matrix, is ill-conditioned (and it is), then this small change in
ga may produce large differences in the estimates of a, . . . , i. For this reason, simple
matrix inversion does not work, even if the system is linear.
Another, perhaps simpler example of ill-conditioning [6.36] is as follows: Consider the linear system described by a blur A, and unknown image f, and a measurement g, where
1
A=
1
g = Af
f
f = 1
f2
1
1.01
1
g=
.
1
(6.4)
The condition number for matrix A is 402.0075 which is much larger than 1. This
system has solution f 1 = 1, f 2 = 0. Now, suppose the measurement, g, is corrupted
by noise, producing g = [1 1.01]T . Then, the solution is f 1 = 0, f 2 = 1. A trivially
small change in the measured data caused a dramatic change in the solution.
There are many ways to approach these ill-posed restoration problems. They
all share a common structure: the regularization theory. Generally speaking, any
regularization method tries to analyze a related well-posed problem whose solution
approximates the original ill-posed problem [6.57].
The rst approach one might think of is to produce an image estimate which has
the minimum expected mean square error. That is, nd the unknown image f which
minimizes
E=
(gi ( f i h))2 ,
(6.5)
where the sum is over all the pixels in the image, and the distortion is represented by
application of a kernel operator corresponding to a blur h. Simply minimizing E does
not work, as the problem is still ill-conditioned. Making some assumptions about the
noise can give us a bit better performance. If the distortion is linear, space-invariant,
and the noise is stationary, the Wiener lter gives the optimal solution according to
this criterion. (See [6.28] for a tutorial presentation.)
111
6.3.1
Bayes rule
Bayes rule concerns three probability functions: the a priori probability density,
p( f ), the conditional probability density, p(g| f ), and the a posteriori conditional
probability density, p( f |g).
We dene p( f ) to represent the a priori probability density that some particular
picture f occurs. (If we consider the brightness values as continuous, then we need
to use a probability density rather than a probability. Which we use has no impact
on the resulting form of the derivations given below.) That is, it is the probability
of picture f occurring before any measurements are made. As a discrete example of
the a priori probability, suppose we have a factory which manufactures anges and
gaskets, but makes ten times as many anges as gaskets. Flanges and gaskets may
come down the conveyor at random times. But because of our a priori knowledge
that the plant manufactures ten times as many anges as gaskets we know that we are
much more likely to see a ange than a gasket if we choose to look at the conveyor
at some random time. Thus the a priori possibility that our camera sees a ange is
0.9 and the a priori probability of gaskets is 0.1.
We dene p(g| f ) to represent the conditional probability density that image g
is measured, given that the measured image is known to be some corruption of
image f. The probability density function may be characterized in several possible
ways. One is by simply tabulating the number of times a particular value occurs
for each possible value of the variable, in this case, length. Such a tabulation is
referred to as a histogram of the variables. Properly normalized, a histogram can
be a useful representation of a probability density function. Unfortunately, it is
difcult to represent the density function of images as a histogram. One may also
describe a density function in a parametric way using some analytic function (e.g.,
the Gaussian).
We dene p( f |g) to represent the a posteriori conditional probability (density)
that the observed image g is really a corruption of image f . p( f |g) is what we are
looking for. We will use it as our decision rule or, more correctly, as our discrimination function. Our decision rule will then be as follows.
For a measurement g made on an unknown image, compute p( f i |g) for each
possible f i . Then decide that the unknown is the image f i for which p( f i |g) is
greater than p( f j |g) for all i
= j. When we make classication decision based on
p( f i |g), we are using a maximum a posteriori (MAP) image processing algorithm.
112
We can relate the three probability functions just dened by using Bayes rule:
p(g| f ) p( f )
something
something = p(g).
p( f |g) =
(6.6)
(6.7)
In Eq. (6.6) we used something to represent the denominator for the conditional
probability density. We used the word something to call attention to the fact that this
number represents the probability density of that value of g occurring, independent
of the original, uncorrupted image. Since this number is independent of f, and is the
same for all possible fs, it therefore does not provide us any help in distinguishing
which class is most likely. Instead, it is a normalization constant which we use to
ensure that the number p( f |g) has the desirable properties of a probability; that is,
it lies between 0 and 1 and sums to 1 when summed over all possible images (that is,
the observed object belongs to at least one of the classes which we are considering).
We wish to maximize the a posteriori probability density p( f |g) of the unknown
correct image f given measured image g. We will be relating the probabilities of the
entire image to the probabilities of individual pixels. Using Bayes rule, we have the
proportionality
p( f i |gi ) p(gi | f i ) p( f i )
(6.8)
Since the only difference between the measurement gi and the true but unknown
image pixel f i is the noise, and if we assume a Gaussian noise model, we can
replace the conditional density of the individual pixels with the density of the noise,
producing
1
n i2
p(g| f ) =
exp 2 .
(6.10)
2
2
i
Moving the product inside the exponential allows us to write [6.8, 6.27, 6.37, 6.38,
6.80]:
( f i gi )2
N
1
i
p(g| f ) =
(6.11)
exp
.
2
2
2
113
Vi j
ji
p( f i ) exp
(6.12)
.
T
The sum is taken over i , the neighborhood of pixel i. Recall from Chapter 4 that
the aura of a set of pixels A relative to a set B is the collection of all points in B that
are neighbors of pixels in A, where the concept of neighbor is given by a problemspecic denition. Here, the concept is similar, except that we consider only the
neighborhood of a single pixel rather than the neighborhood of a set. Like the denition of aura, the denition of neighborhood is allowed to be problem-dependent,
and in this sense two pixels do not have to be adjacent or even necessarily close
in the image to be considered neighbors. In essentially all practical applications,
however, the neighbors of a particular pixel are those pixels which are adjacent. T
is an adjustable width parameter, and the Vs are potential functions which are, in
general, functions of the pixels in the neighborhood.
The way we formulate the prior probability for the entire image is to once again
write a product
p( f ) =
p( f i ).
(6.13)
Check this out! The
answer which maximized
the conditional
probability is the same as
the answer which
minimized the squared
error! This happens
because the noise is
additive Gaussian.
Forming the product of Eqs. (6.11) and (6.12) as indicated in Eq. (6.8) and eliminating
the constant1 term, we then take natural logarithms and change the sign thereby
converting the problem from maximizing a probability to minimizing an objective
function.
!
( f i gi )2
H ( f, g) =
Vi j .
(6.14)
+
2 2
i
i ji
We will refer to the rst, conditional, term of Eq. (6.14) as the noise term [6.27]
and to the second as the prior. This gives the following form:
H ( f, g) = Hn ( f, g) + Hp ( f ).
(6.15)
These terms do not affect the location of the minimum. The which remains in Eq. (6.14) allows us to weight
the relative importance of the noise and prior terms.
114
having a brightness which varies smoothly, except at boundaries [6.9, 6.16]; or the
most common: having brightness which is constant in local areas and discontinuous
at boundaries [6.16, 6.38].
6.3.2
We are performing an
operation that some (but
not we) refer to as
denoising.
6.3.3
An image which
minimizes the quadratic
form will be blurry.
115
Penalty
Variation
Fig. 6.1. The penalty should be stronger for higher noise. Assuming local variations in brightness
are due to noise, a larger variation is penalized more. However, local brightness
variations can also be due to edges, which should not be penalized (otherwise, they
will be blurred). Therefore we want our penalty function to have an upper limit.
b
2
i
2
exp i2
2
!
.
(6.16)
In Eq. (6.16), the constants are irrelevant. We include them only so that it looks like a
Gaussian. is a soft threshold: It represents a priori knowledge of the roughness of
the surface. The form for makes this knowledge explicit. It is hoped that the spatial
derivatives i will become small almost everywhere as the algorithm proceeds. This
concept will be explored in more detail in the next section.
Combining Eqs. (6.15) and (6.16), we have an objective function which, if minimized, will result in a restored image which resembles the data (in the least squares
sense) while at the same time consisting of regions of uniform brightness separated by abrupt edges. We could use mean eld annealing (MFA) to minimize this
objective function.
116
2
Here we have a (not very interesting) image consisting of only two pixels, f 1 and f 2 ,
which have been corrupted by noise to result in measured pixel values g1 and g2 . The
prior term chosen in this case encourages solutions in which f 1 = f 2 . The principal
result of the MFA derivations, when applied to a function of the type described
by Eq. (6.17), is the replacement of by + T , where T is a parameter (called
temperature in the literature) which is initialized to a large value, and gradually
117
( f 1 f 2 )2
2( f 1 f 2 )
1
exp
2( f 1 g1 ) +
+T
( + T )2
( + T )2
HT ( f 1 , f 2 ) =
.
f
2( f 2 f 1 )
1
( f 1 f 2 )2
exp
2( f 2 g2 ) +
+T
( + T )2
( + T )2
(6.19)
The MFA approach is as follows.
(1) Set T = Tinitial (a problem-dependent parameter).
(2) Using Eq. (6.19), perform gradient descent or some other minimization technique, and nd the f which minimizes HT .
(3) Reduce T.
(4) If T > Tnal , go to (2).
The simplest version of gradient descent applicable here is the iteration
k+1
k
f1
f1
HT f 1k , f 2k
f2
f2
f
(6.20)
for some small scalar constant . To see how this works, consider the case of large
T: Making T large results in
min f HT ( f 1 , f 2 ) = min f (( f 1 g1 )2 + ( f 2 g2 )2 )
as
(T )
(6.21)
118
0.15
0.1
0.05
0
+=5
+=4.5
0.05
+=6
+=5.5
+=10
0.1
0.15
0.2
5
10
15
Fig. 6.2. Continuous deformation of a function which is initially convex to nd the global
minimum of a nonconvex function.
with scalar constants k and l. At high values of T (T + = 10), the curve is completely convex, and as T is reduced, it assumes its true form. At each iteration, the
minimum is tracked, as indicated by the arrows, concluding in the global minimum.
6.4.1
A good exam question
would be why is the
noise term called the
noise term? or what
assumptions were
involved in deriving the
noise term from
conditional probabilities
of noise?
where (D( f ))i denotes some distortion of image f in the vicinity of pixel i. Finding
the image f which minimizes this term produces the image which, when distorted,
most closely (in the sum-squared error sense) resembles the measurement g. Now,
lets look at the prior term.
Writing the prior in a slightly more general way
(R( f ))i2
1
exp
(6.23)
2
i
where the term (R( f ))i denotes some function of the (unknown) image f at pixel i,
and incorporates the function of + T in Eq. (6.18). What kind of image minimizes
this term? Lets look at what it means.
First, observe the minus sign in front of the prior term. With that sign there, to
minimize the function, we should nd the image which causes the exponential to
be maximized. What kind of image maximizes the exponential? Now look at the
argument of the exponential. Observe the minus sign, and observe that both terms
are squared, and therefore always positive. Thus, the argument of the exponential
119
EXAMPLE
Piecewise-constant images
f
x
2
+
f
y
2
.
(6.24)
In order for this term to be zero, both partial derivatives must be zero. The only type of
surface which satises this condition is one which does not vary in either direction
at, but not at everywhere. To see why the solution is piecewise-constant rather than
completely constant, you need to recognize that the total function being minimized
is the sum of both the prior and the noise terms. The prior term seeks a constant
solution, but the noise term seeks a solution which is faithful to the measurement.
The optimal solution to this problem is a solution which is at in segments, as illustrated in one dimension in Fig. 6.3. The function R(f) is nonzero only for the points
where f undergoes an abrupt edge. To see more clearly what this produces, consider
the extension to continuous functions. If x is continuous, then the summation in
Eq. (6.23) becomes an integral. The argument of the integral is nonzero at only
a small, nite number of points (referred to as a set of measure zero), which is
insignicant compared to the rest of the integral.
Fig. 6.3. A piecewise-constant solution to the tting of a surface. The solution has a derivative
equal to zero at almost every point. Nonzero derivatives exist only at the points where
steps exist.
120
EXAMPLE
Piecewise-planar images
This is the quadratic
variation. The Laplacian
also involves second
derivatives. Ask yourself
how it differs from the
quadratic variation.
Lets try another example and see what that does. Consider
2 2 2 2
2 2
f
f
f
2
+
+2
.
R (f) =
2
2
x
y
x y
(6.25)
What does this do? What kind of function has a zero for all its second derivatives?
Answer, a plane. Thus, using this form for R(f) will produce an image which is
planar, but still maintains delity to the data a piecewise-planar image. Another
alternative operator which is also based on the second derivative is the Laplacian,
2 f
2 f
+
.
x2
y2
We saw both of these back in Chapter 5.
You might ask your instructor, Breaking a brightness image into piecewise-linear
segments is the same as assuming the actual surfaces are planar, right? To which you
would probably get Yes, except for variations in lighting, reectance, and albedo.
Ignoring that, you charge on, saying But real surfaces arent all planar. The answer
is twofold: First is the trivial and useless observation that all surfaces are planar, you
just have to examine a sufciently small region. More seriously, whether breaking
an image into planar patches makes sense depends on the application. For example
[6.14, 6.74], one could do a piecewise-constant segmentation to remove noise and
then treat each patch as a plane and get improved estimates of optic ow [6.41] or
stereo [6.72] that way. Some of the underlying theory of planar approximations to
images may be found in [6.62].
Interestingly, Yi and Chelberg [6.83] observe that second-order priors like this
one require a great deal more computation than rst-order priors, and that rst-order
priors can be made approximately invariant. However, in our own experiments, we
have not found that second-order priors impose such a severe computational penalty,
and they do provide more exibility in reconstruction.
A one-dimensional example of such a solution is shown in Fig. 6.4. The idea
of modeling an image as piecewise-planar has recently received additional support
from some work by Elder and co-workers [6.22, 6.23, 6.24] which suggests that
the edge representation of an image is, to a good approximation, invertible. They
accomplish this remarkable result by assuming that except at edges, an image satises
the Laplace equation 2 f (x, y) = 0.
In conclusion, you can choose any function for the argument of the exponential
that you wish, as long as the image you want is produced by setting the argument to
zero. Some more general properties of prior models have been stated by Li [6.51]
as a function of the local image gradient : A prior should (1) be continuous in its
rst derivative; (2) be even (h( ) = h( )); (3) be positive h( ) > 0; and (4) have
121
Fig. 6.4. A piecewise-linear solution to the data of Fig. 6.3. Clearly, the piecewise-linear solution
preserves delity to the data more accurately than the piecewise-constant.
6.4.2
(6.27)
ji
ji
The delta function is not convenient, since it is not differentiable, and we will want
to use gradient descent to solve this problem. But there is another problem with
this formulation: If the image is continuously valued (or even if it is represented
in oating point), what does it mean for f i to equal to f j ? How close should they
122
be before they are considered equal? How about | f i f i | < 0.01? Is this small
enough? How about 0.001? OK? Do you accept that? So we insist that two points
which differ by more than 0.001 contribute to the error. What about two points that
differ by 0.000 999? That pair does not contribute at all. Does that make sense?
What we have generated is very similar to the problem of describing the probability that a measure has a particular value. The probability, for example, of being
exactly 6.000 000 (for an arbitrary number of zeros) is zero, and we therefore resort
to a different representation for the concept of likelihood, and we invent the probability density. In a similar way, in this problem, we pursue the same philosophy,
instead of using the Kroneker delta, we replace the delta function with a continuous,
differentiable function, which represents the same intuition,
1
( f i f j )2
Hp =
exp
.
(6.29)
2 2
2
i ji
This form also allows the concept of annealing: Start with large and reduce until
it approaches zero. The square root of a constant is not particularly meaningful, but
it does serve to ensure that the function remains bounded in the appropriate way.
The details of why this process of annealing avoids local minima are described
elsewhere [6.8, 6.11, 6.12], but result from the analogy to simulated annealing [6.27].
Initial value of
(6.30)
Decreasing
MFA is based upon the mathematics of simulated annealing. One can show that in
simulated annealing, a global minimum can be achieved by following a logarithmic
annealing schedule like
1
K =
(6.31)
ln K
where K is the iteration number. This schedule decreases extremely slowly; so
slowly as to be impractical. Instead, one could choose a schedule like
K = 0.99 K 1
(6.32)
123
which has been shown to work satisfactorily in many applications, and reduces
much faster than the logarithmic schedule.
6.4.3
Taking a one-dimensional example, assume f i is a pixel from the original (unknown) image, gi is a pixel from the measured image, and h is the horizontal blur
kernel with a nite kernel size 5, as shown in Fig. 6.5. We now explain the derivative
of the noise term, Eq. (6.33), in gradient descent in detail. First write out all the terms
involving a pixel ( f 4 ) at which a measurement (g4 ) was made in the noise term Hn
above
E 4 = (( f h)2 g2 )2 + (( f h)3 g3 )2 + (( f h)4 g4 )2
+ (( f h)5 g5 )2 + (( f h)6 g6 )2
= ( f 0 h 2 + f 1 h 1 + f 2 h 0 + f 3 h 1 + f 4 h 2 g2 )2
+ ( f 1 h 2 + f 2 h 1 + f 3 h 0 + f 4 h 1 + f 5 h 2 g3 )2
(6.34)
+ ( f 2 h 2 + f 3 h 1 + f 4 h 0 + f 5 h 1 + f 6 h 2 g4 )
+ ( f 3 h 2 + f 4 h 1 + f 5 h 0 + f 6 h 1 + f 7 h 2 g5 )2
+ ( f 4 h 2 + f 5 h 1 + f 6 h 0 + f 7 h 1 + f 8 h 2 g6 )2
where ( f h)i denotes the application of kernel h to image f with the origin of h (in
this case, the center) located at pixel f i . The derivative of Hn with respect to pixel
f 4 can then be derived as Eq. (6.35), and further generalized as Eq. (6.36),
Hn
= 2(( f h)2 g2 )h 2 + 2(( f h)3 g3 )h 1 + 2(( f h)4 g4 )h 0
f4
+2(( f h)5 g5 )h 1 + 2(( f h)6 g6 )h 2
(6.35)
Hn
= ((( f h) g) h rev )4
(6.36)
f4
where h rev = h 2 , h 1 , h 0 , h 1 , h 2 , and ( f h g) is computed at all points. The
f2
f3
f4
f5
f6
h2
h1
h0
h1
h2
124
h2
n2
n3
n4
n5
n6
h1
h0
h1
h 2
h2
h1
h0
h1
h2
h2
h1
h0
h1
h2
h2
h1
h0
h1
h2
h2
h1
h0
h1
h2
Fig. 6.6. The reverse kernel in the derivation of the noise term.
h = h 0,1
(6.37)
h 0,0
h 0,1 , h rev = h 0,1
h 0,0
h 0,1 .
h 1,1
h 1,0
h 1,1
h 1,1 h 1,0 h 1,1
Before we discuss the prior term, lets conclude the general form of differentiation
when a kernel function is involved. Let R( f h) be some differentiable function.
The derivative with respect to f is
R( f h) = (R ( f h)) h rev
f
(6.38)
Hp
= [2( f r ) exp (( f r )2 )] rrev .
f
(6.40)
Recall that the derivative of the prior H p / f is itself an image, and that image
is derived by applying r to f, multiplying (pixel-by-pixel) with the exponential to
produce another image, and applying the reverse of r to that image.
125
6.4.4
exp
2 2
i
2
i
where (R( f ))i will be the quadratic variation of Eq. (6.25) at pixel i. Of course, in
specic applications, different priors may be indicated. To perform gradient descent,
we must nd the derivative with respect to f . This problem becomes complicated
when we recognize that the numerator of the argument of the exponential varies in
both x and y. We have two choices.
r
We could recognize that R is a sum of three terms, that the exponential of a sum is
the product of the exponentials, and use the product rule of derivatives to construct
a rather complicated expression for the derivative.
We could say instead of putting the summation in the argument of the exponential,
lets just add the exponentials.
Of course, these are not equivalent expressions. However, minimizing either gets
at the same idea a piecewise-linear image. Since the second is simpler to implement,
being engineers, we choose the second option. We know how to do the derivatives,
so we end up with the following algorithm.
The derivative of the noise term is trivial: On each iteration, simply change pixel i
by dnoisei = ( f i gi )/ 2 .
To determine the derivative of the prior requires a bit more work: according to
Eq. (6.40), the derivative of the prior term is
f
( f )2
exp
rev .
2
2 2
Dene three kernel operators which we will use to estimate the three partial second
derivatives in the quadratic variation
0 0 0
0 1 0
1
1
and
x x = 1 2 1 , yy = 0 2 0 ,
6
6
0 0 0
0 1 0
0.25 0 0.25
x y = 2 0
0
0 .
0.25 0 0.25
Notice that these kernels are symmetric. Therefore = rev .
126
Compute three images, where the ith pixels of those images are
ri x x = (x x f )i , ri yy = ( yy f )i ,
and ri x y = (x y f )i .
2
and similarly create s yy and sx y .
For the purposes of gradient descent, the change in pixel i from the prior term is
then
dpriori = ((x x sx x )i + ( yy s yy )i + (x y sx y )i ).
The gradient descent rule says to change each element of f using f i f i di ,
where di = dnoisei + dpriori .
'
The learning coefcient should be = R M S(di ), where is a small
dimensionless number, like 0.04; RMS(d) is the root mean square norm of the gradient d; can be determined as the variance of the noise in the image (note that
this is NOT a good estimate in synthetic images). We observe that in this form,
changes every iteration.
The coefcient is on the order of , and choosing = is usually adequate.
Implementing this algorithm and annealing over a couple of orders of magnitude
of should give reductions of noise similar to those illustrated in Figs. 6.106.13.
6.5 Conclusion
127
6.6 Vocabulary
chapter, we use gradient descent with annealing, but other, more sophisticated and
faster techniques, such as conjugate gradient, could be used.
6.6 Vocabulary
You should know the meanings of the following terms.
Anisotropic diffusion (see 6A.2)
Annealing
Bayes rule
GNC (see 6A.1)
Inverse problem
MAP algorithm
Relaxation
Restoration
Assignment
6.1
Equation (6.34) illustrates the partial derivative of
an expression involving a kernel, by expanding the kernel into a sum. Use this approach to prove that Eq.
(6.40) can be derived from Eq. (6.39). Do your proof
using a one-dimensional problem, and use a kernel which
is 3 1 (denote the elements of the kernel as h 1 ,h0 ,
and h1 ).
Assignment
6.2
Implement Eq. (6.65) on the image angio.ifs, or some
other image which your instructor selects. Experiment
with various run times and parameter settings.
Assignment
6.3
In Eq. (6.25), the quadratic variation is presented as
a prior term. A very similar prior would be the Laplacian. What is the difference? That is, are there image
features which would minimize the Laplacian and not
minimize the quadratic variation? Vice versa?
128
Assignment
6.4
Which of the following expressions represents a Laplacian?
2 2 2 2
2 2 2
2 f 2 f
2 f
f
(a)
+ 2
(c)
+
(e)
+
2
2
2
2
x
y
x
y
x
y2
2 2
2 2
f
f
f f
f
f
+
(d)
(b)
+
(f)
+
x y
x
x
x
x
Assignment
6.5
One form of the diffusion equation is written
df/dt = hx (c(hx f)) + hy (c(hy f)) where hx and hy
estimate the first derivatives in the x and y
directions, respectively. This suggests that four
kernels must be applied to compute this result. Simple
algebra, however, suggests that this could be rewritten
as df/dt= c(hxx f + hyy f), which requires application
of only two kernels. Is this simplification of the
algorithm correct? If not, explain why not, or under
what conditions it would be true.
Assignment
6.6
Consider the following image Hamiltonian
!
fi gi2
1
(h f)2
H (f) =
exp
H n ( f) + H p ( f)
2
i
i
where denotes application of a kernel operator, the
pixels in the image are lexicographically indexed by i,
and the kernel h is
1 2 1
2 4 2 .
1 2 1
Let G p (fk) denote the partial derivative of H p with
respect to pixel k. G p (fk) = / fk H p (f). Write an expression for G p (fk). Use kernel notation.
Assignment
6.7
Continuing problem Assignment 6.6, you are to consider
ONLY the prior term. Write an equation which describes
129
the change in image brightness at pixel k as one iteration of a simple gradient descent algorithm. Denote the
gradient by G p (fk) and use that in this answer.
Assignment
6.8
Continuing problem Assignment 6.7, expand this differential equation (assume the brightness varies only in
the x direction) by substituting the form for G p (fk)
which you derived. Is this a type of diffusion equation? Discuss. (Hint: Replace the application of kernels with appropriate derivatives.)
Assignment
6.9
In a diffusion problem, you are asked to diffuse a VECTOR quantity, instead of the brightness which you did
in your project. Replace the terms in the diffusion
equation with the appropriate vector quantities, and
write the new differential equation. (Hint: You may
find the algebra easier if you denote the vector as
[a,b]T .)
Assignment
6.10
The time that the diffusion runs is somehow related to
blur. This is why some people refer to diffusions of
this type as scale space. Discuss this use of terminology.
Topic 6A
6A.1
130
HpMFA
i( f )
Fig. 6.7. Prior energy for MFA, for various values of T. Smaller T results in sharper peaks.
pose the MFA problem in a manner even more similar to GNC, a similarity rst noted by
Geiger and Girosi [6.25]. In the weak membrane application of GNC, the minimization
problem is
min f,l HGNC
(6.42)
where
HGNC = Hn + S + P,
S = 2
|i ( f )|2 (1 li ),
i( f )
and
P=
li
(6.43)
and the notation i ( f ) is interpreted to mean the gradient of the image at point i. Here, the
li {0, 1} denotes a discontinuity in f at the ith pixel. That is, if l1 = 1, the pixel at point i
is interpreted as an edge point. Similarly, f i will denote the brightness of the ith pixel. It has
been shown [6.16] that minimizing HGNC can be reduced to the following problem, which
involves only continuous variables
!
v(i ( f )) .
(6.44)
min f Hn +
i
In Eqs. (6.43) and (6.44), the |(.)| represents any operator which returns a scalar measure
of the local edginess of the image such as ( f / x)2 + ( f / y)2 , and the v function of
Eq. (6.44) is the clipped parabola illustrated in Fig. 6.8.
The minimization problem posed by Eq. (6.44) is unsolvable by techniques such as gradient descent, since the function dened by Eq. (6.44) is in general not convex. That is,
it may possess many minima. Instead, GNC approximates v with the piecewise-smooth
function
2 t 2
if (|t| < q)
(t) = c (|t| r )2 /2
(6.45)
if (q |t| < r )
if (|t| r )
131
v*
Fig. 6.9. Smoothed approximations to the energy of Fig. 6.8. Smaller values of p result in
approximations which are closer to the desired prior. The t used here (in GNC) is
equivalent to the edge strength of MFA.
t
2
1
r = + 2
c
,
and q =
.
2 r
(6.46)
Equations (6.45) and (6.46) then dene the algorithm. Reducing the parameter p from 1 to
0 steadily changes v until it becomes precisely equal to v. This produces a family of prior
energies, illustrated in Fig. 6.9.
The process of gradually reducing p begins by minimizing a function which is convex, and
therefore has a unique minimum. Then, from that minimum, the local minimum is tracked
continuously as p is reduced from 1 to 0.
6A.2
Relation of spatial
derivatives (RHS) to
temporal derivatives
(LHS).
(6.47)
where t is time, and i f denotes the spatial gradient of f at pixel i. The diffusion equation
models the ow of some quantity (the most commonly used example is heat) through a
material with conductivity (e.g., thermal conductivity) c.
If ci is constant, independent of the pixel number i, (ci = c) then the partial differential
Eq. (6.47) has a solution which is the same as convolution with a Gaussian, in which the
variance of that Gaussian depends on c and on the time over which the diffusion is run.
Specically, let f, a function of space and time, be described by a specic partial differential
equation (PDE). If it is possible to write f in the following form:
f (x, t) =
G(x, x , t) f (x , 0) d x ,
(6.48)
132
then we say that G(x, x , t) is the Greens function of the PDE. The special case of isotropic
diffusion may be stated formally in one dimension:
Theorem
The Gaussian is the Greens function of the PDE
f
2 f
=c 2.
t
x
(6.49)
Proof
This is accomplished by writing the Gaussian as
1
(x x )2
G(x, x , t) = exp
,
2 2
where will turn out to be a function of time (we have omitted the 1/ 2 since it occurs
on both sides of the PDE and cancels out).
Substitute Eq. (6.48) into Eq. (6.49), producing on the left-hand side (LHS) the partial
with respect to t of an integral in which is a function of t. Taking this partial, make the LHS
equal to
(x x )2
1
.
(6.50)
4
2 t
Similarly, we can take the second partial derivative with respect to x to create the right-hand
side (RHS):
c (x x )2
1
.
(6.51)
4
2
Equating Eqs. (6.50) and (6.51) produces the equation
c
=
t
(6.52)
2 = 2ct. QED.
(6.53)
whose solution is
In the case of VCD, the conductance becomes a function of the spatial coordinates, in
this instance parameterized by i. In particular it becomes a property of the local image
intensities themselves. The conductance ci is usefully seen as a factor by which space is
locally compressed.
To smooth, except at edges, we let ci be small if i is an edge pixel, i.e., if a selected image
property is locally nonuniform. If ci is small (in the heat transfer analogy), little heat ows
(space is stretched), and in the image, little smoothing occurs. If, on the other hand, ci is
large, then much smoothing is allowed in the vicinity of pixel i (space is compressed). VCD
then, just as the forms of MFA and GNC discussed, implements an operation which after
repetition produces a nearly piecewise uniform result.
133
6A.3
(6.54)
If there is an edge in the image, we want to remove noise on both sides of the edge, but not
blur the edge. Therefore it makes sense to diffuse in a direction tangent to the edge. The
normal and tangent vectors to an edge in a two-dimensional image are given by
[ f x f y ]T
N= $
f x2 + f y2
and
[ f y f x ]T
T = $
.
f x2 + f y2
Consider now, the second partial derivatives of f taken in the N and T directions: f N N and
fT T .
Since the Laplacian is rotation-invariant, we can write the diffusion PDE (Eq. (6.54)) in
the new coordinates by f t = c( f N N + f T T ).
One may derive the following relationships between the partials
f N N = f x2 f x x + 2 f x f y f x y + f y2 f yy / f x2 + f y2
f T T = f y2 f x x 2 f x f y f x y + f x2 f yy / f x2 + f y2 .
(6.55)
Substituting this form into Eq. (6.54) and subtracting the normal ow, we end up with a PDE
which smooths along edges without smoothing across edges:
f t = f y2 f x x 2 f x f y f x y + f x2 f yy / f x2 + f y2 ,
(6.56)
6A.4
6A.4.1
134
The equivalence of the techniques is demonstrated in the next section and proven formally
in [6.12], to which we refer the reader for in-depth analysis. The following experiments are
also described in that paper and are reprinted here to assist the reader in understanding the
action of the algorithms.
The two approaches were each used to restore the same image with various signal-tonoise ratios (SNRs). On each application of MFA and GNC to a noisy image, the respective
parameters were varied to achieve the best possible image restoration. Several hundred runs
with distinct parameter values were completed for each algorithm. We found that for each
noisy image there existed some parameter set for each algorithm such that the restored images
were of comparable quality.
The resulting image quality achieved is depicted above with the original image (Fig. 6.10),
the image corrupted with SNR = 2 (Fig. 6.11), the MFA restored image (Fig. 6.12), and the
GNC restored image (Fig. 6.13).
Coding of the GNC algorithm can be found in [6.16], which performs descent using
successive over-relaxation (SOR). By using an implementation of MFA which also uses
SOR, we found the execution times of MFA to be roughly ten times faster than GNC for high
noise cases (SNR < 3). For cleaner images, SNR 4, the GNC execution times were faster.
6A.4.2
(6.57)
In a sampled image (as all digital images are), however, the limiting process makes no sense,
for as ter Haar Romeny et al. point out [6.73], one cannot take differences at scales smaller
than a pixel. Instead, one must estimate the derivative at a point by operations performed on
some neighborhood of that point. The topic of estimating derivatives is an old one, and we
will not examine it further except to point out that most analyses have concluded that such
estimates (for higher order derivatives as well) are generally computed by a kernel operator
in one dimension and by the Euclidean norm of an array of n operators in n dimensions.
For the purposes of this derivation, we consider only derivatives in the x direction. We will
135
generalize this in a few paragraphs. Thus, we may rewrite the prior term as
( f r )i2
b
exp
.
Hp ( f ) =
2T 2
2T i
(6.58)
By the notation ( f r )i we mean the result of applying a kernel r to the image f at point i. The
kernel may be chosen to emphasize problem-dependent image characteristics. This general
form has been used to remove noise from piecewise-constant [6.38] and piecewise-linear
[6.8, 6.9] images.
In the following derivation, we consider only the prior term.
To perform gradient descent we will need the derivative, so we calculate
Hp
b
=
fi
2T
( f r)
( f r )2
exp
r
rev
T2
2T 2
i
And here.
(6.59)
(6.60)
In the equation above, we have lumped the constants together into and set the annealing
control parameter, T, to 1 for clarity. We have then made use of the observation that for
centered rst derivative kernels, f h = ( f h rev ). We will discuss the impact of T in
the next section.
Finally, we consider the use of f H ( f ) in a gradient descent algorithm. In the simplest
implementations of gradient descent, f is updated by (compare with Eq. (6.20))
f ik+1 = f ik
H
fi
(6.61)
where f k denotes the value of f at iteration k, and where is some small constant (or, in more
sophisticated algorithms, a function of the Hessian of H ). Rewriting Eq. (6.61),
f k+1 f ik
H
= i
fi
(6.62)
we note that the LHS of Eq. (6.62) then represents a change in f between iterations k and
k + 1, and in fact bears a strong resemblance to the form of the derivative of f. We make
this similarity explicit by dening that iteration k is calculated at time t and iteration k + 1
calculated at time t + t. (In similar contexts, t is sometimes known as the evolution
parameter.) Since t is an articially introduced parameter, without physically meaningful
136
,
fi
t
t
(6.63)
where we have allowed the constant to be renamed t so the expression looks like a
derivative with respect to time. Substituting this (re)denition into Eq. (6.60), we have our
nal result: By simply changing notation and making explicit the time between iterations,
the derivative of the MFA prior term may be rewritten as
fi
= ((( f ) exp((T f )2 ))),
t
(6.64)
(6.65)
in which the conductivity, c, is replaced by the exponential, we observe that Eq. (6.64) is
precisely the form of the diffusion equation used in (VCD) [6.31, 6.59, 6.62]. The constant
simply incorporates into one place all the constants that are involved.
By Eq. (6.65), we have shown the equivalence of MFA and VCD, provided that MFA is
performed without using the noise term, Eq. (6.11). This equivalence provides for the union of
two schools of thought concerning image analysis: The rst (optimization) school considers
the properties that an image ought to have. It then sets up an optimization problem whose
solution is the desired image. One might also call this the restoration school. The second
(process) school is more concerned with determining the appropriate spatial analysis to apply.
Adaptive ltering, diffusion, template matching, etc. are more concerned with the process
itself than with what that process does to some hypothetical energy function of the image.
The result of this section demonstrates that these two schools are not just philosophically
equivalent, at least for this particular form of edge-preserving smoothing, they are precisely
equivalent.
The above equivalence considered only the prior term of the MFA objective function.
Addition of the noise term converts an image feature extraction algorithm into a constrained
restoration algorithm.
Nordstrom [6.59] also observed a similarity between diffusion techniques and regularization (optimization) methods. He notes that the anisotropic diffusion method (which Whitaker
[6.79] calls VCD) does not seek an optimal solution of any kind. This is not quite true: The
developer of the technique did not intend for it to be used as a minimization technique is
possibly a better way to say it. Nordstrom then proceeds to unify from the original outlook
quite different methods of regularization and anisotropic diffusion. He proceeds quite elegantly and precisely to dene a cost function whose behavior is an anisotropic diffusion, in
a similar manner to the derivation presented here. Nordstrom also argues for the necessity of
adding a stabilizing cost to restrict the space of possible estimated image functions. At
this point the reader should not be surprised to learn that the form of the stabilized cost is
( f i gi )2
(6.66)
i
137
f'
f '2
exp(-x/T )
which we showed in Eq. (6.11) is a measure of the effect of Gaussian noise in a blur-free
imaging system. Thus we see that biased anisotropic diffusion (BAD) [6.59] may be thought of
as a maximum a posteriori restoration of an image. This observation now permits researchers
in VCD/BAD to consider use of different stabilizing costs, if they have additional information
concerning the noise generation process.
6A.5
6A.6
Conclusion
An equivalence exists between the problems of image optimization and diffusion. Others
have made similar observations. In addition to Nordstrom [6.59], Geiger and Yuille came to a
similar conclusion [6.26] for an energy function requiring explicit line processes. Since it has
been shown [6.38] that the line processes are not required, their result may now be interpreted
more generally. In an image optimization problem (in particular, restoration), one denes a
criterion function which is to be minimized, and applies some minimization scheme to nd a
global (or at least a good) minimum. Thus, an image restoration problem may be considered
as a combination of goals: (1) to preserve the information in the original image, and thereby
produce a resulting image which resembles that original (or the result of an operation on it)
in some way; and (2) to produce an image which possesses certain properties, such as local
smoothness except at boundaries. If one abandons the rst goal, then restoration problems
138
become iconic feature extraction problems. Wu and Doerschuk [6.81] have developed an
attractive extension to this work.
Finally, recall that in [6.19], we demonstrated that MFA operations could be calculated
by a two-layer, locally connected, recurrent neural network. From this paper one may then
conclude that GNC and VCD are likewise implementable by straightforward neural networks.
These results lead us to conjecture the following guiding principles for the design of feature
extraction algorithms.
r
r
r
Bibliography
[6.1] I. Abdelqader, S. Rajala, W. Snyder, and G. Bilbro, Energy Minimization Approach
to Motion Estimation using Mean Field Annealing, Signal Processing, July 1992.
[6.2] E. Allgower and K. Georg, Numerical Continuation Methods, Berlin, Springer-Verlag,
1990.
2
Some authors also include a third requirement, localness, to the denition of a relaxation. We prefer to
include this in a separate bullet point.
139
Bibliography
[6.3] H. Andrews and B. Hunt, Digital Image Restoration, Englewood Cliffs, NJ, PrenticeHall, 1977.
[6.4] D. Baker and J. Aggarwal, Geometry Guided Segmentation of Outdoor Scenes,
SPIE Applications of Articial Intelligence, VI, pp. 576583, 1988.
[6.5] J. Besag, Spatial Interaction and the Statistical Analysis of Lattice Systems, Journal
of the Royal Statistical Society, B, 36, pp. 192326, 1974.
[6.6] J. Besag, On the Statistical Analysis of Dirty Pictures, Journal of the Royal Statistical
Society, B, 48 (3), 1986.
[6.7] G. Bilbro and W. Snyder, Fusion of Range and Luminance Data, IEEE Symposium
on Intelligent Control, Arlington, August, 1988.
[6.8] G. Bilbro and W. Snyder, Range Image Restoration using Mean Field Annealing,
In Advances in Neural Network Information Processing Systems, San Mateo, CA,
Morgan-Kaufmann, 1989.
[6.9] G. Bilbro and W. Snyder, Mean Field Annealing, an Application to Image Noise
Removal, Journal of Neural Network Computing, Fall, 1990.
[6.10] G. Bilbro and W. Snyder, Optimization of Functions with Many Minima, IEEE
Transactions on Systems, Man, and Cybernetics, 21(4), July/August, 1991.
[6.11] G. Bilbro, R. Mann, T. Miller, W. Snyder, D. Van den Bout and M. White, Optimization by Mean Field Annealing, In Advances in Neural Information Processing
Systems, San Mateo, CA, Morgan-Kauffman, 1989.
[6.12] G. Bilbro, W. Snyder, S. Garnier, and J. Gault, Mean Field Annealing: a Formalism
for Constructing GNC-like Algorithms, IEEE Transactions on Neural Networks, 3(1)
pp. 131138, 1992.
[6.13] G. Bilbro, W. Snyder, and R. Mann, Mean Field Approximation Minimizes Relative
Entropy, Journal of the Optical Society of America, A, 8(2), February 1991.
[6.14] M. Black and A. Jepson, Estimating Optical Flow in Segmented Images Using
Variable-order Parametric Models with Local Deformations, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 18(10), 1996.
[6.15] A. Blake, Comparison of the Efciency of Deterministic and Stochastic Algorithms
for Visual Reconstruction, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 1(1), 1989.
[6.16] A. Blake and A. Zisserman, Visual Reconstruction, Cambridge, MA, MIT Press, 1987.
[6.17] E. Brezin, D. LeGuillon, and J. Zinn-Justin, Field Theoretical Approaches to Critical
Phenomena, Phase Transitions and Critical Phenomena, volume 6, eds. C. Domb
and M. Green, New York, Academic Press, 1976.
[6.18] R. Burden, J. Faires, and A. Reynolds, Numerical Analysis, Boston, Prindle, 1981.
[6.19] H. Chang and M. Fitzpatrick, Geometrical Image Transformation to Compensate
for MRI Distortions, SPIE Medical Imaging IV, 1233, pp. 116127, February,
1990.
[6.20] H. Derin and H. Elliot, Modeling and Segmentation of Noisy and Textured Images
using Gibbs Random Fields, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 9, pp. 3955, 1987.
[6.21] H. Ehricke, Problems and Approaches for Tissue Segmentation in 3D-MR Imaging,
SPIE Medical Imaging IV: Image Processing, 1233, pp. 128137, February, 1990.
140
[6.22] J. Elder, Are Edges Incomplete? International Journal of Computer Vision, 34(2),
1999.
[6.23] J. Elder and R. Goldberg, Image Editing in the Contour Domain, IEEE Transactions
on Pattern Analysis and Machine Intelligence, 23(3), 2001.
[6.24] J. Elder and S. Zucker, Scale Space Localization, Blur, and Contour-based
Image Coding, IEEE Conference on Computer Vision and Pattern Recognition, San
Francisco, CA, 1996.
[6.25] D. Geiger and F. Girosi, Parallel and Deterministic Algorithms for MRFS: Surface Reconstruction and Integration, AI Memo, No 1114, Cambridge, MA, MIT,
1989.
[6.26] D. Geiger and A. Yuille, A Common Framework for Image Segmentation by Energy
Functions and Nonlinear Diffusion, MIT AI Lab Report, Cambridge, MA, 1989.
[6.27] D. Geman and S. Geman, Stochastic Relaxation, Gibbs Distributions, and the
Bayesian Restoration of Images, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 6(6), November, 1984.
[6.28] R. Gonzalez and P. Wintz, Digital Image Processing, 2nd edn, Reading, MA, AddisonWesley, 1987.
[6.29] A. Gray, J. Kay, and D. Titterington, An Empirical Study of the Simulation of Various Models used for Images, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(5), 1994.
[6.30] B. Groshong, G. Bilbro, and W. Snyder, Restoration of Eddy Current Images by
Constrained Gradient Descent, Journal of Nondestructive Evaluation, December,
1991.
[6.31] S. Grossberg, Neural Dynamics of Brightness Perception: Features, Boundaries,
Diffusion, and Resonance, Perception and Psychophysics, 36(5), pp. 428456,
1984.
[6.32] J. Hadamard, Lectures on the Cauchy Problem in Linear Partial Differential Equations,
New Haven, CT, Yale University Press, 1923.
[6.33] J. Hammersley and P. Clifford, Markov Field on Finite Graphs and Lattices,
unpublished.
[6.34] F. Hansen and H. Elliot, Image Segmentation using Simple Markov Field Models,
Computer Graphics and Image Processing, 20, pp. 101132, 1982.
[6.35] R. Haralick and G. Shapiro, Image Segmentation Techniques, Computer Vision,
Graphics, and Image Processing, 29, pp. 100132, 1985.
[6.36] E. Hensel, Inverse Theory and Applications for Engineers, Englewood Cliffs, NJ,
Prentice-Hall, 1991.
[6.37] H. Hiriyannaiah, Signal Reconstruction using Mean Field Annealing. Ph.D. Thesis,
North Carolina State University, Raleigh, NC, 1990.
[6.38] H. Hiriyannaiah, G. Bilbro, W. Snyder, and R. Mann, Restoration of Locally Homogeneous Images using Mean Field Annealing, Journal of the Optical Society of
America A, 6, pp. 19011912, December, 1989.
[6.39] J. Hopeld, Neurons with Graded Response Have Collective Computational Properties Like Those of Two-state Neurons, Proceedings of the National Academy of
Science USA, 81, pp. 30583092.
141
Bibliography
[6.40] J. Hopeld, and D. Tank, Neural Computations of Decisions in Optimization Problems, Biological Cybernetics, 52, pp. 141152, 1985.
[6.41] M. Irani, B. Rousso, and S. Peleg, Recovery of Ego-motion Using Region Alignment, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(3), 1997.
[6.42] A. Kak and M. Slaney, Principles of Computerized Tomographic Imaging, New York,
IEEE Press, 1988.
[6.43] S. Kapoor, P. Mundkur, and U. Desai, Depth and Image Recovery using a MRF
Model, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(11),
1994.
[6.44] I. Kapouleas and C. Kulikowski, A Model-based System for Interpretation of MR
Human Brain Scans, Proceedings of the SPIE, Medical Imaging II, vol. 914, February,
1988.
[6.45] R. Kashyap and R. Chellappa, Estimation and Choice of Neighbors in Spatialinteraction Model of Images, IEEE Transactions on Information Theory, 29, pp.
6072, January, 1983.
[6.46] M. Kelly, In Machine Intelligence, vol 6, Edinburgh, University of Edinburgh Press,
1971.
[6.47] R. Kindermann and J. Snell, Markov Random Fields and Their Applications, Providence, RI, American Mathematical Society, 1980.
[6.48] S. Kirkpatrick, Gelatt C, and Vecchi M, Optimization by Simulated Annealing,
Science, 220, pp. 671668, 1983.
[6.49] A. Kolmogorov, On the Representation of Continuous Functions of One Variable by
Superposition of Continuous Functions of One Variable and Addition, AMS Translation, 2, pp. 5559, 1957.
[6.50] R. Lee, and R. Leahy, Multi-spectral Tissue Classication of MR Images Using
Sensor Fusion Approaches, SPIE Medical Imaging IV: Image Processing, 1233,
pp. 149157, February, 1990.
[6.51] S. Li, On Discontinuity-adaptive Smoothness Priors in Computer Vision, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(6), 1995.
[6.52] R. Malik and T. Whangbo, Angle Densities and Recognition of 3D Objects, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 19(1), 1997.
[6.53] D. Marr, Vision, San Francisco, CA, Freeman, 1982.
[6.54] J. Marroquin, Probabilistic Solution to Inverse Problems, Doctoral Dissertation, MIT,
1985.
[6.55] P. Morris, in Nuclear Magnetic Resonance Imaging in Medicine and Biology, Oxford,
Clarendon Press, 1986.
[6.56] J. Moussouris, Gibbs and Markov Systems with Constraints, Journal of Statistical
Physics, (10), pp. 1133, 1974.
[6.57] M. Nashed, Aspects of Generalized Inverses in Analysis and Regularization, in
Generalized Inverses and Applications, ed. by M. Nashed, New York, Academic
Press, 1976.
[6.58] T. Nelson, Propagation Characteristics of a Fractal Network: Applications to the HisPurkinje Conduction System, SPIE Medical Imaging IV: Image Processing, 1233,
pp. 2332, February, 1990.
142
[6.59] N. Nordstrom, Biased Anisotropic Diffusion A Unied Regularization and Diffusion Approach to Edge Detection, Image and Vision Computing, 8(4), pp. 318327,
1990.
[6.60] T. Pavlidis, Structural Pattern Recognition, Berlin, Springer-Verlag, 1977.
[6.61] A. Pentland, Interpolation using Wavelet Bases, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(4), 1994.
[6.62] P. Perona and J. Malik, Scale-space and Edge Detection using Anisotropic Diffusion,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 12, pp. 629639,
July, 1990.
[6.63] H. Qi, A High-resolution, Large-area Digital Imaging System, Ph.D. Thesis, North
Carolina State University, 1999.
[6.64] A. Rosenfeld and A. Kak, Digital Picture Processing, 2nd edn, Vol. 2, New York,
Academic Press, 1982.
[6.65] Samet, The Design and Analysis of Spacial Data Structures, Reading, MA, AddisonWesley, 1989.
[6.66] P. Santago, K. Link, W. Snyder, J. Worley, S. Rajala, and Y. Han, Restoration of Cardiac Magnetic Resonance Images, Symposium on Computer Based Medical Systems,
Chapel Hill, NC, June 36, 1990.
[6.67] S. Shemlon and S. Dunn, Rule-based Interpretation with Models of Expected Structure, SPIE Medical Imaging IV, 1233, pp. 3344, February, 1990.
[6.68] W. Snyder, G. Bilbro, A. Logenthiran, and S. Rajala, Optimal Thresholding A New
Approach, Pattern Recognition Letters, 11(12), December, 1990.
[6.69] W. Snyder, P. Santago, A. Logenthiran, K. Link, G. Bilbro, and S. Rajala, Segmentation of Magnetic Resonance Images using Mean Field Annealing, XII International
Conference on Information Processing in Medical Imaging, Kent, England, July 711,
1991.
[6.70] W. Snyder, A. Logenthiran, P. Santago, K. Link, G. Bilbro, and S. Rajala, Segmentation of Magnetic Resonance Images using Mean Field Annealing, Image and Vision
Computing, 10(6), pp. 361368, 1992.
[6.71] C. Soukoulis, K. Levin, and G. Grest, Irreversibility and Metastability in Spin-glasses.
I. Ising Model, Physical Review B, 28(3), pp. 14951509, 1983.
[6.72] B. Super and W. Klarquist, Patch-based Stereo in a General Binocular Viewing
Geometry, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(3),
1997.
[6.73] B. ter Haar Romeny, L. Florack, J. Koenderink, and M. Viergever, Scale Space:
Its Natural Operators and Differential Invariants, XII International Conference on
Information Processing in Medical Imaging, Kent, England, July 711, 1991.
[6.74] P. Torr, R. Szeliski, and P. Anandan, An Integrated Bayesian Approach to Layer Extraction from Image Sequences, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 23(3), 2001.
[6.75] J. van Laarhoven and E. Aarts, Simulated Annealing: Theory and Applications,
Norwell, MA, Reidel, 1988.
[6.76] M. Vannier, et al., Multispectral Analysis of Magnetic Resonance Images, Radiology, 154, pp. 221224, January, 1985.
143
Bibliography
[6.77] D. Van den Bout and T. Miller, Graph Partitioning using Annealed Neural Networks,
IEEE Transactions on Neural Networks, 1(2), pp. 192203, 1990.
[6.78] C. Wang, W. Snyder, and G. Bilbro, Optimal Interpolation of Images, Neural Networks for Computing Conference, Snowbird, UT, April, 1995.
[6.79] R. Whitaker, Geometry-limited Diffusion in the Characterization of Geometric
Patches in Images, TR91-039, Dept. of Computer Science, UNC, Chapel Hill, NC,
1991.
[6.80] G. Wolberg and T. Pavlidis, Restoration of Binary Images Using Stochastic Relaxation With Annealing, Pattern Recognition Letters, 3(6), pp. 375388, December,
1985.
[6.81] C. Wu and P. Doerschuk, Cluster Expansions for the Deterministic Computation
of Bayesian Estimators Based on Markov Random Fields, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(3), 1975.
[6.82] M. Yaou and W. Chang, Fast Surface Interpolation Using Multiresolution Wavelet
Transform, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(7),
1994.
[6.83] J. Yi and D. Chelberg, Discontinuity-preserving and Viewpoint Invariant Reconstruction of Visible Surfaces Using a First-order Regularization, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(6), 1995.
Mathematical morphology
A mans discourse is like to a rich Persian carpet, the beautiful gures and patterns of which
can be shown only by spreading and extending it out; when it is contracted and folded up,
they are obscured and lost
Plutarch
The sufx -ology means study of-, so obviously, morphology is the study of
morphs; answering critical questions like: How come they only come out at night,
and then y toward the light? and Why is it that bug zappers only toast the harmless
critters, leaving the skeeters alone? and HOLD IT! Thats MORPH-ology, the
study of SHAPE, not moths! Try again . . .
7.1.1
Dilation
First, the intuitive denition: The dilation of a (BINARY) image is that same image
with all the foreground regions made just a little bit bigger.
Now, formally: We consider two images, f A and f B , and let A and B be sets of
ordered pairs, consisting of the coordinates of each foreground pixel in f A and f B ,
respectively.
Consider one pixel in f B , and its corresponding element (ordered pair) of B, call
that element b B. Create a new set by adding the ordered pair b to EVERY ordered
pair in A. Lets look at a tiny example.
For this image, A = {(2, 8), (3, 6), (4, 4), (5, 6), (6, 4), (7, 6), (8, 8)}. Adding the
pair (1, 1) results in the set A(1,1) = {(1, 9), (2, 7), (3, 5), (4, 7), (5, 5), (6, 7),
(7, 9)}. The corresponding image is also shown in Fig. 7.1, and we hope you
144
145
(a) 9
(b) 9
8
7
6
5
4
3
2
1
0
8
7
6
5
4
3
2
1
0
0 12 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
fA
fA (1, 1)
Fig. 7.1. Example of dilation. (a) The original binary image. (b) The binary image dilated by
B = {(--1, 1)}.
observed that A(1,1) is nothing more than a translation of A. With this concept
rmly understood, think about what would happen if you constructed a SET of
translations of A, one for each pair in B. We denote that as {Ab , b B}, that is, b is
one of the ordered pairs in B.
Formally, we dene the DILATION of A by B as A B = {a + b|(a A,
b B)}, which is the same as the union of all those translations of A,
AB =
Ab
(7.1)
bB
and we use the same symbol to denote the dilation of images: f A f B . Here is
another example.
fA
9
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9
4
3
2
1
fB 0
1
2
3
4
5
4 3 2 1 0 1 2 3 4 5
A = {(2, 8), (3, 6), (4, 4), (5, 6), (6, 4),
(7, 6), (8, 8)}
(7.2)
(7.3)
146
Mathematical morphology
and
fA fB =
9
8
7
6
5
4
3
2
1
0
0 12 3 4 5 6 7 8 9
#(A B) #A #B.
(7.4)
To go further, we need to dene some notation: If x is an ordered pair, then (1) the
translation1 of a set A by x is A x , (2) the reection of A is A = {(x, y|(x, y) A)}
and (3) the complement of set A is Ac . An example of reection is
2
1
0
1
2
2 1 0 1 2
fA
2
1
0
1
2
2 1 0 1 2
fA~
In this example, A = {(0, 0), (1, 0), (1, 1)} and the reection of A is {(0, 0), (1, 0),
(1, 1)}.
7.1.2
Erosion
Now, we dene the (sort of) inverse of dilation, erosion,
A B = {a|(a + b) A for every (a A, b B)}
which can be written in terms of translations by
,
A B =
Ab .
(7.5)
(7.6)
b B
We do not have to be limited to a 2-space. As long as x and A are drawn from the same space, it works. More
generally, if A and B are sets in a space , and x , the translation of A by x is denoted A x = {y| for some
a A, a = a + x}.
147
Notice two things: The second set, B, is reected, and the intersection symbol is
used. Lets do an example.
9
8
7
6
5
4
3
2
1
0
fA
0 1 2 3 4 5 6 7 8 9
2
1
0
1
2
2 1 0 1 2
fB
Rather than the tedious job of listing all the 17 elements of A, just draw
8
7
6
5
4
3
2
1
0
fA(0, 0)
0 1 2 3 4 56 7 8 9
8
7
6
5
4
3
2
1
0
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9
fA(1, 0)
fA(0, 0) fA(1, 0)
0 1 2 3 4 5 6 7 8 9
So now we have dilation and erosion dened. You will observe that usually (for all
practical purposes) one of the images is small compared to the other; that is, in
the example above
#A #B.
(7.7)
When this is the case, we refer to the smaller image, f B , as the structuring element
(s.e.).
148
Mathematical morphology
7.1.3
(7.9)
(7.8)
(7.10)
(7.11)
for any s.e. K. When this property holds, we say the operator is increasing.
An example proof: Dilation is increasing
Let the set A consist of elements Ai . A = {A1 , A2 , . . . , An }, and let B be similarly
denoted. Furthermore, suppose B A. Now, suppose both A and B are dilated by
the same s.e., K. Take a single element of K , say K 1 , and dilate each element of
A by K 1 , A K 1 = {A1 + K 1 , A2 + K 1 , . . . , An + K 1 }, and similarly dilate B.
Since every element of B was also an element of A, every element of B K i is in
A K i . Since this observation is true for an arbitrary element of K , it is true for
all elements of K . Now consider the union of the results of applying two elements
of the s.e. K to A: A12 = (A K 1 ) (A K 2 ). Since B K 1 A K 1 and
B K 2 A K 2 , we know from set theory that if R1 S1 and R2 S2 then
R1 R2 S1 S2 and we are done.
r
(7.12)
As you might guess, erosion has some extensive properties as well: That is, erosion
149
7.1.4
(7.13)
(7.14)
(A B) C (A C) (B C)
(7.15)
A (B C) (A B) (A C)
(7.16)
A (B C) = (A B) (A C).
(7.17)
(7.18)
(7.19)
An application
So what is the purpose of all this? Lets do an example: Inspection of printed circuit
(PC) boards. Heres a picture of a PC board with two traces on it shorted together by
a hair which was stuck to the board when it went through the wave solder machine.
We will use opening to identify the short.
150
Mathematical morphology
First, erode the image using a small s.e. We choose an s.e. which is smaller than the
features of interest (the traces), but larger than the defect. The erosion then looks
like this:
and surprise, surprise, the defect is gone. For inspection purposes, one could now
subtract the original from the opened, and the difference image would contain only
the defect. Furthermore these operations can be done in hardware, blindingly fast.
(7.20)
(7.21)
This example illustrates rst of all that morphological concepts can be extended to a
151
continuous domain. (For the time being, remember, however, that this is continuous
in resolution, not in brightness value; still binary. We will x that soon.) Second, it
illustrates the fact that opening preserves exactly the geometry of objects which are
big enough, and totally erases smaller objects. In this sense, opening resembles
the functioning of the median lter, where each pixel is replaced by the median of
its neighbors.
7.1.5
Duality: (AoK )c = Ac r K .
Proof of duality. Notice how this proof is done. We expect students to do proofs
this carefully.
1. (AoK )c = [(A K ) K ]c denition of opening
2. = (A K )c K complement of dilation
3. = (Ac K ) K complement of erosion
4. = Ac r K denition of closing.
Idempotency: Opening and closing are idempotent. That is, repetitions of the
same operation have no further effect:
AoK = (AoK )oK
(A r K ) = (A r K ) r K .
r
r
r
Closing is extensive: A r K A.
Opening is anti-extensive: AoK A.
Images dilated by f K remain invariant under opening by f K . That is, f A f K =
( f A f K )o f K .
Proof
1. A r K A since closing is extensive
2. (A r K ) K A K dilation is increasing
3. ((A K ) K ) K A K denition of closing
4. (A K )oK A K denition of opening
5. for any B, BoK B opening is anti-extensive
6. therefore, (A K )oK A K substitution of A K for B
7. (A K )oK = A K since A K is both greater or equal to and less or
equal to (A K )oK , so only equality can be true.
152
Mathematical morphology
for some set of structuring elements K = {K 1 , K 2 , . . .}, where the set K is said to be
a basis set for this operation. The erosions could all be done in parallel and the union
performed very quickly using lookup table methods. More details are available in
[7.17].
(7.23)
To illustrate the concept of umbra, let f A be one dimensional. (We illustrate a onedimensional function here as a two-dimensional function in which one dimension is
always zero. That way, you have an example that is easily extended to two dimensions):
A = {(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0)}
and the pixel value at the corresponding coordinate is
f A (x, 0) = [1, 2, 3, 1, 2, 3, 3].
Notice the new notation: Since f A takes on various values, depending on which
element of A is being considered, we use functional notation. Drawing f A , we have
Fig. 7.2.
In Fig. 7.2, the heavy black line represents f A , and the umbra is the shaded area
under f A . Following this gure, the heavy black line is the TOP of the umbra. We
could write the umbra as a set of ordered triples:
U ( f A ) = {(0, 0, 1), (1, 0, 1), (1, 0, 2), (2, 0, 1), (2, 0, 2), (2, 0, 3), (3, 0, 1),
(4, 0, 1), (4, 0, 2), (5, 0, 1), (5, 0, 2), (5, 0, 3), (6, 0, 1), (6, 0, 2), (6, 0, 3)}.
153
fA(x, 0)
3
2
1
0
Heres the trick: Although the gray-level image is no longer binary (and therefore
not representable by set membership) the umbra does have those properties.
We may therefore dene the dilation of a gray-scale image f A by a gray-scale
s.e., f B as
f A (x, y) f B (x, y) = TOP(U ( f A ) U ( f B ))
(7.24)
and erosion is similarly dened. Furthermore gray-scale opening and closing can be
dened in terms of gray-scale dilation and erosion.
Generalizing this concept to two-dimensional images, the umbra becomes three
dimensional, a set of triples
U ( f (x, y)) = {(x, y, z)|(z f (x, y))}.
(7.25)
(x1 , y1 )
for (x1 , y1 )
Z Z , where
denotes the set of possible pixel locations,
assumed here to be positive and integer.
1
1
1
1
1
1
1
1
11 1
22 1
23 21
22 1
221
2 32 1
12 22
111
1
1
1
11
154
Mathematical morphology
(7.26)
(7.27)
7.3.1
1
T=
1
+0
0
0
(7.28)
A more detailed explanation follows. First, the distance transform, D(x, y), is initialized to a special symbol denoting innity at every nonedge point, D 0 (x, y) =
, (x, y)
/ R, and to zero at every edge point D 0 (x, y) = 0, (x, y) R. Then,
application of the mask starts at the upper left corner of the image, placing the origin
of the mask over the (1, 1) pixel of the image, and applying Eq. (7.28) to calculate
a new value for the DT at (1, 1). In the example of Fig. 7.5, the DT is illustrated
with innities denoted by blank squares, and edges denoted by zeros. The mask
of Fig. 7.4 is applied in the shaded area. The application of Eq. (7.28) produces
min(1 + 0, 1 + ) for the distance transform value in the pixel indicated by the
arrow.
After one pass, top-to-bottom, left-to-right, the mask is reversed (in both directions), and applied again, bottom-to-top, right-to-left.
155
+0
This process is repeated at each pixel until all pixels in the image have been
processed, and then iterated until all pixels in the DT are marked with a nite value.
Masks other than that of Fig. 7.4 produce other variations of the DT. In particular,
Fig. 7.6 produces the chamfer map. If divided by three, the chamfer map produces
a DT which is not a bad approximation to the Euclidian distance.
7.3.2
(7.29)
156
Mathematical morphology
that points in the Voronoi diagram do not belong to the Voronoi domain of any
region.
7.4 Conclusion
In this chapter, we have looked at a particular approach to processing the shape of
regions. Morphological operators are particularly useful in binary images, but may
be applied to gray-scale images as well. Unlike most chapters in this book, we did
not make explicit use of either optimization methods or consistency.
7.5 Vocabulary
You should know the meanings of the following terms.
Closing
Dilation
Distance transform
Erosion
Extensive
Increasing
Opening
Umbra
Voronoi diagram
Assignment
7.1
In section 7.1.3, we stated that dilation is commutative because addition is commutative. Erosion also involves addition, but one of the two images is reversed.
Is erosion commutative? Prove or disprove it.
Assignment
7.2
In section 7.3, a mask is given and it is stated that
application of this mask produces a distance transform
which is not a bad approximation to the Euclidian
distance from the interior point to the nearest edge
point. How bad is it? Contrive an example where the
value produced by the application of the mask is different from the Euclidian distance to the nearest edge
point.
157
Assignments
Assignment
7.3
Consider a region with an area of 500 pixels with 120
pixels on the boundary. You need to find the distance
transform from each pixel in the interior to the boundary, using the Euclidian distance. (Note: Pixels ON the
boundary are not considered IN the region -- at least
in this problem.) What is the computational complexity? Note: You may come up with some algorithm more
clever than that used to produce any of the answers
below, so if your algorithm does not produce one of
these answers, explain it.
(a) 60 000
(b) 120 000
(c) 45 600
(d) 91 200
Assignment
7.4
Trick question: For Problem Assignment 7.3, how many
square roots must you calculate to determine this
distance transform? Remember, this is the Euclidian
distance.
Assignment
7.5
Prove (or disprove) that binary images eroded by a
kernel, K, remain invariant under closing by K . That
is, prove that A K ==(A K ) r K .
Assignment
7.6
Show that dilation is not necessarily extensive if the
origin is not in the s.e.
Assignment
7.7
Prove that dilation is increasing.
Assignment
7.8
Let C be the class of binary images that have only one
dark pixel. For a particular image, let that pixel be
located at (i0 , j0 ).
Using erosion and dilation by kernels that have
{(0, 0)} as an element, devise an operator, that is,
158
Mathematical morphology
Assignment
Topic 7A
7A.1
7.9
Which of the following statements is correct? (You
should be able to reason through this, without doing
a proof.)
(a) (A B) C == A (B C )
(7.30)
(b) (A B) C == A (B C )
(7.31)
7.10
Use the thresholded images you created in Assignment
5.5 and Assignment 5.6. Choose a suitable structuring
element, and apply opening to remove the noise.
Morphology
159
Topic 7A Morphology
Matheron [7.21] proved that any of a large class of morphological operations can be
computed as a union of erosions, or by using duality, as an intersection of dilations. Choosing
the set of basis sets so that a given operation by a given structuring element may be
calculated in this way has been the subject of considerable research [7.18, 7.20].
7A.1.1
2
1
3
4
0
5
7
6
U0
U2
U4
J0
J1
J2
J4
L0
J5
L2
J6
U6
J3
J7
L4
L6
V1
V3
V5
V7
U0SU 0 J0S J 0 L 0SL0 R0SR0 0 S0 J1S J 1 V1SV 1 R1SR1 1 S1 . . . J7S J 7 V7SV 7 R7SR7 7 S7
R0
R1
R2
R3
where any of the superscripts may be zero. For example, V1 12 2R42 463 R7 is a member of this
set, but J7 1J2 is not.
R4
R5
R6
R7
(7.32)
Denition
An image A is a factor of an image S if and only if it is possible to write S as the dilation of
A, that is, S = A B. A factor A is a prime factor of S if and only if A cannot be factored
into anything other than itself and single-pixel images. In Table 7.1 are listed all the prime
factors which start with R0 . The prime factors are not required to be in the form of Eq. (7.32).
In Table 7.2, we present other prime factors, listing only their chain code representation.
Now we present the approach to decomposition of the structuring element using an example.
We will decompose the structuring element illustrated in Fig. 7.11, whose chain code is
S = L 0 03 124 R4 43 R6 62 . The concave portions of this boundary are denoted by v 1 = L 0 ,
v 2 = R4 , v 3 = R6 . The convex portions by d1 = 0, d2 = 1, d3 = 2, d4 = 4, d5 = 6.
160
Mathematical morphology
4
2
2
R 02245
R 022V56
R022R 5
R 0235
R 02R 46
R 0242
R 0V3426
R 0V3V56
R0V3R 5
R 0R 35
R 0R 346
R 0326
R 034
R 0V345
U 0 02 22 42
J0 022 42
J1 22 42 6
L 0 22 42
V1 22 42 V7
V1 234V7
V1 V3 42 62
V1 V3 R5 6
V1 22 52
V1 32 62
R0 22 45
R0 242
R0 R35
R1 242 V7
R1 252
R1 R4 62
U0 02 234
J0 0234
J1 22 45
L 0 234
V1 22 4R6
V1 23R6
V1 V3 42 V7
V1 V3 52
V1 245
V1 32 V7
R0 22 V5 6
R0 V3 42 6
R0 R346
R1 24R6
R1 3462
R1 R4 V7
U0 01242
J0 1242
J1 2346
U0 0134
J0 134
J1 235
V1 22 V5 62
V1 2R4 62
V1 V3 456
V1 R3 462
V1 V3 V5 V7
V1 346
R0 22 R5
R0 V3 45
R0 32 6
R1 2V5 62
R1 34V7
R1 42 6
V1 22 V5 V7
V1 2R4 V7
V1 V3 4R6
V1 R3 4V7
V1 R3 R6
V1 35
R0 235
R0 V3 V5 6
R0 34
R1 2V5 V7
R1 356
R1 45
V1 22 R5 6
V1 242 6
V1 V3 V5 62
V1 R3 56
R0 2R4 6
R0 V3 R5
R1 2R5 6
R1 3R6
First, we identify all the prime factors involving L 0 , R4 , and R6 , which are compatible
with this image. These are illustrated in Fig. 7.12. To understand more clearly how these are
compatible with the image, consider the segment R4 62 01. Observe that this matches the R4
segment which is at the upper right of Fig. 7.11.
The next step in the process is to construct a matrix , where i j represents the number
of times v i occurs in A j . In this example,
1 0 0 0 0
= 0 1 1 0 0 ,
0 0 0 1 1
161
Topic 7A Morphology
L 02242
R 46201
(R 02245)
A1
A2
R 4602
(R 0242)
A3
R 6022
(R 0242)
R 612
(R 034)
A4
A5
Fig. 7.12. The prime factors which match segment of the boundary of Fig. 7.11. The chain codes
in parentheses indicate what the boundary would be if rotated to R0 equivalents.
where we can see that R4 occurs once in A2 and A3 , but not at all in A1 , A4 , or A5 . Next, we
construct a matrix
, where
i j counts the number of times di occurs in A j . Here,
0
0
= 2
2
0
1
1
0
0
2
2
0
0
0
1
1
0
2
0
0
0
1
1 .
0
0
Two vectors are dened, Y , representing the number of times v i occurs in the original boundary
and Z , the number of times di occurs in the original boundary. In this example, Y = [1 1 1]T
and Z = [3 1 4 3 1]T . A vector X which satises
X = Y
X Z
(7.33)
is a solution for the decomposition. In this example, X = [10110]T satises these two equations. Note that X = [30421]T which is less than or equal to Z, in the sense that each element
of X is less than or equal to the corresponding element of Z. Thus, we can decompose the
boundary by dilating by A1 once, by A2 zero times, A3 once, A4 once, and A5 zero times, or
S = A1 A3 A4 B.
All that remains is to determine a kernel B. This is accomplished by looking at the difference between X and Z . Z X = [01011]T . Thus, we need a kernel whose boundary is
described by the sequence d2 , d4 , d5 , each repeated once. This sequence is 146, as in Fig. 7.13.
Thus, we have a sequence of structuring elements, each 3 3, whose sequential application
produces the same result as the kernel in Fig. 7.11.
4
6
1
Fig. 7.13. The s.e.
described by the
sequence 146.
7A.2
(7.34)
162
Mathematical morphology
and
S = S.
(7.35)
Equation (7.34) represents an example of the property that this particular S is closed under
dilation. A convenient grid to use is every third point, as illustrated in Fig. 7.14.
So now, sampling means to read and remember the image values at all the pixels where
the grid is black. Now suppose K is some s.e. We propose a rather simple reconstruction
algorithm. Here is the idea: Lets sample our image using the sampling grid specied. Then,
we ask the question: Under what conditions will the dilation of the sampled image by the
s.e. K, be the original image? Florencio and Schafer [7.7] have shown that the following
properties are required: First, the sampling grid itself, when dilated by the s.e., must be the
entire space:
S K = ,
(7.36)
((x, y) S, x = y), K x K y = ,
(7.37)
Actually, this is not very interesting just the center pixel (the origin) and its eight neighbors.
But it does satisfy the three conditions.
If these properties hold for S and K, then the theorem is as follows. Let F be some image,
let P be the set of images Fi satisfying Fi = A K for some A S, and let Q be the set of
images F j satisfying F j = F j oK = F j r K . Then:
Part A The samples of F P at the points in S are necessary and sufcient to perfectly
reconstruct F.
Part B The samples of F Q at the points in S are necessary to reconstruct F with bounded
error and sufcient to reconstruct F with an error at most r (K) where r (K) is the
radius of the smallest circle containing K.
There is a lot to talk about regarding this theorem: First, what does part A mean? Can you
gure it out from the set notation? You should be able to, but you may say to yourself if this
means what I think it means, its trivial. Well, what it means is this: If F can be generated
by taking some collection of the points in S, and dilating them with kernel K, then all you
have to remember is which points in S were needed. Yes, it is kind of trivial, isnt it? But
there are some profound implications, which we will get to when we talk about the Nyquist
rate.
163
Topic 7A Morphology
(a)
(b)
8
7
6
5
4
3
2
1
0
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Fig. 7.15. (a) Original image. (b) Result of sampling the original image with S of Fig. 7.14.
Now to understand part B: The Haussdorf distance is a measure of the difference between
two (sub)images. Given two sets of points S = {s1 , s2 , . . . sn } and T = {t1 , t2 , . . . tm }, the
Haussdorf distance is dened by
d H (S, T ) = max(h(S, T ), h(T, S))
(7.38)
where h(S, T ) = maxsS mintT s t. That is, the Haussdorf distance is the largest of the
minimum distances between elements of one set to the elements of the other set.
Part B says that if F is closed under opening by K, (thats quite an expression, dont you
think? Closed under opening), that is, if F does not change when its opened by K and
closed under closing by K, then the samples of F at the points of S are sufcient to represent
F almost exactly. By almost exactly we mean the set of sample points, when dilated by K,
is very nearly equal to the original F.
Now lets do that example: Fig. 7.15 represents an original image, FA and what we get
when we sample it with S.
Since we know that we could only get data on rows 0, 3, 6, or 9 and similarly for columns,
we could toss out those rows and columns in our reduced resolution version, and get a
smaller image. So there, on the left of Fig. 7.16, we have the subsampled, smaller version
of the original image. Now lets zoom it back, using dilation: Place K down at each of the
sampling points and we get the reconstruction of Fig. 7.16. Hmmm. doesnt look much like
the original, unsampled image, does it? (Whatever happened to exact reconstruction?) And
heres a tough question. This theorem claims to say that one can reconstruct a signal by
sampling every third point. Doesnt this contradict Shannon? Doesnt Shannon say we have
0 3 6 9
0 1 2 3 4 5 6 78 9
Fig. 7.16. Subsampled image and reconstruction by dilation. Students: Note the original image
in Fig. 7.15(a) does that image belong to P or Q?
164
Mathematical morphology
to sample every other point anyway? (Actually, Shannons theorem was dened for analog
signals. Students: How DOES the Shannon theorem apply to this case? What IS the Nyquist
rate?) The sampling theorem does not say we can reconstruct any signal, it says something
about frequencies, doesnt it? (Say yes.) In fact, the sampling theorem basically says that a
sampling grid with a particular frequency (which is the same as the grid spacing) cannot be
used to store information which changes more than half the grid spacing. In morphological
sampling, the theorem is complicated by the fact that we not only can choose a grid, we can
also choose a kernel. The restrictions are given by part B of the theorem. Unless the image is
one which has already been preltered by K, K cannot restore it. Haralick [7.12, p. 252] says
it this way: The morphological sampling theorem cannot produce a reconstruction whose
positional accuracy is better than the radius of the circumscribing disk of the reconstruction
structuring element K .
7A.3
7A.4
7A.4.1
Introduction
In this section, we apply the concepts of the distance transform, connected components (see
Chapter 8), and binary morphology-like operations to solve the problem of closing gaps in
edges (see [7.16] for alternative strategies). In two dimensions, an edge is a curve, and in
three dimensions, a surface. As we have seen, any edge operator is certain to occasionally
fail, resulting in both extraneous edges and gaps in edges. If an edge has gaps in it, then
connected component labeling routines will fail, labeling interior and exterior points the
same. Thus, we must develop techniques which correct such edge detection errors. We will
relate these techniques to morphological operators. In [7.10], we developed a technique
called distance-transform-based closing, and have implemented this technique in both two
and three dimensions. This new technique is compared with binary morphology (mask erosion
techniques [7.1]) and iterative parallel thinning [7.44], a 3D parallel thinning technique [7.42]
Several of the gures in this chapter are from a paper written by one of the authors [7.36].
165
Topic 7A Morphology
and with 3D morphological techniques. In every case, the technique reported here produces
superior performance by better preserving the shape of sharp corners.
7A.4.2
Problem denition
We are given an edge image. Due to noise, blur, or other error, the edge/surface resulting
from edge detection in a real image will occasionally have gaps areas in which the edge
detector response is not sufciently strong to make a positive determination. (See Pratt [7.25,
Section 17] for an excellent discussion of edge detector errors.) Such gaps may be closed by
various types of morphological operators.The algorithm described here is denoted DT-driven
closing.
166
Mathematical morphology
Fig. 7.17. Two distance denitions which may be used to compute distance transforms. The one
on the right produces the chamfer map, closely approximating the Euclidian distance.
2
1
1
1
2
1
1
2
1
3 3
2 2
1 1
1
2
1
2
1
2
2
2
1 2
1 2
2 2
2 2
1 1
2
1
2
1
1
1
2
1 1 1
2 2 2
To construct a complete DT, Eq. (7.28) is iterated until no more changes occur between
iterations. This is the approach followed by Bister et al. [7.3]. For the application discussed
here, however, we assume some a priori knowledge of maximum gap size. This allows us to
dene a value K max reecting this knowledge. K max will represent a distance so great that any
pixel this far away could not possibly be part of the edge. Normally, we use values K max 4,
since this allows gaps of six pixels to be bridged. Fig. 7.18 illustrates a distance transform
near a gap in an edge.
Generation of the three-dimensional distance transform proceeds in an analogous manner.
Observe that the computational complexity of distance transform generation is proportional
to the number of edge points in the image and to the size of K max . Larger values of K max
increase the size of the kernel k substantially in three dimensions.
7A.4.3
167
Topic 7A Morphology
Bister et al. [7.3] identify local maxima in the DT, and each maximum results in a potential
region. They note, however, Since the distance transform is sensitive to noise producing
irregularities in the borders of the regions, one cavity (region) can contain many local maxima
very close to one another. In order to eliminate these spurious maxima, a lter merges the
maxima for which the sum of the heights is much larger than the geometrical distance
between them. We conjecture that the performance of this lter is equivalent to our choice
of K max . Instead of searching for a maximum, we use connected components (Chapter 8).
Both techniques identify an area within a region which robustly characterizes that region.
7A.4.4
A voxel is a
three-dimensional pixel.
168
Mathematical morphology
One constraint on the relabeling algorithm is the selection of the background region as the
best neighbor. Neighbors belonging to the background are considered in a more restricted
fashion than those belonging to regions. For the background value to be selected as the best
neighbor, the background pixel must be face-connected to the pixel under consideration.
This avoids the undesirable result of occasionally nding isolated background pixels inside
a closed boundary. (See [7.30] for more discussion on this connectivity paradox.)
When k = 0, the TBA pixels we seek to relabel are either true edges or noise pixels.
An edge in an image, by denition, represents an interface between an object region and
some other region in that image. The other region may be either another object region or the
background. We choose to relabel an edge pixel as part of an image region regardless of the
outcome of the enumeration. Only in the case that the edge pixel is completely surrounded
by background, do we choose background as the best neighbor.
169
Topic 7A Morphology
copyarray(L, Ltemp);
}/* end while */
}/* end for k */
}/* end relabel */
/*==============================================================*/
/* in this function, p and n are data structures containing the frame, row, and col*/
/* coordinates of a voxel*/
int Best26Neighbor(p)
{
while ((n = neighbor(p)) != NULL)
{
if (L(n) != EDGE)
{
if(L(n) != BACKGROUND) Card[L(n)]++;
else if(faceconnected(n,p)) Card[BACKGROUND]++;
}
}
if ((maximum(Card) == BACKGROUND) && (L(n) == EDGE))
return NextMax(Card);
else return maximum(Card);
} /* end Best26Neighbor */
In this algorithm, Card is simply an array which keeps a count (cardinality) of the number
of times a particular label is adjacent to the voxel of interest.
7A.4.5
Examples
In this section, we will provide the technique described above with competing morphological
strategies.
170
Mathematical morphology
Fig. 7.19. Two regions with large gaps in their boundaries. From [7.36]. Used with permission.
Fig. 7.20. Distance transform and result of connected component labeling of Fig. 7.19. From
[7.36]. Used with permission.
171
Topic 7A Morphology
Fig. 7.21. Segmentation resulting from applying DT-driven closing. Note that the regions are
correctly partitioned, with a precision accurate to the individual pixel. From [7.36]. Used
with permission.
7A.4.6
Thinning differs from
skeletonization in that
thinning preserves
connectivity: A boundary
which is intact when wide
will still be intact after
thinning.
(7.39)
172
Mathematical morphology
(a)
(b)
Fig. 7.22. (a) Original cutlery image, containing gaps in edges. (b) Distance transform of original
cutlery image. From [7.36]. Used with permission.
(c)
(d)
Fig. 7.23. (c) Thickened edge image. (d) Label image based on dilated edges. From [7.36]. Used
with permission.
173
Topic 7A Morphology
Fig. 7.24. (e) Relabeled cutlery image. From [7.36]. Used with permission.
where X\Y denotes all the elements of X which are not in Y. The skeleton is then
constructed by
Skeleton =
N
Si i A.
(7.40)
i=0
The subsets contain information about size, orientation, and connectivity. A minimal skeleton
has the property of being able to exactly reconstruct the original image, but it does not
necessarily preserve path or surface connectivity [7.20]. An alternative to the morphological
skeleton, not considered in this book, is morphological shape decomposition (MSD) [7.24].
MSD and the morphological skeleton are compared by Reinhardt and Higgins [7.27] who
conclude that MSD performs slightly better. While morphological skeletonization can be
used for many applications (such as image coding) and has been widely studied, it is not
directly comparable to 2D or 3D thinning, in which connectivity is preserved. Since, in this
application of edge/surface gap closing, we insist on preservation of connectivity, we consider
below only techniques which possess this property.
Morphological 2D thinning (Arcelli, Cordella, and Levialdi)
Arcelli et al. [7.1] use a sequence of masks to implement thinning. The original image is
eroded by each of eight 3 3 masks in sequence and the resulting image is used as input
174
Mathematical morphology
A1
B1
A2
B2
A3
A4
B3
B4
Fig. 7.25. 3 3 masks (left) and result of erosion thinning on the cutlery image (right). From
[7.36]. Used with permission.
Fig. 7.26. Results of iterative thinning. From [7.36]. Used with permission.
to the next iteration until all possible pixels have been eroded. In each mask shown in Fig.
7.25 [7.1, 7.26], the positions marked in black denote edge pixels, white denotes background,
and hashed are pixels which do not become involved in the computation. A mask is said to
match an image at coordinates (k, l) if, when the center of the mask is registered to pixel
(k, l), then all image pixels covered by the mask are edge or background as denoted by the
corresponding mask pixel. If a mask matches an image at a particular edge point, then the
edge at that point in the image may be reset to background. The masks are applied in the
following order: A1, B1, A2, B2, A3, B3, A4, B4. The result of thinning Fig. 7.23 using this
algorithm is shown in Fig. 7.25. Note how this technique distorts the shape of sharp vertices
like the fork tines.
Iterative 2D thinning (Zhang and Suen)
Zhang and Suens algorithm for thinning binary images [7.44] iteratively passes over the
image, deciding if contour points can be deleted. A contour point is an edge pixel with at
least one 8-neighbor that is a background pixel. Each iteration contains two passes and at
each iteration, the decision to remove a pixel is based on the number of edge neighbors of a
175
Topic 7A Morphology
pixel, the number of 01 transitions in a sequence around the pixel, and on two sets of background neighbor congurations. The result of iteratively thinning the dilated cutlery image
(Fig. 7.23) is shown in Fig. 7.26. It is interesting to note the similarity with the technique of
Arcelli et al.
7A.4.7
Three-dimensional images
DT-driven closing was applied to a three-dimensional image of a box with broken interior
partitions; Figs. 7.277.29 show the results. The order of the images is the same as shown for
the 2D results. Note that even though the gaps are large in x, y, they are successfully bridged,
and the edges are still sharp in the nal relabeled image.
The three-dimensional algorithm was also tested on a synthetic ellipsoid (Fig. 7.30) which
was deliberately undersampled to produce large gaps. The results of DT-driven closing with
a K max value of 3 are shown in Fig. 7.30.
Fig. 7.27. (a) Original broken box image, frame 20. (b) Distance transform of broken box.
(e)
Fig. 7.28. (c) Thickened edge image. (d) Label image based on dilated edges.
(a)
(b)
Fig. 7.30. Ellipsoid with gaps (a) and relabeled ellipsoid (b).
176
Mathematical morphology
7A.4.8
7A.4.9
Preserving geometry
Probably the most signicant aspect of the performance of DT-driven closing is its ability
to preserve the geometry of surfaces, particularly near vertices. Its two-dimensional performance is particularly well demonstrated in Fig. 7.24. To demonstrate how it processes
vertex geometry in three dimensions, we synthesized a cone and extracted the surface with a
three-dimensional edge detector.
Fig. 7.32. (a) Dilated ellipsoid and (b) Tsao and Fu thinned ellipsoid. From [7.36]. Used with
permission.
177
Topic 7A Morphology
Fig. 7.33. A cross section through the apex of a cone which has been processed by DT-based
closing (a) and by TsaoFu thinning (b). From [7.36]. Used with permission.
Gaps in the surface were then closed with DT-driven closing. The same cone surface
was dilated using conventional dilation to close gaps, and then thinned using the TsaoFu
algorithm. The results are shown in Fig. 7.33. Since the process of dilation replaces the edge
information, the TsaoFu thinning algorithm has no memory of the original surface when
it thins the dilated image. Of course, neither technique processes the geometry perfectly.
However, since DT-driven closing retains, via the DT, a memory of the original, undilated,
geometry, this technique is better able to restore that geometry after gaps are closed. See
[7.38] for more details and computation speed.
7A.4.10
7A.5 Vocabulary
You should know the meanings of the following terms.
Chamfer map
Prime factor
Sampling
Assignment
7.A1
In section 7A.1.1 an example decomposition of a structuring
element was given. Prove that this decomposition gives the
178
Mathematical morphology
Assignment
7.A2
What is the major difference between the output of a thinning
algorithm and the maxima of the distance transform? Choose
the best answer from the following.
(a) A thinning algorithm preserves connectivity. The maxima of
the DT are not necessarily connected.
(b) The maxima of the DT are unique, thinning algorithms do
not produce unique results.
(c) The DT preserves connectivity. Thinning algorithms do not.
(d) Thinning algorithms produce the intensity axis of symmetry. The DT does not.
Assignment
7.A3
Use the distance transform to compute the medial axis of an
image. The name of the image will be given in class.
Bibliography
[7.1] C. Arcelli, L. Cordella, and S. Levialdi, Parallel Thinning of Binary Pictures, Electronics Letters, 11, pp. 148149, 1975.
[7.2] H.G. Barrow, Parametric Correspondence and Chamfer Matching, Proceedings
of the 5th International Joint Conference on Articial Intelligence, August, 1977,
pp. 659663.
[7.3] M. Bister, Y. Taeymans, and J. Cornelis, Automated Segmentation of Cardiac MR
Images, Computers in Cardiology, Washington, DC, IEEE Computer Society Press,
pp. 215218, 1989.
[7.4] G. Borgefors, Distance Transformations in Arbitrary Dimensions, Computer Graphics, Vision, and Image Processing, 27, pp. 321345, 1984.
[7.5] G. Borgefors, Distance Transformations in Digital Images, Computer Graphics,
Vision, and Image Processing, 34, pp. 344371, 1986.
[7.6] H. Breu, J. Gil, D. Kirkpatrick, and M. Werman, Linear Time Euclidian Distance
Transform Algorithms, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(5), 1995.
[7.7] D. Florencio and R. Schafer, Homotopy and Critical Morphological Sampling,
Proceedings of the SPIE, 2308, June, 1994.
[7.8] J.D. Foley, A. vanDam, S.K. Feiner, and J.F. Hughes, Computer Graphics: Principles
and Practice, Reading, MA, Addison-Wesley, pp. 9199, 1990.
[7.9] W. Gong and G. Bertrand, A Simple Parallel 3D Thinning Algorithm, 10th International Conference on Pattern Recognition, June, 1990.
179
Bibliography
[7.10] B. R. Groshong, and W. E. Snyder, Using Chamfer Maps to Segment Images, Technical Report CCSP-WP-86/11, Center for Communications and Signal Processing,
North Carolina State University, Raleigh, NC, USA, December, 1986.
[7.11] K. Hafford and K. Preston Jr., Three-dimensional Skeletonization of Elongated
Solids, Computer Vision, Graphics, and Image Processing, 27, pp. 7891, 1984.
[7.12] R. Haralick and L. Shapiro, Computer and Robot Vision, Volume 1, New York,
Addison-Wesley, 1992.
[7.13] D. Hilbert and S. Cohn-Vossen, Geometry and the Imagination, New York, Chelsea,
1952.
[7.14] H. Hiriyannaiah, G. Bilbro, and W. Snyder, Restoration of Locally Homogeneous
Images using Mean Field Annealing, Journal of the Optical Society of America, A,
6(12), pp. 19011912, 1989.
[7.15] C. Huang and O. Mitchell, A Euclidian Distance Transform using Grayscale
Morphology Decomposition, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(4), 1994.
[7.16] X. Jiang, An Adaptive Contour Closure Algorithm and its Experimental Evaluation,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11), 2000.
[7.17] R. Jones and I. Svalbe, Morphological Filtering as Template Matching, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(4), 1994.
[7.18] R. Jones and I. Svalbe, Algorithms for the Decomposition of Gray-scale Morphological Operations, IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(6), 1994.
[7.19] S. Lobregt, P. Verbeek, and F. Groen, Three-Dimensional Skeletonization: Principle
and Algorithm, IEEE Transactions on Pattern Analysis and Machine Intelligence,
2(1), pp. 7577, 1980.
[7.20] P. Maragos and R. Schafer, Morphological Skeleton Representation and Coding of
Binary Images, IEEE Transactions on Acoustics, Speech, and Signal Processing, 34,
pp. 12281244, 1986.
[7.21] G. Matheron, Random Sets and Integral Geometry, New York, Wiley, 1975.
[7.22] H. Park and R. Chin, Optimal Decomposition of Convex Morphological Structuring
Elements for 4-connected Parallel Array Processors, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(3), 1994.
[7.23] H. Park and R. Chin, Decomposition of Arbitrarily Shaped Morphological Structuring
Elements, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1),
1995.
[7.24] I. Pitas and A. Venetsanopoulos, Morphological Shape Decomposition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(1), 1990.
[7.25] Pratt, W. K., Digital Image Processing, New York, Wiley, 1978.
[7.26] K. Preston, and M. Duff, Modern Cellular Automata Theory and Applications, New
York, Plenum Press, 1984.
[7.27] J. Reinhardt and W. Higgins, Comparison Between the Morphological Skeleton and
Morphological Shape Decomposition, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(9), 1996.
180
Mathematical morphology
Segmentation
Fig. 8.2. A
segmentation and
labeling of the image
in Fig. 8.1.
In many machine vision applications, the set of possible objects in the scene is quite
limited. For example, if the camera is viewing a conveyer, there may be only one
type of part which appears, and the vision task could be to determine the position
and orientation of the part. In other applications, the part being viewed may be one
of a small set of possible parts, and the objective is to both locate and identify each
part.?? Finally, the camera may be used to inspect parts for quality control.
In this section, we will assume that the parts are fairly simple and can be characterized by their two-dimensional projections, as provided by a single camera view.
Furthermore, we will assume that the shape is adequate to characterize the objects.
That is, color or variation in brightness is not required. We will rst consider dividing
the picture into connected regions.
A segmentation of a picture is a partitioning into connected regions, where each
region is homogeneous in some sense and is identied by a unique label. For example,
in Fig. 8.2 (a label image), region 1 is identied as the background. Although
region 4 is really background as well, it is labeled as a separate region since it is not
connected to region 1.
The term homogeneous deserves some discussion. It could mean all the pixels
are the same brightness, but that criterion is too strong for most practical applications.
It could mean that all pixels are close to some representative (mean) brightness.
181
182
Segmentation
Stated more formally [8.80], a region is homogeneous if the brightness values are
consistent with having been generated by a particular probability distribution (see
also the analysis by Ng and Lee [8.44]). In the case of range imagery [8.35], where
we (might) have an equation which describes the surface, we could say a region
is homogeneous if it can be described by the combination of that equation and
some probabilistic deformation. For example, if all the points in a region of a range
image lay in the same plane except for deviations whose distance from the plane
may be described by a particular Gaussian distribution, we might say this region is
homogeneous.
There are several ways to perform segmentation. Threshold-based techniques are
guaranteed to form closed regions, for they simply assign all pixels above (or below,
depending on the problem) a specied threshold to be in the same region. Edgebased techniques assume that regions are separated by neighborhoods where the
edge strength is high. Region-based methods start with elemental (e.g., homogeneous) regions and split or merge them. Then, there are a variety of hybrid methods,
including watershed [8.5] techniques. A watershed method generally operates on the
gradient of the image; segmentation consists of ooding the image with (by analogy) water, in which region boundaries (areas of high edge strength) are erected to
prevent water from different seed points from mixing. Traditional region growing
methods are really variations on watershed methods [8.1].
Before we can go much further in our discussion of issues and techniques in
segmentation, you need to understand some of the interesting and unexpected things
that happen to geometry and topology when you deal with digital images. Remember
the connectivity paradox from section 4.5? We discovered that an object could have
a closed boundary but still have an inside and outside which were connected? As
another example, consider the problem of nding the perimeter of a region, or even
the length of a line from a sampled version of that line. That problem has immediate
applications in segmentation, and yet it is not obvious how to estimate it [8.31].
Keep in mind that these sorts of things can happen, because we are going to talk a
lot about connectivity in this chapter.
183
Fig. 8.3. Two detectors forming an image of a rectilinear grid. The light source is uniform,
however, the images exhibit both radiometric (brightness) distortion (the left image
is brighter in the right center and the right image is brighter in the middle) and
geometric distortion (straight lines are distorted in a characteristic pincushion
form).
Probably the most important factor to note is the local nature of thresholding.
That is, a single threshold is almost never appropriate for an entire scene. It is nearly
always the local contrast between object and background that contains the relevant
information. Since camera sensitivity drops off from the center of the picture to the
edges due to parabolic distortion and/or vignetting as shown in Fig. 8.3, it is often
useless to attempt to establish a global threshold. A dramatic example of this effect
can be seen in an image of a rectilinear grid, in which the uniform white varies
signicantly over the surface.
Effects such as parabolic distortion and vignetting are quite predictable and easy
to correct. In fact, off-the-shelf hardware is available for just such applications.
It is more difcult, however, to predict and correct effects of nonuniform ambient illumination, such as sunlight through a window, which changes radically over
the day.
Since a single threshold cannot provide sufcient performance, we must choose
local thresholds. The most common approach is called block thresholding, in which
the picture is partitioned into rectangular blocks and different thresholds are used
on each block. Typical block sizes are 32 32 or 64 64 for 512 512 images.
The block is rst analyzed and a threshold is chosen; then that block of the image
is thresholded using the results of the analysis. In more sophisticated (but slower)
versions of block thresholding, the block is analyzed and the threshold computed.
Then that threshold is applied only to the single pixel at the center of the block. The
block is then moved over one pixel, and the process is repeated.
184
Segmentation
20
40
60
80
100
120
140
160
180
Fig. 8.4. A histogram of a bimodal image, with many bright pixels (intensity around 169) and
many dark pixels (intensity around 11).
8.2.1
Choosing a threshold
The simplest strategy for choosing a threshold is to average the intensity over the
block and choose i avg + i as the threshold, where i is some small increment, such
as 5 out of 256 gray levels. Such a simple thresholding scheme can have surprisingly
good results.
However, when the simpler schemes fail, one is forced to move to more sophisticated techniques, such as thresholding based on histogram analysis. Before we
describe this technique, we will rst dene a histogram.
The histogram h(i) of an image f (x, y) is a function of the permissible intensity
values. In a typical imaging system, intensity takes on values between 0X00 (black)
and 0XFF (white). A graph that shows, for each gray level, the number of times that
level occurs in the image is called the histogram of an image. Illustrated in Fig. 8.4
is a histogram of black parts on a white conveyor.
In Fig. 8.4 we note two distinct peaks, one at gray level 11, almost pure black, and
one at gray level 169, bright white. With the exception of noise pixels, every point
in the image belongs to one of these regions. A good threshold, then, is anywhere
between the two peaks.
Histograms are seldom as nice as the one in Fig. 8.4 and some additional processing is generally needed ([ 8.14, 8.51, 8.70] explain and experimentally compare
several such methods). In the following section, we describe a more sophisticated
technique for nding the optimal threshold.
In [8.7], a technique was developed which nds the global minimum of a function
of several variables, even for functions which have more than one minimum. That
technique, known as tree annealing (TA) may be applied to the problem of histogram analysis and thresholding in the following way [8.62].
185
A1
A2
(x 1 )2
(x 2 )2
h(x) =
+
.
(8.1)
exp
exp
212
222
21
22
If h(f) is properly normalized, one may adjust the usual normalization of the two
component Gaussians so that each sums to unity on the 256 discrete gray levels
(rather than integrating to unity on the continuous interval), and thereby admit the
additional constraint that A1 + A2 = 1. Use of this constraint reduces the number
of parameters to be estimated from six to ve; however, we have determined experimentally that TA actually solves the problem more accurately by using the six
variables, without readjusting the normalization on each iteration. Conventional descent often terminates at a suboptimal local minimum for this two-Gaussian problem
and is even less reliable for a three-Gaussian problem. TA deals easily with either.
The result of tting a sum of three Gaussians to the histogram of an image is shown
in Fig. 8.5.
Whatever algorithm we use, the philosophy of histogram-based thresholding is
the same: Find peaks in the histogram and choose the threshold to be between them.
In many industrial environments, the lighting may be extremely well controlled.
With such control, the best thresholds will be constant over time and may be chosen
interactively during system set up. However, in general, different thresholds are used
in different areas of the picture.
186
Segmentation
C
D
E
Fig. 8.6. A graph with two connected components.
11112221111111111111111111111
11122221111113333333333333111
11112221111113333333333333111
11112211111111111111333311111
11112211111111111111333331111
11112211111111111111333333111
11222222111111111113333333111
11224422211111111133333331111
11222222111111111111333111111
11112211111111111111111111111
8.3.1
187
(1) Find an unlabeled black pixel; that is, L(x, y) = 0. Choose a new label number
for this region, call it N. If all pixels have been labeled, stop.
(2) L(x, y) N .
(3) If f (x 1, y) is black and L(x 1, y) = 0, push the coordinate pair (x 1, y)
onto the stack.
If f (x + 1, y) is black and L(x + 1, y) = 0, push (x + 1, y) onto the stack.
If f (x, y 1) is black and L(x, y 1) = 0, push (x, y 1) onto the stack.
If f (x, y + 1) is black and L(x, y + 1) = 0, push (x, y + 1) onto the stack.
(4) Choose a new (x, y) by popping the stack.
(5) If the stack is empty, go to 1, else go to 2.
This labeling operation results in a set of connected regions, each assigned a unique
label number. To nd the region to which any given pixel belongs, the computer has
only to interrogate the corresponding location in the L memory and read the region
number.
EXAMPLE
Applying region growing
Fig. 8.9 shows a 4 7 array of pixels. Assume the initial value of x, y is 2, 4.
Apply algorithm grow and show the contents of the stack and L each time step (3)
is executed. Let the initial value of N be 1.
Solution
Pass 1. Immediately after execution of step (3). The algorithm has examined
pixel 2, 4, examined its 4-neighbors, and detected only one 4-neighbor in the
foreground, the pixel at 3, 4. Thus, the coordinates of that pixel are placed on
Stack:
( ) top
7
6
5
4
3
2
1
1
188
Segmentation
the stack.
Stack:
3, 4 top
7
6
5
L= 4
3
2
1
0 0 0 0
0 0 0 0
0 0 0 0
0 1 0 0
0 0 0 0
0 0 0 0
0 0 0 0
1 2 3 4
Pass 2. The pixel at 3, 4 was removed from the top of the stack and marked with
a 1 in the L image, its neighbors were examined and two 4-neighbors were found,
the pixels at 3, 3 and at 3, 5; both were put on the stack.
Stack:
3, 5 top
3, 3
7
6
5
L= 4
3
2
1
0 0 0 0
0 0 0 0
0 0 0 0
0 1 1 0
0 0 0 0
0 0 0 0
0 0 0 0
1 2 3 4
Pass 3. The top of the stack contained 3, 5. That pixel was removed from the
stack and marked with a 1 in the L image. All its neighbors were examined and one
4-neighbor was found, the pixel at 3, 6. Thus the coordinates of this pixel are put
on the stack.
Stack:
3, 6 top
3, 3
7
6
5
L= 4
3
2
1
0 0 0 0
0 0 0 0
0 0 1 0
0 1 1 0
0 0 0 0
0 0 0 0
0 0 0 0
1 2 3 4
Pass 4. The stack was popped again, this time removing the 3, 6 and marked with
a 1 in the L image. That pixel was examined and determined to have no 4-neighbors
189
Stack:
3, 3 top
7
6
5
L= 4
3
2
1
0 0 0 0
0 0 1 0
0 0 1 0
0
1
1
0
0 0 0 0
0 0 0 0
0 0 0 0
1 2 3 4
Pass 5. The stack was popped again, removing the 3, 3 and marked with a 1 in the
L image. That pixel was examined and determined to have no 4-neighbors which
had not already been labeled.
7 0 0 0 0
6
0 0 1 0
5
0 0 1 0
Stack :
L = 4 0 1 1 0
() top
0 0 1 0
3
2 0 0 0 0
1
0 0 0 0
1 2 3 4
Pass 6. The stack was popped again, producing a return value of stack empty and
the algorithm is complete since all black pixels had been labeled.
This region growing algorithm is just one of several strategies for performing
connected component analysis. Other strategies exist which are faster than the one
described, including some that run at raster-scan rates [8.6]. We will now consider
one such technique.
8.3.2
190
111 22
111 22
111 22
111 22
111 11?
Fig. 8.10. Ambiguity in
label assignment.
Segmentation
f (x, y) is the gray-scale value of the (x, y) pixel in the image memory.
(x, y)i is the ith adjacent neighbor of the (x, y) pixel.
f i (x, y) is the gray-scale value of the ith adjacent neighbor of the (x, y) pixel.
L(x, y) is the region label corresponding to the (x, y) pixel in the image memory.
L i (x, y) is the region label number corresponding to the ith adjacent neighbor of
the (x, y) pixel.
K (i) is the contents or the ith element in the equivalence memory. This memory
is a content-addressable memory.
191
start
Initialize memories
p=0
Get first pixel
get next pixel
i=1
N
N
(x,y)i labeled?
i = i+1
i>N?
(x,y) labeled?
N
(x,y)
labeled?
p=p+1
L(x,y) = K(Li(x,y))
L(x,y)=p
z = max(K(L(x,y)),K(Li(x,y)))
w = min(K(L(x,y)),K(Li(x,y)))
z=w?
K(p) = p
N
K*(z) = K(w)
Fig. 8.11. Flowchart of the algorithm (from [8.60]).
r
r
r
The algorithm is illustrated in Fig. 8.13 for an arbitrary region adapted from
Milgram et al. [8.41]. The reader should note that the relation | f (x, y) f i (x, y)| <
T tests if two pixels are similar. There are other similarity measures which could
be used, including local rst- and/or second-order statistics. If two pixels meet this
criterion and they are adjacent, then they are in the same region. By denition, if
two pixels are in the same region, then R(a, b) holds. That is,
{ADJACENT(x, y, x , y ) | f (x, y) f i (x, y)| < T } R(x, y, x , y ).
192
Segmentation
193
Image
Memory
(f)
Interface
Processor
Region
Label
Memory
(L)
Equivalence
Memory
(K)
Host Computer
Fig. 8.12. Architecture of a region labeling system.
10
11
9 10 11 12 13 14 15 16 17
Fig. 8.13 illustrates a difcult labeling problem, and Table 8.1 illustrates the
labeling process for this example.
The f memory and the L memory are both conventional random access memories.
However, the equivalence memory has two modes of operation. It may be used
on a conventional RAM where the address input corresponds to the region table
194
Segmentation
K memory address
1
16
5
5
9
11
16
1
1
4
10
10
10
11
1
1
1
1
1
1
1
2
2
2
2
2
1
3
1
1
1
1
4
1
1
and data output is the equivalent table. In associative memory mode it is used to
update that table. In this mode, two activities occur in synchronism with a two-phase
clock.
r
r
Phase 1: All memory cells whose contents match the contents of the data bus set
their corresponding enable ip-ops (see Fig. 8.14).
Phase 2: All memory cells whose enable ip-ops are set read the contents of the
data bus.
This operation effectively updates the equivalence table in parallel during the
scan.
1
select
2
Address
decoder
Address
bus
Fig. 8.14. Organization of the K memory.
Data
bus
195
It is also possible to update the table by a search algorithm at the end of the scan.
However, doing the updating in parallel during the scan tremendously reduces the
number of equivalences (since the lowest number is always used), thus reducing
the number of bits needed for the K memory. A content-addressable memory has
the property that memory cells can be accessed or loaded by their contents [8.45,
8.69, 8.77].
The parameter most crucial to the development of a satisfactory memory, and
hence a system capable of operating in real time, is the memory size. A near real-time
system will result if the memory size necessitates a compromise in the access speed.
Issues of memory size are discussed further in [8.60] where simulation involving
real images is presented.
The nal component of the architecture is the interface/processor. The primary
purpose of this unit is to execute the algorithm described in this section and
owcharted in Fig. 8.11. Additionally, it must be capable of (1) processing the video
signal input into gray-scale values for storage in the memory, and (2) interpreting
the L memory in terms of the K memory.
Simulation
The algorithm was applied to a 512 512 image of text data that was thresholded
prior to segmentation. Two parameters were of interest: (1) the number of elemental
regions (those whose labels are stored in L), since this affects the word width of L
and K and the length of K; and (2) the number of regions perceived by the algorithm
since this determines the amount of further processing which the host computer must
perform before useful information can be gleaned from the image.
The results of one simulation are summarized below.
r
r
These results indicate that a 512 512 10 bit L memory and a 1024 10 bit
memory would be required for this image.
In this section we have addressed the issue of performing image analysis operations in real time on television-scanned data. We have shown that it is possible to
design hardware which can perform the operation of region growing in this way.
The concept of using equivalence relations to partition an input set is fundamental
to the algorithm. Furthermore, the use of content-addressable read/write memories
facilitates the implementation of such equivalence relation processing in real time.
These concepts were developed by considering potential hardware structures;
however, nothing about the algorithm prevents its implementation on a digital computer. The program described here was written to simulate the effectiveness of this
approach. Since then, we have used it to label regions in an image segmentation. Its
196
Segmentation
speed of operation, even in simulation, is superior to our earlier region grower, for
identical performance.
8.3.3
197
y(s) s x(s)
.
(s)
Suppose the curve is closed, then the concepts of INSIDE and OUTSIDE make sense.
Given a point in the plane x = [xi , yi ] which is not on the curve, let x represent the
closest point on the curve to x (at this point, the arc length is dened to be sx ). Then
we say x is INSIDE the curve if [x x ]n (sx ) 0 and OUTSIDE otherwise.
There is a way [8.54] to perform curve evolution (see section 9.8) such that the
enclosed area remains constant.
You do not necessarily have to nd salient points. Chen et al. [8.13] simply
apply an orientation-selective lter at all possible orientations. If two segments have
sufciently different orientations, the lter response will have multiple peaks, and
the position of the peaks can be used to identify the segments. This approach seems
to work well for images consisting of straight lines with X or T intersections (see
Chapter 10 for a discussion of the types of intersections).
Rosen and West [8.50] propose a slightly different strategy for nding salient
points. They t the sequence of data points with whatever function seems appropriate
(ellipses or straight lines). The data point which ts most poorly becomes the salient
point. The curve is then divided into two segments, and the t is repeated recursively
on each segment.
The concept of active contours was originally developed to address the fact that
any edge detection algorithm will fail on some images, because in certain areas
of the image, the edge simply does not exist. For example, Fig. 8.15 illustrates a
human heart imaged using nuclear medicine. A radioactive drug is introduced into
the circulation, and an image is made which reects the radiation at each point. The
brightness at a point is a measure of the integral in a direction perpendicular to the
imaging plane of the amount of blood in the area subtended by that pixel. The volume
of blood within the ventricle can thus be calculated by summing the brightness over
the area of the ventricle. Of course, this requires an accurate segmentation of the
ventricle boundary, a problem made difcult by the fact that there is essentially no
contrast in the upper left corner of the ventricle. This occurs because radiation from
other sources behind the ventricle (superior and inferior vena cava, etc.) contributes
to blur out the contrast. Thus, a technique is required which can bridge these rather
large gaps gaps which are really too large to be bridged using the closing techniques
of Chapter 7.
198
Segmentation
8.5.1
X i X j + X i1 2X i + X i+1 ,
where X i = [xi yi ]T is the snake point. Minimizing the rst term produces a curve
where the snake points are close together. The curve which minimizes the second
term will have little bending. A negative aspect of the rst term is that it is minimized
by a snake which shrinks to a single point. Because of this, many applications also
introduce an expansion term which causes the entire curve to grow larger.
The external energy measures edginess of the region through which the boundary
passes. Again, there are many functions which may be used for this. Our favorite is
EE =
exp ( f (X i )).
For two dimensions, the minimization problem is solvable using dynamic programming [9.19]. However, rather than using local edginess at the boundary, one could
use the difference in average contrast between outside and inside [9.55], which is
only meaningful if the outside is relatively homogeneous.
Observation: This problem ts perfectly into the MAP philosophy, and simulated
annealing (SA) can be used [9.78]. However, the search neighborhood is problematic.
That is, as we have discussed, SA is guaranteed to nd the state which globally
minimizes a set of states. However, the set of states must be sampled in order for
SA to work. In [9.78], an existing contour is used as a starting point, and at each
199
iteration, the only states sampled are contours within one pixel of the current contour,
and from that set a minimum is chosen. The resultant contour is the best contour of
the set sampled, but not necessarily from the entire region of interest.
It is also important that the forms chosen for the energy functions be invariant
to scale, translation, and rotation. One way to accomplish this is to use two snakes,
suitably weighted, one outside the hypothesized boundary and contracting, and one
initiated inside and expanding [9.25].
8.5.2
(8.2)
where sI (x, y) = 1 (x, y) and sE (x, y) = 1/(1 + (x, y)). (x, y) is a measure of the edginess in the image at point x, y, and (x, y) measures the curvature
of the contour at x, y.
Manhaeghe et al. obtain a snake-like result using Kohonen maps [9.47]; see
also [9.88]. One advantage of this approach is that the computations are local. One
can simply look at a point on the present boundary, and consider where that point
could be moved. Choose one candidate location and determine if moving to that
point increases or decreases the energy (if you are using the energy minimization
method).
Considering only the movement of boundary points, however, introduces some
problems. The rst is the difculty in accurately computing the curvature from the
boundary points. As you know, any derivative-based operator is super-sensitive to
noise. Since the curvature involves a second derivative, it is even worse. Another
problem is that there is no really effective way to allow for the possibility that the
boundary might divide into separate components. The following level-set approach
resolves those differences.
Remember the distance transform? From Chapter 7, the distance transform resulted in a function DT (x, y) which was equal to zero on boundary pixels and got
larger as one went away from the boundary. Now consider a new version of the
distance transform, which is exactly the same OUTSIDE the contour of interest.
(Remember, the contour is closed, and so the concepts of INSIDE and OUTSIDE
make sense.) Inside the contour, this new function (which we will refer to as the
200
Segmentation
(8.3)
It is important to note that for points on the contour, the metric function is equal to
zero. The set of points where (x, y) = C is called the C-level set of , and we are
particularly interested in the zero-level set.
Now we will modify the metric function . For every point (x, y) we compute a
new value of (x, y). There are several ways we could modify those points, and we
will mention some of them below, but remember, the contour of interest is still the set
of points where the (modied) metric function takes on a value of zero. We initialized
it to be the distance transform, but from here on, you should no longer think of it as
a distance transform (although it will retain some of those characteristics). Instead,
just think of it as another brightness, a function of x and y.
What is the gradient of the metric function? You knew how to compute the gradient
when it was brightness. It is no different. Thinking about a level set as an isophote,
you knew that the gradient is normal to the isophote, so, given the gradient vector,
how did you get the normal? Do you remember how to calculate the gradient?
G(x, y) =
(x, y), (x, y) ,
(8.4)
x
y
then the normal is just the normalized (naturally) version of the gradient.
n(x, y) =
G(x, y)
.
|G(x, y)|
(8.5)
So we can relate the normal to the contour to the gradient of the metric function.
We can also relate [9.46] the movement of the contour in the normal direction to
a function (which is called the speed function), describing how rapidly the metric
changes, producing a differential form for the change in :
(8.6)
(8.7)
(8.8)
201
where s involves something about the brightness variations in the image and also
involves the curvature (in 2D) of the isophote at x, y. Of course, if you insist on
using the 2D curvature of the zero level set, you need to relate that to the function ,
which fortunately is not too hard. Since the normal vector has already been worked
out, and the curvature can be related to changes in normal direction, it is possible to
show that:
=
x x x2 2 x y x y + yy y2
,
3/2
x2 + y2
(8.9)
202
Segmentation
algorithm which removes noise while seeking the best piecewise-linear t to the data
points. Such a t is equivalent to tting a surface with a set of planes. Points where
planes meet produce either roof edges or step edges, depending on viewpoint.
If an annealing algorithm is used such as MFA, good segmentations to more general
surfaces can be produced simply by not running the algorithm all the way to a truly
planar solution [8.6]. A second philosophical approach to range image segmentation
is to assume some equation for the surface, e.g., a quadric (a general second-order
surface, dened in section 8.6.1). Then, all points which satisfy that equation and
which are adjacent belong to the same surface. This philosophy mixes the problems
of segmentation and tting, for we do not know which points to use to estimate the
parameters of the surface until we have some sort of segmentation [8.7, 8.59]. In the
next section, we look at these two philosophies in a bit more detail.
8.6.1
Implicit and explicit
forms of equations were
dened in Chapter 4.
Describing surfaces
In general, we must t a surface to the data. Taubin et al. [8.68] have looked carefully
at the question of tting surfaces to data and observe rst that polynomial surfaces
are very attractive, but such a polynomial should be of even order. The implicit
form is clearly much more attractive, but is much more difcult to t. Consider for
example the second-order forms mentioned in Chapter 4. An explicit representation
might be
z = ax 2 + by 2 + cx y + d x + ey + f
(8.10)
(8.11)
The expression of Eq. (8.11) is called a quadric, and it is a general form which describes all second-order surfaces (cones, spheres, planes, ellipsoids, etc.). In Chapter
5 you learned how to t an explicit function to data, by minimizing the squared error.
Unfortunately, the explicit form does not allow the possibility of higher order terms
in z. You could solve Eq. (8.11) for z, using the quadratic form, and then have an
explicit form, but now you have a square root on the right-hand side, and lose the
ability to use linear methods for solving for the vector of coefcients.
We can use the implicit form by rst dening f (x, y, z) ax 2 + by 2 + cz 2 +
d x y + ex z + f yz + gx + hy + i z + j and making the following observation: If the
point [xi , yi , z i ]T is on the surface described by the parameter vector [a, b, c, d, e,
f, g, h, i, j]T , then f (xi , yi , z i ) should be exactly zero. We dene a level
set of a function as the collection of points [xi , yi , z i ]T such that
f (xi , yi , z i ) = L for some scalar constant L. Thus, we?? can nd the
coefcients by minimizing E = i ( f (xi , yi , z i ))2 (also known as the algebraic distance from the point (xi , yi , z i ) to the surface). In some
203
cases, this gives good results, but it is not really what we want; we
really should be minimizing i d([xi , yi , z i ], f (x, y, z)), where d is some distance
metric, for example the Euclidean distance from the point to the surface (this is
known as the geometric distance [8.66] to the surface). Again, this turns out to be
algebraically intractable. (To implement this see [8.67] and [17.37] for important
details.) Although methods based on the algebraic distance work relatively well most
of the time, they can certainly fail. Whatever distance measure we use, it should have
the following properties [17.37]: (1) The measure should be zero whenever the true
(Euclidean, geometric) distance is zero (the algebraic distance does this); (2) at
the sample points, the derivatives with respect to the parameters are the same for
the true distance and the measure.
Of course, whatever representation you choose to use (polynomials are popular),
there is always a desire for a representation that is invariant to afne transforms
[8.27].
8.6.2
(8.12)
This implicit form describes not only ellipses, but lines, hyperbolae, parabolae, and
circles. To guarantee that the resulting curve is an ellipse we must also ensure that
it satises
b2 4ac < 0.
(8.13)
we get a solution which tends to t areas of low curvature with hyperbolic arcs
rather than with ellipses. Similar difculties occur when attempting to t ellipsoids
to range data. See Wang et al. [8.74], Rosen and West [8.50], and Fitzgibbon et al.
[8.19] for more details.
In performing such ts, it is important to know when a point is simply an outlier,
that is, it has been corrupted by substantial noise, but actually belongs to the surface
under consideration, or when it really belongs to another, possibly occluding, surface.
Darrell and Pentland [13.9] examined this question in some detail and demonstrated
204
Segmentation
that M-estimates lead to excellent segmentations. Cabrera and Meer [8.11] remove
the bias from ts of an ellipse using an iterative algorithm called bootstrapping.
How you t a function to data also depends on the nature of the noise or corruption
to the data. If the noise is additive, zero-mean Gaussian (which is what we almost
always assume) then the minimum vertical distance (MMSE) or the minimal normal
distance (which we called eigenvector line tting) methods work well. If the noise
is not Gaussian, then other methods are more appropriate. For example, nuclear
medicine images are corrupted primarily by counting (Poisson) noise. Such a noise
differs from Gaussian in two important ways it is never negative, and it is signal
dependent. Well away from zero, Poisson noise may be reasonably modeled by
additive Gaussian with a variance equal to the signal. Other sensors produce other
types of noise. Stewart [8.64] considers the case of inliers and outliers, but assumes
that the bad data are randomly distributed over the dynamic range of the sensor. That
is, the noise is not additive.
Given one segmentation, should you merge two adjacent regions? If they are
adjacent and satisfy the same equation to within some noise measure, they should
be merged [8.8, 8.29, 8.34, 8.56]. Other relevant papers on tting surfaces include
[8.4, 8.75, 8.78].
One is also faced with the issue of what surface measurements to use as a basis
for segmentation. Curvature would appear to be particularly attractive since the
measurement curvature is invariant to viewpoint. However, curvature estimates are
very sensitive to quantization noise [8.71].
205
8.8 Conclusion
correct answer is. With range images, however, it is somewhat easier to determine
truth since surfaces can physically be measured.
Hoover et al. [8.22] propose the following formalism for comparing the quality of a
machine-segmented (MS) image using a human-segmented ground-truth (GT) image
as the gold standard. Let M and G denote the MS and GT images respectively; let
Mi (i = 1, . . . , m) denote a region in M; and let G j ( j = 1, . . . , m) denote a region
in G. |R| will denote the number of pixels in region R. Let Oi j be the number of
pixels which belong to both region i in the MS image and region j in the GT image.
Finally, let T be a threshold, 0.5 < T 1.0.
There are ve different segmentation results, dened as follows.
(1) A correct classication occurs when Oi j T |M| and Oi j T |G|.
(2) An oversegmentation occurs when a region in the GT image is broken into several
regions in the MS image. Formally, given one region in GT, i, and several regions
in MS, j1 , j2 , . . . , jn , if
(a) at least 100T percent of the pixels in each region of MS actually belong to
region i. (Oi, jl T | jl |, l), and
(b) at least 100T percent of the pixels which actually belong to region i are
n
marked as belonging to the union of regions j1 , j2 , . . . jn , ( l=1
Oi, jl
T |i|).
(3) Undersegmentation occurs when pixels in distinct regions in the GT image are
identied as belonging to the same region in the MS image. The denition is
isomorphic to the denition of oversegmentation with the two images reversed.
(4) A missed classication occurs when a region in the GT image is neither correctly
segmented, nor is part of an oversegmentation or an undersegmentation.
(5) A noise classication is identical to a missed classication except that the region
belongs to the MS image.
Two correct segmentations can be further compared in the case of range imagery
by computing the normal vectors to the GT region and the corresponding MS region,
and nding the absolute value of the angle between these two vectors.
With these denitions, we can evaluate the quality of a segmentation by counting
correct or erroneous segmentations, and measuring total angle error. By plotting these
measures vs. T and comparing these plots, a measure of segmenter performance may
be determined.
Hoover et al. [8.22] use this approach to do a thorough evaluation of the quality
of four different range image segmentation algorithms.
8.8 Conclusion
In this chapter, we used the concepts of consistency to identify the components of a
region. In the rst example we studied, if all pixels were the same brightness, they
206
Segmentation
Gradient descent.
were dened to be in the same region. In the example of section 8.6.1, all the points
satisfying the same surface equation were dened to fall in the same region.
In section 8.2.1 we used an optimization method, minimum squared error, to nd
the best threshold. In section 8.5, we obtained a closed boundary using the philosophy of active contours, by specifying a problem-specic objective function and
nding the boundary which minimizes that function. Any appropriate minimization
technique could be used. In section 8A.5, we will once again see function minimization (based on gradient descent with annealing) used in a maximum a posteriori
method which will nd the picture which minimizes a particular objective function,
in this case resulting in a segmentation.
8.9 Vocabulary
You should know the meanings of the following terms.
Active contour
Algebraic distance
Connected component
Explicit form
Geometric distance
Histogram
Homogeneous
Implicit form
Label image
Normal direction
Oversegmentation
Quadric
Region growing
Salient point
Segmentation
Snake
Speed of a curve
Thresholding
Undersegmentation
Assignment
8.1
Section 2.3.5 of the text by Haralick and Shapiro
[4.18] describes a labeling algorithm similar in
some ways to that presented in this section. Write a
207
Topic 8A Segmentation
0
0
1
0
2
2
3
2
4
4
5
4
6
2
7
4
8
2
9
2
report which compares and contrasts these two techniques. Consider: (1) ease of implementation on a
uniprocessor (simplicity of code); (2) speed on a
uniprocessor; and (3) potential for parallel implementation.
Assignment
Topic 8A
8.2
In the process of a connected component labeling
scheme, we find at one point that the lookup table appears as shown in Table 8.2.
Now, we discover the following equivalence: 9 = 7. On
the blank row in Table 8.2, show the contents of the
lookup table after resolving the equivalence.
Segmentation
In this chapter, we have up to this point considered partitioning of images into areas which
differ in some way in their brightness or range. The algorithms we have presented are applicable, however to any feature which characterizes a pixel or area around a pixel. For this
reason, we mention here some other measures which may be used, including texture, color,
and motion. In this section, we also mention other approaches to segmentation, including
segmentation based on edges.
8A.1
Texture segmentation
In section 4A.2.2, texture is discussed. If one is able to quantify the concept of texture to
assign two different numbers to two different textures in order to distinguish them then
texture can be used as a feature in a segmentation algorithm. Instead of dening ADJACENT
as similar brightnesses, one simply denes them as having similar textures. One could also
combine color and texture [8.46, 8.65].
Several researchers have observed that there are really two fundamentally different types
of textures, those that are in some sense deterministic, and those that are in some sense
random, and which can in principle be modeled by Markov random elds [8.2]. Liu and
Picard [8.38] and others [8.20, 8.21, 8.58] observe that the peaks in the Fourier transform
give hints to how separate these texture characteristics are.
208
Segmentation
0
1
1
1
1/2
2
2
1/4
4
n
1/2n
2n
8A.1.1
0
1
1
1
1/2
4
2
1/4
16
n
1/2n
22n
Fractal dimension
The fractal dimension measures the self-similarity of a shape to itself when measured on
a different scale. In that way it provides a measure of the spatial distribution of the points
within the foreground (which presumably are the object of interest). The utility of the fractal
dimension may be seen by considering a set S of points in a 2-space and dening the fractal
dimension of S as
log M
0 log (1/)
(8.15)
where M is the minimum number of boxes required to cover S. Lets look at the fractal
dimension of some example objects. We begin with a single point. Obviously, a point may
be covered by precisely one square box, independent of the size of the box. Therefore, in Eq.
(8.15), M is always equal to one, and the limit of the denominator is innity. Therefore we
nd the fractal dimension of a point is simply zero. Now let us consider a straight line of unit
length. Clearly, such a line may be covered by a 1 1 box. However, it may also be covered
by two boxes, each 1/2 1/2, or four boxes, each 1/4 1/4. Each time the size of the box
is halved, the number of boxes required doubles. We could tabulate this process in terms of
a parameter n, equal to the number of halvings that have been done (Table 8.3).
From Table 8.3, we may evaluate dim(S) from
log 2n
= 1.
n log (1/(1/2))n
(8.16)
Finally, let S be a square area, and without loss of generality, let its sides be of length one.
We could cover this with a single 1 1 box, or by four 1/2 1/2 boxes, or by 16 1/4 1/4
boxes, etc. An argument similar to the one for the straight line results in Table 8.4.
From Table 8.4, the fractal dimension is
log 22n
log 22n
2n
=
lim
= lim
= 2.
n log (1/(1/2))n
n log 2n
n n
(8.17)
209
Topic 8A Segmentation
Fig. 8.17. The gure on the left has a fractal dimension of 2.0, the gure on the right has a fractal
dimension of 1.58. In the right-hand gure, the covering 22 squares are shown with
dark lines.
So, we end up with an intuitively appealing result: A point is a zero-dimensional thing, a line
is one dimensional, and a square is two dimensional. At least, the result agrees with intuition
for these very simple shapes.
Given an image, we must give some thought to how to apply Eq. (8.15) to the discrete
pixel domain, since we obviously cannot take a limiting box size less than = 1. Solving
Eq. (8.15) for M we nd for small values of
1
dim(S)
(8.18)
log M = log
which simplies to
log M = (log ) dim(S).
(8.19)
Thus, for small , the log of M is linearly related to the log of , with the slope being
the dimension of S. We may therefore estimate the dimension of S by considering the two
smallest values we have for , 1 and 2, and computing that slope, using
dim (S)
log M1 log M2
log M1 log M2
=
log 1 log 2
log 2
(8.20)
producing a simple algorithm: Find M1 , the number of 1 1 boxes required to cover the
object; this is simply the area of the object measured in pixels. Find M2 , the number of 2 2
boxes required to cover the same object, and take the difference of the logs.
Consider the following example. The image on the left of Fig. 8.17 has a foreground
region which has an area of 36, and which can be covered by 9 2 2 squares. Thus,
its fractal dimension is (log 36 log 9)/ log 2 = 2. The gure on the right has the same
area, but requires 12 2 2 squares to cover it, and therefore has a fractal dimension of
(log 36 log 12)/ log 2 = 1.58.
The concept of fractal dimension can be extended to gray-scale images, and gray-scale
image features extracted using this measure [8.12].
210
Segmentation
8A.2
8A.3
Motion segmentation
If one could identify all the connected pixels which have the same motion characteristics, one
could apply the connected component methods we have already discussed. Although some
papers emphasize motion segmentation Patras et al. [8.47] use watershed methods and a
multistage segmentation method it is critical to any motion segmentation algorithm that it
be based on an effective method for detecting and representing motion.
The problem of characterizing image motion has been the object of a great deal of research,
and is discussed in much more detail in section 9A.3.
8A.4
Color segmentation
Just as texture variation can produce segmentations, so can color variations. Variations on
color segmentation include posing an optimization problem and minimizing an objective
function. The image which produces the minimum of the objective function is then the
segmentation. Liu and Yang [8.36] use simulated annealing to nd a good color-based segmentation. Clustering, which is described briey in Chapter 15, also gives an approach
[8.72].
8A.5
211
Bibliography
In order to make use of the MAP methods, all one need do is modify the prior term to
include the expected brightness. For example, suppose we have prior knowledge that an image
should be smooth except for step discontinuities, and in addition, each pixel is allowed to
only have one of three brightnesses (e.g. csf, white matter, and gray matter). The following
prior term is maximal when the brightness at pixel i is identical to its neighbors or when it
has the value k1 or the value k2 , or the value k3 .
H( f ) =
exp(( f i f j )2 ( f i k1 )2 ( f i k2 )2 ( f i k3 )2 ).
(8.21)
i
ji
For further reading, Zhu and Yuille [8.80] show that several similar techniques may be
combined. Further, they illustrate relationships in a variety of image representations.
8A.6
Human segmentation
One of the aspects of segmentation which, regrettably, we do not have time to address in this
book is the issue of how humans do segmentation. That is, what is the correct segmentation?
For example, Koenderink and van Doorn [8.28] observed that humans tend to perceive the
projection of three-dimensional objects as collections of ellipses. See [8.57] for an overview.
Bibliography
[8.1] R. Adams and L. Bischof, Seeded Region Growing, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(6), 1994.
[8.2] P. Andrey and P. Tarroux, Unsupervised Segmentation of Markov Random Field
Modeled Textured Images using Selectionist Relaxation, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 20(3), 1998.
[8.3] K.E. Batcher, STARAN Parallel Processor System Hardware, Proceedings of AFIPS
National Computer Conference, vol 43, pp. 405410, 1974.
[8.4] J. Berkmann and T. Caelli, Computation of Surface Geometry and Segmentation
using Covariance Techniques, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(11), 1994.
[8.5] S. Beucher, Watersheds of Functions and Picture Segmentation, IEEE International
Conference on Acoustics, Speech and Signal Processing, Paris, May, 1982.
[8.6] G. Bilbro and W. Snyder, Range Image Restoration using Mean Field Annealing,
In Advances in Neural Network Information Processing Systems, San Mateo, CA,
Morgan-Kaufmann, 1989.
[8.7] G. Bilbro and W. Snyder, Optimization of Functions with Many Minima, IEEE
Transactions on Systems, Man, and Cybernetics, 21(4), July/August, 1991.
[8.8] G. Blais and M. Levine, Registering Multiview Range Data to Create 3D Computer
Objects, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8),
1995.
[8.9] K. Boyer, M. Mirza, and G. Ganguly, The Robust Sequential Estimator: A General Approach and its Application to Surface Organization in Range Data, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(10), 1994.
212
Segmentation
[8.10] C.R. Brice and C.L. Fennema, Scene Analysis using Regions, Articial Intelligence,
1, pp. 205226, Fall, 1970.
[8.11] J. Cabrera and P. Meer, Unbiased Estimation of Ellipses by Bootstrapping, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(7), 1996.
[8.12] B. Chaudhuri and N. Sarkar, Texture Segmentation using Fractal Dimension, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(1), 1995.
[8.13] J. Chen, Y. Sato, and S. Tamura, Orientation Space Filtering for Multiple Orientation
Line Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(5), 2000.
[8.14] C. Chow and T. Keneko, Automatic Detection of the Left Ventricle from Cineangiograms, Computers and Biomedical Research, 5, pp. 388410, 1972.
[8.15] T. Davis, Fast Decomposition of Digital Curves into Polygons using the Haar
Transform, IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8),
August, 1999.
[8.16] R.O. Duda and P.E. Hart, Pattern Classication and Scene Analysis, New York, Wiley,
1973.
[8.17] M. Fishler and R. Bolles, Perceptual Organization and Curve Partitioning, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 8(1), 1986.
[8.18] M. Fishler and H. Wolf, Locating Perceptually Salient Points on Planar Curves,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2), 1994.
[8.19] A. Fitzgibbon, M. Pilu, and R. Fisher, Direct Least Square Fitting of Ellipses,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5), May,
1999.
[8.20] J. Francos, Orthogonal Decomposition of 2-D Random Fields and Their Applications
in 2-D Spectral Estimation, Signal Processing and its Applications, ed. N.K. Bose
and C.R. Rao, Amsterdam, North Holland, 1993.
[8.21] J. Francos, Z. Meiri, and B. Porat, A Unied Texture Model Based on a 2-D Wold
Like Decomposition, IEEE Transactions on Signal Processing, 41, pp. 26652678,
August, 1993.
[8.22] A. Hoover, G. Jean-Baptiste, X. Jiang, P. Flynn, H. Bunke, D. Goldgof, K. Bowyer,
D. Eggbert, A. Fitzgibbon, and R. Fisher, An Experimental Comparison of Range
Image Segmentation Algorithms, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(7), 1996.
[8.23] L. Itti, C. Koch, and E. Niebur, A Model of Saliency-based Visual Attention for Rapid
Scene Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(11), 1998.
[8.24] D. Jacobs, Robust and Efcient Detection of Salient Convex Groups, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(1), 1996.
[8.25] K. Kanatani, Statistical Bias of Conic Fitting and Renormalization, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(3), 1994.
[8.26] N. Katzir, M. Lindenbaum, and M. Porat, Curve Segmentation under Partial Occlusion, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5),
1994.
[8.27] D. Keren, Using Symbolic Computation to Find Algebraic Invariants, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(11), 1994.
213
Bibliography
[8.28] J. Koenderink and A. Van Doorn, The Shape of Smooth Objects and the Way Contours
End, Perception, 11, pp. 129137, 1982.
[8.29] K. Koster and M. Spann, MIR: An Approach to Robust Clustering Application to
Range Image Segmentation, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(5), 2000.
[8.30] L. Krakauer, Computer Analysis of Visual Properties of Curved Objects, Project
MAC TR-82, 1971.
[8.31] S. Kulkarni, S. Mitter, T. Richardson, and J. Tsitsiklis, Local vs. Global Computation
of Length of Digitized Curves, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(7), 1994.
[8.32] S. Kumar, S. Han, D. Goldgof, and K. Bowyer, On Recovering Hyperquadrics
from Range Data, IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(11), 1995.
[8.33] K. Lai and R. Chin, Deformable Contours: Modeling and Extraction, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(11), pp. 10841090, 1995.
[8.34] S. LaValle and S. Hutchinson, A Bayesian Segmentation Methodology for Parametric
Image Models, IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(2), 1995.
[8.35] K. Lee, P. Meer, and R. Park, Robust Adaptive Segmentation of Range Images,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(2), 1998.
[8.36] J. Liu and Y. Yang, Multiresolution Color Image Segmentation, IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(7), 1994.
[8.37] X. Liu and R. Ehrich, Subpixel Edge Location in Binary Images using Dithering,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(6), 1995.
[8.38] F. Liu and R. Picard, Periodicity, Directionality, and Randomness: Wold Features for
Image Modeling and Retrieval, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(7), 1996.
[8.39] A. Logenthiran, W. Snyder, and P. Santago, MAP Segmentation of Magnetic Resonance Images using Mean Field Annealing, SPIE Symposium on Electronic Imaging,
Science and Technology, February, 1991.
[8.40] A. Matheny and D. Goldgof, The Use of Three- and Four-dimensional Surface Harmonics for Rigid and Nonrigid Shape Recovery and Representation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(10), 1995.
[8.41] D.L. Milgram, A. Rosenfeld, T. Willett, and G. Tisdale, Algorithms and Hardware
Technology for Image Recognition, Final Report to U.S. Army Night Vision Lab,
March 31, 1978.
[8.42] F. Mokhtarian and A. Mackworth, Scale-based Description and Recognition of Planar
Curves and Two-dimensional Shapes, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 8(1), 1986.
[8.43] K. Mori, M. Kidode, H. Shinoda, and H. Asada, Design of Local Parallel Pattern
Processor for Image Processing, Proceedings AFIPS National Computer Conference,
vol 47, pp. 10251031 June, 1978.
[8.44] W. Ng and C. Lee, Comment on Using the Uniformity Measure for Performance Measure in Image Segmentation, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18(9), 1996.
214
Segmentation
[8.45] B. Parhami, Associative Memories and Processors: An Overview and Selected Bibliography, Proceedings of the IEEE, 61, pp. 772730, June, 1973.
[8.46] D. Panjwani and G. Healey, Markov Random Field Models for Unsupervised Segmentation of Textured Color Images, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 17(10), 1995.
[8.47] I. Patras, E. Hendriks, and R. Lagendijk, Video Segmentation by MAP Labeling of
Watershed Segments, IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 2001.
[8.48] A. Pikaz and I. Dinstein, Using Simple Decomposition for Smoothing and Feature
Point Detection of Noisy Digital Curves, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(8), 1994.
[8.49] E. Rivlin and I. Weiss, Local Invariants for Recognition, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(3), 1995.
[8.50] P. Rosen and G. West, Nonparametric Segmentation of Curves into Various Representations, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(12),
1995.
[8.51] A. Rosenfeld and A. Kak, Digital Picture Processing, 2nd edition, New York,
Academic Press, 1997.
[8.52] G. Roth and M. Levine, Geometric Primitive Extraction using a Genetic Algorithm,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9), 1994.
[8.53] C. Samson, L. Blanc-F`eraud, G. Aubert, and J. Zerubia, A Variational Model for
Image Classication and Restoration, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(5), 2000.
[8.54] G. Sapiro and A. Tannenbaum, Area and Length Preserving Geometric Invariant
Scale Spaces, IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(1), 1995.
[8.55] H. Sheu and W. Hu, Multiprimitive Segmentation of Planar Curves a Two-level
Breakpoint Classication and Tuning Approach, IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8), 1999.
[8.56] H. Shum, K. Ikeuchi, and R. Reddy, Principal Component Analysis with Missing
Data and Its Application to Polyhedra1 Object Modeling, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(9), 1995.
[8.57] K. Siddiqi and B. Kimia, Parts of Visual Form: Computational Aspects, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(3), 1995.
[8.58] R. Sriram, J. Francos, and W. Pearlman, Texture Coding Using a Wold Decomposition
Model, Proc. International Conference on Pattern Recognition, Jerusalem, October,
1994.
[8.59] W. Snyder and G. Bilbro, Segmentation of Range Images, International Conference
on Robotics and Automation, St. Louis, March, 1985.
[8.60] W. Snyder and A. Cowart, An Iterative Approach to Region Growing, IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(3), 1983.
[8.61] W.E. Snyder and C.D. Savage, Content-Addressable Read-Write Memories for Image
Analysis, IEEE Transactions on Computers, 31(10), pp. 963967, 1982.
[8.62] W. Snyder, G. Bilbro, A. Logenthiran, and S. Rajala, Optimal Thresholding, A New
Approach, Pattern Recognition Letters, 11(11), 1990.
215
Bibliography
[8.63] W. Snyder, P. Santago, A. Logenthiran, K. Link, G. Bilbro, and S. Rajala, Segmentation of Magnetic Resonance Images using Mean Field Annealing, XII International
Conference on Information Processing in Medical Imaging, Kent, England, July 711,
1991.
[8.64] C. Stewart, MINIPRAN: A New Robust Estimator for Computer Vision, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(10), 1995.
[8.65] P. Suen and G. Healey, The Analysis and Recognition of Real-world Textures in Three
Dimensions, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(5),
2000.
[8.66] S. Sullivan, L. Sandford, and J. Ponce, Using Geometric Distance Fits for 3-D Object
Modeling and Recognition, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(12), 1994.
[8.67] G. Taubin, Nonplanar Curve and Surface Estimation in 3-space, IEEE Robotics and
Automation Conference, Philadelphia, May, 1988.
[8.68] G. Taubin, F. Cukierman, S. Sullivan, J. Ponce, and D. Kriegman, Parameterized
Families of Polynomials for Bounded Algebraic Curve and Surface Fitting, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(3), 1994.
[8.69] K.J. Thurber, Large-Scale Computer Architecture: Parallel and Associative Processors, Rochelle Park, NJ, Hayden, 1976.
[8.70] O. Trier and A. Jain, Goal-directed Evaluation of Binarization Methods, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(12), 1995.
[8.71] E. Trucco and R. Fisher, Experiments in Curvature-based Segmentation of Range
Data, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(2), 1995.
[8.72] T. Uchiyama and M. Arbib, Color Image Segmentation Using Competitive Learning,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(12), 1994.
[8.73] K. Vincken, A. Koster, and M. Viergever, Probabilistic Multiscale Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(2), 1997.
[8.74] R. Wang, A. Hanson, and E. Riseman, Fast Extraction of Ellipses, Ninth International Conference on Pattern Recognition, Rome, 1988.
[8.75] M. Wani and B. Batchelor, Edge-region-based Segmentation of Range Images, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(3), 1994.
[8.76] M. Worring and A. Smeulders, Digitized Circular Arcs: Characterization and Parameter Estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(6), 1995.
[8.77] S. Yau and H. Jung, Associative Processor Architecture A Survey, Computer
Surveys, 9, pp. 326, March, 1977.
[8.78] X. Yu, T. Bui, and A. Kryzak, Robust Estimation for Range Image Segmentation and
Reconstruction, IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(5), 1994.
[8.79] P. Zhu and P. Chirlian, On Critical Point Detection of Digital Shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8), 1995.
[8.80] S. Zhu and A. Yuille, Region Competition: Unifying Snakes, Region Growing, and
Bayes/MDL for Multiband Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(9), 1996.
Shape
Space tells matter how to move, and matter tells space to get bent
Douglas Adams
217
where Rz is as dened above, represents a rotation about the z axis. Given a region s, we can easily construct a matrix, which we denote by S, in which each
column contains the x, y coordinates of a pixel in the region. For example, suppose the region s = {(1, 2), (3, 4), (1, 3), (2, 3)}, then the corresponding coordinate
matrix S is
1 3 1 2
S=
.
2 4 3 3
We can apply an orthogonal transformation such as a rotation to the entire region by
matrix multiplication
cos sin 1 3 1 2
S = Rz S =
.
sin cos
2 4 3 3
That works wonderfully well for rotations, but how can we include translation in this
formalism? To accomplish that desire, we augment the rotation matrices by adding
a row and a column, all zeros except for a 1 in the lower right corner. With this new
denition,
cos sin 0
Rz = sin cos 0 .
0
0
1
We also augment the denition of a point by adding a 1 in the third position, so the
coordinate pair (x, y) in this new notation becomes
x
y .
1
Now, we can combine translation and rotation into a single matrix representation
(called a homogeneous transformation matrix). We accomplish this by changing the
third column to include the translation. For example, to rotate a point about the origin
by an amount , and then translate it by an amount dx in the x direction and dy in
the y direction, we perform the matrix multiplication
x
cos sin d x
x = sin
(9.1)
cos dy y .
1
0
0
1
Thus, it is possible to represent rotation in the viewing plane (about the z axis) and
translation in that plane by a single matrix multiplication. All the transformations
mentioned above are elements of a class of transformations called similarity transformations. Similarity transformations are characterized by the fact that they may
move an object around, but they do not change its shape.
218
Shape
Reference
Affine
Not affine
Fig. 9.1. Afne transformations can scale the coordinate axes. If the axes are scaled differently,
the image undergoes a shear distortion.
Fig. 9.2. Aircraft images which are afne transformations of each other (from [9.1]).
But what can we do, if anything, to represent rotations out of the camera plane?
To answer this, we need to dene an afne transformation. An afne transformation
of the 2-vector x = [x, y]T produces the 2-vector x = [x , y ]T , where
x = Ax + b,
(9.2)
and b is likewise a 2-vector. This looks just like the similarity transformations mentioned above, except that we do not require the matrix A be orthonormal, only nonsingular. An afne transformation may distort the shape of a region. For example,
shear may result from an afne transformation, as illustrated in Fig. 9.1.
As you probably recognize, rotation out of the eld of view of a planar object is
equivalent to an afne transformation of that object. This gives us a (very limited)
way to think about rotation out of the eld of view. If an object is nearly planar, and
the rotation out of the eld of view is small, that is, nothing gets occluded, we can
represent this 3D motion as a 2D afne transformation. For example, Fig. 9.2 shows
some images of an airplane which are all afne transformations of each other.
The matrix which implements an afne transform (after correction for translation)
can be decomposed into its constituents [9.62] rotation, zoom, and shear:
a11 a12 x
cos
sin
0 1 x
=
.
(9.3)
y
sin cos
0 0 1
y
a21 a22
Now, what does one do with these concepts of transformations? One can correct
transformations and align objects together through inverse transformations to assist
in shape analysis. For example, one can correct for translation by shifting so the
centroid is at the origin. Correcting for rotation involves rotating the image until the
principal axes of the object are in alignment with the coordinate axes.
219
Finding the principal axes is accomplished by a linear transformation which transforms the covariance matrix of the object (or its boundary) into the unit matrix. This
transformation is related to the whitening transformation and the KL transform. Unfortunately, once such a transformation has been done, Euclidian distances between
points in the region have been changed [9.92].
In the previous paragraph, the word distance occurs. Although one usually thinks
that distance means Euclidian distance, in this book we will use this word many
times and in many forms. For that reason, we should be just a bit more rigorous
about exactly what the concept means. The Euclidian distance (for example) is a
type of measure called a metric. Any metric d(a, b) has the following properties:
r
Properties of a metric.
r
r
d(a, a) = 0
a
d(a, b) = d(b, a)
(a, b)
d(a, b) + d(b, c) d(a, c)
(a, b, c)
x2
x1
Fig. 9.3. A region which is approximately an ellipsoid.
220
Shape
y1
y2
Fig. 9.4. A new set of coordinates, derived by a rotation of the original coordinates, in which one
coordinate well represents the data.
9.2.1
d
yi bi .
(9.4)
i=1
Here, the vectors bi are deterministic (and may in general, be specied in advance).
Since any random vector x may be expressed in terms of the same d vectors, bi (i =
1, . . . , d), we say the vectors bi span the space containing x, and refer to them as a
basis set for x. To make further use of the basis set, we will require:1
(1) The bi vectors are linearly independent;
(2) The bi vectors are orthonormal, i.e.,
1 (i = j)
.
biT b j =
0 (i
= j)
(9.5)
Under these conditions, the yi may be found by projections, where the projection
operation is dened by
yi = biT x(i = 1, . . . , d)
(9.6)
y = [y1 , . . . , yd ]T .
(9.7)
and we let
Here, we say the number yi results from projecting x onto the basis vector bi .
Suppose we wish to ignore all but m(m < d) components of y (which we will
call the principal components) and yet we wish to still represent x, although with
some error. We will thus calculate (by projection onto the basis vectors) the rst m
1
To be a basis, they do not have to be orthonormal, just not parallel, but here, we will require orthonormality.
221
elements of y, and replace the others with constants, forming the estimate
x =
m
yi bi +
i=1
d
i bi .
(9.8)
i=m+1
The error which we have introduced by using some arbitrary constants, the alphas
of Eq. (9.8), rather than the elements of y, is given by
x = x x
m
d
=x
yi bi +
i bi
i=1
d
(9.9)
i=m+1
[yi i ]bi .
i=m+1
=E
-
(9.10)
d
0
/
E (yi i )2 .
(9.11)
i=m+1
(9.12)
(9.13)
resulting in
So, we should replace those elements of y which we do not measure by their expected
values mathematically and intuitively appealing.
Substituting Eq. (9.13) into Eq. (9.11), we have
2 (m) =
d
%
&
E (yi E{yi })2 .
i=m+1
d
i=m+1
%
/
02 &
E biT x E biT x
(9.14)
222
Shape
%
/
0
&
E biT x E biT x (x T bi E{x T bi })
(9.15)
%
&
E biT (x E{x})(x T E{x T })bi .
i
(9.16)
and we now recognize the term between the bs in Eq. (9.16) as the covariance of x:
d
2 (m) =
biT K x bi .
(9.17)
i=m+1
It can be shown (and we will show it below, for the special case of tting a straight
line) that the choice of vector bi which minimizes Eq. (9.17) also satises
K x bi = i bi .
(9.18)
Proof is a homework.
= B Kx B
T
(9.19)
(9.20)
where the matrix B has columns made from the basis vectors, b1 , b2 , . . . , bd .
Furthermore, in the case that the columns of B are the eigenvectors of Kx , then B
will be the transformation which diagonalizes Kx , resulting in
1 0 0
0 0
2
Ky =
(9.21)
.
. . . . . . . . . . . .
0
0 d
Substituting Eq. (9.21) into Eq. (9.17), we nd
2 (m) =
d
biT i bi .
(9.22)
i biT bi
(9.23)
i=m+1
Since i is scalar,
2 (m) =
d
i=m+1
d
i=m+1
i .
(9.24)
223
(9.25)
9.2.2
(9.26)
b2
(9.27)
b1
Fig. 9.5. A covariance matrix may be thought of as representing a hyperellipsoid, oriented in the
directions of the eigenvectors, and with extent in those directions equal to the square
root of the corresponding eigenvalues.
224
Shape
We wish to nd the straight line which best ts this set of data. Move the origin
to the center of gravity of the set. Then, characterize the (currently unknown) best
tting line by its unit normal vector, n. Then, for each point xi , the perpendicular
distance from xi to the best tting line will be equal to the projection of that point
onto n. Denote this distance by di (n):
di2 (n) = (nT xi )2 .
(9.28)
To nd the best tting straight line, we minimize the sum of the squared perpendicular
distances
!
n
n
n
n
2 =
nT xi xiT n = nT
di2 (n) =
(nT xi )2 =
xi xiT n (9.29)
i=1
i=1
i=1
i=1
(9.30)
Dening S =
i
xi xiT , we take
T
(n Sn (nT n 1)).
n
(9.32)
Differentiating the quadratic form nT Sn, we get 2Sn, and setting the derivative to
zero results in
2Sn 2n = 0,
(9.33)
which is the same eigenvalue problem mentioned earlier. Thus we may state:
The best tting straight line passes through the mean of the set of data points, and
will lie in the direction corresponding to the major eigenvector of the covariance of
the set.
We have now seen two different ways to nd the straight line which best ts data:
The method of least squares described in section 5.3, if applied to tting a line rather
than a plane, minimizes the vertical distance from the data points to the line. The
method described in this section minimizes the perpendicular distance described by
Eq. (9.29). Other methods also exist. For example, [9.53] nds piecewise representations which preserve moments up to an arbitrarily specied order.
225
Fitting functions to data occurs in many contexts. For example, OGorman [9.54]
has looked at tting not only straight edges, but points, straight lines, and regions
with straight edges. By so doing, subpixel precision can be obtained.
In the following, we will consider a few of the many simple features which may be
used to characterize the shape of regions (for additional features, see also [9.2]).
Average gray value. In the case of black and white silhouette pictures, this is
simple to compute.
Maximum gray value. Is straightforward to compute.
Minimum gray value. Is straightforward to compute.
Area (A). A count of all pixels in the region.
Perimeter (P). Several different denitions exist. Probably the simplest is a count
of all pixels in the region that are adjacent to a pixel not in the region.
Diameter (D). The diameter describes the maximum chord the distance between
those two points on the boundary of the region whose mutual distance is maximum
[9.68, 9.71]. We will discuss computation of this parameter in the next section.
Thinness (also called compactness)2 (T ). Two denitions for compactness exist:
Ta = (P 2 /A) 4 measures the ratio of the squared perimeter to the area; Tb =
D/A measures the ratio of the diameter to the area. Fig. 9.6 compares these two
measurements on example regions.
Center of gravity (CG). The x and y coordinates of the center of gravity may be
written as
1
1
mx =
x, m y =
y
N
N
for all N points in a region. However, we prefer the vector form
N
1
xi
.
(9.34)
m=
N i=1 yi
Some authors [9.69] prefer not to confuse the mathematical denition of compactness with this denition, and
thus refer to this measure as the isoperimetric measure.
226
Shape
Ta small, Tb large
Ta large, Tb large
Ta large, Tb small
Fig. 9.6. Results of applying two different compactness measures to various regions. Since a
circle has the minimum perimeter for a given area, it minimizes Ta . A starsh on the
other hand, has a large perimeter for the same area.
r
x
Fig. 9.7. y/x is the
aspect ratio using one
denition, with
horizontal and vertical
sides to the bounding
rectangle.
y
r
XY aspect ratio. See Fig. 9.7. The aspect ratio is the length/width ratio of the
bounding rectangle of the region. This is simple to compute.
Minimum aspect ratio. See Fig. 9.8. Again, a length/width, but much more computation is required to nd the minimum such rectangle.
The minimum aspect ratio can be a difcult calculation, since it requires a
search for extremal points. A very good approximation can be obtained if we
think of a region as represented by an ellipse-shaped distribution of points. In this
case, as we discussed in Fig. 9.5, the eigenvalues of the covariance of points are
measures of the distribution of points along orthogonal axes major and minor.
The ratio of those eigenvalues is quite a good approximations of the minimum
aspect ratio.
Number of holes. One feature that is very descriptive and reasonably easy to
compute is the number of holes in a region.
Triangle similarity. Consider three points on the boundary of a region, P1 , P2 , P3 ,
let d(Pi , P j ) denote the Euclidian distance between two of those points, and
let S = d(P1 , P2 ) + d(P2 , P3 ) + d(P3 , P1 ) be the perimeter of that triangle. The
2-vector
d(P1 , P2 ) d(P2 , P3 )
,
,
(9.35)
S
S
simply the ratio of side lengths to perimeter, is invariant to rotation, translation,
and zoom [9.6].
Symmetry. In two dimensions, a region is said to be mirror-symmetric if it is
invariant under a reection about a line. That line is referred to as the axis of
symmetry. A region is said to have rotational symmetry of order n if it is invariant
under a rotation of 2/n about some point, usually the center of gravity of the
region. There are two challenges in determining the symmetry of regions. The
rst is simply determining the axis. The other is to answer the question: How
symmetric is it? Most papers prior to 1995 which analyzed symmetry of regions
227
9.3.1
In this method we rst nd the major axis of the region and those two points on
the boundary of the region which are closest to that axis. In this approach, minor
228
Assignment: Is this
algorithm new, or is it
repeated somewhere else
in the text?
Shape
deviations, such as the spur shown in Fig. 9.9, will be ignored, even though they
may actually contain one of the extreme points.
The rst step in the process is calculation of the major axis. This is performed
by a minimization-of-squared-error technique. It is important that minimization of
error be independent of the coordinate axes; therefore, we use the eigenvector line
tting technique described earlier in this section rather than the conventional MSE
technique. We dene a line to be the best representation of the major axis if it
minimizes the sum of squares of the perpendicular distances from the points in the
region to the line.
Let us assume the region R is described by a set of points R = {(xi , yi )|i =
1, . . . , n}. Let the point (xi , yi ) be denoted by the vector v i , and di be the perpendicular distance from v i to the major axis. Thus, the major axis of the region is the
line which minimizes:
d2 =
n
di2 .
(9.36)
i=1
It is easily shown that the major axis must pass through the center of gravity of the
region, thus it is necessary only to nd the slope of the axis. Since the line passes
through the center of gravity, let us take that point as the origin of our coordinate
system. Then the problem becomes: Given n points with zero mean, nd the line
through the origin which minimizes d 2 .
This turns out to be the same eigenvector problem we described earlier. Thus one
can nd the major axis by:
(1) relocating the origin by subtracting the center of gravity from each point;
(2) nding the principal eigenvector of the scatter matrix of this modied set of
points; and
(3) solving for the major axis as the line through the center of gravity parallel to the
principal eigenvector.
Having found the major axis, one then treats each point on the boundary as a
vector and projects each onto the major axis. The extrema are then those two points
on opposite sides of the center of gravity whose projections onto the major axis are
maximum in length. (The extrema are not necessarily unique.) This approach yields
a solution which is an accurate representation of the shape of the region in the
least-mean-squared-error sense. It may or may not actually nd the two points whose
mutual distance is maximum. In many applications, an approximation like this is
exactly what is needed. However, occasionally, one encounters regions (Fig. 9.9)
with spurs on them, where the spur may be relevant. In this case, it is necessary to
use an algorithm which will nd the actual extrema (see section 9A.1).
229
9.4 Moments
If one were to stretch a rubber band around a given region, the region that would
result would be the convex hull (Fig. 9.10).
The difference in area between a region and its convex hull is the convex discrepancy. See Shamos [9.68] for fast algorithms for computing the convex hull, and
[9.30] for such algorithms for parallel machines.
We can nd the convex hull in O(n log n) time. Furthermore, nding the convex
hull provides us with another simple feature, the convex discrepancy, as illustrated
in Fig. 9.10.
9.4 Moments
The moments of a shape are easily calculated, and as we shall see, can be robust to
similarity transforms.
A moment of order p + q may be dened on a region as
m pq =
x p y q f (x, y).
(9.37)
If we assume that the region is uniform in gray value and that gray value is arbitrarily
set to 1 inside the region and 0 outside, the area of the region is then m 00 , and we
nd that the center of gravity is
m 10
m 01
mx =
my =
.
(9.38)
m 00
m 00
We may derive a set of moment-like measurements (the central moments) which are
invariant to translation by moving the origin:
(x m x ) p (y m y )q f (x, y).
(9.39)
pq =
By taking into account rotation and zoom, we can continue this sort of derivation and
can now dene as many features as we wish by choosing higher orders of moments
or combinations thereof. From the central moments, we may dene the normalized
central moments by
pq =
pq
,
00
where =
p+q
+ 1.
2
Finally, the invariant moments [9.21] have the characteristics that they are invariant to
translation, rotation, and scale change, which means that we get the same moment,
even though the image may be moved, rotated, or zoomed.3 They are listed in
Table 9.1.
3
Gonzalez and Wintz [9.21] refer to zooming as scale change. Since we use the word scale in a slightly
different way, we refer to this as zoom.
230
Shape
= 20 02
2
= (20 02 )2 + 411
= (30 312 )2 + (321 03 )2
= (30 + 12 )2 + (03 + 21 )2
= (30 312 )(30 + 12 )[(30 + 12 )2 3(03 + 21 )2 ]+
(321 03 )(03 + 21 )[3(30 + 12 )2 (03 + 21 )2 ]
6 = (20 02 )[(30 + 12 )2 (03 + 21 )2 ]+
411 (30 + 12 )(21 + 03 )
7 = (321 03 )(30 + 12 )[(30 + 12 )2 3(03 + 21 )2 ]+
(312 03 )(21 + 03 )[3(30 + 12 )2 (03 + 21 )2 ]
Since their original development by Hu [9.33], the concept has been extended to
moments which are invariant to afne transforms by Rothe et al. [9.62].
Despite their attractiveness, strategies based on moments do have problems, not
the least of which is sensitivity to quantization and sampling [9.45]. (See Assignment
9.9.)
The use of moments is actually a special case of a much more general approach to
image matching [9.89] referred to as the method of normalization. In this philosophy,
we rst transform the region into a canonical frame by performing a (typically
linear) transform on all the points. The simplest such transformation is to subtract
the coordinates of the center of gravity (CG) from all the pixels, thus moving the
coordinate origin to the CG of the region. In the more general case, such a transformation might be a general afne transform, including translation, rotation, and
shear. We then do matching in the transform domain, where all objects of the same
class (e.g., triangles) look the same.
Some renements are also required if moments are to be calculated with grayscale images, that is, when the f of Eq. (9.37) is not the result of a thresholding
operation. All the theory of invariance still holds, but as Gruber and Hsu [9.24] point
out, noise corrupts moment features in a data-dependent way.
Once a program has extracted a set of features, some use must be made of this
set, either to match two observations or to match an observation to a model. The use
of simple features in matching is described in section 13.2.
231
2
3
5
6
Fig. 9.11. The eight directions (for 8-neighbor) and four directions (for 4-neighbor) in which one
might step in going from one pixel to another around a boundary.
all between 0 and 7 (if using eight directions) or between 0 and 3 (if using four
directions), designating the direction of each step. The eight and four cardinal directions are dened as illustrated in Fig. 9.11. The boundary of a region may then
be represented by a string of single digits. A more compact representation utilizes
superscripts whenever a direction is repeated. For example, 0012112776660 could
be written 02 1212 272 63 0, and illustrates the boundary shown in Fig. 9.12. The ability
to describe boundaries with a sequence of symbols plays a signicant role in the
discipline known as syntactic pattern recognition, and appears frequently in the
machine vision literature, including other places in this book.
Pun alert! (bet you missed that one) note the kinds of numbers discussed in the paragraph.
232
Shape
In the transform
A change in size
A rotation of about the origin
A translation
Multiplication by a constant
Phase shift
A change in the DC term
How we represent the movement from one boundary point to another is critical.
Simply using a 4-neighbor chain code produces poor results. Use of an 8-neighbor
code reduces the error by 40 to 80%, but is still not as good as using a subpixel interpolation. There are other complications as well, including the observation that the
usual parameterization of the boundary (arc length) is not invariant to afne transformations [9.1, 9.96]. Experiments have [9.39] compared afne-invariant Fourier
descriptors and autoregressive methods. For more detail on Fourier descriptors, see
[9.1].
, draw a circle of radius R about that point. Make R as large as possible, but such
that (1) no points in the circle are outside the region and (2) the circle touches the
boundary of the region at no fewer than two points. Any point on the medial axis
can be shown (see Assignment 7.12) to be a local maximum of a distance transform
(DT). A point with DT values of k is a local maximum if none of its neighbors
has a greater value. Fig. 9.13(a) repeats Fig. 7.3. The local maxima of this distance
transform are illustrated in Fig. 9.13(b).
Another way to think about the medial axis is as the minimum of an electrostatic
potential eld. This approach is relatively easy to develop if the boundary happens
to be straight lines or, in three dimensions, planes [9.12]. See [9.16] for additional
algorithms to efciently compute the medial axis.
233
(a)
( b)
1
1
1
1
1
1
1
1
1 1
2 2
23
22
22
23
12
1
1
1
21
1
1
21
22
11
1
2
1
1
1
11
2
2
1
1
3
3
22
1
11
Fig. 9.13. An example region whose DT, computed using 4-neighbors, is shown in (a). The
morphological skeleton is illustrated in (b), and simply consists of the local maxima of
the DT.
9.7.1
(9.40)
where s is (normalized) arc length, (s) is the template as stored in the data base, and
(s) is the deformation required to make this particular template match a sequence of
boundary points in the image being accessed. We emphasize that (s) is the difference
between the original and the deformed template. The image in the data base which
best matches the template is the image which minimizes the difference between the
234
Shape
1
E=
dx
ds
2
+
dy
ds
2
!
d 2 y
d 2 x
+
+
I E ((s)) ds (9.41)
ds 2
ds 2
which represents (in the rst term) how much the template had to be strained to t the
object while the second term represents the energy spent to bend the template. This
is thus a deformable templates problem. The optimization problem may be solved
numerically [13.5].
A variation on deformable templates is the idea of geometric ow to change
a given initial curve into a form which is better suited for identication by, say,
template matching. The term geometric ow means that the ow is completely
determined by the geometry of the curve. Pauwels et al. [9.58] cast this discipline
as the answer to: Is it possible to use optimization of functionals to crystallize the
geometric content of a curve by reducing the noise while at the same time enhancing
the salient features?
(9.42)
This one equation describes all the second-order surfaces, some of which are illustrated in Fig. 9.14.
Ellipsoid
Hyperboloid of
(L) one sheet
(R) two sheets
Elliptic paraboloid
Hyperbolic
paraboloid
Fig. 9.14. The quadric equation describes a wide variety of surfaces [9.103] (CRC Press. Used
with permission).
235
If the quadric is centered at the origin, and its principal axes happen to be aligned
with the coordinate axes, the quadric will take on a particular form. For example, an
ellipsoid has the special form
x2
y2
z2
+
+
= 1.
a2
b2
c2
(9.43)
However, when the axes of the quadric do not align with the coordinate axes, only
the general form of Eq. (9.42) occurs.
From range or other surface data, the quadric coefcients may be determined by
methods such as those in section 8.6.1. Given the coefcients, the type of quadric
may be determined by the following method.5
If there is a constant term, d, divide by the constant, redening the other coefcients
(e.g. a a/d), resulting in a form of the quadric equation in which the constant
term is unity:
ax 2 + by 2 + cz 2 + 2 f yz + 2gzx + 2hx y + px + qy + r z + 1 = 0.
(9.44)
[x
h
1]
g
0
h
b
f
0
g
f
c
0
x
p
q y
= 0.
r z
1
1
(9.45)
E = h
g
h
b
f
f.
c
Obtain the three eigenvalues, 1 , 2 , and 3 , and nd the reciprocal of each which
is nonzero, r1 = 1/1 , r2 = 1/2 , and r3 = 1/3 . At least one reciprocal must be
positive to have a real surface: If exactly one is positive then the surface is a hyperboloid of two sheets; if exactly two are positive, then it is a hyperboloid of one sheet;
if all three are positive, it is an ellipsoid, and the square roots of r1 , r2 , r3 are the
major axes of the ellipsoid. Otherwise the distance between the foci of hyperboloids
is determined by the magnitudes of the rs.
The authors are grateful to Dr G. L. Bilbro for his formulation of this method.
236
Shape
(9.46)
2
2
2
+
+
=0
x2
y2
z 2
(9.47)
which is
1 2
2
r
+
sin
+
= 0.
(9.48)
r
r
sin
sin2 2
We seek solutions which are separable in the sense that they may be written as the
product of functions of a single variable; that is, they have the form
(r, , ) = R(r ) ()().
(9.49)
With this restriction, the partial differential equation may be separated into three
ordinary differential equations and the solution can be shown to be
Plm cos sin m
(9.50)
where the parameter l is referred to as the degree, m is an integer less than l, and
P is a Legendre polynomial.
Thus, any function may be represented as
L
l
% m m
&
0
m m
r (, ) =
Ul Pl cos +
Ul Pl cos cos m + Vl Pl cos sin m
l=0
m=0
(9.51)
where the coefcients are found by the process of tting this form to the data.
237
(9.52)
|Ai x + Bi y + Ci z + Di |i = 1
(9.53)
i=1
N
(1 F(xi , yi , z i ))2
(9.54)
i=1
are presumed to be a good t. (Since, at every point on the surface F, the value of F
(x, y, z) is supposed to be 1.0.) Observe that this minimizes the algebraic distance
(see section 8.6.1) from the point to the surface! This distance will be zero if the
point lies on the surface, but otherwise, there is not a simple relationship between the
Euclidean distance from the point to the surface and the algebraic distance. Kumar
et al. observe: This function is biased, especially for oblong objects. Similar
complaints exist for essentially all applications of the algebraic distance.
To get a somewhat better t, in the case of hyperquadrics, the following approach
can be used. Suppose we have an initial estimate of the surface, an estimate which
is not too bad, but might not be the best estimate. Let that estimate be dened by a
set of parameters, A, B, C, and D, dening a function F(x, y, z). For a particular
point (xi , yi , z i ), substitute these values into F to determine w i = F(xi , yi , z i ). If the
point is actually on the surface, then w will be equal to 1. Now, consider the surface
dened by F(x, y, z) = w i . The distance normal to this surface can be approximated
by the distance di in the direction of the gradient,
F(x, y, z) = F(xi , yi , z i ) + di F(xi , yi , z i )
(9.55)
so
di =
1 F(xi , yi , z i )
.
F(xi , yi , z i )
(9.56)
This di is really what we want to minimize. It is the distance from a surface through
the point (xi , yi , z i ) which is in some sense parallel to the surface we wish to determine. These are estimates only; it remains to iteratively rene these estimates
until they become the real solutions. To accomplish that, rewrite the squared error in
238
Shape
terms of d,
E=
N
(1 F(xi , yi , z i ))2
i=1
F(xi , yi , z i )
(9.57)
9.13 Conclusion
In this chapter, several features were dened that could be used to quantify the shape
of a region. Some, like the moments, are easy measurements to make. Others, such as
239
9.14 Vocabulary
Constrained
minimization.
Pseudo-inverse minimizes
the squared error.
r
r
In section 9.2.2 a derivation is presented which nds the straight line which
best ts a set of points, in the sense that the sum of perpendicular distances is
minimized. To accomplish that, we were required to use constrained minimization
with Lagrange multipliers.
In section 9.8 we nd a deformation to a template by performing a minimization
of an integral squared error.
In section 9A.2, we will encounter a problem requiring inversion of a nonsquare
matrix. Of course, one cannot formally invert such a matrix, and instead we derive
the pseudo-inverse. We also show that the pseudo-inverse is really a minimumsquared-error algorithm.
9.14 Vocabulary
You should know the meanings of the following terms.
Affine transform
Aspect ratio
Basis vector
Center of gravity
Chain code
Compactness
Convex hull
Convex discrepancy
Deformable template
Diameter
Fourier descriptor
Generalized cylinder
Homogeneous transformation matrix
Invariant moment
240
Shape
K--L transform
Linear transformation
Medial axis
Metric
Moment
Orthogonal transformation
Principal component
Similarity transform
Thinness
Topic 9A
9A.1
Shape description
This technique is easy to program and converges rapidly. We have had good luck with it.
Unfortunately, it is not guaranteed to converge to the global extrema, although there is a high
likelihood that it will do so.
An extension of the previous algorithm provides a strategy which is guaranteed to converge.
In addition, this algorithm provides a mechanism for rapidly narrowing the search space.
First, dene a linear search function M(Pi , R) which returns the point Pi+1 = M(Pi , R)
such that x R, d(Pi , x) d(Pi+1 , Pi ).
Choose an arbitrary point P1 R, nd P2 = M(P1 , R P1 ). Next, nd P3 = M(P2 , R
{P1 , P2 }). The relationship between d(P1 , P2 ) and d(P2 , P3 ) can be one of the following three
cases, which are illustrated in Figs. 9.15 9.17:
Case 1: d(P1 , P2 ) > d(P2 , P3 )
Case 2: d(P1 , P2 ) = d(P2 , P3 )
Case 3: d(P1 , P2 ) < d(P2 , P3 ).
In Case 3, we compute a new P3 called P3 so that d(P2 , P3 ) = d(P2 , P3 ) and P1 , P2 , and
are colinear. We, therefore have a symmetrical lens shape which encloses all the points in
R using at most two linear searches.
P3
241
PP11
P'3
P3
P1
PP33
P3
P1
P
P22
P2
P2
B2
22
21
R22
R21
P1
P2
R11
R12
12
11
B1
Extreme
2
Fig. 9.18. In this gure, B2 is one of the extrema. The other extreme is an element of R11
(redrawn from [9.71]).
Our heuristic states that the best direction in which to search is perpendicular to P1 , P2
(or P2 , P3 in Case 3).
Compute the apexes 1 and 2 as shown in Fig. 9.18. Then nd B1 = M(1 , {R
{P1 , P2 }}). If d(1 , B1 ) d(P1 , P2 ), stop, else partition R P1 , P2 into two mutually exclusive regions R1 and R2 , where R1 = {x R {P1 , P2 }d(x, P1 ) > d(P1 , P2 )}. Find
B2 = M(2 , R R1 ), and nd R2 = {x R {P1 , P2 } R1 d(x, 2 ) > d(P1 , P2 )}.
Note that: (1) In general, R1 R2 R. However, R {R1 R2 } contains no points of
interest. (2) If either R1 = or R2 = , we can stop and P1 , P2 is the diameter.
If both R1 and R2 are nonempty, we can dene them as antipodal regions in the following
sense: If a diameter exists greater than d(P1 , P2 ), one endpoint must lie in R1 and the other
in R2 , which effectively cuts the search space in half.
Two new points, 11 and 12 , are computed by swinging an arc with center at 1 and radius
d(1 , B1 ) to intersect the lens. Similarly, 21 and 22 are computed.
242
Shape
Note that d(21 , 12 ) = d(11 , 22 ) and this is an upper bound on the diameter. In a digital
picture, having an upper limit on the diameter may allow earlier stopping of the algorithm,
since if we know a diameter candidate, and we know the upper bound, if these two values
differ by less than 1.414 (the diagonal of a pixel), there is no need to search further.
Compute r = MAX(d(B1 , B2 ), d(P1 , P2 )). Use this as a radius with which to draw arcs.
Using 21 as center and r as radius, partition R1 into two regions, R11 and R1 R11 . Similarly,
use 22 as center and nd R12 R1 . Note that R1 {R11 R12 } contains no points of interest.
Similarly, nd R21 and R22 as shown. If R21 = and R11 R12 = , then R21 and R12 is an
antipodal pair as is R11 and R22 . In any case, R21 R22 is antipodal with R11 R12 . These
antipodal regions will be our search space in the next phase.
The strategy to this point has either identied the extrema or it has provided other useful
results, specically:
(1) An upper bound on the diameter has been derived.
(2) All points within a very large area have been eliminated as candidates for extrema.
(3) The remaining points have been partitioned into antipodal regions.
At this point, if fewer than K points remain (which is most often the case), the most appropriate
technique is to compute convex hulls. (The optimal choice of K is a function of region
topology. It has been our experience that K = 50 seems to work well.) This computation is
aided by the observations that:
r
r
If, on the other hand, many more points remain, the algorithm can be invoked recursively
using the antipodal pairs of regions as subject areas, and choosing new starting points as
those points closest to 21 and 12 , or 11 and 22 .
For R with N points, the blind exhaustive search takes O(n 2 ) distance calculations and
comparisons. Our search algorithm is exhaustive (hence guarantees convergence), but intelligent by taking the global shape of the region R into consideration. The initial search space R
is sequentially divided into smaller mutually exclusive subspaces by eliminating those points
that cannot be the end points of the extrema.
Although the number of subspaces is increased by two for each recursive call of the
procedure, there are many more points that are eliminated from the search space after each
call. Therefore, the search space is rapidly decreasing. How rapid the decreasing rate is
depends on the shape of R.
This method is derived from geometric considerations and consequently its rate of convergence is strongly dependent on the geometry of the region. It is, therefore, difcult to
accurately access the computational complexity of the method. If used as a pre-processing
technique, it operates in O(4n) time, plus the time required to perform the convex hull calculation on the remaining points, or O(k log k) where k represents the points remaining after
pre-processing.
243
Certainly in the worst case, almost no points are eliminated and convergence operates
as convex hull in O(n log n). It is in fact worse than the convex hull in this case, since the
program is more complex.
However, due to the large number of branch points, the algorithm exits early for virtually
all regions and it converges very rapidly.
9A.2
9A.2.1
xi
ri
ku f
0
u0 0
R T yi
(9.58)
k v f v 0 0
ci = 0
0 1 zi
0
0
1 0
1
1
where ku , kv , u 0 , v 0 and f are projective properties of the camera, R is a 3 3 rotation matrix
and T is a 3 1 translation vector. Assuming we know for each point the actual 3D coordinates
and the corresponding 2D observations we should be able to infer the transformations and
camera properties. Variations on this problem include [9.95] the Perspective n Point problem
(PnP), when n point correspondences are known; the Perspective n Line problem (PnL), when
n pairs of corresponding lines are known; and the Perspective n Angle problem (PnA), when
244
Shape
n pairs of corresponding angles are observed. Analytic solutions have been determined for
P3P [9.17], P3L [9.14], and PnA [9.95]. A clear explanation of the linear approaches to using
uncalibrated cameras (but assuming correct solutions to the correspondence problem) may
be found in [9.26].
Some work [9.34] has been done on the harder version of the correspondence problem, the
case that the cameras are not only uncalibrated, but are oriented in such a way that the epipolar
assumption is not necessarily valid. In this case the stereo matching problem becomes one of
search for best matching pairs, using both radiometric and geometric information to narrow
the search.
9A.2.2
Can you nd the surface normal? (If you have the normal at every point, how would you
determine the surface?)
The solution may be found by solving differential equations. We start by writing the surface
normal vector as n = r /|r |, where the direction vector
T
z z
r=
, ,1 .
x y
In most shape-from-shading literature, the partial derivatives are abbreviated p z/ x,
q z/ y.
Despite the fact that we use the term brightness constantly, it actually does not have
a rigorous physical denition. Following Horn [9.32], we dene irradiance as the power
per unit area falling on a surface, measured in watts per square meter. Then, we can dene
radiance as the power per unit foreshortened area per unit solid angle. This dependence on
n
o
Fig. 9.19. Light strikes a surface at an incident angle, relative to the surface normal, n, and is
reected/scattered in another direction.
245
foreshortened area makes it clear that the angle of observation plays an important role in
scene brightness.
Often, the reectivity model of a surface is known, or may be measured. For example,
the brightness observed might be independent of the angle of observation, and depend on
the angle of incidence. For example, the reected brightness might be related to the incident
brightness with a relationship such as
R(x, y) = a I (x, y) cos(I ).
(9.59)
Thus, if we know the incident brightness, I, the albedo (how the surface is painted), a, and
the reected brightness, R, we should be able to solve for I , and from that, infer the surface
normal, and from that the surface. The reectivity function of Eq. (9.59) is known as a
Lambertian model. Note that the angle of observation does not enter the Lambertian model.
Another familiar reectivity function is the specular model
R(x, y) = a I (x, y)(I o ),
(9.60)
which describes mirrors you only get a reection if the angle of illumination equals the
angle of incidence. Of course most surfaces, even shiny ones, are not perfect specular
reectors, and a more realistic model for a mixed surface might be
R(x, y) = a I (x, y) cos4 (I o ).
(9.61)
Although the use of reectivity functions requires radiometric calibration of cameras [9.28],
that requirement in itself is not the major difculty. To see the complexity of the problem
[9.51], let us expand Eq. (9.59) in terms of the elements of the observation vector and the
normal vector.
z
z
R(x, y) = a I (x, y) cos(I r N ) = a I (x, y) cos Ix
+ Iy
+ Iz Nz .
(9.62)
x
y
Assuming we know the angle of observation (which we will actually know only approximately
at best), the albedo and the incident brightness, we still have a partial differential equation
which we must solve to determine the surface function z.
Many papers and Horns classic text [9.32] address approaches for solving various special
cases of the shape-from-shading problem. A recent paper by Zhang et al. [9.100] surveys
the eld up to 1999. (Remember, Equation (9.62) is itself a special case it assumes the
brightness does not depend on the observation angle.) In the following, we discuss another
special case, photometric stereo.
Photometric stereo
In many cases, it is reasonable to model the reectivity of a surface as proportional to the
cosine of the angle between the surface normal vector and the illumination vector:
I (x, y) = ro (N i r n)
(9.63)
where Ni is a unit vector in the direction of light source i. If we are fortunate enough to
have an object, a Lambertian reector, which satises this equation, and which possesses the
same albedo (r0 ), independent of the illumination, we can make use of multiple pictures from
246
Shape
multiple angles to determine the surface normal [9.35, 9.94]. Let us illuminate a particular
pixel with three different light sources (one at a time) and measure the brightness of that pixel
each time. At that pixel, we construct a vector from the three observations
I = [I1 , I2 , I3 ]T .
(9.64)
We know the direction of each light source. Let those directions be denoted by unit vectors
from the surface point toward the light source, N 1 , N 2 , and N 3 . Write those three direction
vectors in a single matrix by making each vector a row of the matrix.
N1
n 11 n 12 n 13
N = N 2 = n 21 n 22 n 23 ,
(9.65)
N3
n 31 n 32 n 33
and now we have a matrix version of Eq. (9.63):
I = r0 N n.
(9.66)
(9.67)
1 1
N I.
r0
(9.68)
Note that this derivation of photometric stereo assumes the albedo (sometimes called the
surface reectance) is the same for every angle of illumination. In the next subsection, we
illustrate an application which combines shape from shading with photometric stereo, and
does not make this assumption. For example, specular reectors provide a special condition: the angle of observation is exactly equal to the angle of incidence. This allows special
techniques to be used [9.56].
But what if we used more than three light sources? This gives us a wonderful context in
which to discuss an important topic, overdetermined systems and the pseudo-inverse.
If we in fact used more than three light sources, we would hope to cancel out some effects of
noise and/or measurement error. Suppose we have k light sources. Then Eq. (9.66) is rewritten
as I k1 = Nk3 n31 where subscripts are used to emphasize the matrix dimensions, and r0
is removed for clarity of explanation.
Now, we cannot simply multiply by the inverse of N because N is not square. Instead, as we
have done so many times before, let us set up a minimization problem: We will nd the surface
normal vector n which minimizes the squared sum of differences between measurements I
and products of N and n. Of course, if Eq. (9.63) is strictly true everywhere, then we do not
need to do a minimization. Of course, if Eq. (9.63) were true everywhere, there would be no
point to taking more than three measurements. We believe that measurements are not perfect
and there is some advantage to taking additional ones. Dene an objective function E which
incorporates the desire to nd an optimal solution:
E=
k
i=1
2
Ii NiT n = (I N n)T (I N n).
(9.69)
247
(9.70)
We wish to nd the surface normal vector n which minimizes this sum-squared difference E,
so we differentiate E with respect to n,
n E = 2N T I + 2N T N n
(9.71)
(9.72)
n = (N T N )1 N T I.
(9.73)
or
(In case you have not recognized it yet, the pseudo-inverse just appeared.) That was a lot of
work. Lets see if there is an easier way:
Go back to Eq. (9.66), again omitting the r0 for clarity, and multiply both sides by N T .
N T I = N T N n.
(9.74)
248
Shape
6
4
2
SE loci
2
4
6
BSE loci
8
8
0
p
Fig. 9.20. A given measured brightness pair can be created by a locus of p, q values in both SE
and BSE images. (The authors are grateful to B. Karacali for this gure.)
pair and the vector which is normal to the surface z(x, y), and write that difference as
z z
di (x, y) =
,
, ( pi , qi ) .
x y
Finally, assuming that the two curves of Fig. 9.20 intersect at m points (we make m a function
of x, y to remind the reader that all this is being done for a single x, y point), we dene an
objective function as
!1
m(x,y)
1
E=
(di (x, y))
+ R,
(9.75)
x,y
9A.2.3
Structured illumination
In section 4.2.2, the basic concepts of structured illumination were introduced. The key point
is that by controlling the lighting, one or more of the unknowns in the stereopsis problem
may be eliminated. Lets look at an example in more detail to see how this might work.
The problem to be solved is an application in robot vision: A robot is to pick up shiny,
metallic turbine blades from a rack and place them into a machine for further processing. In
order to locate the blades, a horizontal slit of light is projected onto the scene, by passing a
laser beam through a cylindrical lens. The geometry of the resulting images is illustrated in
249
Blade
z
h
Cart
Fig. 9.21. The presence of a blade translates the location of the light stripe vertically in the
image.
Fig. 9.21 [9.57]. If there were no blade in the image, the light stripe from the laser would
form a horizontal line in the image as it is reected off the cart. The presence of the blade
causes a vertical translation of the light stripe. The number of lines of vertical displacement is
directly proportional to an angular difference which produces the angle . Knowing the two
angles and the distance h between the camera and the projector allows a simple calculation
of the distance z:
z=
9A.2.4
h tan
.
tan + tan
(9.76)
Although this relationship is relatively simple, it turns out to be both simpler and more
accurate to simply keep a lookup table of z vs. row displacement.
One practical problem which arises in this problem is the specular nature of the reections
from the turbine blades; the bright spots may be orders of magnitude brighter than the rest of
the image. This is dealt with by passing the beam through a polarizing lter. By placing another
such lter on the camera lens, the specular spots are considerably reduced in magnitude.
In this example of the use of structured illumination, only one light stripe was projected
at a time, so there was no ambiguity about which bright spot resulted from which projector,
however, in the more general case, one may have multiple light sources, and then, some
method for disambiguation is required [9.7, 9.50].
9A.2.5
250
Shape
9A.3
(9.77)
(9.78)
f 2 (x) f 1 (x)
.
f 1 (x)
(9.79)
There is a serious problem here. What happens if the gradient is zero? The gradient being
zero is really saying that there is no information at this point in the image. Imagine you are
looking through a telescope at a semi-tractortrailer truck which is passing by, and the eld
of view of your telescope allows you to see only a small area, a few square inches, of the
truck. When the bumper passes, you have information, and you know motion has occurred.
However, as the top of the trailer is passing, you see no changes for a relatively long time.
Another troublesome issue that arises in computing and applying optic ow is that in 2D,
Eq. (9.79) becomes a differential equation (see [9.10]). To deal with these problems, researchers in optic ow have tried various ways to combine information from local measurements to infer global knowledge, such as using clustering to identify sets of points which are
moving together [9.41]. As discussed in Chapter 4, one can match either points or boundaries.
For example, Quan [9.61] matches conics; Taylor and Kriegman [9.85] match line segments,
as does Zhang [9.99]; Smith and Nandha Kumar [4.35] match textures. Motion segmentation
is discussed in [4.35, 9.3].
In two dimensions, the of Eq. (9.77) becomes a vector. Optic ow algorithms generate
a disparity eld, a vector eld which associates a vector with each pixel [9.20]. Efcient
implementations for optic ow have been investigated [10.19, 17.13, 17.14], including objects
which deform [17.60].
The optic ow at a point in the image plane of a moving camera is
ui =
1
Ai T + Bi
zi
(9.80)
251
0
f
xi
yi
and
Bi =
x y
i i 2
1 + yi
1 + xi2
xi yi
yi
xi
(9.81)
are completely determined by the camera. T is the translational velocity of the camera and
is the rotational velocity of the camera. z i is the depth to the image point at camera coordinates
(xi , yi ). The problem of determining T and
has been addressed by a number of researchers
[9.36, 9.37]. Earnshaw and Blostein [9.15] compare these methods and introduce a new
variation.
Motion can be estimated from smear. Chen et al. [9.9] make the observation that:
Psychophysical investigation has shown that the human visual system (HVS) integrates retinal images for 120 ms. Due to the integration, motion smear is inevitable. It has been reported
that when the HVS is presented with an image blurred due to motion, the amount of smear perceived by human observers increases with observing duration at short durations (up to 20 ms).
At longer durations, however, the perceived image becomes sharper. It is our conjecture that
the HVS is performing a deblurring, or sharpening, function on the image. This observation
leads to a motion from smear algorithm [9.9].
Recently, a gradual shift in focus has been observed [9.5] in the motion analysis literature
from analyzing the image or camera motion to labeling the action taking place from How
are things moving? to What is happening?
Shape from motion
The discipline known as shape from motion usually assumes a solution to the correspondence problem: That certain points are identied and corresponded in each frame. Here, we
describe the approach to shape from motion presented by Kanade and co-workers [9.60,
9.86]. This description is based on the unrealistic assumption that the points in the image are
projected orthographically (perpendicular to) the image plane. We use this derivation because
it is simpler to understand. However, if you need to implement this, refer to [9.60] for the
modications required for perspective projection. See also Soatto and Perona [9.72, 9.73] for
a recent general study.
In the following derivation, the object is stationary, only the camera is moving. A point at
spatial coordinates s p is projected onto the imaging at frame f at coordinates (u f p , v f p ). The
camera moves, and at each frame, the camera is located at t f and has orientation described
by the three vectors (i f , j f , k f ). If we knew s p , we would know the camera coordinates of
the image of p, from
u f p = i Tf (s p t f )
v f p = j Tf (s p t f ).
(9.82)
y f = t Tf j f ,
(9.83)
v f p = j Tf s p + y f .
(9.84)
Dening
x f = t Tf i f
we can rewrite Eqs. (9.82) as
u f p = i Tf s p + x f
252
Shape
Now we have an inverse problem: We know the image coordinates, and from these we must
determine the spatial position of both the camera and the object points.
Accumulate all the observations of the image points in a matrix:
u 11
...
u
F1
W =
11
...
v F1
. . . u1 p
... ...
. . . uFP
,
. . . v 1P
... ...
. . . vFP
(9.85)
given that there are F frames and P points. We observe that each column in this matrix is
a listing of the image coordinates of a particular point as it is viewed in different frames,
whereas each row is a listing of the locations of all the points in a particular frame. Next, we
dene a matrix M which is a 2F 3 matrix whose rows are the i f and j f vectors, and a
matrix S which is a 3 P shape matrix whose columns are the s p vectors. Finally, dene
T to be a 2F-dimensional translation vector with elements x f and y f . With these denitions,
we can rewrite Eq. (9.84) for all the points as
W = M S + T 1P ,
Is this parenthetical
remark correct?
(9.86)
where 1 P is a vector of length P, composed of all ones. (Observe that the product of T with
the ROW vector of all ones, an outer product, constructs a matrix of dimension 2F P.)
If we move the origin to the center of gravity of the object (why not? location of the origin
is arbitrary), we obtain
C=
P
1
s p = 0.
P p=1
(9.87)
This location of the origin allows us to solve for T immediately, by observing that since the
sum of any row of S is zero, the sum of any row of W is simply PT , and each row of T can be
determined as the corresponding row of W divided by P. Now, subtract T from W producing
a new matrix W which satises
W = M S.
(9.88)
Using singular-valued decomposition, we can nd a suitable decomposition of W , which we
denote by W = M S . Unfortunately, these are not necessarily the M and S we want, since we
could put any product A A1 between them without changing the value of the product. We
thus search for a matrix A such that
M = MA
S = A1 S.
(9.89)
To nd A, we can make use of the fact that the rows of M are the direction vectors of the
camera and are hence orthonormal. With these additional constraints, A is determined, and
we know the positions in 3-space of all the P points as well as knowing the camera angles at
each frame time.
253
9A.4
Vocabulary
You should know the meanings of the following terms.
Irradiance
Optic flow
Perspective
Photometric stereo
Pseudo-inverse
Reflectivity
Shape from shading
Structured illumination
Assignment
9.1
For each feature described in section 9.3, determine if that
feature is invariant to: (1) rotation in the viewing plane;
(2) translation in that plane; (3) rotation out of that plane
(an affine transform if the object is planar); and (4) zoom.
Assignment
9.2
Let the Euclidian distance between two points be denoted by
the operator d(P1 ,P2 ) (you may want to use this in the next
problem). Design a monotonic metric R(P1 ,P2 ) which maps all
distances to be between 0 and 1. That is, if d(P1 ,P2 ) = ,
then R(P1 ,P2 ) = 1 and if d(P1 ,P2 ) = 0 then R(P1 ,P2 ) = 0.
For the metric you developed, show how you would prove your
measure is a formal metric. Just set up the problem. Extra
credit if you actually do the proof.
Assignment
9.3
Following are five points on the boundary of a region. (1,1),
(2,1), (2,2), (2,4), (3,2). Use eigenvector methods to fit a
straight line to this set of points, thus finding the principal axes of the region. Having found the principal axes, then
estimate the aspect ratio of the region.
Assignment
9.4
Write the chain code for the figure below.
254
Shape
Assignment
9.5
Discuss the following postulate: Let P1 and P2 be the two extrema of a region which determine the diameter. Then P1 and
P2 are on the boundary of the region.
Assignment
9.6
Starting with Eq. (9.19), prove Eq. (9.20).
Assignment
9.7
What is the difference between the intensity axis of symmetry
and the medial axis?
Assignment
9.8
In Table 9.1, prove that the invariant moment 1 is invariant
to zoom.
Assignment
9.9
Your instructor will specify an image containing a single
region with unity brightness and zero background.
(1) Compute the seven invariant moments of the foreground
region.
(2) Rotate the foreground region about its center of gravity through ten, twenty and forty degrees, and compute
the invariant moments of the resulting image. What do you
conclude?
Assignment
9.10
Prove that Eq. (9.35) is invariant to: (1) translation;
(2) rotation; and (3) zoom.
Assignment
9.11
Is the caption on Fig. 9.12 correct?
Assignment
9.12
Two silhouettes, A and B, are measured and their boundaries
encoded. Then, Fourier descriptors are computed. The descriptors are as in Table 9.3.
It is possible that these two objects represent similarity transforms of one another. (A similarity transform is
equivalent to a rigid body motion, translation or rotation
only.) Could they be affine transformations of one another?
(An affine transformation is a linear transformation which
includes not only rigid body motion, but also the possibility
255
Object B
5.00 + i0.00
4.2 + i1.87
3.86 + i1.00
2.95 + i2.05
3.19 + i1.47
5.83 + i1.80
3.69 + i2.57
3.48 + i2.00
2.30 + i2.77
2.70 + i2.24
for scaling of the coordinate axes. If both axes (in 2D) are
scaled by the same amount, you get zoom. If they are scaled
by different amounts, you get shear.)
If you decide that these two sets of descriptors represent
the same shape, possibly transformed, describe and justify
what type of operations convert A into B. If they are not the
same shape, explain why.
Assignment
9.13
A cylinder with unit radius and height of ten is oriented
vertically about the origin, and is known to have a surface which is a Lambertian reflector. That is, the reflected
brightness is independent of the angle of observation, and
depends on the angle of incidence following the relationship
f = aIcos i, where a is the albedo and I is the brightness of
the source. On this cylinder, the albedo is constant.
The camera is located at x = 0, y = 2, z = 2, and the optical
axis of the camera is pointed at the origin.
y
x
Max brightness
observed somewhere
along here
256
Shape
Bibliography
[9.1] K. Arbter, W. Snyder, H. Burkhardt, and G. Hirzinger, Application of Afneinvariant Fourier Descriptors to Recognition of 3-D Objects, IEEE Transactions
on Pattern Analysis and Machine Intelligence, 12(7), pp. 640647, 1990.
[9.2] D. Ballard and C. Brown, Computer Vision, Englewood Cliffs, NJ, Prentice-Hall,
1982.
[9.3] M. Bichsel, Segmenting Simply Connected Moving Objects in a Static Scene, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(11), pp. 11381142,
1994.
[9.4] T. Binford, Visual Perception by Computer, IEEE Conference on Systems and
Control, Miami, December, 1971.
[9.5] A. Bobick and J. Davis, The Recognition of Human Movement Using Temporal
Templates, IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3),
pp. 257267, 2001.
[9.6] A. Califano and R. Mohan, Multidimensional Indexing for Recognizing Visual
Shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(4),
pp. 373392, 1994.
[9.7] D. Caspi, N. Kiryati, and J. Shamir, Range Imaging with Adaptive Color Structured
Light, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5),
pp. 470480, 1998.
[9.8] C. Chen, T. Huang, and M. Arrott, Modeling, Analysis, and Visualization of Left
Ventricle Shape and Motion by Hierarchical Decomposition, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 16(4), pp. 342356, 1994.
[9.9] W. Chen, N. Nandhakumar, and W. Martin, Image Motion Estimation from Motion
Smear A New Computational Model, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(4), pp. 412425, 1996.
[9.10] A. Chhabra and T. Grogan, On Poisson Solvers and Semi-direct Methods for Computing Area Based Optic Flow, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(11), pp. 11331138, 1994.
[9.11] K. Cho and S. Dunn, Learning Shape Classes, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(9), pp. 882888, 1994.
[9.12] J. Chuang, C. Tsai, and M. Ko, Skeletonization of Three-dimensional Object using
Generalized Potential Field, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(11), pp. 12411251, 2000.
[9.13] M. Clerc, Texture Gradient, http://www.masterworksneart.com/inventory/
vas originals.htm.
257
Bibliography
258
Shape
[9.29] D. Heeger and A. Jepson, Subspace Methods for Recovering Rigid Motion I:
Algorithm and Implementation, International Journal of Computer Vision, 7(2),
pp. 95117, 1992.
[9.30] D. Helman and J. JaJa, Efcient Image Processing Algorithms on the Scan Line
Array Processor, IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(1), pp. 4756, 1995.
[9.31] A. Hoover, D. Goldgof, and K. Bowyer, Extracting a Valid Boundary Representation from a Segmented Range Image, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 17(9), pp. 920924, 1995.
[9.32] B.K.P. Horn, Robot Vision, Cambridge, MA, MIT Press, 1986.
[9.33] M. Hu, Visual Pattern Recognition by Moment Invariants, IRE Transactions on
Information Theory, 8, pp. 179187, 1962.
[9.34] X. Hu and N. Ahuja, Matching Point Features with Ordered Geometric, Rigidity,
and Disparity Constraints, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(10), pp. 10411049, 1994.
[9.35] Y. Iwahori, R. Woodham, and A. Bagheri, Principal Components Analysis and
Neural Network Implementation of Photometric Stereo, Proceedings IEEE Conference on Physics-Based Modeling in Computer Vision, June, 1995, pp. 117125,
1995.
[9.36] A. Jepson and D. Heeger, Linear Subspace Methods for Recovering Translational
Direction, In Spatial Vision in Humans and Robots, ed. L. Harris and M. Jenkin,
Cambridge, Cambridge University Press, 1993.
[9.37] K. Kanatani, Unbiased Estimation and Statistical Analysis of 3-D Rigid Motion
from Two Views, IEEE Transactions on Pattern Analysis and Machine Intelligence,
15(1), pp. 3750, 1993.
[9.38] K. Kanatani, Comments on Symmetry as a Continuous Feature, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(3), pp. 246247, 1997.
[9.39] H. Kauppinen T. Seppanen, and M. Pietikainen, An Experimental Comparison of
Autoregressive and Fourier-based Descriptors in 2D Shape Classication, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(2), pp. 201207,
February, 1995.
[9.40] R. Kimmel, A. Amir, and A. Bruckstein, Finding Shortest Paths on Surfaces Using Level Sets Propagation, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 17(6), pp. 635640, 1995.
[9.41] D. Kottke and Y. Sun, Motion Estimation via Cluster Matching, IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(11), pp. 11281132, 1994.
[9.42] A. Laurentini, The Visual Hull Concept for Silhouette-based Image Understanding,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2), pp. 150
162, 1994.
[9.43] A. Laurentini, How Far 3D Shapes can be Understood from 2D Silhouettes, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(2), pp. 188195,
1995.
[9.44] S. Lavallee and R. Szeliski, Recovering the Position and Orientation for Free-form
Objects from Image Contours Using 3D Distance Maps, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(4), pp. 378390, 1995.
259
Bibliography
260
Shape
[9.61] L. Quan, Conic Reconstruction and Correspondence from Two Views, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(2), 1996.
[9.62] I. Rothe, H. Susse, and K. Voss, The Method of Normalization to Determine
Invariants, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4),
1996.
[9.63] G. Sandini and V. Tagliasco, An Anthropomorphic Retina-like Structure for Scene
Analysis, Computer Graphics and Image Processing, 14, pp. 365372, 1980.
[9.64] H. Schultz, Retrieving Shape Information from Multiple Image of a Specular
Surface, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2),
pp. 195201, 1994.
[9.65] E. Schwartz, Computational Anatomy and Functional Architecture of Striate Cortex,
Spatial Mapping Approach to Perceptual Coding, Vision Research, 20, pp. 645669,
1980.
[9.66] J. Sethian, Curvature and Evolution of Fronts, Communications in Mathematical
Physics, 101, pp. 487499, 1985.
[9.67] J. Sethian, Numerical Algorithms for Propagating Interfaces: HamiltonJacobi
Equations and Conservation Laws, Journal of Differential Geometry, 31, pp. 131
161, 1990.
[9.68] M. Shamos, Geometric Complexity, 7th Annual ACM Symposium on Theory of
Computing, May, 1975, Albuquerque, NM, pp. 224233, 1975.
[9.69] D. Sinclair and A. Blake, Isoperimetric Normalization of Planar Curves, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(8), pp. 769777,
1994.
[9.70] S. Smith and J. Brady, ASSET-2: Real-time Motion Segmentation and Shape Tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8), 1995.
[9.71] W. Snyder and I. Tang, Finding the Extrema of a Region, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2, pp. 266269, 1980.
[9.72] S. Soatto and P. Perona, Reducing Structure from Motion: A General Framework for Dynamic Vision.1. Modeling, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 20(9), pp. 933942, 1998.
[9.73] S. Soatto and P. Perona, Reducing Structure from Motion: A General Framework
for Dynamic Vision. 2. Implementation and Experimental Assessment, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(9), pp. 943960, 1998.
[9.74] B. Soroka, Generalized Cylinders from Parallel Slices, Proceedings of the Conference on Pattern Recognition and Image Processing, 1979.
[9.75] B. Soroka and R. Bajcsy, Generalized Cylinders from Serial Sections, 3rd International Joint Conference on Pattern Recognition, November, Coronado, CA,
1976.
[9.76] M. Soucy and D. Laurendeau, A General Surface Approach to the Integration of a Set
of Range Views, IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(4), pp. 344358, 1995.
[9.77] J. Stone and S. Isard, Adaptive Scale Filtering: A General Method for Obtaining
Shape from Texture, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(7), pp. 713718, 1995.
261
Bibliography
262
Shape
10
Consistent labeling
The single most challenging problem in all of computer vision is the local/global
inference problem. As in the fable of the blind men and the elephant, the computer
must, from a set of local measurements, infer the global properties of what is being
observed. In other words, the next level of the machine vision problem is to interpret
the global scene (which is composed of individual objects) using local information
about each object obtained from segmentation and shape analysis as we have discussed in Chapters 8 and 9. One way to approach the local/global inference problem
is to introduce the concept of consistency.
10.1 Consistency
Lets begin with some notation: Dene a set of objects {x1 , x2 , . . . xn }, and a set
of labels for those objects {1 , 2 , . . . k }, which we assume for now are mutually
exclusive (each object may have only one label) and collectively exhaustive (each
object has a label). Denote a labeling as the ordered pair (xi , j ). By this notation,
we mean that object i has been assigned label j.
As an example of consistent labeling, we will consider the problem of labeling
objects in a line drawing. Researchers have been interested in analysis of line drawings since the beginnings of work in machine vision for three reasons: First, humans
can obviously look at line drawings and make interpretations with ease. Second, psychological experiments [10.1, 10.6, 10.10] have convincingly demonstrated that it is
the points where brightness changes rapidly that convey the most information, and it
is relatively easy to convert edges to lines. Third, a line drawing is a dramatic reduction in the amount of data (which is not the same as information) in an image, and
perhaps, just perhaps, learning how to process line drawings would make analysis algorithms run faster. Most of the groundwork in line drawing analysis was done in the
late 1960s and 1970s [10.5, 10.8], however, progress continues [10.17] to be made.
263
264
Consistent labeling
+
+
Fig. 10.1. A line drawing illustrating convex, concave, and occluding lines, and a labeling of that
drawing.
The compatibility
function r (.) will have
differing interpretations
through this chapter,
depending on the specic
labeling algorithm.
Check it out. Did we do
the math right? n objects,
with each object labeling
consistent with the
labeling of all other
objects.
We will allow each line in such a drawing to be labeled as either convex (an edge
in three dimensions which points toward the observer, such as the corner of a desk),
concave (an edge in three dimensions which points away from the observer, such as
the joint between the wall and the oor of a room), or occluding (the edge occurs
because one surface is partially hidden by another surface). For example, consider the
drawing in Fig. 10.1. In that drawing, lines resulting from convex edges are labeled
with a plus sign, concave edges are labeled with a minus sign, and occluding edges
are labeled with an arrow. The arrow points in the direction such that the occluded
surface is on the left if one moves in the direction of the arrow. You may note that
not all the lines in this drawing have been labeled. That is deliberate. Thus we have
one type of object, lines, and three types of labels for those lines. Our mission is to
learn how to get a computer to do automatically what human beings did so easily
when we interpreted the lines in that gure.
Before we can do that, we need to address one of those unfortunate ambiguities
that natural language introduces into discussions like this. The term labeling may
have two meanings. It may refer to the label assigned to a single object, to a pair
of objects, or to an entire scene. We will try to make sure that the meaning is clear
from context in this discussion. In order to accomplish the objective of labeling a
scene, we must consider something we call consistency. The simultaneous labeling
of two objects is represented by some function which we will call compatibility
and denoted by r (i, , j, ). This function is dened to have a value of 1 if the
two labelings can exist together (mutually compatible) and 1 if they cannot. For
example, r (i, +, i, ) = 1, since the object i cannot be both concave and convex
(consider the drawing in Fig. 10.2). A labeling of an image is said to be complete
and consistent when i i
= j r (i, , j, ) = n(n 1), where any label values at
265
10.1 Consistency
+
Fig. 10.3. All the physically possible Y vertices.
+
+
all are allowed. In this chapter, we will utilize several different realizations of this
compatibility function.
Lets look at the line labeling in more detail to see how this works. Although we
are interested in labeling lines, it will turn out to be useful to think about vertices
as well. A vertex is where lines meet. If each line can have four labels (concave,
convex, arrow in, arrow out), there should be 43 ways to label a vertex with three
lines meeting. It turns out, however that not all of those combinations are physically
possible. In Figs. 10.3 to 10.5, we illustrate all the Y, ELL, and arrow vertices
which are physically possible. There are a variety of ways we could make use of this
information. One way is to use depth-rst search. The algorithm is as follows:
2
1
(1) Choose a starting vertex (call it vertex 1) and label all the lines coming in to it
in a physically possible way.
(2) Choose an adjacent vertex (call it vertex 2) and label all lines coming in to it
in a physically possible way, such that the labeling is consistent with previous
labeling. That is, the line from vertex 1 to 2 can only have one label.
(3) If no consistent labeling is possible, back up.
In Fig. 10.7, we illustrate the labeling process of the 3D object shown in Fig. 10.6,
beginning with a choice of one possible labeling for the lines of vertex 1. Given the
266
Consistent labeling
+
+
4
?
labeling of vertex 1, we can choose any of a set of labeling for the lines of vertex 2,
but all of them must have a + sign on the line between 1 and 2, as illustrated on the
second line of the gure. Now, choose one of those labelings of vertex 2 (lets pick
the one on the left), and label vertex 3 in such a way that it is consistent with the
labeling of both 1 and 3. Now, we need to assume a correct interpretation of vertex
3 in order to label vertex 4. Again, choose the one on the left. In order to label the
lines coming into vertex 4, we must choose a labeling which is consistent with the
(assumed correct) labeling of vertex 3 and vertex 1. Since two of the lines coming in
to vertex 4 are now determined, there is only one way to label vertex 4 consistently,
and that is with an arrow coming in on the third line.
Suppose we had reached the labeling of vertex 4, and found there were no physically possible labelings. Then clearly one of the earlier assumptions was incorrect.
We now back up, and choose a new labeling for vertex 3. Suppose we run through
all the possible labelings of the lines of vertex 3 and from none of them can we
nd a consistent labeling of 4, then we back up again and choose a new labeling of
vertex 2. We follow this approach until we either nd a consistent labeling of the
entire object, or we fail to nd one. If we fail, then the object cannot be consistently
labeled [10.7].
Since line labeling was originally developed, many researchers have added enhancements. For example, Parodi and Piccioli [10.13] demonstrate that if vanishing
points can be determined, and the 3D coordinates of a single point are known, the
3D coordinates of all the other labeled points may be found.
267
10.2.1
The rst way one might approach the idea of using consistency is to set up a linear
system that takes into account the initial probabilities and the consistencies. We
dene the compatibility of object i having label with object j having label
as rL (i, , j, ) (the subscript L denotes that this is used in the linear relaxation
algorithm) and require that 0 rL 1 and
rL (i, , j, ) = 1
for all i, j, .
(10.1)
The linear relaxation process iteratively updates the label weights following
rL (i, , j, ) p j ( ).
(10.2)
pi () =
j
Nonlinear relaxation
268
Consistent labeling
Dont worry about the denominator; it is just there to make sure the values of pi stay
within the range of 01. The term qi () represents a measure of how compatible
pi ( j ) is with the labeling of all the other objects. While p is strictly positive, q can
be either positive or negative. If negative, qi () suggests that the current labeling of
pi by j is incompatible with most other labelings,
Ci j
rN (i, , j, ) p kj ( ) ,
(10.4)
qik () =
Details of a specic form
for r are problemdependent.
10.2.2
where this is a similar compatibility r (.) that we saw earlier in this chapter, but with a
subscript N to denote that we will use it in nonlinear relaxation. The only difference
is that now we allow r to take on values of not only 1 and 1, but all values in
between. If two labelings are completely consistent, their compatibility is said to be
1. If the labelings are totally inconsistent, the compatibility is 1. If the labeling
of one simply does not affect the other, the compatibility is 0. Lets examine what
Eq. (10.4) means.
The CHANGE in our condence that object i has label is simply the sum of how
compatible that label for object i is with all the other labelings currently. Notice that
the compatibility is multiplied by the condence that the other labeling is correct.
That is, if we have little condence in a labeling of object j, we do not really care
how much it is compatible with a labeling of object i. Finally, Ci j is there for
convenience. It simply weights the inuence that object j has on object i, without
regard to labels. It might be zero, for example, if object i and j have no inuence on
each other, and we know that ahead of time. Ci j is optional; one could just as well
incorporate it into the compatibility function.
Assume you have done a segmentation of a range image into planes. Now, you wish
to nd which of a set of models best matches the object being observed. Assume
the segmentation produces patches which are planar, and since the image is a range
image, we can compute the orientation of those planes in 3-space. The problem now
is to nd the set of planar surfaces in the model (or collection of models) which best
matches the set of planar surfaces in the image. Patches in the image are the objects,
regions in the model(s) are the labels. One way to dene compatibility of labeling
is as follows. Consider the compatibility of labeling image patch A as model region
269
Fig. 10.8. Four possibilities in the denition of compatibility of labeling between two patches
and the corresponding two models.
1 with B as 2. Let us now consider: Does patch A border patch B? And does region
1 border region 2? There are four possibilities, illustrated in Fig. 10.8. If the two
regions in the image are adjacent, and the two regions in the model are also adjacent,
then we dene the compatibility of two labelings as
rN (A, 1 , B, 2 ) = cos(AB 1,2 )
(10.5)
Let us assume that we are tracking four-wheeled vehicles as they move. Our camera
can only record the position of the tires (it is a pretty weird camera). Our objective
then is to determine which tires in one image correspond to which tires in the next
image. This is illustrated in Fig. 10.9, in which the location of the tires in frame n
are denoted by open circles and the locations in frame n + 1 by closed circles. In
this application, we have another labeling task. The objects (wheels) in frame n are
to be labeled with the labels (also wheels) in frame n + 1. So what shall we use as a
compatibility function? To gure that out, lets think about some incorrect labelings.
For example, Fig. 10.10 illustrates the labeling with an arrow, from object to label.
See if you think this makes sense.
If Fig. 10.10 is correct, then the front left tire went to the front right, and the
rear tires did the same. We can only interpret that as the car ipped over, which,
while possible, we certainly hope it did not actually happen. A more reasonable
270
Consistent labeling
interpretation is in Fig. 10.11, which shows the arrows are almost parallel! Ah, thats
it! We can use the cosine of the angle between the arrows,
rN (i, m, j, n) = cos((i, m) ( j, p))
Fig. 10.11. A labeling
which makes more
sense.
(10.6)
where i and j are wheels in frame n and m and p are wheels in frame n + 1. Equation
(10.6) measures the consistency of assuming that wheel i is wheel m and wheel j is
wheel p. Although included in a notes version of this book several years earlier, this
concept was published in 1995 by Wu [10.19].
10.3 Conclusion
Conjugate gradient.
This chapter is all about consistency. We hope to have convinced the student that
the best way to fuse information from diverse courses is to seek labelings which are
consistent.
Optimization is formally used only in the next section, where an optimization is set
up and solved using a numerical optimization technique called conjugate gradient.
The conjugate gradient technique is not explained in this chapter the reader is
referred to standard texts in numerical methods. However, the technique is in many
ways similar to gradient descent, but runs much faster.
Researchers continue to work on improving the concepts of consistent labeling,
including relaxation labeling [10.4]. Relaxation or similar algorithms have been
used in such diverse applications as optical character recognition [10.12] and edge
detection [10.15]. See [10.9, 10.16] for underlying theory.
10.4 Vocabulary
You should know the meanings of the following terms.
Compatibility
Concave edge
Consistent
Convex edge
Labeling
Linear relaxation
Local/global inference
Nonlinear relaxation
Occluded
Relaxation labeling
271
Assignment
Topic 10A
10.1
OK. You have seen two examples of compatibility functions. Now you have the opportunity to make up your
own. Here is the problem. You have applied an edge
detector to an image. At every pixel in the image, a
gradient has been taken and you know the magnitude and
direction of that gradient. Recognizing that some of
these measurements may be corrupted by noise and blur,
develop an application of relaxation labeling which
helps determine real edge pixels. Hint: A real edge
pixel would have its gradient vector point in the same
direction as neighboring edge pixels. Develop a compatibility function using this concept. Describe how to
use it. You may use pseudo-code or words or flowcharts,
or all three. Do not write actual software.
A(x a, y a, z a )
v1
B( x b, y b, z b )
v2
C( x c, y c, z c )
(10.7)
272
Consistent labeling
1.5
1
2
4
0.5
6
8
1
0.5
1.5
4
7
2
5
2
6
2
3
4
5
1
5
2 4
0
1
4
4
5
5
0
5 5
3
3
4
3
2
1
0
2
5
1
2
6
4
5
0
1
4
0
5
3
3
4
2
5
4
0
5
1
2
10
5
5
3
4
3
8
5
0
0
0
Fig. 10.13. Left column: 2D line drawings. Right: 3D interpretation using the emulation method.
This algorithm can interpret a wide range of line drawings and seems to consistently generate
the same interpretations that a human does, even without any explicit models.
To simplify the problem, we square the objective function SDA and call it S . We would
like to nd the third coordinate (z i ) of the points in the 2D picture which will minimize the
objective function S.
S=n
!2
(10.8)
273
References
2
= 2n
,
z i
z i
z i
(10.9)
where the angle , formed by two vectors v 1 and v 2 , can be computed by Eq. (10.10).
"
#
1v2
= cos1 v v1 v
2
1
= cos
!
(xa xb )(xc xb ) + (ya yb )(yc yb ) + (z a z b )(z c z b )
4
4
.
(xa xb )2 + (ya yb )2 + (z a z b )2 (xc xb )2 + (yc yb )2 + (z c z b )2
(10.10)
Some 3D interpretation results of 2D line drawings are shown in Fig. 10.13. The optimization
problem is solved using conjugate gradient.
Wang [10.18] has made several improvements on Marills original algorithm, most of
which are less computationally intensive, including the usage of the standard derivation of
segment magnitudes (DSM) as the objective function [10.3] and the application of gradient
descent to solve the minimization problem [10.2].
References
[10.1] F. Attneave, Some Informational Aspects of Visual Perception, Psychology Review,
61(3), 1954.
[10.2] L. Baird and P. S. Wang, 3D Object Perception Using Gradient Descent, International Journal of Mathematical Imaging and Vision, 5, pp. 111117, 1995.
[10.3] E. W. Brown and P. S. Wang, Why We See Three-Dimensional Objects: Another
Approach, http://www.ccs.neu.edu/home/feneric/msdsm.html.
[10.4] W. Christmas, J. Kittler, and M. Petrou, Structural Matching in Computer Vision
using Probabilistic Relaxation, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 17(8), 1995.
[10.5] M. Clowes, On Seeing Things, Articial Intelligence, 2(1), 1971.
[10.6] J. Elder and S. Zucker, Evidence for Boundary-specic Grouping, Vision Research,
38(1), 1998.
[10.7] D. Hofstadter, Godel, Escher, Bach: An Eternal Golden Braid, New York, Basic
Books, Inc., 1979.
[10.8] D. Huffman, Impossible Objects as Nonsense Sentences, in Machine Intelligence,
vol. 6, ed. B. Meltzer and D. Michie, Edinburgh University Press, 1971.
[10.9] R. Hummel and S. Zucker, On the Foundations of Relaxation Labeling Processes,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(3), 1983.
[10.10] J. Koenderink, What Does the Occluding Contour Tell Us About Solid Shape?
Perception, 13, pp. 321330, 1984.
[10.11] T. Marill, Emulating the Human Interpretation of Line-drawings as Threedimensional Objects, International Journal of Computer Vision, 6(2), pp. 147161,
1991.
274
Consistent labeling
11
Parametric transforms
Supposing I was on the other side of the glass, wouldnt the orange still be in my right hand?
Lewis Carroll
This chapter discusses another approach to the solution of the local/global inference
problem, the use of parametric transformations. In this approach, we assume that the
object for which we are searching in the image may be described by a mathematical
expression, which in turn is represented by a set of parameters. For example, a
straight line may be written in slopeintercept form:
y = ax + b,
(11.1)
where a and b are the parameters describing the line. Our approach is as follows:
Given a set of points (or other features), all of which satisfy the same equation, we
will nd the parameters of that equation. In a sense, this is the same as tting a
curve to a set of points, but as we will discover, the parametric transform approach
allows us to nd multiple curves, without knowing a priori which point belongs
to which curve. We begin this process by considering the special case of nding
straight lines.
Suppose you are tasked with the problem of nding the straight lines in the image
shown in Fig. 11.1. If only one straight line were present in the image, we could
use straight line tting to determine the parameters of the curve. But we have two
line segments here. If we could segment this rst, then we could t each segment
separately yes, this is a segmentation problem, but we are segmenting a boundary
into boundary segments rather than segmenting an image into regions. In this section,
we will learn how to do this.
First, let us prove an illustrative theorem.
Denition
Given a point in a d-space, and a parameterized expression dening a curve in that
space, the parametric transform of that point is the curve which results from treating
the point as a constant and the parameters as variables. For example, Eq. (11.1)
275
276
Parametric transforms
(11.2)
which is itself a straight line in the 2-space a, b. Given the point x = 3, y = 5,
then the parametric transform is b = 5 3a.
Theorem
If n points in a 2-space are colinear, all the parametric transforms corresponding to
those points, using the form b = y xa intersect at a common point in the space
a, b.
Proof
Suppose n points {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )} all satisfy the same equation
y = a 0 x + b0 .
Consider two of those points, (xi , yi ) and (x j , y j ). The parametric transforms of the
points are the curves (which happen to be straight lines)
yi = xi a + b
yj = x ja + b
a
(11.3)
(11.4)
which we rewrite to make clear the fact that a and b are independent variables:
b = yi xi a
b = y j x j a.
(11.5)
(11.6)
y y
y j yi
x j xi
(11.7)
and we have the a and b values where the two curves intersect. However, we also
know from Eq. (11.3) that all the xs and ys satisfy the same curve. By performing
that substitution into Eq. (11.7), we obtain
b = (a0 xi + b0 ) xi
(a0 x j + b0 ) (a0 xi + b0 )
x j xi
(11.8)
which simplies to
b = (a0 xi + b0 ) xi a0 = b0 .
(11.9)
277
Similarly,
a=
y j yi
(a0 x j + b0 ) (a0 xi + b0 )
=
= a0 .
x j xi
x j xi
(11.10)
Thus, for any two points along the straight line parameterized by a0 and b0 , their
parametric transforms intersect at the point a = a0 , and b = b0 . Since the transforms
of any two points intersect at that one point, all such transforms intersect at that
common point. QED.
Review of concept: Each POINT in the image produces a CURVE (possibly straight)
in the parameter space. If the points all lie on a straight line in the image, the
corresponding curves will intersect at a common point in parameter space.
Got that? Now, on to the next problem.
11.1.1
Pick a value of and . Hold those values constant. Then the set of points which satisfy Eq. (11.11) can be shown to be a straight line. There is a geometric interpretation
of this equation which is illustrated in Fig. 11.3.
This representation of a straight line has a number of advantages. Unlike the use
of the slope, both of these parameters are bounded; can be no larger than the largest
diagonal of the image, and need be no larger than 2. A line at any angle may be
represented without singularity.
The use of this parameterization of a straight line solves one of the problems which
confronts us, the possibility of innite slopes. The other problem is the calculation
of intersections.
y
Gradient direction
(11.11)
x
Fig. 11.3. In the , representation of a line, is the perpendicular distance of the line from the
origin, and is the angle made with the x axis.
278
Parametric transforms
11.1.2
1
1
1
1 1
1 1 1 1
3
2
1 1
1 1
1
1
1
1
2
1 1 1
(a)
(b)
Fig. 11.5. (a) An image with two line segments which have distinctly different slopes and
intercepts, but whose actual positions are corrupted signicantly by noise. (b) The
corresponding Hough transform.
279
11.2.1
280
Parametric transforms
11.3.1
B12
B01
P1
P2
P0
C
(11.12)
11.3.2
Finding circles when the origin is unknown but the radius is known
Equation (11.12) describes an equation in which x and y are assumed to be variables,
and h, k, and R are assumed to be constants. As before, let us rewrite this equation
281
11.3.3
(11.13)
In the space (h, k) what geometric shape does this describe? You guessed it, a circle.
Each point in image space (xi , yi ) produces a curve in parameter space, and if all those
points in image space belong to the same circle, where do the curves in parameter
space intersect? You should be able to gure that one out by now.
Now, what if R is also unknown? It is the same problem, however, instead of
allowing h to range over all possible values and computing k, we must now allow
both h and k to range over all values and compute R. We now have a three-dimensional
parameter space. Allowing two variables to vary and computing the third denes
a surface in this 3-space. What type of surface is this (an ellipse, a hyperboloid, a
cone, a paraboloid, a plane)?
(11.14)
Again, the accumulator array should be incremented over a neighborhood rather than
just a single accumulator, and it will prove efcacious to increment by an amount
proportional to M.
xi , yi
Fig. 11.8. The point xi , yi is proposed to lie on a circle of radius R. The gradient points in the
direction of the arrow. If the circle is known to be dark relative to the background, the
center can be found by taking a step of length R in the direction opposite to the
gradient. If the center/surround contrast is opposite, then the step should be taken in
the direction of the gradient. If the contrast is not known a priori, steps can be taken
(and accumulators incremented) in both directions.
282
Parametric transforms
P1
P2
P3
So far, we have assumed that the shape we are seeking can be represented by an
analytic function, representable by a set of parameters. The concepts we have been
using, of allowing data components which agree to vote, can be extended to
generalized shapes. Initially we suppose we have an arbitrarily shaped region, and
suppose we know the orientation, shape, and zoom. Our rst problem is to gure
out how to represent this object in a manner which is suitable for use by Hough-like
methods. The following is one such approach [11.2]:
First, dene some reference point. The choice of a reference point is arbitrary,
but the center of gravity is convenient. Call that point O. For each point Pi on the
boundary, calculate both the gradient vector at that point and the vector O Pi . from
the reference to the boundary point. Quantize the gradient direction into, say, n
values, and create a table with n rows. Each time a pointP j on the boundary has a
gradient direction with value G i (i = 1, . . . n), a new column is lled in on row i,
containing O P j . Thus, the fact that multiple points on the boundary may have
identical gradient directions is accommodated by placing a separate column in the
table for each entry. In Fig. 11.9, shape is shown and three entries in the R-table are
illustrated in Table 11.1.
To utilize such a shape representation to perform shape matching and location,
we use the following algorithm.
(1) Form an accumulator array, which will be used to hold candidate locations of
the reference point. Initialize the accumulator to zero.
(2) For each edge point, Pi , do the following.
(2.1) Compute gradient direction, and determine which row of the R-table corresponds to that direction.
(2.2) For each entry, j, on that row:
(a) compute the location of the candidate center by adding the stored
vector to the boundary point location: A = T [i, j] + Pi
(b) increment accumulator determined by A.
0
300
1, 0.1
0.6, 1.1
1, 0.5
empty
283
11.5 Conclusion
11.5 Conclusion
Accumulator arrays
enforce consistency.
11.6 Vocabulary
You should know the meanings of the following terms.
Accumulator array
Generalized Hough transform
Hough transform
Parametric transform
Topic 11A
11A.1
Parametric transforms
Finding parabolae
Wechsler and Sklansky [11.12] have developed one approach to the problem of nding
parabolic curves in images, as described below.
A parabola is the locus of points each of whose distance from a xed point, the focus, is
equal to its distance from a xed straight line, the directrix, as illustrated in Fig. 11.10.
x 2 = (x 2a)2 + y 2
Directrix
y
P
y
d
V
a
A
a
d
Fig. 11.10. A parabola.
F
x
(11.15)
284
Parametric transforms
or
y 2 = 4a(x a).
Differentiate Eq. (11.16) with respect to x:
dy
= tan =
=
dx
a
x a
(11.16)
(11.17)
x2
.
1 + 2
(11.18)
4a(x a)
.
x 2a
(11.19)
(11.20)
Substituting for a,
tan =
2 tan
= tan 2.
1 tan2
(11.21)
(11.22)
x = d cos( ) = d cos
(11.23)
y = d sin( ) = d sin
(11.24)
x F = x p + x
(11.25)
y F = y p + y.
(11.26)
and
Equation (11.16) assumes the focus lies at x = a. This technique will work if only one
parabola lies in the eld of view, for then the position of the origin is arbitrary. In the more
general case, however, we must make the location of the origin explicit.
The derivation of Wechsler and Sklansky [11.12] can be generalized as follows to overcome
these difculties.
Initially, we continue to assume the parabola is symmetric about a horizontal line,
but having an origin at an arbitrary point x0 , y0 . The parabolic equation becomes
285
Directrix
P
d
y0
a
x
x0
Fig. 11.11. Parabola with arbitrary reference.
(11.27)
dy
=
dx
a
x x0
(11.28)
and
a = 2 (x x0 ).
(11.29)
(11.30)
(11.31)
Equation (11.31) describes a straight line in the x0 , y0 parameter space. Thus, by making use
of the local derivative, the 3-parameter problem of Eq. (11.27) has been reduced to a much
more tractable two-dimensional problem.
It should be obvious that shapes other than circles and parabolae can be found using this
approach [11.3].
11A.2
286
Parametric transforms
Clustering is the process of nding natural groupings in data. In this chapter the obvious
application is to nd the best estimate of the peak(s) of the accumulator; however, these ideas
are much broader in applicability than just nding the mode of a distribution. For example,
McLean and Kotturi [11.9] use clustering [11.5] to locate the vanishing points in an image.
Clustering is discussed in detail in Chapter 15.
11A.3
11A.4
BF
z
(11.32)
where F is the focal distance of either of the cameras (which are assumed to have the same
focal length). The hard problem, of course, is to determine which pixels correspond to the
same point in 3-space. Suppose we extract a small window from the leftmost image and
use the sum of squared differences (SSD) to template match that small window along a
horizontal line in the other image. We could graph that objective function vs. the disparity,
or the inverse distance (1/z). We nd it convenient to use the inverse distance and we will
typically nd that the match function has multiple minima, as illustrated in Fig. 11.12. If
we take an image from a third or fourth camera, with a different baseline from camera 1,
we nd similar nonconvex curves. However, all the curves will have minima at the same
point, the correct disparity. Immediately, we have a consistency! We form a new function,
the sum of such curves taken from multiple baseline pairs, and nd that this new function
(called the SSD-in-inverse-distance) has a sharp minimum at the correct answer. Okutomi and
287
11A.5 Conclusion
Kanade [11.10] have proven that this function always exhibits a clear minimum at the correct
matching position, and that the uncertainty of the measurement decreases as the number of
baseline pairs increases.
11A.5
Conclusion
The general concept of the parametric transform seeks consistency! That is the key here. Many
points which are in some sense consistent contribute to the same cell in the accumulator
they vote, in a sense. Hopefully, noise averages out in this voting process, and we can arrive
at consistent solutions.
In computed tomography (CT), the signal measured is a line integral along the ray from
the x-ray source to the x-ray detector. The line along which the integral is performed may be
represented using the parameterization of a straight line:
R( , ) = ( (x(s)cos + y(s)sin )) ds.
(11.33)
Examination of Eq. (11.33) leads one to the immediate conclusion that the Hough transform
may be formally represented by the Radon transform, and that, except for applications, they
are the same transform.
In addition to the use of these transforms for identifying specic shapes, Leavers [11.6] has
shown that if one considers the shape of the parameter space, rather than simply the location
of the peaks, one may determine the convex hull and several shape parameters of regions.
A fascinating alternative to the Hough transform was proposed by Aghajan and Kailath
[11.1], using wavefront propagation. The idea is to think of each pixel as a radio transmitter
emitting signals which are detected by receivers at the end of each row. Using the mathematics
of direction of arrival signal processing, they show it is possible to detect straight lines with
a computational complexity signicantly less than the conventional Hough transform. This
idea of wavefront propagation makes sense as a paradigm for how the brain detects straight
lines.
11A.6
Vocabulary
You should know the meanings of the following terms.
Gauss map
Parabola
Radon transform
SSD
Assignment
11.1
In the directory named leadhole are a set of images of
wires coming through circuit board holes. The holes are
roughly circular and black. Use parametric transform methods to find the centers of the holes. This is a project and
288
Parametric transforms
p1(x, y)
p2(x, y)
Assignment
11.2
You are to use the generalized Hough transform approach to
both represent an object and to search for that object in
an image. It turns out that the object is a perfect square,
centered at the origin, with sides two units long, but you do
not know that ahead of time. You only have five points, those
at (0,1), (1,0), (1, 0.5), (-1,0), and (0, -1). Fill out
the R-table which will be used in the generalized Hough
transform of this object. (Table 11.2 contains four rows;
that is just a coincidence. You are not required to fill them
all in, and if you need more rows, you can add them.)
Assignment
11.3
Let P1 = [x1 , y1 ] = [3, 0] and P2 = [x2 , y2 ] = [2.39, 1.42] be two
points, both of which lie approximately on the same disk.
We do not know a priori whether the disk is dark inside or
bright. The image gradients at P1 and P2 are 50 and 4.5/4
(using polar notation).
Use Hough methods to estimate the location of the center of
the disk, and radius of the disk, and determine whether the
disk is darker or brighter than the background.
References
[11.1] H. Aghajan and T. Kailath, SLIDE: Subspace-based Line Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(11), 1994.
[11.2] D. Ballard, Generalizing the Hough Transform to Detect Arbitrary Shapes, Pattern
Recognition, 13(2), 1981.
[11.3] N. Bennett, R. Burridge, and N. Saito, A Method to Detect and Characterize Ellipses
Using the Hough Transform, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 21(7), 1999.
289
References
[11.4] Y. Cheng, Mean Shift, Mode Seeking, and Clustering, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(8), 1995.
[11.5] T. Hofmann and J. Buhmann, Pairwise Data Clustering by Deterministic Annealing, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1), 1997.
[11.6] V. Leavers, Use of the Two-dimensional Radon Transform to Generate a Taxonomy
of Shape for the Characterization of Abrasive Powder Particles, IEEE Transactions
on Pattern Analysis and Machine Intelligence, 22(12), 2000.
[11.7] P. Liang and C. Taubes, Orientation-based Differential Geometric Representations
for Computer Vision Applications, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(3), 1994.
[11.8] E. Lutton, H. Matre, and J. Lopez-Krahe, Contribution to the Determination of
Vanishing points using the Hough Transform, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(4), 1994.
[11.9] G. McLean and D. Kotturi, Vanishing Point Detection by Line Clustering, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(11), 1995.
[11.10] M. Okutomi and T. Kanade, A Multiple-Baseline Stereo, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 15(4), 1993.
[11.11] J. Princen, J. Illingworth, and J. Kittler, Hypothesis Testing: A Framework for
Analyzing and Optimizing Hough Transform Performance, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 16(4), 1994.
[11.12] H. Wechsler and J. Sklansky, Finding the Rib Cage in Chest Radiographs, Pattern
Recognition, 9, pp. 2130, 1977.
[11.13] Yla-Jaa ski and N. Kiryati, Adaptive Termination of Voting in the Probabilistic
Circular Hough Transform, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(9), 1994.
12
Functions are born of functions, and in turn, give birth or death to others. Forms emerge from
forms and others arise or descend from these
L. Sullivan
You have already seen the use of graph-theoretic terminology in connected component labeling in Chapter 8. The way we used the term connected components in
the past was to consider each pixel as a vertex in a graph, and think of each vertex as
having four, six, or eight edges to other vertices (that is, four-connected neighbors,
six neighbors if hexagonal pixel is used, and eight-connected neighbors). However,
we did not build elaborate set-theoretic or other data structures there. We will do
so in this chapter. The graph-matching techniques discussed in this chapter will be
used a great deal in Chapter 13.
12.1 Graphs
291
(a, b V )[(a, b) E (b, a) E]. Otherwise the graph is directed (or, in some
special cases, partially directed a seldom-used term).
The degree of a node is the number of edges coming into that node.
A path between nodes v 0 and vl is a sequence of nodes v 0 , v 1 , . . . vl such that
there exists an edge between v i and v i+1 for all is.
A graph is connected if there exists a path between any two nodes.
A clique (remember this word, you will see it again) is a subgraph in which there
exists an edge between any two nodes.
A tree is a graph which contains no loops. Tree representations have found application in speeding up Markov random eld applications [12.8].
Atoms are distinguished from nodes by a single bit, usually the most signicant
bit of the computer word. Certain nodes contain a zero in their right half, indicating
the end of the list. Such nodes are indicated in Fig. 12.2 by the cross in the right-hand
half. The linked list can also be used to store computer instructions, thus providing
a powerful mechanism for automatic programs. This was the foundation of the
programming language LISP.
Datum
Datum
Fig. 12.2. In a linked list, each node contains two pointers. A pointer with the value of zero
indicates the end of a list.
292
Datum
Datum
Datum
Datum
Pointer
Pointer
Pointer
Pointer
Fig. 12.3. A more general data structure may contain both data and pointers.
The concept of the linked list was incorporated into more modern programming
languages and extended to allow more generality. For example, a structure such as
that illustrated in Fig. 12.3 can contain data and pointers to other data structures of
the same or different types. For example the following C denition describes the
data structure of Fig. 12.3.
struct patch
{
int area;
int perimeter;
struct *patch;
struct *patch;
}
The C programmer will recognize the *patch as denoting a pointer to another
structure of type patch.
E
B
F
Fig. 12.4. A polyhedron with six faces.
293
A
What a mess! Do you
think there is a planar way
to draw this graph? That
is, can you draw it without
any lines crossing?
E
C
D
3
1
2
Now is the problem: Given an observation, and the RAG derived from the observation, and given a collection of models and their corresponding graphs, which
model best matches the observation? We will address this matching problem later.
Other graph representations are possible and often useful, for example, the constructive solid geometry (CSG) community uses a collection of primitives subjected
to transformations to represent input to automatic parts manufacturing systems. The
primitives are objects like spheres and cylinders. Methods have been developed
[13.8] to match scenes to models constructed from CSG representations as well as
to RAG representations.
12.4.1
294
3
2
1
Patch #3
Avail: yes
B. Window
B. Volume
Direction
Normal
Near
Next
Patch #1
Avail: yes
B. Window
B. Volume
Direction
Normal
Near
Next
Patch #2
Avail: yes
B. Window
B. Volume
Direction
Normal
Near
Next
Fig. 12.7. Scene graph for an image with one object segmented into three patches.
Notice the use of
cardinality rather than
area. Why do you suppose
this is? Think about a
range image viewed
obliquely.
r
r
295
Fig. 12.8. The coordinate system is dened so that the object to be characterized has its center
of gravity located at the origin.
Fig. 12.9. Two different aspects of the same object produce very different two-dimensional
images.
296
AD
A
ABD
A
AB
ADC AC ABC
BC
DCD C
B BD
BCD
Fig. 12.10. Each partition of the VSP is identied by a list of the surfaces visible from that set of
viewpoints (redrawn from [12.5]).
In those rare camera movements, however, radical things happen to the image
surfaces disappear or appear. That is, the topological structure of the image changes.
Two viewpoints V1 and V2 are dened to be aspect equivalent, denoted V1 V2 ,
if and only if there exists a sequence of innitesimal camera motions, a path, from
V1 to V2 such that the topology of the viewed image does not change. The aspect
equivalent property is obviously symmetric, reexive, and trivially transitive. It is
therefore an equivalence relation and thus imposes a partition, denoted the viewpoint
space partition (VSP), on the set of points on the sphere. Each element of this partition
is referred to as a viewing region. Fig. 12.10, from [12.5], illustrates the VSP for a
tetrahedron with faces A, B, C, and D. The aspect graph (referred to originally as
the visual potential) is the dual of the VSP.
Computing the aspect graph is accomplished [12.2] by rst constructing the
labeled image structure graph (LISG), which is an augmented graph in which
each node in the graph corresponds to a vertex in the line drawing, and each arc
corresponds to the line segment between those nodes. The arcs are augmented
with the labels corresponding to the properties +, , and , as we used them in
Chapter 10 to mean convex, concave, or occluding. The algorithm in [12.2] partitions
the viewing sphere such that all points in a partition have isomorphic LISGs.
For arbitrary (potentially nonconvex) polyhedra, using orthographic projection,
the algorithm is of high (but still polynomial) computational complexity. For a
polyhedron with n faces, the total worst case time complexity is O(n 8 ).
Aspect graphs were originally developed by Koenderink and Van Doorn [12.3],
and extended by a variety of authors; Bowyer and Dyer [12.1] provide a good survey
of work prior to 1990. More recent work has focussed on such nasty problems as
the fact that we are dealing with sampled data [12.7].
12.7 Conclusion
Consistent labeling
searches a tree of
interpretations.
The concepts of graphs occur throughout the machine vision literature. See [12.6]
for more on scene structure graphs and Bayesian networks which use them. As we
297
References
now understand, the search algorithm used in section 10.1 to label line drawings
actually searched a tree of interpretations.
12.8 Vocabulary
You should know the meanings of the following terms.
Aspect graph
Clique
Connected
Degree
Edge
Isomorphic
Node
NP--complete
Path
RAG
Scene graph
Tree
Vertex
References
[12.1] K. Bowyer and C. Dyer, Aspect Graphs: An Introduction and Survey of Recent
Results, International Journal of Imaging Systems and Technology, 2, pp. 315328,
1990.
[12.2] Z. Gigus and J. Malik, Computing the Aspect Graph for Line Drawings of Polyhedral
Objects, IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(2),
1990.
[12.3] J. Koenderink and A. Van Doorn, The Internal Representation of Solid Shape with
Respect to Vision, Biological Cybernetics, 32, pp. 211216, 1979.
[12.4] B. Messmer and H. Bunke, A New Algorithm for Error-tolerant Subgraph Isomorphism Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(5), 1998.
[12.5] H. Plantinga and C. Dyer, Visibility, Occlusion, and the Aspect Graph, International
Journal of Computer Vision, 5(2), pp. 137160, 1990.
[12.6] S. Sarkar and P. Soundararajan, Supervised Learning of Large Perceptual Organization: Graph of Spectral Partitioning and Learning Automata, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 22(5), 2000.
[12.7] I. Shimshoni and J. Ponce, Finite-resolution Aspect Graphs of Polyhedral Objects,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 1997.
[12.8] C. Wu and P. Doerschuk, Tree Approximations to Markov Random Fields, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(4), 1995.
13
Image matching
In this chapter we will consider issues associated with matching matching observed
images with models as well as matching images with each other. We will consider
matching iconic representations as well as matching graph-theoretic representations.
Matching establishes an interpretation. That is, it puts two representations into
correspondence.
r
Both representations may be of the same form. For example, correlation matches
an observed image with a template. Similarly, subgraph isomorphism matches a
region adjacency graph to a subgraph of a model graph.
Both representations might be of different forms. For example, one image matches
one paragraph describing something. In most such applications, we nd ourselves
matching an equation to some data, and in this case, tting might be a better
word.
In the remainder of this chapter we address all of these matching problems except
tting, which was discussed earlier in this book.
Template matching
A template is a representation for an image (or subimage) which is itself a picture. A
template is typically moved around the target image until a location is found which
maximizes some match function. The most obvious function is the squared error
S E(x, y) =
N
N
( f (x , y ) T (, ))2 ,
(13.1)
=1 =1
(assuming the template is N N ) which provides a measure of how well the template
(T ) matches the image ( f ) at point x, y. If we expand the square and carry the
298
299
summation through, we nd
S E(x, y) =
N
N
f 2 (x , y ) 2
=1 =1
N
N
N
N
f (x , y )T (, )
=1 =1
T 2 (, ).
(13.2)
=1 =1
Lets look at these terms: The rst term is the squared sum of the image brightness
values at the point of application. It says nothing about how well the image matches
the template (although it IS dependent on the image). The third term is simply the
sum of the squared elements of the template, and is a constant, no matter where the
template is applied. The second term obviously is the key to matching, and that term
is the correlation.
In matching using an optimization criterion, the assumption is made that the quality of match can be described by a set of parameters a = {a1 , a2 , . . . , an }, which
could be the pixels themselves. We dene a merit function M(a, f (x)) which quanties the quality of the match between the template and the local image. Matching
consists of determining a so that M is maximized. Typically, a is the xy coordinates
specifying where the template is placed.
If M is monotonic in a, we can maximize M by solving
Ma j =
M
=0
a j
for j = 1, . . . , n.
(13.3)
If M is not monotonic, the process of nding points where the partial derivatives are
zero can terminate in a local maximum. Furthermore, as we have discussed earlier,
it is probably not possible to nd an analytic solution to Eq. (13.3). In that case, we
could use hill-climbing:
+ cMa j .
a kj = a k1
j
(13.4)
300
Image matching
13.1.2
Point matching
Consider an image as just a set of points with known distances between each other
(one could consider this an instance of the springs and templates problem discussed
in the next section, with trivial templates). One example where this type of data
occurs is in the recognition of targets in Synthetic Aperture Radar (SAR) images. In
this situation one may approach the problem by assuming a 3D model of the object,
and nding the transformation from 3D to 2D which best describes the observation
[13.53]; see also [13.3].
Matching in the stereo environment can make use of the epipolar constraint as
well as characteristics of edges and vertices, and can be treated probabilistically
[13.47].
13.1.3
Segment matching
The problem of nding a match between short arcs and pieces of a long arc has been
investigated in the literature [13.3, 13.39, 13.42]. The corresponding 3D problem
has not received much attention largely because it is difcult to extract 3D curves;
[13.22] provides one approach.
13.1.4
Eigenimages
The eigenimage approach has been an effective solution to problems like object
identication and recognition [13.49, 13.50], where the image of an unknown object
is compared to images of known objects in a data base (or a training set) and the
unknown object can be identied or recognized when a close match is found. We
can surely do a pixel-by-pixel comparison. However, this is very time-consuming,
especially when the size of the image is large and the number of images included in
the data base is large as well.
301
The eigenimage approach has its origin in principal component analysis (PCA),
which is a popular technique for dimensionality reduction. In section 9.2.1, one type
of PCA, the KL transform, is described in detail. PCA constructs a representation
of the data with a set of orthogonal basis vectors that are the eigenvectors of the
covariance matrix generated from the data. By projecting the data onto the dominant
eigenvectors (corresponding to the larger eigenvalues), the dimension of the original
data set can be reduced with minimal loss of information. Similarly, in the eigenimage approach, each image is represented as a linear combination of a set of dominant principal components (the eigenimages). Matching is then conducted based
on the coefcients of the linear combination (or the weight of projections onto the
eigenimages) which greatly speeds up the process. The projection preserves most of
the energy, and thus captures the highest amount of variation in the data base. Here,
we discuss the calculation of the eigenimages in detail.
Let f 1 , f 2 , . . . , f p represent a set of images of known objects in the data base.
Without loss of generality, assume these images are of the same dimension m n.
The following steps lead us to the eigenimages.
Step 1. Construct a set of vectors {I1 . . . I p } where each Ii (i = 1, . . . , p) is the
lexicographical representation of the corresponding image f i minus the average
p
image, Ii = f i A, and A = 1p i=1 f i . Note that each vector is mn 1.
Step 2. Calculate the covariance matrix of this set by
1 T
Ii I ,
p 1 i=1 i
p
C=
(13.5)
where C is an mn mn matrix.
Step 3. Use eigenvalue decomposition techniques to obtain the eigenvectors and
eigenvalues of C:
C = E E T
where E is an mn mn matrix with each column vector an eigenvector of C (or an
eigenimage of C), is a diagonal matrix
1 . . .
0
= 0 ...
0
0 . . . mn
with the eigenvalues of C on the diagonal of , and 1 > 2 > > mn .
Step 4. Suppose among the mn eigenvalues, the rst k values are much larger than
the other values, that is, kj=1 j / mn
j=1 j = is very close to 1. Then we can
use the rst k eigenvalues to reconstruct the original image without losing too much
information. Hopefully, k mn.
302
Image matching
Step 5. Calculate the projection coefcients of image f i onto the selected eigenimages for comparison purpose by
Wi = IiT [E 1 . . . E k ]
(13.6)
Compare the distance between Wtest and all the Wi s in the data base (a Euclidean
distance might be the simplest approach); the one with the closest distance is selected
as a match.
We now show an interesting example of applying the eigenimage approach to
face recognition [13.51]. Assume we have three images in the data base (Lena,
Einstein, and the Clock), the unknown image is Monalisa. Following Steps 1 through
3 described above, we can derive 64 64 eigenimages. We use only two of the
dominant ones since the ratio between the summation of the rst two eigenvalues
and the summation of all the eigenvalues is close to 1, as stated in Step 4. Fig. 13.1
shows all the original images and the two eigenimages derived. Following Step
5, we compute the projection coefcients of all four original images on the two
eigenimages; these are listed in Fig. 13.1 as well. Based on a simple Euclidean
distance calculation, it turns out that the closest match to Monalisa is Einstein. Is
that a surprise? Not really. In the eyes of the computer, these two images are indeed
more alike than Monalisa and Lena.
Even though the eigenimage approach has great potential for image matching, from
the procedure described above, we see that the most time-consuming step is the
derivation of the eigensystem. When the size of images is large, the calculation of
the covariance matrix (which is mn mn) can take up a lot of computation resources
or be completely infeasible. For more efcient calculations, readers are referred to
[13.34, 13.35].
We illustrate one approach to reducing computation through an example. Assume
that each image has only three pixels and that there are only two such images in the
set. Let them be
f 1 = [1
3]T
f 2 = [5
9]T .
303
Training images
Test image
Original
images
I2
I1
I3
Itest
E1
E2
3]T
.
I2 = [2 3 3]T
%
&
Construct the matrix I = I1 . . . I p , in which the ith column of I is one of the
images Ii , and consider the product S = I I T . In this example,
8 12 12
2 2
I = 3 3 and S = 12 18 18 .
12 18 18
3 3
Observe that if p < mn, then S is the scatter matrix, identical to the covariance
except for the multiplicative scale factor. S is huge if the image is 256 256, then
S is 2562 2562 . However, if there are only, say, ve images in the set, I is 2562 5,
304
Image matching
(13.7)
p
ik Ik .
(13.8)
k=1
The unknown is
distinguished from the
examples by lack of a
subscript.
r
Decide which measurements you wish to use to describe the shape. For example,
one might build a system which measures seven invariant moments and the aspect
ratio, for a total of eight features. The best collection of features is applicationdependent, and methods for optimally choosing feature sets is beyond the scope of
this book (see [14.4, 14.11, 18.30] which are just a few of many texts in statistical
methods). Organize these eight features into a vector, x = [x1 , x2 , . . . , x8 ]T .
Describe a model object using a collection of example images (called a training
set), from which feature vectors have been extracted. Continuing the example
with eight features, we could collect a set of n images of axes, measure the feature
vector of each axe, and characterize the model axe by its average over this set
axe = n1 xaxe x. Hatchets might be similarly characterized by an average over
a set of sample hatchets.
Now, given an unknown region, characterized by its feature vector x, shape matching consists of nding the model which is closest in some sense to the observed
region. Probably the simplest denition of close uses the Euclidian distance
4
d(modelaxe, observation) = (x axe )T (x axe )
4
.
d(modelhatchet, observation) = (x hatchet )T (x hatchet )
305
Euclidian distance. In Chapter 14, the concepts in this discipline of statistical pattern
classication are covered in more detail.
13.3.1
Association graphs
Association graphs embody a methodology which is less restrictive than isomorphism, and which may converge more rapidly. It will converge to a solution which
is consistent, but not necessarily optimal (depending, of course, on the criteria for
optimality used in any particular application).
The method matches a set of nodes from the model to a set of nodes (extracted)
from the image.
306
Image matching
Denition
Here, a graph is denoted G = V, P, R, where V represents a set of nodes,
P represents a set of unary predicates on nodes, and R represents binary relations
between nodes.
A predicate is a statement which takes on only the values TRUE or FALSE.
For example let x denote a region in a range image. Then CYLINDRICAL(x) is a
predicate which is true or false depending on whether all the pixels in x lie on a
cylindrical surface.
A binary relation describes a property possessed by a pair of nodes. It may be
considered as a set of ordered pairs R = {(a1 , b1 ), (a2 , b2 ), . . . (an , bn )}. In most
applications, order is important. It is possible to think of a relation as a predicate,
since for any given pair, say (ak , bk ), either it is an element of the set R or it is not.
However, it seems more descriptive to use the word relation in this context.
Given two graphs, G 1 = V1 , P, R and G 2 = V 2, P, R, we construct the
association graph G by:
r
r
In Fig. 13.2 we illustrate an observation in which a segmentation error, oversegmentation, has occurred. That is, regions B and C are actually part of the same region, but
due to some measurement or algorithmic error, have been labeled as two separate
regions. In this example, the unary predicates are labels spherical, cylindrical, and
planar. Regions A and 1 are spherical, while B, C, D, 2, and 3 are cylindrical. The
only candidates for matches are those with the same predicate. So only A can match 1.
We now construct a graph in which all candidate matches are the nodes. We then
have the nodes of the association graph, as illustrated in Fig. 13.3.
307
A
B
1
2
3D
C
D
Image
Model
Fig. 13.2. A range camera has observed a scene and segmented it into segments which satisfy
the same equation, however, an error has occurred.
1A
2B
3C
2C
3B
2D
3D
rA (2, B, 2, C) = 1
1A
2B
rA (2, B, 3, B) = 1.
3C
2C
3B
2D
The second line says that patch B in the image could not be region 2 in the model
while simultaneously, patch C in the image is the same region. In both examples,
inconsistencies are really based on the assumption that the segmenter is working
correctly. However, one could allow the segmenter to fail. In that case, new edges are
added, because new relationships are now consistent. For example, rA (3, C, 3, D) =
1 since we believe that two patches could be part of the same regions (the segmenter
can fail by oversegmentation), however, rA (2, D, 3, D) = 1 still holds because
we still believe the segmenter will not merge patches (fail by undersegmentation).
Allowing for oversegmentation produces the association graph of Fig. 13.4.
Note one other type of inconsistency which prevents an edge from being constructed: 3D and 3B are not connected since B and D do not border. That is, we
believe that if the segmentation fails by oversegmentation, the segmenter will not
introduce an entire new patch between the two. We must emphasize that how you
develop these rules is totally problem-dependent!
Once you have the allowable consistencies, the matching is straightforward.
Simply nd all maximal cliques. The maximal clique is not unique, since there
may be several cliques of the same size.
In this case, there are at least two maximal cliques, two of which are:
{(1, A)(2, B)(2, C)(3, D)} and {(1, A)(3,B)(2, C)(2, D)}.
308
Image matching
hair
eye
ear
nose
Mouth
Fig. 13.5. A springs and templates model of a face.
13.3.2
Cost =
TemplateCost(d, F(d))
dtemplates
SpringCost(F(d), F(e))
(13.9)
d,erefref
MissingCost(c).
c(Rmissing Rmissing )
In Eq. (13.9), d is a template and F(d) is the point in the image where that template
is applied. TemplateCost is therefore a function indicating how well a particular
template matches the image when applied at its best matching point. SpringCost is
a measure of how much the model must be distorted (the springs stretched) to apply
those particular templates at those particular locations. Finally, it may be that not
every template can be located in some images, the left eye may not be visible and a
cost may be imposed for things missing. All these costs are empirically determined.
However, once they are determined, it becomes relatively easy to determine how
well any given image matches any given model.
309
13.5 Vocabulary
There is one signicant (among several others) problem with spring matching:
The number of elements matched affects the magnitude of the cost. The costs are
summed, so a poor match of only a few things may be less than a good match of
many (and therefore better, in a minimal cost algorithm).
This is a problem which is not unique to springs and templates. The usual solution
is to normalize the calculations, using a technique such as
Cost =
13.4 Conclusion
Association graphs use the concepts and formalisms of consistent labeling directly.
The advantage of using a graph structure is that the search for largest clique is
aided by a body of available software for performing such searches as quickly as
computational complexity allows. Similarly, the springs and templates ideas measure
both consistency and deviation from consistency. The springs and templates concepts
also illustrate both how one might construct an appropriate objective function, and
a problem that can easily arise if one does not pay attention to interpretation of the
objective function if we are summing match quality, a good match of many things
(adding up lots of small numbers) may be more than (and therefore worse than) a
poor match of only a few things (adding up just a few rather large numbers).
We began this chapter by pointing out that formal optimization methods, either
descent or hill-climbing, are hard to apply to image matching because the search
space is littered with local minima. However, if we initialize the algorithm sufciently close to the solution, such techniques will work. We used the sum of squared
differences (SSD), also sometimes called the sum-squared error, as the objective
function.
Eigenimages are lower dimensionality representations of the original images. The
projections are chosen by minimizing the error between the original data and the
projected data.
13.5 Vocabulary
You should know the meanings of the following terms.
Association graph
Clique
310
Image matching
Correspondence
Deformable template
Eigenimage
Hill-climbing
Matching metric
PCA
Template
Assignment
13.1
In this chapter, we stated that the problem of finding the largest clique is NP-complete. What does that
really mean? Suppose you have an association graph with
ten nodes, interconnected with 20 edges. How many tests
must you perform to find all cliques (which you must do
in order to identify which of these are maximal)? You
ARE permitted (encouraged!) to look up clique-finding
in a graph theory text.
Assignment
13.2
In section 13.3.1, an example problem is presented
which involves an association graph which allows for
segmentation errors. The result of that graph is two
maximal cliques, which (presumably) mean two different
interpretations of the scene. Describe in words these
two interpretations.
Assignment
13.3
In the bibliography for this chapter, there is an
incomplete citation to Olson [13.36]. First, locate
a copy of that paper. You may use a search engine, the
Web, the library, or any other resource you wish. In
that paper, the author does template matching in a different way: Using a binary (edge) image and a similar
template, he does not ask Does the template match the
image at this point? Instead he asks, At this point,
how far is it to the nearest edge point?
How does he perform this operation, apparently a
search, efficiently?
Once he knows the distance to the nearest edge point,
how does he make use of that information to compute a
quality of match measure?
311
13.5 Vocabulary
Assignment
13.4
In an image-matching problem, we have two types of objects, lions and antelope (which occupy only one pixel
each).
r
r
r
r
Assignment
13.5
Do you think the concepts of springs and templates
would be applicable to Assignment 13.4? Discuss.
Assignment
13.6
Still thinking about lions and antelope, you observe the scene opposite: The sketch is not to scale.
However, for your convenience, we have tabulated the
distance between each pair of animals (Table 13.1).
Lions are yellow (which is denoted by a dotted interior) or brown (denoted by a black interior -- thats
right; there arent any). Antelope are white (denoted
by a white interior) or yellow. You wish to use
association graph methods to solve this problem; and
since this technique is not as powerful as nonlinear
5
2
312
Image matching
Pair
Distance
(arbitrary units)
1, 2
1, 3
1, 4
1, 5
2, 3
2, 4
2, 5
3, 4
3, 5
4, 5
5.5
2
3
2
2
3
4
2
3.8
3.4
13A.1
Matching
313
This example derivation was rst described by Shapiro and Brady [13.46] who use eigenvector methods as follows.
As in the original springs and templates formulation, we are nding which of a collection
of feature point sets best matches one particular set. Let di j be the Euclidian distance between
feature points xi and x j , and construct a matrix of weights
H = [Hi j ], where Hi j = exp
di2j
2 2
!
.
(13.10)
The matrix H is diagonalized using standard methods into the product of three matrices
H = E E T
(13.11)
where E is a matrix with the eigenvectors of H as its columns, and is a diagonal matrix
with the eigenvalues on the diagonal. Let us assume the rows and columns of E and are
sorted so that the eigenvalues are sorted along the diagonal in decreasing size. We think of
each row of E as a feature vector, denoted Fi . Thus
F1
E = . . . .
Fm
Suppose we have two images, f 1 and f 2 , and suppose f 1 has m feature points while f 2 has n
feature points, and suppose m < n. Then by treating each set of feature points independently,
we have H1 = E 1 1 E 1T for image f 1 , and H2 = E 2 2 E 2T for image f 2 . Since the images have
different numbers of points, the matrices H1 and H2 have different numbers of eigenvalues.
We therefore choose to use only the most signicant k features for comparison purposes.
It is important that the directions of the eigenvectors to be matched be consistent, but
changing the sign does not affect the orthonormality. We choose E 1 as a reference and then
orient the axes of E 2 by choosing the direction that best aligns the two sets of feature vectors;
see [13.46] for details. After aligning the axes, a matrix Z characterizing the match between
image 1 and image 2 is dened by
Z i j = (Fi1 F j2 )T (Fi1 F j2 ).
(13.12)
The best matches are indicated by the elements of Z which are the smallest in their row
and column. We will revisit this example in the next section.
Sclaroff and Pentland [13.44] present a further alternative to the springs and templates
formulation: First, compute a description of the entire shape which is robust to sampling
and parameterization error. Then, using this description of the entire shape, nd a coordinate
system which effectively describes the shape. Doing this on the image and the model makes
it straightforward to determine cardinal directions.
Wu [10.19] takes the problem of computing optic ow and uses relaxation labeling to nd
consistent template matches.
The concept of deformable templates can be combined with graph representations to
produce an approach [13.1] to matching of objects which are similar, but not identical in
shape (e.g. x-rays of hands). The idea of deformable templates can be viewed as an extension
314
Image matching
of MAP methods. See [13.26] for a well-written concise description. Methods like this also
nd applications in target tracking and automatic target recognition (ATR) [13.12].
13A.2
13A.2.1
x1
w1
x2
w2
x3
w3
u
S
Fig. 13.7. Computation performed by a single neuron. Each input (xi ) is multiplied by a weight
(wi ) and the results are added, producing a signal u, which is passed through a
sigmoid-like nonlinearity function (S) producing the neuron output y.
315
Decision boundary is
a hyperplane
Single-layer perceptron
Arbitrary
decision regions
Multilayer (usually 3)
Fig. 13.8. Types of feedforward neural networks, and the decision regions which they can
implement.
Fig. 13.9. A feedforward neural network with three inputs and two outputs. Each circle denotes
a neuron. Weights are not explicitly shown, but exist on the connections.
M S E.
w i j
Using the three-level neural network model illustrated in Fig. 13.9, the gradient descent rule
may be readily implemented by making use of the chain rule for derivatives. Hussain and
Kabuka [13.24] demonstrate use of a neural network for character recognition.
316
Image matching
13A.2.2
This model of the behavior of a neuron is true only in the steady state. That is, since the
output is dependent on the input, which is the output, which is dependent . . . (to iterate is
human, to recurse, divine2 ). But such a description is woefully inadequate when things are
changing. In that case, we need some model of the dynamics of the system. Many different
models can be used, and the reader is referred to the literature [13.15, 13.20, 13.23] for a
closer examination. Here, we consider a single, rather simple model, one in which the rate
of change of output from the summer is dependent on the input, and can be represented by a
rst-order differential equation:
n
d
yi (t) = yi (t) +
w i j S(u j ) Ii
dt
j=1
(13.14)
where the yi s are the neuron outputs, the w i s are the weights, as before, and the Ii s are
inputs to each neuron from the external world (not shown in the gure). Thus the change is
proportional to the current state, the inputs from all the other neurons, and the external input.
In operation, a recurrent NN is presented with a particular input, and then allowed to run.
They should converge to a particular state.
This model was described by Hopeld [13.23], among others. In Hopelds model, the
rate constant, , resulted from a lumped-constant model of capacitance and resistance in an
operational-amplier implementation of such a recurrent network.
Now, forget about Eq. (13.14) for a moment, and consider the objective function below,
which we wish to minimize:
vi
1
E =
w i j vi v j +
S 1 ( ) d +
Ii v i .
(13.15)
2 i j
i
i
0
If we are to nd the vs which minimize this, we need to differentiate E with respect to those
vs. Doing that, we nd
E
=
w i j v j + S 1 (v i ) + Ii .
(13.16)
v i
j
Now, we observe that the derivative of E with respect to the variables v has the same form as
the dynamics of a Hopeld neural network, or
E
du i
=
.
(13.17)
v i
dt
Think about the steady state of the system described by Eq. (13.17). When the network has
nished changing (all the derivatives with respect to time are zero), all the partials of the
2
L. P. Deutsch.
317
Angle i
fi1
fi
fi+1
Fig. 13.11. The angle between neighboring feature points is a measurement local to feature
point i.
energy function are also zero. So we are at an extreme. It is relatively easy to show that one
may ignore that annoying integral in Eq. (13.15), and therefore a Hopeld neural network
nds the set of variables v i which minimize an objective function of the form described by
Eq. (13.15) (without the integral). We illustrate the use of such a network for matching through
an example.
Using the same set of features as the previous section, the zero crossings of the boundary
curvature, we assign a local measure to each feature point, in this case, the angle between the
vectors to neighboring points, as illustrated in Fig. 13.11. We will use this and a more global
feature, the distance between feature points, to solve the correspondence problem.
Assume image 1 (which you can think of as a model, if you wish) has n feature points,
and image 2 has m feature points. We dene a matrix of neurons which has n columns and
m rows. The neuron at row i, column j should have a value between zero and one depending
on the degree to which feature point i in the rst image matches feature point j in the second
image.
The matching process is posed as minimizing the expression
!
A
q
E =
Ci jkl Vik V jl +
Vik Vil +
Vik V jk .
2 i j k l
2
i
k k
=l
k
l i
= j
(13.18)
The rst term quanties the compatibility of matches ik and jl. The last two terms are included to encourage uniqueness of matches. This form is chosen to allow for occlusions. The
compatibility coefcient is the sum of three terms
Ci jkl = 1
(i , k ) + 2
( j , l ) + 3
(ri j , rkl )
where
1
(a, b) =
1
(13.19)
if (|a b| < T )
,
otherwise
for threshold T; i is a measurement local to feature point i as illustrated in Fig. 13.11; and
ri j is a measure of similarity of relational measures between feature points. For example, if
the distance between points i and j is the same as the distance between points k and l, then
labelings ik and jl are consistent.
Proper manipulation of Eq. (13.18) allows it to be put in the form of Eq. (13.15), enabling
minimization by a neural network. More details are available in [13.28, 13.54]. In Fig. 13.12
an outline is shown of a pistol partially occluded by a hammer. A neural network using these
principles is able to identify both objects from this image.
318
Image matching
13A.3
Image indexing
Up to this point, we have considered the process of image matching as searching a data
base of models for the model which best matches the observation. We have not addressed
the process of search itself. One could, of course, simply try all models, but that could
be prohibitively time-consuming, particularly in instances which involve large data bases
of models. In applications like automatic target recognition, where matching requires both
high speed and large data bases [13.45], better methods are required. The alternate paradigm,
indexing (sometimes called image hashing ) is analyzed in [9.6]. In an indexing scheme, a set
of parameters are extracted from the image. Obviously, such parameters need to be invariant
to as many image transformations as possible and also need to be robust [13.1]. This resulting
parameter vector is then used as indices into a lookup table containing references to models.
The lookup process returns a list of candidate models consistent with this particular parameter
vector. To see how this works, consider the following algorithm.
Begin by looking at local areas around the boundary and attempting to match each local
area with a data base of feature descriptors such as lines, circular arcs, and minima and maxima
of curvature. Assuming a successful segmentation of an unoccluded object, we start with an
edge image, where the edges are not required to be connected. About some point [x0 , y0 ]
on the edge, we sample the edge in that neighborhood using a sampling scheme3 which is
invariant to zoom. Form all possible combinations of that point with two other nearby points
and generate an invariant parameter vector similar to that described in [9.37]. That parameter
vector is then used to index a data base of local shapes. For each entry selected, a feature
instance is extracted and after all the triples have been considered, the feature instance with
the highest number of votes is selected.
Now, the boundary is represented by a sequence of feature instances, and the indexing
method may be repeated, using a look-up table of object models which are indexed by
geometry and occurrence of feature instances.
Numerous other approaches to indexing exist [13.5, 13.32]; an excellent review is included
in [13.48]. The space requirements for some indexing schemes are analyzed in [13.25].
As data bases get larger, one must consider the image indexing problem in the context of
the entire digital library. The reader is directed to an entire special issue of IEEE Transactions
on Pattern Analysis and Machine Intelligence (August, 1996) which addresses this.
13A.4
To avoid cluttering this description of the indexing paradigm with lots of details, we ask the reader to tolerate
the omission of some details. They are in the cited paper.
319
(13.20)
We make use of the observation that the determinant of a matrix of points is invariant to
rigid body motions,4 and write the determinant which is constructed from any four of the ve
points, using as a subscript, the index of the point we omitted. For example,
M1 = |X 2 X 3 X 4 X 5 |.
(13.21)
From the linear dependence of X 5 in Eq. (13.20), we substitute for X 5 in each case, deriving
M1 = a|X 2 X 3 X 4 X 1 | + b|X 2 X 3 X 4 X 2 |
+ c|X 2 X 3 X 4 X 3 | + d|X 2 X 3 X 4 X 4 |.
(13.22)
This can be simplied by observing that the determinant of a matrix which has two identical
columns is zero:
M1 = a|X 2 X 3 X 4 X 1 |.
(13.23)
But this can be simplied even more by observing that if you interchange two columns, you
ip the sign of the determinant.
M1 = (a)|X 1 X 3 X 4 X 2 | = a|X 1 X 3 X 2 X 4 | = (a)|X 1 X 2 X 3 X 4 |.
(13.24)
So
M1 = a M5 .
(13.25)
Similarly
M2 = bM5
M3 = cM5
(13.26)
M4 = d M5 .
From this, we can write an expression for the coefcients:
a=
M1
M5
b=
M2
M5
c=
M3
M5
d=
M4
.
M5
(13.27)
In 2D, the same ve points project to a set of 3-vectors (again, using homogeneous coordinates), and
x5 = ax1 + bx2 + cx3 + d x4 .
(13.28)
We construct 3 3 matrices by leaving out two indices, and denoting by subscript the indices
left out:
m 12 = |x3 x4 x5 |.
4
In fact, absolute invariants of linear forms are always ratios of powers of determinants [13.19].
(13.29)
320
Image matching
At this point, we simplify the notation, get rid of the xs and just keep track of the subscripts,
rewriting the denition of m 12 :
m 12 = |3 4 5|.
(13.30)
As above, we can do algebra to relate the determinants and the coefcients, for example,
m 12 = a|3 4 1| + b| 3 4 2|
= a|1 3 4| + b|2 3 4|
(13.31)
= am 25 + bm 15
and
m 13 = am 35 cm 15
(13.32)
m 14 = am 45 + dm 15 .
We have determined forms for the coefcients in terms of the Mi s, and adding those relations
into the equations we just derived produces
M5 m 12 + M1 m 25 M2 m 15 = 0
M5 m 13 + M1 m 35 M3 m 15 = 0
(13.33)
M5 m 14 + M1 m 45 M4 m 15 = 0.
These relations are invariant to both 3D and 2D motions except for a multiplicative scale
which affects all the Mi s the same. We can eliminate this dependence by using ratios and
dene 3D invariants
I1 =
M1
M5
I2 =
M2
M5
I3 =
M3
M5
(13.34)
and 2D invariants
i 12 =
m 12
m 15
i 13 =
m 13
m 15
i 25 =
m 25
m 15
i 35 =
m 35
.
m 15
(13.35)
The denominators are not zero, since they are the determinants of matrices which we know
are nonsingular. Look at Eq. (13.33) and divide the top line by M5 :
M1
M2
M5
m 12 +
m 25
m 15 = 0,
M5
M5
M5
(13.36)
m 12 + I1 m 25 I2 m 15 = 0.
(13.37)
which simplies to
(13.38)
So if we have 2D invariants we have two equations for the 3D invariants. The two equations of
Eq. (13.38) do not, unfortunately, determine the three 3D invariants. Still those two equations
determine a space line in the 3-space of the Is.
321
How do we use an idea like this? Given a 3D model of an object, and any ve points,
four of which are not coplanar, we can nd I1 , I2 , and I3 , a point in a 3D space. To perform
recognition, we rst extract from the 2D image (generally several) 5-tuples of feature points
and from them construct the 2D invariants. Each 5-tuple gives rise to two equations in I1 ,
I2 , I3 space, that is, a straight line in the 3D invariant space. If a 5-tuple in the 2D image
is a projection of some 5-tuple in 3D, then the line so obtained will pass through the single
point representing the model. If we have a different projection of those ve points, we get a
different straight line, but it still passes through the model point.
Implementing this for realistic scenes is slightly more complicated than this description
because one must actually make use of projective geometry rather than assuming orthogonal
projections. Other complications arise in determining a suitable way to choose 5-tuples, and
a means for dealing with the fact that the line may almost pass through the point. Weiss
and Ray [13.52] address these issues.
13A.5
13A.5.1
Conclusion
Which model to use?
So far, we have described quite a collection of representations for objects, but certainly not
all that are in the literature. Other methods include variations on deformable models [13.10,
13.11], especially for range images [13.21].
Consider Fig. 13.13: Should you match it to a circle or a six-sided polygon? Clearly, there
is no simple answer to this question. If you have prior, problem-specic knowledge that you
are always dealing with circular objects, you might choose to use the circular model, which
is certainly less complex than that of a polygon. The idea of minimum description length
(MDL) provides some help along these lines. The MDL paradigm states that the optimal
representation for a given image may be determined by minimizing the combined length of
the encoding of the representation and the residual error. Interestingly, a MAP representation
can be shown [13.9, 13.30] to be equivalent to the MDL representation where the prior truly
represents the signal.
Schweitzer [13.43] uses the MDL philosophy to develop algorithms for computing the
optic ow, and Lanterman [13.29] uses it to characterize infrared scenes in ATR applications
if there are several descriptions compatible with the observed data, we select the most
parsimonious [13.29].
Rissanen [13.41] suggests that the quality of an object/model match could be represented
by
L(x, ) = log2 P(x|) + L()
(13.39)
where x is the observed object, is the model, represented as a vector of parameters, P(x|) is
the conditional probability of making this particular measurement given the model, and L()
denotes the number of bits required to represent the model. The logarithm of the conditional
probability is then a measure of how well the data ts the model. We thus may trade off a
more precise t of a more complex model with a less accurate t of a simpler model [13.6].
322
Image matching
Ultimately, machine vision is not going to be solved by one program, one algorithm, or
one set of mathematical concepts. Ultimately, its solution will depend on the ability to build
systems which integrate a collection of specialists. The jury is still out on how to accomplish
this. Regrettably, only a few papers have undertaken this formidable task. For example, Grosso
and Tistarelli [13.18] combine stereopsis and motion. Bilbro and Snyder [13.4] fuse luminance
and range to improve the quality of the range imagery, and Pankanti and Jain [13.38] fuse
stereo, shading, and relaxation labeling. Zhu and Yuille [8.80] incorporate the MDL approach,
including active contours and region growing, into a unied look at segmentation. Gong and
Kulikowski [13.16] use a planning strategy, primarily in the medical application area.
13A.5.2
A recurrent neural
network is an
optimization engine!
13A.6
323
Bibliography
[13.8] T. Chen and W. Lin, A Neural Network Approach to CSG-based 3-D Object Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(7),
1994.
[13.9] T. Darrell and A. Pentland, Cooperative Robust Estimation Using Layers of
Support, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(5),
1995.
[13.10] D. DeCarlo and D. Metaxas, Blended Deformable Models, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 18(4), 1996.
[13.11] S. Dickinson, D. Metaxas, and A. Pentland, The Role of Model-based Segmentation
in the Recovery of Volumetric Parts From Range Data, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 19(3), 1997.
[13.12] M. Dubuisson Jolly, S. Lakshmanan, and A. Jain, Vehicle Segmentation and
Classication using Deformable Templates, IEEE Transactions on Pattern Analysis
and Machine Intelligence, 18(3), 1996.
[13.13] M. Fischler and R. Elschlager, The Representation and Matching of Pictorial Structures, IEEE Transactions on Computers, 22(1), 1973.
[13.14] S. Gold and A. Rangarajan, A Graduated Assignment Algorithm for Graph Matching, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4), 1996.
[13.15] R. Golden, Mathematical Methods for Neural Network Analysis and Design,
Cambridge, MA, MIT Press, 1996.
[13.16] L. Gong and C. Kulikowski, Composition of Image Analysis Processes Through
Object-centered Hierarchical Planning, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 17(10), 1995.
[13.17] F. Goudail, E. Lange, T. Iwamoto, K. Kyuma, and N. Otsu, Face Recognition
System Using Local Autocorrelation and Multiscale Integration, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(10), 1996.
[13.18] E. Grosso and M. Tistarelli, Active/Dynamic Stereo Vision, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(9), 1995.
[13.19] G. Gurevich, Foundations of the Theory of Algebraic Invariants, Transl. Raddock
and Spencer, Groningen, The Netherlands, Nordcliff Ltd, 1964.
[13.20] S. Haykin, Neural Networks, A Comprehensive Foundation, Englewood Cliff, NJ,
Prentice-Hall, 1999.
[13.21] M. Hebert, K. Ikeuchi, and H. Delingette, Spherical Representation for Recognition of Free-form Surfaces, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 17(7), 1995.
[13.22] D. Heisterkamp and P. Bhattachaya, Matching of 3D Polygonal Arcs, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1), 1997.
[13.23] J. Hopeld, Neural Networks and Physical System with Emergent Collective
Computational Abilities, Proceedings of the National Academy of Science, 79,
pp. 25542558, 1982.
[13.24] B. Hussain and M. Kabuka, A Novel Feature Recognition Neural Network and its
Application to Character Recognition, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(1), 1994.
[13.25] D. Jacobs, The Space Requirements of Indexing Under Perspective Projections,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(3), 1996.
324
Image matching
325
Bibliography
[13.43] H. Schweitzer, Occam Algorithms for Computing Visual Motion, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(11), 1995.
[13.44] S. Sclaroff and A. Pentland, Model Matching for Correspondence and Recognition,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(6), 1995.
[13.45] K. Sengupta and K. Boyer, Organizing Large Structural Modelbases, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(4), 1995.
[13.46] L. Shapiro and J. M. Brady, Feature-based Correspondence: an Eigenvector
Approach, Image and Vision Computing, 10(5), 1992.
[13.47] X. Shen and P. Palmer, Uncertainty Propagation and Matching of Junctions as Feature Groupings, IEEE Transactions on Pattern Analysis and Machine Intelligence,
22(12), 2000.
[13.48] A. Smeulders, M. Worring, S. Santini, G. Gupta, and R. Jain, Content-based Image
Retrieval at the End of the Early Years, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(12), 2000.
[13.49] D. Swets and J. Weng, Using Discriminant Eigenfeatures for Image Retrieval,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8), 1996.
[13.50] M. Turk, and A. Pentland, Eigenfaces for Recognition, Journal of Cognitive
Neuroscience, 3(1), pp. 7186, 1991.
[13.51] X. Wang and H. Qi, Face Recognition Using Optimal Non-orthogonal Wavelet
Basis Evaluated by Information Complexity, International Conference on Pattern
Recognition, vol. 1, pp. 164167, Quebec, Canada, August, 2002.
[13.52] I. Weiss and M. Ray, Model-based Recognition of 3D Objects from Single Images,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 2001.
[13.53] M. Yang and J. Lee, Object Identication from Multiple Images Based on Point
Matching under a General Transformation, IEEE Transactions on Pattern Analysis
and Machine Intelligence, 16(7), 1994.
[13.54] S. Yoon, A New Multiresolution Approximation Approach to Object Recognition,
Ph.D. Thesis, North Carolina State University, 1995.
14
Statistics are used much like a drunk uses a lamppost: for support, not illumination
Vin Scully
The discipline of statistical pattern recognition by itself can ll textbooks (and in fact,
it does). For that reason, no effort is made to cover the topic in detail in this single
chapter. However, the student in machine vision needs to know at least something
about statistical pattern recognition in order to read the literature and to properly
put the other machine vision topics in context. For that reason, a brief overview
of the eld of statistical methods is included here. To do serious research in machine vision, however, this chapter is not sufcient, and the student must take a
full course in statistical pattern recognition. For texts, we recommend several: The
original version of the text by Duda and Hart [14.3] included both statistical pattern
classication and machine vision, however, the new version [14.4] is pretty much
limited to classication, and we recommend it for completeness. The much older
text by Fukanaga [14.6] still retains a lot of useful information, and we recommend
[14.11] for readability.
327
Area
x
x x
x x xx
x
x x
o
x o
o
o
x
o oo
o
x
o
o
Length
Fig. 14.1. A linear decision boundary.
14.1.1
Figure 14.1 illustrates the result of a large number of measurements of two different
industrial parts. The features, area and length have both been measured and each
measurement indicated in the gure by a single mark. (A chart like this is called a
scatter graph.) The x points represent one class, anges and the o points represent
a second class, gaskets. A linear decision boundary has been drawn on the gure. A
linear decision rule would be as follows: Decide the unknown object is a ange if the
result of the measurements lies on the left of the decision boundary otherwise decide
it is a gasket. Linear decision rules are particularly attractive because they can be
implemented by linear machines which have a great deal of potential parallelism
and therefore high speed. As can be seen from the gure these two classes are
not linearly separable. That is, there does not exist any one straight line which
completely partitions the two classes. The choice of the best such straight line is the
result of the linear classier design process. A variation on linear machines is given
in section 14A.2, where we introduce support vector machines.
14.1.2
328
Conditional probabilities
play a critical role in
maximum likelihood
functions.
14.1.3
If we are given one training set for each class and from those training sets we can
develop the statistical representations of the classes, then this process is known as
supervised learning. The word supervised refers to the fact that each data point is
independently labeled according to the class to which it belongs. Each class may then
be characterized either statistically by its mean, variance or other statistical measures,
or by some other parametric representation. Fig. 14.1 illustrates the data distribution
resulting from supervised sampling, i.e., the x points are identied as belonging to
one class and the o points to another. The example we have used previously in section
13.2, of distinguishing axes from hatchets, is a supervised learning problem, since
it was assumed that we had training sets of both classes.
Unsupervised learning
329
o
oo o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
oo
Fig. 14.2. In unsupervised learning, only one set of measurements is taken; however, such data
may fall into natural clusters.
14.2.1
Bayes rule
We dene P(w i ) to represent the a priori probability that class w i occurs, that is, the
probability of class w i occurring before any measurements are made. For example,
suppose we have a factory which manufactures anges and gaskets, but makes nine
330
Fig. 14.3. The same measurement on two objects may have different average values, but due to
noise in the measurement, or actual variation, these values may overlap.
times as many anges as gaskets. Flanges and gaskets may come down the conveyor
at random times. But because of our a priori knowledge that the plant manufactures
nine times as many anges as gaskets we know that we are much more likely to see
a ange than a gasket if we chose to look at the conveyor at some random time. Thus
the a priori possibility of anges is 0.9 and the a priori probability of gaskets is 0.1.
We dene p(x|w i ) to represent the conditional probability density of a measurement x occurring given that the sample is known to come from class w i . For a
particular w i , p(x|w i ) should be thought of as a function of x. Suppose we have a
factory which manufactures axes and hatchets. Then we might nd the probability
densities for the lengths of axes and hatchets to be represented by Fig. 14.3. In that
gure we see that an axe is most likely to be 30 inches long and a hatchet most likely
to be 12 inches, but that some variation in length can occur.
The probability density function may be characterized in several possible ways.
One is by simply tabulating the number of times a particular value occurs for each
possible value of the variable, in this case, length. Such a tabulation is referred to
as a histogram of the variables. Properly normalized, a histogram can be a useful
representation of a probability density function, but of course requires that only a
nite number of possible values may exist. One may also describe a density function
in a parametric way using some analytic function (e.g., the Gaussian) to represent
the density.
Finally we dene P(w i |x), the posterior conditional probability, to represent the
conditional probability that the object being observed belongs to a class w i given a
measurement x. P(w i |x) is what we are looking for. We will use it as our decision
rule or, more correctly, as our discriminant function. Our decision rule will then be
as follows: For a measurement x made on an unknown object, compute P(w i |x)
for each class, that is, for each possible value of i. Then decide that the unknown
belongs to the class i for which P(w i |x) is greater than P(w j |x) for all i
= j.
331
When we make a classication decision based on P(w i |x), we are using a maximum
likelihood classier.
We can relate the three functions just dened by using Bayes rule:
P(w i |x) =
Something = p(x) =
c
p(x|w i )P(w i )
Something
(14.1)
p(x|w j )P(w j ).
(14.2)
j=1
In Eq. (14.1) we used something to represent the denominator for the conditional
probability density. We used the word something to call attention to the fact that this
number represents the probability density of that value of x occurring, independent
of the class of the observation. Since this number is independent of the class, and is
the same for all classes, it therefore does not provide us any help in distinguishing
which class is most likely. Instead, it is a normalization constant which we use to
ensure that the number P(w i |x) has the desirable properties of a probability; that
is, it lies between 0 and 1 and when summed over all the classes, it sums to 1 (the
observed object belongs to at least one of the classes which we are considering).
In a sense, Equation (14.1) solves the pattern recognition problem. It tells us how
to make a decision, assuming we know each of the components of the RHS. In the
next section we consider how one goes about determining those components.
14.2.2
1
1 " x #2
p(x) =
exp
.
(14.3)
2
2
Here p(x) is known as the univariate Gaussian (normal) density. The word univariate refers to the fact that x is a scalar, a single variable. Here, the mean, and
standard deviation, comprise the elements of a parameter vector:
i
i =
.
(14.4)
i
These two numbers serve to completely describe the conditional probability density
of the variable x for class i assuming that the density has a Gaussian form.
332
1
(2)d/2 |K i |1/2
%
&
exp (1/2)(x i )T K i1 (x i ) .
(14.5)
In Eq. (14.5), d is the dimensionality of the vector x, i is the mean vector which
represents the average (vector-valued) value of the random vector of measurements
and K i is a d d covariance matrix.
Thus, in the univariate Gaussian case, we can represent the class conditional
density of the measurements by two numbers, and in the multivariate case by a
d-dimensional vector and a d by d matrix. Given these parameters, we can easily
substitute them into Eq. (14.5) and then into Eq. (14.1) and compute the most likely
class for an unknown object. Unfortunately, in most applications we are not given
the mean and covariance but rather must estimate them from training sets.
Take the log of the RHS of Eq. (14.5). That will eliminate the exponential, and,
since the logarithm is monotonic, will result in an expression involving the measurement vector x, and the statistics i and K i which characterize class i. This
expression is maximized in the same way as the original probability. Classication
is now straightforward: Just substitute x into each equation you can generate like
this using all the different means and covariances. This gives you c (assuming you
have c classes) different functions, called discriminant functions, all of which have
the same form but which have different parameters. Assign x to the class for which
the corresponding discriminant function is the largest.
14.2.3
Density estimation
Since the Gaussian (normal) density occurs so often in actual distributions of random
variables, and is so convenient to work with, we will use it often, and treat it as a
special case.
To design a pattern classier using supervised methods, we need to estimate the
parameters of the density from a training set of samples. We will denote the parameter
set by the vector .
The univariate Gaussian case
333
Assuming that the samples are drawn independently, the probability of drawing
the entire set X i is determined by
p(X i ) =
ni
p(xik ).
(14.6)
k=1
ni
p(xik |i ).
(14.7)
k=1
The maximum likelihood estimate of i is then dened as that value i which maximizes p(X i |i ). Eq. (14.7) describes the likelihood of any particular training set
occurring, given that the probability distribution is described by the parameter vector i . Since we are dealing with the Gaussian density, we rewrite Eq. (14.7) as
ni
1
(xik i )2
exp
.
(14.8)
p(X i |i ) =
2i2
2i
k=1
Now an important observation: The value of i which maximizes p(X i |i ) also
maximizes ln [ p(X i |i )]. This is true because the natural logarithm is a monotonically
increasing function. Thus, we have our choice of nding the parameter vector i
which maximizes either the density or its logarithm. The logarithm will be much
easier to use. Taking the log of the RHS we nd
ni
ni
1 xik i 2
1
ln
.
(14.9)
ln( p(X i |i )) =
2
i
2i
k=1
k=1
14.2.4
i
k=1
(14.10)
which simplies to
ni
xik
k=1
ni
i =0
(14.11)
ni
1
xik .
n i k=1
(14.12)
k=1
i =
i is known as the sample mean and the fact that it is equal to the average value is
certainly intuitively satisfying.
334
14.2.5
=
.
(14.13)
2
Rewrite Eq. (14.9), to make it a bit simpler and the log of the probability becomes
1 xk 2
ln
(14.14)
L=
ln 2
2
k=1
(14.15)
n
n
1
(xk )
2
+
= 0.
3
k=1
k=1
(14.16)
xk =
k=1
n
(14.17)
k=1
and
n
1
xk
n k=1
(14.18)
as before.
Equation (14.16) simplies similarly to yield
n
n
1
= 3
(xk )
2,
k=1
(14.19)
n
1
(xk )
2.
n k=1
(14.20)
and therefore
2 =
Thus we see that the best estimate for the parameters of a normal density are the
familiar sample mean and sample variance.
335
(14.21)
ni
1
(xik
i )(xik
i )T .
n i k=1
(14.22)
i =
Look at how K is dened.
Is it a matrix? A scalar? A
vector?
Ki =
Thus we have essentially the same results for the multivariate case as for the univariate
case: That the best estimate (in the maximum likelihood sense) of the mean and
variance of a Gaussian are the sample mean and sample (co)variance.
Now what is there to remember from this chapter? To perform the maximum
likelihood estimate of a set of parameters, given a training set, assume independence
(if you can) and write the probability of the entire training set occurring as a product.
Take logs, differentiate, and set to zero to produce a set of simultaneous equations
which, when solved, will be the best estimates of the parameters. This approach
works for any distribution, not just the Gaussian. However, for some cases, the
process of solving the system of simultaneous equations may be intractable.
Finally, there are other ways to nd parameters, other than maximum likelihood;
techniques which space and time do not permit us to cover here.
14.2.6
p(x|w 1 )P(w 1 )
.
p(x)
(14.23)
As mentioned above, remember that p(x) is the same regardless of whether x belongs
to class 1 or 2. Since this denominator is unaffected by the classication decision,
we can ignore it in making that decision.
In the two-class case, we choose class 1 if P(w 1 |x) > P(w 2 |x), or, substituting
Bayes rule, we choose class 1 if
p(x|w 1 )P(w 1 ) > p(x|w 2 )P(w 2 ),
(14.24)
336
that is,
p(x|w 1 )
P(w 2 )
>
.
p(x|w 2 )
P(w 1 )
(14.25)
The expression on the left is known as the likelihood ratio. The relationship in
Eq. (14.25) provides a truefalse relationship between the likelihood ratio and the
prior information. If it is false, we choose class two. Observe that this form was
derived by making the decision which maximizes the probability of making a correct
decision, using knowledge of the measurement and the prior probability of the
classes. We could use other criteria as well. For example, instead of maximizing
the probability, we could choose to minimize the conditional risk.
(14.26)
(14.27)
We also use the notation P(error|w 2 ) to mean the probability that we make an
incorrect decision when w 2 is the true state. Fig. 14.4 illustrates the a posteriori
probability density of two classes, the process of deriving the decision boundary,
and the probability of error.
In general, if p(x|w 1 )P(w 1 ) > p(x|w 2 )P(w 2 ), we should decide that x is in
1 ,
so that the smaller term contributes to the error integral. That is exactly what Bayes
decision rule does.
337
p(x|w 2 )P (w 2 )
p(x|w 1 )P (w 1 )
Eliminated by
moving boundary
to the left
p ( x w )P ( w ) dx
2
p ( x | w )P ( w ) dx
Fig. 14.4. The a posteriori probability density function of two classes as Gaussian, the decision
boundary, and the probability of error.
In the multiclass case, since there are more ways to be wrong than right, it is
simpler to compute the probability of being correct:
P(correct) =
=
c
i=1
c
i=1
P(x
i , w i )
p(x
i |w i )P(w i ) =
c
i=1
(14.28)
This result will be valid no matter how the feature space is partitioned. The Bayes
classier maximizes this probability by choosing regions which maximize the
integrals.
338
case,
p(x = 0) = 0, p(x = 1) = 0, p(x = 2) = 0.5, p(x = 3) = 0.5.
Would you agree that the expected value of x is 2.5? That is, half the time one would
expect to see x equal to 2, and half the time, x will be 3.
Now, generalize that concept to lots of possible values with a probability associated
with each value. If the number of possible values is nite, the expected value is a
sum
x =
x P(x).
(14.29)
In the more general case, when x is continuous, we will replace the probability with
a density, and replace the summation with an integral.
Suppose we observe x and contemplate taking action i . If the true state of nature
is w j , we incur loss Ci j . Since P(w j |x) is the probability that w j is true, the expected
loss associated with ai is
ri =
c
Ci j P(w j |x).
(14.30)
j=1
The expected loss is called the risk. We write ri as r (i |x) to make it clear that this
is the conditional risk.
We wish to minimize the total risk, which we denote as r by choosing the best i .
A decision rule is a function (x) that tells us what action to take to minimize r.
For every x the decision function (x) assumes a value 1 , . . . , a . The overall
risk is associated, then, with the decision rule.
Since ri or r (i |x) is the conditional risk, the overall risk is
r = r ((x)|x) p(x) dx.
(14.31)
If we choose (x) so that r((x)|x) is as small as possible for every value of x, we
minimize the overall risk. Thus, to minimize r, compute ri and select the action i
for which ri is minimal. The resulting overall risk is called the Bayes risk.
14.4.1
339
If there are only two classes, we have four possibilities, two that we made the
correct decision, and two that we were wrong. The total risk therefore is:
r = C11 P(w 1 |x) d x
As you read these terms,
consider the concepts of
false negative, false
positive, true positive,
and true negative.
Which of these terms
represents the total
amount of each of these?
C12 P(w 2 |x) d x
(14.32)
C21 P(w 1 |x) d x
+
Reminder: Ci j is the cost
of deciding i when reality
is j.
That is, the probability that we guessed it was in class 1, integrated over the region
of x in which our decision rule says we SHOULD guess class 1, plus . . .
Rewriting all that, we get
r = [C11 P(w 1 |x) + C12 P(w 2 |x)] d x
[C21 P(w 1 |x) + C22 P(w 2 |x)] d x.
(14.33)
+ 1 C21
P(w 2 |x) d x
(14.34)
which reorganizes to
r = 2 + ((C11 C21 )P(w 1 |x) + (C12 C22 )P(w 2 |x)) d x.
(14.35)
Our objective is to minimize this quantity (remember, it is the risk incurred in making
all four possible decisions).That is, the decision rule is really the determination of
the decision region(s). In this case, since there are only two decision regions, and
everywhere that we do not decide class one, we decide class two, all we need to
do is to determine the region
1 . To accomplish that, rst, we need to make an
340
assumption. We assume that the cost of making a correct decision is always less than
the cost of an error. So (C11 C21 ) < 0, etc.
How do we choose the limits of an integral (and remember, the limits of the integral
are in fact the boundaries of the region where we decide class 1) such that the integral
is maximally small? Simply choose the decision region such that the integrand is
negative everywhere. Doing so produces the condition required for region 1 to be
chosen: Choose
1 such that
(C11 C21 )P(w 1 |x) + (C12 C22 )P(w 2 |x) < 0.
(14.36)
Replace the posterior probabilities with the product of the conditional densities and
prior probabilities to get
(C11 C21 ) p(x|w 1 )P(w 1 ) < (C22 C12 ) p(x|w 2 )P(w 2 ),
(14.37)
which after appropriate algebraic manipulation becomes the decision rule: Choose
class 1 if
p(x|w 1 )
(C12 C22 )P(w 2 )
>
,
p(x|w 2 )
(C21 C11 )P(w 1 )
Try this: Substitute the
symmetrical cost function
into Eq. (14.38) and see
how the likelihood ratio
test simplies.
(14.38)
(14.39)
so that all errors are equally costly and there is no cost for a correct decision. We
may now rewrite the conditional risk, the cost of making decision i
ri =
c
Ci j P(w j |x) =
j=1
(14.40)
j =i
Thus, to minimize the average probability of error, we select i as the i which maximizes the a posteriori probability P(w i |x). That is, for minimum cost, we decide
w i , if P(w i |x) > P(w j |x) for all i
= j, which we have already seen is the simple
maximum likelihood classier. Thus, we see that the maximum likelihood classier
minimizes the Bayes risk associated with a symmetric cost function.
341
A = K 11 K 21 , b = 2(K 21 2 K 11 1 ),
|K 1 |
c = T1 K 11 1 T2 K 21 2 + ln
,
|K 2 |
and
(14.42)
(14.43)
And the decision rule becomes: Decide class 1 if g(x) < T . In this formulation, we
see clearly why the Gaussian parametric classier is known as a quadratic classier.
Lets examine the implications of this rule. Consider the quantity (x
1 )T K 11 (x 1 ). This is some sort of measure involving a measurement, x, and a
class parameterized by mean vector and a covariance matrix. This quantity is known
as the Mahalanobis distance.
First, lets look at the case that the covariance is the identity. Then, the Mahalanobis distance simplies to (x 1 )T (x 1 ). That is, take the difference between
the measurement and the mean. That is a vector. Then take the inner product of that
vector with itself, which is, of course, the squared magnitude of that vector. What is
this quantity? Of course! It is just the (squared) Euclidean distance between the measurement and the mean of the class. If the prior probabilities are the same and we use
symmetric costs, Threshold works out to be zero, and the decision rule simplies to:
Decide class 1 if
(x 1 )T (x 1 ) (x 2 )T (x 2 ) < 0
(14.44)
else decide class 2. If the measurement is closer to the mean of class 1 than class 2,
this quantity is less than zero. Therefore, we refer to this (very simplied) classier
as a nearest mean classier, or nearest mean decision rule.
Now, lets complicate the rule a bit. We no longer assume the covariances are
equal to the identity, but do assume they are equal to each other (K 1 = K 2 K ). In
this case, look at Eq. (14.42) and notice that the A matrix becomes zero. Now, the
operations are not quadratic any more. We have a linear classier.
We could choose to ignore the ratio of the determinates of the covariance matrices,
or, more appropriately, to include that number in the threshold T. Then we have a
minimum distance decision rule, but now the distance used is not the Euclidean
distance. We refer to this as a minimum Mahalanobis distance classier.
342
Here is another special case: What if the covariance is not only equal, but diagonal?
Now, the Mahalanobis distance takes on a special form. We illustrate this by using
a three-dimensional measurement vector, and letting the mean be zero:
1
0
0
x
11
1
[x1 x2 x3 ] 0
0
x2
22
x3
1
0
0
33
which we expand to
x12
x2
x2
+ 2 + 3.
11
33
33
Do you recall seeing this
ellipse discussion
somewhere else in this
book?
This is the equation of an ellipsoid, centered at the origin (or, in the case that the
mean is not zero, centered at the mean) with axes located along the coordinate axes.
In the more general case, with covariance which is not diagonal, the only thing
that happens is that this ellipsoid may rotate. So, the equation which represents the
Mahalanobis distance from a point to a class produces an ellipsoid.
Here is one more interesting case: Suppose the covariances are the same, diagonal,
and proportional to the identity K i = 2 I . Now, the discriminant function for class
i takes on the form
gi (x) =
Remember! An inner
product computes a
projection.
2iT x |i |2
2 + 2 ln P(w i ).
2
(14.45)
Assume further that the magnitudes of all the means are the same. That is, all the
means are located on a hypersphere centered at the origin. Then, we do not need to
consider the second term in Eq. (14.45), and the discriminant function simplies to
gi (x) = iT x =
d
ik xk ,
(14.46)
k=1
343
that is, the probability of a particular state of nature times all the decisions we could
make if that were the state of nature, and the cost associated with those decisions.
To see what this means clearly, think about the two-class case and let i = 1. We
observe that r is linear in P(w i ). Thus, r is maximized at one extreme of P(w 1 ) or
the other, e.g., P(w 1 ) = 0 or P(w 1 ) = 1. If we let C11 = C22 = 0 then the maximum
of r becomes either
C12 p(x|w 2 ) d x
(14.48)
or
C21 p(x|w 1 ) d x.
(14.49)
Since
1
2 is the complete space, then
max
C12 p(x|w 2 ) d x,
C21 p(x|w 1 ) d x
(14.50)
(14.51)
344
This method utilizes a volume1 V around the unknown. We simply count the
number of points from the various classes which occur. Then the class-conditional
density is estimated by
p(x|m ) =
km
,
nm V
(14.52)
where km is the number of samples in class m inside the volume V centered at x, and
n m is the total number of samples in class m in the training set.
Use of a constant volume is a problem, because in regions which are densely
populated (many training set points nearby) the volume will contain many points,
resulting in too much smoothing, whereas in more sparsely populated areas, the same
volume results in estimates which are not sufciently representative. The simple
solution is to let the volume depend on the data. For example, to estimate p(x) from
n samples, one can center a cell about x and let it grow until it contains kn samples,
where kn is some (yet to be specied) function of n. If the density of samples near x
is high, then the volume will be small, resulting in good resolution. If the density is
small, then the region will grow, providing smoothing. Duda et al. [14.4] point out
c
ni
(14.53)
i=1
km
nm V
(14.54)
and
nm
n
k
.
p(x) =
nV
If we apply Bayes rule to Eqs. (14.54)(14.56), we nd
P(m ) =
P(m |x) =
1
km
.
k
(14.55)
(14.56)
(14.57)
Of course, in more than three dimensions, this is a hypervolume. For simplicity, we will continue to use the
word volume with the understanding that no limit on dimensionality exists.
345
14.9 Vocabulary
This rule tells us to look in a neighborhood of the unknown feature vector for k
samples. If, within that neighborhood, more samples lie in class i than any other
class, we assign the unknown as belonging to class i. We thus have the k-nearestneighbor classication rule.
The student should note that in the k-NN strategy, we have never dened precisely
how nearest should be computed. The Euclidean metric is generally assumed to be
the most reasonable measure for distance, but others may certainly be used.
In the authors own experience in classifying large data sets of industrial data, we
have found nearest neighbor algorithms to work surprisingly well.
A major practical disadvantage of the k-NN strategy for classication is the fact
that all the data must be stored. This can be a massive storage burden, especially
when compared with parametric methods which require only a few points. The computational burden associated with the k-NN techniques can likewise be signicant,
since, in order to nd the k nearest neighbors, the distance from the unknown to all
the neighbors must be determined. Heuristics have been published which speed up
this process signicantly, and the student is referred to the literature for suggestions.
See, for example, the condensed-nearest-neighbor rule described in Hand [15.7].
14.8 Conclusion
In this brief introduction to statistical pattern recognition, you have seen how statistical methods can assist in the process of making decisions. You have also noticed how
pervasive the optimization approach to problem solving is. Probability densities are
estimated using maximum likelihood methods, where the likelihood is a product of
probabilities. For Gaussian forms, maximum likelihood simplies to sum-squared
error.
You learned how to nd decision regions which minimize the total risk, even
when different decisions have different costs, by nding limits of integration which
make the argument of the integral negative. Even in the case that the risk cannot be
computed, we can develop a scheme which minimizes the maximum risk.
Classication is often considered a minimum distance process. That is, we
make the decision which minimizes some sort of distance, and you have seen several
examples in this chapter.
14.9 Vocabulary
You should know the meaning of the following terms.
Bayes rule
Cluster
Conditional density
346
Decision boundary
Decision rule
Discriminant function
Feature vector
Likelihood ratio
Linear machine
Linearly separable
Maximum likelihood
Minimax
Multivariate
Prior probability
Quadratic classifier
Risk
Supervised learning
Training set
Univariate
Unsupervised learning
Assignment
14.1
Assume class 1 and 2 are well represented by Gaussian
densities with the following parameters: Class 1 mean
= 0, variance = 1. Class 2 mean = 3, variance = 4.
Substitute the forms for the Gaussian into Eq. (14.25)
and derive an equation which gives the range of x in
which class 1 is chosen. You will need to make a reasonable assumption about prior probabilities (equal
probabilities are often chosen).
Hint: After doing the substitution, take natural
logarithms of both sides.
Assignment
14.2
In a one-dimensional problem, the conditional density
for class 1 is Gaussian with mean 0 and variance 2; for
class 2, the conditional density is also Gaussian with
mean 3 and variance 1. That is:
!
x 2
1
1
p(x|w 1 ) =
exp
2
2 2
2
1
1
2
p(x|w 2 ) =
exp (x 3)
2
2
(1) Sketch the two densities on the same axis.
(2) What is the likelihood ratio?
347
and C 21 = 3, use the integral form for the probability of error assuming a Bayes decision rule.
Assignment
14.3
In a one-dimensional problem the class-conditional densities for a feature x are
exp((x r)) x r
p(x|w 1 ) =
and
0
otherwise
exp(x 3) x < 3
,
p(x|w 2 ) =
0
otherwise
where P(w 1 ) = P(w 2 ) = 0.5.
(1) Assume that r < 3, and sketch the densities. Determine the decision rule that minimizes the probability of error, and indicate what that decision rule
means by marking a point on the x axis.
(2) Find the value of r which minimizes P(error|w 2 ).
Topic 14A
14A.1
xi =
n
i j
(14.58)
j=1
is computed, producing a 25-element vector which in some sense describes the image.
348
So for every image, we have a vector consisting of 25 numbers. Using that 25-vector, the
challenge is to properly make a decision. The rst step is to reduce the dimensionality to
something more manageable than 25.
We look for a method for reducing the dimensionality from, in general, d dimensions, to
c 1 dimensions, where we are hoping to classify the data into c classes. (Somehow, we
must know c, which in this example is the number of individual faces.) The following strategy
is an extension of a method known in the literature as Fishers linear discriminant.
Assume we have c different classes, and a training set, X i of examples from each class.
Thus, this is a supervised learning problem. Dene the within-class scatter matrix to be
SW =
c
Si
(14.59)
(x i )(x i )T ,
(14.60)
i=1
where
Si =
xX i
2
1
and i is the mean of class i. Thus, Si is a measure of how much each class varies from its
average.
1
i =
x.
(14.61)
n i xX i
We dene the between-class scatter matrix as
c
SB =
n i (i )(i )T ,
(14.62)
i=1
where is the mean of all the points in all the training sets and n i is the number of samples
in class i. To see what this means, consider Fig. 14.6. The between-class scatter is a measure
of the sum of the distances between each of the class means and the overall sample mean.
Maximization of some measure of SB will push the class means apart, away from the overall
mean.
The idea is to nd some projection of each data vector x onto a vector y,
y = Wx
(14.63)
such that rst, y is of lower dimension than x, and second, the classes are better separated
after they are projected.
The projection from d-dimensional space to c 1 dimensional space is accomplished by
c 1 linear discriminant functions
yi = wiT x.
(14.64)
(14.65)
We now dene a criterion function which is a function of W and measures the ratio of betweenclass scatter to within-class scatter. That is, we want to maximize SB relative to SW , or rather,
349
1
1
1
to maximize some measure of SW
SB . The trace of SW
SB is the sum of the spreads of SW
SB
1
in the direction of the principal components of SW SB . We can see clearly what this means
in the two-class case.
1
J = trSW
SB =
n1n2
n1n2
1
trSW
(1 2 )(1 2 )T =
D2
n1 + n2
n1 + n2
(14.66)
(14.68)
(14.69)
(14.70)
14A.2
350
Class 1
class 1
Class 2
w
class 2
class 1
Fig. 14.7. A poor
choice of the dividing
hyperplane (a line in
this 2D example)
produces a margin
which is small.
q+
class 2
Fig. 14.8. A good choice
of the dividing hyperplane
results in a large margin.
14A.2.1
T x2 = q .
(14.71)
Dene Ii to be the set of training examples from class i. For any point in I1 , T x q > ,
and for any point in I2 , T x q < . We need to nd two things. (1) A pair of points, one2 in
each class, which are as close together as possible. We will call these points support vectors.
(2) A vector onto which to project the support vectors so that their projections are maximally
far apart. We solve this problem as follows.
Recall that was a unit vector. It is thus equal to some other vector in the same direction divided by its magnitude, = w/w. We will look for one such vector, with certain
properties which will be introduced in a moment: For now, let x1 denote any point in I1 , not
necessarily a support vector, and similarly for x2 . Then
w T
w T
x1 q
x2 q
(14.72)
w
w
It is possible to have more than one support vector in each class, since two points might both be precisely the
same distance from the hyperplane.
351
which leads to
wT x1 qw w
wT x2 qw w.
(14.73)
Dene b = qw, and we then add a constraint to w to require that its magnitude have a
particular property:
w = 1/ .
(14.74)
Now we have two equations which describe behavior for any points in class 1 or class 2:
wT x1 + b 1
From this point on in this
derivation, the subscript
on the x no longer denotes
the class to which x
belongs, but rather just its
index, as an element of
the training set.
wT x2 + b 1.
(14.75)
Since we wish to nd the line which maximizes the margin, , from Eq. (14.74), we see that
this is the same as nding the projection vector w whose magnitude is minimal, thus we seek
a minimizer w = arg min( 12 wT w). Unfortunately, the null vector would minimize this, so we
need to add some constraints to avoid this trivial solution.
Let yi be the label for point xi , and dene the labels as
1 if xi I1
yi =
(14.76)
1 if xi I2
and consider the expression yi = (wT xi + b). This will always be greater than or equal to
1, regardless of the class of xi . We thus have a constraint, and our minimization problem
becomes: Find the w which minimizes wT w such that yi (wT xi + b) 1.
This can be accomplished by setting up the following constrained optimization problem:
L(w, b, ) =
l
1 T
i (yi (wT xi + b) 1)
w w
2
i=1
(14.77)
(14.78)
i xi yi .
(14.79)
(14.80)
1
i yi j y j xiT x j
i yi j y j xiT x j b
i yi +
i
2 i j
i
j
i
i
(14.81)
where the rst and second term are the same except for the 1/2. The third term is zero.
352
(14.82)
1
L = T A + 1T ,
2
(14.83)
(14.84)
In principle, Eq. (14.84) may be solved for b using any i, but it is numerically better to use
an average.
Similarly, we note the dimension of A is the same as the number of samples in the training
set. Thus, unless some ltering is done on the training set prior to building the SVM, the
computational complexity can be substantial.
14A.2.2
353
space (for reasons beyond the scope of this brief explanation). It also provides a simple
mechanism for incorporating nonlinear mixtures of the information from the measurements.
The polynomial form of Eq. (14.85) is but one way of expanding the dimensionality of the
measurement vector. A more interesting collection of ways comes to mind when one looks
at Eq. (14.83), and observes that to compute the optimal separating hyperplane, one does not
need to know the vectors themselves, but only the scalars which result from computing all
possible inner products. Thus, we do not need to map each vector to a high-dimensionality
space and then take the inner product of those vectors, not if we can gure out ahead of time
what those inner products should be.
Kernels and inner products
We seek the best separating hyperplane in the higher dimensional space dened by yi =
(xi ), : (
d
m ) where m > d. Then, the equations for the elements of A become
functions of the inner products of these new vectors A = [yi y j T (xi ) (x j )]. For notational
convenience (and to lead to a really clever result), dene a kernel operator, K (xi , x j ), which
takes into account both the nonlinear transformation and the inner product. Instead of
asking What nonlinear operator should I use? lets ask a different question: Given a
particular kernel, is there any chance it represents the combination of a nonlinear operator
and an inner product? The amazing answer is yes, under certain conditions that is true.
These conditions are known as Mercers conditions: Given a kernel function K(a, b) of two
8
vector-valued arguments, if for any g(x) which has nite energy (that is, (g(x))2 d x is nite),
8
then K (a, b)g(a)g(b) da db 0, then there exists a mapping and a decomposition of K
of the form
K (a, b) =
i (a)i (b).
(14.86)
i
In Eq. (14.86), the subscript i denotes the ith element of the vector-valued function . Thus,
that expression represents an inner product. Notice that Mercers conditions simply state
that if K satises these conditions, then K may be decomposed into an inner product of two
instances of a function . It does not say what is, nor does it say what the dimensionality
of is. But thats OK. We do not have to know. In fact, the vector may have innite
dimensionality. Thats still OK.
One kernel which is known to satisfy Mercers condition, and which is very popular in the
SVM literature, is the radial basis function
(a b)T (a b)
K (a, b) = exp
.
(14.87)
2 2
In the literature, SVMs have been applied to various problems such as face recognition [14.10]
and breast cancer detection [14.1]. In previous studies [14.7] and in comparative analysis in
the literature, they have empirically been shown to outperform classical classication tools
such as neural networks and nearest neighbor rules [14.9, 14.13, 14.14]. Interestingly, in
a comparison with a classier based on hyperspectral data, a SVM-based classier using
multispectral data (derived from the original hyperspectral data by ltering) performed better
than classiers based on the original data [14.8].
354
14A.3
14A.4
Conclusion
Statistical methods provide tools for making decisions, based on measurements. If the measurements are sufciently discriminating that simple thresholds may be used, sophisticated
statistical methods may not be required. On the other hand, most collections of measurements
are not sufcient to make such decisions based on trivial feature comparisons.
We have seen in section 14A.1 an example of what is in the discipline of statistical pattern
recognition, and just one application to machine vision. There is NOT enough information
in this book to teach you all you need to know about statistical methods. You really need to
take a full course. We hope this chapter has given you enough motivation to do that.
In section 14A.1 we derived an objective function which, if maximized, would result in
projected data with classes maximally separated. It turned out that this maximization problem
turns into an eigenvalue problem.
A SVM nds the decision boundary which maximizes the margin, where the margin is
the distance between the closest points and the decision boundary, and the derivation of the
machine requires use of constrained optimization with Lagrange multipliers.
Vocabulary
You should know the meaning of the following terms.
Between-class scatter
Fishers linear discriminant
Margin
Mercers conditions
Support vector
Within-class scatter
References
[14.1] P. S. Bradley, U. M. Fayyad, and O. L. Mangasarian, Mathematical Programming
for Data Mining: Formulations and Challenges, INFORMS Journal on Computing,
11(3), pp. 217238, 1999.
[14.2] C. J. C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition,
Data Mining and Knowledge Discovery, Vol. 2, No. 2, Dordrecht, Kluwer, 1998.
[14.3] R. Duda and P. Hart, Pattern Recognition and Scene Analysis, New York, Wiley,
1973.
[14.4] R. Duda, P. Hart, and D. Stork, Pattern Classication, Second Edition, New York,
Wiley, 2001.
[14.5] R. Fletcher, Practical Methods of Optimization, New York, Wiley, 1987.
[14.6] K. Fukanaga, Introduction to Statistical Pattern Recognition, New York, Academic
Press, 1972.
[14.7] B. Karacali and H. Krim, Fast Minimization of Structural Risk Using the Nearest
Neighbor Rule, IEEE Transactions on Neural Networks, 14(1), Jan. 2003.
355
References
[14.8] B. Karacali and W. Snyder, On-the-y Multispectral Automatic Target Recognition, Combat Identication Systems Conference, Colorado Springs, June, 2002.
[14.9] D. Li, S. M. R. Azimi, and D. J. Dobeck, Comparison of Different Neural Network
Classication Paradigms for Underwater Target Discrimination, Proceedings of
SPIE, Detection and Remediation Technologies for Mines and Minelike Targets V,
4038, pp. 334345, 2000.
[14.10] E. Osuna, R. Freund, and F. Girosi, Training Support Vector Machines: An Application to Face Detection, Proceedings of Computer Vision and Pattern Recognition,
Puerto Rico, June, 1997.
[14.11] C. Therrien, Decision, Estimation, and Classication, New York, Wiley, 1989.
[14.12] V. Vapnik, The Nature of Statistical Learning Theory, Berlin, Springer, 1995.
[14.13] M. H. Yang and B. Moghaddam, Gender Classication Using Support Vector
Machines, Proceedings of IEEE International Conference on Image Processing,
Vancouver, BC, Canada, September, 2000, vol. 2, pp. 471474, 2000.
[14.14] Y. Yang and X. Liu, Re-examination of Text Categorization Methods, Proceedings of the 1999 22nd International Conference on Research and Development in
Information Retrieval (SIGIR99), Berkeley, CA, pp. 4249, 1999.
15
Clustering
In this chapter, we approach the problem alluded to in Chapter 14 where the training
set simply contains points, and those points are not marked in any way to indicate
from which class they may have come. As in the previous chapter, we present only
a brief overview of the eld, and refer the reader to other texts [14.4, 15.7] for
more thorough coverage. One very important area which we omit here is the use
of biologically inspired models for clustering [15.4, 15.5, 15.6], and the reader is
strongly encouraged to look into these.
We will discuss the issues of clustering in a rather general sense, but note one
particular application, which is identication of peaks in the Hough transform
array.
Consider this example from satellite pattern classication: We imagine a
downward-looking satellite orbiting the earth, which, at each observed point, makes
a number of measurements of the light emitted/reected from that point on the
earths surface. Typically, as many as seven different measurements might be taken
from a given point, each measurement in a different spectral band. Each pixel
in the resulting image would then be a 7-vector where the elements of this vector
might represent the intensity in the far-infrared, the near-infrared, blue, green, etc.
Now suppose we have labeled training sets indicating examples of pixels containing
wheat, corn, grass, and trees. With these training sets, it would seem that we should
be able to build a pattern classier; and indeed we can. Furthermore, the problem as
stated so far is a supervised learning problem. Lets consider for a moment however,
the class which we call trees. This class consists of evergreen and deciduous trees
and, depending upon the time of year, these two subclasses will give radically different spectral signatures. We thus have a pattern classication problem which is not
easily approached with parametric classiers. While we could use a non-parametric
approach, parametric methods are very attractive. An alternative to nonparametric
classiers is to consider methods for determining the existence of the subclasses,
assigning points in the training set to the correct subclass and then representing that
subclass parametrically. Fig. 15.1 illustrates a two-dimensional problem in which
the existence of two classes within the same training set is readily apparent. We will
356
357
refer to these subclasses as clusters for the duration of this discussion. Each cluster
could be represented fairly accurately by a 2D Gaussian. The entire measurement
space, however, is obviously bimodal.
Such clustering is easy (for us humans) to visualize in a 2-space, and essentially
impossible to visualize in problems with more than three dimensions.
(15.1)
(15.2)
One might also consider some generalization of the Mahalanobis distance from
(point, cluster) to (cluster, cluster) dening
dsher (A, B) =
| A B |
,
A2 + B2
(15.3)
where the sigmas are computed by rst projecting the two clusters onto the line
between the two means. The sample mean and sample variances of the projected
data are the parameters of Eq. (15.3).
We can provide a more formal statement about how to dene a distance between
two clusters by rst stating the desirable properties of such a distance. We require
d(A, B) 0
d(A, B) = 0
if A = B
(15.4)
358
Clustering
There are many measures which satisfy these conditions; for example, we could
integrate the densities over all the sample space to obtain the divergence,
p(x|A)
ddiv (A, B) = [ p(x|A) p(x|B)] ln
dx.
(15.5)
p(x|B)
In the case of multivariate Gaussians, this becomes
'
(1 2 )T K 11 + K 21 [1 2 ] + tr K 11 K 2 + K 21 K 1 2I
2,
(15.6)
which simplies to Eq. (15.7) if the two covariance matrices are equal to each other.
2 (1 2 )T K 1 (1 2 ).
The Chernoff distance is
dch (A, B) = ln
(15.7)
( p(x|A))1s ( p(x|B))s dx
(15.8)
(15.9)
In the case that s = 1/2, the Chernoff distance turns into the Bhattacharyya distance:
K A + K B 1
1 12 (K A + K B )
1
T
( A B )
( A B ) + ln
. (15.10)
8
2
2 |K A |1/2 |K B |1/2
The most commonly occurring cluster distance metric in the literature is the nearest
neighbor measure
dmin (A, B) = mina,b d(a, b)
for a A, b B.
(15.11)
That is, over all pairs of points, one from cluster A and one from cluster B, choose
those two points which are closest together and dene that to be the distance between
the two clusters.
Similarly, one may dene the furthest neighbor distance
dmax (A, B) = maxaA, bB d(a, b).
(15.12)
Each of the denitions given above simply gives us a scalar measure for representing,
in some sense, how far apart two clusters are.
Any time you use a distance measure on vector-valued quantities, you should
pay attention to the possibility that scaling of the coordinate axes might change the
results. For example, consider the set of points shown in Fig. 15.2. Another example
that fairly well shows the impact of clustering is a classication problem involving
the vector [a, b]T , where a represents population and b represents the number of
359
24
2
1
4
3
a
5
1
3
Fig. 15.2. Simple scaling of the coordinate axes can change the apparent clusters.
15.2.1
Agglomerative clustering
In agglomerative clustering, we begin by assigning each data point in the training
set to a separate cluster. If there are N data points, then at the beginning, we have N
clusters. Next, we perform iterations of: Merge the two closest clusters.
By merge we mean: (1) nd the two closest clusters (each cluster may be
viewed as a set); (2) create a new set, consisting of the union of the two; and
(3) remove the original two. Fig. 15.3 shows an example.
This process continues until there are only c clusters. Presumably c is known
beforehand.
When we begin, every cluster consists of a single point, and the distance between
clusters is the same as the measure used for the distance between points. After we
have begun the process, however, we will be forced to make use of measures of
distances between clusters as described earlier.
360
Clustering
C
B
A
B
A
Fig. 15.3. Before the clustering iteration (left), the data points have already been assigned to
three clusters. Clusters B and C are determined (somehow) to be closer than A, B or
A, C, so B and C are merged and renamed to become a new cluster B.
The clusters which result from this algorithm are highly dependent on the measure
of cluster distance which is used. If, for example, one uses dmin , and illustrates
the distances used by drawing a line in Fig. 15.4, one gets the minimum spanning
tree (MST) of the graph which represents the data points by simply continuing the
algorithm until there is only one cluster. If we want three clusters, then we need only
cut the longest two edges in this graph. One then realizes that if formal graph theoretic
operations result from clustering, then the converse is likewise true: Whatever we
know about graphs may help us in designing clustering algorithms. In particular, the
following algorithm constructs the MST of a graph very quickly:
Dene the operation y = FIND(x) as returning the name of the set containing x.
Similarly, UNION(A, B, C) creates a new set C = A B, and then delete the sets
A and B.
361
Generally, in step 3.3, we index each set by an integer index, and rather than discarding indices when A and B are erased, we use the index for A or for B (whichever is
smaller) as the index for the new set C.
As has been discussed in the literature [15.1, 15.8] there exist parallel algorithms
for performing the unionnd operations in constant time. The parallel algorithm
assumes the existence of a lookup table which performs the FIND operation. Then
a UNION in parallel hardware is implemented as follows: to do a UNIONFIND
operation on points u and v:
(a)
(a)
(a)
(b)
(b)
(b)
(c)
(c)
(c)
362
Clustering
case. In particular, dmin will tend to choose clusters which are long and thin, and
dmax will choose clusters which are basically round. Students often get confused
right here, concerning the maximum and minimum criteria, so let us spend just a
few extra words to reiterate what we are doing: In this algorithm (the agglomerative
clustering algorithm) we are ALWAYS merging the two CLOSEST clusters. We get
into the maximum distance thing when we dene the distance between the clusters.
dmax says to use as a measure of distance between the clusters, the distance between
those points, in those clusters, whose mutual distance is maximum.
15.2.2
k-means clustering
There is, of course, another way to do clustering. In fact there are several ways. The
k-means algorithm is probably the most popular, and is described as follows:
Algorithm: k-means clustering
Step 1. In an arbitrary way, assign samples to clusters. Or, if you dont like being that
arbitrary, choose an arbitrary set of cluster centers and then assign all the samples
to the nearest cluster. How you pick the cluster centers is problem-dependent. For
example, if you were clustering points in a color space, where the dimensions are
red, green, and blue, you might scatter all your cluster centers uniformly over this
3-space, or you might put all of them along the line from 0, 0, 0 to maxred, maxgreen,
maxblue.
Step 2. Compute the mean of each cluster.
Step 3. Reassign each sample as belonging to the cluster with the nearest mean.
Step 4. If nothing changed this iteration, exit, otherwise go to step 2.
In Fig. 15.9 we illustrate the use of k-means to identify the peaks in a Hough
accumulator array. We use the same accumulator array illustrated in Fig. 11.5. Each
point in the accumulator array is treated as if it contains a number of points equal to
the value of the number stored at that point. The initial cluster centers were chosen
well separated and far from the actual centers. The simplest implementation of
k-means does not work in this application because there are very many points in
which the accumulator array contained only one point. All those ones add up to
place the mean at a point not near the peak we seek. The solution is to simply ignore
points with low values. In this example, any point not containing at least three points
was ignored. Other heuristics are possible (see Assignment 15.4).
The ISODATA algorithm [15.2] extends k-means. ISODATA allows the algorithm
to pick its own version of the number of clusters, and provides for more exibility
in specifying maximum and minimum cluster sizes.
363
Fig. 15.9. Path followed by two cluster centers initialized far from their nal positions in a Hough
accumulator array. The length of the lines indicates the move from one estimate of the
center to the estimate calculated in the next iteration.
c
(x i )(x i )T
(15.14)
i=1 xX i
c
i=1 xX i
Tr((x i )(x i )T ) =
c
(x i )T (x i ) (15.15)
i=1 xX i
is simply the sum of the squared deviations of the points from their means. The principle disadvantage of the trace criterion is that when used in a clustering algorithm,
it can yield different results when the variables are scaled.
The determinant of Sw is invariant to axis scaling, but use of the determinant
imposes the assumption that all clusters have roughly the same shape.
364
Clustering
15.3.1
15.3.2
Vector quantization
In the research area known as vector quantization, the computer is presented with
a set of n vectors, and is to nd natural groupings among the vectors. Said another
way, the computer is to nd a set of c reference vectors which represent the set of
n in some optimal way. If you think this sounds like clustering you are correct. It is
in fact precisely clustering; so lets call it that.
365
15.3.3
Winner-take-all approaches
The winner-take-all (also called competitive learning) approach originally resulted
from researchers who were interested in modeling the cognitive process we know as
generalization, and therefore each reference vector/cluster center was represented
by a mathematical construct which modeled a neuron. It will not be necessary for us
to go into the physiological model of neurons here. Instead, we use the term cluster
center for what some other material might refer to as a neuron.
Each cluster center i has an associated weight vector which we will refer to
by the same name i = [i j ]T , j = 1, . . . , d. Note that the vector describes a
location in a d-space. To do clustering, we present an input vector [v 1 , v 2 , . . . v d ]T .
Dene the winner as the cluster center closest (in whatever sense is appropriate) to
the input vector. That is,
d(i , v j ) d(k , v j ) (k
= i).
(15.16)
(15.17)
(15.18)
366
Clustering
where v is the data presented to the algorithm at this iteration; F is some nonincreasing
scalar function of di j , a measure of the distance between clusters i and j; and is a
maximum on that distance. This algorithm is easily programmed and converges to
excellent clusterings.
15.4 Conclusion
Gradient descent.
As we have noted, the form of the clustering algorithm signicantly affects the results
of the clustering. Some attempts to reduce the dependency on algorithm form have
been made [15.3], but this remains a fertile area for new ideas.
We view clustering as a collection of methods for determining consistency (as
in determining the peaks of the Hough transform) but not for using consistency to
solve other problems.
Clustering algorithms are totally dependent on optimization methods. In section
15.3, we minimized the trace of the scatter matrix to nd a good clustering measure.
We used branch and bound to speed up the combinatorial problem which results
when points must simply be switched between clusters.
Although couched in the terminology of neural networks, the winner-take-all
methods of section 15.3.3 use Eq. (15.17) (which is quite reminiscent of gradient
descent) to nd the best cluster center.
15.5 Vocabulary
You should know the meaning of the following terms.
Agglomerative clustering
Bhattacharyya distance
Branch and bound
Chernoff distance
Cluster
Competitive learning
Distance
Euclidean distance
Furthest neighbor distance
k-means
Kohonen map
Mahalanobis distance
Minimum spanning tree
Nearest neighbor distance
Union--find
367
15.5 Vocabulary
Assignment
15.1
Prove that for the equal-covariance case, the
Bhattacharyya distance becomes the same as the measure
given in Eq. (15.7).
Assignment
15.2
The following points are to be partitioned into two
clusters:
[0,0],[0,1],[0,2],[0,3],[0,4],[0,5],[0,7],[0,8].
Sketch the points and indicate the minimum spanning
tree.
Using dmin , find and identify the two clusters. You
may do this graphically if you wish.
Using dmax also find and identify the two clusters.
Can you suggest another distance measure to use in
this case? Discuss.
Assignment
15.3
In your images directory are three images, called
facered.ifs
faceblue.ifs
and (can you guess?)
facegreen.ifs
These are the red, blue, and green components of a full
color image. Each pixel can be represented by 8 bits
of red, 8 of green and 8 of blue. Therefore, there
are potentially 224 colors in this image. Unfortunately, your workstation (probably) only has 8 bits
of color, for a total of 256 possible colors. Your mission (should you choose to accept it) is to figure out
a way to display this picture in full color on your
workstation.
Approach: Use some kind of clustering algorithm. Find
128 clusters which best represent the color space, and
assign all points to one of those colors. Then, make a
file with the following data in it:
368
Clustering
15.4
In Fig. 15.9 a Hough accumulator is illustrated. That
same accumulator array is available on the CDROM as
hough.ifs. Invent a new way to find the peaks using a
clustering algorithm. (Do NOT simply find the brightest
point.) Suggestions might include weighting each point
by the square of the accumulator value, doing something
with the exponential of the square of the value, etc.
References
[15.1] R. Anderson and H. Woll, Wait-free Parallel Algorithms for the Union-Find Problem, Proc. 22nd ACM Symposium on Theory of Computing, pp. 370380, New York,
ACM Press, 1991.
[15.2] G. Ball, Data Analysis in the Social Sciences: What about the Details? Proc. AFIPS
Fall Joint Computer Conference, Washington, DC, Spartan Books, 1965.
[15.3] G. Beni and X. Liu, A Least Biased Fuzzy Clustering Method, IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(9), 1994.
[15.4] G. Carpenter and S. Grossberg, A Massively Parallel Architecture for a Selforganizing Neural Pattern Recognition Machine, Computer Vision, Graphics, and
Image Processing, 37, pp. 54115, 1987.
[15.5] G. Carpenter and S. Grossberg, ART-2: Stable Self-organization of Stable Category
Recognition Codes for Analog Input Patterns, Applied Optics, 26(23), p. 4919, 1987.
[15.6] G. Carpenter and S. Grossberg, ART-3: Hierarchical Search using Chemical Transmitters in Self-Organizing Pattern Recognition Architectures, Neural Networks, 3,
pp. 129152, 1990.
[15.7] D. Hand, Discrimination and Classication, New York, Wiley, 1989.
[15.8] W. Snyder and C. Savage, Content-Addressable ReadWrite Memories for Image
Analysis, IEEE Transactions on Computers, 31(10), pp. 963967, 1982.
16
16.1 Terminology
To make more progress in this area, we need to dene some terminology. The
denitions are in reference to analysis of strings of symbols, such as occur in language
analysis.
Terminal symbol.
Nonterminal symbol.
370
Grammar.
Production.
>
>
>
>
>
>
>
>
>
>
>
>
>
NP VP
VP ADV
V
ADJ N
ADJ NP
ART N
horse
professor
runs
sleeps
quickly
green
the
Derivation.
371
16.2.1
Type 0 grammars
In a type 0 grammar, any rewrite rule is allowable. The left-hand side of the production
may contain any mix of terminal and nonterminal symbols. For example
abAaBc > abAaCCc
16.2.2
is allowable. Lets reiterate what this means: If, in the course of a derivation, the
string abAaBc occurs, it may be replaced by abAaCCc. e.g. aardvabAaBcark could
be replaced by aardvabAaCCcark.
For any type 0 grammar, there is a Turing machine which recognizes the language
generated by that grammar.
Type 1 grammars1
Any rewrite rule is allowable, except that strings are not permitted to get shorter. The
left-hand side of the production may contain any mix of terminal and nonterminal
Type 1 grammars are sometimes called context-sensitive, but we will not use that term because some students
nd it confusing.
372
16.2.3
Type 2 grammars2
The left-hand side of the production may contain only a single nonterminal symbol.
For example
Bc > aabCC
is not allowable, because the LHS contains more than one symbol, however
B > aabbCC
is allowable.
For any type 2 grammar, there is a pushdown automaton which recognizes the
language generated by that grammar. We present one example of a type 2 grammar
which you might nd interesting.
Terminals {0,1}
Nonterminals: {S}
Start symbol: S
Table 16.2. Productions in an
example type 2 grammar.
S
S
>
>
0S1
01
373
16.2.4
Type 3 grammars
The left-hand side of the production may contain only a single nonterminal symbol.
The right-hand side many contain only strings of the form a or aA, that is, a
single terminal symbol, or a single terminal followed by a single nonterminal.
For example,
B > aCCb
is not allowable, because the RHS is not of an allowable form, however
B > aC
is allowable.
16.3.1
The reader familiar with
FSMs will recognize this
as dening a Mealy
machine (or is it a Moore
machine? You look it up).
Nondeterministic
machines.
3
4
Type 3 grammars
For any type 3 grammar, there is a nite state machine (FSM) which recognizes the
language generated by that grammar. A FSM is a system which may exist in a nite
number of states3 , denoted by upper case letters, and has a nite number of input
symbols, denoted here by lower case letters or numbers. Its operation is governed by
a set of transition rules of the form (A, a) = B; that is, when the machine is in state
A and receives input a, it will change to state B. The output of the machine depends
only on its state. The machine produces an output when it is in an accepting state.
To derive the machine from the grammar requires two steps, rst, we construct a
strange artice called a nondeterministic nite state machine. Then, we convert that
machine into one which we can actually build. Lets do that for a grammar which
describes regular ECGs.4 First, a bit about how your heart beats.
A normal heart beat is sketched in Fig. 16.1. It illustrates several waves. The
P wave is generated by the electrical signal, depolarization, which initiates the
contraction of the two atria, the smaller chambers of the heart. After the atrial beat,
the ECG returns to the isoelectric line, denoted i, for a short period of time, to
allow the larger chambers, the ventricles, to ll. Then the ventricles depolarize,
creating the QRS (denoted by just R) signal. In a healthy heart, the signal again
returns to the isoelectric line until the ventricles repolarize, creating the T wave.
Another isoelectric period occurs until the next P wave. So a healthy heart produces
A nite state machine is a system which may exist in a nite number of states. We cant believe we actually
said that!
An ECG is an electrocardiogram. The familiar expression EKG comes from the German spelling of the
word.
374
Fig. 16.4. Atrioventricular block: Delay between the P and the R, piiritipiiiritip.
375
Fig. 16.5. Myocardial infarction: Signal does not return to isoelectric between R and T, pirtii.
>
>
>
>
>
>
>
>
>
>
pA
iC
rD
iE
tF
iG
i
iH
i
iS
(A, i) = C
(E, t) = F
(H, i) = {S, Q}
(C, r ) = D
(F, i) = G
(Q, i) =
A > a, construct a state change of the form (A, a) = Q. Finally, if a is any input
symbol, (Q, a) = , where denotes the empty symbol.
The state change description of the machine which recognizes the language generated by this grammar is shown in Table 16.4.
Do you see why this is called nondeterministic? This machine goes from H to
Q under input i or H to S under the same input i. We do not mean sometimes it goes
to Q and sometimes to S. We mean it really does both, which of course is impossible
in a physically realizable machine.
To convert this into something that could be built, we construct a machine M as
follows.
The states of the new machine are all possible subsets of states of the original
machine, including (but not all will necessarily be used). In this example, there
376
p
[S]
([A], i) = [C]
([E], t) = [F]
([H ], i) = [S, Q]
([S, Q], i) = A
i
[A]
([C], r ) = [D]
([F], i) = [G]
([Q], i) =
[D]
[C]
i
i
[E]
[F]
[HQ]
i
[G] i
Fig. 16.6. Simple deterministic FSM which recognizes normal ECGs. Accepting states are
circled.
0
7
1 0
7
6
are 29 such states. These states will be denoted by square brackets and a list of the
original state names. If a transition involves such a set-valued state on the left, the
new state will be the union of the states to which the original machine went. (Whew!
was that awkward enough?) The accepting states of the new machine are any states
involving Q or any states which were accepting states in the original machine. The
accepting states of the new machine are any states containing accepting states of
the original machine. This process produces the physically implementable machine
illustrated in Table 16.5 and Fig. 16.6. In this example, although there are 29 states
in the new machine, only a few of them are used.
Thus, we have a machine which recognizes heart rhythms. We hope you have
observed that we have only listed what to do if the normal or expected input
occurs. Now, just for the fun of it, we could modify the state diagram (Fig. 16.6)
to include pathological conditions. For example, we could add (D, t) = Y which
would cause a transition to an alarm state indicating the patient might be having a
myocardial infarction (or aortic dissection or other bad stuff).
Next, we will give one more example of a type 3 grammar, this one using a chain
code. But rst, another way to describe regular languages: Regular expressions.
Given a set of terminal symbols, T, a regular expression is a string constructed by
concatenating elements of T and the symbol * (denoting repetition), with parentheses
to delineate order of operation, and comma to denote the logical OR operation.
In this section, we will use the terminals {0, 1, 2, 3, 4, 5, 6, 7} (the elements of a
chain code).
One element of the language generated by (0, 7)(0, 7)*(7, 6)(7, 6)(61, 72)(1, 2)
(1, 2)0(0, 1)* is illustrated in Fig. 16.7. The FSM which recognizes the set of all
strings generated by this regular expression is given in Fig. 16.8.
377
0
1
0,7
6,7
1,2
6,7
1
1,2
1,2
Fig. 16.8. State diagram for the nondeterministic FSM that recognizes strings generated by the
regular expression above. Two numbers separated by a comma denote that either
input will cause this transition. Any other input will cause a transition to an error state
which is not shown.
16.3.2
Type 2 grammars
Although type 3 grammars provide simple implementations as simple nite state
machines, which can be built using only ip-ops and combinational logic, they do
not have sufcient generality to solve many problems. In some applications, other
types of grammars may be more appropriate. In this subsection we present two
examples of shape recognition using type 2 grammars.
Recognition of chromosomes
The following example is abstracted from the text by Gonzalez and Thomason [16.6],
based originally upon the work of Ledley et al. [16.9], and illustrates the use of a
context-free grammar to recognize types of chromosomes.
The terminals in this grammar are boundary segments, denoted by a, b, c, d, and e,
and illustrated in Fig. 16.9. In the recognition setting, these might be called boundary
primitives. A chromosome will be described by a sequence of symbols ae. Note that
a
b
Fig. 16.9. Primitive boundary segments used for syntactic pattern recognition. Note that
segment size and direction are important.
378
(a)
(b)
b
d
b
b
b
Fig. 16.10. (a) A submedian chromosome. (b) A telocentric chromosome. (Redrawn from [16.6].)
>
>
>
>
>
>
>
>
>
>
Telocentric
S1
AA
CA
DE
bB
Cb
Db
cD
e
d
S>
S2
A
A
B
C
D
F
C
D
>
>
>
>
>
>
>
>
>
>
S2
BA
AC
FD
Bb
bC
bD
Dc
b
a
except for the symbol d, which can appear either way, the symbols have associated
directions.
There are two types of chromosomes recognized by this grammar, telocentric and
submedian, as illustrated in Fig. 16.10. Each may be described by a sequence of
boundary segments. The following grammar (Table 16.6) will generate either type
of chromosome.
These productions were not invented without some thought. The rst two productions, those involving the start symbol, S, control which type of chromosome
image is being generated: S1 denotes a submedian chromosome, and S2 designates
a telocentric chromosome. In addition, the other symbols connote components of
the chromosome boundary. That is, A will result in generation of armpair, B will
result in generation of bottom, C will result in generation of side, D will result in
generation of arm, E will result in generation of rightside, F will result in generation
of leftside.
379
Fig. 16.11. Two of the productions used to generate a hexagonal texture. (Redrawn from [9.2].)
Shape grammars
Finally one last example from [9.2] makes use of shape grammars [17.57] to generate
and recognize textures. In a shape grammar, both the set of terminal symbols, VT ,
and the set of nonterminal symbols, VN , are sets of shapes, with the restriction that
VT VN = .
In this example, the set of terminals contains only one element, a hexagon:
380
C and push the symbol q onto the stack. Using such a machine, it is now possible
to recognize languages such as the example in section 16.2.3, where the number
of ones must equal the number of zeros. The idea is simple: Every time we see a
zero, push a zero onto the stack and stay in the same state. When we rst see a one,
change state and pop the stack. Subsequently, every time we see a one, pop the stack.
If we ever see another zero, go to an error state. Else, when the stack goes empty,
the number of ones is equal to the number of zeros, and go to an accepting state.
One of the principal concerns of practitioners of syntactic pattern recognition is
illustrated well by both examples of chromosome recognition and ECG recognition.
Both systems assume that a recognizer exists which can identify the primitives such
as T waves. The underlying assumption is that such a primitive preprocessor would
be simple, perhaps a template matcher, and robust to noise. In practice, this may
be difcult to accomplish, and may require that the grammar itself be designed for
some degree of noise tolerance. The reader is referred to textbooks [16.5, 16.6] on
syntactic methods for details.
16.4 Conclusion
Syntactic methods do not rely on optimization methods or consistency except in
how the terminal symbols are classied. A lower level algorithm is required to
extract these features from images. This is in fact the principal weakness of syntactic
methods. Because they must rely on other algorithms to provide input, they can fail
in two different ways. The symbol recognizers may fail due to noise, blur, occlusion
or other unexpected variation. Second, the object may simply not look like what
the grammar was designed for. It may look similar, but there is no simple way to
incorporate similarity into a grammar.
16.5 Vocabulary
You should know the meaning of the following terms.
Derivation
Finite state machine
Grammar
Nonterminal symbol
Primitive
Production
Pushdown automaton
Regular expression
381
References
Regular grammar
Shape grammar
Terminal symbol
Assignment
16.1
Show that the string representing the submedian chromosome in Fig. 16.10(a) can be generated by the grammar
of Table 16.6.
Assignment
16.2
The statement was made earlier that the grammar of
Table 16.6 is a type 2 grammar. Prove or disprove
this statement.
References
[16.1] M. Chen, A. Kundu, and J. Zhou, Off-line Handwritten Word Recognition using
a Hidden Markov Model Type Stochastic Network, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 16(5), 1994.
[16.2] N. Chomsky, Three Models for the Description of Language, IRE Transactions on
Information Theory, 2(3), 1956.
[16.3] A. Corazza, R. De Mori, R. Gretter, and G. Satta, Optimal Probabilistic Evaluation Functions for Search Controlled by Stochastic Context-free Grammars, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(10), 1994.
[16.4] C. Fermuller and W. Kropatsch, A Syntactic Approach to Scale-space-based
Corner Description, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(7), 1994.
[16.5] K.S. Fu, Syntactic Pattern Recognition and Applications, Englewood Cliffs, NJ,
Prentice-Hall, 1982.
[16.6] R. Gonzalez and M. Thomason, Syntactic Pattern Recognition, Reading, MA,
Addison-Wesley, 1978.
[16.7] J. Hopcroft and J. Ullman, Introduction to Automata Theory, Languages, and
Computation, Reading, MA, Addison-Wesley, 1979.
[16.8] W. Kropatsch, Curve Representation in Multiple Resolution, Pattern Recognition
Letters, 6(3), 1987.
[16.9] R. Ledley, L. Rotolo, R. Kirsch, M. Ginsberg, and J. Wilson, FIDAC: Film Input
to Digital Automatic Computer and Associated Syntax-directed Pattern-recognition
Programming System, In Optical and Electro-optical Information Processing, ed.
J. Tippet, D. Beckowitz, L. Clapp, C. Koester, and A. Vanderburgh, Cambridge, MA,
MIT Press, 1965.
[16.10] B. Olstad and A. Torp, Encoding of a priori Information in Active Contour Models,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(9), 1996.
17
Applications
Machine vision has found a wide set of applications from astronomy [17.44] to
industrial inspection, to automatic target recognition. It would be impossible to cover
them all in the detail which they deserve. In this chapter, we choose to provide the
reader with more of an annotated bibliography than a pedagogical text. We mention
a few applications very briey, and provide a few references. In the next chapter,
we choose one application discipline, automatic target recognition, to cover in a bit
more detail.
383
384
Applications
how industry and universities might collaborate more effectively. Over two hundred
CEOs of machine vision companies (most of which are involved in manufacturing
inspection) were invited, and fewer than 30 showed up. Why was there such a poor
turnout when the topic is seemingly so important?
One possible answer is that most machine vision companies are small. But one
might ask, Why are so many machine vision companies so small? We speculate
that the uniqueness of this eld comes in the answer to that question. The capital
investment required to set up a machine vision company is actually quite small. One
can get into the business with a computer, some inexpensive hardware, and some
good ideas. Unless you go into custom hardware or sophisticated specializations,
you might be able to get your company running without venture capital. You do not
need to be big to be in the machine vision business. Because the companies are so
small, they are intensely market-driven, and do not see that basic academic research
can help them in the short term. Sometimes they are right.
Still, some basic research in industrial inspection is getting done, such as registration using ducial marks [17.61], automatic extraction of features [17.62], recognition of overlapping parts [17.21], as well as applications in assembly [17.43].
If a company has been in the business of manufacturing a particular line of products for many years, it is likely that many of those product designs were not entered
into CAD data bases. Reverse engineering is the process of going from legacy designs into modern data bases. It may require reading of blueprints [17.13], and
it may also require generation of CAD models and data bases [17.59] from actual objects, which in turn may require extraction of geometric primitives such as
spheres, cylinders, cones, etc. from range data [17.40], or other coordinate measuring
machines.
Microscopy is another application area in which machine vision plays an important
role. For example, Pap smears are often screened using automated systems, and white
blood cell counts may be done by computers as well. Tracking tubular molecules
in epi-uorescence microscopy [17.46] has been recently reported in the research
literature.
Many industrial parts are specular reectors, and their shape and roughness can
be extracted using multiple light sources [17.17, 17.56].
However it is done, and whatever the application, machine vision requires system
building and work on sensor modeling, together with hypothesis generation [17.73].
385
Lincolns
head
Image H
Memorial
Image E
Image T
Fig. 17.1. Three possible views of a penny. Although possible, the view of a penny standing on
edge is so unlikely as to be discarded.
17.6.1
Robot surgery
Robot-assisted surgery is becoming more important in applications where precise
placement requirements are in excess of what a human can achieve. A common
application is in brain surgery, where the head may be rigidly and precisely held
[17.23, 17.34, 17.38]. In robot-assisted surgery it is necessary to match 3D medical
images (MRI or CT) with 2D x-ray projections. This can be formulated as the
386
Applications
estimation of the spatial pose of a 3D object from 2D images [17.36]. Recent robotic
surgical successes include coronary artery bypass graft on the beating heart [17.9],
stomach surgery [17.7], and gall bladder surgery [17.24.]
17.6.2
Robot driving
Robot navigation includes identifying and avoiding obstacles [17.74] as well as
navigating on roads and off. Finding road edges from a moving vehicle has been
approached using a variety of imaging modes [17.20, 17.65, 17.72] as well as ground
level millimeter-wave radar [17.35].
We reiterate, robotic vision is really a system science. It draws components from
all of the techniques described elsewhere in this book. For example, optic ow may
be used to analyze camera motion and keep the camera trained on the target [17.49],
Grosso and Tistarelli [13.18] combine stereopsis and motion. Zhang et al. [17.74]
make use of the assumption that the robot is moving in the ground plane.
In fact, if one were to design a single project which would teach students the most
about engineering, robotics would probably be the optimal topic.
The reader interested in robot vision should peruse the proceedings of the IEEE
International Conference on Robotics and Automation.
Bibliography
[17.1] M. Anbar, Quantitative Dynamic Telethermometry in Medical Diagnosis and Management, Boca Raton, FL, CRC Press, 1994.
[17.2] M. Barzohar and D. Cooper, Automatic Finding of Main Roads in Aerial Images by
Using Geometric-stochastic Models and Estimation, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 18(7), 1996.
[17.3] M. Berman, Automated Smoothing of Image and Other Regularly Spaced
Data, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5),
1994.
Bozma and R. Kuc, A Physical Model-based Analysis of Heterogeneous Envi[17.4] O.
ronments using Sonar ENDURA method, IEEE Transactions on Pattern Analysis
and Machine Intelligence, 16(5), 1994.
[17.5] K. Bradshaw, I. Reid, and D. Murray, The Active Recovery of 3D Motion
Trajectories and Their Use in Prediction, IEEE Transactions on Pattern Analysis
and Machine Intelligence, 19(3), 1997.
[17.6] R. Brunelli and D. Falavigna, Person Identication Using Multiple Cues, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 17(10), 1995.
[17.7] G. Cadiere, J. Himpens, M. Vertruyen, J. Bruyns, O. Germay, G. Leman, and
R. Izizaw, Evaluation of Telesurgical (Robotic) NISSEN Fundoplication. Surg.
Endosc., 15(9), pp. 918923, 2001.
[17.8] V. Caglioti, Uncertainty Minimization in the Localization of Polyhedral Objects,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5), 1995.
387
Bibliography
388
Applications
[17.26] X. Jia and M. Nixon, Extending the Feature Vector for Automatic Face Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(12),
1995.
[17.27] B. F. Jones, A Reappraisal of the Use of Infrared Thermal Image Analysis in
Medicine, IEEE Transactions on Medical Imaging, 17(6), pp. 10191027, 1998.
[17.28] J. Kanai, S. Rice, T. Nartker, and G. Nagy, Automated Evaluation of OCR Zoning,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1), 1995.
[17.29] T. Kanungo, R. Haralick, H. Baird, W. Stuezle, and D. Madigan, A Statistical,
Nonparametric Methodology for Document Degradation Model Validation, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22(11), 2000.
[17.30] A. Katz and P. Thrift, Generating Image Filters for Target Recognition by Genetic
Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9),
1994.
[17.31] J. R. Keyserlingk, P. D. Ahlgren, E. Yu, N. Belliveau, and M. Yassa, Functional Infrared Imaging of the Breast, IEEE Engineering in Medicine and Biology, May/June,
pp. 3041, 2000.
[17.32] G. Kopec and P. Chou, Document Image Decoding using Markov Source
Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6),
1994.
[17.33] A. Kumar, Y. Bar-Shalom, and E. Oron, Precision Tracking Based on Segmentation
with Optimal Layering for Imaging Sensors, IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(2), 1995.
[17.34] Y. Kwoh, J. Hou, E. Jonckheere, and S. Hayati, A Robot with Improved Absolute
Positioning Accuracy for CT Guided Stereotactic Brain Surgery, IEEE Transactions
on Biomedical Engineering, 35(2), 1988.
[17.35] S. Lakshmanan and D. Grimmer, A Deformable Template Approach to Detecting Straight Edges in Radar Images, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(4), 1996.
[17.36] S. Lavallee and R. Szeliski, Recovering the Position and Orientation of Free-form
Objects from Image Contours Using 3D Distance Maps, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(4), 1995.
[17.37] K. Lee, Y. Choy, and S. Cho, Geometric Structure Analysis of Document Images: A
Knowledge-based Approach, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(11), 2000.
[17.38] P. Le Roux, H. Das, S. Esquenzai, and P. Kelly, Robot-assisted Microsurgery; Feasibility in a Rat Microsurgical Model, Neurosurgery, 48, 2001.
[17.39] E. Marchand and F. Chaurmette, Active Vision for Complete Scene Reconstruction
and Exploration, IEEE Transactions on Pattern Analysis and Machine Intelligence,
21(1), 1999.
[17.40] D. Marshall, G. Lukacs, and R. Martin, Robust Segmentation of Primitives from
Range Data in the Presence of Geometric Degeneracy, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 23(3), 2001.
[17.41] N. Merlet and J. Zerubia, New Prospects in Line Detection by Dynamic Programming, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4),
1996.
389
Bibliography
390
Applications
[17.57] G. Stiny and J. Gips, Algorithmic Aesthetics: Computer Models for Criticism and
Design in the Arts, University of California Press, 1972.
[17.58] M. Swain and D. Ballard, Color Indexing, International Journal of Computer
Vision, 7(1), 1991.
[17.59] T. Syeda-Mahmood, Indexing of Technical Line Drawing Databases, IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8), 1999.
[17.60] H. Takeda, C. Facchinetti, and J. Latombe, Planning the Motions of a Mobile Robot
in a Sensory Uncertainty Field, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(10), 1994.
[17.61] M. Tichem and M. Cohen, Submicron Registration of Fudicial Marks using Machine
Vision, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(8),
1994.
[17.62] S. Trika and R. Kashyap, Geometric Reasoning for Extraction of Manufacturing
Features in Iso-oriented Polyhedrons, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(11), 1994.
[17.63] L. Tsap, D. Goldgof, and S. Sarkar, Nonrigid Motion Analysis Based on Dynamic
Renement of Finite Element Models, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(5), 2000.
[17.64] T. Wakahara, Shape Matching using LAT and its Application to Handwritten
Numeral Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6), 1994.
[17.65] R. Wallace, A. Stentz, C. Thorpe, H. Moravec, W. Whittaker, and T. Kanade, First
Results in Robot Road-Following, Proceedings of the International Joint Conference on Articial Intelligence, 1985.
[17.66] C. Wang, Collision Detection of a Moving Polygon in the Presence of Polygonal Obstacles in the Plane, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16(6), 1994.
[17.67] C. Wang and W. Snyder, MAP Transmission Image Reconstruction via Mean Field
Annealing for Segmented Attenuation Correction of PET Imaging, 17th International Conference of the IEEE Engineering in Medicine and Biology Society, Montreal, September, 1995.
[17.68] C. Wang and W. Snyder, Frequency Characteristic Study Of Filtered-Backprojection
Reconstruction And Maximum Reconstruction For PET Images, 17th International
Conference of the IEEE Engineering in Medicine and Biology Society, Montreal,
September, 1995.
[17.69] C. Wang, W. Snyder, and G. Bilbro, Performance Evaluation of Filtered Backprojection Reconstruction and Iterative Reconstruction Methods for PET Images,
Computers in Medicine and Biology, 9(3), 1998.
[17.70] C. Wang, W. Snyder, G. Bilbro, and P. Santago, A Performance Evaluation of FBP
and ML Algorithms for PET Imaging, SPIE Medical Imaging, 1996.
[17.71] D. Weinshall and W. Werman, On View Likelihood and Stability, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(2).
[17.72] J. Weng, and S. Chen, Vision-guided Navigation using SHOSLIF, Neural
Networks, 1998.
391
Bibliography
18
This is the principal application chapter of this book.1 We have selected one application area: Automatic target recognition (ATR), and illustrate how the mathematics
and algorithms previously covered are used in this application. The point to be made
is that almost all applications similarly benet from not one, but fusions of most of
the techniques previously described. As in previous chapters, we provide the reader
with both an explanation of concepts and pointers to more advanced literature. However, since this chapter emphasizes the application, we do not include a Topics
section in this chapter.
Automatic target/object recognition (ATR) is the term given to the eld of engineering sciences that deals with the study of systems and techniques designed to
identify, to locate, and to characterize specic physical objects (referred to as targets)
[18.7, 18.9, 18.69], usually in a military environment. Limited surveys of the eld
are available [18.3, 18.8, 18.21, 18.66, 18.74, 18.79, 18.89]. In this chapter, the only
ATR systems considered are those that make use of images. Therefore, our use of
terminology (e.g., clutter) will be restricted to terms that make sense in an imaging
scenario.
The authors are indebted to Rajeev Ramanath, who assisted signicantly in the generation of this chapter, and
in fact wrote some sections, and to Richard Sims, and John Irvine who provided careful reviews and extremely
helpful feedforward.
392
393
18.1.1
ATR terminology
There are a few other terms that are often used in the ATR literature. We give the
denitions as follows.
Chip. A small image usually containing the image of a single target, extracted from a
large image of a scene. Target cueing algorithms, which identify the likely presence
of a target, often produce chips as output.
Detection rate. Fraction of targets correctly detected by the system.
Classication rate. Fraction of targets classied correctly, or more generally, the
conditional probability of correct recognition given the target was detected.
Clutter. Objects that are imaged but are not targets. Clutter typically may be trees,
houses, and other vehicles anything that is in the picture but is not target.
Cultural clutter. Refers to man-made objects like buildings, as opposed to natural
objects.
False alarm rate. Generally, the fraction of the number of detections that do not
correspond to actual targets. However, this denition may be modied if the task is
classication rather than detection. We observe that false alarm rate is not the same
as probability of false alarm. The false alarm rate is usually given in false alarms per
square kilometer. See section 18.3.
394
Detection
Segmentation
Classification
Recognition
(a)
(b)
(c)
Fig. 18.2. Different imaging modalities. (a) Visible spectrum image. (b) Thermal IR image (notice
hot tank engine on far left). (c) Ground truth (from [18.5]). Used with permission.
395
Wavelength range
Source
Visible (V)
Near-infrared (NIR)
Shortwave infrared (SWIR)
Midwave infrared (MWIR)
Thermal infrared (TIR) or
Longwave infrared (LWIR)
Microwave, RADAR
0.40.7 m
0.71.1 m
1.12.5 m
35 m
812 m
Solar illumination
Solar illumination
Solar illumination
Solar illumination, Thermal
Thermal
1 mm1 m
Thermal, Articial
Given an image of the scene (which consists of a possible target and background), we need to detect the target. Target detection methods can be viewed as
having two steps [18.86]. In the rst step, appropriate measurements are extracted
from an image using low-level image processing techniques. These measurements
are then utilized to derive a primary segmentation of the image into regions. In
the second step, higher-level descriptors of the segmented regions are used to determine the presence or absence of the target (detect) and possibly classify that
target.
396
H0
Source
H1
Noise source
Measurement
space
Decision Rule
Decision (D0 or D1)
Z =a+n
Z = b + n,
(18.1)
397
Probability of detection,
Pd .
Sensitivity. Probability of a true positive, that is, the ratio between true positive
and the summation of true position and false negative. P(D1 |H1 )P(H1 ). In the
specic application of target detection (as distinct from classication, recognition,
and identication), the sensitivity is referred to as the probability of detection and
denoted Pd .
Specicity. Probability of a true negative, that is, the ratio between true negative and
the summation of true negative and false positive. P(D0 |H0 )P(H0 ).
Observe that P(Di |Hi )P(Hi ) = P(Di , Hi ). Now, the probability of a correct
decision is
P(C) = P(D0 , H0 ) + P(D1 , H1 ) = P(D0 |H0 )P(H0 ) + P(D1 |H1 )P(H1 ).
18.3.1
18.3.2
398
0.75
P(D0|H0) 0.50
0.25
0.75
0.25
P(D0|H1)
P(D1|H1)
0.5
0.5
0.25
0.75
0.25
P(D1|H0) 0.5
0.75
P(H1 ) p(z|H1 ) dz +
(18.2)
where
i is the region of measurement space over which we decide class i. P(Hi ) is
the prior probability of seeing an example of class i, and p(x|Hi ) is the conditional
probability density of whatever we are measuring (e.g., brightness) given that the
example is from class i. Of course, we do not know the TRUE probability densities,
we only know our estimates of those densities, determined from the training sets.
Furthermore, we derived the decision regions
i from those (estimated) densities.
We could try to determine the error rate by simply counting the number of elements in the training set that are misclassied: We call this the apparent error rate.
Unfortunately, this leads to an optimistic result: It underestimates the error rate of
the system when tested on data not in the training set. This occurs because the ATR
has been designed to minimize the number of misclassications of the training set,
and unless the training set perfectly represents the true distribution of the data, the
classier will reect characteristics of the training set which may not be true of the
entire sample population.
We must distinguish the apparent error rate from the true error rate. Although we
have no way to determine the true error rate, we may get a better estimate using the
two different approaches now described.
399
This is straightforward. We simply divide our original training set into two parts
(randomly of course) and build the classier using half of the data. We then test
the system on the other half. This approach works reasonably well if we have very
large training sets (thousands of examples), or better said, something like 10d where
d is the dimensionality of the problem. (What does dimensionality mean here?)
Unfortunately, such large training sets are unlikely in most problems.
Assume there are n points in the training set. Remove point 1 from the set, and
design the classier using the other n 1 points. Then test the resulting machine
on point 1. Repeat for all points. The resulting error rate can be shown to be an
almost unbiased estimate of the expected true error rate for the classier designed
using all n points. Of course, this requires that we design n machines, which could
be prohibitive. However, with such a result, we have numbers which we can put into
the ROC curves.
18.3.3
400
localizes potential target areas. Hence, probability of detection and false alarm may
be used to evaluate this step. The segmentation operation extracts the target after it has
been detected. We may therefore use measures like misclassied pixels, correlation
coefcient between true and extracted target, etc.
18.4.1
All images record signals received as a sum of emitted and reected radiation:
f (x, y) = (x, y) + (x, y).
(18.3)
However, the amount emitted may be dramatically smaller than the amount reected
(in the case of visible light) or signicant (in the case of longwave infrared). The
emissivity is the ratio of emitted to total radiation.
(x, y) =
(x, y)
.
f (x, y)
(18.4)
(18.5)
401
12:00
14:00
16:00
18:00
22:00
00:00
02:00
06:00
08:00
10:00
Al
Fe
20:00
tire
04:00
Fig. 18.5. Objects may exhibit contrast reversals over a 24 hour period. (From US Army Night
Vision and Electro-optics Research Center, used with permission.)
Occlusions
Unlike industrial machine vision problems, where the setup of the manufacturing
facility is specically designed to minimize occlusion, not only do occlusions occur
in ATR scenes, but targets are usually at least partially occluded. In fact, the opponent
will be actively trying to have his equipment as occluded as possible [4.13]! An image
of a truck occluded by a tree is shown in Fig. 18.7, later.
All these variabilities raise the question of How well trained should the ATR be?
It is entirely too easy to over-train a system, and produce a system which performs
very well on the data on which it has been trained, but poorly on data it has not seen,
even though that data may (in the eyes of a human) be very similar. The problem is not
to get the probability of detection high, but to do so while simultaneously keeping the
false alarm rate low. The NeymanPearson test [18.53] provides a means to perform
such a minimization with performance bounds on probability of false alarm.
18.4.2
Tracking
In ATR, many if not most applications require target tracking, and furthermore the
tracking problems are less constrained and more challenging than tracking in the
civilian domain. Centroid tracking is the simplest type of tracking algorithm, although there are ways to improve its sophistication [18.39]. The centroid tracker
(usually) assumes there is just one target in the eld of view, and that bright spot
is much brighter than background. If those assumptions are true, the centroid of
the target is the centroid of the eld of view. More sophisticated tracking of moving objects is most often done using optimal lters like the KalmanBucy lter.
402
Haddad and Simanca [18.28] discuss the limitations of the Kalman ltering approach and propose a nonlinear tracking lter based on wavelets and the Zakai
equation. Amoozegar et al. [18.3] provide a survey of fuzzy and neural techniques
in tracking.
The process of tracking can be combined with the process of classifying vehicles
[13.12, 18.22] as well.
18.4.3
Segmentation
In most ATR scenarios, the problem of separating clutter from target is fundamental.
Clutter varies from one scene class to another and requires adaptive representations
[18.38]. However, there currently is not even a uniformly accepted denition for
signal-to-clutter [18.61, 18.68, 18.78].
Once a potential target is localized, it is extracted from the background as accurately as possible. However, every segmenter makes prior assumptions about the
target and its neighboring pixels. These assumptions may not be valid for all viewing
conditions. As we learned in Chapter 8, two common approaches to segmentation are
edge or boundary formation and region growing [18.68]. Edge detection approaches
are based upon recognizing dissimilarities in images whereas region growing utilizes
similarity properties. Because edge detection techniques are quite sensitive to noise,
successful edge detection usually depends on higher level semantic knowledge. Region growing techniques offer better immunity to noise and therefore do not have as
much reliance on semantic knowledge. Qi et al. [18.63] propose an efcient segmentation approach to segment man-made targets from unmanned aerial vehicle (UAV)
imagery using curvature information derived from an image histogram smoothed by
Bezier splines. Experimental results show that by enhancing the histogram instead
of the original image, similar segmentation results can be obtained in a more efcient
way. In [18.87], a segmentation strategy based on the image pyramid data structure
is developed, working its way from the top of the pyramid to the bottom, processing
image detail hierarchically.
As we learned in Chapter 6, diffusion and diffusion-like processes [18.41, 18.42]
provide excellent noise-removal steps as components of a segmentation process.
18.4.4
Feature selection
Most of the features used by researchers are geometric, topological and/or spectral
[18.7]. The primary goal of feature selection is to obtain features which maximize
the similarity of objects in the same class while minimizing the similarity of objects
in different classes. The mathematics of feature selection are covered nicely in a
textbook by Hand [18.30].
403
Human performance is
dependent on
pixels-on-target, too.
(1) ATR systems must of necessity deal with unstructured environments it simply
is not possible to control the illumination, the viewing angle, the atmosphere,
etc.
(2) Occluded targets are not just possible, they are likely.
(3) Only a few pixels are likely to be on target. The probability of correct classication strongly depends on the number of pixels on target. This is illustrated in
Fig. 18.6 for a neural network classier, but similar results, including especially
the dramatic change around 50 pixels-on-target, occurs for any system, including
a human.
Interestingly, the rst studies relating pixels on target to Pd were done prior to
digital images, and considered scan lines rather than pixels. In infrared images
of military targets, Johnson [18.36] found that target recognition by humans
required at least four pixels across the critical (smallest) target dimension. Of
course, such results vary with target complexity and recognition task details, but
100
75
50
25
100
200
300
400
500
404
405
Another type of ATR system is based on the premise that the more sensory data
that is available from the target of interest, the better the system performance. This
is intuitively obvious for sensors that have complementary properties. Due to numerous limitations of single-sensor ATR systems, there has been a move toward
multisensor targeting systems and, hence, the problem of correlating and combining
data generated from multiple sensors. This is also referred to sometimes as multisensor fusion, however, the information sources may be different sensors (sensor
fusion) or different algorithms (algorithm fusion) [18.32].
Finally, some researchers break the set of ATR algorithms into model-based methods, statistics-based methods, and template-based methods. These three categorizations are discussed in more detail below.
18.5.1
Model-based techniques
Most model-based techniques are geometry-based. They pose the question: Given
a certain viewing angle, what should the target look like? [18.1, 18.6, 18.12, 18.13].
This is potentially a powerful philosophy, as it provides information on portions of
the target that may have been occluded due to its position e.g., from a certain
viewpoint the barrel of a tank may be missing. However if we have a 3D model
of the target, we could generate all possible views and perform a combinatorial
search [18.65] to get a match. Model-based techniques are readily combined with
different data types, especially range (laser radar) images [18.87]. However, like
almost everywhere in machine vision, some optimization problem must be solved,
using neural networks [18.29], genetic algorithms [18.10], or other optimization
methods.
Usually, descriptions that correspond to scene structure and geometry alone are
obtained as opposed to scene physics (heat, light, material properties, etc.). Matching
is then the process of hypothesizing and verifying matches between model and
image points. This process produces a 3D to 2D transformation which brings the
3D model points into correspondence with the 2D image points. The best match
is then the transformation which best explains the scene. Solution of the 3D to 2D
correspondence is basically solution of the perspective equation. Errors between
projected model points and the corresponding image points help verify how good a
match is.
These methods are powerful, but require a lot of processing and a large database.
They perform poorly when targets are occluded [18.73] as this results in a case of
incomplete information. To this end, a lot of work is being done in obtaining the
actual geometric shape of the target from occluded views [18.75].
Occlusions, in general, fall into two distinct categories contiguous (buildings or
whole trees as shown in Fig. 18.7) and distributed (tree branches). The rst type is
easier to deal with as it is possible to have enough information on the nonoccluded
406
A histogram may be
constructed of the chain
code directions.
18.5.2
Statistics-based techniques
The ideas behind statistics-based techniques are the same as those described in Chapter 14: (1) Obtain features; (2) develop statistical measurements which characterize
different classes; (3) make decisions which optimize some measure such as minimum cost, or maximum probability of correct decision. In this section, we consider
only multispectral measurements.
Multispectral matching
One technique for multispectral analysis uses the concept of histogram intersection
introduced by Swain and Ballard [18.82] for the realm of color images. The idea
is simply to compare the histograms of two images and determine any overlap
factor (how many pixels in the data base histogram match those in the histogram
of the new image). Specically, given a pair of histograms, I (from new image)
and M (from database), each containing n bins, the intersection is dened to be
n
j=1 min(I j , M j ). The result is the number of pixels that have the same color in the
two images. This number may be normalized to obtain overlap factor. The color of an
object is subject to signicant change depending upon varying lighting conditions;
and in this situation, this simple algorithm will clearly not give us a good match. To
overcome this problem, Funt and Finlayson [18.24] combined the idea of histogram
intersection with a concept referred to as color constancy [18.23], which removes
effects of varying illumination conditions and in effect, normalizes the image to
a standard illuminant. Now that we have a data base also in standard illuminant
conditions, we can compare apples with apples and use the histogram intersection
method described above. We have not put any restriction to the dimensionality of the
histogram, and hence can extend this concept to higher dimensions (more sensors),
obtaining a more robust system.
Another measure of spectral match between a known target signature and an observed signature is obtained by treating the signatures as vectors and nding the inner
product between the two vectors [18.93]. The better the match, the closer the angular
separation to zero. In other words, if we have two d-dimensional measurements in
the spectral signatures, X and Y , then the distance between these two measurements
407
X rY
.
|X ||Y |
Small indicates that the two spectra are quantitatively similar; similarly large
indicates dissimilar spectra. In [18.93], Weisberg et al. use this measure to perform a
clustering procedure, segmenting the image into multiple regions of interest. There
are many other measures of similarity that are in use [18.34, 18.62].
In [18.86], Trivedi introduces the use of relative spectral information rather than
absolute information about the target for remote-sensing purposes. This introduces
some robustness into the system. For example, a certain object may be brighter than
the background in a particular channel and darker in some other channel.
18.5.3
408
transform. Cowart et al. [18.19] also consider the use of a parametric transform
which allows for maneuvering targets.
Viewed from above, a ship is a long, straight (unless the ship has had an unfortunate
encounter) object. When viewed with SAR,2 the image of a ship consists of spots and
dropouts. One may estimate [18.25, 18.50] the orientation of a ship by simply taking
the Hough transform. Surprisingly, it appears [18.25] that the Hough transform is
less sensitive to noise than using principal axes. If the ship is moving, its wake is
even longer, straighter, and may be more easily found [18.18].
A satellite, when viewed from a terrestrial telescope, has straight edges (see Fig.
18.8). The Hough transform can also be used to identify and characterize those edges
[18.20].
(18.6)
409
Object erased
Some parts removed, some preserved
Object preserved
Fig. 18.9. A
ring-shaped
structuring element
and a rectangular
object (redrawn from
[18.60]).
(18.7)
where B is the ring-shaped structuring element of Fig. 18.9. The behavior of this
structuring element is different according to the relative size of the object, L, and
the radius of the structuring element, R. Such methods [18.58] lead to morphologybased algorithms which are more specic to target recognition [18.10] or target
tracking [18.92]. The result is shown in Table 18.2.
Shape recognition using morphology has also been addressed by Pham et al.
[18.58].
Trellis algorithm
Segmented region
Chain rode
Histogram matching
Histogram of
chain code
Decision
410
(1) Scale variation in the image domain is equivalent to a vertical shift in the chain
code histogram domain.
(2) Changes in orientation are equivalent to horizontal cyclic shifts in the histogram.
It is possible that two different objects have the same chain code histogram. To this
end a trellis algorithm may be used to distinguish between such objects [18.81]. A
trellis structure of a distorted pattern is created using each row vector as a fractional
pattern (a node in the trellis). A large collection of observed data is used to provide
a statistical basis for this trellis, to which the Viterbi algorithm (see section 2.4.2) is
applied. Although the use has been demonstrated only for handwritten characters,
this method may be universally applied to any category of distorted patterns.
18.9 Conclusion
A wide variety of papers are available in the public literature that have comparative
studies on different techniques. For example, Li et al. [18.43] compared a number
of neural, statistical and model-based approaches and concluded that: At least for
FLIR images, the neural network-based approaches gave better results than the
PCA (principal components analysis) and the LDA (linear discriminant analysis)based approaches. They found that methods based on the Hausdorff distance also
performed well.
Often, papers in the literature present conclusions that make the reader think the
ATR problem is nearly solved when, in reality, either these systems have not been
tested on real military data or their scope is so limited to one specic application
and set of conditions that their actual performance may be doubted. We need
to understand that the problem being dealt with here is not only nontrivial, it gets
substantially more difcult with every new invention made by the enemy. What is
state-of-the-art today may not be tomorrow.
Nonetheless, from an engineering point of view (which is almost always practical!), the authors believe that the following are needed to produce ATR systems that
may be benchmarked:
r
r
a large set of standard real-world images, with ground truth information available
to the research community,
a tool to optimally evaluate and incorporate the best of a large number of ATR
techniques into a system with optimal performance.
Several questions arise when we take these issues into consideration. How do
we come up with a standard? What objectives are we trying to reach a powerful
system that has the best ROC curves or developing a system that is portable, using
411
Bibliography
minimal hardware? Clearly all these cannot be met at the same time and someone
has to strike a balance.
General trends are seen, however, in the development of ATR systems. The use
of more than one sensory system, portability of the system, its ability to be truly an
automated system with minimal human interference, the increased use of mathematical tools to provide a sound basis for the system are clearly where research seems
to be headed. In this chapter the authors hope to have communicated the vastness of
this problem while also presenting the achievements of science in approaching this
problem.
Bibliography
[18.1] J. Albus, Applications of an Efcient Algorithm for Locating 3D Models in
2D Images, Automatic Target Recognition VIII, Proceedings SPIE, 3371, April,
1998.
[18.2] K. Al-Ghoneim and B. Kumar, Combining Focused MACE Filters for Target Detection, Automatic Target Recognition VIII, Proceedings SPIE, 3371, April, 1998.
[18.3] F. Amoozegar, A. Notash, and H. Pang, Survey of Fuzzy Logic and Neural Network Technology for Multi-target Tracking, Automatic Target Recognition VIII,
Proceedings SPIE, 3371, April, 1998.
[18.4] M. Barzohar and D. Cooper, Automatic Finding of Main Roads in Aerial Images by
Using Geometric-stochastic Models and Estimation, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 18(7), 1996.
[18.5] J. Beveridge, D. Panda, and T. Yachik, November 1993 Fort Carson RSTA Data
Collection Final Report, Colorado State University Technical Report, CS-94-118,
August, 1994.
[18.6] J. Bevington and K. Siejko, Ladar Sensor Modeling and Image Synthesis for
ATR Algorithm Development, Automatic Target Recognition VI, Proceedings SPIE,
2756, April 1996.
[18.7] B. Bhanu, Automatic Target Recognition: State of the Art Survey, IEEE Transactions on Aerospace and Electronic Systems, 22(4), 1986.
[18.8] B. Bhanu and T. Jones, Image Understanding Research for Automatic Target Recognition, Proceedings of the DARPA Image Understanding Workshop, 1992.
[18.9] B. Bhanu and T. Jones, Image Understanding Research for Automatic Target Recognition, IEEE Aerospace and Electronics Systems Magazine, 8(10), 1993.
[18.10] M. Boshra and B. Bhanu, Predicting Performance of Object Recognition, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22(9), 2000.
[18.11] M. Bullock, D. Wang, S. Fairchild, and T. Patterson, Automated Training of 3-D
Morphology Algorithm for Object Recognition, Automatic Target Recognition IV,
Proceedings SPIE, 2234, April, 1994.
[18.12] J. Burrill, S. Wang, A. Barrow, M. Friedman, and M. Soffen, Model-based Matching using Elliptical Features, Automatic Target Recognition VI, Proceedings SPIE,
2756, April, 1996.
412
413
Bibliography
414
415
Bibliography
[18.63] H. Qi, W. E. Snyder, and D. Marchette, An Efcient Approach to Segmenting Manmade Targets from Unmanned Aerial Vehicle Imagery, Optical Engineering, 39(5),
pp. 12671274, 2000.
[18.64] H. Ranganath and R. Sims, Self Partitioning Neural Networks for Target Recognition, Automatic Target Recognition IV, Proceedings SPIE, 2234, April, 1994.
[18.65] K. Rao, Combinatorics Reduction for Target Recognition in ATR Applications,
Automatic Object Recognition II, Proceedings SPIE, 1700, 1992.
[18.66] J. Ratches, C. Walters, R. Buser, and B. Guenther, Aided and Automatic Target
Recognition Based upon Sensory Inputs from Image Forming Systems, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 19(9), Sept. 1997.
[18.67] A. Reno, D. Gillies, and D. Booth, Deformable Models for Object Recognition in
Aerial Images, Automatic Target Recognition VIII, Proceedings SPIE, 3371, 1998.
[18.68] E.M. Riseman and M.A. Arbib, Computational Techniques in Visual Segmentation
of Static Scenes, CCGIP, 7, Target Recognition VIII, Proceedings SPIE, 3371,
April, 1998.
[18.69] R. Robmann and H. Bunke, Towards Robust Edge Extraction a Fusion Based
Approach using Greylevel and Range Images, Automatic Target Recognition V,
Proceedings SPIE, 2485, April, 1995.
[18.70] F.A. Sadjadi, A Model-Based Technique for Recognizing Targets by Using Millimeter Wave Radar Signatures, International Journal of Infrared and Millimeter
Waves, 10(3), 337342, 1989.
[18.71] F.A. Sadjadi, Automatic Object Recognition: Critical Issues and Current Approaches, Proceedings SPIE, 1471, 1991.
[18.72] F.A. Sadjadi, Automatic Object Recognition: Critical Issues and Current Approaches, 1991. Selected SPIE Papers on CD-ROM, Volume 6. Automatic Target
Recognition SPIE: 1 PO Box 10, Bellingham, WA 98227-0010, USA.
[18.73] F.A. Sadjadi, Automatic Object Recognition: Critical Issues and Current Approaches, 1991. Selected SPIE Papers on CD-ROM, Volume 6. Automatic Target
Recognition SPIE: 1 PO Box 10, Bellingham, WA 98227-0010, USA.
[18.74] F.A. Sadjadi, Special Section on ATR, Optical Engineering, 31(12), 1992.
[18.75] F.A. Sadjadi, Application of Genetic Algorithm for Automatic Recognition of
Partially Occluded Objects, Automatic Object Recognition IV, Proceedings SPIE,
2234, 1994.
[18.76] R. Samy and J. Bonnet, Robust and Incremental Active Contour Models for Objects
Tracking, Automatic Target Recognition V, Proceedings SPIE, 2485, April, 1995.
[18.77] R. Sharma and N. Subotic, Construction of Hybrid Templates from Collected and
Simulated Data for SAR ATR Algorithms, Automatic Target Recognition VIII, Proceedings SPIE, 3371, April, 1998.
[18.78] R. Sims, Signal to Clutter Measurement and ATR Performance, Automatic Target
Recognition VIII, Proceedings SPIE, 3371, April, 1998.
[18.79] R. Sims and B. Dasarathy, Automatic Target Recognition using a Passive Multisensor Suite, Special Section on ATR, Optical Engineering, 31(12), 1992.
[18.80] A. Srivastava, B. Thomasson, and R. Sims, A Regression Model for Prediction of
IR Images, Proceedings of SPIE Aerosense ATR XI, Orlando, 2001.
416
[18.81] L.B. Stotts, E.M. Winter, L.E. Hoff, and I.S. Reed, Clutter Rejection using Multispectral Processing, Proceedings SPIE Signal and Data Processing of Small Targets,
1305, pp. 210, 1990.
[18.82] M. Swain and D. Ballard, Color Indexing, International Journal of Computer
Vision, 7(1), 1991.
[18.83] H. Tanaka, Y. Hirakawa, and S. Kaneku, Recognition of Distorted Patterns Using the Viterbi Algorithm, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 4(1), 1982.
[18.84] W. Thoet, T. Rainey, D. Brettle, L. Stutz, and F. Weingard, ANVIL Neural Network
Program for Three-dimensional Automatic Target Recognition, Optical Engineering, 31(12), December, 1992.
[18.85] M.M. Trivedi, Detection of Objects in High Resolution Multispectral Aerial
Images, SPIE Applications of Articial Intelligence II, 1985.
[18.86] M. Trivedi, Object Detection Using Their Multispectral Characteristics, Proceedings SPIE, 754, 1987.
[18.87] M. M. Trivedi and J. C. Bezdek, Low-Level Segmentation of Aerial Images with
Fuzzy Clustering, IEEE Transactions on Systems, Man, and Cybernetics, 16(4),
1986.
[18.88] A. Ueltschi and H. Bunke, Model-based Recognition of Three-dimensional Objects
from Incomplete Range Data, Automatic Target Recognition V, Proceedings SPIE,
2485, April, 1995.
[18.89] J. Wald, D. Krig, and T. DePersia, ATR: Problems and Possibilities for the IU
Community, Proceedings of the DARPA Image Understanding Workshop, IUW,
255264, San Diego, January, 1992.
[18.90] B. Wallet, D. Marchette, and J. Solka, A Matrix Representation for Genetic Algorithms, Automatic Target Recognition VI, Proceedings SPIE, 2756, April, 1996.
[18.91] B. Wallet, D. Marchette, and J. Solka, Using Genetic Algorithms to Search for
Optimal Projections, Automatic Target Recognition VII, Proceedings SPIE, 3069,
April, 1997.
[18.92] S. Wang, G. Chen, D. Sapounas, H. Shi, and R. Peer, Development of Gazing
Algorithms for Tracking Oriented Recognition, Automatic Target Recognition VII,
Proceedings SPIE, 3069, April, 1997.
[18.93] A. Weisberg, M. Najarian, B. Borowski, J. Lisowski, and B. Miller, Spectral Angle
Automatic Cluster Routine (SAALT): An Unsupervised Multispectral Clustering
Algorithm, Proceedings of IEEE Aerospace Conference, 307317, 1999.
[18.94] D. Xue, Y. Zhu, and G. Zhu, Recognition of Low-contrast FLIR Tank Object Based
on Multiscale Fractal Character Vector, Automatic Target Recognition VI, Proceedings SPIE, 2756, April, 1996.
[18.95] C. Zhou, G. Zhang, and J. Peng, Performance Modeling Based Adaptive Target
Tracking in Multiscenario Environment, Automatic Target Recognition VI, Proceedings SPIE, 2756, April, 1996.
Author index
Aarts, E. 142
Adams, R. 211
Aggarwal, J. 48, 63, 401
Aghajan, H. 287, 288
Ahlgren, P. 388
Ahuja, N. 258, 261
Alter, T. 62
Amir, A. 201, 258
Amit, Y. 322
Amoozegar, F. 402, 412
Anandan, P. 142
Anbar, M. 386
Anderson, C. 63
Andrey, P. 211
Anh, V. 104
Arbib, M. 215, 415
Arbter, K. 256
Arcelli, C. 173
Arrott, M. 256
Astrom, K. 322
Attneave, F. 273
Aubert, G. 214
Avnir, D. 227, 262
Babaud, J. 104
Bachmann, C. 414
Badler, D. 62
Bagheri, A. 258
Baird, H. 388
Bajcsy, R. 62
Ball, G. 368
Ballard, D. 256, 288, 390, 406,
416
Baram, Y. 389
Bar-Shalom, Y. 388, 413
Barzohar, M. 386, 407, 411
Batchelor, B. 215
Baudin, M. 104
Baur, C. 414
Belliveau, N. 388
Ben-Arie, J. 324
Ben-Ezra, M. 64
Beni, G. 368
Bennett, N. 288
Berkmann, J. 211
Berman, M. 386
Bertrand, G. 178
417
Besag, J. 114
Beucher. S. 211
Bezdek, J. 416
Bhanu, B. 322
Bhattachaya, P. 323
Bichsel, M. 256
Bigun, J. 62
Bilbro, G. 140, 142, 143, 204, 211, 214,
322
Bilbro, S. 215
Bimbo, A. 322
Bischof, L. 211
Bister, M. 166
Black, M. 120, 139
Blackman, S. 412
Blais, G. 204, 211
Blake, A. 129, 260
Blanc-F`eraud, L. 214
Blostein, S. 251, 257
Bobick, A. 256
Bokil, A. 62
Bolle, R. 259
Bonnet, J. 415
Booth, D. 415
Borowski, B. 416
Bose, S. 262
Boult, T. 257
Bovik, A. 261
Bowyer, K. 212, 213, 237, 297
Boyer, K. 105, 211, 325
Bozma, O. 386
Bradshaw, K. 386
Brady, J. 260
Brady, M. 313, 325
Breu, H. 154, 178
Brice, C. 212
Brooks, M. 105
Brosnan, T. 414
Bruckstein, A. 201, 258
Brunelli, R. 386
Buhmann, J. 289
Bui, T. 215
Bullock, M. 411
Bunke, H. 212, 296, 415, 416
Burbeck, C. 259
Burkhardt, H. 256
Burridge, R. 288
418
Author index
Cryer, J. 262
Cukierman, F. 215
Dai, X. 387
Darrell, T. 323
Das, H. 388
Dasarathy, B. 415
Daugman, J. 105
Davis, J. 256
Davis, T. 212
De Mori, R. 381
DeCarlo, D. 323
Deconinck, F. 322
deFigueiredo, R. 98
Defrise, M. 322
Delingette, H. 323
Deng, W. 99, 105
DePersia, T. 416
Deriche, M. 105
Derin, H. 139
Desai, U. 141
Dhome, M. 257
Dhond, U. 48, 63, 401
Diakides, N. 387
Dickinson, S. 238, 323
Dinstein, I. 214
Doerschuk, P. 138, 143, 297
Dori, D. 387
du Buf, J. 62
Dubin, S. 259
Dubuisson-Jolly, M. 262, 323
Duda, R. 104, 326
Duncan, J. 155, 180
Dunn, D. 63
Dunn, S. 142, 256
Dyer, C. 297
Earnshaw, A. 251, 257
Eggbert, D. 212
Ehrich, E. 213
Ehricke, H. 139
Elder, J. 120, 140, 273
Elliot, H. 139, 140
Elschlager, R. 323
Ennesser, F. 387
Esquenzai, S. 388
Facchinetti, C. 390
Fairchild, S. 411
Falavigna, D. 386
Fan, C. 250, 387
Faugeras, P. 322
Fennema, C. 212
Fermuller, C. 381
Ferreira, A. 257
Ferrie, F. 262
Fiddelaers, P. 85, 106, 259
419
Author index
Gupta, A. 257
Gurelli, M. 63
Gurevich, G. 322
H. Delingette 321
Hadamard, J. 140
Haddad, Z. 387, 402, 412
Hammersley, J. 140
Han, S. 213, 237
Han, Y. 142
Hanayama, M. 389
Hand 402
Hand, D. 368
Hansen, F. 140
Hanson, A. 215, 391
Hara, Y. 414
Haralick, R. 63, 97, 388
Hart, P. 326
Hartley, R. 257
Havaldar, P. 257
Hayati, S. 387, 388
Hayes, H. 413
Haykin, S. 323
Healey, G. 63, 214, 215, 257, 389
Hebert, M. 321, 323
Heeger, D. 258
Heisterkamp, D. 323
Helman, D. 258
Hel-Or, Y. 63
Hendriks, E. 214
Hensel, E. 140
Herrington, D. 106
Higgins, W. 63, 173, 179
Hildreth, E. 106
Hingorani, S. 387
Hirakawa, Y. 416
Hiriyannaiah, H. 140
Hirzinger, G. 256
Hoff, L. 416
Hofmann, T. 289
Hoover, A. 205, 212, 243, 258
Hopeld, J. 140, 316, 323
Horn, B. 258
Horn, BKP 244
Hou, J. 388
Hsu, K. 230, 257
Hu, M. 230, 258
Hu, W. 214
Hu, X. 258
Huang, C. 154, 179
Huang, T. 256, 324
Huang, Y. 64, 387, 413
Hubel, D. 99, 105
Huffman, D. 273
Hummel, R. 274
Hussain, B. 323
Hutchinson, S. 213
420
Author index
Keshava, N. 414
Keyserlingk, J. 388
Khotanzad, A. 62
Kim, J. 324
Kimia, B. 214
Kimmel, R. 201, 258
Kindermann, R. 141
Kirkpatrick, D. 178
Kirkpatrick, S. 141
Kirsch, R. 381
Kiryati, N. 256, 289
Kisworo, M. 105
Kittler, J. 270, 273, 288
Klarquist, W. 142
Ko, M. 256
Koch, C. 212
Koenderink, J. 142, 213, 273, 296, 297
Kolmogorov, A. 138, 141
Kondepudy, R. 63, 257
Kong, A. 322
Kopec, G. 388
Koster, A. 215
Koster, K. 213
Kottke, D. 258
Kotturi, D. 286, 289
Kriegman, D. 215, 250, 261
Krig, D. 416
Kropatsch, W. 381
Kryzak, A. 215
Kube, P. 106
Kuc, R. 386
Kulikowski, C. 322, 323, 383
Kulkarni, S. 213
Kumar, A. 388, 413
Kumar, B. 324
Kumar, S. 213, 237
Kundu, A. 62, 381
Kuo, C. 61, 63
Kuruganti, P. 389
Kwoh, Y. 388
Kyuma, K. 323
Lagendijk, R. 214
Lai, K. 201, 213
Lakshmanan, S. 323, 324, 388
Lange, E. 323
Langley, K. 387
Lapreste, J. 257
Latombe, J. 390
Laurendeau, D. 260
Laurentini, A. 258
LaValle, S. 213
Lavallee, S. 258, 388
Lawson, W. 413
Le Roux, P. 388
Leahy, R. 141
Leavers, V. 289
421
Author index
Leclerc, Y. 324
Ledley, R. 377, 381
Lee, C. 182, 213
Lee, J. 325
Lee, K. 213, 388
Lee, R. 141
Lee, T. 106
Lemahieu, I. 259
Leung, D. 262
Leung, Y. 106
Levialdi, S. 173
Levin, K. 142
Levine, M. 204, 211, 214
Lew, M. 324
Li, S. 120, 141, 324
Li, Z. 261
Liang, P. 288
Liao, S. 259
Lin, C. 412
Lin, W. 323
Lindeberg, T. 106
Lindenbaum, M. 212, 324
Link, K. 142, 215
Lipson, H. 274
Lisowski, J. 416
Liu, F. 207, 213
Liu, I. 261
Liu, J. 210, 213
Liu, X. 196, 213, 368
Liu, Z. 389
Logenthiran, A. 142, 213, 214
Loizou, G. 389
Lopez-Krahe, J. 288
Lorey, R. 413
Lozano-Perez, T. 387
Lukacs, G. 388
Lutton, E. 288
Ma, S. 64
Madigan, D. 388
Maestre, F. 389
Mahalanobis, A. 413
Maintz, J. 52, 63
Maitre, N. 288
Malik, J. 106, 142, 297
Malik, R. 141
Malladi, R. 259
Mandelbrot, B. 61, 63
Manhaeghe, C. 259
Mann, R. 140
Marapan, S. 63
Marchand, E. 388
Marchette, D. 413, 416
Margalit, A. 324
Marr, D. 64, 106, 141
Marroquin, J. 141
Marshall, D. 388
Martin, R. 388
Martin, W. 251, 256
Matheny, A. 213, 236, 383
Matheron, G. 159
McLauchlan, P. 64
McLean, G. 286, 289
Medioni, G. 257, 387
Meer, P. 204, 212, 213
Meiri, Z. 212
Merlet, N. 388, 407, 414
Mersereau, R. 414
Messmer, B. 296
Metaxas, D. 238, 323
Michel, J. 259, 389
Milgram, D. 191
Miller, B. 416
Miller, T. 143
Mirelli, V. 414
Mirmirani, M. 387
Mirza, M. 211
Mitchell, O. 154, 179
Mitchell, T. 324
Mitter, S. 213
Miura, J. 389
Mohan, R. 256
Mokhtarian, F. 259
Molina, R. 389
Moons, T. 85, 106
Morano, R. 259
Moravec, H. 390
Moriyama, M. 389
Morris, P. 141
Morse, B. 259
Moussouris, J. 141
Mundkur, P. 141
Murakami, H. 324
Murase, H. 324
Murray, D. 64, 386
Nagy, G. 388
Naito, H. 389
Najarian, M. 416
Nakagawa, Y. 259
Nakajima, Y. 389
Nakano, Y. 414
Namazi, N. 250, 387
Nandhakumar, N. 64, 251, 256, 259, 387
Nartker, T. 388
Nasrabadi, N. 414
Natonek, E. 414
Navab, N. 64
Nayar, S. 259
Nelson, R. 106
Nelson, T. 141
Nevatia, R. 262
Ng, W. 182, 213
Nguyen, D. 414
422
Author index
Poggio, T. 64
Ponce, J. 215, 297
Porat, B. 212
Porat, M. 212
Pratt, W. 106, 165, 204
Priebe, C. 413
Princen, J. 288
Pritch, Y. 64
Qi, H. 142, 389
Qian, M. 259
Quan, L. 64, 260
Rainey, T. 416
Rajala, S. 64, 142, 214
Ranganath, H. 415
Rangarajan, A. 305, 323
Rao, K. 324
Ravichandran, G. 389
Ray, M. 321, 325
Reddi, S. 389
Reddy, R. 214
Reed, I. 416
Rece, M. 274
Reid, I. 386
Reinhardt, J. 173, 179
Reno, A. 415
Rice, S. 388
Richardson, T. 213
Richetin, M. 257
Ringach, D. 389
Riseman, E. 215, 415
Rissanen, J. 324
Rives, G. 257
Rivlin, E. 196, 214
Robmann, R. 415
Rocha, J. 389
Rogers, G. 413
Ronfard, R. 261
Rosen, P. 214
Rosenfeld, A. 107, 142, 274, 324
Rosin, P. 197, 203
Roth, G. 214
Rothe, I. 230, 260
Rotolo, L. 381
Rousso, B. 141
Sadjadi, F. 415
Saha, P. 180
Saito, J. 414
Saito, N. 288
Samson, C. 214
Samy, R. 415
Sandford, L. 215
Sandini, G. 260
Santago, P. 142, 213, 215, 390
Sapiro, G. 214
423
Author index
Sapounas, D. 416
Sarkar, N. 212
Sarkar, S. 105, 297, 390
Sato, Y. 212, 389
Satta, G. 381
Savage, C. 368
Schafer, R. 162
Schonfeld, D. 180
Schultz, H. 260
Schwartz, E. 260
Schwartz, J. 324
Schweitzer, H. 321, 325
Sclaroff, S. 312, 325
Sengupta, K. 325
Seppanen, T. 258
Sethian, J. 259, 260
Shah, M. 262
Shamir, J. 256
Shamos, M. 229, 260
Shapiro, L. 63, 313, 325
Sharir, M. 324
Sharma, R. 415
Shashua, A. 48, 64
Shaw, G. 414
Shemlon, S. 142
Shen, X. 325
Sheu, H. 214
Shi, H. 416
Shi, J. 104
Shimshoni, I. 297
Shpitalni, M. 274
Shum, H. 214
Siddiqi, K. 214
Sideman, S. 98
Simanca, S. 387, 402, 412
Sims, R. 415
Sinclair, D. 260
Sklansky, J. 283, 284, 289
Slaney, M. 141
Slater, D. 389
Smeulders, A. 180, 215
Smith, M. 414
Smith, P. 61, 64
Smith, S. 260
Snell, J. 141
Snyder, W. 64, 106, 140, 142, 143, 204, 211, 213,
214, 256, 260, 322, 368, 387, 408
Soatto, S. 251, 260
Sogo, T. 385, 389
Sohn, K. 324
Sohn, W. 389
Solina, F. 62
Solka, J. 413, 416
Solomon, F. 389
Soroka, B. 260
Soucy, M. 260
Soukoulis, C. 142
Soundararajan, P. 297
Spann, M. 213
Sriram, R. 214
Srivastava, A. 415
Stentz, A. 390
Stewart, C. 204, 215
Stiny, G. 390
Stone, J. 249, 260
Storvik, G. 261
Stotts, L. 416
Stuezle, W. 388
Stutz, L. 416
Subbarao, M. 261
Subotic, N. 415
Subrahmonia, J. 63
Suen, C. 174, 180, 387, 413
Suen, P. 215
Suesse, H. 261
Sull, S. 261
Sullivan, S. 215
Sun, Y. 258, 261
Super, B. 142, 261
Susse, H. 260
Svalbe, I. 179
Swain, M. 390, 406, 416
Swets, D. 325
Syeda-Mahmood, T. 390
Szeliski, R. 142, 258, 388
Tagare, H. 98, 155, 180
Tagari, H. 383
Tagliasco, V. 260
Takeda, H. 390
Tam, P. 262
Tamura, S. 212, 389
Tanaka, H. 416
Tang, I. 260
Tank, D. 141
Tannenbaum, A. 214
Tarabanis, K. 261
Taratorin, A. 98
Tarroux, P. 211
Tatsuhiro, Y. 180
Taubes, C. 288
Taubin, G. 215, 261
Taylor, C. 250, 261
ter Haar Romeny, B. 134, 142
Thoet, W. 416
Thomas, B. 387
Thomason, M. 377, 381
Thomasson, B. 415
Thomopoulos, S. 412
Thorpe, C. 390
Thrift, P. 388, 413
Thurber, K. 215
Tichem, M. 390
Tistarelli, M. 322, 323, 386
424
Author index
Titterington, D. 140
Tomasi, C. 261
Tong, F. 261
Toriwaki, J. 180
Torp, A. 379, 381
Torr, P. 142
Trier, O. 184, 215
Trika, S. 390
Trivedi, M. 63, 407, 416
Trucco, E. 215
Tsai, C. 256
Tsai, H. 104
Tsai, P. 262
Tsai, R. 261
Tsao, Y. 176, 180
Tsap, L. 390
Tsitsiklis, J. 213
Tsujimoto, E. 324
Turk, M. 325
Ubeda, S. 257
Uchiyama, T. 215
Ueltschi, A. 416
Uhurpa, K. 180
van den Boomgaard, R. 180
Van den Bout, D. 143
van den Elsen, P. 63
van den Elsen, P. 52
van der Heijden, F. 106
van der Hengel, A. 105
Van Doorn, A. 212, 296, 297
Van Gool, L. 85, 106, 259
Van Horn, M. 106
van Laarhoven, J. 142
Van Ness, J. 63
van Vliet, L. 106
Vanek, P. 257
Vannier, M. 142
vanVliet, L. 98
Vecchi, M. 141
Velten, V. 259, 389
Vemuri, B. 259, 261
Venetsanopoulos, A. 179
Venkatesh, S. 105
Verbeek, P. 98, 106
Viergever, M. 52, 63, 142, 215
Vincken, K. 215
Vlontzos, J. 257
Vogelaers, D. 259
Vos, F. 155, 180
Voss, K. 260, 261
Wakahara, T. 390
Wakeley, J. 63
Wald, J. 416
Wallace, R. 390
Wallet, B. 416
425
Author index
Zhou, J. 381
Zhu, G. 416
Zhu, P. 196, 215
Zhu, S. 215, 322
Zhu, Y. 416
Zhuang, X. 64
Zietz, S. 259
Zisserman, A. 129, 139
Zucker, S. 98, 140, 273
Zucker, W. 105
Index
3D analysis 261
4-connected 53
8-connected 53
accepting state 373
accumulator arrays 278
active contours 98, 322
active testing 387, 412
active vision 385
afne 196
afne transformation 218
aircraft 407
albedo 244
algebraic distance 202, 203, 237
algebraic invariants 389
algorithm fusion 405
algorithms, clustering 359
algorithms, performance of 395
angle density 141
anisotropic diffusion 133
annealing 121, 127, 133, 206
annealing schedule 122
annealing, tree 184
anti-extensive property 149, 151
apparent error rate 398
arc length 196
array processor 258
aspect 385
aspect equivalent 296
aspect graphs 295
aspect ratio 226, 304
assembly 384
atoms 291
ATR 367, 392
ATR, performance of 395
attention, focus of 387
aura 53
autonomous exploration 385
axes, principal 235
axis of symmetry 226
BAD 137
bar code 105
basis 11, 59, 152
basis functions 75
basis set 220
basis vectors 75, 220
426
427
Index
428
Index
ego-motion 141
eigenimage 300
eigenvalues 15, 349
eigenvector 15, 222
ejection fraction 383
electrocardiogram 373
electronhole 43
ellipse, direct tting 203
ellipse, nding 288
ellipses, tting 203
ellipses, least square tting 212
ellipsoid 202, 235
ellipsoids, tting 203
elllipse, nding 285
emissivity 400
epipolar line 48
equivalence memory 190
equivalence relation 190
erosion 146
error rate 398
Euclidian distance 219, 304, 356
Euclidian distance transform 154
events 8
explicit 202
explicit representation 39, 71
extensive property 148
external energy 198
face recognition 346, 384
facet model 97
factor 159
false alarm 396
false alarm rate 393
false negative 396
false positive 396
false positive fraction 397
fast Fourier transform 75, 84
feature 304
feature selection 324
feature vector 304
FIDAC 381
lters 41
nite element models 390
nite state machine 22, 370, 373
Fishers linear discriminant 348
tting 69, 73
tting surfaces 202
tting, straight lines 223
tting, subpixel precision of 225
FLIR 394
focus, of a parabola 284
focused lters 407
formal language 369
Fourier descriptors 231, 262
Fourier transform 41, 83, 207
fractal dimension 207, 212
fractal dimension for gray-scale 209
frame time 43
frequency response 41
FSM 23
function tting 69
functional 39
furthest neighbor distance 358
fusion 405
Gabor function 90
Gauss curvature 61
Gauss map 286
Gaussian 78, 88, 279
Gaussian, mean of 333
Gaussian, multivariate 335, 358
generalized Hough transform 282
genetic algorithm 214, 403
genetic learning 388, 413
geometric ow 234
geometry, curve 234
geometry-based 404
GNC 129
gradient 13, 97, 279
gradient descent 16, 18, 127, 206, 270
gradient magnitude 78
graduated nonconvexity 116, 129
grammar 370
graph, directed 290
graphs, aspect 238
gray-scale 38
gray-scale morphology 152, 179
Greens function 132, 133
grid 67
ground plane 386
ground truth 396
Haar transform 212
handwritten word recognition 380
harmonic 236
Hausdorff distance 410
heart 383
heat equation 132
Hessian 14, 135
hexagonal 57
hexagonal derivative 71
hidden Markov model 22, 62, 381
histogram 184, 330
histogram intersection 406
homogeneity predicate 181
homogeneous 181
homogeneous transformation 217
homotopy 116
Hopeld neural network 137
horizontal blanking 45
horizontal resolution 46
Hough transform 275, 288, 356, 407
Hough transform, generalized 282
hull, visual 258
429
Index
430
Index
measurement system 49
medial axis 53, 232
median lter 151
Mercers conditions 353
metric 219, 358
metric function 200
microscopy, epi-uorescence 384
minimax rule 342
minimization 15
minimum description length 321
minimum spanning tree 360
minimum-squared-error 239
MMSE 204
model 304
model-based 405
modeling edges 105
molecules, tubular 384
moments 61, 224, 230
moments, central 229
morphological ltering 179
morphological sampling 162
morphological shape decompostion 173
morphological skeleton 171
morphology 144, 408, 409
motion 322, 386
motion analysis 115, 250
motion segmentation 210, 250
MRI 39, 383
MSE estimate 69
multiple light sources 384
multisensor fusion 405
multispectral 382
multispectral ATR 406
multispectral matching 406
multivariate Gaussian 79
multivariate pdf 332
multiview range data 211
MWIR 395
nearest feature line 324
nearest neighbor measure 358
neighborhood 53
neighboring state 24
neural network 137, 403
neuron 137
NewtonRaphson 17
NeymanPearson 401
nodes 290, 291, 305
noise 49
noise estimation 63, 257
noise sensitivity 68
noise, counting 204
noise, Gaussian 204
noise, non-Gaussian 204
noise, Poisson 204
nonlinear relaxation 267
nonmaximum suppression 97
431
Index
power spectrum 61
predicate 306
prime factor 159
principal axis 219, 227
principal component 214, 220
principal component analysis 301
principal curvatures 60
prior probabilities 396
probabilistic 40
probability distribution 9
probability of error 340
probability, prior 329
probability, transition 22
production 370
projection 12, 75, 76, 220
projective invariants 322
pseudo-inverse 239
puns, bad 383
pushdown stack 186
pyramid 85, 402
quad tree 86
quadratic 202
quadratic classier 340
quadratic form 13
quadratic variation 83
quadric 39, 202, 234
quality, segmentation 205
quantization 46
quantization error 47
quantized 42
radar 386
radial basis function 353
radiometric 63, 257
radius of a convolution kernel 71
ramp edge 76
range 39, 47
range image segmentation 211, 215
range images 201, 267
range images, merging of 204
range images, registration of 204
raster scanning 43
receiver operating characteristic 397
recognition 393
recognizer 371
recurrent network 137
recursive region growing 186
reection 226
reection of a set 146
reectivity 245
region adjacency graph 292
region alignment 141
region growing 182, 322, 402
registration 204
regular expression 376
relabeling 167
relational representations 42
relaxation 107
relaxation labeling 266, 305, 313, 321
restoration 108
reverse engineering 384
ridge seeking 63
ridges 52
roads, nding in images 386
ROC 397
roof edge 76
rotation 217, 218
rotation, correction for 218
salient 196
salient group 210
sample mean 333, 334
sample variance 334
sampled 42
sampling 45
sampling grid 161
sampling theorem 46, 161
sampling time 46
SAR 408
satellite images 407
scale space 85
scale space causality 87
scaling 358
scanning 43, 192
scatter matrix 70
scene graph 293
secondary electrons 247
segmentation 115, 120, 181, 322, 402
segmentation of surfaces 201
segmentation, based on texture 207
segmentation, psychology of 211
segmentation, quality of 204
segmentation, range 388
self-organizing 368
sensitivity 397
sensor fusion 115, 405
separability 350
set of measure zero 119
shape from 243
shape from focus 249
shape from shading 244, 322
shape from texture 249
sharpness 41
shear 218
sigmoid 137
signal-to-clutter 402
signature 400
silhouettes 243, 259, 407
similarity transformations 217
simple points 180
simplicial models 261
simply connected 159
simulated annealing 19, 115, 122, 198
432
Index
433
Index