Alignment by Maximization of Mutual Information
Alignment by Maximization of Mutual Information
Alignment by Maximization of Mutual Information
net/publication/261458781
CITATIONS READS
2,050 683
2 authors, including:
Paul Viola
Amazon
107 PUBLICATIONS 40,046 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Paul Viola on 13 June 2017.
PAUL VIOLA
Massachusetts Institute of Technology, Artificial Intelligence Laboratory, 545 Technology Square,
Cambridge, MA 02139
[email protected]
Abstract. A new information-theoretic approach is presented for finding the pose of an object in an image. The
technique does not require information about the surface properties of the object, besides its shape, and is robust
with respect to variations of illumination. In our derivation few assumptions are made about the nature of the
imaging process. As a result the algorithms are quite general and may foreseeably be used in a wide variety of
imaging situations.
Experiments are presented that demonstrate the approach registering magnetic resonance (MR) images, aligning
a complex 3D object model to real scenes including clutter and occlusion, tracking a human head in a video sequence
and aligning a view-based 2D object model to real images.
The method is based on a formulation of the mutual information between the model and the image. As applied
here the technique is intensity-based, rather than feature-based. It works well in domains where edge or gradient-
magnitude based methods have difficulty, yet it is more robust than traditional correlation. Additionally, it has an
efficient implementation that is based on stochastic approximation.
formulate an imaging equation, Though the abstract suggestion that mutual informa-
tion plays a role in object recognition may not be new,
v(T (x)) = F(u(x), q) + η (1) to date no concrete representations or efficient algo-
rithms have been proposed. This paper will present
or equivalently, a new approach for evaluating entropy and mutual in-
formation called EM MA1 . It is distinguished in two
v(y) = F(u(T −1 (y)), q) + η. (2) ways: 1) EM MA does not require a prior model for
the functional form of the distribution of the data; 2)
The imaging equation has three distinct components. entropy can be maximized (or minimized) efficiently
The first component is called a transformation, or pose, using stochastic approximation.
denoted T . It relates the coordinate frame of the model In its full generality, EM MA can be used whenever
to the coordinate frame of the image. The transforma- there is a need to align images from two different sen-
tion tells us which point in the model is responsible for sors, the so-called “sensor fusion” problem. For exam-
a particular point in the image. The second compo- ple, in medical imaging data from one type of sensor
nent is the imaging function, F(u(x), q). The imaging (such as magnetic resonance imaging—MRI) must be
function determines the value of image point v(T (x)). aligned to data from another sensor (such as computed
In general a pixel’s value may be a function both of the tomography—CT).
model and other exogenous factors. For example an
image of a three dimensional object depends not only
on the object but also on the lighting. The parameter 2. An Alignment Example
q collects all of the exogenous influences into a sin-
gle vector. Finally, η is a random variable that models One of the alignment problems that we will address
noise in the imaging process. In most cases the noise involves finding the pose of a three-dimensional object
is assumed Gaussian. that appears in a video image. This problem involves
Alignment can be a difficult problem for a number comparing two very different kinds of representations:
of reasons: a three-dimensional model of the shape of the object
and a video image of that object. For example, Fig. 1
• F, the imaging function of the physical world, can contains a video image of an example object on the left
be difficult to model. and a depth map of that same object on the right (the
• q, the exogenous parameters, are not necessarily object in question is a person’s head: RK). A depth map
known and can be difficult to find. For example is an image that displays the depth from the camera to
computing the lighting in an image is a non-trivial every visible point on the object model.
problem. From the depth map alone it might be difficult to see
• T , the space of transformations, which may have that the image and the model are aligned. For a human
many dimensions, is difficult to search. Rigid ob- observer, the task can be made much easier by simulat-
jects often have a 6 dimensional transformation ing the imaging process and rendering an image from
space. Non-rigid objects can in principle have an the 3D model. Figure 2 contains two renderings of the
unbounded number of pose parameters. object model. These synthetic images are constructed
assuming that the 3D model has a Lambertian surface
One reason that it is, in principle, possible to de- and that the model is illuminated from the right. It is
fine F is that the image does convey information about almost immediately obvious that the model on the left
the model. Clearly if there were no mutual informa- of the figure is more closely aligned to the video im-
tion between u and v, there could be no meaningful F. age than the model on the right. Unfortunately, what
We propose to finesse the problem of finding and com- might seem like a trivial determination is difficult to
puting F and q by dealing with this mutual informa- reproduce with a computer. The task is made difficult
tion directly. We will present an algorithm that aligns because the intensities of the true video image and the
by maximizing the mutual information between model synthetic images are quite different. In fact, the pix-
and image. It requires no a priori model of the relation- els of the real image and the correct model image are
ship between surface properties and scene intensities— uncorrelated. Somehow, the human visual system is ca-
it only assumes that the model tells more about the pable of ignoring the superficial differences that arise
scene when it is correctly aligned. from changes in illumination and surface properties.
P1: LMW/STR/RKB/JHR P2: STR/SRK P3: STR/SRK QC:
International Journal of Computer Vision Kl470-04-Viola July 18, 1997 9:22
Figure 1. Two different views of RK. On the left is a video image. On the right is a depth map of a model of RK that describes the distance to
each of the visible points of the model. Closer points are rendered brighter than more distant ones.
Figure 2. At left is a rendering of a 3D model of RK. The position of the model is the same as the position of the actual head. At right is a
rendering of the head model in an incorrect pose.
A successful computational theory of object recogni- quite difficult. These difficulties are only compounded
tion must be similarly robust. if the surface properties of the object are not well un-
Lambert’s law is perhaps the simplest model of sur- derstood. For example, many objects can not be mod-
face reflectivity. It is an accurate model of the re- eled as having a Lambertian surface. Different surface
flectance of a matte or non-shiny surface. Lambert’s finishes will have different reflectance functions. In
law states that the visible intensity of a surface patch is general reflectance is a function of lighting direction,
related to the dot product between the surface normal surface normal and viewing direction. The intensity of
and the lighting. For a Lambertian object the imaging an observed patch is then:
equation is: X
X v(T (x)) = R(αi , lEi , oE, u(x)), (4)
v(T (x)) = αi lEi · u(x), (3) i
i
where the model value u(x) is the normal vector of where oE is a vector pointing toward the observer from
a surface patch on the object, li is a vector pointing the patch and R(·) is the reflectance function of the
toward light source i, and αi is proportional to the in- surface. For an unknown material a great deal of
tensity of that light source ((Horn, 1986) contains an experimentation is necessary to completely categorize
excellent review of imaging and its relationship to vi- the reflectance function. Since a general vision sys-
sion). As the illumination changes the functional rela- tem should work with a variety of objects and under
tionship between the model and image will change. general illumination conditions, overly constraining
Since we can not know beforehand what the imag- assumptions about reflectance or illumination should
ing function will be, aligning a model and image can be be avoided.
P1: LMW/STR/RKB/JHR P2: STR/SRK P3: STR/SRK QC:
International Journal of Computer Vision Kl470-04-Viola July 18, 1997 9:22
Figure 3. On the left is a video image of RK with the single scan-line highlighted. On the right is a graph of the intensities observed along
this scan line.
Let us examine the relationship between a real im- Figure 3 contains a graph of the intensities along a
age and model. This will allow us to build intuition single scan-line of the image of RK. Figure 4 shows
both about alignment and image formation. Data from similar data for the correctly aligned model of RK.
the real reflectance function can be obtained by align- Model normals from this scan-line are displayed in two
ing a model to a real image. An alignment associates graphs: the first shows the x component of the normal
points from the image with points from the model. If while the second shows the y component. Notice that
the alignment is correct, each pixel of the image can be we have chosen this portion of the model so that the y
interpreted as a sample of the imaging function R(·). component of the normal is almost constant. As a result
The imaging function could be displayed by plotting the relationship between normal and intensity can be
intensity against lighting direction, viewing direction visualized in only two dimensions. Figure 5 shows the
and surface normal. Unfortunately, because intensity intensities in the image plotted against the x compo-
is a function of so many different parameters the result- nent of the normal in the model. Notice that this rela-
ing plot can be prohibitively complex and difficult to tionship appears both consistent and functional. Points
visualize. Significant simplification will be necessary from the model with similar surface normals have very
if we are to visualize any structure in this data. similar intensities. The data in this graph could be well
In a wide variety of real images we can assume that approximated by a smooth curve. We will call an imag-
the light sources are far from the object (at least in terms ing function like this one consistent. Interestingly, we
of the dimensions of the object). When this is true and did not need any information about the illumination or
there are no shadows, each patch of the object will be surface properties of the object to determine that there
illuminated in the same way. Furthermore, we will as- is a consistent relationship between model normal and
sume that the observer is far from the object, and that image intensity.
the viewing direction is therefore constant throughout Figure 6 shows the relationship between normal and
the image. The resulting relationship between normal intensity when the model and image are no longer
and intensity is three dimensional. The normal vector aligned. The only difference between this graph and
has unit length and, for visible patches, is determined the first is that the intensities come from a scan-line 3
by two parameters: the x and y components. The im- centimeters below the correct alignment (i.e., the model
age intensity is a third parameter. A three dimensional is no longer aligned with the image, it is 3 centimeters
scatter plot of normal versus intensity is really a slice too low). The normals used are the same. The result-
through the high dimensional space in which R(·) is ing graph is no longer consistent. It does not look as
defined. Though this graph is much simpler than the though a simple smooth curve would fit this data well.
original, three dimensional plots are still quite difficult In summary, when model and image are aligned there
to interpret. We will slice the data once again so that all will be a consistent relationship between image inten-
of the points have a single value for the y component sity and model normal. This is predicted by our as-
of the normal. sumption that there is an imaging function that relates
P1: LMW/STR/RKB/JHR P2: STR/SRK P3: STR/SRK QC:
International Journal of Computer Vision Kl470-04-Viola July 18, 1997 9:22
Figure 4. On the left is a depth map of RK with the single scan-line highlighted. At top right is a graph of the x component of the surface
normal. On the bottom right is the y component of the normal.
Figure 6. The misaligned case: on the left is the misaligned scan-line from the video image of RK. On the right is a scatter plot of the intensity
of this part of the video image versus the x component of the surface normal from the model.
the image most likely, can be defined. The combined function is best thought
Y µ of as a reflectance map (Horn, 1986). It maps the nor-
p(v | u, T ) ≈ max p η = v(T (xa )) mals of an object directly into intensities. The three
F,q
xa dimensional alignment procedure we will describe ma-
¶ nipulates a similar combined function.
− F(u(xa ), q) p(F) p(q). (6) How might Eq. (6) be approximated efficiently? It
seems reasonable to assume that for most real imaging
This approximation is accurate whenever the integral functions similar inputs should yield similar outputs.
in Eq. (5) is approximated by the component of the In other words, the unknown imaging function is con-
integrand that is maximal. The approximation is a good tinuous and piecewise smooth. An efficient scheme
one when a particular F and q are much more likely for alignment could skip the step of approximating
than any other. the imaging function and attempt to directly evaluate
Using (6) we can define an alignment procedure as a the consistency of a transformation. A transformation
nested search: i) given an estimate for the transforma- is considered consistent if points that have similar val-
tion, find F and q that make the image most likely; ii) ues in the model project to similar values in the image.
given estimates for F and q, find a new transformation By similar we do not mean similar in physical location,
that makes the image most likely. Terminate when the as in |xa − xb |, but similar in value, |u(xa ) − u(xb )| and
transformation has stabilized. In other words, a trans- |v(T (xa )) − v(T (xb ))|. One ad-hoc technique for es-
formation associates points from the model with points timating consistency is to pick a similarity constant ψ
in the image; for every u(x) there is a corresponding and evaluate the following sum:
v(T (x)). A function F and parameter vector q are X
Consistency(T ) = − gψ (u(xb ) − u(xa ))
sought that best model the relationship between u(x) xa 6=xb
and v(T (x)). This can be accomplished by “training” a
function to fit the collection of pairs {v(T (xa )), u(xa )}. × (v(T (xb )) − v(T (xa )))2 , (7)
The search for F is not a simple process. The range where gψ is a Gaussian with standard deviation ψ, and
of possible imaging functions is of course infinite. In the sum is over points from the model, xa and xb . In
order to condition the search it is necessary to make a set order to minimize this measure, points that are close
of assumptions about the form of F. In addition some together in value must be more consistent, and those
assumptions about the smoothness of F are necessary further apart less so.
to insure convergence of the nested search for the max- An important drawback of consistency is that it
imum of (6). These assumptions can be enforced by is maximized by constancy. The most consistent
formulating a strong prior probability over the space of transformation projects the points of the model onto
functions, p(F). a constant region of the image. For example, if scale
In many cases the search for an imaging function and is one of the transformation parameters, one entirely
exogenous parameters can be combined. For any partic- consistent transformation projects all of the points of
ular F and q, another function Fq (u(x)) = F(u(x), q) the model down to a single point of the image.
P1: LMW/STR/RKB/JHR P2: STR/SRK P3: STR/SRK QC:
International Journal of Computer Vision Kl470-04-Viola July 18, 1997 9:22
Figure 7. The joint distribution of data from the aligned and misaligned case above (left: aligned, right: misaligned). The weighted neighbor
function approximation is show as a thin black line.
We now have two alternatives for alignment when Though weighted neighbor likelihood is a power-
the imaging function is unknown: a theoretical tech- ful technique, it has three significant drawbacks (see
nique that may be intractable, and an outwardly effi- (Viola, 1995) for a more detailed discussion).
cient ad-hoc technique that has a number of important Firstly, it will only work when the image is a function
difficulties. One would like to find a technique that of the model. Though this was assumed at the outset,
combines the best features from each approach. We in several important applications the image data may
propose that the complex search for the most likely not be a function of the model. This is frequently the
imaging function, Fq , be replaced with a simpler search case in medical registration applications. For exam-
for the most consistent imaging function. ple, a CT scan is neither a function of an MR scan,
One type of function approximator that maximizes nor is an MR scan a function of a CT scan. The sec-
consistency is known as kernel regression or the weigh- ond drawback of weighted neighbor log likelihood is
ted neighbor approximator: that it can be susceptible to outliers. If one assumes,
P as is typical, that the image is conditionally Gaussian,
∗ xa R(u − u(xa ))v(T (xa )) occlusion and specularity can ruin an otherwise good
F (u) = P . (8) match between model and image3 . The third drawback
xa R(u − u(x a ))
arises from weighted neighbor likelihood’s affinity for
The weighting function R usually has a maximum at constant solutions.
zero, and falls off asymptotically away from zero. F ∗ Rather than require that the image be a function of the
can be used to estimate the likelihood of a transforma- model, one natural generalization is to require that the
tion as we did in (6). This formulation can be much image be predictable from the model. Predictability
more efficient than a naive implementation of (6) since is closely related to the concept of entropy. A pre-
there is no need to search for Fq . The model, image, dictable random variable has low entropy, while an
and transformation define F ∗ directly. unpredictable random variable has high entropy. By
Figure 7 shows the weighted neighbor approxima- moving to a formulation of alignment that is based on
tion to the data from the RK alignments (in these entropy many of the drawbacks of weighted neighbor
graphs R is the Gaussian density function with variance likelihood can be circumvented.
0.0003). Notice F ∗ fits the aligned model much better The entropy of a random variable is defined as
Z
than the misaligned model. Assuming that the noise
is Gaussian the log likelihood of the aligned model, h(y) ≡ − p(y) ln p(y) dy. (9)
1079.49, is much larger than the log likelihood of the
The joint entropy of two random variables x and y is
misaligned model, 537.34.2 Z
h(z, y) ≡ − p(z, y) ln p(z, y) dz dy. (10)
4. From Likelihood to Entropy
Log likelihood and entropy are closely related (see
The “classical” derivation of weighted neighbor like- (Cover and Thomas, 1991) for an excellent review
lihood provided a context in which insights could of entropy and its relation to statistics). It can be
be developed and concrete representations described. shown that under certain conditions the conditional log
P1: LMW/STR/RKB/JHR P2: STR/SRK P3: STR/SRK QC:
International Journal of Computer Vision Kl470-04-Viola July 18, 1997 9:22
likelihood of the image given the model is a multi- and is not a function of T . The second term is the en-
ple of the conditional entropy of the image given the tropy of the part of the image into which the model
model: projects. It encourages transformations that project u
into complex parts of v. The third term, the (nega-
log p(v(T (x) | u(x), T ) = − N h(v(T (x)) | u(x), T ),
tive) joint entropy of u and v, contributes when u and
(11) v are functionally related. It encourages transforma-
tions where u explains v well. Together the last two
where N is the number of model points4 . This is
terms identify transformations that find complexity and
true only when the conditional distribution of v is the
explain it well.
same as the assumed distribution for the noise, η. So
Why are weighted neighbor likelihood and condi-
if the noise is assumed Gaussian, equality holds when
tional entropy related? Weighted neighbor likelihood
the conditional distribution of v is Gaussian. Note that
measures the quality of the weighted neighbor func-
while this is a restrictive assumption, it does not require
tion approximation. In the graph on the left of Fig. 7
either that the distribution of v be Gaussian, or that the
the points of the sample lie near the weighted neighbor
joint distribution of v and u be Gaussian.
function approximation. In addition, the joint distribu-
Both the constraint that v be a function of u and the
tion of samples is tightly packed together. Points are
constraint on the conditional distribution of v can be
not distributed throughout the space, but lie instead in
relaxed by estimating the conditional entropy directly:
a small part of the joint space. This is the hallmark of
h(v(T (x)) | u(X )) ≡ h(u(x)) − h(u(X ), v(T (x))). a low entropy distribution. In the graph on the right of
Fig. 7 the weighted neighbor function approximation
(12) is a poor fit to the data and the data is more spread out.
In general, aligned signals have low joint entropy and
In the next section we will present an efficiently opti-
misaligned signals have high joint entropy.
mizable measure of entropy (EMMA) that can be used
for this purpose. Nowhere in the derivation of EMMA
will it be necessary to assume that v is a function of u. 5. EMMA Alignment
In addition, even in situations where v is a function of
u, EMMA will frequently work better than weighted We seek an estimate of the transformation T̂ that aligns
neighbor likelihood. While weighted neighbor like- the model u and image v by maximizing their mutual
lihood requires restrictive assumptions about p(η), information over the transformations T ,
EMMA can be used with a wide variety of densities. T̂ = arg max I (u(x), v(T (x))). (14)
This make EMMA more robust to non-Gaussian errors. T
In addition, the move from likelihood to entropy Here we treat x as a random variable over coordinate
presents a principled mechanism for avoiding constant locations in the model. In the alignment algorithm
solutions. Conditional entropy, though it is more gen- described below, we will draw samples from x in order
eral than weighted neighbor likelihood, is still closely to approximate I and its derivatives.
related to the consistency measure defined in Eq. (7).
Like consistency, conditional entropy will accept a con- 5.1. EMMA and its Derivatives
stant solution as optimal. Conditional entropy con-
founds two distinct situations: conditional entropy will The entropies described above are defined in terms of
be low when the image is predictable from the model, integrals over the probability densities associated with
but it will also be low if the image by itself is pre- the random variables u and v. When analyzing signals
dictable. Rather than conditional entropy we will es- or images we will not have direct access to the densities.
timate the mutual information between the model and In this section we describe a differentiable estimate of
the image: the entropy of a random variable that is calculated from
samples.
I (u(x), v(T (x))) ≡ h(u(x)) + h(v(T (x)))
The entropy of a random variable z may be expressed
− h(u(x), v(T (x))). (13) as an expectation of the negative logarithm of the prob-
ability density:
The mutual information defined in Eq. (13) has three
components. The first term is the entropy in the model, h(z) = −E z (ln p(z)).
P1: LMW/STR/RKB/JHR P2: STR/SRK P3: STR/SRK QC:
International Journal of Computer Vision Kl470-04-Viola July 18, 1997 9:22
Our first step in estimating the entropies from of A. It will be near zero if some other element of
samples is to approximate the underlying probabil- A is significantly closer to z i . Distance is interpreted
ity density p(z) by a superposition of Gaussian den- with respect to the squared Mahalanobis distance (see
sities centered on the elements of a sample A drawn (Duda and Hart, 1973))
from z:
Dψ (z) ≡ z T ψ −1 z.
1 X
p(z) ≈ G ψ (z − z j ),
N A z j ∈A Thus, Wz (z i , z j ) is an indicator of the degree of match
between its arguments, in a “soft” sense. It is equiva-
where
µ ¶ lent to using the “softmax” function of neural networks
−n −1 1 (Bridle, 1989) on the negative of the Mahalanobis dis-
G ψ (z) ≡ (2π) 2 |ψ| 2 exp − z T ψ −1 z .
2 tance to indicate correspondence between z i and ele-
ments of A.
This method of density estimation is widely known The summand in Eq. (16) may also be written as:
as the Parzen Window method. It is described in the
textbook by Duda and Hart (1973). Use of the Gaussian d 1
Wz (z i , z j ) Dψ (z i − z j ).
density in the Parzen density estimate will simplify dT 2
some of our subsequent analysis, but it is not necessary.
Any differentiable function could be used. Another In this form it is apparent that to reduce entropy, the
good choice is the Cauchy density. transformation T should be adjusted such that there
Next we approximate statistical expectation with is a reduction in the average squared distance between
the sample average over another sample B drawn those values which W indicates are nearby, i.e., clusters
from z: should be tightened.
1 X
E z ( f (z)) ≈ f (z i ). 5.2. Stochastic Maximization
N B zi ∈B
of Mutual Information
We may now write an approximation for the entropy
of a random variable z as follows, The entropy approximation described in Eq. (15) may
now be used to evaluate the mutual information of the
−1 X 1 X model and image (Eq. (13)). In order to seek a maxi-
h(z) ≈ ln G ψ (z i − z j ). (15)
N B zi ∈B N A z j ∈A mum of the mutual information, we will calculate an
approximation to its derivative,
The density of z may be a function of a set of parame-
ters, T . In order to find maxima of mutual information, d d
I (u(x), v(T (x))) = h(v(T (x)))
we calculate the derivative of entropy with respect to dT dT
T . After some manipulation, this may be written com- d
− h(u(x), v(T (x))).
pactly as follows, dT
The weighting factors are defined as the algorithm’s cost is quadratic in the sample size. For
G ψv (vi − v j ) smaller sample sizes, less effort is expended, but addi-
Wv (vi , v j ) ≡ P tional noise is introduced into the derivative estimates.
xk ∈A G ψv (vi − vk ) Stochastic approximation is a scheme that uses noisy
and
derivative estimate instead of the true derivative for
G ψuv (wi − w j ) optimizing a function (see (Widrow and Hoff, 1960;
Wuv (wi , w j ) ≡ P , Ljung and Söderström, 1983; Haykin, 1994)). Con-
xk ∈A G ψuv (wi − wk )
vergence can be proven for particular linear systems,
using the following notation (and similarly for indices provided that the derivative estimates are unbiased, and
j and k), the learning rate is annealed (decreased over time). In
practice, we have found that successful alignment may
u i ≡ u(xi ), vi ≡ v(T (xi )), and wi ≡ [u i , vi ]T . be obtained using relatively small sample sizes, for
example N A = N B = 50. We have proven that the
If we are to increase the mutual information, then the
technique will always converge to a pose estimate that
first term in the brackets (of Eq. (17)) may be inter-
is close to locally optimal (Viola, 1995).
preted as acting to increase the squared distance be-
It has been observed that the noise introduced by the
tween pairs of samples that are nearby in image inten-
sampling can effectively penetrate small local minima.
sity, while the second term acts to decrease the squared
Such local minima are often characteristic of continu-
distance between pairs of samples that are nearby in
ous alignment schemes, and we have found that local
both image intensity and the model properties. It is
minima can be overcome in this manner in these appli-
important to emphasize that distances are in the space
cations as well. We believe that stochastic estimates for
of values (intensities, brightness, or surface properties),
the gradient usefully combine efficiency with effective
rather than coordinate locations.
escape from local minima.
The term dT d
(vi − v j ) will generally involve gra-
dients of the image intensities and the derivative of 5.2.2. Estimating the Covariance. In addition to λ,
transformed coordinates with respect to the transfor- the covariance matrices of the component densities in
mation. In the simple case that T is a linear operator, the approximation method of Section 5.1 are important
the following outer product expression holds: parameters of the method. These parameters may be
chosen so that they are optimal in the maximum like-
d
v(T (xi )) = ∇v(T (xi ))xiT . lihood sense with respect to samples drawn from the
dT random variables. This approach is equivalent to min-
imizing the cross entropy of the estimated distribution
5.2.1. Stochastic Maximization Algorithm. We seek with the true distribution (Cover and Thomas, 1991).
a local maximum of mutual information by using a For simplicity, we assume that the covariance matrices
stochastic analog of gradient descent. Steps are repeat- are diagonal.
edly taken that are proportional to the approximation The most likely covariance parameters can be es-
of the derivative of the mutual information with respect timated on-line using a scheme that is almost iden-
to the transformation: tical in form to the scheme for maximizing mutual
information.
Repeat:
6. Experiments
A ← {sample of size N A drawn from x}
In this section we demonstrate alignment by maximiza-
B ← {sample of size N B drawn from x}
tion of mutual information in a variety of domains. In
T ← T + λcdI
dT all of the following experiments, bi-linear interpolation
was used when needed for non-integral indexing into
The parameter λ is called the learning rate. The images.
above procedure is repeated a fixed number of times or
until convergence is detected. 6.1. MRI Alignment
A good estimate of the derivative of the mutual in-
formation could be obtained by exhaustively sampling Our first and simplest experiment involves finding the
the data. This approach has serious drawbacks because correct alignment of two MR images (see Fig. 8).
P1: LMW/STR/RKB/JHR P2: STR/SRK P3: STR/SRK QC:
International Journal of Computer Vision Kl470-04-Viola July 18, 1997 9:22
Figure 8. MRI alignment (from left to right): original proton-density image, original T2-weighted image, initial alignment, composite display
of final alignment, intensity-transformed image.
The two original images are components of a double- that the intensity transformation preserve a significant
echo MR scan and were obtained simultaneously, as amount of information. On the right in Fig. 8. we show
a result the correct alignment should be close to the the model image after a non-monotonic (quadratic) in-
identity transformation. It is clear that the two im- tensity transformation. Alignment performance is not
ages have high mutual information, while they are not significantly affected by this transformation.
identical. This last experiment is an example that would defeat
A typical initial alignment appears in the center of traditional correlation, since the signals (the second
Fig. 8. Notice that this image is a scaled, sheared, and last in Fig. 8) are more similar in value when they
rotated and translated version of the original. A suc- are badly mis-aligned (non-overlapping) than they are
cessful alignment is displayed as a checkerboard. Here when properly aligned.
every other 20 × 20 pixel block is taken either from the
model image or target image. Notice that the boundary 6.2. Alignment of 3D Objects
of the brain in the images is very closely aligned.
We represent the transformation by a 6 element 6.2.1. Skull Alignment Experiments. This section
affine matrix that takes two dimensional points from describes the alignment of a real three dimensional ob-
the image plane of the first image into the image plane ject of its video image. The signals that are compared
of the second image. This scheme can represent any are quite different in nature: one is the video bright-
combination of scaling, shearing, rotation and transla- ness, while the other consists of two components of the
tion. Before alignment the pixel values in the two MR normal vector at a point on the surface of the model.
images are pre-scaled so that they vary from 0 to 1. We obtained an accurate 3D model, including nor-
The component densities are ψuu = ψw = 0.1, and the mals, of a skull that was derived from a computed to-
random samples are of size 20. We used a learning rate mography (CT) scan. Cluttered video images of the
of 0.02 for 500 iterations and 0.005 for 500 iterations. skull were obtained (see Fig. 9). In these images the
Total run time on a Sparc 10 was 12 seconds. pose of the model is displayed by projecting 3D points
Over a set of 50 randomly generated initial poses from the model’s surface into the image plane and high-
that vary in position by 32 pixels, a little less than one lighting them in white. In the upper left of Fig. 9 the
third of the width of the head, rotations of 28 degrees, model is displayed in a typical initial pose. The final
and scalings of up to 20%, the “correct” alignment alignment of the skull model is in the upper right. No-
is obtained reliably. Final alignments were well within tice that the boundaries of the skull model and skull
one pixel in position and within 0.5% of the identity ma- image are in close agreement. We would like to em-
trix for rotation/scale. We report errors in percent here phasize that in none of these experiments have we pre-
because of the use of affine transformation matrices. segmented the image. The initial poses often project
The two MRI images are fairly similar. Good align- the model into regions of the image that contain a sig-
ment could probably be obtained with a normalized nificant amount of clutter. EMMA reliably settles on a
correlation metric. Normalized correlation assumes, at pose where few if any of the model points project onto
least locally, that one signal is a scaled and offset ver- the background.
sion of the other. Our technique makes no such as- One difference between the method used to per-
sumption. In fact, it will work across a wide variety form 3D alignment and that used for 2D alignment is
of non-linear transformations. All that is required is a Z -buffering step that is used to prune hidden points
P1: LMW/STR/RKB/JHR P2: STR/SRK P3: STR/SRK QC:
International Journal of Computer Vision Kl470-04-Viola July 18, 1997 9:22
Figure 9. Skull alignment experiments: Initial alignment, final alignment, initial alignment with occlusion, final alignment with occlusion.
from the calculations. Since Z -buffer pruning is costly, learning rate a compromise must be made between the
and the pose does not change much between iterations, rapid changes that arise from the rotation and the slow
it proved sufficient to prune every 200 iterations. An- changes that arise from translation. Since the models
other difference is that the model surface sampling was used have a radius that is on the order of 100 millime-
adjusted so that the sampling density in the image was ters, we have chosen rotation learning rates that are
corrected for foreshortening. 100 times smaller than translation rates. In our exper-
In this experiment, the camera has a viewing an- iments alignment proceeds in two stages. For the first
gle of 18 degrees. We represent T , the transforma- 2000 iterations the rotation learning rate is 0.0005 and
tion from model to image coordinates, as a double the translation learning rate is 0.05. The learning rates
quaternion followed by a perspective projection (Horn, are then reduced to 0.0001 and 0.01 respectively for an
1986). Assuming diagonal covariance matrices four additional 2000 iterations. Running time is about 30
different variances are necessary, three for the joint seconds on a Sparc 10.
entropy estimate and one for the image entropy esti- A number of randomized experiments were per-
mate. The variance for the x component of the normal formed to determine the reliability, accuracy and re-
was 0.3, for the y component of the normal was 0.3, for peatability of alignment. This data is reported in
the image intensity was 0.2 and for the image entropy Table 1. An initial alignment to an image was per-
was 0.15. The size of the random sample used is 50 formed to establish a base pose. From this base pose,
points. a random uniformly distributed offset is added to each
Since the units of rotation and translation are very translational axis (labeled 1T ) and then the model is
different, two separate learning rates are necessary. For rotated about a randomly selected axis by a random
an object with a 100 millimeter radius, a rotation of 0.01 uniformly selected angle (1θ ). Table 1 describes four
radians about its center can translate a model point up experiments each including 50 random initial poses.
to 1 millimeter. On the other hand, a translation of 0.01 The distribution of the final and initial poses can be
can at most translate a model point 0.01 millimeters. compared by examining the variance of the location
As a result, a small step in the direction of the derivative of the centroid, computed separately in X , Y and Z .
will move some model points up to 100 times further In addition, the average angular rotation from the true
by rotation than translation. If there is only a single pose is reported (labeled |1θ |). Finally, the number
P1: LMW/STR/RKB/JHR P2: STR/SRK P3: STR/SRK QC:
International Journal of Computer Vision Kl470-04-Viola July 18, 1997 9:22
Initial Final
1T σX σY σZ σX σY σZ
XY Z 1θ | 1θ | | 1θ | Success
± mm ◦ mm ◦ mm ◦ %
An initial alignment of the model to the first frame of novel image, the model and novel image are considered
the sequence was obtained using a manually-generated aligned (or recognized). One can significantly reduce
starting pose (this frame is not shown). In subsequent the number of model images required by adding an
frames, the previous final pose was used as the initial affine transformation to the comparison process. The
pose for the next alignment. Each pose refinement took novel image is then compared to each model image
about 10 seconds on a Sparc 10. under a set of affine transformations. The most com-
How are the face experiments different from the skull monly used comparison metric is correlation. Correla-
experiments? Firstly, the face model is much smoother tion makes the assumption that the model and the image
than the skull model. There really aren’t any creases or are identical (or possibly related by a linear function).
points of high curvature. As a result it is much less In general the set of images that can arise from a
likely that an edge-based system could construct a rep- single object under varying illumination is very broad.
resentation either of the image or the model that would Figure 11 shows two images of the same object in the
be stable under changes in illumination. Secondly, the same pose. These images are very different and are in
albedo of the actual object is not exactly constant. The fact anti-correlated: bright pixels in the left image cor-
face contains eyebrows, lips and other regions where respond to dark pixels in the right image; dark pixels in
the albedo is not the same. As a result this is a test of the left image correspond to bright pixels in the right
EMMA’s ability to handle objects where the assump- image. No variant of correlation could match these
tion of constant albedo is violated. Thirdly, not all of images together.
the occluding contours of the object are present in the We have presented techniques based on entropy that
model. The model is truncated both at the chin and can match both correlated and anti-correlated signals.
the forehead. As a result experiments with this model These techniques require only that there is some con-
demonstrate that EMMA can work even when the oc- sistent relationship between model and image. Dis-
cluding contours of the image and model are not in couragingly, it is not difficult to find two images of
agreement. the same object for which there is no consistent re-
lationship. Figure 12 shows a novel image which is
aligned with the two model images. Figure 13 contains
6.3. View Based Recognition Experiments two scatter plots of the pixel values in the novel image
versus the pixel values in the model images. Clearly,
In the previous vision experiments we used knowledge there is no simple consistent relationship displayed in
of the physics of imaging to show that the surface nor- either of these graphs. Neither correlation nor EMMA
mal of an object should be predictive of the intensity could be used to match this novel image to either model
observed in an image. Unfortunately, in many experi- image.
mental situations no three dimensional model is avail-
able. In these situations it is frequently the case that
the only available information about an object is a col-
lection of images taken under a variety conditions. One
approach for solving problems like this is to use a col-
lection of images as the model. This is often called a
“view based” approach since the model is made up of a
number of views of the model object. Given a novel im-
age of some object, each model image is compared to
it in turn. If some model image is “close enough” to the Figure 12. A novel image of the car model.
P1: LMW/STR/RKB/JHR P2: STR/SRK P3: STR/SRK QC:
International Journal of Computer Vision Kl470-04-Viola July 18, 1997 9:22
Figure 13. The relationship between pixels in the novel image and each of the model images.
6.3.1. Photometric Stereo. By itself each model im- illumination that is only available under very controlled
age does not contain enough information to constrain circumstances. One cannot use natural images where
the match between image and model. However, it is the lighting is unknown or difficult to determine. Luck-
well known that taken together a collection of images ily, we need not actually know G( ), r1 , r2 , F( ), or q.
can be used to determine the 3D shape of an object. As As long as they exist there will be high mutual infor-
we’ve seen the 3D shape is sufficient to constrain the mation between any novel image and a pair of model
match between image and model. images. This is the essence of view based EMMA align-
When multiple images of an object are available a ment. We don’t actually perform photometric stereo,
technique called photometric stereo can be used to esti- we simply assume that it is possible. As a result a
mate 3D shape (Horn, 1986). Photometric stereo works pair of images should give information about any third
with images which are taken from the same location but image.
under different illumination conditions. It is assumed To demonstrate this approach we have built a model
that detailed information both about illumination and using the two images in Fig. 11. Figure 14 shows the
surface properties are available for each image. As target image, the initial pose of the model, and the final
a result a reflectance map can be computed for each pose obtained after alignment.
image. Technically this experiment is very similar to the
The reflectance map together with the intensity of MRI alignment experiment. The main difference is that
a pixel acts as a constraint on the normal vector visi- the model is constructed from a pair of model images.
ble from that pixel. The allowable normals usually lie A sample of the model u(x) = [u 1 (x), u 2 (x)]T is a
along a closed curve on the unit circle. From a sec- two dimensional vector containing the intensity of the
ond image, and its associated reflectance map, another two images at location x. This is similar to the two
set of allowable normals can be computed. By inter- component representation of normal used in the three
secting these constraints, two images are sufficient to dimensional alignment experiments. For this experi-
determine the surface normal at each pixel. From the ment σ is 0.1. The parameters were updated for 1000
normals the shape can be obtained through integration. iterations at a rate of 0.002. From a set of random-
Once the shape of the object is determined, the ized experiments we have determined that the cap-
correct alignment could be found using the three di- ture range of the alignment procedure is about 40%
mensional version of EMMA alignment. The imaging of the length and width of the car, and 35 degrees of
function of this new two stage process is: rotation.
Figure 14. Top left: A novel image of the car. Top right: The initial pose of the car model. Though the model is made up of multiple images,
only one is shown here. Bottom left: The aligned pose of the car model.
required. The metric has been rigorously derived from example. Some methods use a metric that is propor-
information theory. tional to the number of edges that coincide (see the
In a typical vision application EMMA alignment excellent survey articles: (Besl and Jain, 1985; Chin
is intensity-based, rather than feature based. While and Dyer, 1986)). A smooth, optimizable version of
intensity based, it is more robust than traditional such a metric can be defined by introducing a penalty
correlation—since it is insensitive to negating the both for unmatched edges and for the distance between
image data, as well as a variety of non-linear transfor- those that are matched (Lowe, 1985; Wells III, 1992).
mations (e.g., Section 6.1), which would defeat con- This metric can then be used both for image/model
ventional intensity-based correlation. comparison and for pose refinement. Additional tech-
The sensitivity of intensity correlation may be cor- nical details on the relationship between mutual infor-
rected, to some extent, by performing correlations on mation and other measures of alignment may be found
the magnitude of the intensity gradient. This, as well in (Viola, 1995).
as edge-based matching techniques, can perform well Alignment by extremizing properties of the joint
on objects having discontinuous surface properties, or signal has been used by Hill et al. (1994) to align
useful silhouettes. These approaches work because the MRI, CT, and other medical image modalities. They
image counterparts of these discontinuities are rea- use third order moments of the joint histogram to char-
sonably stable with respect to illumination, however acterize the clustering of the joint data. We believe
they typically make two very strong assumptions: the that mutual information is perhaps a more direct mea-
edges that arise are stable under changes in lighting, sure of the salient property of the joint data at align-
and the models are well described as a collection of ment, and demonstrate an efficient means of estimating
edges. and extremizing it. Recently, Collignon et al. (1995)
There are many schemes that represent models and described the use of joint entropy as a criterion for reg-
images by collections of edges and define a dis- istration of CT and MRI data. They demonstrated a
tance metric between them, Huttenlocher’s use of the good minimum by probing the criterion, but no search
Hausdorff distance (Huttenlocher et al., 1991) is an techniques were described.
P1: LMW/STR/RKB/JHR P2: STR/SRK P3: STR/SRK QC:
International Journal of Computer Vision Kl470-04-Viola July 18, 1997 9:22
Image-based approaches to modeling have been pre- F49620-93-1-0263 (Viola), Howard Hughes Medi-
viously explored by several authors. Objects need not cal Institute (Viola), ARPA IU program via ONR#:
have edges to be well represented in this way, but care N00014-94-01-0994 (Wells) and AFOSR # F49620-
must be taken to deal with changes in lighting and pose. 93-1-0604 (Wells).
Turk and Pentland have used a large collection of face
images to train a system to construct representations Notes
that are invariant to some changes in lighting and pose
(Turk and Pentland, 1991). These representations are 1. EM MA is a random but pronounceable subset of the letters in the
a projection onto the largest eigenvectors of the distri- words “Empirical entropy Manipulation and Analysis”.
bution of images within the collection. Their system 2. Log likelihood is computed by first finding the Gaussian distribu-
addresses the problem of recognition rather than align- tion that fits the residual error, or noise, best. The log of (6) is then
computed using the estimated distribution of the noise. For small
ment, and as a result much of the emphasis and many amounts of noise, these estimates can be much larger than 1.
of the results are different. For instance, it is not clear 3. Correlation matching is one of many techniques that assumes a
how much variation in pose can be handled by their Gaussian conditional distribution of the image given the model.
system. We do not see a straightforward extension 4. Here we speak of the empirically estimated entropy of the condi-
of this or similar eigenspace work to the problem of tional distribution.
pose refinement. In other related work, Shashua has
shown that all of the images, under different lighting, References
of a Lambertian surface are a linear combination of any
three of the images (Shashua, 1992). A procedure for Becker, S. and Hinton, G.E. 1992. Learning to make coherent predic-
image alignment could be derived from this theory. In tions in domains with discontinuities. In Advances in Neural Infor-
mation Processing, J.E. Moody, S.J. Hanson, and R.P. Lippmann,
contrast, our image alignment method does not assume (Eds.), Denver 1991. Morgan Kaufmann: San Mateo, vol. 4.
that the object has a Lambertian surface. Bell, A.J. and Sejnowski, T.J. 1995. An information-maximisation
Entropy is playing an ever increasing role within the approach to blind separation. In Advances in Neural Informa-
field of neural networks. We know of no work on the tion Processing, Denver 1994. Morgan Kaufmann: San Francisco,
alignment of models and images, but there has been vol. 4.
Besl, P. and Jain, R. 1985. Three-dimensional object recognition.
work using entropy and information in vision prob- Computing Surveys, 17:75–145.
lems. None of these techniques use a non-parametric Bridle, J.S. 1989. Training stochastic model recognition algorithms
scheme for density/entropy estimation as we do. In as networks can lead to maximum mutual information estimation
most cases the distributions are assumed to be either of parameters. In Advances in Neural Information Processing 2,
binomial or Gaussian. Entropy and mutual informa- D.S. Touretzky (Ed.), Morgan Kaufman, pp. 211–217.
Chin, R. and Dyer, C. 1986. Model-based recognition in robot vision.
tion plays a role in the work of (Linsker, 1986; Becker Computing Surveys, 18:67–108.
and Hinton, 1992; Bell and Sejnowski, 1995). Collignon, A., Vandermuelen, D., Suetens, P., and Marchal, G. 1995.
3D multi-modality medical image registration using feature space
clustering. In Computer Vision, Virtual Reality and Robotics in
Acknowledgments Medicine, N. Ayache (Ed.), Springer Verlag, pp. 195–204.
Cover, T.M. and Thomas, J.A. 1991. Elements of Information Theory.
John Wiley and Sons.
We were partially inspired by the work of Hill and
Duda, R. and Hart, P. 1973. Pattern Classification and Scene Analy-
Hawkes on registration of medical images. Sanjeev sis. John Wiley and Sons.
Kulkarni introduced Wells to the concept of rela- Haykin, S. 1994. Neural Networks : A comprehensive foundation.
tive entropy, and its use in image processing. Nicol Macmillan College Publishing.
Schraudolph and Viola began discussions of this con- Hill, D.L., Studholme, C., and Hawkes, D.J. 1994. Voxel similar-
ity measures for automated image registration. In Proceedings of
crete approach to evaluating entropy in an application
the Third Conference on Visualization in Biomedical Computing,
of un-supervised learning. pp. 205–216, SPIE.
We thank Ron Kikinis and Gil Ettinger for the 3D Horn, B. 1986. Robot Vision. McGraw-Hill: New York.
skull model and MRI data. J.P. Mellor provided the Huttenlocher, D., Kedem, K., Sharir, K., and Sharir, M. 1991. The up-
skull images and camera model. Viola would like to per envelope of Voronoi surfaces and its applications. In Proceed-
ings of the Seventh ACM Symposium on Computational Geometry,
thank Terrence. J. Sejnowski for providing some of the
pp. 194–293.
facilities used during the preparation of this manuscript. Linsker, R. 1986. From basic network principles to neural archi-
We thank for following sources for their support of tecture. Proceedings of the National Academy of Sciences, USA,
this research: USAF ASSERT program, Parent Grant#: vol. 83, pp. 7508–7512, 8390–8394, 8779–8783.
P1: LMW/STR/RKB/JHR P2: STR/SRK P3: STR/SRK QC:
International Journal of Computer Vision Kl470-04-Viola July 18, 1997 9:22
Ljung, L. and Söderström, T. 1983. Theory and Practice of Recursive Vision and Pattern Recognition, Lahaina, Maui, Hawaii, pp. 586–
Identification. MIT Press. 591. IEEE.
Lowe, D. 1985. Perceptual Organization and Visual Recognition. Viola, P.A. 1995. Alignment by Maximization of Mutual Informa-
Kluwer Academic Publishers. tion. Ph.D. thesis, Massachusetts Institute of Technology.
Shashua, A. 1992. Geometry and Photometry in 3D Visual Recog- Wells III, W. 1992. Statistical Object Recognition. Ph.D. thesis,
nition. Ph.D. thesis, M.I.T Artificial Intelligence Laboratory, AI- MIT Department Electrical Engineering and Computer Science,
TR-1401. Cambridge, Mass. MIT AI Laboratory TR 1398.
Turk, M. and Pentland, A. 1991. Face recognition using eigenfaces. Widrow, B. and Hoff, M. 1960. Adaptive switching circuits. In 1960
In Proceedings of the Computer Society Conference on Computer IRE WESCON Convention Record, IRE, New York, 4:96–104.