2.5.2 Multivariate Density
2.5.2 Multivariate Density
2.5.2 Multivariate Density
selected randomly from a distribution. It can be shown that the normal distribution
has the maximum entropy of all distributions having a given mean and variance
(Problem 20). Moreover, as stated by the Central Limit Theorem, the aggregate Central
effect of a large number of small, independent random disturbances will lead to a Limit
Gaussian distribution (Computer exercise ??). Because many patterns — from fish Theorem
to handwritten characters to some speech sounds — can be viewed as some ideal or
prototype pattern corrupted by a large number of random processes, the Gaussian is
often a good model for the actual probability distribution.
d inner
at b = ai bi , (38) product
i=1
and often called a dot product.
For simplicity, we often abbreviate Eq. 37 as p(x) ∼ N (µ, Σ). Formally, we have
µ ≡ E[x] = xp(x) dx (39)
and
Σ ≡ E[(x − µ)(x − µ)t ] = (x − µ)(x − µ)t p(x) dx, (40)
where the expected value of a vector or a matrix is found by taking the expected
values of its components. In other words, if xi is the ith component of x, µi the ith
component of µ, and σij the ijth component of Σ, then
µi = E[xi ] (41)
and
all the off-diagonal elements are zero, p(x) reduces to the product of the univariate
normal densities for the components of x.
Linear combinations of jointly normally distributed random variables, independent
or not, are normally distributed. In particular, if A is a d-by-k matrix and y = At x
is a k-component vector, then p(y) ∼ N (At µ, At ΣA), as illustrated in Fig. 2.8. In
the special case where k = 1 and A is a unit-length vector a, y = at x is a scalar that
represents the projection of x onto a line in the direction of a; in that case at Σa is the
variance of the projection of x onto a. In general then, knowledge of the covariance
matrix allows us to calculate the dispersion of the data in any direction, or in any
subspace.
It is sometimes convenient to perform a coordinate transformation that converts
an arbitrary multivariate normal distribution into a spherical one, i.e., one having a
covariance matrix proportional to the identity matrix I. If we define Φ to be the ma-
trix whose columns are the orthonormal eigenvectors of Σ, and Λ the diagonal matrix
of the corresponding eigenvalues, then the transformation Aw = ΦΛ−1/2 applied to
the coordinates insures that the transformed distribution has covariance matrix equal
to the identity matrix. In signal processing, the transform Aw is called a whiten-
whitening ing transformation, since it makes the spectrum of eigenvectors of the transformed
transform distribution uniform.
The multivariate normal density is completely specified by d + d(d + 1)/2 pa-
rameters — the elements of the mean vector µ and the independent elements of the
covariance matrix Σ. Samples drawn from a normal population tend to fall in a single
cloud or cluster (Fig. 2.9); the center of the cluster is determined by the mean vector,
and the shape of the cluster is determined by the covariance matrix. If follows from
Eq. 37 that the loci of points of constant density are hyperellipsoids for which the
quadratic form (x−µ)t Σ−1 (x−µ) is constant. The principal axes of these hyperellip-
soids are given by the eigenvectors of Σ (described by Φ); the eigenvalues (described
by Λ) determine the lengths of these axes. The quantity
Mahalanobis is sometimes called the squared Mahalanobis distance from x to µ. Thus, the contours
distance of constant density are hyperellipsoids of constant Mahalanobis distance to µ and the
volume of these hyperellipsoids measures the scatter of the samples about the mean. It
can be shown (Problems 15 & 16) that the volume of the hyperellipsoid corresponding
to a Mahalanobis distance r is given by
V = Vd |Σ|1/2 rd , (44)
d/2
π /(d/2)! d even
Vd = (45)
2d π (d−1)/2 ( d−1
2 )!/(d)! d odd.
Thus, for a given dimensionality, the scatter of the samples varies directly with |Σ|1/2
(Problem 17).
2.6. DISCRIMINANT FUNCTIONS FOR THE NORMAL DENSITY 19
x2
N(Awtµ, I)
10 Awtµ
Aw N(µ, Σ)
8
6 A
Atµ
P
4
N(Atµ, AtΣ A)
σ
2 a
Pt µ
0 x1
2 4 6 8 10
Figure 2.8: The action of a linear transformation on the feature space will convert an
arbitrary normal distribution into another normal distribution. One transformation,
A, takes the source distribution into distribution N (At µ, At ΣA). Another linear
transformation — a projection P onto line a — leads to N (µ, σ 2 ) measured along a.
While the transforms yield distributions in a different space, we show them super-
imposed on the original x1 − x2 space. A whitening transform leads to a circularly
symmetric Gaussian, here shown displaced.
1 d 1
gi (x) = − (x − µi )t Σ−1
i (x − µi ) − ln 2π − ln |Σi | + ln P (ωi ). (47)
2 2 2
Let us examine the discriminant function and resulting classification for a number of
special cases.
2.6.1 Case 1: Σi = σ 2 I
The simplest case occurs when the features are statistically independent, and when
each feature has the same variance, σ 2 . In this case the covariance matrix is diagonal,
20 CHAPTER 2. BAYESIAN DECISION THEORY
x2
x1
Figure 2.9: Samples drawn from a two-dimensional Gaussian lie in a cloud centered on
the mean µ. The red ellipses show lines of equal probability density of the Gaussian.
being merely σ 2 times the identity matrix I. Geometrically, this corresponds to the
situation in which the samples fall in equal-size hyperspherical clusters, the cluster
for the ith class being centered about the mean vector µi . The computation of the
determinant and the inverse of Σi is particularly easy: |Σi | = σ 2d and Σ−1
i = (1/σ 2 )I.
Since both |Σi | and the (d/2) ln 2π term in Eq. 47 are independent of i, they are
unimportant additive constants that can be ignored. Thus we obtain the simple
discriminant functions
x − µi 2
gi (x) = − + ln P (ωi ), (48)
2σ 2
Euclidean where · is the Euclidean norm, that is,
norm
x − µi 2 = (x − µi )t (x − µi ). (49)
If the prior probabilities are not equal, then Eq. 48 shows that the squared distance
x − µ2 must be normalized by the variance σ 2 and offset by adding ln P (ωi ); thus,
if x is equally near two different mean vectors, the optimal decision will favor the a
priori more likely category.
Regardless of whether the prior probabilities are equal or not, it is not actually
necessary to compute distances. Expansion of the quadratic form (x − µi )t (x − µi )
yields
1
gi (x) = − [xt x − 2µti x + µti µi ] + ln P (ωi ), (50)
2σ 2
which appears to be a quadratic function of x. However, the quadratic term xt x is
the same for all i, making it an ignorable additive constant. Thus, we obtain the
linear equivalent linear discriminant functions
discriminant
gi (x) = wit x + wi0 , (51)
where
2.6. DISCRIMINANT FUNCTIONS FOR THE NORMAL DENSITY 21
4
2 2
0
-2 1
p(x|ωi)
ω1 ω2 0.15
0.4 0
0.05
0.2
1
R2
0
0.1
0
x
-1
-2 2 4
P(ω2)=.5 P(ω1)=.5 R1
R1 R2 P(ω1)=.5 -2
R2
P(ω1)=.5 P(ω2)=.5
R1 -2
-2 -1
0 0
2 1
4 2
Figure 2.10: If the covariances of two distributions are equal and proportional to the
identity matrix, then the distributions are spherical in d dimensions, and the boundary
is a generalized hyperplane of d − 1 dimensions, perpendicular to the line separating
the means. In these 1-, 2-, and 3-dimensional examples, we indicate p(x|ωi ) and the
boundaries for the case P (ω1 ) = P (ω2 ). In the 3-dimensional case, the grid plane
separates R1 from R2 .
1
wi = µ (52)
σ2 i
and
−1 t
wi0 =
µ µ + ln P (ωi ). (53)
2σ 2 i i
We call wi0 the threshold or bias in the ith direction. threshold
A classifier that uses linear discriminant functions is called a linear machine. This
kind of classifier has many interesting theoretical properties, some of which will be bias
discussed in detail in Chap. ??. At this point we merely note that the decision linear
surfaces for a linear machine are pieces of hyperplanes defined by the linear equations machine
gi (x) = gj (x) for the two categories with the highest posterior probabilities. For our
particular case, this equation can be written as
wt (x − x0 ) = 0, (54)
where
w = µi − µj (55)
and
1 σ2 P (ωi )
x0 = (µi + µj ) − ln (µ − µj ). (56)
2 µi − µj 2 P (ωj ) i
This equation defines a hyperplane through the point x0 and orthogonal to the
vector w. Since w = µi − µj , the hyperplane separating Ri and Rj is orthogonal to
the line linking the means. If P (ωi ) = P (ωj ), the second term on the right of Eq. 56
vanishes, and thus the point x0 is halfway between the means, and the hyperplane is
the perpendicular bisector of the line between the means (Fig. 2.11). If P (ωi ) = P (ωj ),
the point x0 shifts away from the more likely mean. Note, however, that if the variance
22 CHAPTER 2. BAYESIAN DECISION THEORY
p(x|ωi) p(x|ωi)
ω1 ω2 ω1 ω2
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
-2 2 4 -2 2 4
R1 R2 R1 R2
P(ω1) = .7 P(ω2) = .3 P(ω1) = .9 P(ω2) = .1
4 4
2 2
0 0
-2 -2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
P(ω2)=.2
R2
R2 P(ω2)=.01
P(ω1)=.8 P(ω1)=.99 R1
R1
-2 -2
0 0
2 2
4 4
3
4
2
1 2
0 0
P(ω2)=.2
2 2
R2
R2 R1
R1
1 1
0 0
-1 -1
P(ω2)=.01
P(ω1)=.8
P(ω1)=.99
-2 -2
-2 -2
-1 -1
0 0
1 1
2 2
Figure 2.11: As the priors are changed, the decision boundary shifts; for sufficiently
disparate priors the boundary will not lie between the means of these 1-, 2- and
3-dimensional spherical Gaussian distributions.
σ 2 is small relative to the squared distance µi − µj , then the position of the decision
boundary is relatively insensitive to the exact values of the prior probabilities.
If the prior probabilities P (ωi ) are the same for all c classes, then the ln P (ωi )
term becomes another unimportant additive constant that can be ignored. When this
happens, the optimum decision rule can be stated very simply: to classify a feature
vector x, measure the Euclidean distance x − µi from each x to each of the c
mean vectors, and assign x to the category of the nearest mean. Such a classifier is
minimum called a minimum distance classifier. If each mean vector is thought of as being an
distance ideal prototype or template for patterns in its class, then this is essentially a template-
classifier matching procedure (Fig. 2.10), a technique we will consider again in Chap. ?? Sect. ??
on the nearest-neighbor algorithm.
template-
matching
2.6. DISCRIMINANT FUNCTIONS FOR THE NORMAL DENSITY 23
2.6.2 Case 2: Σi = Σ
Another simple case arises when the covariance matrices for all of the classes are
identical but otherwise arbitrary. Geometrically, this corresponds to the situation in
which the samples fall in hyperellipsoidal clusters of equal size and shape, the cluster
for the ith class being centered about the mean vector µi . Since both |Σi | and the
(d/2) ln 2π term in Eq. 47 are independent of i, they can be ignored as superfluous
additive constants. This simplification leads to the discriminant functions
1
gi (x) = − (x − µi )t Σ−1 (x − µi ) + ln P (ωi ). (57)
2
If the prior probabilities P (ωi ) are the same for all c classes, then the ln P (ωi )
term can be ignored. In this case, the optimal decision rule can once again be stated
very simply: to classify a feature vector x, measure the squared Mahalanobis distance
(x − µi )t Σ−1 (x − µi ) from x to each of the c mean vectors, and assign x to the
category of the nearest mean. As before, unequal prior probabilities bias the decision
in favor of the a priori more likely category.
Expansion of the quadratic form (x − µi )t Σ−1 (x − µi ) results in a sum involving
a quadratic term xt Σ−1 x which here is independent of i. After this term is dropped
from Eq. 57, the resulting discriminant functions are again linear:
wi = Σ−1 µi (59)
and
1
wi0 = − µti Σ−1 µi + ln P (ωi ). (60)
2
Since the discriminants are linear, the resulting decision boundaries are again
hyperplanes (Fig. 2.10). If Ri and Rj are contiguous, the boundary between them
has the equation
wt (x − x0 ) = 0, (61)
where
0.2
0.1
p p 0
P(ω1)=.5 P(ω1)=.9
5
P(ω2)=.5
P(ω2)=.1
-5
0
0
5
-5
7.5 10
P(ω2)=.5 5 7.5
2.5 5
P(ω2)=.1
2.5
0
P(ω1)=.5 0
-2.5
-2 -2 P(ω1)=.9
0 4 0 4
2 2 2
0 2 0
-2 -2
Figure 2.12: Probability densities (indicated by the surfaces in two dimensions and
ellipsoidal surfaces in three dimensions) and decision regions for equal but asymmetric
Gaussian distributions. The decision hyperplanes need not be perpendicular to the
line connecting the means.
2.6. DISCRIMINANT FUNCTIONS FOR THE NORMAL DENSITY 25
wi = Σ−1
i µi (66)
and
1 1
wi0 = − µti Σ−1
i µi − ln |Σi | + ln P (ωi ). (67)
2 2
The decision surfaces are hyperquadrics, and can assume any of the general forms hyper-
— hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, quadric
and hyperhyperboloids of various types (Problem 29). Even in one dimension, for
arbitrary covariance the decision regions need not be simply connected (Fig. 2.13).
The two- and three-dimensional examples in Fig. 2.14 & 2.15 indicate how these
different forms can arise. These variances are indicated by the contours of constant
probability density.
The extension of these results to more than two categories is straightforward
though we need to keep clear which two of the total c categories are responsible for
any boundary segment. Figure 2.16 shows the decision surfaces for a four-category
case made up of Gaussian distributions. Of course, if the distributions are more com-
plicated, the decision regions can be even more complex, though the same underlying
theory holds there too.
p(x|ωi)
0.4
ω2
0.3
0.2
ω1
0.1
x
-5 -2.5 2.5 5 7.5
R1 R2 R1
Figure 2.13: Non-simply connected decision regions can arise in one dimensions for
Gaussians having unequal variance.
26 CHAPTER 2. BAYESIAN DECISION THEORY
.04
.03 0.01
.02
0.005
.01
0 0
p
p
20 20
10 10
0 0 20
-10 10
0 0
-10 -10
10 -10
20 -20
0.05
0.03
0.04
0.02 0.03
0.02
0.01
0.01
p
0
p 0
10 10
-10 0 -10 0
0 0
10 -10 10 -10
20 20
0.01
0.15
0.1
0.005
0.05
pP 0
p 0
20 5
10
0 -5 0
20
-10 10 0
0 -5
-10 5
Figure 2.14: Arbitrary Gaussian distributions lead to Bayes decision boundaries that
are general hyperquadrics. Conversely, given any hyperquadratic, one can find two
Gaussian distributions whose Bayes decision boundary is that hyperquadric.
2.6. DISCRIMINANT FUNCTIONS FOR THE NORMAL DENSITY 27
2 6
10 4
0 40 2
0
0 10
20
0 -10 -2 5
2.5 5
5 2.5 -20 0 0 0
7.5 0 0 5
-2.5 20 10 -5
4 10
10
2
5
5
0
0
10
-2
0
10 -5
5
-4
5 10
-2
0 0
0 2 0 -5 5
5
4 0
10 6 5 0
1 1
3
0.5 2 0
1 3
1 5
0 0 -1 2
0.5
0 -1 1
0 0 0
0.5 -0.5 1 0
0 2
1 2
1.5 -1 4 3 -1
2 6 4
.5
Decis
0
ion b
5
ound
20
0
10
ary
-5 -.5
0
-10 -10
0 0
10 -20 .5 .2
1 0
1.5 -.2
ω4 ω3
ω2 ω4
ω1
Figure 2.16: The decision regions for four normal distributions. Even with such a low
number of categories, the shapes of the boundary regions can be rather complex.