2.5.2 Multivariate Density

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

2.5.

THE NORMAL DENSITY 17

selected randomly from a distribution. It can be shown that the normal distribution
has the maximum entropy of all distributions having a given mean and variance
(Problem 20). Moreover, as stated by the Central Limit Theorem, the aggregate Central
effect of a large number of small, independent random disturbances will lead to a Limit
Gaussian distribution (Computer exercise ??). Because many patterns — from fish Theorem
to handwritten characters to some speech sounds — can be viewed as some ideal or
prototype pattern corrupted by a large number of random processes, the Gaussian is
often a good model for the actual probability distribution.

2.5.2 Multivariate Density


The general multivariate normal density in d dimensions is written as
 
1 1 t −1
p(x) = exp − (x − µ) Σ (x − µ) , (37)
(2π)d/2 |Σ|1/2 2
where x is a d-component column vector, µ is the d-component mean vector, Σ is the
d-by-d covariance matrix, |Σ| and Σ−1 are its determinant and inverse, respectively, covariance
and (x − µ)t is the transpose of x − µ.∗ Our notation for the inner product is matrix


d inner
at b = ai bi , (38) product
i=1
and often called a dot product.
For simplicity, we often abbreviate Eq. 37 as p(x) ∼ N (µ, Σ). Formally, we have

µ ≡ E[x] = xp(x) dx (39)

and

Σ ≡ E[(x − µ)(x − µ)t ] = (x − µ)(x − µ)t p(x) dx, (40)

where the expected value of a vector or a matrix is found by taking the expected
values of its components. In other words, if xi is the ith component of x, µi the ith
component of µ, and σij the ijth component of Σ, then

µi = E[xi ] (41)
and

σij = E[(xi − µi )(xj − µj )]. (42)


The covariance matrix Σ is always symmetric and positive semidefinite. We shall
restrict our attention to the case in which Σ is positive definite, so that the deter-
minant of Σ is strictly positive.† The diagonal elements σii are the variances of the
respective xi (i.e., σi2 ), and the off-diagonal elements σij are the covariances of xi and covariance
xj . We would expect a positive covariance for the length and weight features of a
population of fish, for instance. If xi and xj are statistically independent, σij = 0. If statistical
∗ The mathematical expressions for the multivariate normal density are greatly simplified by em-
indepen-
ploying the concepts and notation of linear algebra. Readers who are unsure of our notation or dence
who would like to review linear algebra should see Appendix ??.
† If sample vectors are drawn from a linear subspace, |Σ| = 0 and p(x) is degenerate. This occurs,
for example, when one component of x has zero variance, or when two components are identical
or multiples of one another.
18 CHAPTER 2. BAYESIAN DECISION THEORY

all the off-diagonal elements are zero, p(x) reduces to the product of the univariate
normal densities for the components of x.
Linear combinations of jointly normally distributed random variables, independent
or not, are normally distributed. In particular, if A is a d-by-k matrix and y = At x
is a k-component vector, then p(y) ∼ N (At µ, At ΣA), as illustrated in Fig. 2.8. In
the special case where k = 1 and A is a unit-length vector a, y = at x is a scalar that
represents the projection of x onto a line in the direction of a; in that case at Σa is the
variance of the projection of x onto a. In general then, knowledge of the covariance
matrix allows us to calculate the dispersion of the data in any direction, or in any
subspace.
It is sometimes convenient to perform a coordinate transformation that converts
an arbitrary multivariate normal distribution into a spherical one, i.e., one having a
covariance matrix proportional to the identity matrix I. If we define Φ to be the ma-
trix whose columns are the orthonormal eigenvectors of Σ, and Λ the diagonal matrix
of the corresponding eigenvalues, then the transformation Aw = ΦΛ−1/2 applied to
the coordinates insures that the transformed distribution has covariance matrix equal
to the identity matrix. In signal processing, the transform Aw is called a whiten-
whitening ing transformation, since it makes the spectrum of eigenvectors of the transformed
transform distribution uniform.
The multivariate normal density is completely specified by d + d(d + 1)/2 pa-
rameters — the elements of the mean vector µ and the independent elements of the
covariance matrix Σ. Samples drawn from a normal population tend to fall in a single
cloud or cluster (Fig. 2.9); the center of the cluster is determined by the mean vector,
and the shape of the cluster is determined by the covariance matrix. If follows from
Eq. 37 that the loci of points of constant density are hyperellipsoids for which the
quadratic form (x−µ)t Σ−1 (x−µ) is constant. The principal axes of these hyperellip-
soids are given by the eigenvectors of Σ (described by Φ); the eigenvalues (described
by Λ) determine the lengths of these axes. The quantity

r2 = (x − µ)t Σ−1 (x − µ) (43)

Mahalanobis is sometimes called the squared Mahalanobis distance from x to µ. Thus, the contours
distance of constant density are hyperellipsoids of constant Mahalanobis distance to µ and the
volume of these hyperellipsoids measures the scatter of the samples about the mean. It
can be shown (Problems 15 & 16) that the volume of the hyperellipsoid corresponding
to a Mahalanobis distance r is given by

V = Vd |Σ|1/2 rd , (44)

where Vd is the volume of a d-dimensional unit hypersphere:

 d/2
 π /(d/2)! d even
Vd = (45)

2d π (d−1)/2 ( d−1
2 )!/(d)! d odd.

Thus, for a given dimensionality, the scatter of the samples varies directly with |Σ|1/2
(Problem 17).
2.6. DISCRIMINANT FUNCTIONS FOR THE NORMAL DENSITY 19

x2
N(Awtµ, I)

10 Awtµ

Aw N(µ, Σ)
8

6 A
Atµ
P

4
N(Atµ, AtΣ A)
σ

2 a
Pt µ

0 x1
2 4 6 8 10

Figure 2.8: The action of a linear transformation on the feature space will convert an
arbitrary normal distribution into another normal distribution. One transformation,
A, takes the source distribution into distribution N (At µ, At ΣA). Another linear
transformation — a projection P onto line a — leads to N (µ, σ 2 ) measured along a.
While the transforms yield distributions in a different space, we show them super-
imposed on the original x1 − x2 space. A whitening transform leads to a circularly
symmetric Gaussian, here shown displaced.

2.6 Discriminant Functions for the Normal Density


In Sect. 2.4.1 we saw that the minimum-error-rate classification can be achieved by
use of the discriminant functions

gi (x) = ln p(x|ωi ) + ln P (ωi ). (46)


This expression can be readily evaluated if the densities p(x|ωi ) are multivariate nor-
mal, i.e., if p(x|ωi ) ∼ N (µi , Σi ). In this case, then, from Eq. 37 we have

1 d 1
gi (x) = − (x − µi )t Σ−1
i (x − µi ) − ln 2π − ln |Σi | + ln P (ωi ). (47)
2 2 2
Let us examine the discriminant function and resulting classification for a number of
special cases.

2.6.1 Case 1: Σi = σ 2 I
The simplest case occurs when the features are statistically independent, and when
each feature has the same variance, σ 2 . In this case the covariance matrix is diagonal,
20 CHAPTER 2. BAYESIAN DECISION THEORY
x2

x1

Figure 2.9: Samples drawn from a two-dimensional Gaussian lie in a cloud centered on
the mean µ. The red ellipses show lines of equal probability density of the Gaussian.

being merely σ 2 times the identity matrix I. Geometrically, this corresponds to the
situation in which the samples fall in equal-size hyperspherical clusters, the cluster
for the ith class being centered about the mean vector µi . The computation of the
determinant and the inverse of Σi is particularly easy: |Σi | = σ 2d and Σ−1
i = (1/σ 2 )I.
Since both |Σi | and the (d/2) ln 2π term in Eq. 47 are independent of i, they are
unimportant additive constants that can be ignored. Thus we obtain the simple
discriminant functions

x − µi 2
gi (x) = − + ln P (ωi ), (48)
2σ 2
Euclidean where  ·  is the Euclidean norm, that is,
norm
x − µi 2 = (x − µi )t (x − µi ). (49)

If the prior probabilities are not equal, then Eq. 48 shows that the squared distance
x − µ2 must be normalized by the variance σ 2 and offset by adding ln P (ωi ); thus,
if x is equally near two different mean vectors, the optimal decision will favor the a
priori more likely category.
Regardless of whether the prior probabilities are equal or not, it is not actually
necessary to compute distances. Expansion of the quadratic form (x − µi )t (x − µi )
yields

1
gi (x) = − [xt x − 2µti x + µti µi ] + ln P (ωi ), (50)
2σ 2
which appears to be a quadratic function of x. However, the quadratic term xt x is
the same for all i, making it an ignorable additive constant. Thus, we obtain the
linear equivalent linear discriminant functions
discriminant
gi (x) = wit x + wi0 , (51)

where
2.6. DISCRIMINANT FUNCTIONS FOR THE NORMAL DENSITY 21

4
2 2
0
-2 1
p(x|ωi)
ω1 ω2 0.15
0.4 0

0.3 0.1 P(ω2)=.5


2

0.05
0.2
1
R2
0
0.1
0

x
-1
-2 2 4
P(ω2)=.5 P(ω1)=.5 R1
R1 R2 P(ω1)=.5 -2
R2
P(ω1)=.5 P(ω2)=.5
R1 -2
-2 -1
0 0
2 1
4 2

Figure 2.10: If the covariances of two distributions are equal and proportional to the
identity matrix, then the distributions are spherical in d dimensions, and the boundary
is a generalized hyperplane of d − 1 dimensions, perpendicular to the line separating
the means. In these 1-, 2-, and 3-dimensional examples, we indicate p(x|ωi ) and the
boundaries for the case P (ω1 ) = P (ω2 ). In the 3-dimensional case, the grid plane
separates R1 from R2 .

1
wi = µ (52)
σ2 i
and

−1 t
wi0 =
µ µ + ln P (ωi ). (53)
2σ 2 i i
We call wi0 the threshold or bias in the ith direction. threshold
A classifier that uses linear discriminant functions is called a linear machine. This
kind of classifier has many interesting theoretical properties, some of which will be bias
discussed in detail in Chap. ??. At this point we merely note that the decision linear
surfaces for a linear machine are pieces of hyperplanes defined by the linear equations machine
gi (x) = gj (x) for the two categories with the highest posterior probabilities. For our
particular case, this equation can be written as

wt (x − x0 ) = 0, (54)
where

w = µi − µj (55)
and

1 σ2 P (ωi )
x0 = (µi + µj ) − ln (µ − µj ). (56)
2 µi − µj 2 P (ωj ) i
This equation defines a hyperplane through the point x0 and orthogonal to the
vector w. Since w = µi − µj , the hyperplane separating Ri and Rj is orthogonal to
the line linking the means. If P (ωi ) = P (ωj ), the second term on the right of Eq. 56
vanishes, and thus the point x0 is halfway between the means, and the hyperplane is
the perpendicular bisector of the line between the means (Fig. 2.11). If P (ωi ) = P (ωj ),
the point x0 shifts away from the more likely mean. Note, however, that if the variance
22 CHAPTER 2. BAYESIAN DECISION THEORY
p(x|ωi) p(x|ωi)
ω1 ω2 ω1 ω2
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

-2 2 4 -2 2 4

R1 R2 R1 R2
P(ω1) = .7 P(ω2) = .3 P(ω1) = .9 P(ω2) = .1

4 4
2 2
0 0
-2 -2
0.15 0.15

0.1 0.1

0.05 0.05

0 0

P(ω2)=.2
R2
R2 P(ω2)=.01
P(ω1)=.8 P(ω1)=.99 R1
R1
-2 -2
0 0
2 2
4 4
3
4
2
1 2

0 0
P(ω2)=.2

2 2
R2
R2 R1
R1
1 1

0 0

-1 -1
P(ω2)=.01
P(ω1)=.8
P(ω1)=.99
-2 -2

-2 -2
-1 -1
0 0
1 1
2 2

Figure 2.11: As the priors are changed, the decision boundary shifts; for sufficiently
disparate priors the boundary will not lie between the means of these 1-, 2- and
3-dimensional spherical Gaussian distributions.

σ 2 is small relative to the squared distance µi − µj , then the position of the decision
boundary is relatively insensitive to the exact values of the prior probabilities.

If the prior probabilities P (ωi ) are the same for all c classes, then the ln P (ωi )
term becomes another unimportant additive constant that can be ignored. When this
happens, the optimum decision rule can be stated very simply: to classify a feature
vector x, measure the Euclidean distance x − µi  from each x to each of the c
mean vectors, and assign x to the category of the nearest mean. Such a classifier is
minimum called a minimum distance classifier. If each mean vector is thought of as being an
distance ideal prototype or template for patterns in its class, then this is essentially a template-
classifier matching procedure (Fig. 2.10), a technique we will consider again in Chap. ?? Sect. ??
on the nearest-neighbor algorithm.
template-
matching
2.6. DISCRIMINANT FUNCTIONS FOR THE NORMAL DENSITY 23

2.6.2 Case 2: Σi = Σ
Another simple case arises when the covariance matrices for all of the classes are
identical but otherwise arbitrary. Geometrically, this corresponds to the situation in
which the samples fall in hyperellipsoidal clusters of equal size and shape, the cluster
for the ith class being centered about the mean vector µi . Since both |Σi | and the
(d/2) ln 2π term in Eq. 47 are independent of i, they can be ignored as superfluous
additive constants. This simplification leads to the discriminant functions
1
gi (x) = − (x − µi )t Σ−1 (x − µi ) + ln P (ωi ). (57)
2
If the prior probabilities P (ωi ) are the same for all c classes, then the ln P (ωi )
term can be ignored. In this case, the optimal decision rule can once again be stated
very simply: to classify a feature vector x, measure the squared Mahalanobis distance
(x − µi )t Σ−1 (x − µi ) from x to each of the c mean vectors, and assign x to the
category of the nearest mean. As before, unequal prior probabilities bias the decision
in favor of the a priori more likely category.
Expansion of the quadratic form (x − µi )t Σ−1 (x − µi ) results in a sum involving
a quadratic term xt Σ−1 x which here is independent of i. After this term is dropped
from Eq. 57, the resulting discriminant functions are again linear:

gi (x) = wit x + wi0 , (58)


where

wi = Σ−1 µi (59)
and
1
wi0 = − µti Σ−1 µi + ln P (ωi ). (60)
2
Since the discriminants are linear, the resulting decision boundaries are again
hyperplanes (Fig. 2.10). If Ri and Rj are contiguous, the boundary between them
has the equation

wt (x − x0 ) = 0, (61)
where

w = Σ−1 (µi − µj ) (62)


and

1 ln [P (ωi )/P (ωj )]


x0 = (µ + µj ) − (µ − µj ). (63)
2 i (µi − µj )t Σ−1 (µi − µj ) i
Since w = Σ−1 (µi −µj ) is generally not in the direction of µi −µj , the hyperplane
separating Ri and Rj is generally not orthogonal to the line between the means.
However, it does intersect that line at the point x0 which is halfway between the
means if the prior probabilities are equal. If the prior probabilities are not equal, the
optimal boundary hyperplane is shifted away from the more likely mean (Fig. 2.12).
As before, with sufficient bias the decision plane need not lie between the two mean
vectors.
24 CHAPTER 2. BAYESIAN DECISION THEORY

0.2

0.1

p p 0

P(ω1)=.5 P(ω1)=.9
5
P(ω2)=.5
P(ω2)=.1
-5
0
0

5
-5

7.5 10

P(ω2)=.5 5 7.5

2.5 5
P(ω2)=.1

2.5
0

P(ω1)=.5 0
-2.5
-2 -2 P(ω1)=.9
0 4 0 4
2 2 2
0 2 0
-2 -2

Figure 2.12: Probability densities (indicated by the surfaces in two dimensions and
ellipsoidal surfaces in three dimensions) and decision regions for equal but asymmetric
Gaussian distributions. The decision hyperplanes need not be perpendicular to the
line connecting the means.
2.6. DISCRIMINANT FUNCTIONS FOR THE NORMAL DENSITY 25

2.6.3 Case 3: Σi = arbitrary


In the general multivariate normal case, the covariance matrices are different for each
category. The only term that can be dropped from Eq. 47 is the (d/2) ln 2π term,
and the resulting discriminant functions are inherently quadratic:

gi (x) = xt Wi x + wit x + wi0 , (64)


where
1
Wi = − Σ−1 , (65)
2 i

wi = Σ−1
i µi (66)
and
1 1
wi0 = − µti Σ−1
i µi − ln |Σi | + ln P (ωi ). (67)
2 2
The decision surfaces are hyperquadrics, and can assume any of the general forms hyper-
— hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, quadric
and hyperhyperboloids of various types (Problem 29). Even in one dimension, for
arbitrary covariance the decision regions need not be simply connected (Fig. 2.13).
The two- and three-dimensional examples in Fig. 2.14 & 2.15 indicate how these
different forms can arise. These variances are indicated by the contours of constant
probability density.
The extension of these results to more than two categories is straightforward
though we need to keep clear which two of the total c categories are responsible for
any boundary segment. Figure 2.16 shows the decision surfaces for a four-category
case made up of Gaussian distributions. Of course, if the distributions are more com-
plicated, the decision regions can be even more complex, though the same underlying
theory holds there too.
p(x|ωi)
0.4
ω2
0.3

0.2
ω1

0.1

x
-5 -2.5 2.5 5 7.5

R1 R2 R1

Figure 2.13: Non-simply connected decision regions can arise in one dimensions for
Gaussians having unequal variance.
26 CHAPTER 2. BAYESIAN DECISION THEORY

.04

.03 0.01

.02
0.005
.01
0 0
p
p

20 20
10 10

0 0 20
-10 10
0 0
-10 -10
10 -10
20 -20

0.05
0.03
0.04

0.02 0.03
0.02
0.01
0.01
p
0
p 0

10 10

-10 0 -10 0
0 0
10 -10 10 -10
20 20

0.01
0.15

0.1
0.005

0.05
pP 0
p 0

20 5
10
0 -5 0
20
-10 10 0
0 -5
-10 5

Figure 2.14: Arbitrary Gaussian distributions lead to Bayes decision boundaries that
are general hyperquadrics. Conversely, given any hyperquadratic, one can find two
Gaussian distributions whose Bayes decision boundary is that hyperquadric.
2.6. DISCRIMINANT FUNCTIONS FOR THE NORMAL DENSITY 27

2 6
10 4
0 40 2
0
0 10
20
0 -10 -2 5
2.5 5
5 2.5 -20 0 0 0
7.5 0 0 5
-2.5 20 10 -5

4 10
10

2
5

5
0
0
10
-2
0
10 -5
5
-4
5 10
-2
0 0
0 2 0 -5 5
5
4 0
10 6 5 0

1 1

3
0.5 2 0

1 3
1 5
0 0 -1 2
0.5
0 -1 1
0 0 0
0.5 -0.5 1 0
0 2
1 2
1.5 -1 4 3 -1
2 6 4

.5
Decis

0
ion b

5
ound

20
0
10
ary

-5 -.5
0

-10 -10
0 0
10 -20 .5 .2
1 0
1.5 -.2

Figure 2.15: Arbitrary three-dimensional Gaussian distributions yield Bayes decision


boundaries that are two-dimensional hyperquadrics. There are even degenerate cases
in which the decision boundary is a line.
28 CHAPTER 2. BAYESIAN DECISION THEORY

ω4 ω3

ω2 ω4

ω1

Figure 2.16: The decision regions for four normal distributions. Even with such a low
number of categories, the shapes of the boundary regions can be rather complex.

You might also like