Lecture Notes On Pattern Recognition and Image Processing
Lecture Notes On Pattern Recognition and Image Processing
Lecture Notes On Pattern Recognition and Image Processing
Jonathan G. Campbell Department of Computing, Letterkenny Institute of Technology, Co. Donegal, Ireland. email: jonathan.campbell at gmail.com, [email protected] URL:http://www.jgcampbell.com/ Report No: jc/05/0005/r Revision 1.1 (minor typos corrected, and Fig. 4.3 added). 13th August 2005
Contents
1 Introduction 2 Simple classier algorithms 2.1 Thresholding for one-dimensional data . . . . . . . . . . . . . . . . . . 2.2 Linear separating lines/planes for two-dimensions . . . . . . . . . . . . 2.3 Nearest mean classier . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Normal form of the separating line, projections, and linear discriminants 2.5 Projection and linear discriminant . . . . . . . . . . . . . . . . . . . . 2.6 Projections and linear discriminants in p dimensions . . . . . . . . . . . 2.7 Template Matching and Discriminants . . . . . . . . . . . . . . . . . . 2.8 Nearest neighbour methods . . . . . . . . . . . . . . . . . . . . . . . . 3 Statistical methods 3.1 Bayes Rule for the Inversion of Conditional Probabilities . 3.2 Parametric Methods . . . . . . . . . . . . . . . . . . . . . 3.3 Discriminants based on Gaussian Density . . . . . . . . . 3.4 Bayes-Gauss Classier Special Cases . . . . . . . . . . 3.4.1 Equal and Diagonal Covariances . . . . . . . . . . 3.4.2 Equal but General Covariances . . . . . . . . . . . 3.5 Least square error trained classier . . . . . . . . . . . . . 3.6 Generalised linear discriminant function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 3 5 5 6 6 7 7 8 9 10 10 11 11 12 13 13 15 16 18 18 18 22 22 23
4 Neural networks 4.1 Neurons for Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Three-layer neural network for arbitrarily complex decision regions . . . . . . . . . . . . . . . . . . . . . . 4.3 Sigmoid activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Principal Components Analysis and Fishers Linear Discriminant Analysis A.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Fishers Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 1
Introduction
This note is a reworking of some of the basic pattern recognition and neural network material covered in (Campbell & Murtagh 1998); it includes some material from (Campbell 2000). Chapter 3 contains more detail than other chapters; except for the rst page, this could be skipped on rst reading. We dene/summarize a pattern recognition system using the block diagram in Figure 1.1.
x sensed data
w (omega) class
Figure 1.1: Pattern recognition system; x a tuple of p measurements, output class label. In some cases, this might work; i.e. we collect the pixel values of a sub-image that contains the object of interest and place them in a p-tuple x and leave everything to the classier algorithm. However, Figure 1.2 shows a more realistic setup. In this arrangement still somewhat idealised, for example, we assume that we have somehow isolated the object in a (sub)image we segment the image into object pixels (value 1, say) and background pixels (value 0). In such a setup we can do all the problem specic processing in the rst two stages, and pass the feature vector (in general p-dimensional) to a general purpose classier.
raw image
segment
binary image
feature extractor
tuple
classifier
features (*) General pattern recognition * For example, in the trivial case of identifying 0 and 1 we could have a feature vector (x1 = width, x2 = ink area). ++ class = 0, 1; in a true OCR problem (AZ;09), class = 1..36.
Figure 1.2: Image pattern recognition problem separated. Pattern recognition (classication) may be posed as an inference problem. The inference involves class labels, that is we have a set of examples (training data), XT = {xi , i }n . x is the pattern vector of course, we freely admit that in certain i=1 situations x is a simple scalar. is the class label, = {1 , . . . , c }; given an unseen pattern x, we infer . In general, x = (x0 x1 . . . x p1 )T , a p-dimensional vector; T denotes transposition. From now on we concentrate on classier algorithms, and we refer only occasionally to concrete examples.
Chapter 2
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 3 T 4 5 6 x1
Figure 2.1: Width data, x1 . Let us see whether we can recognise using width alone (x1 . Figure 2.1 shows some (training) data that has been collected. We see that a threshold (T) set at about x1 = 2.8 is the best we can do; the classication algorithm is:
= 1 when x1 T, = 0 otherwise. Use of histograms, see Figure 2.2 might be a more methodical way of determining the threshold, T .
(2.1) (2.2)
If enough training data were available, n , the histograms, h 0 (x1 ), h1 (x1 ), properly normalised would approach probability densities: p0 (x1 ), p1 (x1 ), more properly called class conditional probability densities: p(x 1 | ), = 0, 1, see Figure 2.3. When the feature vector is three-dimensional (p = 3) or more, it becomes impossible to estimate the probability densities using histogram binning there are a great many bins, and most of them contain no data.
freq.
h1(x1) h0(x1)
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 3 T 4 5 6 x1
p(x1 | 1) p(x1 | 0)
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 3 T 4 5 6 x1
x2 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 x1 0 0 0 0 0 0 0 0 0 0 0 0 0
Figure 2.4: Two dimensions, scatter plot. The dotted line shows that the data are separable by a straight line; it intercepts the axes at x 1 = 4.5 and x2 = 6. Apart from plotting the data and drawing the line, how could we derive it from the data? (Thinking of a computer program.)
|x 0|2
= |x 1|2 ,
(2.3) (2.4)
(2.5)
which is of the form b1 x1 + b2x2 b0 = 0. In Figure 2.5, 01 = 4, 02 = 3, 11 = 2, 12 = 1.5; with these values, eqn 2.6 becomes 4x1 + 3x2 18.75 = 0, which intercepts the x1 axis at 18.75/4 4.7 and the x2 axis at 18.75/3 = 6.25. (2.7) (2.6)
x2 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 x1 0 0 0 0 0 0 0 0 0 0 0 0 0
Figure 2.5: Two dimensional scatter plot showing means and separating line.
2.4 Normal form of the separating line, projections, and linear discriminants
Eqn 2.6 becomes more interesting and useful in its normal form, a1 x1 + a2x2 a0 = 0, where a2 + a2 = 1; eqn 2.8 can be obtained from eqn 2.6 by dividing across by 1 2 b2 + b2 . 1 2 (2.8)
Figure 2.6 shows interpretations of the normal form straight line equation, eqn 2.8. The coefcients of the unit vector normal to the line are n = (a1 a2 )T and a0 is the perpendicular distance from the line to the origin. Incidentally, the components correspond to the direction cosines of n = (a1 a2 )T = (cos sin a2 )T . Thus, (Foley, van Dam, Feiner, Hughes & Phillips 1994) n corresponds to one row of a (frame) rotating matrix; in other words, see below, section 2.5, dot product of the vector expression of a point with n, corresponds to projection onto n. (Note that cos /2 = sin .) Also as shown in Figure 2.6, points x = (x1 x2 )T on the side of the line to which n = (a1 a2 )T points have a1 x1 + a2 x2 a0 > 0, whilst points on the other side have a1 x1 + a2x2 a0 < 0; as we know, points on the line have a1 x1 + a2x2 a0 = 0.
= 0 when g(x) > a0 , = 1, g(x) < a0 , = tie, g(x) = a0 . Linear discriminants, eqn. 2.12, are often written as 6
a0/a2
x1
(2.13)
Chapter 3
Statistical methods
Recall Figure 2.3, repeated here as Figure 3.1.
p(x1 | 1) p(x1 | 0)
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 3 T 4 5 6 x1
Figure 3.1: Class conditional densities. We have class conditional probability densities: p(x1 | ), = 0, 1; given a newly arrived x1 we might decide on its class according to the maximum class conditional probability density at x 1 , i.e. set a threshold T where p(x1 | 0) and p(x1 | 1) cross, see Figure 3.1. This is not completely correct. What we want is the probability of each class (its posterior probability) based on the evidence supplied by the data, combined with any prior evidence. In what follows, P( | x) is the posterior probability or a posteriori probability of class given the observation x; P(w) is the prior probability or a priori probability. We use upper case P(.) for discrete probabilities, whilst lower case p(.) denotes probability densities. Let the Bayes decision rule be represented by a function g(.) of the feature vector x: g(x) = arg maxw j [P( j | x)] (3.1)
To show that the Bayes decision rule, eqn. 3.1, achieves the minimum probability of error, we compute the probability of error conditional on the feature vector x the conditional risk associated with it: R(g(x) = j | x) =
k=1,k= j
P(k | x).
(3.2)
That is to say, for the point x we compute the posterior probabilities of all the c 1 classes not chosen. Since = {1 , . . . , c } form an event space (they are mutually exclusive and exhaustive) and the P( k | x)c are probabilik=1 ties and so sum to unity, eqn. 3.2 reduces to: R(g(x) = j ) = 1 P( j | x). (3.3)
It immediately follows that, to minimise R(g(x) = j ), we maximise P( j | x), thus establishing the optimality of eqn. 3.1. The problem now is to determine P( | x) which brings us to Bayes rule.
Therefore, equating the right hand sides of these equations, and rearranging, we arrive at Bayes rule for the posterior probability P( | x): p(x | )P() . (3.6) P( | x) = p(x) P() expresses our belief that will occur, prior to any observation. If we have no prior knowledge, we can assume equal priors for each class: P(1 ) = P(2 ) . . . = P(c ), c P( j ) = 1. Although we avoid further discussion here, we note that j=1 the matter of choice of prior probabilities is the subject of considerable discussion especially in the literature on Bayesian inference, see, for example, (Sivia 1996). p(x) is the unconditional probability density of x, and can be obtained by summing the conditional densities: p(x) =
j=1 c
p(x | j )P( j ).
(3.7)
p(x | j ) is completely specied by j , the p-dimensional mean vector, and K j the corresponding p p covariance matrix: j = E[x]= j , K j = E[(x j )(x j )T ]= j . The respective maximum likelihood estimates are: j = and, Kj =
N
(3.9) (3.10)
1 Nj
n=1
xn ,
Nj
(3.11)
where we have separated the training data XT = {xn , n }N into sets according to class. n=1 10
(3.12)
so that classication, eqn. 3.1, becomes a matter of assigning x to class w j if, g j (x) > gk (x), k = j. (3.14)
Since p(x), the denominator of eqn. 3.13 is the same for all g j (x) and since eqn. 3.14 involves comparison only, we may rewrite eqn. 3.13 as g j (x) = p(x | j )P( j ). (3.15) We may derive a further possible discriminant by taking the logarithm of eqn. 3.15 since logarithm is a monotonically increasing function, application of it preserves relative order of its arguments: g j (x) = log p(x | j ) + log P( j ). In the multivariate Gaussian case, eqn. 3.16 becomes (Duda & Hart 1973), p 1 1 g j (x) = (x j )T K1 (x j ) log2 log | K j | +logP( j ) j 2 2 2 Henceforth, we refer to eqn. 3.17 as the Bayes-Gauss classier. The multivariate Gaussian density provides a good characterisation of pattern (vector) distribution where we can model the generation of patterns as ideal pattern plus measurement noise; for an instance of a measured vector x from class j : xn = j + e n , where en N(0, K j ), that is, the noise covariance is class dependent. (3.18) (3.17) (3.16)
g j (x) = =
1 (x j )T (x j ) + logP( j ) 22 1 x j ) 2 + logP( j ). 22
(3.19) (3.20)
Nearest mean classier If we assume equal prior probabilities P( j ), the second term in eqn. 3.20 may be eliminated for comparison purposes and we are left with a nearest mean classier. Linear discriminant If we further expand the squared distance term, we have, g j (x) = which may be rewritten as a linear discriminant: g j (x) = w j0 + wT x j where w j0 = and wj = 1 (T j ) + logP( j ), 22 j 1 j. 2 (3.22) (3.23) (3.24) 1 (xT x 2T x + T j ) + logP( j ), j j 22 (3.21)
Template matching In this latter form the Bayes-Gauss classier may be seen to be performing template matching or correlation matching, where w j = constant j , that is, the prototypical pattern for class j, the mean j , is the template. Matched lter In radar and communications systems a matched lter detector is an optimum detector of (subsequence) signals, for example, communication symbols. If the vector x is written as a time series (a digital signal), x[n], n = 0, 1, . . . then the matched lter for each signal j may be implemented as a convolution: y j [n] = x[n] h[n] =
N1
m=0
x[n m] h j [m],
(3.25)
where the kernel h[.] is a time reversed template that is, at each time instant, the correlation between h[.] and the last N samples of x[.] are computed. Provided some threshold is exceeded, the signal achieving the maximum correlation is detected. Single Layer Neural Network If we restrict the problem to two classes, we can write the classication rule as: g(x) = g1 (x) g2(x) 0 : 1 , otherwise 2 = w0 + wT x,
1 where w0 = 22 (T 1 T 2 ) + log P(1) 1 2 P( )
2
(3.26) (3.27)
and w =
1 ( 2 ). 2 1
In other words, eqn. 3.27 implements a linear combination, adds a bias, and thresholds the result that is, a single layer neural network with a hard-limit activation function. (Duda & Hart 1973) further demonstrate that eqn. 3.20 implements a hyper-plane partitioning of the feature space.
12
Nearest Mean Classier, Mahalanobis Distance If we have equal prior probabilities P( j ), we arrive at a nearest mean classier where the distance calculation is weighted. The Mahalanobis distance (x j )T K1 (x j ) effectively weights j contributions according to inverse variance. Points of equal Mahalanobis distance correspond to points of equal conditional density p(x | j ). Linear Discriminant Eqn. 3.28 may be rewritten as a linear discriminant, see also section 2.5: g j (x) = w j0 + wT x j where and 1 w j0 = (T K1 j ) + logP( j ), 2 j w j = K1 j . (3.29) (3.30) (3.31)
Weighted template matching, matched lter In this latter form the Bayes-Gauss classier may be seen to be performing weighted template matching. Single Layer Neural Network As for the diagonal covariance matrix, it can be easily demonstrated that, for two classes, eqns. 3.29 3.31 may be implemented by a single neuron. The only difference from eqn. 3.27 is that the non-bias weights, instead of being simple a difference between means, is now weighted by the inverse of the covariance matrix.
where d = (d1 , d2 , . . . dc )T is the c-dimensional vector of class indicators and x, as usual, the p-dimensional feature vector. We can express individual class membership indicators as: d j = b0 + bi xi + e.
i=1 p
(3.33)
the
analysis,
we
recall
the
theory
of
linear
regression
Linear Regression The simplest linear model, y = mx + c, of school mathematics, is given by: y = b0 + b1x + e, (3.34)
which shows the dependence of the dependent variable y on the independent variable x. In other words, y is a linear function of x and the observation is subject to noise, e; e is assumed to be a zero-mean random process. Strictly eqn. 3.34 is afne, since b0 is included, but common usage dictates the use of linear. Taking the nth observation of (x, y), we have (Beck & Arnold 1977, p. 133): yn = b 0 + b 1 xn + e n (3.35) 13
Least square error estimators for b0 and b1 , b0 and b1 may be obtained from a set of paired observations {xn , yn }N by n=1 minimising the sum of squared residuals: S=
n=1 N 2 rn = (yn yn)2 n=1 N N
(3.36) (3.37)
S=
n=1
(yn b0 b1xn )2
Minimising with respect to b0 and b1 , and replacing these with their estimators, b0 and b1 , gives the familiar result: b1 = N[ yn xn ( yi )( xi )]/[N( x2 ) ( xi )2 ] i b 1 xn yn xn b0 = N N (3.38) (3.39)
The validity of these estimates does not depend on the distribution of the errors e n ; that is, assumption of Gaussianity is not essential. On the other hand, all the simplest estimation procedures, including eqns. 3.38 and 3.39, assume the x n to be error free, and that the error en is associated with yn . In the case where y, still one-dimensional, is a function of many independent variables p in our usual formulation of p-dimensional feature vectors eqn. 3.35 becomes: yn = b0 + bi xin + en
i=1 p
(3.40)
where xin is the i-th component of the n-th feature vector. Eqn. 3.40 can be written compactly as: )T yn = x T b + e n n (3.41)
where b = (b0 , b1 , . . . , b p is a p + 1 dimensional vector of coefcients, and xn = (1, x1n , x2n , . . . , x pn ) is the augmented feature vector. The constant 1 in the augmented vector corresponds to the coefcient b 0 , that is it is the so called bias term of neural networks, see sections 2.5 and 4. All N observation equations may now be collected together: y = Xb + e (3.42)
where y = (y1 , y2 , . . . , yn , . . . , yN )T is the N 1 vector of observations of the dependent variable, and e = (e1 , e2 , . . . , en , . . . , eN )T . X is the N p + 1 matrix formed by N rows of p + 1 independent variables. Now, the sum of squared residuals, eqn. 3.36, becomes: S = (y Xb)T . (3.43)
Minimising with respect to b just as eqn. 3.36 was minimised with respect to b 0 and b1 leads to a solution for b (Beck & Arnold 1977, p. 235): b = (XT X)1 XT y. (3.44) The jk-th element of the (p + 1) (p + 1) matrix XT X is N xn j xnk , in other words, just N the jk-th element of the n=1 autocorrelation matrix, R, of the vector of independent variables x estimated from the N sample vectors. If, as in the least-square-error classier, we have multiple dependent variables (y), in this case, c of them, we can replace y in eqn. 3.44 with an appropriate matrix N c matrix Y formed by N rows each of c observations. Now, eqn. 3.44 becomes: B = (XT X)1 XT Y (3.45)
XT Y is a p + 1 c matrix, and B is a (p + 1) c matrix of coefcients that is, one column of p + 1 coefcients for each class. Eqn. 3.45 denes the training algorithm of our classier.
14
Application of the classier to a feature vector x may be expressed as: y = Bx. It remains to nd the maximum of the c components of y. In a two-class case, this least-square-error training algorithm yields an identical discriminant to Fishers linear discriminant (Duda & Hart 1973). Fishers linear discriminant is described in section A.2. Eqn. 3.45 has one signicant weakness: it depends on the condition of the matrix X T X. As with any autocorrelation or auto-covariance matrix, this cannot be guaranteed; for example, linearly dependent features will render the matrix singular. In fact, there is an elegant indirect implementation of eqn. 3.45 involving the singular value decomposition (SVD) (Press, Flannery, Teukolsky & Vetterling 1992), (Golub & Van Loan 1989). The Widrow-Hoff iterative gradient-descent training procedure (Widrow & Lehr 1990) developed in the early 1960s tackles the problem in a different manner. (3.46)
In the pattern recognition literature, the solution of eqn. 3.47 involving now the vector x is called the generalised linear discriminant function (Duda & Hart 1973, Nilsson 1965). It is desirable to escape from the xed model of eqn. 3.47: the form of the f k (x) must be known in advance. Multilayer perceptron (MLP) neural networks provide such a solution. We have already shown the correspondence between the linear model, eqn. 3.41, and a single layer neural network with a single output node and linear activation function. An MLP with appropriate non-linear activation functions, e.g. sigmoid, provides a model-free and arbitrary non-linear solution to learning the mapping between x and y (Kosko 1992, Hecht-Nielsen 1990, Haykin 1999).
15
Chapter 4
Neural networks
Here we show that a single neuron implements a linear discriminant (and hence also implements a separating hyperplane). Then we proceed to a discussion which indicates that a neural network comprising three processing layers can implement any arbitrarily complex decision region. Recall eqn. 2.12, with ai wi , and now (arbitrarily) allocating discriminant value zero to class 0, g(x) = wi xi w0
i=1 p
0, = 0 > 0, = 1.
(4.1)
Figure 4.1 shows a single articial neuron which implements precisely eqn. 4.1.
+1 (bias) w0 x1 x2 . . . xp w1 w2 wp
Figure 4.1: Single neuron. The signal ows into the neuron (circle) are weighted; the neuron receives w i xi . The neuron sums and applies a hard limit (output = 1 when sum > 0, otherwise 0). Later we will introduce a sigmoid activation function (softer transition) instead of the hard limit. The threshold term in the linear discriminant (a0 in eqn. 2.13) is provided by w0 +1. Another interpretation of bias, useful in mathematical analysis of neural networks, see section 3.5, is to represent it by a constant component, +1, as the zeroth component of the augmented feature vector. Just to reemphasise the linear boundary nature of linear discriminants (and hence neural networks), examine the twodimensional case, 0, = 0 w1 x 1 + w 2 x 2 w 0 (4.2) > 0, = 1. The boundary between classes, given by w1 x1 + w2 x2 w0 = 0, is a straight line with x1 -axis intercept at w0 /w1 and x2 -axis intercept at w0 /w2 , see Figure 4.2. 16
x2 w0/w2
w1/w0
x1
17
x2 1
0 0 AND
x2 1
1 1 OR 1 x1
x2 1
0 1 XOR 1 x1
0 1 x1
Figure 4.3: AND, OR, XOR. It is noted that XOR cannot be implemented by a single neuron; in fact it required two layers. Two layer were a big problem in the rst wave of neural network research in the 1960s, when it was not known how to train more than one layer.
x2 4 3 2 1
d21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d24 1 1 1 1 1 1
0 0 0 0 0 d23 1 1 1 1 1 1 0 0 0 0
d22
19
+1 (bias) d11
x1 x2 . . . xp
+1 d12
+1 d13 AND . . .
+1 d14 OR d21
class
. . . d24
. . . AND
Figure 4.5: Three-layer neural network implementing an arbitrarily complex decision region.
20
Bibliography
Beck, J. & Arnold, K. (1977). Parameter Estimation in Engineering and Science, John Wiley & Sons, New York. Campbell, J. (2000). Fuzzy Logic and Neural Network Techniques in Data Analysis, PhD thesis, University of Ulster. Campbell, J. & Murtagh, F. (1998). Image processing and pattern recognition, Technical report, Computer Science, Queens University Belfast. available at: http://www.jgcampbell.com/ip. Duda, R. & Hart, P. (1973). Pattern Classication and Scene Analysis, Wiley-Interscience, New York. Duda, R., Hart, P. & Stork, D. (2000). Pattern Classication, Wiley-Interscience. Fisher, R. (1936). The use of multiple measurements in taxonomic problems, Annals of Eugenics 7: 179188. Foley, J., van Dam, A., Feiner, S., Hughes, J. & Phillips, R. (1994). Introduction to Computer Graphics, Addison Wesley. Golub, G. & Van Loan, C. (1989). Matrix Computations, 2nd edn, Johns Hopkins University Press, Baltimore. Haykin, S. (1999). Neural Networks - a comprehensive foundation, 2nd edn, Prentice Hall. Hecht-Nielsen, R. (1990). Neurocomputing, Addison-Wesley, Redwood City, CA. Kosko, B. (1992). Neural Networks and Fuzzy Systems. A Dynamical Systems Approach to Machine Intelligence, Prentice Hall, Englewood Cliffs. Nilsson, N. (1965). Learning Machines, McGraw-Hill, New York. Press, W., Flannery, B., Teukolsky, S. & Vetterling, W. (1992). Numerical Recipes in C, 2nd edn, Cambridge University Press, Cambridge, UK. Sivia, D. (1996). Data Analysis, A Bayesian Tutorial, Oxford University Press, Oxford, U.K. Venables, W. & Ripley, B. (2002). Modern Applied Statistics with S, 4th edn, Springer-Verlag. Widrow, B. & Lehr, M. (1990). 30 Years of Adaptive Neural Networks, Proc. IEEE 78(9): 14151442.
21
Appendix A
Principal component analysis (PCA), also called Karhunen-Lo` ve transform (Duda, Hart & Stork 2000) is a linear transfore mation which maps a p-dimensional feature vector x R p to another vector y R p where the transformation is optimised such that the components of y contain maximum information in a least-square-error sense. In other words, if we take the rst r p components (y Rq ), then using the inverse transformation, we can reproduce x with minimum error. Yet another view is that the rst few components of y contain most of the variance, that is, in those components, the transformation stretches the data maximally apart. It is this that makes PCA good for visualisation of the data in two dimensions, i.e. the rst two principal components give an optimum view of the spread of the data. We note however, unlike linear discriminant analysis, see section A.2, PCA does not take account of class labels. Hence it is typically a more useful visualisation of the inherent variability of the data. In general x can be represented, without error, by the following expansion: x = Uy = yi ui
i=1 p
(A.1)
where yi is the ith component of y and where U = (u1 , u2 , . . . , up ) is an orthonormal matrix: ut uk = jk = 1, when i = k; otherwise = 0. j (A.3) (A.4) (A.2)
(A.5)
22
The optimum transformation matrix U turns out to be the eigenvector matrix of the sample covariance matrix C: C= where A is the N p sample matrix. UCUt = , the diagonal matrix of eigenvalues. (A.8) 1 t A A, N (A.7)
A.2
In contrast with PCA (see section A.1), linear discriminant analysis (LDA) transforms the data to provide optimal class separability (Duda et al. 2000) (Fisher 1936). Fishers original LDA, for two-class data, is obtained as follows. We introduce a linear discriminant u (a p-dimensional vector of weights the weights are very similar to the weights used in neural networks) which, via a dot product, maps a feature vector x to a scalar, y = ut x. (A.9)
u is optimised to maximise simultaneously, (a) the separability of the classes (between-class separability), and (b) the clustering together of same class data (within-class clustering). Mathematically, this criterion can be expressed as: J(u) = where SB is the between-class covariance, ut SB u . ut SW u (A.10)
SB = (m1 m2 )(m1 m2 )t , and Sw = C 1 + C 2 , the sum of the class-conditional covariance matrices, see section A.1. m1 and m2 are the class means. u is now computed as: u = S1 m1 m2 . w
(A.11)
(A.12)
(A.13)
There are other formulations of LDA (Duda et al. 2000) (Venables & Ripley 2002), particularly extensions from two-class to multi-class data. In addition, there are extensions (Duda et al. 2000) (Venables & Ripley 2002) which form a second discriminant, orthogonal to the rst, which optimises the separability and clustering criteria, subject to the orthogonality constraint. The second dimension/discriminant is useful to allow the data to be view as a two-dimensional scatter plot.
23