7.pca Mda

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

A Tutorial on Principal

Component Analysis (PCA)


and Multiple Discriminant
Analysis (MDA):
Dr. Rahul Das Gupta
1. Covariance Matrix
Given set of feature vectors   X (k )
 
n N
k 1

where, X ( k )  k th feature vector


(k ) T
  x , x ,..., x   n
(k )
1
(k )
2 n

xi( k )  i th component of the k th feature vector

n  dim ension of each feature vector

N  no. of feature vectors


N
1
Mean vector   E ( X ) 
N
 X
k 1
(k )

where,    1 , 2 ,..., n  
T n

Covariance Matrix   E ( X   )( X   )T 

   X    X   
1 N (k ) (k ) T

N k 1
 11  12 ... 1n 
 
  21  22 ... 2n  n n
 
... ... ... 
 
 n1  n 2 ... n n 
where,  ij  covariance between i th component and jth
component of sample feature vectors

=   xi  i  x (jk )   j 
1 N (k )
N k 1
and  ji   ij
 ii  variance between i th component of the feature vector
2

 xi( k )  i 
N
1
= 
N k 1
2. Dimension Reduction Procedure
using KL-Transform:
Step #1: Computation of Mean Vector  
n

nn
and Covariance Matrix  for a given
set of feature vector  X 
n N
(k )
 .
k 1

Here, N  no. of feature vectors


n  dim ension of each feature vector
X ( k )  k th feature vector
(k ) T
  x , x ,..., x   n
(k )
1
(k )
2 n

xi( k )  i th component of the k th feature vector


N
1
Mean vector   E ( X ) 
N
X
k 1
(k )

where,    1 , 2 ,..., n  
T n

Covariance Matrix   E ( X   )( X   )T 

   X    X   
1 N (k ) (k ) T

N k 1
 11  12 ... 1n 
 
  21  22 ... 2n  n n
 
... ... ... 
 
 n1  n 2 ... n n 
where,  ij  covariance between i th component and jth
component of sample feature vectors

=   xi  i  x (jk )   j 
1 N (k )
N k 1
and  ji   ij
 ii  variance between i th component of the feature vector
2

=   xi  i 
1 N (k )
N k 1
Step #2: Determination of the eigenvalues of
nn
the Covariance Matrix   in
descending order.
n n
The eigenvalues of the covariance matrix  
arranged in descending order are the following :
1  2  3  ...  n
where  eˆi  i eˆi for all i  1, 2,..., n
and eˆi  unit eigenvector of the covariance matrix
 n n
corresponding to eigenvalue i .

Step #3: Determination of the sequence of

unit eigenvectors
eˆ1 , eˆ2 , eˆ3 , ..., eˆ n
corresponding to descending sequence of
eigenvalues
1 , 2 , 3 ,..., n where, 1  2  3  ...  n .
Step #4: In order to reduce the dimension of
(k )
each feature vector X from n-dimension
to l-dimension where n>>l, the vector
X (k )
  is projected on the unit

eigenvectors
eˆ1 , eˆ2 , eˆ3 , ..., eˆl .

The feature vector of reduced dimension


will be as follows:
 ( X ( k )   )T eˆ1 
 (k ) 
 ( X   ) eˆ 2 
T

 (k ) 
 ( X   ) e3  ˆ
T

Y (k )  ... 

l

... 
 
 ... 
 ( X ( k )   )T eˆ 
 l


Theorem:
The best possible one-dimensional
representations of the set of n-dimensional
feature vectors can be obtained by projecting
the difference of the given vector and the
mean vector on the unit principal eigenvector
(the unit eigen vector corresponding to the
largest eigen value) of the covariance matrix
(  ) of the original feature vectors.

Proof:

Y (k )
   t eˆ  (k ) n

N 2 N 2

Total Error E   X (k )
Y (k )
 X (k )
 (   t eˆ ) (k )

k 1 k 1
N 2

  ( X ( k )   )  t ( k ) eˆ
k 1
N 2 N 2 N
 X (k )
   t (k ) 2
eˆ  2 t ( k ) eˆ T ( X ( k )   )
k 1 k 1 k 1
N 2 N N
 X (k )
   t (k ) 2
 2 t ( k ) eˆ T ( X ( k )   )
k 1 k 1 k 1

...(1)
 eˆ is an unit vector i.e. eˆ  1
E  N 2 N N 
 (k )  X     t  2 t eˆ ( X   ) 
(k ) (k ) 2 (k ) T (k )

t (k )
t  k 1 k 1 k 1 
 2t ( k )  2 eˆ T ( X ( k )   )

In order to minimise the total error E we must


have
E
 0  t (k )
 ˆ
e T
( X (k )
 )
t (k )

Hence, t ( k )  eˆ T ( X ( k )   )
 ( X ( k )   )T eˆ...(2)

Now from equation (1) and (2) we have,


N 2

E   X (k )  Y (k )
k 1
N 2 N
 X (k )
   t (k ) 2
...(3)
k 1 k 1

From (2) and (3) we get,


N 2 N
E X (k )
  t (k ) 2

k 1 k 1
N 2 N
 X (k )
   eˆ T ( X ( k )   )( X ( k )   )T eˆ
k 1 k 1

 N 
N 2

 X (k )
  eˆ   ( X ( k )   )( X ( k )   )T  eˆ
T

k 1  k 1 
N 2

  X (k )    eˆ T (N ) eˆ
k 1

 Co var iance matrix 


 
  1  N

T 
 
N  k 1
( X (k )
  )( X (k )
  ) 

N 2

 X (k )
  N( eˆ T  eˆ ) ...(4)
k 1

As the first part of the equation (4) is always


N 2

 X (k )
 0
greater than or equal to zero i.e., , k 1

the optimisation problem takes the following


form:
Minimisation of E  Maximisation of eˆ T  eˆ
subjected to the constraint
eˆ  1 i.e., eˆ T eˆ  1
Hence, the Lagrangian will be the following:
L( eˆ )  eˆ T  eˆ   ( eˆ T eˆ  1)

In order to maximise L( eˆ ) with respect to the

unit vector ê we must have


eˆ L( eˆ )  0 ( Null vector )
Now
 eˆ L( eˆ )   eˆ  eˆ T  eˆ   ( eˆ T eˆ  1) 
 2 eˆ  2 eˆ
Hence,
 eˆ L( eˆ )  0 ( Null vector )
  eˆ   eˆ

Thus,  and eˆ are eigenvalue and


eigenvector pair of the covariance matrix  .
Now,
eˆ T  eˆ  eˆ T ( eˆ )
 eˆ T ( eˆ )   eˆ   eˆ 
  ( eˆ T eˆ )


 eˆ  eˆ   max
T
Hence, max
We have to select the maximum eigenvalue
(max ) 
of the covariance matrix .

Thus
ê will be the eigenvector of the
covariance matrix  corresponding to the
(max )
maximum eigenvalue .

Thus, the best possible one-dimensional


representation of the n-dimensional feature
(k ) T
vector X
(k )
  x , x ,..., x  
(k )
1
(k )
2 n
n

will be as follows:
t (k)  eˆ T ( X ( k )   )  ( X ( k )   )T eˆ 
3. Multiple Discriminant Analysis
(MDA):
Motivation
Principal Component Analysis (PCA)
does not take class label information
into account. PCA may remove the
components which are discriminatory.
Example
If PCA is applied to set of feature vectors
representing two types of characters ‘O’
and ‘Q’ then the tail portion of ‘Q’ will be
removed, which the discriminatory
component between ‘O’ and ‘Q’.
Hence, for classification-based
applications discriminatory component
must be preserved after dimensionality
reduction.
 
N
Training Set T  X (i ) , y (i )
i 1

where, N  no. of training sample


X (i )  feature vector of i th sample  n

y (i )  class label of i th sample


k  no. of classes
Let us denoted these classes as C1 , C2 , C3 , ..., Ck respectively.
N i  no. of training samples belong to class Ci
k
N  total no. of training samples   N i
i 1

1
mi  centre of i th class 
Ni

X C
X n

i
k
1
m  grand mean 
N

i 1
Nm  i i
n
4. Dimension Reduction Procedure
using Multiple Discriminant Analysis
(MDA):
X 
N

Given Training Set:


(i )
 n
,y (i )
i 1

where,
X ( i )  feature vector of i th sample  n

y ( i )  class label of i th sample


k  no. of classes
Let us denoted these classes as C1 , C2 , C3 , ..., Ck respectively.
N i  no. of training samples belong to class Ci
k
N  total no. of training samples   N i
i1

Step #1: Compute of the Mean Vector


(centre of i class) mi 
th n
for all the classes
C1 , C2 , C3 , ..., Ck and the Grand Mean
n
Vector m  for a given training set
X 
N
(i )
 n
, y (i )
in the following way.
i 1

1
mi  centre of i th class 
Ni

X C
X n

i
k
1
m  grand mean 
N

i 1
Nm 
i i
n

Step #2: Determination of the Within Class


Scattering or (Intra Class Scattering) SW
and Between Class Scattering or (Inter
SB
Class Scattering) in the following way:
k
SW     X  mi  X  mi 
T

i 1 X Ci

k
and S B   N i  m i  m  m i  m 
T

i 1
Step #3: Determination of the eigenvalues of
1 nn
the discrimination matrix W S B 
S in the
descending order.

Step #4: Determination of the unit


eigenvectors of the discrimination matrix
S W1S B
corresponding to the descending
sequence of eigenvalues
1 , 2 , 3 ,..., n where, 1  2  3  ...  n .
The eigenvalues of the discrimination matrix
S W1S B  n n
arranged in descending order are
the following :
1  2  3  ...  n
where ( S W1S B )eˆi  i eˆi for all i  1, 2,..., n
and eˆi  unit eigenvector of the discrimination
matrix S W1S B  n n
corresponding to
eigenvalue i .
Step #5: In order to reduce the dimension of
(k )
each feature vector X from n-dimension
to l-dimension where n>>l, the vector
 X (k )  m is projected on the unit
eigenvectors
eˆ1 , eˆ2 , eˆ3 , ..., eˆl .

The feature vector of reduced dimension


will be as follows:
( X ( k )  m)T eˆ1 
 (k ) 
 ( X  m )T
ˆ
e 2
( X ( k )  m)T eˆ 
 3

Y (k )  ...  l

 
 ... 
... 
 (k ) 
( X  m) eˆl 
T

Theorem:
The best direction corresponding to
discrimination is along the unit eigenvector
Ŵ  of the discrimination matrix S W
1
SB
corresponding to its highest eigen value
 max  .
Proof:
Feature vectors X (i )  n i  1, 2,3,..., N are now
projected to some direction corresponding to
the unit vector Wˆ  to determine the most
n

discriminatory feature between classes. Let the


feature vectors transformed to the following
set of scalers Y (i)  Wˆ T X (i)  i  1, 2,3,..., N .
1
mi  centre of i th class in the transformed domain  
Ni YCi
Y

1 k
m  grand mean in the transformed domain   Ni mi 
N i 1
1 1 
T 1

mi   N Y   ˆ ˆ
W X W 
T
 X   Wˆ T mi 
Ni YCi i X Ci  Ni X Ci 

  ˆT
 
1 k 1 k 1 k
m   Ni mi   N i W mi  W   N i mi   W m 
ˆ T ˆ T

N i 1 N i 1  N i 1 
k
SW  within cluster scattering     X  mi  X  mi 
T

i 1 X Ci
k
S B  between cluster scattering   Ni  mi  m  mi  m 
T

i 1

Within Class Scattering or


Intra Class Scattering

1. Within Class Scattering or Intra Class


Scattering in original feature space:
k
SW     X  mi  X  mi 
T

i 1 X Ci

2. Within Class Scattering or Intra Class


Scattering in the transformed feature
space:
k
SW    Y  mi Y  mi 
T

i 1 Y Ci

  
k

T
Wˆ T X  Wˆ T mi Wˆ T X  Wˆ T mi
i 1 X Ci

 k T
 Wˆ     X  mi  X  mi   Wˆ
T

 i 1 X Ci 
 Wˆ T
S Wˆ
W

Small Within Class Scattering or (Intra


Class Scattering) will make the classes
compact.
Within Class Scattering or (Intra Class
Scattering) should be as small as possible.
Between Class Scattering or
Inter Class Scattering

1. Between Class Scattering or Inter Class


Scattering in original feature space:
k
S B   N i  mi  m  m i  m 
T

i 1

2. Between Class Scattering or Inter Class


Scattering in the transformed feature
space:
k
S B   Ni  mi  m  mi  m 
T

i 1

  
k
  Ni Wˆ T mi  Wˆ T m Wˆ T mi  Wˆ T m
T

i 1

 k
T
 Wˆ   N i  mi  m  mi  m   Wˆ
T

 i 1 
 Wˆ T S Wˆ B

Large Between Class Scattering or (Inter


Class Scattering) will make the classes well
separable.
Between Class Scattering or (Inter Class
Scattering) should be as large as possible.

The ratio of Between Class Scattering or


(Inter Class Scattering) and Within Class
Scattering or (Intra Class Scattering)
should be as large as possible so that classes
should be compact and well separable. i

S Wˆ T
S Wˆ
J (Wˆ )  B  T B ...(1)
SW Wˆ SW Wˆ
Hence, we have to maximise
 Wˆ T S B Wˆ 
Wˆ  T
ˆ ˆ   0  Null vector 
 W SW W 


    
Wˆ T SW Wˆ Wˆ Wˆ T S B Wˆ  Wˆ T S B Wˆ Wˆ Wˆ T SW Wˆ
0
  
 
2
Wˆ SW Wˆ
T

   
 Wˆ T SW Wˆ 2S B Wˆ  Wˆ T S B Wˆ 2SW Wˆ  0

 Wˆ S Wˆ  S Wˆ  Wˆ S Wˆ  S
T
W B
T
B W Wˆ

 S Wˆ 
Wˆ S Wˆ T

S Wˆ
B
B
Wˆ S Wˆ T
W
W

 S B Wˆ   SW Wˆ ...(2)

where  
Wˆ T S B Wˆ   some scaler
Wˆ T
SW Wˆ 

  SW 1S B Wˆ  Wˆ


The matrix SW 1S B  nn is known as discrimination
matrix.
Hence,  and Wˆ are eigenvalue  unit egenvector pair

Now J (Wˆ ) 
Wˆ T
S B Wˆ  
Wˆ T
SW Wˆ 

 J max (Wˆ )  max  Maximum eigenvalue of  SW 1S B 


Thus, the best choice of Ŵ would be
the unit eigenvector corresponding to
the maximum eigenvalue of the
1
matrix W SB .
S

You might also like