Curse of Dimensionality, Dimensionality Reduction With PCA
Curse of Dimensionality, Dimensionality Reduction With PCA
Curse of Dimensionality, Dimensionality Reduction With PCA
2
d1 d
1
0
Curse of Dimensionality: Number of Samples
Of course, when we go from 1 feature to 2, no
one gives us more samples, we still have 9
0 1
0 1
Common pitfall:
If we can’t solve a problem with a few features, adding
more features seems like a good idea
However the number of samples usually stays the same
The method with more features is likely to perform
worse instead of expected better
Curse of Dimensionality: Number of Samples
For a fixed number of samples, as we add
features, the graph of classification error:
classification
error
1 # features
optimal # features
For example, x1
x2 x x2
x 1 y
x3 3
x x 4
x4
Ideally, the new vector y should retain from x all
information important for classification
Dimensionality Reduction
The best f(x) is most likely a non-linear function
Linear functions are easier to find though
For now, assume that f(x) is a linear mapping
Thus it can be represented by a matrix W:
x1 x1 w 11 w 1d 1
x
y1
x2 x2
W x2 with k d
x x w y k
d d k1 w kd x
d
Feature Combination
dimension 2
dimension 1 dimension 1
this line is a
subspace of R2
PCA Derivation: Shift by the Mean Vector
i 1
1i ei
n k 2
J e1,..., ek , 11,... nk x j ji ei
j 1 i 1
unknowns error at one point
PCA: Derivation
To minimize J, need to take partial derivatives and
also enforce constraint that {e1,e2,…,ek} are
orthogonal
n k 2
J e1,..., ek , 11,... nk x j ji ei
j 1 i 1
n n k n k
J e1 ,..., ek ,11 ,... nk x j 2 ji x tj ei 2ji
2
j 1 j 1 i 1 j 1 i 1
PCA: Derivation
n n k n k
J e1 ,..., ek ,11 ,... nk x j 2 ji x tj ei 2ji
2
j 1 j 1 i 1 j 1 i 1
2xm
t
el 2 ml 0 ml x m
t
el
PCA: Derivation
n n k n k
J e1 ,..., ek ,11 ,... nk x j 2 ji x tj ei 2ji
2
j 1 j 1 i 1 j 1 i 1
n n k n k
J e1 ,..., ek x j 2 x tj ei x tj ei x tj ei
2
2
j 1 j 1 i 1 j 1 i 1
Can simplify J
n n k
J e1 ,..., ek x j x tj ei
2
2
j 1 j 1 i 1
PCA: Derivation
n n k
J e1 ,..., ek x j x tj ei
2
2
j 1 j 1 i 1
j 1 i 1
n
Where S j j
x x t
j 1
j 1 i 1
constant
k
Minimizing J is equivalent to maximizing i S ei
e t
i 1
We should also enforce constraints eitei = 1 for all i
Use the method of Lagrange multipliers, incorporate
the constraints with undetermined l1 ,…, lk
Need to maximize new function u
k k
u e1,..., ek eit S ei l j e tj e j 1
i 1 j 1
PCA: Derivation
k k
u e1,..., ek eit S ei l j e tj e j 1
i 1 j 1
j 1 i 1
j 1 i 1 j 1 i 1
constant
l2 0.8