Statistics For Applications 9: Principal Component Analysis (PCA)
Statistics For Applications 9: Principal Component Analysis (PCA)
Statistics For Applications 9: Principal Component Analysis (PCA)
1/16
Multivariate statistics and review of linear algebra (1)
· · · XT
⎛ ⎞
1 ···
X=⎝ ..
⎠.
⎜ ⎟
.
T
· · · Xn · · ·
2/16
Multivariate statistics and review of linear algebra (2)
� Mean of X:
T
E[X] = E[X 1 ], . . . , E[X d ] .
σj,k = cov(Xj , Xk ).
3/16
Multivariate statistics and review of linear algebra (3)
� Empirical mean of X1 , . . . , Xn :
n
T
¯ = 1
�
X Xi = X̄ 1 , . . . , X̄ d .
n
i=1
4/16
Multivariate statistics and review of linear algebra (4)
� If u ∈ Rd ,
� uT Σu is the variance of uT X;
� uT Su is the sample variance of uT X1 , . . . , uT Xn .
5/16
Multivariate statistics and review of linear algebra (5)
6/16
Multivariate statistics and review of linear algebra (6)
� In particular, Σ and S are symmetric, positive semi-definite.
A = P DP T ,
where:
� P is a d × d orthogonal matrix, i.e., P P T = P T P = Id ;
� D is diagonal.
8/16
Principal Component Analysis: Heuristics (2)
� Idea: Write S = P DP T , where
� P = (v1 , . . . , vd ) is an orthogonal matrix, i.e.,
Ivj I2 = 1, vjT vk = 0, ∀j = k.
⎛ ⎞
λ1
⎜
⎜
⎜ λ2 0 ⎟
⎟
⎟
� D=⎜
⎜ . .. ⎟, with λ1 ≥ . . . ≥ λd ≥ 0.
⎟
⎜ ⎟
⎜
⎝ 0 ..
.
⎟
⎠
λd
with equality if a = v1 .
10/16
Principal Component Analysis: Main principle
� Idea of the PCA: Find the collection of orthogonal directions
in which the cloud is much spread out.
Theorem
v1 ∈ argmax uT Su,
lul=1
v2 ∈ argmax uT Su,
lul=1,u⊥v1
···
vd ∈ argmax uT Su.
lul=1,u⊥vj ,j=1,...,d−1
5. Output: Y1 , . . . , Yn , where
Yi = PkT Xi ∈ Rk , i = 1, . . . , n.
12/16
Principal Component Analysis: Algorithm (2)
Question: How to choose k ?
� Experimental rule: Take k where there is an inflection point in
the sequence λ1 , . . . , λd (scree plot).
13/16
Example: Expression of 500,000 genes among 1400
Europeans
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.