Statistics For Applications 9: Principal Component Analysis (PCA)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Statistics for Applications

Chapter 9: Principal Component Analysis (PCA)

1/16
Multivariate statistics and review of linear algebra (1)

� Let X be a d-dimensional random vector and X1 , . . . , Xn be


n independent copies of X.

� Write Xi = (Xi1 , . . . , Xid )T , i = 1, . . . , n.


� Denote by X the random n × d matrix

· · · XT
⎛ ⎞
1 ···
X=⎝ ..
⎠.
⎜ ⎟
.
T
· · · Xn · · ·

2/16
Multivariate statistics and review of linear algebra (2)

� Assume that E[IXI22 ] < ∞.

� Mean of X:
T
E[X] = E[X 1 ], . . . , E[X d ] .

� Covariance matrix of X: the matrix Σ = (σj,k )j,k=1,...,d , where

σj,k = cov(Xj , Xk ).

� It is easy to see that

Σ = E[XXT ] − E[X]E[X]T = E (X − E[X])(X − E[X])T .

3/16
Multivariate statistics and review of linear algebra (3)

� Empirical mean of X1 , . . . , Xn :
n
T
¯ = 1

X Xi = X̄ 1 , . . . , X̄ d .
n
i=1

� Empirical covariance of X1 , . . . , Xn : the matrix


S = (sj,k )j,k=1,...,d where sj,k is the empirical covariance of
the Xij , Xik , i = 1 . . . , n.

� It is easy to see that


n n
1� 1 �� �T
Xi XT T
��
S= i − X̄X̄ = Xi − X̄ Xi − X̄ .
n n
i=1 i=1

4/16
Multivariate statistics and review of linear algebra (4)

� ¯ = 1 XT 1, where 1 = (1, . . . , 1)T ∈ Rd .


Note that X
n
� Note also that
1 T 1 1
S= X X − 2 X11T X = XT HX,
n n n
where H = In − n1 11T .

� H is an orthogonal projector: H 2 = H, H T = H. (on what


subspace ?)

� If u ∈ Rd ,
� uT Σu is the variance of uT X;
� uT Su is the sample variance of uT X1 , . . . , uT Xn .

5/16
Multivariate statistics and review of linear algebra (5)

� In particular, uT Su measures how spread (i.e., diverse) the


points are in direction u.

� If uT Su = 0, then all Xi ’ s are in an affine subspace


orthogonal to u.

� If uT Σu = 0, then X is almost surely in an affine subspace


orthogonal to u.

� If uT Su is large with IuI2 = 1, then the direction of u


explains well the spread (i.e., diversity) of the sample.

6/16
Multivariate statistics and review of linear algebra (6)
� In particular, Σ and S are symmetric, positive semi-definite.

� Any real symmetric matrix A ∈ Rd×d has the decomposition

A = P DP T ,

where:
� P is a d × d orthogonal matrix, i.e., P P T = P T P = Id ;
� D is diagonal.

� The diagonal elements of D are the eigenvalues of A and the


columns of P are the corresponding eigenvectors of A.

� A is semi-definite positive iff all its eigenvalues are


nonnegative.
7/16
Principal Component Analysis: Heuristics (1)

� The sample X1 , . . . , Xn makes a cloud of points in Rd .

� In practice, d is large. If d > 3, it becomes impossible to


represent the cloud on a picture.

� Question: Is it possible to project the cloud onto a linear


subspace of dimension d' < d by keeping as much information
as possible ?

� Answer: PCA does this by keeping as much covariance


structure as possible by keeping orthogonal directions that
discriminate well the points of the cloud.

8/16
Principal Component Analysis: Heuristics (2)
� Idea: Write S = P DP T , where
� P = (v1 , . . . , vd ) is an orthogonal matrix, i.e.,
Ivj I2 = 1, vjT vk = 0, ∀j = k.
⎛ ⎞
λ1


⎜ λ2 0 ⎟


� D=⎜
⎜ . .. ⎟, with λ1 ≥ . . . ≥ λd ≥ 0.

⎜ ⎟

⎝ 0 ..
.


λd

� Note that D is the empirical covariance matrix of the


P T Xi ’s, i = 1, . . . , n.

� In particular, λ1 is the empirical variance of the v1T Xi ’s; λ2 is


the empirical variance of the v2T Xi ’s, etc...
9/16
Principal Component Analysis: Heuristics (3)

� So, each λj measures the spread of the cloud in the direction


vj .

� In particular, v1 is the direction of maximal spread.

� Indeed, v1 maximizes the empirical covariance of


aT X1 , . . . , aT Xn over a ∈ Rd such that IaI2 = 1.

� Proof: For any unit vector a, show that


T
aT Σa = P T a D P T a ≤ λ1 ,

with equality if a = v1 .

10/16
Principal Component Analysis: Main principle
� Idea of the PCA: Find the collection of orthogonal directions
in which the cloud is much spread out.
Theorem

v1 ∈ argmax uT Su,
lul=1

v2 ∈ argmax uT Su,
lul=1,u⊥v1
···
vd ∈ argmax uT Su.
lul=1,u⊥vj ,j=1,...,d−1

Hence, the k orthogonal directions in which the cloud is the


most spread out correspond exactly to the eigenvectors
associated with the k largest values of S.
11/16
Principal Component Analysis: Algorithm (1)

1. Input: X1 , . . . , Xn : cloud of n points in dimension d.

2. Step 1: Compute the empirical covariance matrix.

3. Step 2: Compute the decomposition S = P DP T , where


D = Diag(λ1 , . . . , λd ), with λ1 ≥ λ2 ≥ . . . ≥ λd and
P = (v1 , . . . , vd ) is an orthogonal matrix.

4. Step 3: Choose k < d and set Pk = (v1 , . . . , vk ) ∈ Rd×k .

5. Output: Y1 , . . . , Yn , where

Yi = PkT Xi ∈ Rk , i = 1, . . . , n.

Question: How to choose k ?

12/16
Principal Component Analysis: Algorithm (2)
Question: How to choose k ?
� Experimental rule: Take k where there is an inflection point in
the sequence λ1 , . . . , λd (scree plot).

� Define a criterion: Take k such that


λ1 + . . . + λk
≥ 1 − α,
λ1 + . . . + λd

for some α ∈ (0, 1) that determines the approximation error


that the practitioner wants to achieve.

� Remark: λ1 + . . . + λk is called the variance explained by the


PCA and λ1 + . . . + λd = T r(S) is the total variance.

� Data visualization: Take k = 2 or 3.

13/16
Example: Expression of 500,000 genes among 1400
Europeans

Reprinted by permission from


Macmillan Publishers Ltd: Nature.
Source: John Novembre, et al. "Genes
mirror geography within Europe."
Nature 456 (2008): 98-101. © 2008.
14/16
Principal Component Analysis - Beyond practice (1)
� PCA is an algorithm that reduces the dimension of a cloud of
points and keeps its covariance structure as much as possible.

� In practice this algorithm is used for clouds of points that are


not necessarily random.

� In statistics, PCA can be used for estimation.

� If X1 , . . . , Xn are i.i.d. random vectors in Rd , how to


estimate their population covariance matrix Σ ?

� If n » d, then the empirical covariance matrix S is a


consistent estimator.

� In many applications, n « d (e.g., gene expression). Solution:


sparse PCA
15/16
Principal Component Analysis - Beyond practice (2)
� It may be known beforehand that Σ has (almost) low rank.

� Then, run PCA on S: Write S ≈ S ' , where


λ1
⎛ ⎞

⎜ λ2 0 ⎟

⎜ .. ⎟
⎜ . ⎟
S' = P ⎜
⎜ ⎟ T
λk ⎟P .
⎜ ⎟

⎜ 0 ⎟

..
0
⎜ ⎟
⎝ . ⎠
0
� S ' will be a better estimator of S under the low-rank
assumption.

� A theoretical analysis would lead to an optimal choice of the


tuning parameter k.
16/16
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications


Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

You might also like