Statistics For Applications 9: Principal Component Analysis (PCA)

Statistics for Applications
Chapter 9: Principal Component Analysis (PCA)
1/16
Multivariate statistics and review of linear algebra (1)
� Let X be a d-dimensional random vector and X1 , . . . , Xn be

n independent copies of X.
� Write Xi = (Xi1 , . . . , Xid )T , i = 1, . . . , n.

� Denote by X the random n × d matrix
· · · XT
⎛ ⎞
1 ···
X=⎝ ..
⎠.
⎜ ⎟
.
T
· · · Xn · · ·
2/16
� Assume that E[IXI22 ] < ∞.
� Mean of X:
T
E[X] = E[X 1 ], . . . , E[X d ] .
� Covariance matrix of X: the matrix Σ = (σj,k )j,k=1,...,d , where
σj,k = cov(Xj , Xk ).
� It is easy to see that
Σ = E[XXT ] − E[X]E[X]T = E (X − E[X])(X − E[X])T .
3/16
� Empirical mean of X1 , . . . , Xn :
n
T
¯ = 1
�
X Xi = X̄ 1 , . . . , X̄ d .
n
i=1
� Empirical covariance of X1 , . . . , Xn : the matrix

S = (sj,k )j,k=1,...,d where sj,k is the empirical covariance of
the Xij , Xik , i = 1 . . . , n.
� It is easy to see that

n n
1� 1 �� T
Xi XT T
��
S= i − X̄X̄ = Xi − X̄ Xi − X̄ .
n n
i=1 i=1
4/16
� ¯ = 1 XT 1, where 1 = (1, . . . , 1)T ∈ Rd .

Note that X
n
� Note also that
1 T 1 1
S= X X − 2 X11T X = XT HX,
n n n
where H = In − n1 11T .
� H is an orthogonal projector: H 2 = H, H T = H. (on what

subspace ?)
� If u ∈ Rd ,
� uT Σu is the variance of uT X;
� uT Su is the sample variance of uT X1 , . . . , uT Xn .
5/16
� In particular, uT Su measures how spread (i.e., diverse) the

points are in direction u.
� If uT Su = 0, then all Xi ’ s are in an affine subspace

orthogonal to u.
� If uT Σu = 0, then X is almost surely in an affine subspace

orthogonal to u.
� If uT Su is large with IuI2 = 1, then the direction of u

explains well the spread (i.e., diversity) of the sample.
6/16
� In particular, Σ and S are symmetric, positive semi-definite.
� Any real symmetric matrix A ∈ Rd×d has the decomposition
A = P DP T ,
where:
� P is a d × d orthogonal matrix, i.e., P P T = P T P = Id ;
� D is diagonal.
� The diagonal elements of D are the eigenvalues of A and the

columns of P are the corresponding eigenvectors of A.
� A is semi-definite positive iff all its eigenvalues are

nonnegative.
7/16
Principal Component Analysis: Heuristics (1)
� The sample X1 , . . . , Xn makes a cloud of points in Rd .
� In practice, d is large. If d > 3, it becomes impossible to

represent the cloud on a picture.
� Question: Is it possible to project the cloud onto a linear

subspace of dimension d' < d by keeping as much information
as possible ?
� Answer: PCA does this by keeping as much covariance

structure as possible by keeping orthogonal directions that
discriminate well the points of the cloud.
8/16
� Idea: Write S = P DP T , where
� P = (v1 , . . . , vd ) is an orthogonal matrix, i.e.,
Ivj I2 = 1, vjT vk = 0, ∀j = k.
⎛ ⎞
λ1
⎜
⎜
⎜ λ2 0 ⎟
⎟
⎟
� D=⎜
⎜ . .. ⎟, with λ1 ≥ . . . ≥ λd ≥ 0.
⎟
⎜ ⎟
⎜
⎝ 0 ..
.
⎟
⎠
λd
� Note that D is the empirical covariance matrix of the

P T Xi ’s, i = 1, . . . , n.
� In particular, λ1 is the empirical variance of the v1T Xi ’s; λ2 is

the empirical variance of the v2T Xi ’s, etc...
9/16
� So, each λj measures the spread of the cloud in the direction

vj .
� In particular, v1 is the direction of maximal spread.
� Indeed, v1 maximizes the empirical covariance of

aT X1 , . . . , aT Xn over a ∈ Rd such that IaI2 = 1.
� Proof: For any unit vector a, show that

T
aT Σa = P T a D P T a ≤ λ1 ,
with equality if a = v1 .
10/16
Principal Component Analysis: Main principle
� Idea of the PCA: Find the collection of orthogonal directions
in which the cloud is much spread out.
Theorem
v1 ∈ argmax uT Su,
lul=1
v2 ∈ argmax uT Su,
lul=1,u⊥v1
···
vd ∈ argmax uT Su.
lul=1,u⊥vj ,j=1,...,d−1
Hence, the k orthogonal directions in which the cloud is the

most spread out correspond exactly to the eigenvectors
associated with the k largest values of S.
11/16
Principal Component Analysis: Algorithm (1)
1. Input: X1 , . . . , Xn : cloud of n points in dimension d.
2. Step 1: Compute the empirical covariance matrix.
3. Step 2: Compute the decomposition S = P DP T , where

D = Diag(λ1 , . . . , λd ), with λ1 ≥ λ2 ≥ . . . ≥ λd and
P = (v1 , . . . , vd ) is an orthogonal matrix.
4. Step 3: Choose k < d and set Pk = (v1 , . . . , vk ) ∈ Rd×k .
5. Output: Y1 , . . . , Yn , where
Yi = PkT Xi ∈ Rk , i = 1, . . . , n.
Question: How to choose k ?
12/16
Principal Component Analysis: Algorithm (2)
Question: How to choose k ?
� Experimental rule: Take k where there is an inflection point in
the sequence λ1 , . . . , λd (scree plot).
� Define a criterion: Take k such that

λ1 + . . . + λk
≥ 1 − α,
λ1 + . . . + λd
for some α ∈ (0, 1) that determines the approximation error

that the practitioner wants to achieve.
� Remark: λ1 + . . . + λk is called the variance explained by the

PCA and λ1 + . . . + λd = T r(S) is the total variance.
� Data visualization: Take k = 2 or 3.
13/16
Example: Expression of 500,000 genes among 1400
Europeans
Reprinted by permission from

Macmillan Publishers Ltd: Nature.
Source: John Novembre, et al. "Genes
mirror geography within Europe."
Nature 456 (2008): 98-101. © 2008.
14/16
Principal Component Analysis - Beyond practice (1)
� PCA is an algorithm that reduces the dimension of a cloud of
points and keeps its covariance structure as much as possible.
� In practice this algorithm is used for clouds of points that are

not necessarily random.
� In statistics, PCA can be used for estimation.
� If X1 , . . . , Xn are i.i.d. random vectors in Rd , how to

estimate their population covariance matrix Σ ?
� If n » d, then the empirical covariance matrix S is a

consistent estimator.
� In many applications, n « d (e.g., gene expression). Solution:

sparse PCA
15/16
Principal Component Analysis - Beyond practice (2)
� It may be known beforehand that Σ has (almost) low rank.
� Then, run PCA on S: Write S ≈ S ' , where

λ1
⎛ ⎞
⎜
⎜ λ2 0 ⎟
⎟
⎜ .. ⎟
⎜ . ⎟
S' = P ⎜
⎜ ⎟ T
λk ⎟P .
⎜ ⎟
⎜
⎜ 0 ⎟
⎟
..
0
⎜ ⎟
⎝ . ⎠
0
� S ' will be a better estimator of S under the low-rank
assumption.
� A theoretical analysis would lead to an optimal choice of the

tuning parameter k.
16/16
MIT OpenCourseWare
https://ocw.mit.edu
18.650 / 18.6501 Statistics for Applications

Fall 2016
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

Statistics For Applications 9: Principal Component Analysis (PCA)

Uploaded by

Copyright:

Available Formats

Statistics For Applications 9: Principal Component Analysis (PCA)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics For Applications 9: Principal Component Analysis (PCA)

Uploaded by

Copyright:

Available Formats

Statistics for Applications

Chapter 9: Principal Component Analysis (PCA)

� Let X be a d-dimensional random vector and X1 , . . . , Xn be

� Write Xi = (Xi1 , . . . , Xid )T , i = 1, . . . , n.

� Assume that E[IXI22 ] < ∞.

� Covariance matrix of X: the matrix Σ = (σj,k )j,k=1,...,d , where

� It is easy to see that

Σ = E[XXT ] − E[X]E[X]T = E (X − E[X])(X − E[X])T .

� Empirical covariance of X1 , . . . , Xn : the matrix

� It is easy to see that

� ¯ = 1 XT 1, where 1 = (1, . . . , 1)T ∈ Rd .

� H is an orthogonal projector: H 2 = H, H T = H. (on what

� In particular, uT Su measures how spread (i.e., diverse) the

� If uT Su = 0, then all Xi ’ s are in an aﬃne subspace

� If uT Σu = 0, then X is almost surely in an aﬃne subspace

� If uT Su is large with IuI2 = 1, then the direction of u

� Any real symmetric matrix A ∈ Rd×d has the decomposition

� The diagonal elements of D are the eigenvalues of A and the

� A is semi-deﬁnite positive iﬀ all its eigenvalues are

� The sample X1 , . . . , Xn makes a cloud of points in Rd .

� In practice, d is large. If d > 3, it becomes impossible to

� Question: Is it possible to project the cloud onto a linear

� Answer: PCA does this by keeping as much covariance

� Note that D is the empirical covariance matrix of the

� In particular, λ1 is the empirical variance of the v1T Xi ’s; λ2 is

� So, each λj measures the spread of the cloud in the direction

� In particular, v1 is the direction of maximal spread.

� Indeed, v1 maximizes the empirical covariance of

� Proof: For any unit vector a, show that

Hence, the k orthogonal directions in which the cloud is the

1. Input: X1 , . . . , Xn : cloud of n points in dimension d.

2. Step 1: Compute the empirical covariance matrix.

3. Step 2: Compute the decomposition S = P DP T , where

4. Step 3: Choose k < d and set Pk = (v1 , . . . , vk ) ∈ Rd×k .

Question: How to choose k ?

� Deﬁne a criterion: Take k such that

for some α ∈ (0, 1) that determines the approximation error

� Remark: λ1 + . . . + λk is called the variance explained by the

� Data visualization: Take k = 2 or 3.

Reprinted by permission from

� In practice this algorithm is used for clouds of points that are

� In statistics, PCA can be used for estimation.

� If X1 , . . . , Xn are i.i.d. random vectors in Rd , how to

� If n » d, then the empirical covariance matrix S is a

� In many applications, n « d (e.g., gene expression). Solution:

� Then, run PCA on S: Write S ≈ S ' , where

� A theoretical analysis would lead to an optimal choice of the

18.650 / 18.6501 Statistics for Applications

You might also like