Deep Learning Basics Lecture 7 Factor Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Deep Learning Basics

Lecture 7: Factor Analysis


Princeton University COS 495
Instructor: Yingyu Liang
Supervised v.s. Unsupervised
Math formulation for supervised learning
• Given training data𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛

• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 that minimizes 𝐿 𝑓 = σ𝑖=1 𝑙(𝑓, 𝑥𝑖 , 𝑦𝑖 )
𝑛
• s.t. the expected loss is small
𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷 [𝑙(𝑓, 𝑥, 𝑦)]
Unsupervised learning
• Given training data 𝑥𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Extract some “structure” from the data

• Do not have a general framework


• Typical unsupervised tasks:
• Summarization: clustering, dimension reduction
• Learning probabilistic models: latent variable model, density estimation
Principal Component Analysis (PCA)
High dimensional data
• Example 1: images

Dimension: 300x300 = 90,000


High dimensional data
• Example 2: documents
• Features:
• Unigram (count of each word): thousands
• Bigram (co-occurrence contextual information): millions

• Netflix survey: 480189 users x 17770 movies


Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 …
User 1 5 ? ? 1 3 ?
User 2 ? ? 3 1 2 5
User 3 4 3 1 ? 5 1

Example from Nina Balcan
Principal Component Analysis (PCA)
• Data analysis point of view: dimension reduction technique on a given
set of high dimensional data 𝑥𝑖 : 1 ≤ 𝑖 ≤ 𝑛

• Math point of view: eigen-decomposition of the covariance (or


singular value decomposition of the data)

• Classic, commonly used tool


Principal Component Analysis (PCA)
• Extract hidden lower dimensional structure of the data
• Try to capture the variance structure as much as possible

• Computation: solved by singular value decomposition (SVD)


Principal Component Analysis (PCA)
• Definition: an orthogonal projection or transformation of the data
into a (typically lower dimensional) subspace so that the variance of
the projected data is maximized.

Figure from isomorphismes @stackexchange


Principal Component Analysis (PCA)
• An illustration of the projection to 1 dim
• Pay attention to the variance of the projected points

Figure from amoeba@stackexchange


Principal Component Analysis (PCA)
• Principal Components (PC) are directions that capture most of the
variance in the data

• First PC: direction of greatest variability in data


• Data points are most spread out when projected on the first PC compared to
any other direction
• Second PC: next direction of greatest variability, orthogonal to first PC
• Third PC: next direction of greatest variability, orthogonal to first and
second PC’s
•…
Math formulation
• Suppose the data are centered: σ𝑛𝑖=1 𝑥𝑖 = 0
• Then their projections on any direction 𝑣 are centered: σ𝑛𝑖=1 𝑣 𝑇 𝑥𝑖 = 0

• First PC: maximize the variance of the projections


𝑛

max ෍(𝑣 𝑇 𝑥𝑖 )2 , 𝑠. 𝑡. 𝑣 𝑇 𝑣 = 1
𝑣
𝑖=1
equivalent to
max 𝑣 𝑇 𝑋𝑋 𝑇 𝑣 , 𝑠. 𝑡. 𝑣 𝑇 𝑣 = 1
𝑣
where the columns of 𝑋 are the data points
Math formulation
• First PC:
max 𝑣 𝑇 𝑋𝑋 𝑇 𝑣 , 𝑠. 𝑡. 𝑣 𝑇 𝑣 = 1
𝑣
where the columns of 𝑋 are the data points

• Solved by Lagrangian: exists 𝜆, so that


max 𝑣 𝑇 𝑋𝑋 𝑇 𝑣 − 𝜆𝑣 𝑇 𝑣
𝑣
𝜕
=0 → 𝑋𝑋 𝑇 − 𝜆𝐼 𝑣 = 0 → 𝑋𝑋 𝑇 𝑣 = 𝜆𝑣
𝜕𝑣
Computation: Eigen-decomposition
• First PC: 𝑋𝑋 𝑇 𝑣 = 𝜆𝑣

• 𝑋𝑋 𝑇 : covariance matrix
• 𝑣 : eigen-vector of the covariance matrix
• First PC: first eigen-vector of the covariance matrix

• Top 𝑘 PC’s: similar argument shows they are the top 𝑘 eigen-vectors
Computation: Eigen-decomposition
• Top 𝑘 PC’s: the top 𝑘 eigen-vectors 𝑋𝑋 𝑇 𝑈 = Λ𝑈
where Λ is a diagonal matrix
• 𝑈 are the left singular vectors of 𝑋

• Recall SVD decomposition theorem:


• An 𝑚 × 𝑛 real matrix 𝑀 has factorization 𝑀 = 𝑈Σ𝑉 𝑇 where 𝑈 is an
𝑚 × 𝑚 orthogonal matrix, Σ is a 𝑚 × 𝑛 rectangular diagonal matrix
with non-negative real numbers on the diagonal, and 𝑉 is an 𝑛 × 𝑛
orthogonal matrix.
Equivalent view: low rank approximation
• First PC maximizes variance:
max 𝑣 𝑇 𝑋𝑋 𝑇 𝑣 , 𝑠. 𝑡. 𝑣 𝑇 𝑣 = 1
𝑣

• Alternative viewpoint: find vector 𝑣 such that the projection yields


minimum MSE reconstruction
𝑛
1
min ෍ ||𝑥𝑖 − 𝑣𝑣 𝑇 𝑥𝑖 ||2 , 𝑠. 𝑡. 𝑣 𝑇 𝑣 = 1
𝑣 𝑛
𝑖=1
Equivalent view: low rank approximation
• Alternative viewpoint: find vector 𝑣 such that the projection yields
minimum MSE reconstruction
𝑛
1
min ෍ ||𝑥𝑖 − 𝑣𝑣 𝑇 𝑥𝑖 ||2 , 𝑠. 𝑡. 𝑣 𝑇 𝑣 = 1
𝑣 𝑛
𝑖=1

Figure from Nina Balcan


Summary
• PCA: orthogonal projection that maximizes variance
• Low rank approximation: orthogonal projection that minimizes error
• Eigen-decomposition/SVD

• All equivalent for centered data


Sparse coding
A latent variable view of PCA
• Let ℎ𝑖 = 𝑣 𝑇 𝑥𝑖
• Data point viewed as 𝑥𝑖 = 𝑣ℎ𝑖 + 𝑛𝑜𝑖𝑠𝑒


𝑣

𝑥𝑖
A latent variable view of PCA
• Consider top 𝑘 PC’s 𝑈
• Let ℎ𝑖 = 𝑈 𝑇 𝑥𝑖
• Data point viewed as 𝑥𝑖 = 𝑈ℎ𝑖 + 𝑛𝑜𝑖𝑠𝑒

𝑥
A latent variable view of PCA
PCA structure assumption: ℎ
• Consider top 𝑘 PC’s 𝑈 low dimension. What about
other assumptions?
• Let ℎ𝑖 = 𝑈 𝑇 𝑥𝑖
• Data point viewed as 𝑥𝑖 = 𝑈ℎ𝑖 + 𝑛𝑜𝑖𝑠𝑒

𝑥
Sparse coding
• Structure assumption: ℎ is sparse, i.e., ℎ 0 is small
• Dimension of ℎ can be large

𝑥
Sparse coding
• Latent variable probabilistic model view:
1
𝑝 𝑥 ℎ = 𝑊ℎ + 𝑁 0, 𝐼 , ℎ is sparse,
𝛽
𝜆 𝜆
• E.g., from Laplacian prior: 𝑝 ℎ = exp(− ℎ 1 )
2 2

𝑥
Sparse coding
• Suppose 𝑊 is known. MLE on ℎ is
ℎ∗ = arg max log 𝑝 ℎ 𝑥

2
ℎ∗ = arg min 𝜆 ℎ 1
+ 𝛽 𝑥 − 𝑊ℎ 2

• Suppose both 𝑊, ℎ unknown.


• Typically alternate between updating 𝑊, ℎ
Sparse coding
• Historical note: study on visual system

• Bruno A Olshausen, and David Field. "Emergence of simple-cell


receptive field properties by learning a sparse code for natural
images." Nature 381.6583 (1996): 607-609.
Project paper list
Supervised learning
• AlexNet: ImageNet Classification with Deep Convolutional Neural
Networks

• GoogLeNet: Going Deeper with Convolutions

• Residue Network: Deep Residual Learning for Image Recognition


Unsupervised learning
• Deep belief networks: A fast learning algorithm for deep belief nets

• Reducing the Dimensionality of Data with Neural Networks

• Variational autoencoder: Auto-Encoding Variational Bayes

• Generative Adversarial Nets


Recurrent neural networks
• Long-short term memory

• Memory networks

• Sequence to Sequence Learning with Neural Networks


You choose the paper that interests you!
• Need to consult with TA
• Heavier responsibility on the student side if customize the project

• Check recent papers in the conferences ICML, NIPS, ICLR


• Check papers by leading researchers: Hinton, Lecun, Bengio, etc
• Explore whether deep learning can be applied to your application

• Not recommend arXiv: too many deep learning papers

You might also like