Deep Learning Basics Lecture 7 Factor Analysis

Deep Learning Basics
Lecture 7: Factor Analysis

Princeton University COS 495
Instructor: Yingyu Liang
Supervised v.s. Unsupervised
Math formulation for supervised learning
• Given training data𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛
෠
• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 that minimizes 𝐿 𝑓 = σ𝑖=1 𝑙(𝑓, 𝑥𝑖 , 𝑦𝑖 )
𝑛
• s.t. the expected loss is small
𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷 [𝑙(𝑓, 𝑥, 𝑦)]
Unsupervised learning
• Given training data 𝑥𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Extract some “structure” from the data
• Do not have a general framework

• Typical unsupervised tasks:
• Summarization: clustering, dimension reduction
• Learning probabilistic models: latent variable model, density estimation
Principal Component Analysis (PCA)
High dimensional data
• Example 1: images
Dimension: 300x300 = 90,000

High dimensional data
• Example 2: documents
• Features:
• Unigram (count of each word): thousands
• Bigram (co-occurrence contextual information): millions
• Netflix survey: 480189 users x 17770 movies

Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 …
User 1 5 ? ? 1 3 ?
User 2 ? ? 3 1 2 5
User 3 4 3 1 ? 5 1
…
Example from Nina Balcan
• Data analysis point of view: dimension reduction technique on a given
set of high dimensional data 𝑥𝑖 : 1 ≤ 𝑖 ≤ 𝑛
• Math point of view: eigen-decomposition of the covariance (or

singular value decomposition of the data)
• Classic, commonly used tool

• Extract hidden lower dimensional structure of the data
• Try to capture the variance structure as much as possible
• Computation: solved by singular value decomposition (SVD)

• Definition: an orthogonal projection or transformation of the data
into a (typically lower dimensional) subspace so that the variance of
the projected data is maximized.
Figure from isomorphismes @stackexchange

• An illustration of the projection to 1 dim
• Pay attention to the variance of the projected points
Figure from amoeba@stackexchange

• Principal Components (PC) are directions that capture most of the
variance in the data
• First PC: direction of greatest variability in data

• Data points are most spread out when projected on the first PC compared to
any other direction
• Second PC: next direction of greatest variability, orthogonal to first PC
• Third PC: next direction of greatest variability, orthogonal to first and
second PC’s
•…
Math formulation
• Suppose the data are centered: σ𝑛𝑖=1 𝑥𝑖 = 0
• Then their projections on any direction 𝑣 are centered: σ𝑛𝑖=1 𝑣 𝑇 𝑥𝑖 = 0
• First PC: maximize the variance of the projections

𝑛
max ෍(𝑣 𝑇 𝑥𝑖 )2 , 𝑠. 𝑡. 𝑣 𝑇 𝑣 = 1
𝑣
𝑖=1
equivalent to
max 𝑣 𝑇 𝑋𝑋 𝑇 𝑣 , 𝑠. 𝑡. 𝑣 𝑇 𝑣 = 1
𝑣
where the columns of 𝑋 are the data points
Math formulation
• First PC:
𝑣
where the columns of 𝑋 are the data points
• Solved by Lagrangian: exists 𝜆, so that

max 𝑣 𝑇 𝑋𝑋 𝑇 𝑣 − 𝜆𝑣 𝑇 𝑣
𝑣
𝜕
=0 → 𝑋𝑋 𝑇 − 𝜆𝐼 𝑣 = 0 → 𝑋𝑋 𝑇 𝑣 = 𝜆𝑣
𝜕𝑣
Computation: Eigen-decomposition
• First PC: 𝑋𝑋 𝑇 𝑣 = 𝜆𝑣
• 𝑋𝑋 𝑇 : covariance matrix
• 𝑣 : eigen-vector of the covariance matrix
• First PC: first eigen-vector of the covariance matrix
• Top 𝑘 PC’s: similar argument shows they are the top 𝑘 eigen-vectors
Computation: Eigen-decomposition
• Top 𝑘 PC’s: the top 𝑘 eigen-vectors 𝑋𝑋 𝑇 𝑈 = Λ𝑈
where Λ is a diagonal matrix
• 𝑈 are the left singular vectors of 𝑋
• Recall SVD decomposition theorem:

• An 𝑚 × 𝑛 real matrix 𝑀 has factorization 𝑀 = 𝑈Σ𝑉 𝑇 where 𝑈 is an
𝑚 × 𝑚 orthogonal matrix, Σ is a 𝑚 × 𝑛 rectangular diagonal matrix
with non-negative real numbers on the diagonal, and 𝑉 is an 𝑛 × 𝑛
orthogonal matrix.
Equivalent view: low rank approximation
• First PC maximizes variance:
𝑣
• Alternative viewpoint: find vector 𝑣 such that the projection yields

minimum MSE reconstruction
𝑛
1
min ෍ ||𝑥𝑖 − 𝑣𝑣 𝑇 𝑥𝑖 ||2 , 𝑠. 𝑡. 𝑣 𝑇 𝑣 = 1
𝑣 𝑛
𝑖=1
Equivalent view: low rank approximation
• Alternative viewpoint: find vector 𝑣 such that the projection yields
minimum MSE reconstruction
𝑛
1
min ෍ ||𝑥𝑖 − 𝑣𝑣 𝑇 𝑥𝑖 ||2 , 𝑠. 𝑡. 𝑣 𝑇 𝑣 = 1
𝑣 𝑛
𝑖=1
Figure from Nina Balcan

Summary
• PCA: orthogonal projection that maximizes variance
• Low rank approximation: orthogonal projection that minimizes error
• Eigen-decomposition/SVD
• All equivalent for centered data

Sparse coding
A latent variable view of PCA
• Let ℎ𝑖 = 𝑣 𝑇 𝑥𝑖
• Data point viewed as 𝑥𝑖 = 𝑣ℎ𝑖 + 𝑛𝑜𝑖𝑠𝑒
ℎ
𝑣
𝑥𝑖
• Consider top 𝑘 PC’s 𝑈
• Let ℎ𝑖 = 𝑈 𝑇 𝑥𝑖
• Data point viewed as 𝑥𝑖 = 𝑈ℎ𝑖 + 𝑛𝑜𝑖𝑠𝑒
𝑥
PCA structure assumption: ℎ
• Consider top 𝑘 PC’s 𝑈 low dimension. What about
other assumptions?
• Let ℎ𝑖 = 𝑈 𝑇 𝑥𝑖
• Data point viewed as 𝑥𝑖 = 𝑈ℎ𝑖 + 𝑛𝑜𝑖𝑠𝑒
𝑥
Sparse coding
• Structure assumption: ℎ is sparse, i.e., ℎ 0 is small
• Dimension of ℎ can be large
𝑥
Sparse coding
• Latent variable probabilistic model view:
1
𝑝 𝑥 ℎ = 𝑊ℎ + 𝑁 0, 𝐼 , ℎ is sparse,
𝛽
𝜆 𝜆
• E.g., from Laplacian prior: 𝑝 ℎ = exp(− ℎ 1 )
2 2
𝑥
Sparse coding
• Suppose 𝑊 is known. MLE on ℎ is
ℎ∗ = arg max log 𝑝 ℎ 𝑥
ℎ
2
ℎ∗ = arg min 𝜆 ℎ 1
+ 𝛽 𝑥 − 𝑊ℎ 2
ℎ
• Suppose both 𝑊, ℎ unknown.

• Typically alternate between updating 𝑊, ℎ
Sparse coding
• Historical note: study on visual system
• Bruno A Olshausen, and David Field. "Emergence of simple-cell

receptive field properties by learning a sparse code for natural
images." Nature 381.6583 (1996): 607-609.
Project paper list
Supervised learning
• AlexNet: ImageNet Classification with Deep Convolutional Neural
Networks
• GoogLeNet: Going Deeper with Convolutions
• Residue Network: Deep Residual Learning for Image Recognition

Unsupervised learning
• Deep belief networks: A fast learning algorithm for deep belief nets
• Reducing the Dimensionality of Data with Neural Networks
• Variational autoencoder: Auto-Encoding Variational Bayes
• Generative Adversarial Nets

Recurrent neural networks
• Long-short term memory
• Memory networks
• Sequence to Sequence Learning with Neural Networks

You choose the paper that interests you!
• Need to consult with TA
• Heavier responsibility on the student side if customize the project
• Check recent papers in the conferences ICML, NIPS, ICLR

• Check papers by leading researchers: Hinton, Lecun, Bengio, etc
• Explore whether deep learning can be applied to your application
• Not recommend arXiv: too many deep learning papers

Deep Learning Basics Lecture 7 Factor Analysis

Uploaded by

Copyright:

Available Formats

Deep Learning Basics Lecture 7 Factor Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning Basics Lecture 7 Factor Analysis

Uploaded by

Copyright:

Available Formats

Deep Learning Basics

Lecture 7: Factor Analysis

• Do not have a general framework

Dimension: 300x300 = 90,000

• Netflix survey: 480189 users x 17770 movies

• Math point of view: eigen-decomposition of the covariance (or

• Classic, commonly used tool

• Computation: solved by singular value decomposition (SVD)

Figure from isomorphismes @stackexchange

Figure from amoeba@stackexchange

• First PC: direction of greatest variability in data

• First PC: maximize the variance of the projections

• Solved by Lagrangian: exists 𝜆, so that

• Recall SVD decomposition theorem:

• Alternative viewpoint: find vector 𝑣 such that the projection yields

Figure from Nina Balcan

• All equivalent for centered data

• Suppose both 𝑊, ℎ unknown.

• Bruno A Olshausen, and David Field. "Emergence of simple-cell

• GoogLeNet: Going Deeper with Convolutions

• Residue Network: Deep Residual Learning for Image Recognition

• Reducing the Dimensionality of Data with Neural Networks

• Variational autoencoder: Auto-Encoding Variational Bayes

• Generative Adversarial Nets

• Sequence to Sequence Learning with Neural Networks

• Check recent papers in the conferences ICML, NIPS, ICLR

• Not recommend arXiv: too many deep learning papers

You might also like