DeepLearning Book

Download as pdf or txt
Download as pdf or txt
You are on page 1of 108

Deep Learning

Jiseob Kim ( [email protected])

Artificial Intelligence Class of 2016 Spring


Dept. of Computer Science and Engineering
Seoul National University

1
History of Neural Network Research
Neural network Deep belief net
Back propagation Science Speech

1986 2006 2011 2012


deep learning results
• • Unsupervised & Layer-wised
Loose tie with biological systems pre-training
• SVM
… …Rank Name Errorfor
ramodeling
Description
• • Better
Shallow designs
model and training
• Boosting (normalization, te nonlinearity,
methods
Solve general learning problems
dropout)
… … • Specific for specific tasks
1
• Decision tree U. Toronto
But –it isHandcrafted
given Tied with Deep
0.15315
up… biological
Convsystem
Net LBP, HOG)
• Feature learning features (GMM-HMM, SIFT,
… …2
• KNN U. Tokyo
• • New Hard 0.26172 Hand-crafted fe
to train
development ofatures
computer architectures
and learn
• …… …3 U. Oxford
• –Insufficient
GPU
0.26979
computational resources
ing models.
4 Xerox/INRIA
• –Small 0.27058
training sets
Multi-core computer Bottleneck.
systems
• Does not work well
• Large scale databases
ObjectComputers
How Many recognition over 1,000,000
to Identify images
a Cat? and CPU
16000 1,000cores
categories
Kruger et al. TPAMI’13
(2 GPU)
Slides from Wanli Ouyang [email protected]
Contents

 Neural Networks
 Multilayer Perceptron Structure
 Learning Algorithm based on Back Propagation
 Deep Belief Network
 Restricted Boltzmann Machines
 Deep Learning (Deep Belief Network)
 Convolutional Neural Networks (CNN)
 CNN Structure and Learning
 Applications
NEURAL NETWORKS

Slides by Jiseob Kim


[email protected]
Classification Problem

 Problem of finding label Y given data X


 ex1) x: face image, y: person’s name
 ex2) x: blood sugar measurement, blood pressure…|
y: diagnosis of diabetes
 ex3) x: voice, y: sentence corresponding to the voice

 x: D-dimensional vector, y: integer (Discrete)

 Famous pattern recognition algorithms


 Support Vector Machine
 Decision Tree
 K-Nearest Neighbor
 Multi-Layer Perceptron (Artificial Neural Network; 인공신경망)
Perceptron (1/2)
Perceptron (2/2)

> 0:
w1*x1 + w2*x2 +b = 0 < 0:
x2

w1 1
w2

x1 x2
x1
Parameter Learning in Perceptron

start:
The weight vector w is generated randomly
test:
A vector x ∈ P ∪ N is selected randomly,
If x∈P and w·x>0 goto test,
If x∈P and w·x≤0 goto add,
If x ∈ N and w · x < 0 go to test,
If x ∈ N and w · x ≥ 0 go to subtract.
add:
Set w = w+x, goto test
subtract:
Set w = w-x, goto test
Sigmoid Unit

Classic
Perceptron

Sigmoid
Sigmoid function is Unit
Differentiable
¶s (x)
= s (x)(1- s (x))
¶x
Learning Algorithm of Sigmoid Unit

 Loss Function Target


Unit
Output

  (d  f ) 2

 Gradient Descent Update


 f
 2(d  f ) X  2(d  f ) f (1  f ) X
W s
f (s) =1/ (1+ e-s )
f '(s) = f (s)(1- f (s))
W  W  c(d  f ) f (1  f )X
Need for Multiple Units and Multiple Layers

 Multiple boundaries are n


eeded (e.g. XOR problem)
 Multiple Units

 More complex regions are


needed (e.g. Polygons) 
Multiple Layers
Structure of Multilayer Perceptron
(MLP; Artificial Neural Network)

Input Output
Learning Parameters of MLP
Unit
Target
 Loss Function Output

 We have the same Loss Function


 But the # of parameters are now much more   (d  f ) 2
(Weight for each layer and each unit)
 To use Gradient Descent, we need to calculat
e the gradient for all the parameters

 Recursive Computation of Gradients


 Computation of loss-gradient of the t
op-layer weights is the same as befor
e
 Using the chain rule, we can compute
the loss-gradient of lower-layer weight
s recursively (Back Propagation)
Back Propagation Learning Algorithm (1/3)

 Gradients of top-layer weights and update rule


  (d  f ) 2
¶e ¶f
= -2(d - f ) X = -2(d - f ) f (1- f )X
¶W ¶s
Gradient Descent
update rule W  W  c(d  f ) f (1  f )X

 Store intermediate value delta for later use of chain rul


e
¶e ¶f
d (k )
= ( j ) = (d - f ) ( j )
¶si ¶si
= (d - f ) f (1- f )
Back Propagation Learning Algorithm (2/3)

 Gradients of lower-layer weights


Weighted sum
si( j )  X( j 1)  Wi( j )

  si( j )  ( j 1)
  X
Wi( j ) si( j ) Wi( j ) si( j )
f
 2(d  f ) ( j ) X ( j 1)  2 i( j ) X ( j 1)
si

Local gradient  (d  f ) 2 f


  2( d  f )
si( j ) si( j ) si( j )
Gradient Descent Update rule
for lower-layer weights

Wi( j )  Wi( j )  ci( j ) i( j ) X( j 1)


Back Propagation Learning Algorithm (3/3)

 Applying chain rule, recursive relation between delta’s


m j 1
 i( j )  f i ( j ) (1  f i ( j ) )   i( j 1) wil( j 1)
l 1

Algorithm: Back Propagation

1. Randomly Initialize weight parameters


2. Calculate the activations of all units (with input data)
3. Calculate top-layer delta
4. Back-propagate delta from top to the bottom
5. Calculate actual gradient of all units using delta’s
6. Update weights using Gradient Descent rule
7. Repeat 2~6 until converge
Applications

 Almost All Classification Problems


 Face Recognition
 Object Recognition
 Voice Recognition
 Spam mail Detection
 Disease Detection
 etc.
Limitations and Breakthrough

 Limitations
 Back Propagation barely changes lower-layer parameters (Van
ishing Gradient)
 Therefore, Deep Networks cannot be fully (effectively) trained
with Back Propagation Target
Error Error Error y

Input Output
x y'

Back-propagation
 Breakthrough
 Deep Belief Networks (Unsupervised Pre-training)
 Convolutional Neural Networks (Reducing Redundant Parame
ters)
 Rectified Linear Unit (Constant Gradient Propagation)
DEEP BELIEF NETWORKS

Slides by Jiseob Kim


[email protected]
Motivation

 Idea:
 Greedy Layer-wise training
 Pre-training + Fine tuning
 Contrastive Divergence
Restricted Boltzmann Machine (RBM)

 Energy-Based Model h1 h2 h3 h4 h5
 E ( x ,h )
e
P(x, h)  Joint (x, h)
e
x ,h
 E ( x ,h )
Probability W
åe -E (x,h)
Marginal (x)
x1 x2 x3
P(x) = h Probability,
åe -E (x,h) or
Likelihood
Remark:
x,h • Conditional Independence
P(xj = 1|h) = σ(bj +W’• j ·h)
Conditional P(h | x)   P(hi | x)
i
P(hi = 1|x) = σ(ci +Wi · ·x) Probability
P ( x | h)   P ( x j | h)
j

 Energy function • Conditional Probability is the


same as Neural Network
 E(x,h)=b' x+c' h+h' Wx
Unsupervised Learning of RBM

 Maximum Likelihood
 Use Gradient Descent
j j j j

å e-E (x,h) vi h j  0 vi h j 


a fantasy
L(X;q ) = h

åe -E (x,h) i

t=0
i

t=1
i

t=2
i

t = infinity
x,h

¶L(X;q ) ¶log f (x;q ) 1 K ¶log f (x (k) ;q )


= ò p(x, q ) dx - å
¶wij ¶q K k=1 ¶q
= xi h j p(x,q )
- xi h j X
= xi h j ¥
- xi h j 0

» xi h j 1 - xi h j 0
Distribution of Dataset

Distribution of Model
Contrastive Divergence (CD) Learning
of RBM parameters

 k-Contrastive Divergence Trick


 From the previous slide, to get distribution of model, we nee
d to calculate many Gibbs sampling steps
 And this is per a single parameter update
 Therefore, we take the sample after only k-steps where in pra
ctice, k=1 is sufficient

j j j j

vi h j  0 vi h j 
a fantasy

i i i i

t=0 t=1 t=2 t = infinity

Take this as a sample of Model distribution


Effect of Unsupervised Training

Unsupervised Training makes RBM successfully catch the essential patterns

RBM trained on MNIST


hand-written digit data:

Each cell shows the


pattern each hidden
node encodes
Deep Belief Network (DBN)

 Deep Belief Network (Deep Bayesian N


etwork)
 Bayesian Network that has similar structur
e to Neural Network
 Generative model
 Also, can be used as classifier (with additi
onal classifier at top layer)
 Resolves gradient vanishing by Pre-trainin
g
 There are two modes (Classifier & Auto-E
ncoder), but we only consider Classifier he
re
Learning Algorithm of DBN

 DBN as a stack of RBMs


Classifier

… … h3
RBM
… … h2 DBN
… …
h0 … … h2
W
x0 … … … … h1
… … h1
… … x
1. Regard each layer as RBM
2. Layer-wise Pre-train each RBM in Unsupervised way
3. Attach the classifier and Fine-tune the whole Network in Supervis
ed way
Viewing Learning as Wake-Sleep Algorithm
Effect of Unsupervised Pre-Training in DBN (1/2)

Erhan et. al. AISTATS’2009

28
Effect of Unsupervised Pre-Training in DBN (2/2)

without pre-training with pre-training

29
Internal Representation of DBN

30
Representation of Higher Layers

 Higher layers have more abstract representations


 Interpolating between different images is not desirable in lo
wer layers, but natural in higher layers

Bengio et al., ICML 2013


Inference Algorithm of DBN

 As DBN is a generative model, we can also regenerate the data


 From the top layer to the bottom, conduct Gibbs sampling to generate th
e data samples

Occluded
Generate data

Regenerated

Lee, Ng et al., ICML


2009
Applications

 Nowadays, CNN outperforms DBN for Image or Speech


data
 However, if there is no topological information, DBN is
still a good choice
 Also, if the generative model is needed, DBN is used

Generate Face patches


Tang, Srivastava, Salakhutdinov, NIPS 2014
CONVOLUTIONAL NEURAL NE
TWORKS
Slides by Jiseob Kim
[email protected]
Motivation
 Idea:
 Fully connected structure has too many parameters to learn

 Efficient to learn local patterns when there are geometrical,


topological structure between features such as image data or
voice data (spectrograms)

 DBN: different data


 CNN: same data

Image 1 Image 2
Structure of
Convolutional Neural Network (CNN)

 Higher features formed by repeated Convolution and


Pooling (Subsampling)
 Convolution obtains certain Feature from local area
 Pooling reduces Dimension, while obtaining Translation-
invariant Feature

http://parse.ele.tue.nl/education/cluster2
Convolution Layer

 The Kernel Detects pattern:

1 0 1

0 1 0

1 0 1

 The Resulting value Indicates:


 How much the pattern matches
at each region
Max-Pooling Layer

 The Pooling Layer summarizes the


results of Convolution Layer
 e.g.) 10x10 result is summarized into
1 cell

 The Result of Pooling Layer is Tran


slation-invariant
Remarks

Higher layer

• Higher layer catches more

Higher layer
specific, abstract patterns

• Lower layer catches more


general patterns
Parameter Learning of CNN

 CNN is just another Neural Network with sparse connections


 Learning Algorithm:
 Back Propagation on Convolution Layers and Fully-Connected Layers

Back Propagation
Applications (Image Classification) (1/2)

Image Net Competition Ranking


(1000-class, 1 million images)

ALL CNN!!

From Kyunghyun Cho’s dnn tutorial


Applications (Image Classification) (2/2)

 Krizhevsky et al.: the winner of ImageNet 2012 Competition


1000-class problem, Fully
top-5 test error rate of 15.3% Connected
Application (Speech Recognition)

Convolutional
Neural Network

Input: CNN outperforms all previous


Spectrogram of Speech methods that uses GMM of MFCC
APPENDIX

Slides from Wanli Ouyang [email protected]


Good learning resources

 Webpages:
 Geoffrey E. Hinton’s readings (with source code available for DBN) http://ww
w.cs.toronto.edu/~hinton/csc2515/deeprefs.html
 Notes on Deep Belief Networks http://www.quantumg.net/dbns.php
 MLSS Tutorial, October 2010, ANU Canberra, Marcus Frean http://videolectur
es.net/mlss2010au_frean_deepbeliefnets/
 Deep Learning Tutorials http://deeplearning.net/tutorial/
 Hinton’s Tutorial, http://videolectures.net/mlss09uk_hinton_dbn/
 Fergus’s Tutorial, http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf
 CUHK MMlab project : http://mmlab.ie.cuhk.edu.hk/project_deep_learning.h
tml
 People:
 Geoffrey E. Hinton’s http://www.cs.toronto.edu/~hinton
 Andrew Ng http://www.cs.stanford.edu/people/ang/index.html
 Ruslan Salakhutdinov http://www.utstat.toronto.edu/~rsalakhu/
 Yee-Whye Teh http://www.gatsby.ucl.ac.uk/~ywteh/
 Yoshua Bengio www.iro.umontreal.ca/~bengioy
 Yann LeCun http://yann.lecun.com/
 Marcus Frean http://ecs.victoria.ac.nz/Main/MarcusFrean
 Rob Fergus http://cs.nyu.edu/~fergus/pmwiki/pmwiki.php
 Acknowledgement
 Many materials in this ppt are from these papers, tutorials, etc (especially
Hinton and Frean’s). Sorry for not listing them in full detail.

45
Dumitru Erhan, Aaron Courville, Yoshua Bengio. Understanding Representations Learned in Deep Architectures. Technical Report.
Graphical model for Statistics
 Conditional independence b
etween random variables
 Given C, A and B are indepe
ndent: C
Smoker?
 P(A, B|C) = P(A|C)P(B|C)
 P(A,B,C) =P(A, B|C) P(C) B A
 =P(A|C)P(B|C)P(C)
 Any two nodes are conditio Has Lung cancer Has bronchitis
nally independent given the
values of their parents.

http://www.eecs.qmul.ac.uk/~norman/BBNs/Independence_and_conditional_independence.htm
46
Directed and undirected graphical m
odel C

 Directed graphical model


 P(A,B,C) = P(A|C)P(B|C)P(C) B A
 Any two nodes are conditionally indepe
ndent given the values of
their parents.
 Undirected graphical model C
 P(A,B,C) = P(B,C)P(A,C)
 Also called Marcov Random Field (MRF)
B A

C
B A
B A
47 P(A,B,C,D) = P(D|A,B)P(B|C)P(A|C)P(C)
D
Modeling undirected model

 Probability:
P(x; ) 
f (x; )

f (x; )
 f (x; ) Z ( )  P(x; )  1
x
x

partition
function
Example: P(A,B,C) = P(B,C)P(A,C) Is smoker?
exp( w1 BC  w2 AC )
P ( A, B, C; )  C
 exp( w1BC  w2 AC)
A , B ,C
w1 w2

exp( w1 BC ) exp( w2 AC ) B A

Z ( w1 , w2 ) Is healthy Has Lung cancer

48
More directed and undirected
models

A B C
y1 y2 y3

D E F
h1 h2 h3
G H I

Hidden Marcov model MRF in 2D

49
More directed and undirected
models

A B
y1 y2 y3
C

D h1 h2 h3

P(y1, y2, y3, h1, h2, h3)=P(h1)P(h2| h1)


P(A,B,C,D)=P(A)P(B)P(C|B)P(D|A,B,C)
P(h3| h2) P(y1| h1)P(y2| h2)P(y3| h3)

50
More directed and undirected
models

h3 ... x
W2
h2 ...
HMM
W1

...
h ... h1 ...
W W0
v ... x ...
RBM DBN Our d
(a) (b)

51
Extended reading on graphical model

 Zoubin Ghahramani ‘s video lecture on graphical models:


 http://videolectures.net/mlss07_ghahramani_grafm/

52
Product of Experts
 f (x ; )m
e m m  E ( x ; )
f (x; )
P(x; )  m
  ,
  f (x ; )  e
x
m
m m m
x
 E ( x ; )
Z ( )

E (x; )  m log f m (x m ; m ) Partition function


Energy function

E (x; w )  w1 AB  w2 BC  w3 AD  w4 BE  w3CF  ...


A B C

MRF in 2D D E F

53 G H I
Product of Experts

  e 
15

i
( x u i ) T  ( x u i )
 c(1  i )
i 1

54
Products of experts versus Mixture model

 f (x ; )
m m m
P(x; ) 
 Products of experts :
m

  f (x ; )
x m m m

 "and" operation m

 Sharper than mixture


 Each expert can constrain a different subset of dimensions.
 Mixture model, e.g. Gaussian Mixture model
 “or” operation
 a weighted sum of many density functions

55
Outline

 Basic background on statistical learning and Gr


aphical model
Contrastive divergence and Restricte
d Boltzmann machine
 Product of experts
 Contrastive divergence
 Restricted Boltzmann Machine
 Deep belief net

56
Z ( )  x f ( x;  m )
Contrastive Divergence (CD)

 Probability: P(x;  )  f ( x; ) / Z ( )


 Maximum Likelihood and gradient descent

K   K

max  P(x ; )  max L( X; )  max log  P(x ; )
(k ) (k )

 k 1   
 k 1 
L( X; ) L( X; )
 t 1   t   or 0
 
 K
1 
 log Z ( )   log f (x( k); )
1 L( X; )  K k 1 

K  
 log f (x; ) 1 K  log f (x( k); )
  p ( x,  ) dx  
 K k 1 
 log f (x; )  log f (x; )
 
 p ( x , )  X

model dist.
57
data dist. expectation
P(A,B,C) = P(A|C)P(B|C)P(C)
C
Contrastive Divergence (CD)
B A

 Gradient of Likelihood:
L( X; )  log f ( x; ) 1 K
 log f ( x ( k); )

  p( x,  )

dx 
K

k 1 
Intractable Easy to compute
Tractable Gibbs Sampling Fast contrastive divergence
Sample p(z1,z2,…,zM) T=1 T  
L( X; )
 t 1   t  

CD

Minimum

Accurate but slow gradient


58 Approximate but fast
Gibbs Sampling for graphical model

h1 h2 h3 h4 h5

x1 x2 x3
More information on Gibbs sampling:
Pattern recognition and machine learning(PRML)
59
Convergence of Contrastive divergence (CD)

 The fixed points of ML are not fixed points of CD and vice


versa.
 CD is a biased learning algorithm.
 But the bias is typically very small.
 CD can be used for getting close to ML solution and then ML le
arning can be used for fine-tuning.
 It is not clear if CD learning converges (to a stable fixed poi
nt). At 2005, proof is not available.
 Further theoretical results? Please inform us

M. A. Carreira-Perpignan and G. E. Hinton. On Contrastive Divergence Learning. Artificial Intelligence and Statistics, 2005
60
Outline

 Basic background on statistical learning and Gr


aphical model
Contrastive divergence and Restricte
d Boltzmann machine
 Product of experts
 Contrastive divergence
 Restricted Boltzmann Machine
 Deep belief net

61
Boltzmann Machine

 Undirected graphical model, wit


h hidden nodes.

 f (x ; ) em m m  E ( x ; )
f (x; )
P(x; )  m
  ,
  f (x ; )  e
x
m
m m m
x
 E ( x ; )
Z ( )

E (x; )   wij xi x j   i xi
i j i

 : {wij , i }

Boltzmann machine: E(x,h)=b' x+c' h+h' Wx+x’Ux+h’Vh

62
Boltzmann machine: E(x,h)=b' x+c' h+h' Wx+x’Ux+h’Vh
Restricted Boltzmann Machine (RBM)

 Undirected, loopy, layer h1 h2 h3 h4 h5


 E ( x ,h )
e
P(x, h) 
e
x ,h
 E ( x ,h )

 e  E ( x ,h ) partition
function
x1 x2 x3
P ( x)  h

 e  E ( x ,h )

x ,h

 E(x,h)=b' x+c' h+h' Wx h


P(h | x)   P(hi | x) W
i

P ( x | h)   P ( x j | h) x
j

P(xj = 1|h) = σ(bj +W’• j ·h)


Read the manuscript for details P(hi = 1|x) = σ(ci +Wi · ·x)
Restricted Boltzmann Machine (RBM)

 e  ( b' x  c' h  h' Wx)


f ( x; )
P(x; )  h

 e
x ,h
 ( b' x  c' h  h' Wx)
Z ( )

 E(x,h)=b' x+c' h+h' Wx


 x = [x1 x2 …]T, h = [h1 h2 …]T
 Parameter learning
 Maximum Log-Likelihood

 K   K

max  P(x ; )  min L( X; )  min   log  P(x ; )
(k ) (k )
  
 k 1   k 1 

Geoffrey E. Hinton, “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14, 1771–1800 (2002)
64
CD for RBM

 CD for RBM, very fast!

L(X; )
 e  ( b' x c' h  h' Wx) f ( x; )  t 1   t  
P ( x;  )  h
 
 e
x ,h
 ( b' x  c' h  h' Wx)
Z ( )

L( X; )  log f (x; ) 1 K


 log f (x( k); )
wij
  p(x, )

dx 
K

k 1 
 xi h j  xi h j  xi h j  xi h j
p ( x , ) X  0

 xi h j  xi h j CD
1 0

P(xj = 1|h) = σ(bj +W’• j ·h) P(hi = 1|x) = σ(ci +Wi ·x)
65
L( X; )
CD for RBM  xi h j  xi h j
wij 1 0

P(xj = 1|h)
= σ(bj +W’• j ·h)

P(hi = 1|x)
= σ(ci +Wi ·x)

P(xj = 1|h)
= σ(bj +W’• j ·h)

h2 h1

x1 x2
P(xj = 1|h) = σ(bj +W’• j ·h) P(hi = 1|x) = σ(ci +Wi ·x)
66
RBM for classification

 y: classification label

67
Hugo Larochelle and Yoshua Bengio, Classification using Discriminative Restricted Boltzmann Machines, ICML 2008.
RBM itself has many applications

 Multiclass classification
 Collaborative filtering
 Motion capture modeling
 Information retrieval
 Modeling natural images
 Segmentation

Y Li, D Tarlow, R Zemel, Exploring compositional high order pattern potentials for structured output learning, CVPR 2013
V. Mnih, H Larochelle, GE Hinton , Conditional Restricted Boltzmann Machines for Structured Output Prediction, Uncertainty in Artificial Intelligence,
2011.
Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann machines. ICML, 2008.
Salakhutdinov, R., Mnih, A., & Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. ICML 2007.
Salakhutdinov, R., & Hinton, G. E. (2009). Replicated softmax: an undirected topic model., NIPS 2009.
Osindero, S., & Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field., NIPS 2008

68
Outline

 Basic background on statistical learning and Gr


aphical model
 Contrastive divergence and Restricted Boltzm
ann machine
Deep belief net (DBN)
 Why deep leaning?
 Learning and inference
 Applications

69
Belief Nets

 A belief net is a directed acyclic g random


hidden
raph composed of random varia cause
bles.

visible
effect

70
Deep Belief Net

 Belief net that is deep


 A generative model
 P(x,h1,…,hl) = p(x|h1) p(h1|h2)… p(hl-2|hl-1) p(hl-1,hl)
 Used for unsupervised training of multi-layer deep mo
del.
h3 … …

h2 … …

h1 … …

Pixels=>edges=> local shapes=> object x … …


parts
71 P(x,h1,h2,h3) = p(x|h1) p(h1|h2) p(h2,h3)
Why Deep learning?
Pixels=>edges=> local shapes=> object
parts
 The mammal brain is organized in a deep architecture wit
h a given input percept represented at multiple levels of a
bstraction, each level corresponding to a different area of
cortex.
 An architecture with insufficient depth can require many
more computational elements, potentially exponentially
more (with respect to input size), than architectures whos
e depth is matched to the task.
 Since the number of computational elements one can affo
rd depends on the number of training examples available
to tune or select them, the consequences are not just com
putational but also statistical: poor generalization may be
expected when using an insufficiently deep architecture f
or representing some functions.
T. Serre, etc., “A quantitative theory of immediate visual recognition,” Progress in Brain Research, Computational
Neuroscience: Theoretical Insights into Brain Function, vol. 165, pp. 33–56, 2007.
Yoshua Bengio, “Learning Deep Architectures for AI,” Foundations and Trends in Machine Learning, 2009.
72
Why Deep learning?
 Linear regression, logistic regression: depth 1
 Kernel SVM: depth 2
 Decision tree: depth 2
 Boosting: depth 2
 The basic conclusion that these results suggest is that whe
n a function can be compactly represented by a deep archit
ecture, it might need a very large architecture to be represe
nted by an insufficiently deep one. (Example: logic gates,
multi-layer NN with linear threshold units and positive we
ight).

Yoshua Bengio, “Learning Deep Architectures for AI,” Foundations and Trends in Machine Learning, 2009.
73
Example: sum product network (SPN)
2N-1


N2N-1 parameters
    
X1 X1 X2 X2 X3 X3 X4 X4 X5 X5

O(N) parameters

74
Depth of existing approaches

 Boosting (2 layers)
 L 1: base learner
 L 2: vote or linear combination of layer 1
 Decision tree, LLE, KNN, Kernel SVM (2 layers)
 L 1: matching degree to a set of local templates.
 L 2: Combine these degrees
 Brain: 5-10 layers
b  i  i K (x, x i )

75
Why decision tree has depth 2?

 Rely on partition of input space.


 Local estimator. Rely on partition of input space
and use separate params for each region. Each r
egion is associated with a leaf.
 Need as many as training samples as there are v
ariations of interest in the target function. Not g
ood for highly varying functions.
 Num. training sample is exponential to Num. di
m in order to achieve a fixed error rate.

76
Deep Belief Net

 Inference problem: Infer the states of the unobs


erved variables.
 Learning problem: Adjust the interactions betw
een variables to make the network more likely t
o generate the observed data
h3 … …

h2 … …

h1 … …

x … …
P(x,h1,h2,h3) = p(x|h1) p(h1|h2) p(h2,h3)
77
Deep Belief Net

 Inference problem (the problem of explaining away):

C
P(A,B|C) = P(A|C)P(B|C)
B A

P(h11, h12 | x1) ≠ P(h11| x1) P(h12 | x1)


h11 h12
h1 … …

x1
x … …

An example from manuscript Sol: Complementary prior


78
Deep Belief Net
 Inference problem (the problem
of explaining away)
 Sol: Complementary prior
h4 … … 30
h3 … … 500
h2 … … 1000
h1 … … 2000

x … …

79 Sol: Complementary prior


P(hi = 1|x) = σ(ci +Wi ·x)
Deep Belief Net

 Explaining away problem of Inference (see the manus


cript)
 Sol: Complementary prior, see the manuscript
 Learning problem
 Greedy layer by layer RBM training (optimize lower boun
d) and fine tuning
 Contrastive divergence for RBM training
… … h3
h3 … … … … h2
h2 … … … … h2
… … h1
h1 … …
… … h1
x … …
80 … … x
Deep Belief Net

 Why greedy layerwise learning work?


 Optimizing a lower bound:
log P(x)  log  P(x, h1 )
h

 {Q(h1 | x)[log P(h1 )  log P(h1 | x)]  Q(h1 | x) log Q(h1 | x)]}
h1

 When we fix parameters for layer 1 an (1)


d optimize the parameters for layer 2,
we are optimizing the P(h1) in (1) … … h3
… … h2
… … h2
… … h1
… … h1
81 … … x
Deep Belief Net and RBM

 RBM can be considered as DBN that has infinitive layers


… … x2
… … WT
… … h1
h0 W W
x0 … … … … x1
WT
… … h0
W
… … x0

82
Pretrain, fine-tune and inference – (autoencoder)

83 (BP)
Pretrain, fine-tune and inference - 2

y: identity or rotation degree

Pretraining
84 Fine-tuning
How many layers should we use?

 There might be no universally right depth


 Bengio suggests that several layers is better than one
 Results are robust against changes in the size of a laye
r, but top layer should be big
 A parameter. Depends on your task.
 With enough narrow layers, we can model any distribu
tion over binary vectors [1]

[1] Sutskever, I. and Hinton, G. E., Deep Narrow Sigmoid Belief Networks are Universal Approximators. Neural
Computation, 2007

Copied from http://videolectures.net/mlss09uk_hinton_dbn/


85
Effect of Unsupervised Pre-training
Erhan et. al. AISTATS’2009

86
Effect of Depth

without pre-training with pre-training


w/o pre-training

87
Why unsupervised pre-training makes sense
stuff stuff

high low
bandwidth bandwidth

image label image label

If image-label pairs were If image-label pairs are


generated this way, it generated this way, it
would make sense to try makes sense to first learn
to go straight from to recover the stuff that
images to labels. caused the image by
For example, do the inverting the high
pixels have even parity? bandwidth pathway.
88
Beyond layer-wise pretraining

 Layer-wise pretraining is efficient but not optimal.


 It is possible to train parameters for all layers using a wake
-sleep algorithm.
 Bottom-up in a layer-wise manner
 Top-down and reffiting the earlier models

89
Fine-tuning with a contrastive versio
n of the “wake-sleep” algorithm
After learning many layers of features, we can fine-tune the f
eatures to improve generation.
1. Do a stochastic bottom-up pass
 Adjust the top-down weights to be good at reconstructing the fe
ature activities in the layer below.
2. Do a few iterations of sampling in the top level RBM
-- Adjust the weights in the top-level RBM.
3. Do a stochastic top-down pass
 Adjust the bottom-up weights to be good at reconstructing the f
eature activities in the layer above.

90
Include lateral connections

 RBM has no connection among layers


 This can be generalized.
 Lateral connections for the first layer [1].
 Sampling from P(h|x) is still easy. But sampling from p
(x|h) is more difficult.
 Lateral connections at multiple layers [2].
 Generate more realistic images.
 CD is still applicable, with small modification.

[1]B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1?,” Vision
Research, vol. 37, pp. 3311–3325, December 1997.
[2]S. Osindero and 91
G. E. Hinton, “Modeling image patches with a directed hierarchy of Markov random field,” in NIPS, 2007.
Without lateral connection

92
With lateral connection

93
My data is real valued …

 Make it [0 1] linearly: x = ax + b
 Use another distribution

94
My data has temporal dependency …

 Static:

 Temporal

95
Consider DBN as…

 A statistical model that is used for unsupervised traini


ng of fully connected deep model

 A directed graphical model that is approximated by fa


st learning and inference algorithms

 A directed graphical model that is fine tuned using ma


ture neural network learning approach -- BP.

96
Outline

 Basic background on statistical learning and Gr


aphical model
 Contrastive divergence and Restricted Boltzm
ann machine
Deep belief net (DBN)
 Why DBN?
 Learning and inference
 Applications

97
Applications of deep learning

 Hand written digits recognition


 Dimensionality reduction
 Information retrieval
 Segmentation
 Denoising
 Phone recognition
 Object recognition
 Object detection
 …

Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation
Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks, Science 2006.
Welling, M. etc., Exponential Family Harmoniums with an Application to Information Retrieval, NIPS 2004
A. R. Mohamed, etc., Deep Belief Networks for phone recognition, NIPS 09 workshop on deep learning for speech
recognition.
Nair, V. and Hinton, G. E. 3-D Object recognition with deep belief nets. NIPS09
………………………….

98
Object recognition

 NORB
 logistic regression 19.6%, kNN (k=1) 18.4%, Gaussian kern
el SVM 11.6%, convolutional neural net 6.0%, convolution
al net + SVM hybrid 5.9%. DBN 6.5%.
 With the extra unlabeled data (and the same amount of la
beled data as before), DBN achieves 5.2%.

99
Learning to extract the orientation of a face p
atch (Salakhutdinov & Hinton, NIPS 2007)

100
The training and test sets

100, 500, or 1000 labeled cases 11,000 unlabeled cases

face patches from new people

101
The root mean squared error in the orientatio
n when combining GP’s with deep belief nets
GP on GP on GP on top-level
the top-level features with
pixels features fine-tuning
100 labels 22.2 17.9 15.2
500 labels 17.2 12.7 7.2
1000 labels 16.3 11.2 6.4

Conclusion: The deep features are much better


than the pixels. Fine-tuning helps a lot.
102
Deep Autoencoders 28x28
(Hinton & Salakhutdinov, 200 W1T

6) 1000 neurons
 They always looked like a really ni W2T
ce way to do non-linear dimensio 500 neurons
nality reduction: W3T
 But it is very difficult to opti
250 neurons
mize deep autoencoders usi W4T
ng backpropagation. linear
 We now have a much better way t 30
units
W4
o optimize them:
 First train a stack of 4 RBM’s 250 neurons
 Then “unroll” them. W3
 Then fine-tune with backpro 500 neurons
p. W2
1000 neurons
W1
103 28x28
Deep Autoencoders
(Hinton & Salakhutdinov, 2006)

real
data
30-D
deep auto
30-D PCA

104
A comparison of methods for compressing dig
it images to 30 real numbers.

real
data
30-D
deep auto
30-D logistic
PCA
30-D
PCA

105
Representation of DBN

106
Summary

 Deep belief net (DBN)


 is a network with deep layers, which provides strong representa
tion power;
 is a generative model;
 can be learned by layerwise RBM using Contrastive Divergence;
 has many applications and more applications is yet to be found.

Generative models explicitly or implicitly model the distribution of inputs and outputs.
Discriminative models model the posterior probabilities directly.
107
DBN VS SVM

 A very controversial topic


 Model
 DBN is generative, SVM is discriminative. But fine-tuning of DB
N is discriminative
 Application
 SVM is widely applied.
 Researchers are expanding the application area of DBN.
 Learning
 DBN is non-convex and slow
 SVM is convex and fast (in linear case).
 Which one is better?
 Time will say.
 You can contribute

Hinton: The superior classification performance of discriminative learning


methods holds only for domains in which it is not possible to learn a good
generative model. This set of domains is being eroded by Moore’s law.
108

You might also like