DeepLearning Book
DeepLearning Book
DeepLearning Book
1
History of Neural Network Research
Neural network Deep belief net
Back propagation Science Speech
Neural Networks
Multilayer Perceptron Structure
Learning Algorithm based on Back Propagation
Deep Belief Network
Restricted Boltzmann Machines
Deep Learning (Deep Belief Network)
Convolutional Neural Networks (CNN)
CNN Structure and Learning
Applications
NEURAL NETWORKS
> 0:
w1*x1 + w2*x2 +b = 0 < 0:
x2
w1 1
w2
x1 x2
x1
Parameter Learning in Perceptron
start:
The weight vector w is generated randomly
test:
A vector x ∈ P ∪ N is selected randomly,
If x∈P and w·x>0 goto test,
If x∈P and w·x≤0 goto add,
If x ∈ N and w · x < 0 go to test,
If x ∈ N and w · x ≥ 0 go to subtract.
add:
Set w = w+x, goto test
subtract:
Set w = w-x, goto test
Sigmoid Unit
Classic
Perceptron
Sigmoid
Sigmoid function is Unit
Differentiable
¶s (x)
= s (x)(1- s (x))
¶x
Learning Algorithm of Sigmoid Unit
(d f ) 2
Input Output
Learning Parameters of MLP
Unit
Target
Loss Function Output
si( j ) ( j 1)
X
Wi( j ) si( j ) Wi( j ) si( j )
f
2(d f ) ( j ) X ( j 1) 2 i( j ) X ( j 1)
si
Limitations
Back Propagation barely changes lower-layer parameters (Van
ishing Gradient)
Therefore, Deep Networks cannot be fully (effectively) trained
with Back Propagation Target
Error Error Error y
Input Output
x y'
Back-propagation
Breakthrough
Deep Belief Networks (Unsupervised Pre-training)
Convolutional Neural Networks (Reducing Redundant Parame
ters)
Rectified Linear Unit (Constant Gradient Propagation)
DEEP BELIEF NETWORKS
Idea:
Greedy Layer-wise training
Pre-training + Fine tuning
Contrastive Divergence
Restricted Boltzmann Machine (RBM)
Energy-Based Model h1 h2 h3 h4 h5
E ( x ,h )
e
P(x, h) Joint (x, h)
e
x ,h
E ( x ,h )
Probability W
åe -E (x,h)
Marginal (x)
x1 x2 x3
P(x) = h Probability,
åe -E (x,h) or
Likelihood
Remark:
x,h • Conditional Independence
P(xj = 1|h) = σ(bj +W’• j ·h)
Conditional P(h | x) P(hi | x)
i
P(hi = 1|x) = σ(ci +Wi · ·x) Probability
P ( x | h) P ( x j | h)
j
Maximum Likelihood
Use Gradient Descent
j j j j
åe -E (x,h) i
t=0
i
t=1
i
t=2
i
t = infinity
x,h
» xi h j 1 - xi h j 0
Distribution of Dataset
Distribution of Model
Contrastive Divergence (CD) Learning
of RBM parameters
j j j j
vi h j 0 vi h j
a fantasy
i i i i
… … h3
RBM
… … h2 DBN
… …
h0 … … h2
W
x0 … … … … h1
… … h1
… … x
1. Regard each layer as RBM
2. Layer-wise Pre-train each RBM in Unsupervised way
3. Attach the classifier and Fine-tune the whole Network in Supervis
ed way
Viewing Learning as Wake-Sleep Algorithm
Effect of Unsupervised Pre-Training in DBN (1/2)
28
Effect of Unsupervised Pre-Training in DBN (2/2)
29
Internal Representation of DBN
30
Representation of Higher Layers
Occluded
Generate data
Regenerated
Image 1 Image 2
Structure of
Convolutional Neural Network (CNN)
http://parse.ele.tue.nl/education/cluster2
Convolution Layer
1 0 1
0 1 0
1 0 1
Higher layer
Higher layer
specific, abstract patterns
Back Propagation
Applications (Image Classification) (1/2)
ALL CNN!!
Convolutional
Neural Network
Webpages:
Geoffrey E. Hinton’s readings (with source code available for DBN) http://ww
w.cs.toronto.edu/~hinton/csc2515/deeprefs.html
Notes on Deep Belief Networks http://www.quantumg.net/dbns.php
MLSS Tutorial, October 2010, ANU Canberra, Marcus Frean http://videolectur
es.net/mlss2010au_frean_deepbeliefnets/
Deep Learning Tutorials http://deeplearning.net/tutorial/
Hinton’s Tutorial, http://videolectures.net/mlss09uk_hinton_dbn/
Fergus’s Tutorial, http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf
CUHK MMlab project : http://mmlab.ie.cuhk.edu.hk/project_deep_learning.h
tml
People:
Geoffrey E. Hinton’s http://www.cs.toronto.edu/~hinton
Andrew Ng http://www.cs.stanford.edu/people/ang/index.html
Ruslan Salakhutdinov http://www.utstat.toronto.edu/~rsalakhu/
Yee-Whye Teh http://www.gatsby.ucl.ac.uk/~ywteh/
Yoshua Bengio www.iro.umontreal.ca/~bengioy
Yann LeCun http://yann.lecun.com/
Marcus Frean http://ecs.victoria.ac.nz/Main/MarcusFrean
Rob Fergus http://cs.nyu.edu/~fergus/pmwiki/pmwiki.php
Acknowledgement
Many materials in this ppt are from these papers, tutorials, etc (especially
Hinton and Frean’s). Sorry for not listing them in full detail.
45
Dumitru Erhan, Aaron Courville, Yoshua Bengio. Understanding Representations Learned in Deep Architectures. Technical Report.
Graphical model for Statistics
Conditional independence b
etween random variables
Given C, A and B are indepe
ndent: C
Smoker?
P(A, B|C) = P(A|C)P(B|C)
P(A,B,C) =P(A, B|C) P(C) B A
=P(A|C)P(B|C)P(C)
Any two nodes are conditio Has Lung cancer Has bronchitis
nally independent given the
values of their parents.
http://www.eecs.qmul.ac.uk/~norman/BBNs/Independence_and_conditional_independence.htm
46
Directed and undirected graphical m
odel C
C
B A
B A
47 P(A,B,C,D) = P(D|A,B)P(B|C)P(A|C)P(C)
D
Modeling undirected model
Probability:
P(x; )
f (x; )
f (x; )
f (x; ) Z ( ) P(x; ) 1
x
x
partition
function
Example: P(A,B,C) = P(B,C)P(A,C) Is smoker?
exp( w1 BC w2 AC )
P ( A, B, C; ) C
exp( w1BC w2 AC)
A , B ,C
w1 w2
exp( w1 BC ) exp( w2 AC ) B A
Z ( w1 , w2 ) Is healthy Has Lung cancer
48
More directed and undirected
models
A B C
y1 y2 y3
D E F
h1 h2 h3
G H I
49
More directed and undirected
models
A B
y1 y2 y3
C
D h1 h2 h3
50
More directed and undirected
models
h3 ... x
W2
h2 ...
HMM
W1
...
h ... h1 ...
W W0
v ... x ...
RBM DBN Our d
(a) (b)
51
Extended reading on graphical model
52
Product of Experts
f (x ; )m
e m m E ( x ; )
f (x; )
P(x; ) m
,
f (x ; ) e
x
m
m m m
x
E ( x ; )
Z ( )
MRF in 2D D E F
53 G H I
Product of Experts
e
15
i
( x u i ) T ( x u i )
c(1 i )
i 1
54
Products of experts versus Mixture model
f (x ; )
m m m
P(x; )
Products of experts :
m
f (x ; )
x m m m
"and" operation m
55
Outline
56
Z ( ) x f ( x; m )
Contrastive Divergence (CD)
K K
max P(x ; ) max L( X; ) max log P(x ; )
(k ) (k )
k 1
k 1
L( X; ) L( X; )
t 1 t or 0
K
1
log Z ( ) log f (x( k); )
1 L( X; ) K k 1
K
log f (x; ) 1 K log f (x( k); )
p ( x, ) dx
K k 1
log f (x; ) log f (x; )
p ( x , ) X
model dist.
57
data dist. expectation
P(A,B,C) = P(A|C)P(B|C)P(C)
C
Contrastive Divergence (CD)
B A
Gradient of Likelihood:
L( X; ) log f ( x; ) 1 K
log f ( x ( k); )
p( x, )
dx
K
k 1
Intractable Easy to compute
Tractable Gibbs Sampling Fast contrastive divergence
Sample p(z1,z2,…,zM) T=1 T
L( X; )
t 1 t
CD
Minimum
h1 h2 h3 h4 h5
x1 x2 x3
More information on Gibbs sampling:
Pattern recognition and machine learning(PRML)
59
Convergence of Contrastive divergence (CD)
M. A. Carreira-Perpignan and G. E. Hinton. On Contrastive Divergence Learning. Artificial Intelligence and Statistics, 2005
60
Outline
61
Boltzmann Machine
f (x ; ) em m m E ( x ; )
f (x; )
P(x; ) m
,
f (x ; ) e
x
m
m m m
x
E ( x ; )
Z ( )
E (x; ) wij xi x j i xi
i j i
: {wij , i }
62
Boltzmann machine: E(x,h)=b' x+c' h+h' Wx+x’Ux+h’Vh
Restricted Boltzmann Machine (RBM)
e E ( x ,h ) partition
function
x1 x2 x3
P ( x) h
e E ( x ,h )
x ,h
P ( x | h) P ( x j | h) x
j
K K
max P(x ; ) min L( X; ) min log P(x ; )
(k ) (k )
k 1 k 1
Geoffrey E. Hinton, “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14, 1771–1800 (2002)
64
CD for RBM
L(X; )
e ( b' x c' h h' Wx) f ( x; ) t 1 t
P ( x; ) h
e
x ,h
( b' x c' h h' Wx)
Z ( )
xi h j xi h j CD
1 0
P(xj = 1|h) = σ(bj +W’• j ·h) P(hi = 1|x) = σ(ci +Wi ·x)
65
L( X; )
CD for RBM xi h j xi h j
wij 1 0
P(xj = 1|h)
= σ(bj +W’• j ·h)
P(hi = 1|x)
= σ(ci +Wi ·x)
P(xj = 1|h)
= σ(bj +W’• j ·h)
h2 h1
x1 x2
P(xj = 1|h) = σ(bj +W’• j ·h) P(hi = 1|x) = σ(ci +Wi ·x)
66
RBM for classification
y: classification label
67
Hugo Larochelle and Yoshua Bengio, Classification using Discriminative Restricted Boltzmann Machines, ICML 2008.
RBM itself has many applications
Multiclass classification
Collaborative filtering
Motion capture modeling
Information retrieval
Modeling natural images
Segmentation
Y Li, D Tarlow, R Zemel, Exploring compositional high order pattern potentials for structured output learning, CVPR 2013
V. Mnih, H Larochelle, GE Hinton , Conditional Restricted Boltzmann Machines for Structured Output Prediction, Uncertainty in Artificial Intelligence,
2011.
Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann machines. ICML, 2008.
Salakhutdinov, R., Mnih, A., & Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. ICML 2007.
Salakhutdinov, R., & Hinton, G. E. (2009). Replicated softmax: an undirected topic model., NIPS 2009.
Osindero, S., & Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field., NIPS 2008
68
Outline
69
Belief Nets
visible
effect
70
Deep Belief Net
h2 … …
h1 … …
Yoshua Bengio, “Learning Deep Architectures for AI,” Foundations and Trends in Machine Learning, 2009.
73
Example: sum product network (SPN)
2N-1
N2N-1 parameters
X1 X1 X2 X2 X3 X3 X4 X4 X5 X5
O(N) parameters
74
Depth of existing approaches
Boosting (2 layers)
L 1: base learner
L 2: vote or linear combination of layer 1
Decision tree, LLE, KNN, Kernel SVM (2 layers)
L 1: matching degree to a set of local templates.
L 2: Combine these degrees
Brain: 5-10 layers
b i i K (x, x i )
75
Why decision tree has depth 2?
76
Deep Belief Net
h2 … …
h1 … …
x … …
P(x,h1,h2,h3) = p(x|h1) p(h1|h2) p(h2,h3)
77
Deep Belief Net
C
P(A,B|C) = P(A|C)P(B|C)
B A
x1
x … …
x … …
{Q(h1 | x)[log P(h1 ) log P(h1 | x)] Q(h1 | x) log Q(h1 | x)]}
h1
… … x2
… … WT
… … h1
h0 W W
x0 … … … … x1
WT
… … h0
W
… … x0
82
Pretrain, fine-tune and inference – (autoencoder)
83 (BP)
Pretrain, fine-tune and inference - 2
Pretraining
84 Fine-tuning
How many layers should we use?
[1] Sutskever, I. and Hinton, G. E., Deep Narrow Sigmoid Belief Networks are Universal Approximators. Neural
Computation, 2007
86
Effect of Depth
87
Why unsupervised pre-training makes sense
stuff stuff
high low
bandwidth bandwidth
89
Fine-tuning with a contrastive versio
n of the “wake-sleep” algorithm
After learning many layers of features, we can fine-tune the f
eatures to improve generation.
1. Do a stochastic bottom-up pass
Adjust the top-down weights to be good at reconstructing the fe
ature activities in the layer below.
2. Do a few iterations of sampling in the top level RBM
-- Adjust the weights in the top-level RBM.
3. Do a stochastic top-down pass
Adjust the bottom-up weights to be good at reconstructing the f
eature activities in the layer above.
90
Include lateral connections
[1]B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1?,” Vision
Research, vol. 37, pp. 3311–3325, December 1997.
[2]S. Osindero and 91
G. E. Hinton, “Modeling image patches with a directed hierarchy of Markov random field,” in NIPS, 2007.
Without lateral connection
92
With lateral connection
93
My data is real valued …
Make it [0 1] linearly: x = ax + b
Use another distribution
94
My data has temporal dependency …
Static:
Temporal
95
Consider DBN as…
96
Outline
97
Applications of deep learning
Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation
Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks, Science 2006.
Welling, M. etc., Exponential Family Harmoniums with an Application to Information Retrieval, NIPS 2004
A. R. Mohamed, etc., Deep Belief Networks for phone recognition, NIPS 09 workshop on deep learning for speech
recognition.
Nair, V. and Hinton, G. E. 3-D Object recognition with deep belief nets. NIPS09
………………………….
98
Object recognition
NORB
logistic regression 19.6%, kNN (k=1) 18.4%, Gaussian kern
el SVM 11.6%, convolutional neural net 6.0%, convolution
al net + SVM hybrid 5.9%. DBN 6.5%.
With the extra unlabeled data (and the same amount of la
beled data as before), DBN achieves 5.2%.
99
Learning to extract the orientation of a face p
atch (Salakhutdinov & Hinton, NIPS 2007)
100
The training and test sets
101
The root mean squared error in the orientatio
n when combining GP’s with deep belief nets
GP on GP on GP on top-level
the top-level features with
pixels features fine-tuning
100 labels 22.2 17.9 15.2
500 labels 17.2 12.7 7.2
1000 labels 16.3 11.2 6.4
6) 1000 neurons
They always looked like a really ni W2T
ce way to do non-linear dimensio 500 neurons
nality reduction: W3T
But it is very difficult to opti
250 neurons
mize deep autoencoders usi W4T
ng backpropagation. linear
We now have a much better way t 30
units
W4
o optimize them:
First train a stack of 4 RBM’s 250 neurons
Then “unroll” them. W3
Then fine-tune with backpro 500 neurons
p. W2
1000 neurons
W1
103 28x28
Deep Autoencoders
(Hinton & Salakhutdinov, 2006)
real
data
30-D
deep auto
30-D PCA
104
A comparison of methods for compressing dig
it images to 30 real numbers.
real
data
30-D
deep auto
30-D logistic
PCA
30-D
PCA
105
Representation of DBN
106
Summary
Generative models explicitly or implicitly model the distribution of inputs and outputs.
Discriminative models model the posterior probabilities directly.
107
DBN VS SVM