Luciw RBM DBN
Luciw RBM DBN
Luciw RBM DBN
p ( si 1) 0.5
0
0
bi s j w ji
j
Stochastic units
• Replace the binary threshold units by binary stochastic units
that make biased random decisions.
– The temperature controls the amount of noise.
– Decreasing all the energy gaps between configurations is
equivalent to raising the noise level.
1 1
p ( si 1) j s j wij
1 e
T
1 e Ei T
temperature
E ( v, h ) i i i s j wij
s vh
b s vh vh
iunits i j
Expected value of
product of states at Expected value of
Derivative of log product of states at
probability of thermal equilibrium
when the training thermal equilibrium
one training when nothing is
vector vector is clamped on
the visible units clamped
The (theoretical) batch learning
algorithm
• Positive phase
– Clamp a data vector on the visible units.
– Let the hidden units reach thermal equilibrium at a
temperature of 1
– Sample si s j for all pairs of units
– Repeat for all data vectors in the training set.
• Negative phase
– Do not clamp any of the units
– Let the whole network reach thermal equilibrium at a
temperature of 1 (where do we start?)
– Sample si s j for all pairs of units
– Repeat many times to get good estimates
• Weight updates
– Update each weight by an amount proportional to the
difference in si s j in the two phases.
Solution: Contrastive Divergence!
log p (D | 1 , , n )
log f m d | m
log f m c | m
m m P0
m P1
j j j j
si s j 0 si s j 1 si s j
a fantasy
i i i i
wij ( si s j si s j )
0 1
This is not following the gradient of the log likelihood. But it works well.
When we consider infinite directed nets it will be easy to see why it works.
How to learn a set of features that are good for
reconstructing images of the digit 2
50 binary 50 binary
feature feature
neurons neurons
16 x 16 16 x 16
pixel pixel
image image
data reconstruction
(reality) (better than reality)
Using an RBM to learn a model of a digit class
Reconstructions by
model trained on
2’s
Data
Reconstructions by
model trained on
3’s
i i 256 visible
units (pixels)
data reconstruction
The final 50 x 256 weights
Reconstruction Reconstruction
from activated from activated
Data binary features Data binary features
-10 -10
truck hits house earthquake
20 20 posterior
p(1,1)=.0001
-20 p(1,0)=.4999
house jumps p(0,1)=.4999
p(0,0)=.0001
Why multilayer learning is hard in a sigmoid
belief net.
• To learn W, we need the posterior
distribution in the first hidden layer.
• Problem 1: The posterior is typically
intractable because of “explaining hidden variables
away”.
• Problem 2: The posterior depends on
the prior created by higher layers as hidden variables
well as the likelihood.
– So to learn W, we need to know the prior
weights in higher layers, even if we hidden variables
are only approximating the
posterior. All the weights interact. likelihood W
• Problem 3: We need to integrate over
data
all possible configurations of the higher
variables to get the prior for first
hidden layer. Yuk!
Solution: Complementary Priors
complementary prior WT
• Infinite DAG with replicated weights. h2
WT
h0
W
v0
Inference in a DAG with etc.
replicated weights WT
h2
• The variables in h0 are conditionally
independent given v0. W
– Inference is trivial. We just multiply v2
v0 by W T WT
– This is because the model above h0 h1
implements a complementary prior.
W
• Inference in the DAG is exactly
v1
equivalent to letting a Restricted
Boltzmann Machine settle to WT
equilibrium starting at the data. h0
W
v0
Divide and conquer multilayer learning
• Re-representing the data: Each time the base
learner is called, it passes a transformed
version of the data to the next learner.
– Can we learn a deep, dense DAG one layer at a
time, starting at the bottom, and still guarantee
that learning each layer improves the overall
model of the training data?
• This seems very unlikely. Surely we need to know the
weights in higher layers to learn lower layers?
Multilayer contrastive divergence
• Start by learning one hidden layer.
• Then re-present the data as the activities of
the hidden units.
– The same learning algorithm can now be applied
to the re-presented data.
• Can we prove that each step of this greedy
learning improves the log probability of the
data under the overall model?
– What is the overall model?
Learning a deep directed etc.
network WT
(available at www.cs.toronto/~hinton)
Limits of the Generative Model