Luciw RBM DBN

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 38

Restricted Boltzmann Machines

and Deep Belief Networks

Presented by Matt Luciw

USING A VAST, VAST MAJORITY OF SLIDES ORIGINALLY FROM:

Geoffrey Hinton, Sue Becker, Yann Le Cun, Yoshua Bengio, Frank


Wood
Motivations
• Supervised training of deep models (e.g. many-layered
NNets) is difficult (optimization problem)
• Shallow models (SVMs, one-hidden-layer NNets,
boosting, etc…) are unlikely candidates for learning
high-level abstractions needed for AI
• Unsupervised learning could do “local-learning” (each
module tries its best to model what it sees)
• Inference (+ learning) is intractable in directed graphical
models with many hidden variables
• Current unsupervised learning methods don’t easily
extend to learn multiple levels of representation
Belief Nets
• A belief net is a directed acyclic stochastic
graph composed of stochastic hidden
cause
variables.
• Can observe some of the
variables and we would like to
solve two problems:
• The inference problem: Infer
the states of the unobserved
variables. visible
effect
• The learning problem: Adjust
the interactions between Use nets composed of layers of
variables to make the network stochastic binary variables with
weighted connections. Later, we will
more likely to generate the
generalize to other types of variable.
observed data.
Stochastic binary neurons
• These have a state of 1 or 0 which is a stochastic function of
the neuron’s bias, b, and the input it receives from other
neurons.
1
p ( si  1) 
1  exp(bi   s j w ji )
j
1

p ( si  1) 0.5

0
0
bi   s j w ji
j
Stochastic units
• Replace the binary threshold units by binary stochastic units
that make biased random decisions.
– The temperature controls the amount of noise.
– Decreasing all the energy gaps between configurations is
equivalent to raising the noise level.

1 1
p ( si 1)   j s j wij

1 e
T
1  e Ei T

temperature

Energy gap  Ei  E ( si  0)  E ( si 1)


The Energy of a joint configuration
binary state of unit i in joint
configuration v, h

E ( v, h )    i i  i s j wij
s vh
b  s vh vh

iunits i j

Energy with configuration v bias of weight between


on the visible units and h unit i units i and j
on the hidden units
indexes every non-identical
pair of i and j once
Weights  Energies  Probabilities
• Each possible joint configuration of the visible
and hidden units has an energy
– The energy is determined by the weights and biases
(as in a Hopfield net).
• The energy of a joint configuration of the visible
and hidden units determines its probability:
 E ( v ,h )
p (v, h)  e
• The probability of a configuration over the visible
units is found by summing the probabilities of all
the joint configurations that contain it.
Restricted Boltzmann Machines
• Restrict the connectivity to make learning
easier. hidden
– Only one layer of hidden units.
j
• Deal with more layers later
– No connections between hidden units.
• In an RBM, the hidden units are conditionally
i
independent given the visible states.
– So can quickly get an unbiased sample visible
from the posterior distribution when
given a data-vector.
– This is a big advantage over directed
belief nets
Maximizing the training data log
likelihood Standard
StandardPoE
PoEform
form

• We want maximizing parameters  f m d |  m 
 m 
 
arg max log p (D | 1 ,  ,  n )  arg max log    
d D   f m c |  m  

1 ,, n 1 ,, n
 c m 
Over Assuming
Assumingd’s
d’sdrawn
drawn
Overall
alltraining
trainingdata.
data.
independently
independentlyfrom
fromp()
p()

• Differentiate w.r.t. to all parameters and perform


gradient ascent to find optimal parameters.
• The derivation is nasty.

Frank Wood - [email protected]


Equilibrium Is Hard to Achieve
• With:

 log p (D | 1 ,  ,  n )


 log f m d |  m  

 log f m c |  m 
 m  m P0
 m P 

can now train our PoE model.


• But…
P  there’s a problem:

– is computationally infeasible to obtain (esp. in an
inner gradient ascent loop).
– Sampling Markov Chain must converge to target
distribution. Often this takes a very long time!

Frank Wood - [email protected]


A very surprising fact
• Everything that one weight needs to know about the
other weights and the data in order to do maximum
likelihood learning is contained in the difference of
two correlations.
 log p ( v)
 si s j  si s j
wij v free

Expected value of
product of states at Expected value of
Derivative of log product of states at
probability of thermal equilibrium
when the training thermal equilibrium
one training when nothing is
vector vector is clamped on
the visible units clamped
The (theoretical) batch learning
algorithm
• Positive phase
– Clamp a data vector on the visible units.
– Let the hidden units reach thermal equilibrium at a
temperature of 1
– Sample si s j for all pairs of units
– Repeat for all data vectors in the training set.
• Negative phase
– Do not clamp any of the units
– Let the whole network reach thermal equilibrium at a
temperature of 1 (where do we start?)
– Sample si s j for all pairs of units
– Repeat many times to get good estimates
• Weight updates
– Update each weight by an amount proportional to the
difference in  si s j  in the two phases.
Solution: Contrastive Divergence!

 log p (D | 1 ,  ,  n )


 log f m d |  m  

 log f m c |  m 
 m  m P0
 m P1

• Now we don’t have to run the sampling Markov


Chain to convergence, instead we can stop after
1 iteration (or perhaps a few iterations more
typically)
• Why does this work?
– Attempts to minimize the ways that the model
distorts the data.

Frank Wood - [email protected]


Contrastive Divergence
• Maximum likelihood gradient: pull down
energy surface at the examples and pull it up
everywhere else, with more emphasis where
model puts more probability mass
• Contrastive divergence updates: pull down
energy surface at the examples and pull it up
in their neighborhood, with more emphasis
where model puts more probability mass
Restricted Boltzmann Machines
• In an RBM, the hidden units are
conditionally independent given the
visible states. It only takes one step to
hidden
reach thermal equilibrium when the
visible units are clamped. j
– Can quickly get the exact value of :
 si s j  v
i
visible
A picture of the Boltzmann machine learning
algorithm for an RBM

j j j j

 si s j  0  si s j 1  si s j  
a fantasy

i i i i

t=0 t=1 t=2 t = infinity

Start with a training vector on the visible units.


Then alternate between updating all the hidden units in
parallel and updating all the visible units in parallel.
0 
wij   (  si s j    si s j  )
Contrastive divergence learning:
A quick way to learn an RBM

j j Start with a training vector on the


visible units.
 si s j  0  si s j 1
Update all the hidden units in parallel
Update the all the visible units in
i i
parallel to get a “reconstruction”.
t=0 t=1 Update the hidden units again.
data reconstruction

wij   (  si s j    si s j  )
0 1

This is not following the gradient of the log likelihood. But it works well.
When we consider infinite directed nets it will be easy to see why it works.
How to learn a set of features that are good for
reconstructing images of the digit 2

50 binary 50 binary
feature feature
neurons neurons

Increment weights Decrement weights


between an active pixel between an active pixel
and an active feature and an active feature

16 x 16 16 x 16
pixel pixel
image image
data reconstruction
(reality) (better than reality)
Using an RBM to learn a model of a digit class
Reconstructions by
model trained on
2’s

Data

Reconstructions by
model trained on
3’s

j j 100 hidden units


(features)
 si s j  0  si s j 1

i i 256 visible
units (pixels)
data reconstruction
The final 50 x 256 weights

Each neuron grabs a different feature.


How well can we reconstruct the digit images
from the binary feature activations?

Reconstruction Reconstruction
from activated from activated
Data binary features Data binary features

New test images from Images from an unfamiliar


the digit class that the digit class (the network
model was trained on tries to see every image as
a 2)
Deep Belief Networks
Explaining away (Judea Pearl)
• Even if two hidden causes are independent, they can become
dependent when we observe an effect that they can both
influence.
– If we learn that there was an earthquake it reduces the
probability that the house jumped because of a truck.

-10 -10
truck hits house earthquake

20 20 posterior
p(1,1)=.0001
-20 p(1,0)=.4999
house jumps p(0,1)=.4999
p(0,0)=.0001
Why multilayer learning is hard in a sigmoid
belief net.
• To learn W, we need the posterior
distribution in the first hidden layer.
• Problem 1: The posterior is typically
intractable because of “explaining hidden variables
away”.
• Problem 2: The posterior depends on
the prior created by higher layers as hidden variables
well as the likelihood.
– So to learn W, we need to know the prior
weights in higher layers, even if we hidden variables
are only approximating the
posterior. All the weights interact. likelihood W
• Problem 3: We need to integrate over
data
all possible configurations of the higher
variables to get the prior for first
hidden layer. Yuk!
Solution: Complementary Priors

• There is a special type of multi-layer directed model in which


it is easy to infer the posterior distribution over the hidden
units because it has complementary priors.
• This special type of directed model is equivalent to an
undirected model.
– At first, this equivalence just seems like a neat trick
– But it leads to a very effective new learning algorithm that
allows multilayer directed nets to be learned one layer at a
time.
• The new learning algorithm resembles boosting with each layer
being like a weak learner.
An example of a etc.

complementary prior WT
• Infinite DAG with replicated weights. h2

– An ancestral pass of the DAG is W


exactly equivalent to letting a v2
Restricted Boltzmann Machine
settle to equilibrium. WT
– This infinite DAG defines the h1

same distribution as an RBM. W


v1

WT
h0

W
v0
Inference in a DAG with etc.

replicated weights WT
h2
• The variables in h0 are conditionally
independent given v0. W
– Inference is trivial. We just multiply v2
v0 by W T WT
– This is because the model above h0 h1
implements a complementary prior.
W
• Inference in the DAG is exactly
v1
equivalent to letting a Restricted
Boltzmann Machine settle to WT
equilibrium starting at the data. h0

W
v0
Divide and conquer multilayer learning
• Re-representing the data: Each time the base
learner is called, it passes a transformed
version of the data to the next learner.
– Can we learn a deep, dense DAG one layer at a
time, starting at the bottom, and still guarantee
that learning each layer improves the overall
model of the training data?
• This seems very unlikely. Surely we need to know the
weights in higher layers to learn lower layers?
Multilayer contrastive divergence
• Start by learning one hidden layer.
• Then re-present the data as the activities of
the hidden units.
– The same learning algorithm can now be applied
to the re-presented data.
• Can we prove that each step of this greedy
learning improves the log probability of the
data under the overall model?
– What is the overall model?
Learning a deep directed etc.
network WT

• First learn with all the weights tied h2


– This is exactly equivalent to learning W
an RBM v2
– Contrastive divergence learning is WT
equivalent to ignoring the small
derivatives contributed by the tied h1
weights between deeper layers. W
v1
WT
h0
h0
W
W
v0 v0
• Then freeze the first layer of weights in etc.
both directions and learn the remaining WT
weights (still tied together).
h2
– This is equivalent to learning
another RBM, using the aggregated W
posterior distribution of h0 as the v2
data. WT
h1
W
v1 v1
W WT
h0 h0
T W frozen
W frozen
v0
A simplified version with all hidden layers the same size

• The RBM at the top can be viewed


as shorthand for an infinite directed h3
net.
• When learning W1 we can view the W3
model in two quite different ways:
– The model is an RBM composed h2
of the data layer and h1.
– The model is an infinite DAG W2T W2
with tied weights.
h1
• After learning W1 we untie it from
the other weight matrices.
W1T W1
• We then learn W2 which is still tied
data
to all the matrices above it.
Why greedy learning works
• Each time we learn a new layer, the inference at the layer
below becomes incorrect, but the variational bound on the
log prob of the data improves (only true in theory -ml).
• Since the bound starts as an equality, learning a new layer
never decreases the log prob of the data, provided we start
the learning from the tied weights that implement the
complementary prior.
• Now that we have a guarantee we can loosen the restrictions
and still feel confident.
– Allow layers to vary in size.
– Do not start the learning at each layer from the weights in
the layer below.
Why the hidden configurations should be treated as
data when learning the next layer of weights
• After learning the first layer of weights:
log p ( x)    energy ( x)   entropy (h1 | x)
  p(h1   | x) log p(h1   )  log p( x | h1   ) entropy

• If we freeze the generative weights that define the likelihood
term and the recognition weights that define the distribution
over hidden configurations, we get:
log p ( x)   p (h1   | x) log p (h1   ) const ant

• Maximizing the RHS is equivalent to maximizing the log prob
of “data” that occurs with probability p (h1   | x)
Fine-tuning with a contrastive version of the
“wake-sleep” algorithm
After learning many layers of features, we can fine-tune the
features to improve generation.
1. Do a stochastic bottom-up pass
– Adjust the top-down weights to be good at reconstructing
the feature activities in the layer below.
2. Do a few iterations of sampling in the top level RBM
-- Adjust the weights in the top-level RBM.
3. Do a stochastic top-down pass
– Adjust the bottom-up weights to be good at reconstructing
the feature activities in the layer above.
4. Not required! But helps the recognition rate (-ml).
A neural network model of digit recognition
The top two layers form a
restricted Boltzmann machine 2000 top-level units
whose free energy landscape
models the low dimensional
manifolds of the digits.
10 label units 500 units
The valleys have names:

The model learns a joint density for labels 500 units


and images. To perform recognition we can
start with a neutral state of the label units
and do one or two iterations of the top-
28 x 28
level RBM. pixel
Or we can just compute the free energy of image
the RBM with each of the 10 labels
Show the movie of the network
generating digits

(available at www.cs.toronto/~hinton)
Limits of the Generative Model

1. Designed for images where non-binary values can be treated


as probabilities.
2. Top-down feedback only in the highest (associative) layer.
3. No systematic way to deal with invariance.
4. Assumes segmentation already performed and does not learn
to attend to the most informative parts of objects.

You might also like