A Fast Learning Algorithm For Deep Belief Nets PDF
A Fast Learning Algorithm For Deep Belief Nets PDF
A Fast Learning Algorithm For Deep Belief Nets PDF
Geoffrey E. Hinton
[email protected]
Simon Osindero
[email protected]
Department of Computer Science, University of Toronto, Toronto, Canada M5S 3G4
Yee-Whye Teh
[email protected]
Department of Computer Science, National University of Singapore,
Singapore 117543
1 Introduction
Figure 1: The network used to model the joint distribution of digit images and
digit labels. In this letter, each training case consists of an image and an explicit
class label, but work in progress has shown that the same learning algorithm
can be used if the “labels” are replaced by a multilayer pathway whose inputs
are spectrograms from multiple different speakers saying isolated digits. The
network then learns to generate pairs that consist of an image and a spectrogram
of the same digit class.
form a directed acyclic graph that converts the representations in the asso-
ciative memory into observable variables such as the pixels of an image.
This hybrid model has some attractive features:
r There is a fast, greedy learning algorithm that can find a fairly good
set of parameters quickly, even in deep networks with millions of
parameters and many hidden layers.
r The learning algorithm is unsupervised but can be applied to labeled
data by learning a model that generates both the label and the data.
r There is a fine-tuning algorithm that learns an excellent genera-
tive model that outperforms discriminative methods on the MNIST
database of hand-written digits.
r The generative model makes it easy to interpret the distributed rep-
resentations in the deep hidden layers.
A Fast Learning Algorithm for Deep Belief Nets 1529
r The inference required for forming a percept is both fast and accurate.
r The learning algorithm is local. Adjustments to a synapse strength
depend on only the states of the presynaptic and postsynaptic neuron.
r The communication is simple. Neurons need only to communicate
their stochastic binary states.
Figure 2: A simple logistic belief net containing two independent, rare causes
that become highly anticorrelated when we observe the house jumping. The bias
of −10 on the earthquake node means that in the absence of any observation,
this node is e 10 times more likely to be off than on. If the earthquake node is on
and the truck node is off, the jump node has a total input of 0, which means
that it has an even chance of being on. This is a much better explanation of
the observation that the house jumped than the odds of e −20 , which apply if
neither of the hidden causes is active. But it is wasteful to turn on both hidden
causes to explain the observation because the probability of both happening is
e −10 × e −10 = e −20 . When the earthquake node is turned on, it “explains away”
the evidence for the truck node.
2 Complementary Priors
1
p(si = 1) = , (2.1)
1 + exp −b i − j s j wi j
where b i is the bias of unit i. If a logistic belief net has only one hidden
layer, the prior distribution over the hidden variables is factorial because
their binary states are chosen independently when the model is used to
generate data. The nonindependence in the posterior distribution is created
by the likelihood term coming from the data. Perhaps we could eliminate
explaining away in the first hidden layer by using extra hidden layers to
create a “complementary” prior that has exactly the opposite correlations to
those in the likelihood term. Then, when the likelihood term is multiplied
by the prior, we will get a posterior that is exactly factorial. It is not at
all obvious that complementary priors exist, but Figure 3 shows a simple
example of an infinite logistic belief net with tied weights in which the priors
are complementary at every hidden layer (see appendix A for a more general
treatment of the conditions under which complementary priors exist). The
use of tied weights to construct complementary priors may seem like a mere
trick for making directed models equivalent to undirected ones. As we shall
see, however, it leads to a novel and very efficient learning algorithm that
works by progressively untying the weights in each layer from the weights
in higher layers.
1 The generation process converges to the stationary distribution of the Markov chain,
so we need to start at a layer that is deep compared with the time it takes for the chain to
reach equilibrium.
2 This is exactly the same as the inference procedure used in the wake-sleep algorithm
(Hinton et al., 1995) but for the models described in this letter no variational approximation
is required because the inference procedure gives unbiased samples.
1532 G. Hinton, S. Osindero, and Y.-W. Teh
etc.
WT W
2
V2 vi
W WT
1
H1 h j
WT W
1
V1 vi
W WT
0
H0 h j
WT W
0
V0 vi
Figure 3: An infinite logistic belief net with tied weights. The downward arrows
represent the generative model. The upward arrows are not part of the model.
They represent the parameters that are used to infer samples from the posterior
distribution at each hidden layer of the net when a data vector is clamped
on V0 .
because the complementary prior at each layer ensures that the posterior
distribution really is factorial.
Since we can sample from the true posterior, we can compute the deriva-
tives of the log probability of the data. Let us start by computing the deriva-
tive for a generative weight, wi00j , from a unit j in layer H0 to unit i in layer
V0 (see Figure 3). In a logistic belief net, the maximum likelihood learning
rule for a single data vector, v0 , is
∂ log p(v0 ) 0 0
= h j vi − v̂i0 , (2.2)
∂wi00j
where · denotes an average over the sampled states and v̂i0 is the proba-
bility that unit i would be turned on if the visible vector was stochastically
A Fast Learning Algorithm for Deep Belief Nets 1533
∂ log p(v0 ) 0 0
= h j vi − vi1 . (2.3)
∂wi j
00
∂ log p(v0 ) 0 0
= h j vi − vi1 + vi1 h 0j − h 1j + h 1j vi1 − vi2 + · · · (2.4)
∂wi j
All of the pairwise products except the first and last cancel, leaving the
Boltzmann machine learning rule of equation 3.1.
∂ log p(v0 ) 0 0 ∞ ∞
= vi h j − vi h j . (3.1)
∂wi j
1534 G. Hinton, S. Osindero, and Y.-W. Teh
t = infinity
Figure 4: This depicts a Markov chain that uses alternating Gibbs sampling.
In one full step of Gibbs sampling, the hidden units in the top layer are all
updated in parallel by applying equation 2.1 to the inputs received from the the
current states of the visible units in the bottom layer; then the visible units are
all updated in parallel given the current hidden states. The chain is initialized
by setting the binary states of the visible units to be the same as a data vector.
The correlations in the activities of a visible and a hidden unit are measured
after the first update of the hidden units and again at the end of the chain. The
difference of these two correlations provides the learning signal for updating
the weight on the connection.
This learning rule is the same as the maximum likelihood learning rule
for the infinite logistic belief net with tied weights, and each step of Gibbs
sampling corresponds to computing the exact posterior distribution in a
layer of the infinite logistic belief net.
Maximizing the log probability of the data is exactly the same as minimiz-
ing the Kullback-Leibler divergence, K L(P 0 ||Pθ∞ ), between the distribution
of the data, P 0 , and the equilibrium distribution defined by the model, Pθ∞ .
In contrastive divergence learning (Hinton, 2002), we run the Markov chain
for only n full steps before measuring the second correlation.3 This is equiv-
alent to ignoring the derivatives that come from the higher layers of the
infinite net. The sum of all these ignored derivatives is the derivative of the
log probability of the posterior distribution in layer Vn , which is also the
derivative of the Kullback-Leibler divergence between the posterior dis-
tribution in layer Vn , Pθn , and the equilibrium distribution defined by the
model. So contrastive divergence learning minimizes the difference of two
Kullback-Leibler divergences:
K L P 0 Pθ∞ − K L Pθn Pθ∞ . (3.2)
Figure 5: A hybrid network. The top two layers have undirected connections
and form an associative memory. The layers below have directed, top-down
generative connections that can be used to map a state of the associative memory
to an image. There are also directed, bottom-up recognition connections that are
used to infer a factorial representation in one layer from the binary activities in
the layer below. In the greedy initial learning, the recognition connections are
tied to the generative connections.
are directed. The undirected connections at the top are equivalent to having
infinitely many higher layers with tied weights. There are no intralayer
connections, and to simplify the analysis, all layers have the same number
of units. It is possible to learn sensible (though not optimal) values for the
parameters W0 by assuming that the parameters between higher layers will
be used to construct a complementary prior for W0 . This is equivalent to
assuming that all of the weight matrices are constrained to be equal. The
task of learning W0 under this assumption reduces to the task of learning
an RBM, and although this is still difficult, good approximate solutions can
be found rapidly by minimizing contrastive divergence. Once W0 has been
learned, the data can be mapped through W0T to create higher-level “data”
at the first hidden layer.
If the RBM is a perfect model of the original data, the higher-level “data”
will already be modeled perfectly by the higher-level weight matrices. Gen-
erally, however, the RBM will not be able to model the original data perfectly,
and we can make the generative model better using the following greedy
algorithm:
A Fast Learning Algorithm for Deep Belief Nets 1537
so the bound is
log p(v0 ) ≥ Q(h0 |v0 )[log p(h0 ) + log p(v0 |h0 )]
all h0
− Q(h0 |v0 ) log Q(h0 |v0 ), (4.2)
all h0
where h0 is a binary configuration of the units in the first hidden layer, p(h0 )
is the prior probability of h0 under the current model (which is defined by
the weights above H0 ), and Q(·|v0 ) is any probability distribution over
the binary configurations in the first hidden layer. The bound becomes an
equality if and only if Q(·|v0 ) is the true posterior distribution.
When all of the weight matrices are tied together, the factorial distribu-
tion over H0 produced by applying W0T to a data vector is the true posterior
distribution, so at step 2 of the greedy algorithm, log p(v0 ) is equal to the
bound. Step 2 freezes both Q(·|v0 ) and p(v0 |h0 ), and with these terms fixed,
the derivative of the bound is the same as the derivative of
Q(h0 |v0 ) log p(h0 ). (4.3)
all h0
So maximizing the bound with respect to the weights in the higher layers
is exactly equivalent to maximizing the log probability of a data set in
which h0 occurs with probability Q(h0 |v0 ). If the bound becomes tighter, it
1538 G. Hinton, S. Osindero, and Y.-W. Teh
is possible for log p(v0 ) to fall even though the lower bound on it increases,
but log p(v0 ) can never fall below its value at step 2 of the greedy algorithm
because the bound is tight at this point and the bound always increases.
The greedy algorithm can clearly be applied recursively, so if we use the
full maximum likelihood Boltzmann machine learning algorithm to learn
each set of tied weights and then we untie the bottom layer of the set from
the weights above, we can learn the weights one layer at a time with a
guarantee that we will never decrease the bound on the log probability of
the data under the model.4 In practice, we replace the maximum likelihood
Boltzmann machine learning algorithm by contrastive divergence learning
because it works well and is much faster. The use of contrastive divergence
voids the guarantee, but it is still reassuring to know that extra layers
are guaranteed to improve imperfect models if we learn each layer with
sufficient patience.
To guarantee that the generative model is improved by greedily learning
more layers, it is convenient to consider models in which all layers are the
same size so that the higher-level weights can be initialized to the values
learned before they are untied from the weights in the layer below. The
same greedy algorithm, however, can be applied even when the layers are
different sizes.
Learning the weight matrices one layer at a time is efficient but not optimal.
Once the weights in higher layers have been learned, neither the weights
nor the simple inference procedure are optimal for the lower layers. The
suboptimality produced by greedy learning is relatively innocuous for su-
pervised methods like boosting. Labels are often scarce, and each label may
provide only a few bits of constraint on the parameters, so overfitting is
typically more of a problem than underfitting. Going back and refitting the
earlier models may therefore cause more harm than good. Unsupervised
methods, however, can use very large unlabeled data sets, and each case
may be very high-dimensional, thus providing many bits of constraint on
a generative model. Underfitting is then a serious problem, which can be
alleviated by a subsequent stage of back-fitting in which the weights that
were learned first are revised to fit in better with the weights that were
learned later.
After greedily learning good initial values for the weights in every layer,
we untie the “recognition” weights that are used for inference from the
“generative” weights that define the model, but retain the restriction that
the posterior in each layer must be approximated by a factorial distribution
in which the variables within a layer are conditionally independent given
the values of the variables in the layer below. A variant of the wake-sleep
algorithm described in Hinton et al. (1995) can then be used to allow the
higher-level weights to influence the lower-level ones. In the “up-pass,” the
recognition weights are used in a bottom-up pass that stochastically picks
a state for every hidden variable. The generative weights on the directed
connections are then adjusted using the maximum likelihood learning rule
in equation 2.2.5 The weights on the undirected connections at the top
level are learned as before by fitting the top-level RBM to the posterior
distribution of the penultimate layer.
The “down-pass” starts with a state of the top-level associative mem-
ory and uses the top-down generative connections to stochastically activate
each lower layer in turn. During the down-pass, the top-level undirected
connections and the generative directed connections are not changed. Only
the bottom-up recognition weights are modified. This is equivalent to the
sleep phase of the wake-sleep algorithm if the associative memory is al-
lowed to settle to its equilibrium distribution before initiating the down-
pass. But if the associative memory is initialized by an up-pass and then
only allowed to run for a few iterations of alternating Gibbs sampling be-
fore initiating the down-pass, this is a “contrastive” form of the wake-sleep
algorithm that eliminates the need to sample from the equilibrium distri-
bution of the associative memory. The contrastive form also fixes several
other problems of the sleep phase. It ensures that the recognition weights
are being learned for representations that resemble those used for real data,
and it also helps to eliminate the problem of mode averaging. If, given a
particular data vector, the current recognition weights always pick a partic-
ular mode at the level above and ignore other very different modes that are
equally good at generating the data, the learning in the down-pass will not
try to alter those recognition weights to recover any of the other modes as it
would if the sleep phase used a pure ancestral pass. A pure ancestral pass
would have to start by using prolonged Gibbs sampling to get an equilib-
rium sample from the top-level associative memory. By using a top-level
associative memory, we also eliminate a problem in the wake phase: inde-
pendent top-level units seem to be required to allow an ancestral pass, but
they mean that the variational approximation is very poor for the top layer
of weights.
Appendix B specifies the details of the up-down algorithm using
MATLAB-style pseudocode for the network shown in Figure 1. For sim-
plicity, there is no penalty on the weights, no momentum, and the same
learning rate for all parameters. Also, the training data are reduced to a
single case.
5 Because weights are no longer tied to the weights above them, v̂i0 must be computed
using the states of the variables in the layer above i and the generative weights from these
variables to i.
1540 G. Hinton, S. Osindero, and Y.-W. Teh
exp(xi )
pi = , (6.1)
j exp(x j )
where xi is the total input received by unit i. Curiously, the learning rules
are unaffected by the competition between units in a softmax group, so the
database showed that a good way to model the joint distribution of digit images and their
labels was to use an architecture of this type, but for 16 × 16 images, only three-fifths as
many units were used in each hidden layer.
A Fast Learning Algorithm for Deep Belief Nets 1541
synapses do not need to know which unit is competing with which other
unit. The competition affects the probability of a unit turning on, but it is
only this probability that affects the learning.
After the greedy layer-by-layer training, the network was trained, with a
different learning rate and weight decay, for 300 epochs using the up-down
algorithm described in section 5. The learning rate, momentum, and weight
decay7 were chosen by training the network several times and observing its
performance on a separate validation set of 10,000 images that were taken
from the remainder of the full training set. For the first 100 epochs of the
up-down algorithm, the up-pass was followed by three full iterations of
alternating Gibbs sampling in the associative memory before performing
the down-pass. For the second 100 epochs, 6 iterations were performed, and
for the last 100 epochs, 10 iterations were performed. Each time the number
of iterations of Gibbs sampling was raised, the error on the validation set
decreased noticeably.
The network that performed best on the validation set was tested and
had an error rate of 1.39%. This network was then trained on all 60,000
training images8 until its error rate on the full training set was as low as
its final error rate had been on the initial training set of 44,000 images. This
took a further 59 epochs, making the total learning time about a week. The
final network had an error rate of 1.25%.9 The errors made by the network
are shown in Figure 6. The 49 cases that the network gets correct but for
which the second-best probability is within 0.3 of the best probability are
shown in Figure 7.
The error rate of 1.25% compares very favorably with the error rates
achieved by feedforward neural networks that have one or two hidden lay-
ers and are trained to optimize discrimination using the backpropagation
algorithm (see Table 1). When the detailed connectivity of these networks
is not handcrafted for this particular task, the best reported error rate for
stochastic online learning with a separate squared error on each of the 10
output units is 2.95%. These error rates can be reduced to 1.53% in a net
with one hidden layer of 800 units by using small initial weights, a separate
cross-entropy error function on each output unit, and very gentle learning
7 No attempt was made to use different learning rates or weight decays for different
layers, and the learning rate and momentum were always set quite conservatively to avoid
oscillations. It is highly likely that the learning speed could be considerably improved by
a more careful choice of learning parameters, though it is possible that this would lead to
worse solutions.
8 The training set has unequal numbers of each class, so images were assigned ran-
the network was then left running with a very small learning rate and with the test error
being displayed after every epoch. After six weeks, the test error was fluctuating between
1.12% and 1.31% and was 1.18% for the epoch on which number of training errors was
smallest.
1542 G. Hinton, S. Osindero, and Y.-W. Teh
Figure 6: The 125 test cases that the network got wrong. Each case is labeled by
the network’s guess. The true classes are arranged in standard scan order.
Figure 7: All 49 cases in which the network guessed right but had a second
guess whose probability was within 0.3 of the probability of the best guess. The
true classes are arranged in standard scan order.
neural networks from 1.5% to 0.95%. There is no obvious reason why weight
sharing and subsampling cannot be used to reduce the error rate of the gen-
erative model, and we are currently investigating this approach. Further
improvements can always be achieved by averaging the opinions of multi-
ple networks, but this technique is available to all methods.
Substantial reductions in the error rate can be achieved by supplement-
ing the data set with slightly transformed versions of the training data. Us-
ing one- and two-pixel translations, Decoste and Schoelkopf (2002) achieve
0.56%. Using local elastic deformations in a convolutional neural network,
Simard, Steinkraus, and Platt (2003) achieve 0.4%, which is slightly bet-
ter than the 0.63% achieved by the best hand-coded recognition algorithm
(Belongie, Malik, & Puzicha, 2002). We have not yet explored the use of
distorted data for learning generative models because many types of dis-
tortion need to be investigated, and the fine-tuning algorithm is currently
too slow.
6.2 Testing the Network. One way to test the network is to use a
stochastic up-pass from the image to fix the binary states of the 500 units in
the lower layer of the associative memory. With these states fixed, the label
units are given initial real-valued activities of 0.1, and a few iterations of
alternating Gibbs sampling are then used to activate the correct label unit.
This method of testing gives error rates that are almost 1% higher than the
rates reported above.
1544 G. Hinton, S. Osindero, and Y.-W. Teh
Table 1: Error rates of Various Learning Algorithms on the MNIST Digit Recog-
nition Task.
A better method is to first fix the binary states of the 500 units in the
lower layer of the associative memory and to then turn on each of the
label units in turn and compute the exact free energy of the resulting
510-component binary vector. Almost all the computation required is in-
dependent of which label unit is turned on (Teh & Hinton, 2001), and this
method computes the exact conditional equilibrium distribution over labels
instead of approximating it by Gibbs sampling, which is what the previ-
ous method is doing. This method gives error rates that are about 0.5%
higher than the ones quoted because of the stochastic decisions made in
the up-pass. We can remove this noise in two ways. The simpler is to make
the up-pass deterministic by using probabilities of activation in place of
A Fast Learning Algorithm for Deep Belief Nets 1545
Figure 8: Each row shows 10 samples from the generative model with a particu-
lar label clamped on. The top-level associative memory is run for 1000 iterations
of alternating Gibbs sampling between samples.
Figure 9: Each row shows 10 samples from the generative model with a par-
ticular label clamped on. The top-level associative memory is initialized by an
up-pass from a random binary image in which each pixel is on with a probability
of 0.5. The first column shows the results of a down-pass from this initial high-
level state. Subsequent columns are produced by 20 iterations of alternating
Gibbs sampling in the associative memory.
8 Conclusion
posteriors—has been replaced by the constraint that the prior must make
the variational approximation exact.
After each layer has been learned, its weights are untied from the weights
in higher layers. As these higher-level weights change, the priors for lower
layers cease to be complementary, so the true posterior distributions in
lower layers are no longer factorial, and the use of the transpose of the
generative weights for inference is no longer correct. Nevertheless, we can
use a variational bound to show that adapting the higher-level weights
improves the overall generative model.
To demonstrate the power of our fast, greedy learning algorithm, we
used it to initialize the weights for a much slower fine-tuning algorithm
that learns an excellent generative model of digit images and their labels. It
is not clear that this is the best way to use the fast, greedy algorithm. It might
be better to omit the fine-tuning and use the speed of the greedy algorithm
to learn an ensemble of larger, deeper networks or a much larger training
set. The network in Figure 1 has about as many parameters as 0.002 cubic
millimeters of mouse cortex (Horace Barlow, personal communication,
1999), and several hundred networks of this complexity could fit within
a single voxel of a high-resolution fMRI scan. This suggests that much big-
ger networks may be required to compete with human shape recognition
abilities.
Our current generative model is limited in many ways (Lee & Mumford,
2003). It is designed for images in which nonbinary values can be treated as
probabilities (which is not the case for natural images); its use of top-down
feedback during perception is limited to the associative memory in the top
two layers; it does not have a systematic way of dealing with perceptual
invariances; it assumes that segmentation has already been performed; and
it does not learn to sequentially attend to the most informative parts of
objects when discrimination is difficult. It does, however, illustrate some of
the major advantages of generative models as compared to discriminative
ones:
1
P(x|y) = exp j (x, y j ) + β(x)
(y) j
1
P(y) = exp log (y) + α j (y j ) , (A.2)
C j
1
P(x, y) = exp j (x, y j ) + β(x) + α j (y j ) . (A.3)
C j j
This condition is useful for our construction of the infinite stack of directed
graphical models.
Identifying the conditional independencies in equations A.4 and A.5
as those satisfied by a complete bipartite undirected graphical model, and
again using the Hammersley-Clifford theorem (assuming positivity), we see
that the following form fully characterizes all joint distributions of interest,
1
P(x, y) = exp i, j (xi , y j ) + γi (xi ) + α j (y j ) , (A.6)
Z i, j i j
mixes properly, we will eventually obtain unbiased samples from the joint
distribution given in equation A.6.
Now let us imagine that we unroll this sequence of Gibbs updates in
space, such that we consider each parallel update of the variables to consti-
tute states of a separate layer in a graph. This unrolled sequence of states has
a purely directed structure (with conditional distributions taking the form
of equations A.4 and A.5 and in alternation). By equivalence to the Gibbs
sampling scheme, after many layers in such an unrolled graph, adjacent
pairs of layers will have a joint distribution as given in equation A.6.
We can formalize the above intuition for unrolling the graph as follows.
The basic idea is to unroll the graph “upwards” (i.e., moving away from the
data), so that we can put a well-defined distribution over the infinite stack of
variables. Then we verify some simple marginal and conditional properties
of the joint distribution and thus demonstrate the required properties of the
graph in the “downwards” direction.
Let x = x(0) , y = y(0) , x(1) , y(1) , x(2) , y(2) , . . . be a sequence (stack) of vari-
ables, the first two of which are identified as our original observed and
hidden variable. Define the functions
1
f (x
, y
) = exp i, j (xi
, yi
) + γi (xi
) + α j (y
j ) (A.8)
Z i, j i j
f x (x
) = f (x
, y
) (A.9)
y
f y (y
) = f (x
, y
) (A.10)
x
gx (x
|y
) = f (x
, y
)/ f y (y
) (A.11)
g y (y
|x
) = f (x
, y
)/ f x (x
), (A.12)
P(x(0) , y(0) ) = f x(0) , y(0) (A.13)
P(x(i) |y(i−1) ) = gx x(i) |y(i−1) i = 1, 2, . . . (A.14)
P(y(i) |x(i) ) = g y y(i) |x(i) . i = 1, 2, . . . (A.15)
and similarly for P(y(i) ). Now we see that the following conditional distri-
butions also hold true:
P x(i) |y(i) = P x(i) , y(i) P y(i) = gx x(i) |y(i) (A.19)
P y(i) |x(i+1) = P y(i) , x(i+1) P x(i+1) = g y y(i) |x(i+1) . (A.20)
So our joint distribution over the stack of variables also leads to the appro-
priate conditional distributions for the unrolled graph in the “downwards”
direction. Inference in this infinite graph is equivalent to inference in the
joint distribution over the sequence of variables, that is, given x(0) , we can
obtain a sample from the posterior simply by sampling y(0) |x(0) , x(1) |y(0) ,
y(1) |x(1) , . . . . This directly shows that our inference procedure is exact for
the unrolled graph.
\% UP-DOWN ALGORITHM
\%
\% the data and all biases are row vectors.
\% the generative model is: lab <--> top <--> pen --> hid --> vis
\% the number of units in layer foo is numfoo
\% weight matrices have names fromlayer tolayer
\% "rec" is for recognition biases and "gen" is for generative
\% biases.
\% for simplicity, the same learning rate, r, is used everywhere.
1552 G. Hinton, S. Osindero, and Y.-W. Teh
\% PREDICTIONS
psleeppenstates = logistic(sleephidstates*hidpen + penrecbiases);
psleephidstates = logistic(sleepvisprobs*vishid + hidrecbiases);
pvisprobs = logistic(wakehidstates*hidvis + visgenbiases);
phidprobs = logistic(wakepenstates*penhid + hidgenbiases);
Acknowledgments
References
Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition
using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 24(4), 509–522.
Carreira-Perpinan, M. A., & Hinton, G. E. (2005). On contrastive divergence learn-
ing. In R. G. Cowell & Z. Ghahramani (Eds.), Artificial Intelligence and Statistics,
2005. (pp. 33–41). Fort Lauderdale, FL: Society for Artificial Intelligence and
Statistics.
Decoste, D., & Schoelkopf, B. (2002). Training invariant support vector machines,
Machine Learning, 46, 161–190.
Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and
Computation, 12(2), 256–285.
Friedman, J., & Stuetzle, W. (1981). Projection pursuit regression. Journal of the Amer-
ican Statistical Association, 76, 817–823.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive diver-
gence, Neural Computation, 14(8), 1711–1800.
1554 G. Hinton, S. Osindero, and Y.-W. Teh
Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. (1995). The wake-sleep algorithm for
self-organizing neural networks. Science, 268, 1158–1161.
LeCun, Y., Bottou, L., & Haffner, P. (1998). Gradient-based learning applied to doc-
ument recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Lee, T. S., & Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex.
Journal of the Optical Society of America, A, 20, 1434–1448.
Marks, T. K., & Movellan, J. R. (2001). Diffusion networks, product of experts, and
factor analysis. In T. W. Lee, T.-P. Jung, S. Makeig, & T. J. Sejnowski (Eds.), Proc.
Int. Conf. on Independent Component Analysis (pp. 481–485). San Diego.
Mayraz, G., & Hinton, G. E. (2001). Recognizing hand-written digits using hier-
archical products of experts. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 24, 189–197.
Neal, R. (1992). Connectionist learning of belief networks, Artificial Intelligence, 56,
71–113.
Neal, R. M., & Hinton, G. E. (1998). A new view of the EM algorithm that justifies
incremental, sparse and other variants. In M. I. Jordan (Ed.), Learning in graphical
models (pp. 355–368). Norwell, MA: Kluwer.
Ning, F., Delhomme, D., LeCun, Y., Piano, F., Bottou, L., & Barbano, P. (2005). Toward
automatic phenotyping of developing embryos from videos. IEEE Transactions on
Image Processing, 14(9), 1360–1371.
Roth, S., & Black, M. J. (2005). Fields of experts: A framework for learning image
priors. In IEEE Conf. on Computer Vision and Pattern Recognition (pp. 860–867).
Piscataway, NJ: IEEE.
Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedfor-
ward neural networks. Neural Networks, 2(6), 459–473.
Simard, P. Y., Steinkraus, D., & Platt, J. (2003). Best practice for convolutional neural
networks applied to visual document analysis. In International Conference on Doc-
ument Analysis and Recogntion (ICDAR) (pp. 958–962). Los Alamitos, CA: IEEE
Computer Society.
Teh, Y., & Hinton, G. E. (2001). Rate-coded restricted Boltzmann machines for face
recognition. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural
information processing systems, 13 (pp. 908–914). Cambridge, MA: MIT Press.
Teh, Y., Welling, M., Osindero, S., & Hinton, G. E. (2003). Energy-based models
for sparse overcomplete representations. Journal of Machine Learning Research, 4,
1235–1260.
Welling, M., Hinton, G., & Osindero, S. (2003). Learning sparse topographic rep-
resentations with products of Student-t distributions. In S. Becker, S. Thrun,
& K. Obermayer (Eds.), Advances in neural information processing systems, 15
(pp. 1359–1366). Cambridge, MA: MIT Press.
Welling, M., Rosen-Zvi, M., & Hinton, G. E. (2005). Exponential family harmoni-
ums with an application to information retrieval. In L. K. Saul, Y. Weiss, & L.
Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1481–1488).
Cambridge, MA: MIT Press.