Priors Algorithms Bayesian
Priors Algorithms Bayesian
Priors Algorithms Bayesian
Charalambos Stavrou
Supervised by:
Dr Ricardo Silva
September 2017
Abstract
We investigate how priors on the neural network (NN) parameters impact the model through
the activation function. Specifically, how sparsity inducing priors such as Laplace, and Stu-
dent T distribution priors differ from Normal priors in their impact on the neural network and
ultimately the algorithms performance. We review regularisation methods and dropout, but
also provide a possible Bayesian interpretation of dropout as a variant of a mixture model.
We examine the practicality of Bayesian neural networks (BNNs) by applying the algorithms
to both classification (MNIST) and regression (variant of Mauna Loa) type of problems. The
BNNs are then compared to feed-forward NNs fitted using back propagation with a variety of
different optimisers. For approximating the BNN parameters, experiments have been con-
ducted using, Hamiltonian (Hybrid) Monte Carlo (HMC), Stochastic Gradient Hamiltonian
Monte Carlo (SGHMC), and Variational Inference (VI), in particular using the Variational
Mean Field (VMF) approximation. In our analysis we explore BNNs of 1 hidden layer, 2 hid-
den layers and also a Bayesian convolutional NN (BCNN).
Keywords: Bayesian neural networks, neural networks, Bayesian convolutional neural net-
works, Sparsity priors, HMC, SGHMC, Variational inference, VMF, Dropout.
1
Acknowledgements
Firstly, I would like to thank my supervisor Dr Ricardo Silva, for all our helpful discussions
and his guidance during this project which aided my understanding of the concepts that will
be discussed. Finally, I would like to thank my family for their incredible support during my
studies, without whom I would not have otherwise ventured into this Masters degree.
2
Contents
1 Introduction 9
1.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 The Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Regularisation methods 41
4.1 Frequentist Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Bayesian Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3
4.4 Alternative view on Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.1 Model equations for interpretation (A) . . . . . . . . . . . . . . . . . . . . . 47
4.4.2 Model equations for interpretation (B) . . . . . . . . . . . . . . . . . . . . . 47
5 Application to data 49
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Feed-forward NNs (Frequentist models) . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.1 NN with 1 hidden layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.2 NN with 2 hidden layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.3 CNN with 2 convolutional layers and a hidden layer . . . . . . . . . . . . 57
5.3 Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.1 BNN with 1 hidden layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.2 BNN with 2 hidden layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.3 BCNN with 2 convolutional layers and a hidden layer . . . . . . . . . . . 79
6 Conclusion 85
6.1 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Appendices 88
A Algorithms 89
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.1.1 SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.1.2 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.1.3 Nesterovs Accelerated Gradient (NAG) . . . . . . . . . . . . . . . . . . . 91
A.1.4 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
B Derivations 93
B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
B.1.1 Derivative of the loss w.r.t. softmax input . . . . . . . . . . . . . . . . . . . 93
B.1.2 Derivative of the loss w.r.t. the hidden layer . . . . . . . . . . . . . . . . . 93
B.1.3 Derivative of the loss w.r.t. the bias of the output layer . . . . . . . . . . . 94
B.1.4 Derivative of the loss w.r.t. the weight of the output layer . . . . . . . . . 94
B.1.5 Derivative of the loss w.r.t. the ReLU module . . . . . . . . . . . . . . . . 94
C Plots 96
C.1 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
C.1.1 1 hidden layer neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 97
C.1.2 2 hidden layer neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 98
C.2 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
C.2.1 2 hidden layers NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4
D Additional results 100
D.1 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
D.1.1 1 hidden layer NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
D.1.2 2 hidden layers NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5
Notation
For the purposes of the study the notation will remain consistent. The notation might differ
slightly in parts of the study, however this will be obvious from the context.
C matrix
c vector
c scalar
W weight matrix
H hidden layer matrix
b bias vector
X input matrix (n rows, d columns)
Y matrix of true response/labels (n rows, k columns)
Y output matrix (n rows, k columns)
xi single input data point (vector of length d)
yi single output point (vector of length k)
yi single true response point (vector of length k)
D Dataset {X, Y}
IN NxN identity matrix (i, j {1, 2, . . . , N } : iij = 1 if i = j, 0 if i j)
Abbreviations
NN Neural network
BNN Bayesian neural network
CNN Convolutional neural network
BCNN Bayesian convolutional neural network
GP Gaussian process
MC Monte Carlo
MCMC Markov chain Monte Carlo
VI Variational inference
KL Kullback-Leibler
SGD Stochastic gradient descent
w.r.t. With respect to
i.i.d. Independent and identically distributed
s.t. Such that
For any or for all
vice versa The other way around
6
Operators and functions1
= Defined as
Implies
xy x is equivalent to y
log(x) Natural logarithm of x
XT The transpose of matrix X
bb Elementwise product of the two matrices/vectors b
x d
dx Gradient vector w.r.t. x
x Vector of partial derivatives w.r.t. x
0 A matrix or vector whose entries are all equal to zero
1 A matrix or vector whose entries are all equal to one
xi x xi (x1 , x2 , . . . , xi1 , xi+1 , . . . , xd ) a vector with ith entry removed
ez Element-wise exponentiation of vector z
relu(Z) max(0, Z) , for a matrix Z the maximum is applied element-wise
softplus(Z) log(1 + eZ ) , for a matrix Z the logarithm applied element-wise
tanh(x) sinh(x)
cosh(x) = ex ex
ex +ex , which is the hyperbolic tangent
tanh(X) Hyperbolic tangent applied element-wise
ez
softmax(z) c ezc
, for a vector z
ez1 / c ez1c
, for a matrix Z the function is applied to each row
softmax(Z)
ezn / c eznc
E[x]y xy Expectation of x under the distribution y, i.e. under p(y)
x1 i=1 xi , for a vector x with length n
n
x2
2
i=1 x2i , for a vector x with length n
n
7
1, if i = j
1, if condition is TRUE
ij = 1{condition} =
0, if i j
0, if condition is FALSE
Distributions
(x)2
Nx (, 2 ) = 1
2 2
e 2 2 (Normal/Gaussian)
Nx (, ) = 2 21
exp ( 21 (x )T 1 (x )) (Multivariate Normal/Gaussian)
The covariance matrix is positive semi-definite, and the . operator above refers to the
determinant. If in the text we say that the weights wi are independent, then we mean that
has a diagonal structure, s.t. each of the wi is sampled independently from a Nwi (i , ii2 ),
where ii2 are the diagonal entries of .
Laplace(x, b) = 1 x
2b e
b (Laplace Distribution)
x
i i
Laplace(x, B) = i=1 2b1ii e bii
d
(Multivariate Laplace)
For the purposes of this dissertation the Multivariate Laplace distribution we define possess
independence across the samples xi , and therefore B will always be diagonal. Notice also
that . above refers to the absolute value function. Please note that this is not the general
multivariate Laplace definition found in other literature.
( +1 +1
= [1 + x ]
) 2 2
StudentT(x) 2
( 2 )
(Standard Student T distribution)
( +1 1 2 +1
= ( ) [1 + s(x) ]
) s 2 2
StudentT(x, , s) 2
( 2 ) (Scaled Student T distribution)
p , if x = 1
Ber(p) = p (1 p) =
x 1x (Bernoulli distribution)
1 p
, if x = 0
1 , if x = 1
Categorical() = (Categorical Distribution)
m , if x = m
s.t. m i=1 i = 1
Alternatively,
1{x=1}
T
log
1
Categorical() =
exp s.t. i=1 i = 1
m
log 1
m {x=m}
8
Chapter 1
Introduction
Neural networks have dominated much of the machine learning applications and literature
for the past decade, as they provide a very high degree of flexibility by being able to learn ar-
bitrarily complicated functions. For such a growing field there is still a large gap in Bayesian
applications to neural networks. Bayesian neural networks are the probabilistic alterna-
tive to neural networks, which provide the framework to explore uncertainty in the models.
Bayesian neural networks also provide a natural method for model regularisation through
the prior distribution and the use of ensembles of predictors, by averaging the predictions
of several samples. An important distinction between BNNs and feed-forward NNs is that
in the Bayesian setting we are no longer confined to simple models to avoid over-fitting as
the prior helps to act as a regulariser, making them more compelling when the data is scarce.
In this dissertation we will expand on the work of (Neal, 1995 [1]) using Hamiltonian
Monte Carlo (HMC) to approximate the parameters of the Bayesian neural networks (BNNs).
The goal of this dissertation is to understand how different priors impact the BNN perfor-
mance and the resulting parameter approximations. Moreover, we will explore both the
advantages and disadvantages of HMC in practice. We compare these to other methods of
BNN parameter approximation, such as the Variational Inference (Peterson and Anderson,
1987 [24]; Hinton and van Camp, 1993 [23]; Jordan et al., 1999 [25]; Chen et al., 2014 [57]),
9
but also to feed-forward NNs.
This dissertation will focus at introducing both the feed-forward and Bayesian NN frame-
work. Furthermore, we will examine methods of regularisation. Our main interest lies with
dropout (Hinton et al., 2012 [34] and Srivastava et al., 2014 [35]), which currently serves as
the state-of-the-art method of regularising NNs, but lacks the theoretical justification to its
remarkable performance.
To set the background of this dissertation, the rest of this chapter will provide a brief
introduction to Neural Networks (NNs), Convolutional Neural Networks (CNNs), and the un-
derlying Bayesian approach to inference. We will conclude with an overview of what to
expect throughout the different parts of this dissertation.
x2 y2
xn
hm
yn
Figure 1.1 Above is a single (hidden) layer NN with inputs X, outputs Y and hidden layer H. Note
that for simplicity the arrows connecting the nodes represent all the operations performed from the
previous layer to the current layer.
Although early interest in the field stemmed from Psychologists and Neuroscientists, it
did not take long for the teachings to intercept into other areas. Neurocomputing sprang
10
to life (Rosenblatt, 1957 [5] and 1961 [6]), with the first successful neurocomputer being
built. The loss functions of the neural networks needed to be optimised somehow, with
the popularisation of back-propagation (Rumelhart et al, 1986 [7]) the neural network field
was reanimated.1 . The Neural Networks we will be focusing on in this study are significantly
deeper than their ancestral form. Many more have worked on neural networks since then de-
veloping state-of-the-art models for almost any task, from learning how to play video games,
to identifying objects or even in discovering new drugs.
Let,
x1 y1 h1
x2 y2 h2
X = , Y = , H=
xn yn hm
The activation function gets its name from the fact that it activates the neuron, acting as
switch to turn a neuron on/off depending on a specific threshold, which in this case is zero.
The NN in figure 1.1 is defined mathematically below:
W0 is the (d x m) weight matrix and b0 is the (1 x m) bias vector connecting the input layer
to the output layer. We will refer to these as the first layer of parameters. Similarly W1 is the
(m x k) weight matrix and b1 is the (1 x k) bias vector connecting the hidden layer with the
output layer, which we will refer to as the second layer of parameters.
H = relu (XW0 + b0 )
For classification:
Y = softmax (HW1 + b1 )
For regression:
Y = HW1 + b1
For the regression problem we propose the use of the squared loss function in order to
perform optimisation, whereas for the classification problem we will use the cross-entropy
loss. Note that back-propagation on the graph of the NN is as simple as computing the
derivatives on the different modules in the graph. The modules in this context refer to
operands e.g. the linear module (linear layer in the network), the ReLU module, e.t.c. In
1
Note that back-propagation was formerly known as automatic-differentiation (Linnainmaa, 1970 [8]).
11
fact breaking the derivatives into different parts makes the computation faster as we do not
have to recompute shared derivatives. Commonly used packages and software (Tensorflow,
Theano) can compute these derivatives automatically as long as the model graph is defined
correctly.
n
2
Squared Loss = Y Y2 = (yi f (xi ; ))
2
(1.1)
i=1
n k n k
Cross-entropy Loss = yic log(yic ) = yic log(f (xi ; )) (1.2)
i=1 c=1 i=1 c=1
th
1, if the i observation belongs to class j
yij =
0, otherwise
Note that for NNs of 1 hidden layer or more we can no longer be optimised directly and
therefore optimisation involves some approximation methods such as Stochastic gradient
descent (SGD)2 , where the data used to estimate the gradient is fed batch by batch over
several epochs (cycles over the data). The SGD algorithm can be found in the Appendix
A.1.1. In practice SGD can have very slow convergence and the solutions of the algorithm
are usually local minima inconsistent with the global minimum, and therefore more sophis-
ticated methods need to be applied for improved convergence. Momentum (Rumelhart
et al., 1986 [7]) is a better approximation method that leads to more accurate updates by
accumulating momentum in the direction of most decrease in the gradient providing an ad-
ditional push in order to overcome local minima, this algorithm can be found in the Appendix
A.1.2. An improved variant of momentum is Nesterovs accelerated gradient (Nesterov,
1983 [10]), the algorithm can be found in the Appendix A.1.3.
Both SGD and Momentum unfortunately are sensitive to the value of the learning rate,
which is a hyper-parameter that needs to be set prior to running the algorithm. If the learning
rate is too low, then convergence can be painfully slow. Too high learning rates mean that
the algorithm would never converge as each of the updates are too large, resulting in over-
shooting the global minimum every time. Ideally one would want to set an adaptive learning
rate that is initially quite large and decreases gradually as the approximation approaches the
true minimum. For this reason a more reliable and state-of-the-art gradient approximation
method, is Adam (Kingma et al., 2015 [11]) and this will be the preferred method for this
study, although the other optimisers will also be explored in Chapter 5.
2
SGD was originally named Adaline by (Widrow, 1960 [9]).
12
For an arbitrary weight matrix W and bias vector b, if Z = HW + b, then:
exp(zim )
yi = softmax(zi ) , yim = sof tmax(zi )m =
c=1 exp(zic )
k
Note that the above derivatives were defined for an arbitrary an weight matrix W and
input matrix H and they can be trivially adapted to any inputs and/or weight matrices.
Finally, assuming that we are using a ReLU non-linearity (Fukushima, 1975 [12]). Let
H = relu(M) = max(0, M), then:
L
= WT (Y Y) 1{M>0}
M
The above result is derived in the Appendix B.1.5.
L 1
= WT (Y Y) (see below for details)3
M 1 + eM
The above result is derived in the Appendix B.1.5.
13
Figure 1.3 Graphical representation of the operations in the CNN used in this dissertation.
The above CNN has a 3x3 convolution layer followed by a 2x2 max-pooling, followed by an-
other 3x3 convolution layer followed by another 2x2 max-pooling. The model then has a fully-
connected/flattened layer, and finally a hidden layer with 250 hidden units. The figure was gener-
ated by adapting the code from (Ding, 2016 [52]).
years later a fully functional CNN model was applied by (Lecun et al., 1998 [33]) for object
identification. Since then CNNs have been the best performing models at virtually any image
task as demonstrated by their success in numerous Machine Learning competitions.
The ample success of these networks stems from their translation invariance property,
which refers to the object appearance being independent of its location within the image
itself. Intuitively the number 3 will still be a number 3 regardless where it can be found in
the image. This is a very powerful property, but in order to fully understand how it arises in
CNNs we would require an understanding of the model architecture.
The convolution is the first ingredient in the CNN. The kernel/receptive field, slides over
the image and maps each instance of the window to a different single entry in the feature
map (in our example from figure 1.3 the kernel size is 3x3). This mapping can be found by
summing over all the entries of the element-wise product of the kernel (Kf ) and the weight
matrix W(c) plus a bias term, where f is the instance of the kernel over the image and c is
the feature map (channel of the output). Mathematically,
3 3
ykl = xk+i1,l+j1 wij + bc
(c) (c)
i=1 j=1
where c {1, 2, ..., 10}, and k, l {1, 2, ...26} for the first convolution in the above example,
because we slide the kernel one pixel at a time, going from left to right and then down one
row and repeating.
The next ingredient in the CNN is the max-pool mapping, which takes the maximum
over a receptive field. In our specific example the kernel size is 2x2, and we slide the recep-
tive field 2 pixels at a time so that none of the receptive fields are overlapping. The purpose
of this pooling layer is to reduce the spatial resolution of the images, in particular halving the
14
resolution of the image5 . Notice that the weight matrix W(c) is only dependent on the feature
map c and thus constant for all receptor fields, this in combination with the max-pool layer
gives rise to the translation invariance in CNNs.
Let D = {X, Y} = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}. We are interested in finding the most
likely distribution to have generated the data Y. Assume that y = f (x; ), we would be
interested in finding the parameters most likely to have generated the data Y. In order
to do this we need to make an assumption about the likelihood distribution p(YX, ).
For example, in the classification setting we might assume that the likelihood has a softmax
distribution on the outputs of the function f (x; ).
exp(fc (x, ))
p(y = cx, ) =
c C exp(fc (x, ))
where C is the set of all classes in our data.
When trying to find the best possible model a key quantity of interest is the posterior
distribution p(D), which tells us how likely the specific parameters are given the data.
p(YX, )p()
p(D) = p(Y, X) = p(YX, )p() (1.3)
p(YX)
Where p() is said to be the prior distribution on the parameters.6 In the i.i.d. data
5
After each convolution we add a 1 pixel wide padding of pixels of zeros around the perimeter of the image
to return the image back to its original size before the max-pooling is applied. For example the (26x26) image
resulting after the first convolution is returned to (28x28) before the max-pool is applied.
6
In order to be fully Bayesian we would also need to assume that the prior distribution on the parameters
also has a hyperprior p() distribution determining the value of the hyper-parameters of . The posterior
distribution in (1.3) would become p(, D) p(YX, , )p()p().
15
setting (1.3) would simplify to:
n n
p(D) p() p(yi xi , ) = p() Fyi (xi ; )
i=1 i=1
The denominator p(YX) is the normalising term which we dont usually need to concern
ourselves about until the end, but for completeness:
Many times we receive a new data point xn+1 and we are interested in predicting the
most probable yn+1 . The best prediction we can make is to take the expectation under the
posterior distribution.
16
Chapter 2
When trying to classify the MNIST digits dataset one might prefer to use a sparsity induc-
ing prior distribution such as the Laplace, Cauchy or Student-T distributions. These distribu-
tions are very spiky and are believed to be better suited for modelling sparse datasets such
as MNIST. In other datasets, Normally distributed priors might prove advantageous. More
importantly, prior distributions should be dataset and model specific as they incorporate our
beliefs about the structure of the data.
In his thesis Neal (1995, [1]) showed the functions sampled from several NNs with dif-
ferent prior distributions (Student-T distribution and Cauchy). In this Chapter we will expand
the drawn function space by considering different architectures, such as different activation
functions and prior distributions, and compare these with similar drawn functions from (Neal,
1995). The thesis mainly used the tanh activation function, whereas in this dissertation
we will also explore ReLU and softplus, as they are thought to suffer less from vanishing
gradients than tanh or sigmoid activation functions.
17
case we assume that the data have a Softmax likelihood:
where f (x; ) is the output of a NN with being the parameters of the NN.
In the regression case we assume that the data have a Gaussian likelihood:
p(yx, ) = Ny (f (x; ), In )
where is the model variance known as model precision, and f (x; ) is the output of the
NN. Note that the variance-covariance matrix for the likelihood is assumed to be diagonal,
which makes the calculations easier by making the assumption that the data is i.i.d., this
does not necessarily have to be the case. As a rule of thumb, if we have no reason to sus-
pect that the data are dependent, we always assume the simpler model.
1
p(D) =
p(YX, )p()
Z
Assuming that the parameters are independent of each other we have that:
m
1
p(D) = p(YX, W0 , b0 , W1 , b1 , . . . , Wm , bm ) p(Wj )p(bj )
Z j=0
where m is the number of hidden layers in the NN, and Z is the model evidence2 :
m
Z = p(YX) = b W b p(YX, ) p(Wj )p(bj )dWm dbm . . . dW0 db0
W0 0 m m j=0
Both the posterior distribution and the model evidence are analytically intractable in most
cases, apart from some simple cases such as Bayesian linear regression with a conjugate
prior distribution. The target distribution for Bayesian inference is the posterior distribution,
its analytical intractability consequently necessitates the use of the approximate inference
methods which will be discussed in more detail Chapter 3.
1
Note that for the duration of the dissertation we will not be fully Bayesian in our approach, and the hyper-
parameters will be fixed (chosen by a grid-search method).
2
Assuming the data are i.i.d, then we can further decompose p(YX, ) = ni=1 p (yi f (xi ; ))
18
2.2 Priors
It has long be thought that the structure of the prior distributions has an impact on sparsity
promotion. The benefit of sparsity promotion is the pruning effect on the parameters. One
example of early work in this field is Automatic Relevance Determination (Mackay, 1992 [55],
Neal, 1995 [1]) which aids in applying Occams razor into the modelling process. This reg-
ularisation aspect of prior distributions is highly attractive in the field of Machine Learning,
as it provides a natural safeguard against over-fitting that is essentially incorporated into the
model architecture. Many others have studied Bayesian sparsity since then, (Gerven et al.,
2009 [53], Jylanki et al., 2014 [21], Simpson et al., 2017 [37], Ingraham and Marks, 2017
[54]).
The practicality of incorporating the model regularisation and pruning into the prior distri-
bution has motivated the more detailed exploration of prior distributions in the sections that
will follow. We will explore the impact of different prior distributions on the drawn functions of
neural networks.
Figure (2.1) is quite interesting as it shows that using a Normal prior for the weights leads
approximately to a Gaussian Process (GP), a result argued also by (Neal, 1995). A more
fascinating result is that for the Laplace priors, using low variance priors results in a drawn
function that represents a less smooth GP. Notice that for a higher variance the drawn func-
tion becomes very spiky, a consequence of the heavier tails of the Laplace distribution in
3
The link to the specific folder for this chapter on the Github repository is: https://github.com/cstavrou/
Priors-and-Algorithms-for-Bayesian-Neural-Networks/
19
Figure 2.1 Drawn functions from 1 hidden layer BNNs with 10000 hidden units each, evaluated by
2
taking the average drawn function from 1000 drawn samples. The top left plot has a N (0, 10
dj Idj )
2
prior over the weights, the bottom left plot has a N (0, 20
dj Idj ) prior over the weights. Where dj is
the number of inputs from the previous layer (the number of columns of the units being multiplied
2
by the weight matrix. The top right is a plot with a Laplace(0, 5dj Idj ) prior on the weights, and the
2
bottom right plot has a Laplace(0, 10dj Idj ) prior distribution on the weights. All the networks have
tanh non-linearities for the hidden layer, and the input region is -1 to +1.
We will next consider using another BNNs with unscaled standard Student T prior distri-
butions on the weights, all of which are independent and identically distributed. The Student
T distribution again has heavier tails than the Normal distribution, which implies that there is
a higher probability of a weight taking a large value than for the Normally distributed weights,
which tend to be more centred around the mean. Decreasing the degrees of freedom in the
Student T distribution results in the tails becoming heavier, as the distribution has lower cer-
tainty for the mean value. We will also explore the drawn functions when using a spike and
slab (Mitchell and Beauchamp, 1988 [27]; Ishwaran and Rao, 2005 [28]) prior distribution
over the weights. The spike and slab prior has been used by many in the past to perform
Bayesian variable selection. It works by assigning some of the weights to be exactly equal
to zero with some probability , which results in the spike, with probability (1 ) the weights
are drawn from a wide distribution, typically a Gaussian with large variance, known as the
slab,
20
Figure 2.2 Drawn functions from 1 hidden layer BNNs with 10000 hidden units each, evaluated
by taking the average drawn function from 1000 drawn samples. The top left plot displays a BNN
with a prior function over each of the weights that is i.i.d. standard Student T with 0.5 degrees of
freedom (). The bottom left plot has i.i.d. priors for each of the weights of Student T distribution
with = 1, which is also known as the Cauchy distribution. The top right plot has i.i.d. priors over
the weights which are Student T distributed with = 1.5. In the bottom right plot, the weights are
i.i.d. with spike and slab prior distribution, with = 0.2 and = 30dj . All the networks have tanh
non-linearities for the hidden layer, and the input region is -10 to +10.
The drawn functions in figure (2.2) are considerably different from the ones we observed
for the Normal and the Laplace prior distributions over the weights. For the lower degrees of
freedom the function forms a relatively smoothed plane with large spikes leading to another
plateau. The drawn functions using a Student T distribution for a prior appear to have a
similar structure. The Student T prior appears to be a smoother, but very similarly shaped to
the drawn function for the Spike and Slab prior.
In addition to using tanh non-linearities, ReLU and Softplus non-linearities were also used
for the 1 hidden layer BNNs, their plots are much smoother planes. For the plots please re-
fer to the Appendix C.1.1. Both the ReLU and Softplus non-linearities seem to absorb
the extensive curvature in the surfaces of the drawn functions. This helps to provide some
graphical intuition to the reader as to why ReLU tends to be so widely used in practice, for it
is believed to suffer less from exploding gradients than other rectifiers.
21
2.3.2 2 hidden layer BNN
We will now delve into slightly deeper neural networks, the term deeper refers to the net-
works possessing more hidden layers. For deep learning standards these networks are still
considered to be quite shallow and are therefore not yet at the stage requiring special treat-
ment to avoid vanishing gradients, or large singular values in the Jacobian. We would expect
that as we increase the depth of the models, the samples from the prior may become more
varying (Duvenaud et al., 2014 [30]). In [30] it was observed that increasing the depth of
the models resulted in one dominating singular value, which they explained to be the result
of the heavy-tailed Jacobian distribution. The state-of-the-art networks in (Duvenaud et al.,
2014 and He et al., 2015 [29]) leverage direct skip connections between the inputs and the
hidden layers in order to hinder these adverse effects arising from increasingly deep models.
Similarly to the 1 hidden layer networks seen in figure (2.1) the drawn function from the
model with the Normally distributed prior over the weights looks very similar to a GP. A simi-
lar result is also observed for the Laplace priors for low scale values (b = 5dj ), whereas for the
2
22
Figure 2.4 Drawn functions from 2 hidden layer BNNs with 10000 hidden units for each of the
hidden layers (same for all drawn functions). The drawn functions are evaluated by taking the
2
average of 100 samples. The top left plot has a N (0, 10
dj Idj ) prior distribution over the weights,
2
whilst the bottom left plot has a N (0, 20
dj Idj ) prior distribution over the weights. The top right plot
2 2
dj Idj ) prior over the weights, whilst the bottom right plot has a Laplace(0, dj Idj )
has a Laplace(0, 10 20
prior over the weights. Where dj is the number of inputs from the previous layer (the number of
columns of the units being multiplied by the weight matrix. All the networks use ReLU activation
functions for both of the hidden layers, and the input region is -60 to +60.
A very similar plot can be observed for Softplus activation functions, which can be found
in the Appendix C.1.2. We believe the highly smooth surface is an artefact of the high
variance of the inputs to the first hidden layer. In fact, changing the prior for the weights
multiplying the inputs X to have a standard deviation of 1 in the Normal case, and a scale of
1 in the Laplace case, allows us to distinguish the impact of different activation functions on
the samples from the prior distribution. From figure 2.5 it is obvious that both the ReLU and
Softplus have a smoothing effect on the drawn functions. Although both ReLU and Softplus
provide a mild curvature, if we observe closely, we can detect that the drawn functions using
ReLU activatios have a marginally coarser surface. We believe that this is caused from the
hard 0 minimum constraint on the outputs of the hidden layer, as opposed to the smoother
progression towards 0 that Softplus provides.
We will now consider using Standard T prior distribution functions of varying degrees of
freedom. Much like (Neal, 1995), we observe a plateau type structure. One would expect
such models to be naturally good at classification tasks due to the formation of a step type
structure. We could thus think of the different levels as different classes we would want our
function to predict. However, one can already see a potential problem with having such prior
23
Figure 2.5 Drawn functions from 2 hidden layer networks with 1000 hidden units for each layer. The
functions plotted are the average of 1000 samples. For all the plots the first layer of weights is sam-
pled from a distribution with variance 1 for the Normal prior networks and scale 1 for the Laplace
prior networks respectively. All the plots on the left-hand side use ReLU activation functions for the
hidden layers, and on the right, Softplus activation functions. The top two plots have all the weights
for the remaining 2 hidden layers drawn independently from a Laplace(0, 102 /dj ) distribution. The
second row plots have all their weights drawn independently from a Laplace(0, 202 /dj ). The third
row plots have all the weights for the remaining two layers of weights drawn independently from a
N (0, 102 /dj ) distribution, and the final row from a N (0, 202 /dj ) prior distribution. The input region
for all the plots is -5 to +5.
functions. Separating the different plateaus is a wall with almost infinite gradients. Depend-
ing on the likelihood function that will be used this might make it extremely difficult, if not
impossible to learn the underlying function. The result in practice is that Student T distribu-
tions priors are very difficult to tune to provide stable results.
24
Figure 2.6 Drawn functions from 2 hid-
den layer BNNs with 1000 hidden units
for each of the hidden layers (same
for all drawn functions). The drawn
functions are evaluated by taking the
average of 1000 samples. The up-
per plot has a StudentT( = 0.5) prior
over the weights. The middle plot
has a StudentT( = 1) prior over the
all the weights. The bottom plot has
a StudentT( = 1.5) prior distribution
over the weights. All the networks
have tanh non-linearities for both of the
hidden layers, and the input region is -
1 to +1.
Figure 2.7 Drawn functions from 2 hidden layer BNNs with 1000 hidden units for each of the hidden
layers, evaluated by taking the average of 1000 samples. All the plots have 2 = 1 for the Spike
and Slab prior over the weights multiplying the inputs, the top row has = 0.2 and bottom row has
= 0.5. The remaining layers of weights are sampled independently from; an SS(0.2, 202 /dj ) prior
for the top left plot, an SS(0.2, 302 /dj ) for the top right plot, an SS(0.5, 202 /dj ) for the bottom left
plot, and an SS(0.5, 302 /dj ) for the bottom right plot. All the networks use tanh non-linearities for
their hidden layers, and the input region is -1 to +1.
When using spike and slab priors (figure 2.7) we observed that the drawn functions had
an intrinsic smooth structure. Unlike the functions drawn from pure Normal distributions (fig-
25
ures 2.1 and 2.3) the surfaces observed were less rough and possessed some curved slabs.
This can be attributed to the property of spike and slab priors of assigning exactly 0 to the
weights, and as a result of eliminating variables the drawn function tends to be smoother.
We would expect such priors to perform well in scenarios where the data is sparse and thus
want to assign a zero weight to uninformative inputs. This can be typical in image data,
where large parts of the image are just the background.
We can identify that using a mixture of different priors allows for greater flexibility in the
drawn function (figure 2.8). With the mixture of Normal and Student T priors with degrees
of freedom = 0.5 we can observe a plateau-like structure similar to the one we witnessed
in figure 2.6, with the added benefit of the smooth surface, meaning that the drawn func-
tions would no longer suffer from potentially infinite gradients. This result is the same for the
higher variance Normal priors tested, that have a mildly coarser surface. As we increase the
degrees of freedom of the Student T distribution we begin to loose the plateau-like behaviour
and tend towards the functions observed when using fully Normal priors over the weights,
an unsurprising result given the nature of the Student T distribution.
As in the Normal distribution case explored in figure 2.8 we have the plateau-like effect
caused from the Student T distribution priors. For = 0.5 we have sharper edges than those
observed when using the Normal distribution priors for the first 2 layers of the networks. In
general, the networks using the Laplace and Student T mixture can be related to the sam-
ples drawn when using a Normal and T distribution mixture, with a rougher surface.
To conclude, we expect that the surfaces of the drawn functions will ultimately have an
impact on our ability to model different data more accurately. We suspect that the rougher
functions would benefit mostly in sparser datasets where our training algorithm could be
aimed at detecting abrupt signals in the data, e.g. in the MNIST digits dataset used in chap-
ter 5 we have an input of 784 pixels, most of which are zero and few equal to 1 and thus the
signal arising from the inputs is very acute. In contrast with regression of continuous inputs
where the signal might arise from a smooth function, where smoother priors like the Normal
distribution might make more sense.
26
Figure 2.8 Drawn functions from 2 hidden layer BNNs with 1000 hidden units for each of the hidden
layers, evaluated by taking the average of 1000 samples. The plots on the left have their weights
drawn independently from a N (0, 102 /dj ) for the first two layers (linear layer and first hidden layer).
The ones on the right from a N (0, 202 /dj ). The final layer of weights (from second hidden layer
to the output) are drawn independently from a standard Student T with degrees of freedom 0.5,
1, and 1.5 for the first, second and third rows of plots respectively. All the networks use tanh
non-linearities for their hidden layers, and the input region is -1 to +1.
27
Figure 2.9 Drawn functions from 2
hidden layer BNNs with 1000 hidden
units for each of the hidden layers,
evaluated by taking the average of
1000 samples. The plots have their
weights drawn independently from a
Laplace(0, 102 /dj ) for the first two lay-
ers (linear layer and first hidden layer).
The final layer of weights (from sec-
ond hidden layer to the output) are
drawn independently from a standard
Student T with degrees of freedom 0.5,
1, and 1.5 for the top, middle and bot-
tom plots respectively. All the networks
use tanh non-linearities for their hid-
den layers, and the input region is -1
to +1.
28
Chapter 3
Methods of Approximate
Inference
We have seen in 2.1 that the computations required for BNNs very quickly become analyt-
ically intractable, and therefore we require the use of approximate methods. This chapter is
intended to develop a sufficient understanding of the approximate inference machinery that
will be used in the subsequent chapters of this work.
The approximate inference methods seen in this chapter are only a small selection of
the current approximation methods applied to BNNs. Recent research has been focused
on finding good approximations that scale well on larger datasets (Hernandez-Lobato and
Adams, 2015 [39], Blundell et al., 2015 [40]). Although, both Probabilistic Backpropagation
and Bayes by Backprop are computationally efficient, they can lead to large approximation
errors and an underestimation of the model uncertainty, and will therefore not be explored in
this study.
This section will not provide a comprehensive introduction into MCMC, a more detailed ex-
ploration can be found in (Neal, 1993) and (Gilks et al., 1996). We will briefly introduce
the vital components of MCMC using adaptations of the notation and definitions from (Neal,
1993 [14] and Bishop, 2006 [16]).
29
Definition 1. Markov Chain
A sequence X0 , X1 , . . . , Xn of random variables forms a Markov Chain if:
The joint distribution of the Markov chain is characterised by the marginal distribution on
X0 (initial distribution p(X0 )) and the transition probability distribution, i.e. the conditional
distribution p(Xt Xt1 ), t {1, 2, . . . , n}.
n
p(Xn , Xn1 , . . . , X0 ) = p(X0 ) p(Xi Xi1 )
i=1
Let the transition probability for state x at time n + 1 from state x at time n be,
Tn (x, x ) = p(Xn+1 = x Xn = x)
(x)T (x, x ) = (x )T (x , x)
30
where S is the state space of the Markov Chain. A sufficient condition for ergodicity is that
with some non-zero probability starting from any state x S we can transition to any other
state x S within some finite number of steps k,
where T k (x, x ) corresponds to the transition probability kernel for k time steps ahead. If
T (.) is a transition probability matrix then it would correspond to raising the matrix to the
power k.
Since we want our Markov chain to sample from the target distribution, arranging the
Markov chain such that the target distribution is invariant is the first requirement. The reason
for the invariance requirement is that we do not want the process of generating samples to
change the distribution every time. The second requirement is that the Markov chain is also
ergodic, as an ergodic Markov chain can only have one invariant distribution. This means
that the sampling scheme should converge to the correct target distribution.
Figure 3.1 Where p(x) = p(x)/Z is the target distribution and Z is the normalising factor.
q(xx ) is the proposal distribution, d are the number of dimensions of each sample, and
M is the total number of samples.
Using the definition of (., .) and the symmetry of min(., .) we have that,
31
T (x, x ) = q(x x)(x , x)
= q(x x) min (1, p(x
p(x)q(x x) )
)q(xx )
therefore MH satisfies detailed balance, which thereby guarantees that p(x) is an invariant
distribution. However, satisfying the ergodicity condition depends on the target and proposal
distributions of the specific example.
Suppose that our target distribution is p(x) and that we wish to generate M samples.
Figure 3.2 Where p(x) is the target distribution. p(xi xi ) is the conditional distribution for
ith dimension of x, d are the number of dimensions of each sample, and M is the total
number of samples.
Remark: Let us first note that when we sample xi from the conditional distribution, the re-
maining values xi remain unchanged. Similarly when we sample xi the values of x i all
remain unchanged, and obviously it follows that xi = xi .
32
By applying Bayes rule and the remark above we can easily show that Gibbs sampling
satisfies the detailed balance equation guaranteeing the invariance of the target distribution
p(x). Since,
as required. Although this guarantees invariance of the target distribution it does not guar-
antee ergodicity of the Markov chain. Neal [14] argues that a sufficient condition of guar-
anteeing the ergodicity of the Markov chain is that the conditional distributions are not zero
anywhere. Thus by updating one component each step, within a finite number of steps the
chain starting from any point can reach any other point in the space.
A more sophisticated and dynamic class of MCMC is Hamiltonian (Hybrid) Monte Carlo,
which was first applied in Physics by (Duane et al., 1987). This was later adapted in the Sta-
tistical framework by Neal (1993 [14] and 1995 [1]) from which we will leverage the notation
and results for this section. HMC simulates the motion of an imaginary particle using Hamil-
tons equations of motion, where the position of the particle is the target distribution. For the
purpose of clarity, this section will explain how HMC works and how it can be implemented.
However, a more detailed interpretation of the subject can be found in Neal (1993, and 1995).
33
interest (model parameters) are represented by the position variable, q.
U (q) = log((q)L(qD)) + C
where D is the data, C is a constant, (q) is the prior distribution and L(qD)) is the likelihood
function1 given the data. In other words,
U (q) = log(p(qD)) + C
posterior
= log((q)) `(qD) + C
log-likelihood
with yi being the true label or output and xi the true input data for the ith observation.
1
K(p) = pT M1 p
2
where M is a symmetric and positive-definite matrix, which for simplicity is typically diagonal.
d
p2i
K(p) =
i=1 2mi
We assumed a quadratic form of the kinetic energy, because it corresponds to the log
density of a zero-mean multivariate Gaussian distribution with covariance matrix M plus a
constant, which is very easy to sample from.
1 H(q, p)
p(q, p) = exp ( ) (3.1)
ZH T
1
The likelihood function L(qD) = p(YX, q).
34
where T is referred to as the temperature, which in the canonical case is equal to 1. ZH is
the normalising factor so that p(q, p) is a probability distribution.
dqi H
=
dt pi
dpi H
=
dt qi
for i {1, 2, . . . , d}, where d are the number of dimensions of the variables of interest.
Using the definitions of the Hamiltonian function and the Kinetic energy,
dq dqi pi
= M1 p = , i {1, 2, . . . , d}
dt dt mi
dp U dpi U
= = , i {1, 2, . . . , d}
dt q dt qi
pi (t + 2 ) = pi (t) 2 U q
(q(t))
i
(Half-step momentum update)
p (t+ )
qi (t + ) = qi (t) + i mi 2 (Full-step position update)
pi (t + ) = pi (t + 2 ) 2 U (q(t+))
qi (Half-step momentum update)
2
Note that if the kinetic energy function is different to the quadratic function proposed earlier, then the
full-step update to the position variable has to be appropriately adjusted to the derivative dq
dt
i
.
35
3.4.3 HMC and leapfrog algorithms
Figure 3.3 Where q is the position variable vector or matrix (depends on the parameters
of the model), p is momentum variable and is the step size, and L is the number of
leapfrog steps. Assumes that K(p) = pT M1 p, for simplicity can assume M = I.
Figure 3.4 Where q(i) is the ith sample of the position variable vector or matrix (depends on
the parameters of the model), p(i) is ith sample of the momentum variable and is the step
size, and L is the number of leapfrog steps. leapfrog() is the leapfrog algorithm defined
in figure 3.3, and nsamp is the total number samples. Assumes that K(p) = pT M1 p, for
simplicity can assume M = I.
36
distribution) involves several parameters from which we want to sample from using HMC.
However, this would be a trivial extension of the above algorithm performing the steps for
each of the parameters, whilst holding the other parameters fixed.
Remark
From the joint distribution p(q, p) over q and p in (3.1), we have that,
dH
dt = i=1 ( H
d
pi
dpi
dt + H
qi dt )
dqi
(chain rule)
= i=1 ( H
d
qi pi pi qi )
H H H
= 0 (Remark)
( dq dp
dt , dt
) =
q dq
dt + p dt
dp
(chain rule)
=
q H
p p q
H
= 2H
pq
H 2
qp
= 0 (Remark)
From the above two properties of the Hamiltonian dynamics we can see that p(q, p) is
left invariant. To show that detailed balance holds we will use, but not prove two facts from
(Neal, 1993 [14]),
37
Proof that detailed balance holds for the canonical distribution (T = 1)
Let S = (q, p) and S = (q , p ) denote some regions in the phase space. Suppose that after
a sequence of L leapfrog steps of size we go from the region S in the phase space to the
region S . We need to show that,
Since the leapfrog scheme preserves phase space volume, then if the volume of the phase
space S is equal to V , the volume of the phase space S must also equal to V .
p(S)T (S, S ) = 1
eH(S) min (1, eH(S )+H(S) )
ZV
p(accept state S )
p(S )T (S , S) = 1
eH(S ) min (1, eH(S)+H(S ) )
ZV
p(accept state S)
1 H(S)
p(S)T (S, S ) Ve =
Z
Since the leapfrog scheme is reversible we can perform the scheme backwards in time,
p(S )T (S , S) = 1
ZV eH(S ) eH(S)+H(S )
= 1
ZV eH(S)
= p(S)T (S, S )
as required.
p(S )T (S , S) = 1
ZV eH(S )
p(S)T (S, S ) = 1
ZV eH(S) eH(S )+H(S)
= 1
ZV eH(S )
= p(S )T (S , S)
as required.
Similarly to MH and Gibbs sampling, ergodicity of the chain is once again problem de-
pendent.
38
3.5 Variational Inference
As previously, our goal is to infer the true posterior distribution p(D), where = {1 , 2 , ...}
(the set of all parameters), which cannot be done analytically. Variational inference (Jordan
et al. 1999 [25]) involves the introduction of an approximating distribution parametrised by
. The approximating/variational distribution q () is constructed so that it can be easily
inferred. The idea is that the approximating distribution is a close to the true distribution as
possible, so we want to minimise the KL divergence (Kullback and Leibler, 1951 [26]) with
respect to . Let us define the KL divergence to be,
q ()
KL[q () p(D)] = q ()log ( ) d
p(D)
Hence we have that,
For simplicity, only the variational mean field approximation will be considered (Peterson
and Anderson, 1987 [24]), where q () is fully factorised, i.e. assuming independence of all
the weight and bias scalars in every layer of the model,
q () = q(ij )
i,j(I ,J )
where (I , J ) is the set of all nodes for the parameter . Factorising over everything, makes
the computations very simple and quick, although the method struggles to reach accurate
results for more complicated posterior distributions. Now as for the prediction using this
methodology we define the predictive distribution p(yx, D),
where q () is the argmin of the KL divergence objective. Due to the inability of this method
to perform well on deeper networks we will only use this method of approximate inference
in very few models. We will not explore any further mathematical details of this method, as
they can be found in (Peterson and Anderson, 1987 [24], Hinton and van Camp, 1993 [23],
and Jordan et al., 1999, [25]) in a much more detailed and complete form.
39
3.6 Prediction
As previously discussed, the predictive distribution is p(yx, D), which is the expectation un-
der the posterior distribution. However, an explanation regarding the estimation of those
quantities is yet to be covered, as we cannot typically analytically evaluate the required
integrals. We will now show the method of Monte Carlo integration used to make the ap-
proximations.
p(yx, )p(D)d
= p(yx, D)
i.i.d.
where s q (), s = 1, 2, ..., T .
p(yx, )p(D)d
= p(yx, D)
i.i.d.
where s p(D), s = 1, 2, ..., T . Where p(.) denotes the MCMC approximation to the
true posterior distribution.
40
Chapter 4
Regularisation methods
Regularisation is broad area of methodologies that are primarily aimed at reducing the over-
fitting incurred by the model during the training stage. In machine learning language, over-
fitted models are said to not generalise well. This essentially means that the difference
between the expected error and the empirical/training error is large and thus the model
would perform well on the training set and poorly on the test set. Such a model would be
inaccurate for out of sample predictions. Regularisation is thus a means of penalising over
adaptation to the training data.
For Frequentists regularisation could be as simple as adding a penalty term on the size of
the weights into the objective function, L1 and L2 regularisation are explored in more detail
in section 4.1. A more sophisticated approach to regularisation in NNs is to use the method
known as Dropout, the variants of this method will be explored in section 4.3. The Bayesian
use of a prior distribution incorporates regularisation in the underlying modelling method, in
fact some different prior distributions can have an equivalent regularisation representation
in the Frequentist scheme. Bayesian linear regression with Gaussian priors can be equiv-
alently expressed as L2 regularised linear regression. This equivalence will be explored in
more detail in section 4.2.
41
min (y Xw)T (y Xw)
w
This objective function is likely to over-fit the data in the scenario where d is large.
L1 regularisation
L1 regularisation imposes an L1-norm penalty term w1 on the weights, so that they are
inhibited from becoming too large and hence the new objective function is,
where > 0 is the regularisation constant. Notice that in order to solve the above objective
we would be required to either make a smooth/continuous approximation to the absolute
value function, or to instead use one of the gradient approximation methods seen earlier,
such as Adam, SGD or NAG. A possible smooth approximation could be,
x x2 + for > 0
where is a very small positive constant. The smaller is, the closer the approximation we
would be to the absolute value function.
L2 regularisation
to constraint their absolute size. The objective function in this case is,
for > 0. Since the objective is positive semi-definite there is a unique minimum2 , we
can easily solve the objective by differentiating with respect to the vector w and setting the
derivative equal to the zero vector 0, yielding the solution,
w = (XT X + Id )1 XT y (4.3)
Notice that the value of would impact the solution we obtain and it is therefore a parameter
that needs to be tuned. The most common approach of tuning this parameter is through
cross-validation. This is a method for picking the with minimum cross-validation error, by
minimising the MSE over several different subsets of held out portions of the data (validation
sets) not used for training.
2
The proof for this as argument is not provided as it is just a simple application of algebra and Mercers
theorem, and twice differentiating the objective function.
42
4.2 Bayesian Regularisation
In the Bayesian framework the prior function helps to inhibit the weights from growing too
large and thus acts as a means for regularisation. In the simple context of linear regression
we can actually show equivalence of Laplace priors with L1 regularisation, and Normal priors
with L2 regularisation. Assuming that the data are i.i.d. Gaussian, then the likelihood for the
data has the form,
1 2
p(yX, w) = exp ( (y Xw)T (y Xw))
Z1 2
Let us assume that all the weights are i.i.d Laplace with mean 0 and scale 1, then the prior
on the weights p(w) has the form,
1
p(w) = exp( w1 )
Z2
Then the Maximum a-posteriori (MAP) estimate of the weights wMAP is defined below,
= argmax 2 (y
2
Xw)T (y Xw) w1 + C
w
= argmin 2 (y
2
Xw)T (y Xw) + w1
w
If we assume that the weights are i.i.d. wi N (0, 1), i {1, 2, ..., d}, then the prior distribution
on the weights p(w) has the form,
1 1
p(w) = exp( wT w)
Z3 2
thus the MAP estimate for the parameters is,
43
4.3 Dropout
We will now take a step back and think about the reason for regularisation, which is to prevent
the extensive adaptation to the training data. Large NNs have an astronomical number of
parameters that need to be learnt, making the threat of their co-adaptation highly probable,
particularly in the absence of a sufficient amount of training data. As discussed, regulari-
sation allows the model to perform better in unobserved cases by capturing the underlying
patterns in the data, as opposed to the dataset specific noise (sampling noise).
Dropout (Hinton et al., 2012 [34]; Srivastava et al., 2014 [35]) works by corrupting the
learning process with random noise, prohibiting the model from simply memorising the train-
ing data. Dropout has been shown to achieve state-of-the-art results in many machine learn-
ing tasks and benchmark datasets, making it the go to regularisation technique currently
practised in Machine learning. MC Dropout may also be useful in providing good approx-
imations the true posterior distribution, a result which follows from (Myshkov and Julier,
2016 [41]) analysis on synthetic data.
For a dataset D = (X, y), where X Rnxd and y Rnx1 . Linear regression minimising the
squared loss finds w such that,
where w Rdx1 is the weight vector. Suppose that we apply dropout to the input data X, s.t.
i.i.d.
we keep each xij with probability p. Let B {0, 1}nxd , such that bij Ber(p). In the context
of the linear regression objective minimisation,
Taking the expectation under the probability distribution assumed for B, we have
44
Hence,
0 , if j m
since i {1, 2, ..., n} , bij bim =
p(1 p) + p
, if j = m
2
= y pXw2 + p(1 p) w2
2 2
w argmin
w
Where = diag(diag(XT X))1/2 , and thus scaling the weight vector by the standard devi-
ation of the corresponding d dimensions of the input data. This objective looks quite similar
to the objective we found for L2 regularisation. If we let p tend towards 1 we can see that the
regularisation constant tends towards 0 and since we tend towards retaining all the inputs
the objective tends to the least squares objective. Equivalently for w = pw we have,
1p
= y Xw2 + w2
2 2
w argmin
w p
Another way of applying Dropout (Konda et al., 2015 [36]), is to drop the units as before,
but rescaling the active units by a factor of 1/p to compensative for the loss of magnitude.
The objective in that case is,
1p
= y Xw2 + w2
2 2
w argmin
w p
which can directly be related to a form of but not equivalent to L2 regularisation. Baldi and
Sadowski (2013, [43]) argued that stochastic gradient descent performed on a model with
dropout was similar to stochastic gradient descent on a regularised error function. The rea-
son we say that it is not equivalent to L2 regularisation is because this is objective has a
superior penalty term incorporating the variability of the data and the probability p of drop-
ping a unit. N.B. that we showed how Bayesian linear regression with Gaussian priors can
be equivalent to L2 regularisation, and therefore can be similar to the Dropout objective, but
not equivalent.
Note that MC Dropout would then take the MC estimate of all the predictions generated
by a number (N ) of dropout fitted models. This is essentially an ensemble of models fitted
using dropout. Alternatively, one can use the point estimate method of dropout, which is
45
to fit the weights of dropout models stochastically using gradient descent, and adjusting
the predictions by a factor 1/p. Although the two methods are not equivalent, in practice
they provide quite similar results. MC dropout starts to become more accurate through an
increasing number of samples, which can be computationally expensive.
xi
p(w) w yi 2
i = 1, ..., n
Figure 4.1 Bayesian linear regression, where w, xi Rd , and we assume that the labels
(yi ) are i.i.d. s.t. yi N (wT xi , 2 ), i {1, 2, ..., n}. The prior distribution on the weights
p(w) could be a multivariate Gaussian N (0, I).
Baldi and Sadowski (2014, [44]) discussed the ensemble averaging properties of lin-
ear networks with dropout and non-linear logistic networks. With this interpretation in mind
we will show in what ways Dropout is dissimilar to the regularisation methods that we have
previously seen. In particular, two possible interpretations of Bayesian Dropout will be pro-
vided as well as showing how the graphical model for Dropout based linear regression in a
Bayesian framework differs from that of regular Bayesian linear regression.
46
bcj zc
bcj zc
j = 1, ..., d
j = 1, ..., d
c = 1, ..., N
xi yic
c = 1, ..., N
xi yi
yi
i = 1, ..., n
i = 1, ..., n
Figure 4.3 (B) Second Bayesian in-
terpretation of Dropout, where xi
Figure 4.2 (A) First Bayesian interpre-
Rd , N is the number of Dropout
tation of Dropout, where xi Rd , N is
repetitions. is the prior distribu-
the number of Dropout repetitions used
tion to zc , c {1, 2, ..., N }, and
to form the Monte Carlo estimate for
is the parameter of the Bernoulli
yi . The categorical distribution zc , de-
distribution (dropout probability for
termines the parameters of the distribu-
keeping a node).
tion of yic (see below for details).
In order to be fully Bayesian about this problem we can define a Dirichlet hyper-prior on the
Categorical distribution, for simplicity we will just assume that = ( N1 , N1 , ..., N1 ), meaning
that all data transformations are equally likely to occur. For the explicit definitions of all the
distributions used please visit page 8.
i.i.d.
bcj Ber()
i.i.d.
zc Categorical()
i.i.d.
yi N c=1 zc N (wzc (b xi ), zc ) = c=1 zc N (wzc xi diag(b ), zc )
T c 2 N T c 2
47
Again in order to be fully Bayesian we might want to define a Dirichlet hyper-prior on the
Categorical distribution.
From the two Bayesian interpretations provided we can see that Dropout is more simi-
lar to a mixture model which is the probabilistic interpretation of ensembles. This mixture
model imposes a non-linear transformation of the input data X for each mixture component,
and different weight and standard deviations for each of the model mixtures. This should
make sense, as gradient descent on the transformed data would learn new weights for each
dropout repetition. The predictions from all the different models are then averaged (ensem-
ble averaging), to give the MC estimate for the predictions. We can see clearly that this
is very different to Bayesian linear regression as the model averaging that takes place has
a much richer hypothesis space. In response to a paper by Domingos, (Minka, 2002 [42])
explains the differences between model averaging and Bayesian model averaging, where he
emphasised that the two are not comparable.
Gal (2017, [2]) on the other hand argues that for specific variational approximations, the
optimisation for VI is identical to that of dropout on a NN. This might possibly be the case,
however the similarities between the two are rather vague, especially since we do not get
any indication of what the priors of such a NN are. Recent Dropout advancements leverage
variational inference (Kingma et al., 2015 [47]; Molchanov et al., 2017 [48]; Li and Gal, 2017
[49]).
I believe that comparing dropout on a NN fitted using gradient descent with the equivalent
BNN is simply erroneous as the two models are too different. Konda et al., (2015 [36])
highlighted that dropout enriches the input space and can be perceived to be similar to
data augmentation, which could explain the much richer Bayesian model representation
required to imitate MC dropout. I believe that if we were to instead compare the posterior
distribution of MC dropout, to the posterior of the BNN interpretations shown, we would
observe similar results presuming that the method for estimating the posterior works well.
Unfortunately, much of the advancement in the Machine learning community stemmed from
business applications which reduced the interest in theory grounded work. Many algorithms
such as dropout are known to work well in practice, yet the theory justifying their success is
somewhat incomplete and offers a great potential for further exploration in the future.
48
Chapter 5
Application to data
We will now show the results after applying the different types of models discussed in this
dissertation. We aim to provide an unbiased comparison of the different models and meth-
ods of estimation by choosing to show the results for the most tuned models.
Choosing the right parameters can often be very difficult and time consuming in prac-
tice, something that was experienced first hand during this dissertation. For the BNNs the
hyper-parameters shown are after a thorough search over a vast variety of possible hyper-
parameters. Similarly for the number of steps and step sizes used for the leapfrog integration
method. Again, for the learning rates used for the feed-forward NNs a thorough search over
a number of different possible learning rates was conducted.
Unless otherwise stated, all the models in this chapter were ran on an NVIDIA GeForce
R
GTX 1070 GPU and therefore any algorithm times are relative to that. In practice one should
expect between 10-50 times increase in the computation time if these models are ran on a
CPU, but these differences in speed depend on the model itself, with CNNs benefiting the
most from GPU computation.
5.1 Datasets
We are currently living in an era defined by extensive and rich data, with classification being
at the heart of many fields. A revolution in the way machines operate is therefore to be ex-
pected. For instance, facial recognition and object detection will be a crucial element to self
driving cars and robots. Therefore our choice to look at the classical MNIST digits dataset
was quite natural. When conducting experiments, we used the MNIST dataset that comes
with the Tensorflow python package. This has 55000 training points, of which there are
55000 (28x28) pixel images and 55000 training labels, taking values from 0 to 9. With this
dataset we are also given 10000 test points, again the images are (28x28) pixels and the
labels take values from 0 to 9. We will call this dataset (MNIST large). For all the models
with the exception of the CNNs we reduce the image size to (14x14) so that the models can
49
be ran more easily.
Most of our recorded experiments were conducted using the (MNIST small) datasets,
which is a dataset constructed by taking the first 2000 training and first 2000 testing ex-
amples from the MNIST large dataset. The small MNIST dataset or the short script used
to create the dataset can be found in [56]. When fitting the models, the image size was
reduced to (14x14) due to the relatively large models size, which would otherwise cease
to make any storing of the samples generated using the HMC process possible. Even stor-
ing as little as 100 samples sometimes resulted in files that were over 120 megabytes in size!
Subsequently, a slight variation of the Mauna Loa CO2 concentrations dataset was used
for regression (Gal, 2015 [50]), which was used for several uncertainty models by (Gal and
Ghahramani, 2016 [45]). We changed the dataset from the original by combining the test
and training datasets and taking a random 2/3 split of the data for training (848 examples)
and 1/3 split of the data for testing (424 examples). The dataset and the script used to
create the data can be found in [56]. Note that this dataset has 1-dimensional inputs and
1-dimensional outputs.
The code for a 1 hidden layer BNN was also adapted for the solar regression dataset
(Gal, 2016 [45]), however to avoid repetition we will not present the results obtained in this
dissertation. Presuming the reader is interested further, please refer to the code found in
[56].
50
will be directed at the learning curves (test and training errors during the training).
Only the notable and interesting cases will be presented in this study. All the MNIST
models in this section were trained stochastically, i.e. using mini-batch learning to compute
the back propagation gradients for batches of 200 observations, and cycling through the
whole dataset n number of times (epochs). All the CO2 regression models were trained
stochastically with batches of size 212.
When we briefly introduced our interest in BNNs we emphasised their ability to perform well
even in the scenario where we lack the sufficient amount of data, whereas ordinary NNs
could over-fit. The situation of a small amount of data is not rare, and might become even
more common as we move into the future, where we will seek machines that are be able to
identify and learn in real time without much data. We therefore entirely focus on the Small
MNIST dataset for these models.
Figure 5.1 The table contains the best selected results from training 1 hidden layer NNs using
back-propagation. The accuracy corresponds to the final accuracy on the test set. All models
used a learning rate of 0.005 as this was found to be overall the best learning rate for the 1 hidden
layer models. The time refers to the time for training the models only and excludes the model
evaluation and other computations, it is also the average between the time taken for the SGD and
Adam optimisers for the specific model setup. All the models use Softplus non-linearities for the
hidden layer.
51
Training NNs using back propagation is remarkably fast! Even for the networks with
3000 hidden units and using 5000 epochs for the training the models were trained in less
than 3 minutes. This puts them at a significant advantage in comparison to the Bayesian
approaches which are typically considerably slower than this. However, please note the pre-
sented findings were after we fitted numerous models and selected the best learning rates.
As seen in figure 5.1 the test accuracy of these models is also sufficiently high, although for
less tuned models the accuracy can be significantly lower than this.
As we had expected, using Adam as the optimiser results in significantly more accurate
models than using SGD. We also see that SGD suffers more from over-fitting to a specific
local minimum trajectory, which in many cases resulted in a decreasing test accuracy with
more training (epochs). From the plots in figure 5.2 a much more obvious observation can be
made, for the 1000 epochs the models are clearly under trained as the test accuracy seems
to have a positive gradient at the end of the 1000 epochs, whereas for the 5000 epochs the
models appear to have taken a trajectory worsening the test accuracy with time (over-fitting
to a specific incorrect local minimum). Slightly after the 1000 epochs the test accuracy starts
decreasing, whilst the training accuracy continues to increase, this is a classical example of
over-fitting and also one of the primary reasons that we are exploring Bayesian alternatives.
We observed that the in terms of training accuracy Adam converges extremely quickly,
although the test accuracy increases at a slower rate. The figures also emphasise that Adam
seems to consistently find better local minimum approximations than SGD. SGD struggles to
distinguish the correct direction of the gradient and can instead get trapped in an incorrect
trajectory. This can sometimes lead to over-fitting to the training data, whilst test accuracy
decreases (see figure 5.2).
Without trial and error we cannot know in hindsight what the correct amount of training
could be for these models, which poses many difficulties in practice, and in particular if we
seek some sort of automation in these models. We can always use the test accuracy during
training as a proxy to the right amount of training, automating a stop in the model if the test
accuracy decreases, but this could only be easily applied to SGD, where we have a much
smoother learning curve. Adam has much more unpredictable results, as the hill-climbing
properties of the optimiser mean that there are sections of the learning curves where we
observe highly negative gradients in the accuracy, followed by a sharp spike, automating a
smart stop in these models can be intrinsically difficult.
Regression is much easier to interpret than classification, as we can actually plot the trained
model and see what is in fact happening when we change the model. Using back propaga-
52
Figure 5.2 The learning curves during training of models fitted using back propagation and the
optimiser indicated in the title of the plot. All the models composed of 1 hidden layer of 500 units.
The number of epochs is indicated in the title of the plot, along with the learning rate (LR) that was
used. All the models use a Softplus rectifier for the hidden layer.
tion on this dataset posed the same problems as before, primarily that the parameters used
required extensive tuning before they could actually work well.
Figure 5.3 The learning curves during training of models fitted using back propagation and the
optimiser indicated in the title of the plot (bottom plots). All the models composed of 1 hidden layer
of 500 units. LR refers to the learning rate that was used for the optimiser. The plots on the top
are the learned functions from the corresponding model and the scatter points are the test data.
53
From figure 5.3 it is apparent that the model using the Adam optimiser was not trained
for long enough, as the learning curve seems to be continuing to decrease. The learned
function, also seems to be missing the structure of the data around the extremes. SGD
seems to be trapped in a local minimum and thus performs poorly, fitting almost a straight
line function to the data. This observation seems to be confirmed by looking at figure 5.4,
where even for 5000 epochs the result is the same. For 5000 epochs we observe that the
learned function using Adam is considerably better and seems to capture the structure of
the data reasonably well.
Figure 5.4 The learning curves during training of models fitted using back propagation and the
learned functions (same as figure 5.3, but trained for longer).
Naturally, increasing the size of the models to 5000 hidden units was expected to re-
quire more training, which is something that is emphasised by figure 5.6. Clearly SGD has
not been trained for long enough. We expect that this also holds for the model with the
Adam optimiser, but this is not evident from the learning curves, which seem somewhat flat
with several spikes. Not being able to identify whether the algorithm has converged causes
difficulties in practice. This problem is a recurring theme for models fitted using back prop-
agation, whereas BNNs using MCMC such as HMC benefit from gathering more samples.
That means that we can ultimately continue running our models indefinitely without the risk
of over-fitting, although this is very costly in practice.
It is important to note that for the models with 5000 hidden units we were required to
change the learning rates used. The need to find an optimal learning rate again can occur
frequently in practice, which can make the use of such methods more impractical. Finally, to
ensure that the models were trained for long enough we also trained the models for 50000
epochs (see figure 5.7). Here we can see that the model with the Adam optimiser shows a
slight improvement. However, near the extremes the model deviates from the true structure
of the data.
In order to avoid exploding gradients when using SGD, a much smaller learning rate was
54
Hidden units Epochs Adam (MSE) SGD (MSE) adam sgd Time (s)
500 1000 0.2529 0.4889 0.01 0.005 7
500 3000 0.2363 0.4816 0.01 0.005 19
500 5000 0.2282 0.481 0.01 0.005 32
1000 1000 0.2711 0.4302 0.01 0.005 7
1000 3000 0.2978 0.3436 0.01 0.005 19
1000 5000 0.2516 0.3448 0.01 0.005 29
3000 1000 0.4074 0.5945 0.01 0.001 7
3000 3000 0.2771 0.5732 0.01 0.001 22
3000 5000 0.2466 0.5432 0.01 0.001 35
5000 1000 0.4547 0.4945 0.005 0.001 8
5000 5000 0.3082 0.3652 0.005 0.001 37
5000 50000 0.2237 0.3022 0.005 0.001 365
Figure 5.5 The table contains the best selected results from training 1 hidden layer NNs using
back-propagation. The MSE corresponds to the final mean squared error on the test set. The
time refers to the time for training the models only and excludes the model evaluation and other
computations, it is also the average between the time taken for the SGD and Adam optimisers
for the specific model setup. adam and sgd are the learning rates used for the Adam and SGD
optimisers respectively. All the models use Softplus non-linearities for the hidden layer.
Figure 5.6 The learning curves during training of models fitted using back propagation and the
learned functions, both models have 1 hidden layer with 5000 units.
used (0.001), this meant that even for 50000 epochs the models were significantly under-
trained. Slow convergence is an underlying problem of the SGD optimiser, and it is reflected
in these figures.
55
Figure 5.7 The learning curves during training of models fitted using back propagation and the
learned functions, both models have 1 hidden layer with 5000 units and were trained for 50000
epochs.
that need to be trained and the danger of over-fitting is now much more severe. All the MNIST
and CO2 regression models in this section were trained stochastically using batches of size
200, and 212 respectively.
Returning back to the classification setting, we can see from figure 5.8 that SGD suffers
the most from following an incorrect trajectory. As we increase the amount of training time
the test accuracy falls. The models using the Adam optimiser perform surprisingly well and
in most cases do not seem to suffer from adverse effects of excessive training of the NNs.
In the appendix C.2.1 we can again see the same pattern of SGD finding a worse local
minimum and Adam being much less smooth, which brings us to the question of how and
which optimiser should be used. Once we know the optimiser, what learning rate should be
used? These are all interesting questions, inherent in the field of Machine learning, yet they
have no right answer. Different optimisers might be better at specific tasks, making model
tweaking profoundly difficult.
In these experiments, using SGD as an optimiser produced really poor results as the opti-
miser achieved extremely slow convergence (see figure 5.9) in comparison to Adam. Once
again SGD failed to capture the true underlying structure of the data, which is no surprise
given that it was not completely trained.
56
Hidden units Epochs Adam SGD Time (seconds)
250 1000 0.833 0.8195 30
250 3000 0.852 0.813 90
250 10000 0.842 0.798 308
300 1000 0.8345 0.836 30
300 3000 0.85 0.8115 92
300 10000 0.843 0.8085 306
500 1000 0.843 0.8285 32
500 3000 0.8475 0.8165 97
500 10000 0.865 0.81 318
Figure 5.8 The table contains the best selected results from training 2 hidden layer NNs using
back-propagation. The accuracy corresponds to the final accuracy on the test set. Like previously,
all models used a learning rate of 0.005. The time refers to the time for training the models only
and excludes the model evaluation and other computations, it is also the average between the time
taken for the SGD and Adam optimisers for the specific model setup. All the models use Softplus
non-linearities for the hidden layer.
Figure 5.9 The learning curves during training of models fitted using back propagation and their
learned functions, both models consist of 2 hidden layers each with 200 units.
We begin to see that as the models become increasingly deeper the over-fitting prob-
lem becomes more severe. Furthermore, the choice of learning rate becomes crucial to the
resulting models ability to classify correctly. As is clear from figures 5.13 and 5.11, using
a learning rate of 0.01 for Adam converges very quickly to the optimum (in less than 1000
57
Hidden units Epochs Adam (MSE) SGD (MSE) Time
200 1000 0.2058 0.484 7
200 5000 0.202 0.4608 30
200 10000 0.1989 0.2423 61
300 1000 0.2032 0.4654 6
300 5000 0.2006 0.434 32
300 10000 0.1975 0.2448 59
500 1000 0.2529 0.4889 7
500 5000 0.2282 0.481 32
500 10000 0.2262 0.4036 65
Figure 5.10 The table contains the best selected results from training 2 hidden layer NNs using
back-propagation. The MSE corresponds to mean squared error on the test set. For the Adam
optimiser we use the learning rate of 0.01 and for SGD a learning rate of 0.005. The time refers to
the time for training the models only and excludes the model evaluation and other computations,
it is also the average between the time taken for the SGD and Adam optimisers for the specific
model setup. All the models use Softplus non-linearities for the hidden layer.
epochs) and then begins to radically over-fit. Lowering the learning rate to 0.005 still results
in the networks to over-fit if trained for too long, but the decrease in the test accuracy is less
extreme (see figures 5.14 and 5.12).
Figure 5.11 The table contains the results from training CNNs with 2 x (convolutional & maxpool)
layers followed by a fully connected layer and a hidden layer with 250 units (see figure 1.3 in 1.2).
The accuracy corresponds to the accuracy on the test set. A learning rate of 0.01 was used both
for Adam and SGD. The time was computed in the same way as previously. All the models use
Softplus non-linearities for the hidden layer.
Figure 5.12 Identical to fig. 5.11, apart from a learning rate of 0.005 used both for Adam and SGD.
58
Figure 5.13 The learning curves during training of models fitted using back propagation and their
learned functions, both CNNs consisted of 2 (convolutional & maxpool), a fully connected layer
and a hidden layer with 250 units (see figure 1.3 in 1.2). The models were trained for 3000
epochs and the hidden layer used Softplus non-linearities.
Figure 5.14 The learning curves during training of models fitted using back propagation and their
learned functions, using identical models to figure 5.13, but trained for 10000 epochs and using a
learning rate of 0.005 for both the Adam and SGD optimisers.
When storing the decorrelated samples generated using MCMC we store 1 sample
every neff samples defined as,
N
neff =
ESS
59
where N is the total number of samples generated from the MCMC procedure, and ESS is
the effective sample size (Ripley, 1987 [60]) defined as follows,
N
ESS =
1 + 2 M
k=1 k
where k are the auto-correlations of the particular sample for lag k, and M is the largest
integer s.t. k > 0, k {1, 2, ..., M }. Because we have more than 1 variable which we sample
we take a sub-sample of J variables (w1 , w2 , ..., wJ ) and compute the average sum of all their
positive auto-correlations,
2 J M
neff = 1 + k (wj )
J j=1 k=1
This is defined as such for practicality purposes as the MCMC process is quite inefficient
due to poor mixing. It might have been more sensible to define neff to be,
M
neff = 1 + 2 max ( k (wj ))
j{1,...,J} k=1
but this would mean that we would discard more samples that were very costly to generate in
the first place. As a consequence of our choice of neff we will obtain slightly more correlated
samples.
Interestingly, using Stochastic Gradient HMC (Chen et al., 2014 [57]) worked sufficiently
well for the 1.0 degrees of freedom of the Student T distribution priors. We believe that
our models might have been too large, and the T distribution priors resulted in a posterior
distribution that was difficult to sample from (due to the heavy tails) and thus having a zero
acceptance rate. Using batches of the data to compute the gradients, enabled us to sample
more efficiently from the posterior distribution. Alternatively, it could have been more sensi-
ble to only use the T prior distributions on less wide networks, or use them only on the final
layer and using Gaussian distributions for the other layers.
The tuning of the hyper-parameters (step size and number of leapfrog iterations) of HMC
was difficult and costly. However, the hyper-parameters that did not work were identified
quickly as the acceptance rates were very close to 0. A more efficient way to implement
60
HMC would have been to use the No-U-Turn Sampler (Hoffman and Gelman, 2014 [61]),
which removes the need to tune the learning rate and the step sizes for HMC, but the imple-
mentations of HMC in pymc3 (Salvatier et al., 2016 [62]), were considerably slower than
the alternative Edward (Tran et al., 2017 [63]) which uses Tensorflow (Abadi et al., 2015
[64], 2016 [65]).
We will now take a look at a selection of models. However, due to a resource limitation
we could not explore every case, and in particular we could not explore as large models
as in the Frequentist section. The computation times reported refer only to the collection of
samples (training) and do not explicitly reflect the actual computation time of the script itself.
The script has additional features and involves regular model evaluation (every 50 samples)
during the training, the collection of statistics, and storing the samples and statistics during
training to the disk to avoid losing work from interrupted models (this step is extremely inef-
ficient). These additional features are predominantly ran on the CPU rather than the GPU
which is substantially slower. The actual running time of the script itself is about 2-4 times
more than the time required to collect the samples (depending on the amount of CPU usage
for other processes, and the size of the model). The larger the model is, the larger the sam-
ple files stored to the disk are, with some files being as large as 6GB!
Figure 5.15 BNNs with 1 hidden layer on the small MNIST dataset. For
the samples were
generated using SGHMC, for all other models the samples were generated using HMC. s in the
Student T distribution refers to the scale variable, the mean is taken to be 0. dj refers to the
number of inputs from the previous layer (same as the rescaling used in chapter 2). ESS refers
to the effective sample size, and the number of samples refers to all the samples from the MCMC
scheme including the burnin period. The point estimate accuracy refers to the accuracy from the
model fitted using the average parameters from all samples.
61
Figure 5.16 The trace of the probabilities of each of the 10 digits in the first instance of the true
2
labels for these digits in the dataset, using the BNN with 1 hidden layer (2000 units) with N (0, 20
dj )
prior over all the parameters.
In terms of accuracy of prediction, these models are quite similar to the feed-forward
NNs, but they are much less influenced by the initialisation of the algorithm, which was com-
monly observed for the models minimising a loss function. However, the computation time
for the BNNs using HMC can take up to 100 times longer. For these small models where our
predictive distribution is highly confident (low variance) a Bayesian approach might have not
been necessary. However, in more complicated models knowing the variance of the predic-
tor can be vital in preventing dangerous decisions in practice.
From figure 5.15 we can see that the Laplace distribution priors seem to perform bet-
ter than the other prior distributions used. The difference between the Normal and Laplace
priors is however insignificant, which could indicate that both models are under-fitting. In
general the predictive distribution when using a Laplace prior was observed to have slightly
less variance. A key observation from figure 5.16 is that the model is perhaps unidenti-
fiable, as the output probabilities do not all seem to have converged for all the samples
taken, with digits 2, 4 and 7 suffering the most. This imposes practical difficulties (Silva and
Gramacy, 2010 [51]). Our MCMC scheme might be required to produce many more sam-
ples in order to obtain some usable uncorrelated samples, which can be interpreted as slow
62
mixing and this in turn can result in a greatly inefficient sampling procedure.
Figure 5.17 Diagnostic and predictive plots for the 1 hidden layer NN with 500 hidden units and
2
N (0, 15
dj ) priors over all the parameters, using HMC to sample from the posterior. (Top left) A plot
of the accuracy during sampling on mini-batches of samples. (Top right) Marginal distribution plots
for a sample of weights. (Bottom left) Confusion matrix of the MC estimated model. (Bottom right)
Confusion matrix standard deviations of the MC estimate.
The highly correlated samples also arise due to the fact in the models with 1 hidden layer
the step size for the leapfrog method was roughly around 0.001 and a trajectory of length
70 was used, which meant that the HMC dynamics proposed new samples quite close to
the previous sample, and so on. However, such a low step size was necessary to obtain an
acceptance rate that was non-zero.
of the weights are centred around zero, with some slightly to the left or right. Bearing in mind
that in the first layer we have 98000 weights, from which we have drawn only a very small
sample (12 weights), this type of structure was to be expected in such a sparse dataset.
From the confusion matrix we can see that our model predicts digits 0 and 1 correctly the
most and the most errors seem to be incurred for digits 5 and 8. The most variable prediction
probability is observed for the digit 5. Digits 5 and 3 are most commonly confused with each
other. p(y = 1y = 0) and p(y = 0y = 1) are both almost identical to zero, and the extremely
63
low variance of these estimated probabilities indicates that our model is highly confident in
these estimates. This is a positive outcome demonstrating that our models are producing
sensible results. The fact that 3 and 5 are confused is not surprising, especially given the
fact that we pre-process the images shrinking them from (28x28) to (14x14), making it highly
probable that some of the real 3s are being rendered to look like 5s and vice versa.
Figure 5.18 BNNs with 1 hidden layer on the small MNIST dataset. The samples were drawn
using a Variational Mean Field approximation. s and dj are defined as before. The MC accuracy
refers to accuracy using MC integration for 500 samples from the posterior distribution.
The reader should not be alarmed that the point estimate accuracies for most of the
models are much lower than the MC prediction accuracy. In fact this emphasises that our
sampled probability marginal distributions for the weights are quite rich and do not contain
significantly correlated samples. In figure 5.19 we illustrate this difference on a simple exam-
ple where we have 2 Gaussians both centred at zero but with different variance. The point
estimate would be the mean, which cannot distinguish between the two distributions. In
the NN sense, for the point estimate we would take the average weights and forward feed
them through the network, which as expected performs worse than the MC prediction.
Interestingly, using the SGHMC algorithm to generate samples leads to the second layer
weights (weights from the hidden layer to output) to converge quickly (see figure 5.20). Sim-
ilarly most of the first layer weights also seem to converge with the exception of one of the
weights. This might suggest that the MCMC procedure might have not been run for long
enough. The marginal distributions of the weights vary significantly from each other, and
seem to be quite skewed. Note that some of the weights are centred around very large
values (-60) or (-20), which might suggest why using the T distribution with = 1 did not
work with standard HMC, as the models are very likely to suffer from exploding gradients, or
vanishing gradients due to the accumulation of either highly positive inputs or highly negative
inputs to the non-linear, or softmax layers and thus propose new samples that are always
rejected. Like before, the predictive plots seem quite sensible, although the variance of the
64
0.5
0.4
0.3
p(x)
0.2
0.1
N (0, 1) N (0, 2)
0
5 0 5
x
Figure 5.19 Probability density functions of two different Gaussian distributions with a point esti-
mate of their mean labelled as .
Figure 5.20 Diagnostic and predictive plots for the 1 hidden layer NN with 500 hidden units and
2
T( = 1, s = 10
dj ) priors over all the parameters, using SGHMC to sample from the posterior. (Top
left) The trace plots of samples of the weights from the two layers. The top right, and bottom two
plots are the same to the ones seen before.
predictions is much higher than that for the models with the Normal or Laplace distribution
priors.
From the trace plot of the model with 800 hidden units and Laplace priors over the pa-
rameters (figure 5.21), we observe that whilst the weights in the second layer of the are
sufficiently randomly scattered, suggesting that they may have converged; the weights in
65
Figure 5.21 Diagnostic and predictive plots for the 1 hidden layer NN with 800 hidden units and
2
dj ) priors over all the parameters, using HMC to sample from the posterior. (Top left)
Laplace(0, 15
The trace plots of samples of the weights from the two layers. The top right, and bottom two plots
are the same to the ones seen before.
the first layer (W0 ) seem like they still have not completely converged and perhaps taking
more samples might solve this problem. Moreover, this might be indicative that the Laplace
priors generate less correlated samples than the Normal priors seen in figure 5.22. We also
note that the second layer of weights has marginal distributions of the weights that are quite
similar in structure, whereas in the first layer the weights appear to have a different scale
parameter to each other, but are again centred at zero. This could mean that, either the
prior distribution has a strong impact on the posterior, or that the weights are in fact centred
around zero. I believe that it is a combination of these two factors which gives rise to this
structure, as we are only taking a very small sample of weights, most of which we expected
to be centred at zero since the dataset is very sparse.
Throughout the MNIST 1 hidden layer models we can see that the variance of the pre-
dictive distribution of models with Laplace or Student T distribution priors is generally lower
than that for the models with Normal distribution priors. This could suggest that our models
using the sparsity promoting priors are more confident about their predictions.
The most surprising result in this section was that the models using the variational mean
field approximations performed remarkably well. They also converge significantly faster than
66
Figure 5.22 Diagnostic and predictive plots for the 1 hidden layer NN with 800 hidden units and
2
N (0, 13
dj ) priors over all the parameters, using HMC to sample from the posterior. (Top left) The
trace plots of samples of the weights from the two layers. The top right, and bottom two plots are
the same to the ones seen before.
the MCMC schemes, observing running times up to 70 times faster than for the HMC sam-
pling scheme! However, this result was only restricted to the 1 hidden layer models as the
variational mean field approximations did not work at all for the 2 hidden layer and CNN
models. The factored approximation might be sensible for the smaller models, but for the
more complex models it fails (see figure 5.24).
From figure 5.23 we can see that the predictive distribution using the variational approx-
imation is quite similar in structure to the predictive distributions seen when using HMC,
although there appears to be a slightly higher variance. Unlike before, we see that most of
the second layer of weights exhibit a much more spiky structure and have lower variance
than the first layer of weights both for the Normal and the Laplace priors. There are however
some exceptions to this, where some of the weights have considerably higher variance than
the other weights in the second layer.
To check whether the results obtained on the entire MNIST dataset (55000 training
points, 10000 test points) were consistent with the small dataset we also run HMC on the
large MNIST dataset. This is the only part of the dissertation where we apply the Bayesian
methodology to the entire MNIST dataset due to the long time required to run these models.
67
Figure 5.23 On the left are the distribution and predictive plots for the 1 hidden layer NN with 3000
2
hidden units and Laplace(0, 15dj ) prior distributions over the weights, using the variational mean
field approximation. On the right are the distribution and predictive plots for the 1 hidden layer NN
2
with 3000 hidden units and N (0, 20dj ) prior distributions over all the parameters.
However, due to memory limitations we were required to feed the training data in batches of
10000, as the images could not be handled otherwise on the GPU memory. This is likely to
have slightly worsened the performance of HMC as it requires gradient computations over
the entire dataset. However, using batches means that we have acquired noisy estimates
of the required gradients. Please note that batch HMC here is not the SGHMC described in
(Chen et al., 2014 [57]), which uses the second-order Langevin dynamics (Welling and Teh,
2011 [46]).
We observe that for the large MNIST dataset the HMC sampling scheme for both the
Laplace and Normal distribution priors produce predictive distributions that are very similar
to each other, even though the marginal distributions of the samples are quite different (see
68
Figure 5.24 Diagnostic plots of a 2 hidden layer BNN with 100 hidden units in each layer, using
2
VMF approximation on the small MNIST dataset. All the parameters had N (0, 7dj ) priors. (Top left)
the learning curve on the test set during training. The other plots have been previously defined.
The final model accuracy 1 standard deviation using the MC estimate on 500 was 0.535 0.036.
Figure 5.25 Summary table for the 1 hidden layer BNN using (batch) HMC on the large MNIST
dataset.
figures 5.26 and 5.27) . For the second layer of weights, the model with Normal priors seem
to be mostly centred around 0 with the exception of some weights that have means slightly
different from 0, whereas for the Laplace priors, all the sampled weights are centred at zero.
Again the first layer of weights does not seem to converge.
The predictive distributions for both models seem very sensible, as the predictive dis-
tribution struggles to distinguish the digit 4 from 9, and the digit 3 from 5, which is logical
given the way the images are being shrank. The fact that the point estimate accuracy for
the model with the Normal priors (0.725) is higher than that for the model with the Laplace
priors (0.415) is explained by the Laplace distribution containing more information in its tails
than the Normal distribution (fatter tails). This implies that the point estimate would lie further
from the truth in the case of the Laplace priors.
69
Figure 5.26 Diagnostic and predictive plots for the 1 hidden layer NN with 800 hidden units and
2
dj ) priors over all the parameters applied to the large MNIST dataset.
Laplace(0, 13
Looking at the running times of the two models, we demonstrate that using HMC scales
very well with the sample size, and in fact the larger sample size benefits more from running
on the GPU. The ability to scale well to larger datasets is crucial for the practicality of the
algorithm. Although, the running times for HMC are incomparable to those for feed-forward
NNs. VI shows much promise on the less complex models, by achieving great performance
at running times that are comparable to feed-forward NNs.
Using the same regression dataset as previously, BNNs with 1 hidden layer will be explored.
The samples are all collected using HMC only, as VI and SGHMC did not perform as well. In
[56] we also provide an example script applying a more state-of-the-art variational inference
algorithm known as Automatic Differentation Variational Inference (ADVI) (Kucukelbir et al.,
2015 [66]), however we do not evaluate this against other approximate inference algorithms,
in the paper ADVI is compared to MCMC algorithms.
Noting back to the Gaussian likelihood 2.1, all the models used for the regression
dataset assume a Gaussian likelihood with mean being the output of the NN and variance
2 = 0.072 , for simplicity this quantity was fixed at a value that seemed to perform best. A
70
Figure 5.27 Diagnostic and predictive plots for the 1 hidden layer NN with 800 hidden units and
2
N (0, 20
dj ) priors over all the parameters applied to the large MNIST dataset. prior distributions
over all the parameters.
Figure 5.28 Summary table for BNNs with 1 hidden layer on the CO2 regression dataset, using
HMC to generate the samples from the posterior.
more sophisticated model would essentially try to be Bayesian about this term and allow the
sampling scheme to find the right value for the output noise.
Choo (2000, [67]) explored the use of other hyper-parameter sampling using Gibbs sam-
pling, but also hyper-parameter sampling using HMC, his results however were unsuccess-
ful. The inherent inability of HMC to move between multiple modal regions was his explana-
tion for this phenomenon, which made movement from the region with high hyper-parameter
values to the region with low hyper-parameter values exceedingly difficult. As a conse-
71
Figure 5.29 Diagnostic and predictive plots for BNNs with 1 hidden layer of 1000 units. (Left)
2 2
dj ) on the (Right) N (0, dj ) prior. The bottom plots show the MC approximation of the
Laplace(0, 20 15
predictive distribution (black), the point estimate (red), the test point scatter, and the orange region
is the 99% credibility interval.
quence of these difficulties we will not explore fully Bayesian approaches in this dissertation.
This dataset posed difficulties with finding a step size that was high enough to move at all
in the space where HMC searched, but low enough for the algorithm to accept some of the
proposals. This meant that we were forced to use a step size of 5 105 , and to compensate
for the extremely small step size we used 60 leap frog steps. This meant that the proposals
were very highly correlated. Looking at figures 5.28 and 5.29 we can see how few samples
were actually used (typically 1 sample per 1000) meaning that this sampling scheme was
very inefficient.
For the models with 1000 hidden units with a Laplace prior distribution, we can see that
72
the weights of the first layer possibly burnt in by iteration 60000, whereas the second layer
weights W1 do not show any indication of convergence. It is possible that the chain was not
ran for long enough, but in order to draw this conclusion we would typically require a trace
of the output probabilities instead of the weights.
The marginal distributions for the weights seem to show indications of bi-modality in some
cases, which makes it exceedingly difficult for HMC to find a good solution. Nonetheless,
the models should not be deemed as a failure, as their performance is still relatively good,
thus capturing the general trends in the data. The tight credibility regions are believed to
be an artefact of the correlation of the samples and the low output variance. The Laplace
priors seem to be better at detecting the signals from the data, and we see that the pre-
dictive distribution has a wavelike curvature in the centre, which is something that we could
typically expect to see for this particular dataset. Once again, we observe that the point
estimate for the model with Laplace priors deviates the most from the predictive distribution,
this deviance is most noticeable at the extremities of the data.
This signal detection pattern is repeated for the networks with 2000 hidden units (fig-
ure 5.30), where the Laplace priors offer a more wavelike trend for the predictive distribution.
The deviance of the point estimates is a lot more severe in this case, especially for the model
with the Normal distribution priors. From the marginal distribution plots can again observe
the possibility of bi-modality, which ultimately limits the performance of the HMC algorithm.
Contrastingly to the model with the Normal priors, the marginal distributions tend to be
mostly clamped together in one region with some marginal distributions lying significantly
further from the group (see figures 5.29 and 5.30). This is believed to emphasise the spar-
sity effect that the Laplace prior is having. The marginal distributions for the model with the
Normal priors tend to be more spread out over the space. This difference in the structure of
the marginals is believed to give rise to the wavelike (signal detection) predictive distribution,
whereas the Normal priors tend to smooth the predictive distribution.
We will explore models of similar scope to the ones fitted using back propagation. How-
ever, due to the algorithm time complexity we are limited to a smaller number of models. We
will not see many models using variational approximations and/or with Student T distribution
priors. These were initially tested, and were found to fail in practice. The reader is encour-
73
Figure 5.30 Diagnostic and predictive plots for BNNs with 1 hidden layer of 2000 units. (Right)
2 2
dj ) prior on the (Left) Laplace(0, dj ). The bottom plots show the MC approximation of the
N (0, 18 20
predictive distribution (black), the point estimate (red), the test point scatter, and the orange region
is the 99% credibility interval.
aged to explore such models, using the code provided in [56], as it is possible that the right
hyper-parameters for the HMC algorithm were overlooked.
All models explored in this section will use the small MNIST dataset that was used for the
previous models. The graphical output presented will refer only to the key findings. The two
hidden layer MNIST models are in general more accurate than the single hidden layer BNNs
at the cost of a higher variance.
In the models explored, the marginal distributions for the weights of the models with
74
Hidden Prior No. of MC accuracy Point est. Time
ESS
units Distribution samples 1mc accuracy (mins)
dj ) 156 0.856 0.010
2
250 Laplace(0, 10 45000 0.782 62.4
N (0, 15
dj ) 0.868 0.022
2
250 45000 154 0.622 64.5
300 Laplace(0, 102
dj ) 36000 173 0.871 0.010 0.791 55.1
300 N (0, 152
dj ) 36000 153 0.863 0.027 0.717 55
500 Laplace(0, dj
152
) 36000 196 0.875 0.015 0.549 72.4
500 N (0, dj )
202
36000 195 0.865 0.014 0.537 72.5
Figure 5.31 Summary table for BNNs with 2 hidden layers on the small MNIST dataset, using
HMC to generate the samples.
Figure 5.32 The marginal distribution plots for the 2 hidden layer BNN with 300 units in each layer
2
dj ) priors over the parameters.
for the small MNIST dataset. (Left) The model with Laplace(0, 10
2
dj ) priors over the parameters.
(Right) The model with N (0, 15
Laplace priors tend to be more clumped together. The active weights are consequently
further from the rest of the weights. With the exception of the 2 hidden layer (300 hidden
unit) model (figure 5.32), the marginals for most of the less active weights are clumped to-
gether quite closely around 0. One could expect this from a Laplace prior which has fatter
tails and thus an observation far from zero is more probable than if we had a Normal prior.
This ensures that when the Laplace posterior mean is adjusted it is because of a genuine
(strong) signal from the data, the Laplace prior might mean that certain weaker signals do
not result in an adjustment in the posterior. On the other hand, the models with Normal priors
are more likely to respond to a weaker signal from the data and thus tend to be less clumped.
The predictive plots for both the 500 hidden unit models (figures 5.33 and 5.34) seem
sensible, with the models making the most mistakes on the digit 8, and least mistakes on
digits 0 and 1. As previously, the digit 4 is mistaken with the digit 9 and vice versa. And
the digit 3 is mistaken with the digit 5 the most and vice versa. Intuitively, these mistakes
should make sense to us. However, as the networks become wider the models become
increasingly likely to be unidentifiable (see figures 5.35 and 5.36). Paradoxically, the deeper
75
Figure 5.33 The marginal distributions and predictive plots for the 2 hidden layer BNN with
2
N (0, 20
dj ) and 500 units in each hidden layer.
Figure 5.34 The marginal distributions and predictive plots for the 2 hidden layer BNN with
2
dj ) and 500 units in each hidden layer.
Laplace(0, 15
and less wide models have more stable output probabilities than the models with many
hidden units and only one hidden layer (figure 5.16).
Adding a second hidden layer to the BNNs introduced another layer of complications in set-
ting the right step sizes. The step size in these models had to be reduced further to 5.5106 ,
which was unbearably small. Our exploration of the state space was relatively limited due
to this change. Even increasing the number of leapfrog steps to 70 still did not help sub-
stantially. Sadly there was no other alternative which would avoid the rejection of all the
76
Figure 5.35 The trace of the output probabilities for the same datapoint used in previous example.
Using the BNNs with 2 hidden layers of 300 units each. On the (left) the model with Laplace priors
is used, on the (right) the model uses Normal priors.
Figure 5.36 The trace of the output probabilities for the same datapoint used in previous example.
Using the BNNs with 2 hidden layers of 500 units each. On the (left) the model with Laplace priors
is used, on the (right) the model uses Normal priors.
proposals. Naturally decreasing the step size means that the samples are more correlated
(see figure 5.39).
The trace plots for the first layer of weights seen in figures 5.38 and 5.39 are almost com-
pletely flat meaning that the posterior distribution is almost identical to the prior distribution.
Thus the prior distribution could be having a strong influence on this layer of weights. How-
ever, as we are naturally limited to the numbers of weights that can be observed, we are not
in a position to conclude that this is the case for all the other weights in the first layer.
It is important to stress that there was a substantial indication that for some of the larger
models, such as the ones with the 500 hidden units, we did not collect enough samples to
have a good approximation of the posterior distribution. This becomes evident by looking
at figure 5.40. Plotting the test MSE during the sampling stage exposes the inability for the
approximations of the weights to converge quickly enough, making the use of HMC less
77
practical in this situation.
Figure 5.37 Summary table for BNNs with 2 hidden layers, using the CO2 regression dataset.
The continuously declining MSE during the training is both a good and a bad sign; a good
sign in the sense that the samples we are accepting seem to give more accurate predictions,
and bad in the sense that the sampling scheme has not converged and technically we should
not be using the samples before convergence in building the predictive distribution. How-
ever, due to a lack of computational resources to ran these models for many days in order
to generate enough completely uncorrelated and converged samples from the posterior, we
settle for a sub-optimal scheme.
Sub-optimality was mainly prevalent for the 2 hidden layer models with (500 or more hid-
den units), firstly because we were forced to use extremely small step sizes, and secondly
because the space that needed to be explored was significantly larger. Comparing the num-
ber of parameters that need to be estimated as in a BNN with 1 hidden layer with 5000
hidden units (a bit over 10000 parameters) in comparison with a BNN with 2 hidden layers
of 500 units each (over 250000 parameters) we see that the difference in the state space is
immense.
When comparing the models with Student T distribution priors to the model of 100 hid-
78
Figure 5.38 Diagnostic and predictive plots for BNNs with 2 hidden layers of 300 units. (Left)
2 2
N (0, 7dj ) priors are used for all the parameters, on the (Right) Laplace(0, 5dj ) priors are used for all
the parameters.
den units and a Laplace prior distribution, it becomes apparent that they are significantly
outperformed by the Laplace prior distribution model. This provides a smoother and more
consistent fit to the data (see figure 5.42). We believe that this provides a strong indication
that there is considerable room for calibration of the Student T distribution models, such as
exploring a higher number of degrees of freedom and other scale parameters.
79
Figure 5.39 Diagnostic and predictive plots for the BNN with 2 hidden layers of 500 units and
2
Laplace(0, 7dj ) priors over the weights. The (top right) plot is a plot of the auto-correlations of a
sample of weights, the x-axis is the lag and y-axis the autocorrelation value.
could suggest the possibility of multi-modal marginal distributions. This means that HMC is
at its limitations, as it cannot handle multi-modality very well, instead the samples will be
concentrated around one of the modes. To handle multi-modality one would perhaps re-
quire to incorporate this into the model structure, defining mixture of Gaussian or mixture
of Laplace priors. However this would add another layer of complexity to the model which
ultimately also needs to be approximated. For this approximation, a possible suggestion is
to use a less expensive method of approximation such as VI.
What could be most interesting, is to see a combination of HMC with VI. For example
using HMC to sample from the posterior distribution, followed by VI to estimate the hyper-
parameters. This inexpensive addition to the iterative scheme might allow the use of more
sophisticated models, such as the use of a mixture model prior that could ultimately enable
the HMC algorithm to converge even in the case of multi-modality.
80
Figure 5.40 Diagnostic and predictive
plots for the 2 hidden layer BNN with
2
N (0, 10
dj ) priors over the parameters
and 500 hidden units in each hidden
layer. (Top) Plot of the test mean
squared error during the sampling
stage of the algorithm, evaluated us-
ing MC integration on random sam-
ples from the posterior which does
not incorporate a burnin or spacing
out the samples. Note that the first
20000 iterations are removed from
this plot to allow clear visibility. The
latter two plots have been explained
previously.
81
52
Figure 5.41 Marginal distribution and predictive plots for 2 hidden layer BNNs with T( = 3, s = dj )
priors over all the parameters. (Left) model with 50 hidden units in each hidden layer. (Right)
model with 100 hidden units in each hidden layer.
2
Figure 5.42 Marginal distribution and predictive plots for 2 hidden layer BNNs with Laplace(0, 5dj )
priors over all the parameters and 100 hidden units in each hidden layer.
82
Hidden Prior No. of MC accuracy Point est. Time
ESS
units Distribution samples 1mc accuracy (mins)
dj ) 77 0.927 0.016
2
250 Laplace(0, 10 37500 0.897 314.1
N (0, 15
dj ) 0.939 0.013
2
250 37500 73 0.908 314
Figure 5.43 Summary table for BCNN consisting of 2 (convolutional & maxpool) layers, a fully
connected layer and a hidden layer with 250 units (see figure 1.3 in 1.2). The samples were
generated using the HMC sampling scheme on the small MNIST dataset with unreduced images
(28x28). Both the models were ran on an NVIDIA K80 GPU, which was roughly 1.92 times
slower than the NVIDIA GeForce
R
GTX 1070 used for all other experiments. The figures have all
been adjusted downwards by a factor of 1.92.
Figure 5.45 The traces of the output probabilities of each of the 10 digits in the first instance
2
of the true labels for these digits in the dataset, using a BCNN with N (0, 15
dj ) (on the left), and
2
dj ) (on the right) prior distributions.
Laplace(0, 10
83
Figure 5.46 The two plots on the top correspond to the marginal distributions for the BCNN with
2 2
N (0, 15
dj ) (on the top), and Laplace(0, dj ) (on the bottom) prior distributions.
10
84
Chapter 6
Conclusion
We saw that Dropout has a lot of interpretations, however the most convincing interpretation
to myself is that Dropout enriches the input space (Konda et al., 2015). When expressing
MC Dropout of linear regression into the two possible Bayesian interpretations provided in
this report, the richness of Dropout came to light. We saw Dropout being expressed as a
kind of mixture model, which is considerably more expressive than ordinary Bayesian linear
regression.
In the final chapter we witnessed the difficulty associated with identifying convergence
in the feed-forward NNs, which makes automation more difficult. Bayesian models using
HMC or SGHMC to sample from the posterior do not suffer from this phenomenon. The
more samples that are collected, the more accurate the prediction will be, assuming that the
samples came from the invariant distribution and are sufficiently uncorrelated. However, this
is incredibly computationally costly.
Identifying the correct learning rate in the feed-forward NNs can be very difficult and it
requires a thorough search over the alternatives. The trained models suffered as a result
of a poor choice of learning rate. We saw that the Adam optimiser consistently found better
local solutions than SGD, which appeared to reach inferior local solutions.
As we discovered through the applications, tuning the hyper-parameters for HMC and
SGHMC was extremely difficult. Before identifying good leapfrog step sizes and number of
steps, a thorough and extremely costly search was conducted. The tuning of the hyper-
parameters was also dependent on the choice of prior distribution, with Student T priors
being more sensitive to the choice of hyper-parameters.
In terms of predictive accuracy, the BNNs arrived at very similar results to the feed-
forward NNs for the smaller models. However, the larger more complex feed-forward CNN
models were outperformed by the BCNN models, which did not seem to suffer from over-
fitting. BNNs though still greatly suffer from extremely slow computation, which limited the
85
size of the models that we were able to consider, making algorithms like HMC and SGHMC
less practical.
Furthermore, the difficulty of storing Bayesian NNs is a significant problem. Storing the
samples that need to be used for prediction requires large amounts of memory. This means
that it is extremely difficult to store the trained models to smart phones that have limited
memory capacity. Feed-forward NNs do not suffer from this problem, which makes them
more compelling for Smart phone App developers.
We also found that using a factored (VMF) approximating distribution for Variational Infer-
ence worked surprisingly well for the models with 1 hidden layer. This performed better than
the feed-forward NNs in terms of prediction, for only a slight increase in computational cost.
However, the VMF approximation obtained extremely poor results for the more complex (2
hidden layer) models, making their use quite impractical.
The choice of prior distribution was dependent on the problem, with Laplace priors slightly
outperforming the models with the Normal priors in the MNIST setting with 2 hidden layers.
For the models with 1 hidden layer this difference was indistinguishable. For the BCNNs, the
model with Normal priors was significantly better than the model with Laplace priors. This
reiterated our point that prior distributions should be chosen according to the task at hand.
In general, the models using Student T priors, performed worse than the other models.
This was believed to be a consequence of the heavy tails of the T distributions. However,
the regression models showed some weak evidence of signal identification properties that
the Student T priors might possess.
It was not uncommon for the models using MCMC to struggle to obtain stable results
in the observed (probability) traces, which could have emphasised the unidentifiability of
certain models. Ultimately, this made the use of HMC more impractical in such scenarios.
However, we found that decreasing the width of the hidden layers and increasing their depth
lead to more stable results in terms of the output probability traces. Moreover, there seemed
to be some evidence of multi-modality in the posterior distributions, meaning that it may have
been inherently difficult to sample from such distributions using HMC.
Finally, the leapfrog step sizes used in the models explored we extremely small, in order
to allow the HMC algorithm to produce proposals with non-zero acceptance probability. This
resulted in highly correlated samples, making the HMC sampling scheme extraordinarily
inefficient.
86
6.1 Further work
To counteract the limitations of having to tune the HMC hyper-parameters the use of the
No-U-Turn Sampler (Hoffman and Gelman, 2014) is suggested for future work. For models
using T distribution priors it would be interesting to leverage skip connections (Duvenaud et
al., 2014, He et al., 2015), which might be able to help avoid large singular values in the
Jacobian distribution and improve the stability of inference when using such heavy tailed
distributions.
There is a lot of scope for improvement in the choice of prior distributions, for example
using a more diverse selection of T distribution priors, such as ones with higher degrees of
freedom. Alternatively, using a mixture of prior distributions, such as using Student T dis-
tribution priors only on the final layer of parameters and using Normal distribution priors for
the earlier set of parameters. Another very interesting prior distribution that could be used in
future work is a Spike and Slab prior distribution over the parameters.
Models that can handle multi-modality in the posterior distribution, by perhaps defining
mixture of Gaussian or mixture of Laplace prior distributions over the parameters could be
very promising. It would be interesting to see the use of HMC to sample from the posterior
distribution, followed by VI technique to estimate the hyper-parameters for the mixture prior
distributions.
87
Appendices
88
Appendix A
Algorithms
A.1 Introduction
A.1.1 SGD
2 for i nepoch do
Figure A.1 nepoch is the total number of cycles through the whole dataset. nbatches corre-
sponds to the total number of batches per epoch, which is typically the whole number
greater than the N /batchsize, where N is the number of datapoints. is the learning rate,
L(; D(j) ) is the loss function for the j th batch of the data, and is the set of all param-
eters that need to be learned. For every epoch each of the batches are disjoint sets of
fixed size from the set of all datapoints D.
89
A.1.2 Momentum
2 v = 0;
3 for i nepoch do
Figure A.2 nepoch , nbatches , N, , and L(; D(j) ) are defined exactly as before. v is the mo-
mentum vector and is the friction coefficient.
90
A.1.3 Nesterovs Accelerated Gradient (NAG)
2 v = 0;
3 for i nepoch do
Figure A.3 nepoch , nbatches , N, , and L( ; D(j) ) are defined exactly as before. v is the
momentum vector and is the friction coefficient.
91
A.1.4 Adam
Figure A.4 nepoch , nbatches , N, , and L(; D(j) ) are defined exactly as before. Note that
operations on vectors such as the square, division and the square root are all element-
wise operations, e.g. g2 = g g. As suggested in (Kingma et al., 2015 [11]) good default
settings are 1 = 0.9, 2 = 0.999, = 0.001, and = 108 .
92
Appendix B
Derivations
B.1 Introduction
yim (1 yim ), if m = j
= yim (mj yij ) = (B.1)
yim yij , if m j
n k n k
L
NB L = ylc log(ylc ) = li yic log(yic )
l=1 c=1 zij l=1 c=1 zij
k k
Chain rule yic yic yij yij yic yic
= yic log(yic ) = =
c=1 zij c=1 yic zij yij zij cj yic zij
k
from (B.1)
= yij + yij yij + yic yij = yij ( yic ) yij = yij yij (B.2)
cj c=1
=1
dL
= Y Y
dZ
zik
= (hij wjk + bk ) = jm wjk = wmk (B.3)
him j him j
93
L Chain rule L zij
= = (yij yij )wkj
(B.2)&(B.3)
hik j zij hik j
L
= WT (Y Y) (B.4)
H
B.1.3 Derivative of the loss w.r.t. the bias of the output layer
From before Z = HW + b zik = j hij wjk + bk
zik
= (hij wjk + bk ) = km (B.5)
bm j bm
L
= Y Y (B.6)
b
B.1.4 Derivative of the loss w.r.t. the weight of the output layer
Z = HW + b, using standard matrix calculus:
Z
= HT (B.7)
W
hij
= jk 1{mik >0} (B.8)
mik
L L hij L
jk 1{mik >0}
Chain rule
= =
(B.8)
mik j hij mik j hij
L
= 1{mik >0} = (yij yij )wkj 1{mik >0}
(B.4)
hik j
L
= WT (Y Y) 1{M>0}
(B.9)
M
Where 1{M>0} is a matrix with zeros where the entries of M are equal to or less than zero
and ones where they are positive.
94
1, if mij > 0
[1{M>0} ]ij =
0, if mij 0
L 1
= WT (Y Y) (B.10)
M 1 + eM
We note that since the cross-entropy is the negative log-likelihood for the Softmax likeli-
hood, we can re-use most of these derivatives for the implementation of HMC and we there-
fore choose not to go into detailed derivations of the derivatives for the energy functions in
this dissertation.
95
Appendix C
Plots
Any supplementary graphics and plots not in the main body of the dissertation can be found
in this section. This section is primarily intended to provide a more comprehensive under-
standing of the models discussed in the dissertation.
C.1 Chapter 2
This section of the Appendix contains supplementary plots from Chapter 2.
96
C.1.1 1 hidden layer neural networks
Figure C.1 Drawn functions from 1 hidden layer BNNs with 10000 hidden units. The drawn func-
tions are evaluated by taking the average of 100 samples. The top left plot has a N (0, 102 /dj Idj )
prior distribution over the weights, whilst the bottom left plot has a N (0, 202 /dj Idj ) prior distribution
over the weights. The top right plot has a Laplace(0, 52 /dj Idj ) prior over the weights, whilst the
bottom right plot has a Laplace(0, 102 /dj Idj ) prior over the weights. Where dj is the number of
inputs from the previous layer (the number of columns of the units being multiplied by the weight
matrix. All the networks use a ReLU activation functions for the hidden layer, and the input region
is -60 to +60.
97
Figure C.2 Drawn functions from 1 hidden layer BNNs with 10000 hidden units. The drawn func-
tions are evaluated by taking the average of 1000 samples. The same prior distributions as in
figure C.1 are used, with the difference that the networks use a Softplus activation functions for
the hidden layer.
Figure C.3 Drawn functions from 2 hidden layer BNNs with 10000 hidden units for each of the hid-
den layers (same for all drawn functions). The drawn functions are evaluated by taking the average
of 100 samples. The top left plot has a N (0, 102 /dj Idj ) prior distribution over the weights, whilst the
bottom left plot has a N (0, 202 /dj Idj ) prior distribution over the weights. The top right plot has a
Laplace(0, 102 /dj Idj ) prior over the weights, whilst the bottom right plot has a Laplace(0, 202 /dj Idj )
prior over the weights. Where dj is the number of inputs from the previous layer (the number of
columns of the units being multiplied by the weight matrix. All the networks use Softplus activation
functions for both the hidden layers, and the input region is -60 to +60.
98
Figure C.4 Drawn functions from 2 hid-
den layer BNNs with 1000 hidden units
for each of the hidden layers. The
drawn functions are evaluated by tak-
ing the average of 1000 samples. The
prior distributions for the weigths are
standard Student T distributions with
0.5, 1, and 1.5 degrees of freedom for
the top, centre, and bottom plots re-
spectively. All the networks use Soft-
plus activation functions for the hidden
layer, and have an input region from -
60 to +60.
C.2 Chapter 5
Figure C.5 The learning curves during training of models fitted using back propagation on the
MNIST small dataset. All the models composed of 2 hidden layers each with 500 hidden units.
The models were both trained for 10000 epochs and used Softplus non-linearities for both their
hidden layers.
99
Appendix D
Additional results
D.1 Chapter 5
Figure D.1 Selected results for models with 1 hidden layer using Softplus non-linearities.
Several different learning rates and optimisers that have not been presented in the main
results section.
100
D.1.2 2 hidden layers NN
Number of epochs Hidden units Optimiser Learning Rate MSE Time (seconds)
Figure D.2 Selected (best) results for models with 2 hidden layers with an equal number
of hidden units each using Softplus non-linearities. Several different learning rates and
optimisers were used, only the best performing models are shown in this table. MSE
refers to the final mean squared error on the test set. The time figure refers only to the
training time for the models and excludes any other computations such as the time taken
to evaluate the models.
101
Appendix E
All the code for this dissertation can be found on the Github page https://github.com/
cstavrou/Priors-and-Algorithms-for-Bayesian-Neural-Networks/ [56]. The page has different
folders for each chapter of the dissertation, where all the relevant code and results can be
found. For instructions on how to run the code, please read the README files available for
each of the different scripts. All scripts can be run from the Terminal and many of the scripts
take input arguments, which are described in both the code itself and on the Github page.
102
Bibliography
[1] Neal, R. M., Bayesian learning for neural networks, Thesis, University of Toronto, 1995.
[2] Gal, Y., Uncertainty in deep learning, Thesis, University of Cambridge, 2017.
[3] McCulloch, W.S. and W. Pitts, A logical calculus of the ideas immanent in nervous
activity, Bulletin of Mathematical Biophysics, vol. 5, 1943, pp. 115-133.
[4] Hebb, D. O., The Organization of Behaviour: A Neuropsychological Theory, New York,
Wiley, 1949.
[5] Rosenblatt, F., The Perceptron - a perceiving and recognizing automaton, Report No.
85-460-1, Cornell Aeronautical Laboratory, 1957.
[8] Linnainmaa, S., The representation of the cumulative rounding error of an algorithm as
a Taylor expansion of the local rounding errors, Masters Thesis, University of Helsinki,
1970.
[9] Widrow, B., An adaptive Adaline neuron using chemical memistors, Technical Re-
port No. 1553-2, 1960.
[10] Nesterov, Y., A method of solving a convex programming problem with convergence
rate O(1/ k 2 ), Soviet Mathematics Doklady, Vol. 27, 1983, pp. 372376.
[11] Kingma, D. P., Ba, J. L. Adam: a Method for Stochastic Optimization, 3rd International
Conference on Learning Representations. San Diego, 2015.
[13] Dugas, C. et al., Incorporating Second-Order Functional Knowledge for Better Option
Pricing, Advances in Neural Information Processing Systems, Vol. 13 (NIPS 2000),
2001, pp. 1-7.
103
[14] Neal, R. M., Probabilistic inference using Markov chain Monte Carlo methods, Techni-
cal Report CRG-TR-93-1, University of Toronto, 1993.
[15] Gilks, W.R., S. Richardson, and D. Spiegelhalter, Markov Chain Monte Carlo in Practice,
London, UK, Chapman & Hall, 1996.
[16] Bishop, C. M., Pattern Recognition and Machine Learning, Secaucus, NJ, USA,
Springer-Verlag New York, Inc., 2006.
[17] Metropolis, N. et al., Equation of state calculations by fast computing machines, The
Journal of Chemical Physics, Vol. 21, No. 6, 1953, pp. 1087-1092.
[18] Hastings, W. K., Monte Carlo sampling methods using Markov chains and their appli-
cations, Biometrika, Vol. 57, No. 1, 1970, pp. 97-109.
[19] Geman, S., and D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian
restoration of images, IEEE Transactions on pattern analysis and machine intelligence,
Vol. PAMI-6, No. 6, 1984, pp. 721-741.
[20] Duane, S., A. D. Kennedy, B. J. Pendleton, and D. Roweth, Hybrid Monte Carlo,
Physics Letters B, Vol. 195, No. 2, 1987, pp. 216-222.
[21] Jylanki, P., A. Nummenmaa, A. Vehtari, Expectation Propagation for neural networks
with sparsity-promoting priors, Journal of Machine Learning Research, Vol. 15, 2014,
pp. 1849-1901.
[22] Kalaitzis, A., R. Silva, Flexible sampling of discrete data correlations without the
marginal distributions, Advances in Neural Information Processing Systems 26
(NIPS13), 2013.
[23] Hinton, G. E., D. van Camp, Keeping the neural networks simple by minimizing the
description length of the weights, Proceedings 93 COLT, 1993, pp. 5-13.
[24] Peterson, C., J. R. Anderson, A mean field theory learning algorithm for neural net-
works, Complex Systems, Vol. 1, 1987, pp. 995- 1019.
[27] Mitchell, T. J., J. J. Beauchamp, Bayesian variable selection in linear regression, Jour-
nal of the American Statistical Association, Vol. 83, No. 404, 1988, pp. 1023-1032.
[28] Ishwaran, H., J. S. Rao, Spike and slab variable selection: Frequentist and Bayesian
strategies, The Annals of Statistics, Vol. 33, No. 2, 2005, pp. 730773.
104
[29] He, K. et al., Deep residual learning for image recognition, arXiv:1512.03385v1, 2015.
[31] Hubel, D. H., T. N. Wiesel, Receptive fields, binocular interaction, and functional archi-
tecture in the cats visual cortex, Journal of Physiology, Vol. 160, 1962, pp. 106-154.
[32] Fukushima, K., S. Miyake, Neocognitron: A new algorithm for pattern recognition tol-
erant of deformations and shifts in position, Pattern Recognition, Vol. 15, 1982, pp.
455-469.
[33] Lecun, Y., L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to
document recognition, Proceedings of the IEEE, Vol. 86, No. 11, 1998, pp. 2278-2324.
[35] Srivastava, N. et al., Dropout: A simple way to prevent neural networks from over-
fitting, Journal of Machine Learning Research, Vol. 15, 2014, pp. 1929-1958.
[36] Konda, K., X. Bouthillier, R. Memisevic, and P. Vincent, Dropout as data augmentation,
arXiv:1506.08700v1, 2015.
[37] Simpson, D., H. Rue, A. Riebler, T. G. Martins, S. H. Srbye, Penalising model com-
ponent complexity: A principled, practical approach to constructing priors, Statistical
Science, Vol. 32, No. 1, 2017, pp. 1-28.
[41] Myshkov, P., S. Julier, Posterior distribution analysis for Bayesian inference in neural
networks, Workshop on Bayesian Deep Learning, NIPS 2016.
[42] Minka, T. P., Bayesian model averaging is not model combination, 2002, https:
//tminka.github.io/papers/minka-bma-isnt-mc.pdf, (accessed 17 September 2017).
105
[43] Baldi, P., P. Sadowski, Understanding Dropout, Advances in Neural Information Pro-
cessing Systems 26 (NIPS), 2013.
[44] Baldi, P., P. Sadowski, The dropout learning algorithm, Artificial Intelligence, Vol. 210,
2014, pp. 78-122.
[46] Welling, M., Y. W. Teh, Bayesian learning via stochastic gradient Langevin dynamics,
Proceedings of the 28th International Conference on Machine Learning, 2011.
[47] Kingma, D. P., T. Salimans, M. Welling, Variational dropout and the local reparame-
terization trick , Advances in Neural Information Processing Systems 28 (NIPS 2015),
2015.
[48] Molchanov, D., A. Ashukha, D. Vetrov, Variational Dropout sparsifies deep neural net-
works, Proceedings of the 34th International Conference on Machine Learning, Vol. 70,
2017, pp. 2498-2507.
[49] Li, Y., Y. Gal, Dropout inference in Bayesian neural networks with alpha-divergences,
Proceedings of the 34th International Conference on Machine Learning, Vol. 70, 2017,
pp. 2052-2061.
[51] Silva, R., R. Gramacy, Gaussian process structural equation models with latent vari-
ables, Proceedings of the 26th Conference on Uncertainty on Artificial Intelligence, UAI
2010.
[53] Gerven, M. V., B. Cseke, R. Oostenveld, T. Heskes, Bayesian source localization with
the Multivariate Laplace prior , Advances in Neural Information Processing Systems 22
(NIPS 2009), 2009.
[54] Ingraham, J. B., D. S. Marks, Variational Inference for Sparse and Undirected Models,
Proceedings of the 34th International Conference on Machine Learning, PMLR 70, 2017,
pp. 1607-1616.
[55] MacKay, D. J., Bayesian interpolation, Neural Computation, Vol. 4, 1992, pp. 415-447.
106
[57] Chen, T., E. B. Fox, C. Guestrin, Stochastic gradient Hamiltonian Monte Carlo,
arXiv:1402.4102v2, 2014.
[58] Geyer, C. et al., Handbook of Markov Chain Monte Carlo, Boca Raton, FL, USA, Chap-
man & Hall/CRC, 2011.
[59] Blei, D. M., A. Kucukelbir, J. D. McAuliffe, Variational Inference: A Review for Statisti-
cians, arXiv preprint arXiv:1601.00670v5, 2017.
[60] Ripley, B. D., Stochastic Simulation, Chichester, UK, John Wiley & Sons, Inc., 1987.
[61] Hoffman, M. D., A. Gelman, The No-U-Turn Sampler: Adaptively Setting Path Lengths
in Hamiltonian Monte Carlo, Journal of Machine Learning Research, Vol. 15, 2014, pp.
1593-1623.
[63] Tran, D. et al., Edward: A library for probabilistic modeling, inference, and criticism,
arXiv:1610.09787v3, 2017.
[65] Abadi, M. et al., Tensorflow: A system for large-scale machine learning, Proceedings
of the 12th USENIX Symposium on OSDI16, 2016, pp. 265-283.
[67] Choo, K., Learning hyper-parameters for neural network models using Hamiltonian
dynamics, Thesis, University of Toronto, 2000.
107