2012 Nikolaos Nikolaou MSC
2012 Nikolaos Nikolaou MSC
2012 Nikolaos Nikolaou MSC
Learning objectives
Nikolaos Nikolaou
Master of Science,
Artificial Intelligence,
School of Informatics,
University of Edinburgh
August 17, 2012
Abstract
In this project we examined the problem of non-convex optimization in the context of Machine Learning, drawing inspiration from the increasing popularity of
methods such as Deep Belief Networks, which involve non-convex objectives. We
focused on the task of training the Neural Autoregressive Distribution Estimator,
a recently proposed variant of the Restricted Boltzmann Machine, in applications
to density estimation. The aim of the project was to explore the various stages
involved in implementing optimization methods and choosing the appropriate one
for a given task. We examined a number of optimization methods, ranging from
derivative-free to second order and from batch to stochastic. We experimented
with variations of these methods, presenting along the way all the major steps
and decisions involved. The challenges of the problem included the relatively
large parameter space and the non-convexity of the objective function, the large
size of some of the datasets we used, the multitude of hyperparameters and decisions involved in each method, as well as the ever-present danger of overfitting
the data. Our results show that second order Quasi-Newton batch methods like
L-BFGS and variants of stochastic first order methods like Averaged Stochastic
Gradient Descent outshine the rest of the methods we examined.
Acknowledgements
I would like to extend my thanks to my supervisor Dr. Iain Murray for his guidance and patience. Every piece of information he provided was to-the-point and
saved me many hours of work. From theoretical background, to implementation
details, to writing and typesetting style, the end result would be of far lesser
quality, in every aspect, without his feedback.
I also thank my family and all my friends for their love and support. Special
thanks go to my friends Costas and Stratis who lended me not only strength but
also their computing power for some of the experiments.
Declaration
I declare that this thesis was composed by myself, that the work contained herein
is my own except where explicitly stated otherwise in the text, and that this work
has not been submitted for any other degree or professional qualification except
as specified.
Contents
1 Introduction and Outline of the Dissertation
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . .
2 Background
2.1 Optimization and its Role in Machine
2.2 Deep Belief Networks . . . . . . . . .
2.3 Restricted Boltzmann Machines . . .
2.4 Density Estimation . . . . . . . . . .
2.5 Models Examined . . . . . . . . . . .
Learning
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
5
.
.
.
.
.
8
8
10
12
14
14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
24
26
28
28
30
47
49
4 Hyperparameter Selection
4.1 Step Size . . . . . . . . . . . . . . . . . . . . . .
4.2 Adaptive and Individual Step Sizes . . . . . . .
4.3 Momentum . . . . . . . . . . . . . . . . . . . .
4.4 Batch Size . . . . . . . . . . . . . . . . . . . . .
4.5 Epsilon in Statistical Tests . . . . . . . . . . . .
4.6 Termination Criteria and Overfitting Control . .
4.7 Evaluation . . . . . . . . . . . . . . . . . . . . .
4.8 Alternative Ways to Search the Hyperparameter
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
52
52
55
56
56
56
57
59
61
62
62
65
68
6 Experimental Results
72
2
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Space
CONTENTS
6.1
6.2
6.3
6.4
6.5
The Datasets . . . . . . . . . . . . . .
Experimental Design . . . . . . . . . .
Results: Average Loglikelihood (ALL)
Results: Execution Times . . . . . . .
Results: Closing Remarks . . . . . . .
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
72
74
75
81
83
88
Bibliography
91
A Notational Conventions
96
98
99
Chapter 1
Introduction and Outline of the
Dissertation
1.1
Introduction
1.2
In Chapter 2 we will introduce the basic theoretical background behind this dissertation. Our intention is for it to act as both a theoretical basis for the rest of
the chapters and as a source of motivation for studying non-convex optimization
in the context of machine learning. We will begin by presenting the general form
of an optimization problem and introducing some basic terminology, including the
distinction between convex and non-convex optimization problems and the importance of parameter regularization. We will see how often optimization problems
arise in machine learning tasks and what distinguishes them from optimization
problems in general.
We will then move on to describing the basics of Deep Belief Networks (DBNs),
since they represent a successful recent trend in machine learning where the objectives generally are non-convex. We will mention their basic structure, as well
as the motivation for using them drawing analogies between the principles that
govern them and how learning appears to work in humans. We will outline the
principles of training DBNs and how the non-convexity of their objectives arises.
We will then briefly present Restricted Boltzmann Machines and how they are
used as building blocks for DBNs. We will only go as deep as introducing their
limitations and how other models ones we will cover in more depth can overcome them. Throughout these discussions we will offer examples of applications
and pointers to bibliography which emphasize the wide range of applications and
the power of DBNs.
Finally, we will present the specific models we will be working with, emphasizing
on aspects that will be examined throughout the dissertation. Main focus will
revolve around their respective advantages and disadvantages over other models,
their parameters, their objective functions and their gradients with respect to
their parameters and describe in brief the ones we will be comparing our results
to, the convexity of their objectives, their computational complexity, the role of
overfitting and regularization in practice. The principal focus of this project will
be the the Neural Autoregressive Density Estimator (NADE) with the presentation of which, the chapter closes.
Chapter 3 is a combination of theory and application. Here we will delve deeper
and more formally into the subject of optimization and more specifically in large
scale non-convex optimization problems, as is the task of training NADE with
the specific datasets we use. We will compare the benefits and shortcomings of
batch versus stochastic methods as well as what we gain and what we lose by
using methods of different order.
We will then choose a number of specific methods to examine in more depth,
most of which will include both the basic version and variants based on various
heuristics. We will present each version of each method in detail, including the
motivation behind it, relative benefits and disadvantages over other methods, the
role of their hyperparameters and some possible variations. We will then discuss
our specific implementation for each version, including justification for each of
our decisions and present pseudocodes for all of them.
Chapter 4 will discuss hyperparameter selection in more depth. This chapter continues the theme of combining theory and application. This is so, because there
are theoretical and experimental guidelines for this procedure, yet also practical considerations, like keeping the execution time manageable were taken into
account. We will discuss termination criteria, choices about which hyperparameters we fixed to specific values and why and which ones we tried to adjust in each
method and what subspace of the hyperparameter space we explored. We will
discuss how we evaluate our hyperparameter values choices, how we safeguard
against overfitting and how we chose the termination criteria of our methods. In
most cases, besides describing the specific hyperparameter selection scheme we
used, we will also discuss alternatives, some of which we explored up to some
point.
In Chapter 5 we will test some of our techniques by applying them on a toy problem, more specifically on performing binary classification using logistic regression.
6
Chapter 2
Background
2.1
An optimization problem is generally the problem of finding the best solution from
all feasible solutions. We will briefly discuss the categorizations of optimization
problems in the beginning of the next chapter. Here we will restrict ourselves
to presenting the general form of continuous optimization problems and their
role in Machine Learning. In a continuous optimization problem, our goal is
to either minimize or maximize the value of a function f () : RN R, called
the objective function, possibly under a number of inequality and/or equality
constraints (gi () 0 and hi () = 0, respectively). The standard definition of a
continuous optimization problem [Boyd et al. , 2004] is
minimize
f ()
subject to gi () 0, i = 1, . . . , m1
hi () = 0, i = 1, . . . , m2 .
(2.1)
Any continuous optimization problem can be expressed in the form above. Constraints of the form gi0 () 0 can be rewritten as gi0 () 0. If the right hand
side is not equal to zero, but a constant bi 6= 0, we can simply subtract bi from
both sides to obtain the standard inequality constraint form. Finally, a maximization problem with objective function f () can be stated as the minimization
of f () subject to the same constraints.
Solving optimization problems, usually involves starting at some point in the
parameter space and searching for close local optima. One way to categorize
optimization problems is based on whether or not the objective function is convex.
In convex objective functions, any local optimum has to be a global optimum. In
non-convex objectives1 we have more than one local optima. As a result, while
1
From time to time we may refer to the value of the objective function as the cost or the
CHAPTER 2. BACKGROUND
methods that search for the closest local optimum will succeed in finding the
global optimum of a convex objective, in non-convex objectives the success is not
guaranteed, as we might get trapped in a local optimum. In this sense, nonconvex problems are harder. In Figure 2.1 we can see two univariate functions,
a convex and a non-convex one.
Figure 2.1: A non-convex (left) and a convex (right) function of one variable.
Notice in the convex case the local minimum is also the global minimum.
Image adapted from images in lecture 4 of the course B1 Optimization from
University of Oxford, by Andrew Zisserman
http://www.robots.ox.ac.uk/~az/lectures/b1/lect4.pdf.
In the Introduction we stressed the importance of optimization in Machine Learning and gave a few examples to illustrate how common a subproblem in learning
algorithms it is. Here we should clarify that optimization in the context of learning is not the same as solving an optimization problem. Our real goal is not to
minimize the value of the objective function on the training set, but to train a
model that can generalize on new data. When this generalization capability is
lost because the model has adjusted its parameters so as only to fit the training
data, the model suffers from overfitting. In practice this often means that some
parameters have assumed extremely large values compared to others. Regularization is a way to protect training against overfitting. To implement parameter
regularization, we can add a term to the objective function that encourages the
parameters to assume small values (by penalizing big ones) or that forces many
of the parameters to assume zero values (encourages sparse solutions). A way to
encourage small parameter values for example is L2-regularization,
f 0 () = f () + ||||22 ,
v
u N
uX
i2 .
||||2 = t
(2.2)
i=1
Here we formed a new objective function f 0 (), by adding to the original objective
function f () the square of the L2-norm of , weighted by a factor R+ ,
which controls how harshly we penalize large values (big values of mean harsh
penalization, thus heavier regularization). Se the new objective balances the old
error or simply the objective without properly introducing the term. We hope it will be
clear that we mean the exact same thing.
one (minimize f () wrt ) and the regularization term (keep parameter values
as small as possible).
A way to encourage sparseness is L1-regularization, where we use the L1-norm
of , instead,
f 0 () = f () + ||||1 ,
||||1 =
N
X
|i |.
(2.3)
i=1
In this project we will use unregularized objectives. Our way of preventing overfitting will be by use of a statistical test (see Chapter 3) or in most cases by early
stopping (see Chapter 4) .
To close this brief introduction to optimization in machine learning, we should
note that we usually choose to use convex objectives instead of non-convex ones,
whenever we have the chance, as their global optimum is easier to find. Especially
since 2006, however, Deep Belief Networks have been growing in popularity and
largely due to that Neural Networks have been experiencing a rebirth as well.
In both cases, the objective functions for training them are usually non-convex.
2.2
Deep Belief Networks (DBNs) [Hinton et al. , 2006a] are probabilistic generative
models that consist of more than one layers of stochastic, latent variables, which
typically assume binary values. We call such variables hidden units or feature
detectors and we call the layers they compose hidden layers. The inputs of the
first hidden layer are the elements of the data vector and the inputs of the second
one and on are the activations of its preceding layer. Each hidden layer combines
its inputs nonlinearly to transform them into new features that will be provided
as inputs to the level above. The top two layers have undirected, symmetric
connections between them and form an associative memory.
So in DBMs we perform feature extraction at multiple levels by combining the
features from the level below in a nonlinear fashion. An example with image
data would be the following: Imagine we have T grayscale images of 100 100
pixels and let us assume our initial features are the D = 10000 intensity values
of the pixels. The D pixels are combined in a nonlinear fashion to form H new
features. These features are no longer pixels but nonlinear combinations of pixels,
we can think of them as edges, shapes, forms of any kind within an image which
represent a different level of abstraction. Then the next layer combines these edges
and basic shapes to form new features, which represent an even higher level of
abstraction, like entire objects, for instance. As we see, DBNs can extract features
of multiple levels of abstraction. This property allows them to approximate any
distribution over binary vectors to arbitrary accuracy [Sutskever et al. , 2008],
[Bengio, 2009a]. In Figure 2.2 we can see features learned at different levels of a
Deep Belief Network in [Lee et al. , 2011] in image recognition tasks.
10
CHAPTER 2. BACKGROUND
11
Figure 2.2: An illustration of the features detected at multiple levels from a DBN.
Top row shows features learned at the second hidden layer of the convolutional
DBN (CDBN) presented in [Lee et al. , 2011]. Bottom row shows features learned
at the third hidden layer of the network. The first four columns correspond to
a training the CDBN on different categories of objects while the last column
shows the features learned by training on a mixture of images from all categories.
Notice the different levels of abstraction at different layers of the network, the
intuitiveness of the features learned (e.g. eyes and mouths of faces, wheels and
doors of cars, etc.) as well as the coherence of the features learned even when
objects of different categories are presented to the network.
Image from [Lee et al. , 2011].
Most likely humans perform similar forms of feature extraction [Morrone et al. , 1988],
[Watson, 2000]. After all, when looking at an image we observe shapes and distinct regions within it, edges and repeated patterns, not just colors, although
the initial information our eyes receive are packets of photons. Similarly we understand language as ideas, concepts and interactions among them, although its
building blocks are simple phonemes. In fact, we even use different levels of
abstraction. We dont only recognize regions of different color or texture in an
image, we recognize parts of objects (like the eyes and nose of a person), entire
objects (like faces).
Since the goal of Artificial Intelligence (AI), is to develop techniques that allow
machines to perform such tasks at a level comparable to that of humans, it is
no surprise that DBNs which can allow for such multilevel abstraction are becoming more and more prevalent in subfields of AI such as Machine Vision (e.g.
generating and recognizing images [Bengio et al. , 2007], [Ranzato et al. , 2007],
[Hinton et al. , 2006b], [Lee et al. , 2011], video sequences [Sutskever et al. , 2007],
and motion-capture data [Taylor et al. , 2007]), Speech [Lee et al. , 2009] and Natural Language Processing [Collobert et al. , 2009].
The task of training DBMs is not an easy one. First of all, DBMs by definition
have many layers and a common practice is to train each layer greedily and move
on to the next one. A second difficulty is that in order to search the parameter
space of DBNs, the objective functions we need to optimize tend to be non-convex
for two main reasons: (i) the outputs of each layer are non-linear transformations
of its inputs and (ii) due to symmetries in the networks, we can permute elements
of the weight matrices without changing the networks output. Of course there
are some models that use convex objective functions like the one presented in
[Deng et al. , 2011], but they are the exception, rather than the rule.
And now, an interesting side note. Since we have already made an analogy
of the way humans learn to the feature extraction at various abstraction levels
performed by DBNs, let us make another one that illustrates the importance
of non-convex optimization in learning. When we humans learn, the way we
are presented the examples matters. This might imply that learning in humans
also involves solving non-convex optimization problems, where different order of
presenting the examples will result in ending up in different local optima. In
[Bengio, 2009b] the authors stress this analogy, and present a formalization of
such strategies for machine learning called curriculum learning. One possible
explanation is that curriculum learning is a special case of continuation methods.
Continuation methods are a strategies used in non-convex optimization methods
for finding global optima. We start with a smooth relaxation of the original
problem and gradually consider less smooth versions of it. Ideally, the global
optimum of the relaxation will also be the global optimum of the original problem.
The authors suggest that using curriculum learning techniques we can speed up
convergence, allow us to reach better local optima, even act as regularizers. In
this work we will make use of any of form of curriculum learning or continuation
methods. However perhaps in a future work, such ideas could be explored on
this problem, or similar ones involving non-convex objectives. We just raised
this point to show that non-convex objectives have a great potential in Machine
Learning. Analyzing the ways to optimize them and understanding the merits
and drawbacks of each which is the aim of this project is an important step
towards a broader use of non-convex objectives in learning tasks.
2.3
Let us now return to DBMs and see how they are constructed. A typical building
block of DBMs is the Restricted Boltzmann Machine (RBM). The RBM is a type
of stochastic recurrent neural network introduced by [Smolensky, 1986].
In a RBM we have a number of observation or visible units xd , d = 1, 2, ..., D
each of which is connected to a number of hidden units hk , k = 1, 2, ..., H by a
set of weights Wdk . Units of the same layer (visible, hidden) have no connections
between them but each visible unit is connected to every hidden unit (and viceversa). So we have a bipartite graph and the weight matrix W is symmetric.
The visible units also have a set of bias parameters b RD and the hidden units
a set of bias parameters c RH associated with them.
We can construct a DBN by stacking RBMs on top of one another
[Hinton et al. , 2006a], [Sutskever et al. , 2008]. After training one RBM, the activations of its hidden units can be treated as the visible units of a higher-level
RBM and so on, allowing us to greedily train each layer. We alternatively refer to
12
CHAPTER 2. BACKGROUND
13
the units as nodes, neurons or variables, following the terminology of Probabilistic Graphical Models in general and specifically Neural Networks. In Figure
2.3 we can see the graphical model of a RBM and a 3-layer DBM constructed by
RBMs.
Figure 2.3: Graphical model of a RBM (left) depicted as a bipartite graph among
the input x and hidden units h. Graphical model of a a 3-layer DBM using RBMs
as its building blocks (right). To train the DBN we greedily train each hidden
layer (units hi ) and move on to the next one (units hi+1 ).
Images from Yoshua Bengios research webpage:
http://www.iro.umontreal.ca/~bengioy/yoshua_en/research.html
1 X E(x,h)
e
,
Z h
(2.4)
where
E(x, h) = bT W v xT h cT h
(2.5)
becomes intractable. We will not discuss RBMs in more detail here. A useful
source of information about RBMs and their training is [Hinton, 2010]. Many of
the tricks and heuristics regarding the training of RBMs from the aforementioned
paper carry on to the particular model we will make use of many of its general
guidelines in Chapters 3 and 4.
The model we will be working in this project is the Neural Autoregressive Distribution Estimator (NADE), presented in [Larochelle et al. , 2011]. NADE, transforms the estimation of the partition function to a tractable problem, by converting the RBM into a Bayesian Network. The discussion is restricted to binary
variables only, however the model can be extended to non-binary cases as well.
More precisely, we are going to examine various non-convex optimization techniques in the context of training NADE to perform density estimation. NADE
can be viewed as an autoencoder (it tries to reproduce each data vector from
the feature activations that they cause), whose output can be used to assign
valid probabilities to the observations. Therefore NADE can be used as a density
estimator.
2.4
Density Estimation
Density estimation, the estimation of the distribution p(x) given a number of datapoints x, is of great importance as a machine learning task. Some of its most
notable applications are mentioned in [Zoran et al. , 2011] and include estimating missing features (for example inpainting missing pixels in an image), image
processing applications like denoising and deblurring, and performing unsupervised feature extraction as a first step to solve other tasks (e.g. in Natural Language Processing, Speech Recognition and Computer Vision that we discussed, in
collaborative filtering [Salakhutdinov et al. , 2007a], time series classification and
anomaly detection [Wulsin et al. , 2011], etc.).
RBMs, NADE, DBMs are all generative models, we can therefore use them to
learn the distribution p(x). Note that we use the terms density and distribution somewhat loosely. The probability density function applies to continuous
random variables, and we will be dealing with discrete ones, so the correct term
would be probability mass function, i.e. the function that gives us the probability that the discrete random variable is exactly equal to some value.
2.5
Models Examined
Logistic Regression
In Chapter 5 we will test some of our optimization methods in a toy application
of logistic regression before applying them to the harder task of training NADE.
Furthermore, logistic regression is the building block of FVSBN, a model whose
14
CHAPTER 2. BACKGROUND
15
1
1+
e(+ x)
(2.6)
where = [, ] R2 . In Figure 2.4 we can see the logistic sigmoid function for
= 0 and various values of the parameter R, which as we see controls the
steepness of the sigmoid.
To simplify the notation we define the matrix
1
1
1
x(1) x(2) x(T )
1
1
1
X = ..
..
..
..
.
.
.
.
(1)
(2)
(T )
xD xD xD
and the parameter vector
0
1
= ..
.
D
We can now define the probability that x
(t) belongs to class 1 as
P (y = 1|
x(t) ; ) = (x(t) ) =
1
1+
T
e x(t)
(2.7)
Figure 2.4: The logistic sigmoid function for various values of parameter . As
|| , the sigmoid becomes a step function ((x) = 0, for x < 0 and (x) = 1,
for x > 0). As || 0, it becomes (x) = 0.5, x. For < 0 we would just
have a mirrored image, i.e. it would assign values values (x) > 0.5, for x < 0
and (x) < 0.5, for x > 0. Finally, for 6= 0 we would have a displacement of
the sigmoid on the x -axis.
We can measure how well the model fits the data using the likelihood of the
logistic regression with parameters under the data. The likelihood is defined as
the probability of observing the given data under the model with parameters .
Assuming the data are independent and identically distributed (iid) the likelihood
in our case is
J 0 () =
T
Y
y (t)
P (y = 1|x(t) ; )
(1 P (y = 1|x(t) ; ))(1y
(t) )
(2.8)
t=1
This will be our objective function and our goal is to maximize it. It is a convex
function. We usually prefer to work with sums rather than products. Very
small or very large values which can arise in such products can cause numerical
instabilities in an algorithm (underflowing or overflowing variables). Also, adding
numbers causes less round-off errors to occur and propagate in the calculations.
So we take the logarithm of J() and get
J()
= log(J 0 ()) =
T
X
t=1
16
CHAPTER 2. BACKGROUND
17
We will use the Average Loglikelihood (ALL) of the logistic regression model with
parameters given the data,
J() =
T
1X
[y (t) log( (x(t) )) (1 y (t) ) log(1 (x(t) ))]
T t=1
(2.10)
We will use the ALL as the objective function to be maximized in the next
models we will present as well. J() is a convex function of the parameters d ,
d = 0, 1, ..., D. We will often use its negative, the Average Negative Loglikelihood (ANLL) as an objective to minimize, posing our optimization problem in
the standard form we saw in Eq. (2.1).
The partial derivative of J() with respect to parameter d , d = 0, 1, ..., D, 0
being the intersect (bias) term is
T
1X
J()
(t)
( (x(t) ) y (t) )xd
=
d
T t=1
(2.11)
Normally, some sort of regularization is applied to the objective of logistic regression. Notice in Figure 2.4 that the absolute values of the parameters d ,
d = 1, 2, ..., D control the steepness of the sigmoid in the corresponding dimension (the intersect (bias) term 0 just displaces it on the x -axis). The steeper
the sigmoid, the farther the probabilities P (y = 1|x(t) ; ) and P (y = 0|x(t) ; )
get from 0.5, so the higher their difference becomes. In a sense, the more confident the classifier is about classifying to the one class or the other after seeing
the feature that corresponds to this dimension. If one d has an extremely high
(compared to the other parameters) absolute value, the sigmoid essentially becomes a step function in this dimension, thus assigning the examples solely based
on their value of xd . With regularization, we prevent any d from growing to an
extreme value, thus dominating T x(t) and causing the classifier to ignore the
rest of the features.
In Algorithm 1 we present the pseudocode for LOGREG(), a subroutine that
computes the value of the ALL and its partial derivatives wrt the parameters d ,
d = 0, 1, 2, ..., D at a given point = (0) in the parameter space, for the given
dataset Xtrain with labels y under the logistic regression model we presented
above. These partial derivatives form the gradient J () of J(). We will
introduce the gradient and its use more formally in the next chapter.
Data: Xtrain , y,
Result: J(), J ()
J() = 0 ; J () = ZEROS(SIZE());
T 0 = SIZE(Xtrain , ROW S);
for t = 1 : T 0 do
J( (0) ) = J() + [y (t) log( (x(t) )) (1 y (t) ) log(1 (x(t) ))] ;
for d = 0 : D do
(t)
J()
= J()
+ ( (x(t) ) y (t) )xd ;
d
d
end
end
// Return averages instead of sums:
J( (0) ) = T10 J( (0) );
J ( (0) ) = T10 J ( (0) )
Algorithm 1: The pseudocode for a simple, non-vectorized version of the algorithm that computes the value of the ALL of a logistic regression model and its
partial derivatives wrt the parameters d , d = 0, 1, 2, ..., D: LOGREG().
p(x) =
D
Y
p(xd |x<d ),
(2.12)
d=1
where x<d is the subvector containing all variables xi : i < d, assuming that
xi : i < d are the parents of xd in the Bayesian Network. Note that the ordering
of the features xd , d = 1, 2, ..., D need not necessarily be the initial one as our
notation might indicate. To avoid overloading our notation let us just assume
that we have chosen an appropriate or a random ordering of the elements of x
before constructing the DAG.
The individual conditional probabilities p(xd |x<d ), d = 1, 2, ..., D are tractable
(therefore so is p(x)) and they are given by logistic regression,
p(xd |x<d ) = (bd +
X
j<d
18
Wdj vj ).
(2.13)
CHAPTER 2. BACKGROUND
19
We can use the Average Loglikelihood (ALL) of the model as the objective function to be maximized,
J() =
T
T
D
1X
1 XX
log(p(x(t) )) =
log(p(xd |x<d )),
T t=1
T t=1 d=1
(2.14)
(2.15)
and the mapping from hidden layer to output layer is done by taking
p(xd = 1|x<d ) = (bd + Vd, hd )
(2.16)
The whole procedure corresponds to a Neural Network for each p(xd = 1|x<d ),
where the weighted connections going in and out of the hidden layer of each minineural network are tied. In addition to that, connections are also tied across the
individual neural networks, so that the difference between two consequent hidden
layer activations is
(c + W,<d+1 x<d+1 ) (c + W,<d x<d ) = W,d+1 xd+1 ,
(2.17)
which can be computed in O(H) time and we have D conditionals in the factorization of the joint probability p(x). So, computing the p(x) under NADE
takes O(HD) time overall. Contrast that with the O(HD2 ) time complexity
of an approach that would not take into account the weight sharing across the
conditionals.
Again, we can use the ALL of this model as the objective function. It is given
by Eq. (2.14). The parameter vector is now = [b, W (), c, V ()] RN . The
total number of parameters is N = D+H +H D+DH = D+H +2HD. Each
row of W will correspond to a new feature created by nonlinearly combining the
initial features xd , d = 0, 1, 2, ..., D.
In Figure 2.5 we can see illustrations of NADE and FVSBN. In Algorithm 2 we
present the pseudocode for NADE(), a subroutine that computes the value of the
ALL and its partial derivatives wrt the parameters b, W , c and V at a given
point (b, W , c, V ) = (b(0) , W (0) , c(0) , V (0) ) in the parameter space, for the given
20
CHAPTER 2. BACKGROUND
21
Note that we can have H > D or H < D. In the first case we map the initial feature vectors into a higher dimensional space, before mapping it back
to a D-dimensional space. In the second case we map the initial feature vectors into a lower dimensional space, before mapping it back to a D-dimensional
space, thus performing a non-linear dimensionality reduction step. The higher
the redundancy in the dataset, the less lossy this transformation will be. The
same idea extends to RBMs and DBMs in general, of course. For example
[Hinton et al. , 2006a], [Salakhutdinov et al. , 2007b] show that using such a dimensionality reduction, DBMs can learn short binary codes that allow very fast
retrieval of documents or images.
22
Chapter 3
Optimization Methods Examined
We experimented with various optimization methods and variants thereof during
the course of this project. Some of these methods, for instance the simple versions
of Steepest Gradient Descent and Coordinate Descent are not suited for large scale
non-convex settings. However we included them in some experiments to showcase
their inadequacy or to use them as a baseline to improve upon and as a means
to measure the difficulty of training on a particular dataset.
In this section we will present the methods used, the theoretical motivation behind
them and a brief discussion about the strengths, weaknesses and some potential
variations of each one of them. A brief description of each method is followed
by the corresponding pseudocode in its most basic version. Details such as step
size decay, termination criteria and tricks to reduce the number of operations
(such as indexing and vectoring) were omitted wherever possible in order to keep
the pseudocode for each method in its most general and intelligible form. Step
size decay and termination criteria (including early stopping) are shared across
groups of methods and will be discussed in the following chapter in more detail.
3.1
Batch
CD basicGD
CD basicNewton
First
BGD basic,
BGD heavyball,
BGD bolddriver,
BGD rprop,
DIBGD basic,
Second
Newton
Quasi-Newton
L-BFGS
3.2
MBGD basic,
MBGD heavyball,
MBGD bolddriver,
Stochastic
SGD basic,
SGD heavyball,
ASGD basic,
IAGD basic,
Zeroth order methods do not directly calculate the gradient of the objective function (see next paragraph for more on the gradient). For this reason they are
also referred to as non-gradient or derivative free optimization techniques. They
perform a line search in a single dimension of the parameter space (can do so
iteratively as well), or approximate the gradient using other techniques. Since
they need only use simple evaluations of the objective function, each iteration
is less expensive computationally compared to first order (or higher) methods.
However they are less efficient than the first order methods, since they require
more steps to converge. Most of them are easy to program and robust and an
important benefit is that they can also be applied to objective functions that are
discontinuous or non-differentiable, contrary to higher order methods. For the
definitions below we shall assume that the objective function is continuous and
k times differentiable, where k is the order of the method we are referring to.
First order methods use only gradient and objective value function information,
in other words they only consider up to the first order partial derivatives of the
24
25
objective function with respect to the parameters in order to decide how the
corresponding parameters will be updated. The gradient J of the objective
function J() with respect to the parameters = [i ], i = 1, 2, ...N is the vector
of all first order partial derivatives of J() with respect to ,
J()
, i = 1, 2, ...N.
J = J () =
i
(3.1)
In second order methods, the parameter updates also use curvature (second order
derivative) information. The Hessian H(J ()) (or simply H(J ) or even H when
it is obvious from the context) of the objective function J() with respect to the
parameters = [i ], i = 1, 2, ...N is the matrix whose (i, j)-th element is the
second partial derivative of J() with respect to parameters i and j ,
2 J()
, i, j = 1, 2, ...N.
H = H(J ()) =
i j
(3.2)
3.3
Batch methods use the entire training set in order to calculate the appropriate
update for each of the parameters. As a result, each step is slower compared to
that of stochastic and minibatch methods. On the other hand, since it takes into
account all the training instances, each step is indeed towards the direction that
minimizes the overall error on the training set. The stochastic methods make a
parameter update based on a single instance t, therefore it will not necessarily
lead to a parameter vector which decreases the average or total value of the
objective function ( i.e. over all training instances), but for the particular instance.
Comparisons of the relative advantages of batch and stochastic methods can be
found in [LeCun et al. , 1998], [Bishop, 1995], [Bengio, 2012] and [Bottou et al. , 2007].
Here we will review the main ideas presented in the aforementioned works. In
short, the batch methods have the following advantages:
(i) Their conditions of convergence are well understood.
(ii) Many acceleration techniques like conjugate gradient or 2nd order methods
only operate in batch learning (This is not entirely true, there exist second
order stochastic methods, however they are not used widely and we will
explain later on why).
(iii) Analyzing theoretically their parameter dynamics and convergence rates is
simpler.
(iv) They can be parallelized (for example by using map-reduce techniques).
The above advantages are lost in stochastic gradient descent methods, due to the
noise their stochasticity adds and to the fact that execution of each step requires
the previous step to have been executed. However, the stochastic methods also
have their advantages over the batch methods:
(i) They are much faster than batch methods in large-scale learning and in
scenarios with redundant data.
(ii) They usually result in better solutions in non-convex objectives.
(iii) They can be used for keeping track of changes to the function modeled.
(iv) They can be used in the context of online learning (Somewhat related to
(iii)).
(v) Although the convergence rate of stochastic methods is lower than that of
batch methods, there is evidence [Bottou et al. , 2007], [Bengio, 2012] that
in the context of learning tasks this convergence benefit loses its importance.
26
27
For a simple explanation for (i) let us imagine that our dataset consists of 10
consecutive identical copies of a smaller dataset of, 100 instances. Averaging the
gradient over all 1000 instances would yield the same result as computing the
gradient based on just the first 100. So, a batch method would need 10 times
more time to make a single update to the parameters compared to a stochastic
method. This example is of course extreme, as identical examples rarely
appear in real-world datasets. However similar ones often do. This redundancy
is what can make stochastic methods faster than their batch counterparts.
As for (ii), due to the stochasticity of the updates a stochastic method can escape
local optima. Advantage (iii) comes into play when the function we model changes
over time. Then a stochastic method can capture these changes, as it examines
every instance in sequence. Somewhat related to (iii), advantage (iv) comes into
play when the instances are presented one at a time (or few at a time) to the
training algorithm and we generally need to update our parameters as soon as the
new example(s) arrive(s). Here, using batch methods is no longer a viable option
as it would be a terrible waste computationally to use our entire old dataset of T
instances plus our 1 new example and perform operations on the new (T +1)-sized
batch. It is also a waste of memory to keep all previous instances stored. Using
a stochastic method we can avoid all this computational and memory cost. So
this is an extra advantage of the stochastic methods in large-scale optimization.
Finally, regarding (v), the convergence rate of e.g. Stochastic Gradient Descent
might be lower than that of Batch Gradient Descent. However, in learning tasks,
it has been shown that the convergence rate of the test error ( i.e. the value
of the objective function on unseen data) is O( 1t ), i.e. the same as the rate of
convergence of SGD [Bottou et al. , 2007], [Bengio, 2012]. So there is no benefit
in having a faster convergence rate on the training set as this will not mean having
a faster convergence rate on the test set. And remember, in learning we dont
really care about minimizing the value of the objective function on the training
set, but about achieving generalization.
Miniibatch methods try to strike a balance between the two approaches. If very
small batch sizes are used they lean towards the stochastic side, which allows
them to be used in online learning and grants them increased ability to escape
local optima in non-convex objectives (contrary to batch methods). In addition
to that, vectorization can allow each iteration to be almost as fast as a stochastic
gradient descent step and compared to stochastic methods, minibatch methods
generally need fewer update steps as its steps are less noisy ( i.e. each step is
based on more training samples, so the direction of the step is towards the local
minimum with greater probability). If large batch sizes are used they become
less stochastic, but still can have benefits over batch methods e.g provide faster
updates.
So, to summarize, when optimizing non-convex objectives we favour stochastic or
minibatch or second order batch ones. In large datasets, high redundancy within
3.4
So we should take into account many factors when deciding which optimization
method to use and how to tailor it to the problem, including the properties of
the objective function (e.g. convexity, smoothness, steepness), the dimensionality
of the parameter space, the size of the dataset and whether we need to make
updates as fast as possible or we can afford to wait. Also, any additional domain
knowledge we have, for example, whether a high amount of redundancy in the
dataset is to be expected or not, whether some features are correlated and the
parameters related to them can be treated as a groups or not, all these can
determine which method is appropriate for a given problem.
In our case, the objective function we are trying to optimize is non-convex. So
we can deduce that stochastic, minibatch and second order batch methods are
the most viable options for achieving good quality optima. Furthermore, we have
relatively large datasets, so redundancies are likely to exist within them. Redundancy gives a speed advantage to stochastic methods and minibatch methods
where the size of the minibatch is small. We also have large parameter spaces
to explore, a fact that translates to high computational cost of second order
methods.
So intuitively, in the task of training NADE on the specific datasets, we expect
variants of stochastic and minibatch methods to yield better, faster results than
the other methods and while second order methods might match (or possibly
exceed) their quality of optima, they will require larger execution times.
3.5
Gradient Computation
In Algorithm 3 we can see the common function all methods we examine share:
the one that computes the value of the objective function and its gradient with
respect to the parameters at (0) on the given training set Xtrain . We imaginatively named it COMPUTE OBJECTIVE AND GRADIENT(). To clarify any
confusion (0) RN is just an instantiation of the variable . From this point
on, however we shall use the two symbols interchangeably since we can easily
distinguish if we refer to a parameter or its actual value by the context. Also,
in the algorithms that follow we will use J instead of J( (0) ) and J instead
of J ( (0) ). Again, what we mean will be clear from the context. Of course
28
29
for every model used the objective function and the parameters involved differ,
so COMPUTE OBJECTIVE AND GRADIENT() will take the form of NADE()
or LOGREG() we presented in the previous section, here we just generalize the
concept to include any of these, as well as any other model.
As for the given training data matrix Xtrain its dimensions change according
to the actual method use. In the case of batch methods, Xtrain RT D ,
as we use all training instances. So N is the number of parameters of our
model, T the number of instances (training datapoints) and D the number of
features (the dimensionality of each datapoint). For Stochastic Gradient Descent, Xtrain R1D , since now we only use one instance to make an update.
Finally, for the Dynamically Increasing Batch Size approaches and the Mini-Batch
Methods, Xtrain RTbatch D , where Tbatch is the current minibatch size.
In all cases except for Coordinate Descent, the function outputs J( (0) ) R
and J ( (0) ) RN . In Coordinate Descent, we still get a scalar J( (0) ) but
(0) )
R, since we are working with a single
instead of J ( (0) ) RN we get J(
i
parameter at each iteration. We use J (t) ( (0) ) R to denote the cost on the
t-th instance of the training set and J (t) ( (0) ) RN to denote the gradient
computed using only the t-th instance of the training set. All this will become
more clear in the description of each method.
All methods we implemented call such a function on every iteration. The method
we call from the Mark Schmidts minfunc() package1 (L-BFGS) also calls such
a function on every iteration. Second order methods in addition to the gradient
also need the Hessian supplied, so the algorithm below needs to also include
the computation of the Hessian in an analogous fashion ( i.e. averaging over all
instances). Higher order methods would need tensors of higher order partial
derivatives.
Data: Xtrain , (0)
Result: J( (0) ), J ( (0) )
T 0 = SIZE(Xtrain , ROW S);
// Below, OBJFUNCEVAL(Xtrain (t, ), (0) ) and
COMPUTEGRAD(Xtrain (t, ), (0) ) simply compute J (t) ( (0) ) and
J (t) ( (0) ) respectively and their implementations differ from
model to model:
P 0
J( (0) ) = T10 Tt=1 OBJFUNCEVAL(Xtrain (t, ), (0) );
P 0
J ( (0) ) = T10 Tt=1 COMPUTEGRAD(Xtrain (t, ), (0) );
Algorithm 3: The general pseudocode for the algorithm that computes the
value of the objective function and its gradient with respect to the parameters:
COMPUTE OBJECTIVE AND GRADIENT().
We should note here that instead of taking the average of the J (t) ( (0) ) we could
instead simply take their sum since T10 is just a scaling factor. We would correct
1
Available at http://www.di.ens.fr/~mschmidt/Software/minFunc.html.
the scaling of the actual parameter update with appropriate selection of the step
size to rescale the update P
step.0 Similarily, it bears no significanceP
if we
optimize
T
T0
1
(t)
(0)
regarding to the total cost t=1 J ( ) or the average cost T 0 t=1 J (t) ( (0) )
per instance.
Also, although this implementation would work for all kinds of methods, we
point out for one last time that this is a general purpose pseudocode, not the
actual implementation we used. In fact, we implemented two different versions
for this function: one for one for Stochastic Gradient Descent and its variants
and one for all the other methods. The benefit to this is that Xtrain degenerates
to a D-dimensional row vector in the SGD case, so we can ignore one dimension
and avoid additional operations (e.g. avoid taking averages for SGD) gaining
some computational benefits (albeit constant ones). We will not dwell any more
on this, we just mention it to showcase that a general model can be good for
abstraction and theoretical generalization, but in practice, all theoretical tools
have to be tailored to their practical implementation.
3.6
Batch Methods
30
31
BGD basic: The simple version of the method we discussed above. The pseudocode can be found in Algorithm 4.
Data: Xtrain , initial ,
Result:
= initial ;
while Termination Conditions Are Not Met do
[J, J ] = COMPUTE OBJECTIVE AND GRADIENTS(Xtrain , );
= J ;
end
Algorithm 4: The core pseudocode for the simple version of Steepest Gradient
Descent: BGD basic.
BGD heavyball: Batch Gradient Descent using the Heavy Ball heuristic introduced in [Polyak, 1964]. The idea is borrowed from the movement of a ball
subject to the laws of classical mechanics. The simple Gradient Descent version
can be very slow when the surface of the objective function is highly non-spherical
(its curvature varies significantly with each direction). At most points on the error surface, the local gradient will not point directly towards the minimum. The
result is a zig-zagging behaviour, so Gradient Descent will take many steps to
reach the minimum. We can moderate this effect by to adding a contribution
from the previous step prev weighted by a momentum term 0 1
to the parameter update , which will smooth the oscillations (the zig-zagging
mentioned above), and therefore allow us to reach the minimum in fewer steps.
However it adds an extra hyperparameter, the momentum term, which we we
have to fine tune in addition to . In Figure 3.2 we can see the zig-zagging behaviour of Steepest Gradient Descent compared to the more smooth convergence
of the Heavy Ball heuristic. The pseudocode for this version can be found in
Algorithm 5.
32
33
Figure 3.3: An illustration of the Bold Driver Method. In this case the learning
rate is decreased considerably, hopefully allowing us to eventually reach the minimum. Steps shown in red indicate that the step size has been decreased, while
green ones indicate that it has been increased. The first step is taken using the
initial step size alpha.
Data: Xtrain , initial , , ,
Result:
= initial ; Jprev = ;
while Termination Conditions Are Not Met do
[J, J ] = COMPUTE OBJECTIVE AND GRADIENTS(Xtrain , );
= J ;
if J < Jprev then
= ;
else if J > Jprev then
= ;
else
// Do Nothing...
end
Jprev = J;
end
Algorithm 6: The core pseudocode for the Bold Driver version of Steepest
Gradient Descent: BGD bolddriver.
BGD rprop: This method is known as Resilient Backpropagation or R-prop
[Riedmiller et al. , 1993]. Here we use a different step size for every parameter.
Furthermore, we adapt each step size individually. The idea is that if the sign
of a partial derivative has changed from last iteration, we have actually over34
35
shot the minimum in this dimension, as shown in Figure 3.4, so in this case we
should decrease the step size in order to converge to it in next iterations. On the
other hand if the sign of a partial derivative remains unchanged in consecutive
iterations, it means we have yet to overshoot the minimum in this dimension, so
increase the step size. In terms of implementation, we just keep track of (the sign
of) each partial derivative we calculated in the previous iteration, and if the sign
is the same in the current one, we multiply the step size of this parameter by a
factor > 1. If the sign is different in the current one, we multiply the step size
of this parameter by a factor 0 < < 1. Typical values are for these constants
are = 1.1 and = 0.5. These are additional hyperparameters of the method
and in the next section we can see what values we actually used and how we
selected them (the above given are again just a rule of thumb roughly showing
how big the decrease and the increase in weights should be). The pseudocode for
this version can be found in Algorithm 7.
Figure 3.4: An illustration of the basic idea behind the Resilient Backpropagation heuristic. The color of the arrows indicates the sign of the partial derivative
of the objective function wrt the parameter at the specific point. Green stands
for positive, red stands for negative. Once the sign changes we have overshot the
minimum.
36
37
Coordinate Descent
Coordinate Descent is technically a zeroth order method as it does not use the
gradient vector (although it can be implemented to use partial derivative at each
iteration, we do not compute the entire gradient or Hessian). Coordinate Descent
considers all but one parameters as constants in each iteration i and optimizes
the objective w.r.t the remaining one. So at each iteration J() degenerates to
J(i ). We optimize in one direction at each iteration by performing a line search.
This approach can be used in stochastic, mini-batch learning, as well as batch
learning. Here we only considered the batch variant. In Figure 3.5 we can observe
an illustration of the method.
A way to implement the line search is similarly to Steepest Gradient Descent but
instead of the entire gradient J at each iteration i we compute the first partial
, i = 1, 2, ...N . After executing N iterations cycling through all
derivative J()
i
N parameters we have completed an epoch of the algorithm. We run as many
epochs as needed until the termination criteria are met. Line search by gradient
descent has the disadvantage that it needs us to set a step size and preferably a
different one for each parameter.
Another way to find the minimum in a single dimension is by use of Newtons Method (more on the next subsection). Newtons method is a second order method and in higher dimensions it is computationally expensive, but on
just 1 hdimension
i it degenerates into what we call an inexpensive Newton as
2 J(1 )
, i.e. a scalar, therefore storing it in memory and inverting it is
H =
12
inexpensive.
We can also perform the line search by direct search methods. We iteratively
divide the search space into intervals and then reject the interval that does not
contain the minimum. Here we shall not examine such a version but we mention
it for completeness.
Due to the coordinate steps the algorithm takes it can become very slow when
many parameters are involved. Especially if many of them are not that important
in the optimization procedure, we end up wasting time optimizing along them.
One of its biggest problems is that it can get stuck at a non-stationary point if
the objective function is non-smooth, as shown in Figure 3.6. One solution to
this problem could be to use the R-prop technique mentioned earlier (applicable
in a batch learning context only) to adjust the individual step sizes.
Coordinate Descent can also take other forms, for instance, instead of cycling
through the parameters each epoch we could simply pick a parameter at random
to optimize at each iteration. An interesting variant is the Randomized (Block)
Coordinate Descent Method [Nesterov, 2010], which extends the concept of updating only one parameter at a time to updating blocks of parameters. Updating
the parameters in blocks can increase the speed of convergence by countering the
effects of the coordinate movement mentioned earlier. An important extension of this method would be finding suitable groups of parameters to update at
the same step, perhaps by examining similarities over time in the corresponding
partial derivatives. We might also consider heuristic approaches that allow us to
skip updating certain parameters, if for example we notice small change in them
during recent epochs. All these heuristics can help Coordinate Descent become
competitive to the other methods presented here.
Here we only examined the following basic versions of the method:
CD basicGD: The simple batch version of the method that cycles through the
parameters and updates them individually by performing gradient descent. The
pseudocode can be found in Algorithm 8.
CD basicNewton: The simple batch version of the method that cycles through
the parameters and updates them individually using an inexpensive Newton step
(a detailed description of Newtons Method will follow shortly). The pseudocode
38
39
J, J()
= COMPUTE OBJECTIVE AND GRADIENTS(Xtrain , i);
i
;
i = i J()
i
end
end
Algorithm 8: The core pseudocode for the simple batch version of Coordinate Descent where the line search was performed using gradient descent:
CD basicGD.
J 2 ()
i 2
Newtons Method
Newtons method (see [Murray, 2010] for a detailed review) is a second order
optimization method. Figure 3.7 shows how by utilizing curvature information
Newtons method converges faster than gradient descent. We only implemented
Newtons method for the logistic regression toy problem, just to showcase it. We
also used Newtons method in CD basicNewton, but this was a trivial case. The
reason we did not use Newtons method for training NADE is that in NADEs
case the parameter space is too large, therefore computing the objective functions
Hessian wrt the parameters would be computationally expensive.
At each iteration i, Newtons method approximates J() by a quadratic function
around (i1) , and then takes a step towards the maximum of the quadratic approximation. If J() happens to be a quadratic function, then the exact minimum
is found in one step. For each step i > 1 we have
(i) = (i1) H(J( (i1) ))1 J ( (i1) ).
(3.3)
Usually we modify Newtons method to include a small step size 0 < < 1,
(i) = (i1) H(J( (i1) ))1 J ( (i1) ) ,
(3.4)
mainly to ensure convergence in the common case where J() is not quadratic.
To get an intuition why Newtons method works (and why it takes only one step
for a quadratic J()), let us examine the case of a 1-dimensional parameter space.
40
41
J() J(
(n)
) + (
(n)
2 (n)
J((n) )
1
)
(n) 2 J (
)
+ ( )
.
2
2!
(3.5)
+ ( (n) )
2
Since at the minimum () we have
and rearranging we get
()
(n)
J(() )
(3.6)
J 2 ((n) )
2
1
J()
.
(n)
(3.7)
Note that a point where the gradient of a function is equal to zero is not necessarily a minimum, but a stationary point (maximum, minimum or saddle point
in higher dimensions). In order for Newtons method to converge to a minimum,
the Hessian needs to be positive definite (all its eigenvalues are positive). In case
it is not, the method converges to a saddle point. If J() is a quadratic function,
then Eq. (3.5) becomes = and so does in the rest of the equations, thus
the single step convergence to () .
Limited-memory BFGS
The family of Quasi-Newton methods to mimic Newtons Method, hence the term
Quasi-Newton. They use approximations of the Hessian based on gradient
and past update step information instead of the Hessian itself. By not explicitly
computing the Hessian, these methods are faster per iteration than pure Newtons
method. We shall examine here a variation of the Broyden-Fletcher-GoldfarbShanno (BFGS) method, the Limited-memory BFGS (L-BFGS or LM-BFGS)
proposed in [Nocedal, 1980], which is a member of the Quasi-Newton family.
Standard BFGS, proposed independently in [Broyden, 1970], [Fletcher, 1970],
[Goldfarb, 1970] and [Shanno, 1970], stores a dense N N approximation of
H, where N is the number of parameters. L-BFGS stores only a few vectors that
represent the approximation of H, hence its name. Also, while standard BFGS
computes the Hessian approximation at iteration k, H (k) , using all past updates
i = 1, 2, ..., k 1, L-BFGS only uses the past m updates i = k m, ..., k 1, where
m is a small number (usually less than 10). So L-BFGS keeps record of only the
vectors [ (km) , ..., (k1) ]T and [J ( (km) ), ..., J ( (k1) )]T at iteration k.
In Algorithm 10, we give the pseudocode for the two loop recursion version of
L-BFGS [Nocedal, 1980]. The algorithm gets as input (k) and J ( (k) ) and
42
43
J()
J()
i
< ,
Pr
0 =
i
J()
(3.8)
J()
J()
i
< .
Pr
0 = 1
i
J()
(3.9)
We do this for all i = 1, 2, ...N on every iteration. We only update the parameters
that pass the statistical test. If all parameters fail the statistical test, then we
need to increase the batch size, so as to increase the variance of the sample
partial derivatives, using a rule such as: T 0 = dT 0 M e, M R+. We used a
fixed value of M = 2 in our implementation. As for the initial batch size value we
for all datasets. Once all parameters fail their respective partial
used T 0 = 5T
100
derivatives tests with T 0 = T , we terminate the algorithm.
In order for a parameters (mean) partial derivative to be reliable, we need its
absolute value to be large ( i.e. to lie far from zero) and its variance to be small.
The constant 0 0.5 is an additional hyperparameter of the algorithm that
needs to be adjusted accordingly to optimize performance (see section 4.5 for
more on this subject).
The potential benefits of such an implementation are numerous. First of all,
by starting with a very small batch we add some stochasticity to the optimizer
which could allow it escape local optima early on, much like a stochastic or minibatch method would do. On the other hand, by the end of the execution we
will be running a full batch method, so we asymptotically have the same convergence properties as with batch methods which are better understood compared
to stochastic ones. In fact, it can even speed up convergence, since far from the
minimum, we need less precision to determine the parameter update than closer
to it and this technique takes advantage of it. We only increase the batch size
once the ability of the current batch to provide us with statistically reliable updates is exhausted. Once the current batch cannot offer us reliable updates any
more, continuing to train on it not only is a waste of time, but can even lead to
overfitting. This technique therefore also helps prevent overfitting. As an added
bonus, the statistical test provides us with a natural stopping criterion. As a result, there is no further need to use other termination conditions and overfitting
safeguards like early stopping (see section 4.5 for details).
However, our results on the task of training the NADE model showed that some
of these findings do not carry on to non-convex optimization tasks. It should
be noted that in [Boyles et al. , 2011] the authors tested the technique on convex
objective functions that included a regularization sub-objective. Thus, possible
explanations for the poor performance in our case include the fact that the objective function is convex and the lack of any sort of regularization on our part.
L1 -Regularization makes the objective non-smooth and non-differentiable for any
i = 0 so perhaps this was a reason the authors used a derivative-free method
like Coordinate Descent in the first place.
For these methods we need the vector of the standard deviations of the partial
derivatives,
44
45
(3.10)
= J() , i = 1, 2, ...N.
(3.11)
J()
i
< then
J()
i
pass(i) = 1;
end
end
< 0 then !
if J()
i
if 1
J()
i
J()
< then
pass(i) = 1;
end
end
end
// In the line below denotes element-wise multiplication,
i.e. update only the parameters that passed their
corresponding test :
= J pass;
if pass == 0 then
grow batch = 1;
if T 0 == T then
terminate = 1;
end
end
pass = ZEROS(SIZE());
end
Algorithm 12: The general pseudocode for Steepest Gradient Descent with
the dynamically increasing batch size technique described in this section:
DIBGD basic.
46
3.7
47
Stochastic Methods
We will only examine Stochastic Gradient Descent and some of its variants in
this work. There also exist higher order stochastic methods. A simple version of
one is to apply a form of Newton Method to one training instance per iteration.
Taking this idea a step further we can instead approximate the inverse Hessian
and use a Quasi- Newton version of SGD [Bordes et al. , 2009]. Most of these
however, are complicated, cost more per iteration than simple SGD and it is more
difficult to study their convergence properties. Furthermore, what studies have
been made [Bordes et al. , 2009], [Bottou et al. , 2007], point out that at least
in the context of training machine learning algorithms, whatever computational
benefits are gained are constant and a result of many orthogonal tricks applied
to the methods.
Stochastic Gradient Descent
Stochastic Gradient Descent is a first order stochastic optimization method. It
can be simply viewed as performing Gradient Descent on only one instance of
the training set on each iteration. We can randomly pick an instance on every
iteration, or cycle through all T instances within one epoch and run the algorithm
for a number of epochs. We chose the second option in our implementation.
Despite its apparent simplicity, in practice SGD and its variants has been very successful for large scale optimization problems, especially in the case of non-convex
objectives. Their success, combined with the aforementioned simplicity, has made
it the method of choice for many practitioners in large scale and/or non-convex optimization problems in machine learning [Bottou et al. , 2004], [Bottou et al. , 2007],
[Bengio, 2012], [Vishwanathan et al. , 2006].
The variants we examine here are the following:
SGD basic: The simple version of the method we discussed above. The pseudocode can be found in Algorithm 13.
Data: Xtrain , initial ,
Result:
= initial ;
while Termination Conditions Are Not Met do
for t = 1 : T do
[J, J ] = COMPUTE OBJECTIVE AND GRADIENTS(Xtrain (t, ),
);
= J ;
end
end
Algorithm 13: The core pseudocode for the simple version of Stochastic Gradient Descent: SGD basic.
SGD heavyball: The Heavy Ball version of the method we discussed above.
The reasoning behind this heuristic in the SGD setting is to try to moderate the
stochastic behaviour of the algorithm by adding contributions from the previous
steps to each update. The pseudocode can be found in Algorithm 14.
Data: Xtrain , initial , ,
Result:
= initial ; prev = 0;
while Termination Conditions Are Not Met do
for t = 1 : T do
[J, J ] = COMPUTE OBJECTIVE AND GRADIENTS(Xtrain (t, ),
);
= J prev ;
prev = J ;
end
end
Algorithm 14: The core pseudocode for the Heavy Ball version of Stochastic
Gradient Descent: SGD heavyball.
ASGD basic: A basic version of the Averaged Stochastic Gradient Descent.
Here we use the averaged gradient over all instances visited instead of the gradient
of just current instance to compute each step. Again, the idea is to incorporate
information from all previous partial derivatives on each step, so as to moderate
the stochasticity of SGD. The pseudocode can be found in Algorithm 15.
Data: Xtrain , initial ,
Result:
= initial ; Tpassed = 0;
while Termination Conditions Are Not Met do
for t = 1 : T do
Tpassed = ++;
[J, J ] = COMPUTE OBJECTIVE AND GRADIENTS(Xtrain (t, ),
);
(T
1)+J
J = J passed
;
Tpassed
= J ;
end
end
Algorithm 15: The core pseudocode for the simple version of Averaged Stochastic Gradient Descent: ASGD basic.
IAGD basic: A basic version of the Iterate Averaging Stochastic Gradient
Descent. Yet another approach for decreasing the stochasticity of the method.
We do this by incorporating information from all previous updates on each step
48
49
3.8
Mini-Batch Methods
MBGD basic: The simple version of the method we discussed above. The
pseudocode can be found in Algorithm 17.
Data: Xtrain , initial ,
Result:
= initial ;
while Termination Conditions Are Not Met do
Xtrain = SHUFFLE(Xtrain , ROWS);
for b = 1 : B do
Batch = Xtrain (((b 1) Tbatch + 1) : (b Tbatch ), );
[J, J ] = COMPUTE OBJECTIVE AND GRADIENTS(Batch, );
= J ;
end
end
Algorithm 17: The core pseudocode for the simple version of Mini-batch Gradient Descent: MBGD basic.
MBGD heavyball: Mini-Batch Gradient Descent using the Heavy Ball heuristic we presented in Subsection 3.6. We just keep track of the previous step of
each parameter and add a contribution of it ( i.e. previous step multiplied by the
momentum constant) to the new one. The pseudocode for this version can be
found in Algorithm 18.
Data: Xtrain , initial , ,
Result:
= initial ;
while Termination Conditions Are Not Met do
Xtrain = SHUFFLE(Xtrain , ROWS);
for b = 1 : B do
Batch = Xtrain (((b 1) Tbatch + 1) : (b Tbatch ), );
[J, J ] = COMPUTE OBJECTIVE AND GRADIENTS(Batch, );
= J prev ;
prev = J ;
end
end
Algorithm 18: The core pseudocode for the Heavy Ball version of Mini-batch
Gradient Descent: MBGD heavyball.
MBGD bolddriver: Mini-Batch Gradient Descent using a version of the Bold
Driver heuristic we presented in Subsection 3.6. We evaluate at each epoch n
for each minibatch b = 1, ..., B the value of the objective function on the training
set J(; n, b). If in epoch n the value of the objective function has decreased
compared to J(; n 1, b), we increase the stepsize slightly, multiplying it by a
factor > 1, typically around 1.1. If, on the other hand, the value of the objective
function has increased compared to J(; n1, b), we decrease the stepsize severely,
multiplying it by a factor 0 < < 1, typically around 0.5. It should be noted
50
51
that in this version we dont shuffle the dataset at the beginning of each epoch.
This is important, because otherwise direct comparison of the value of the error
function on the same minibatch in two consecutive epochs would be impossible.
Instead, we only do it once, in the beginning of the execution. The pseudocode
for this approach can be found in Algorithm 19.
Data: Xtrain , initial , , ,
Result:
= initial ; J(; b)prev = , f or b = 1, 2, ..., B; e = 0 ;
Xtrain = SHUFFLE(Xtrain , ROWS);
while Termination Conditions Are Not Met do
e++;
for b = 1 : B do
[J(; b), J(; b)] =
COMPUTE OBJECTIVE AND GRADIENTS(Xtrain , );
= J(; b);
if J(; b) < J(; b)prev then
= ;
else if J(; b) > J(; b)prev then
= ;
else
// Do Nothing...
end
J(; b)prev = J(; b);
end
end
Algorithm 19: The core pseudocode for the Bold Driver version of Mini-batch
Gradient Descent: MBGD bolddriver.
Chapter 4
Hyperparameter Selection
In the previous section we presented the methods we used in order to optimize the
parameters of the model we were working with. However, all these methods have,
in turn, parameters of their own, also known as hyperparameters, the choice of
which can greatly impact their performance both in terms of speed of convergence
and in terms of the quality of the resulting solution. But how do we choose the
values of these hyperparameters? After all this is a new optimization problem in
its own right. Certainly applying similar techniques to the ones discussed seems
too costly. Also, the new optimization problem poses new challenges as some
hyperparameters assume values not in R but in subsets of it e.g. in Z or in some
cases we have constraints involving the hyperparameter values, so such techniques
cannot even be used. Thus, we should seek other alternatives for hyperparameter
optimization.
So at this point we have to make some choices: Which hyperparameters should
we fix and at which values? Which ones should we try to optimize? What
search space should we explore and using what optimization technique? How
should we evaluate our choices for these hyperparameters? How much of our
resources should we allocate for this procedure? In this section we will discuss
these choices, as well as the reasons behind each one of these backing them up with
theoretical arguments or experimental results wherever appropriate. Main sources
of inspiration include [Hinton, 2010], [LeCun et al. , 1998] and [Bengio, 2012].
4.1
Step Size
A hyperparameter that is common to all the methods examined is the step size
. A large step size can result to failure of the method to converge or even lead
to divergence. A small step size can lead to very slow convergence. So we need to
find a step size that balances between these cases. Although there are techniques
that allow us to select the (locally) optimal step size on every iteration (e.g. by
performing a line search). In the case of a quadratic objective function, the ideal
step size ( i.e. individual step size for each partial derivative) would be the inverse
52
53
(n) =
(4.1)
2 (n) <
(4.2)
n=1
and
X
n=1
0 C
max{C, n}
(4.3)
The idea is to keep the step size constant for C iterations and then decrease it
inverse proportionately to the number of iterations.
So now we only need to set two hyperparameters: 0 and C. The most common
way to set these hyperparameters is to do a grid search on the 2-D hyperparameter
space. In other words, we try a number of discrete combinations of values for the
two parameters in a subset of the parameter space, since searching for continuous
values would be hopeless. As to what values we should include in the set, for 0
it makes more sense to use a logarithmic grid e.g. {0.01, 0.1, 1, 10, 100} in order
55
to the updates becoming too small and decreasing without actually becoming
equal to zero. As we get closer and closer to a local minimum in a dimension
of the parameter space, the partial derivative of the objective function w.r.t.
that parameter will tend towards zero. We also havent mentioned a way to
explicitly set the learning rate to zero. So the updates might end up getting
smaller and smaller at each iteration but never actually reaching a zero value
as this slowing down prevents us from reaching the minimum (so J()
6= 0).
i
We also never necessarily terminate (more precisely, we always terminate due to
MAX ITERS or MAX EPOCHS criteria), since a decrease on the value of the
objective function, albeit a minute one will occur, because ((n) 6= 0). This
prevents early stopping from activating (for a detailed explanation see 4.6. We
avoid the problem of non-termination by forcing any step size less than 108 to
become zero. Hopefully the values selected for (0), C will not let this occur
often.
4.2
Another idea for adjusting the step size is to use of adaptive step sizes. This is the
basis of heuristics such as Bold Driver and R-prop we described in the previous
section. Here, as we saw we have two more hyperparameters to handle: > 1
(so as to increase the learning rate when it multiplies it) and 0 < < 1 (so as
to decrease the learning rate when it multiplies it). We want the increase to be
small and the decrease to be considerable. Initially we also included these hyperparameters on the grid search performed, searching in {1.3, 1.2, 1.1, 1.01, 1.001}
for and in {0.5, 0.6, 0.7, 0.8, 0.9} for . In case of ties we favoured bigger values
for and smaller for to encourage faster convergence. However, we ended up
choosing to use the fixed values of = 1.01 and = 0.6. The reason was that
these were the choices of the hyperparameter selection algorithm in most cases
anyway and adding two more dimensions in our grid search would make computation time too much to bear, as it would raise the number of dimensions to 4 or
even 5.
Unless explicitly stated otherwise (i.e. in all cases except for methods using the
R-prop heuristic), each parameter uses the same stepsize. And even in R-prop,
we used the same global initial step size value for all parameters. These choices
were made mainly to keep the total number of hyperparameters involved small
(up to 3) so that the combinatorial search could be performed without adding a
substantial overhead to the entire process. Individual learning rates, individual
initializations, even individual update schemes for the learning rates are all conceivable options, however. We can even opt for treating the parameters as blocks
(e.g. different learning rates for the biases and different for the weights). For the
same computational cost-related reasons we only included 5 values as options for
each hyperparameter.
4.3
Momentum
4.4
Batch Size
4.5
56
4.6
57
We must specify under what conditions the optimization procedure should terminate. In Dynamically Increasing Batch Size methods, a statistical test provided
us with a natural stopping criterion which also safeguarded against overfitting, so
in these cases, no further termination condition needed to be specified. However,
just to make sure that the optimizer wouldnt get stuck in some extreme cases
where the parameters would take a very long time to all fail their respective tests
we imposed a slightly stricter condition: if the number of parameters that fail
the test does not change for MAX ITER SINCE MOST FAILED iterations on
the entire dataset, then terminate.
For all other methods we needed to impose some other types of termination
criteria. One shared across all methods is the maximum iterations allowed. If a
batch method has run for MAX ITERS iterations or a stochastic (or minibatch)
method has run for MAX EPOCHS epochs, then it should terminate. Again,
this criterion is more of a safeguard against infinite execution than a refined
termination condition. In our implementation we used MAX ITERS = 400 or
MAX EPOCHS = 400 for all methods examined.
The most meaningful termination method we used in every method implemented
except for the Dynamically Increasing Batch Size methods we discussed above (
i.e. all save for L-BFGS for which we used the pre-existing minfunc() implementation) was early stopping. At this point we should remember that our actual
goal is not optimization in its own right, but learning. And as we know learning
implies generalization. Our aim therefore is not to minimize the training error,
but the generalization error. In other words, our goal is not to find the parameter
values that minimize the value of the objective function on the training set, but
the expected loss on an unseen test set.
We use the value of the objective function on the validation set in order to estimate
this generalization error. We keep track of the current best solution and stop
training once we detect no improvement on the value of the objective function on
the validation set over the current best solution for MAX ITERS SINCE BEST
iterations (or MAX EPOCHS SINCE BEST epochs where applicable). From this
point on, further training would lead to a decrease in training error at the cost
of generalization, in other words, in overfitting. In Algorithms 20 and 21 we
can find two slightly different implementations of early stopping. The first one
was used for batch methods with MAX ITERS SINCE BEST = 5 and the second
one for stochastc methods with MAX EPOCHS SINCE BEST = 5 and minibatch
methods for MAX EPOCHS SINCE BEST = 5. The vector is the vector of
hyperparameters of the specific optimizer used.
Other possible termination criteria we can use include the following:
(i) Terminate if the parameters do not change substantially. In this scenario,
we need to define substantially which might turn out to be tricky since we
cant guarantee that the effect of small changes in a parameter doesnt bear
significant changes in the objective function. The surface of the objective
function can vary in steepness from dimension to dimension, so our definition
of substantially should be different for each parameter to accommodate
that.
(ii) Terminate if the value of the objective function does not improve substantially. In this case we also need to define substantially but this task can be
more obvious, since the value of the objective function is what we are actually trying to minimize. We can set a tolerance constant of e.g. T OL = 103
train (i1)
| < T OL then we terminate. In L-BFGS runs we
and if | Jtrain (i)J
Jtrain (i)
used this exact criterion. The value of T OL would ideally have to be different for each problem (dataset) and we could have even compared it to
valid (i1)
| Jvalid (i)J
| instead, for a stricter version of early stopping. We did
Jvalid (i)
not do so, however.
Data: Xtrain , Xvalid , initial , , MAX ITERS, MAX ITERS SINCE BEST
Result: best
= initial ; Jbest = ; i = 1; terminate = 0; itersSinceBest = 0;
while i < MAX ITERS AND terminate == 0 do
new = BATCH OPTIMIZER(Xtrain , , );
J(new ) = OBJFUNCEVAL(Xvalid , new );
if J(new )<Jbest then
best = new ;
Jbest = J(new );
else
itersSinceBest++;
if itersSinceBest == MAX ITERS SINCE BEST then
terminate =1;
end
end
i++;
end
Algorithm 20: Early stopping incorporated into a batch optimization method.
58
59
4.7
Evaluation
Finally, let us discuss now how we actually evaluated our choices for the hyperparameters in order to select the most appropriate values. Run the optimization
algorithm for each hyperparameter configuration using the entire training set
and choosing the one that produces the smallest value of the objective function
would be far too costly computationally. For up to 3 free hyperparameters and 5
choices per each we would end up running each method on the entire dataset up
to 35 = 125 times. Instead, we randomly selected a small amount of each training set, in particular 10% on which to train for each of these configurations for a
small number (20) of iterations or epochs (wherever applicable). We used early
stopping with a stricter condition (MAX ITERS SINCE BEST = 2 for batch
methods, MAX EPOCHS SINCE BEST = 2 for minibatch and stochastic ones)
than in the actual optimization procedure to allow us to drop quickly bad hyperparameter choices and not waste time on them. We end up running each method
1
125 more times but (i) we do so using the 10
-th of the data, (ii) we run for far
fewer iterations, (iii) we are more strict in the early stopping condition. As a
result the computational cost is far smaller.
BGD rprop
CD basicGD
L-BFGS
Free Hyperparameters
{5, 0.5, 0.05, 0.005, 0.0005}
{5, 0.5, 0.05, 0.005, 0.0005},
{1, 0.1, 0.01, 0.001, 0.0001},
{5, 0.5, 0.05, 0.005, 0.0005},
{1.3, 1.2, 1.1, 1.01, 1.001},
{0.5, 0.6, 0.7, 0.8, 0.9}
0 (i) {5, 0.5, 0.05, 0.005, 0.0005},
with 0 (i) = 0 (j) i, j,
{1.3, 1.2, 1.1, 1.01, 1.001},
{0.5, 0.6, 0.7, 0.8, 0.9}
0 {5, 0.5, 0.05, 0.005, 0.0005}
-
60
Fixed Hyperparameters
-
(used default
minfunc() options)
-
= 0.01
= 1.01, = 0.6
4.8
61
As a closing note, before settling for a grid search on the hyperparameter space
(and keeping its dimensionality N 4 by choosing fixed values for the rest
of the hyperparameters), we briefly experimented with another hyperparameter
selection scheme for methods that involved N 4 hyperparameters (thus at
least 54 = 625 hyperparameter combinations).
We used a Genetic Algorithm to explore the parameter space, with an initial
population of 20 individuals, each of which represented a candidate vector of
hyperparameters (i) RN initialized using values selected uniformly at random from the same set of values for each hyper parameter on which we performed the grid search. We used each individual as a hyperparameter vector
for the method used and run 20 iterations with an early stopping lookahead
MAX ITERS SINCE BEST = 2 for batch methods, MAX EPOCHS SINCE BEST
= 2 for minibatch and stochastic ones. To evaluate each candidate hyperparameter combination (i) , we used as a fitness function F ( (i) ) = J(; (i) )/Niter ,
where J(; (i) ) is the final value of the objective function on the validation set
and Niter the number of iterations until convergence. The 10 fittest individuals
would be selected for crossover and would move on to the next generation along
with their 10 offspring. For recombination we used single-point crossover among
5 randomly (with equal probability) chosen pairs of the selected individuals. In
other words, both parent vectors were cut in the same position selected uniformly at random from {1, 2, ..., N } and the four resulting parts were combined
to form 2 new individuals of length N each, with one part from each parent. As
for the mutation operation, each element of the offspring would switch its value
with one of the other 4 possible ones in its corresponding hyperparameter set
with equal probability. We used a mutation rate of 0.1. The above describe one
generation of the algorithm, we run for 25 generations.
This method was faster than the grid search, since it was designed as a means
to reduce running time. The maximum number of iterations of the optimization
method would be 20 25 20 < 54 20 of the grid search. However, the resulting hyperparameter choices were usually poor. We found that by fixing certain
hyperparameters to specific values we get better results with a grid search on
the remaining dimensions and faster(up to 53 100 iterations of the optimization
method). The genetic algorithm also has hyperparameters and implementation
choices of its own we could have tweaked (mutation rate, definition of mutation
and recombination operations, initial population size, allowed values for each element of the individual, selection scheme, fitness function, termination conditions
We did not experiment any further with these choices, since we risk entering
etc).
a vicious cycle. We started off trying to optimize the parameters i of our model,
now we seek to optimize the hyperparameters j of the optimization algorithms,
it is better if we do not do so using a method that involves hyperparameters of
its own in need of optimization.
Chapter 5
Gradient Code Debugging and
Verification
Gradient code is particularly error prone and, to make matters worse, these errors can be hard to detect. This is so, because plotting the value of the objective
function per iteration usually gives us little information about whether our gradient code is correct. In batch methods this quantity is supposed to be decreasing
monotonically per iteration (as long as it does not overshoot the minimum) and in
stochastic methods it is supposed to decrease noisily as training progresses ( i.e.
it is allowed to increase from last iteration as well). Indeed, the same can happen
even if we miscalculate the gradient. If we just get its sign right, we will take a
step in the correct direction although the size of the step is not the one intended.
If we use an individual learning rate per parameter, perhaps this scaling error can
be countered (given that the step sizes are allowed to be adapted to appropriate
values during training), however using a global step size (which might be the most
common practice) scaling errors can complicate matters. Furthermore, especially
when the model has many parameters, even missing some partial derivative signs
can lead to a decrease in the value of the objective function, since we might have
moved to better values in the rest of the parameter dimensions. The point of all
this is that e.g. using a batch method, the value of the objective can be decreasing
monotonically, the method might appear to be working, but at its core (gradient
computation) the code could still be wrong. So we cannot use the value of the
objective function as a means of testing our gradients.
5.1
One method to check whether the code that computes the gradient of the objective function is correct (or better: agrees with the objective function) is to
use the Finite Differences method. This technique is used to estimate derivatives
numerically using (first order) finite differences. From the Taylor series we have
62
(0)
J(i
+ h) =
(0)
J(i )
(0)
(0)
(0)
(0)
(0)
(0)
63
J(i )
1 J 2 (i )
1 3 J 3 (i )
+ h2
+
h
+ R(4) (5.1)
+h
i
2!
3!
i 2
i 3
and
(0)
(0)
J(i h) = J(i ) h
1 J 2 (i )
1 3 J 3 (i )
J(i )
+ h2
h
+ R(4) , (5.2)
2
3
i
2!
3!
i
i
where R(k) corresponds to the sum of terms involving derivatives of order k and
higher.
Figure 5.1: We wish to estimate the derivative of J() wrt at 0 . It is the slope
of the curve at point P .
Image adapted from
http://staff.unila.ac.id/tiryono/files/2011/09/BedaHingga-FDM.pdf
J(
(0)
(0)
(0)
J(i )
J(i + h) J(i )
i
h
(5.3)
(0)
(0)
J(i ) J(i h)
J(i )
.
i
h
(5.4)
(0)
(0)
J(i )
f (i + h) J(i h)
i
2h
(5.5)
We can obtain Eq.(5.3) and Eq.(5.4) by rearranging Eq.(5.1) and Eq.(5.2) respectively and truncating the R(2) term (hence the approximation). To get Eq.(5.4)
we add Eq.(5.1) and Eq.(5.2) and truncate the R(2) term. The smaller h is the
better the approximation. Rememder, perhaps the most common definition of
(0)
the derivative of the function J(i ) at i is
(0)
(0)
(0)
J(i + h) J(i )
J(i )
= lim
h0
i
h
(5.6)
We chose the (1) for simplicity and used h = 109 . We approximated the partial derivatives with respect to each i , i = 1, 2, ..., N using this procedure. We
computed the mean absolute relative error
(0)
J(
N
1 X ii
|
E=
N i=1
(0)
(0)
J(i +h)J(i )
h
(0)
J(i )
i
(5.7)
of the approximation for parts of our datasets. In every case we examined it was
of order 107 which was deemed acceptable. Some of the models examined had
a large number of parameters (e.g. NADE has H + D + 2 H D, with H up
to 500 and D at least 112 in our datasets), which meant that using all features
(and many hidden variables in NADE) would mean computing partial derivatives
with respect to a large number of parameters using the finite differences method.
This made the computational cost prohibitive, so instead we checked the approximation using only a few features and H = 5 for NADE. Due to its high
computational cost, performing gradient checks during execution of the methods
was only used at this stage for sanity checks and it was subsequently deactivated
during the actual execution of the methods.
64
5.2
65
(5.8)
We could use the finite differences method to check the second partial derivatives as well,
taking finite differences of first partial derivatives. So for every partial derivative J()
i , i =
2
Figure 5.2: Decision boundary before training on Training set (left) and Test set
(right).
Figure 5.3: Decision boundary after training with BGD basic on Training set
(left) and Test set (right).
Figure 5.4: Decision boundary after training with BGD heavyball on Training
set (left) and Test set (right).
Notice, however that CD basicGD performs as well as the other methods. This
will not be the case in the optimization of NADEs objective function as we
will see in the next section. Also, notice that CD basicNewton gives a better
solution than CD basicGD. In fact this is so because it is faster. Unfortunately
we did not include CD basicNewton in further studies (i.e. on NADE or on
FVSBN) as we only implemented it during the final stages of this dissertation as
a viable alternative to CD basicGD. The results shown here indicate that indeed
CD basicNewton has an advantage over CD basicGD. As we see from Figures
66
67
Figure 5.5: Decision boundary after training with SGD basic on Training set
(left) and Test set (right).
Figure 5.6: Decision boundary after training with CD basicGD on Training set
(left) and Test set (right).
Figure 5.8: Decision boundary after training with MBGD basic on Training set
(left) and Test set (right).
Figure 5.9: Decision boundary after training with Newtons method on Training
set (left) and Test set (right). Note that this boundary was achieved after only 1
iteration while the results for all other methods are after 300 iterations.
Table 5.1: Logistic Regression Results
Value of
Value of
Method
Objective Function Objective Function
on Training Set
on Test Set
Before Optimization
2.006824e+000
2.027101e+000
BGD basic
7.808466e-002
6.683343e-002
BGD heavyball
7.684857e-002
6.567788e-002
4.716319e-002
3.426209e-002
SGD basic
CD basicGD
7.828068e-002
6.699035e-002
CD basicNewton
4.744886e-002
3.623695e-002
MBGD basic
4.643823e-002
3.741856e-002
Newton
0.2699
0.2662
5.3
300
300
300
300
300
300
1
Until now we discussed the case where the gradients of the parameters with
respect to the objective function were miscalculated. How can we detect if we have
the objective function right? After all, the finite differences method just verifies
68
69
that the objective and the gradient codes are consistent with one another.
What if we got the objective wrong and consequently wrote correct gradient code
for the wrong objective function? Here the answer is somewhat more problemspecific. We have to take into consideration the physical meaning of the objective
function: What range can its values assume? What value does it get given
particular extreme cases of inputs?
Since we use NADE to perform density estimation we need to make sure that the
values it computes for ln(p(xi )), i = 1, 2, ..., K, where K = 2D and xi {0, 1}D
assumes all possible values in {0, 1}D , corresponded to natural logarithms of
probability densities p(xi ), i = 1, 2, ...,PK. So we verified
indeed p(xi ) =
PK that
K
ln(p(xi ))
ln(p(xi ))
e
[0, 1], i = 1, 2, ..., K and
= 1. Thus
i=1 p(xi ) =
i=1 e
p(xi ), i = 1, 2, ..., K are indeed probability densities. But are they good ones?
To answer that, at least preliminarily, we can compare NADE to some form of
elementary density estimation as a baseline, comparing the average negative log
likelihood of the models. The simplest baseline would be beating a random
density estimator. So we assume every bit of instance x(t) can be 0 or 1 with
(t)
equal probability ( i.e. p(xd = 0) = p(xd (t) = 1) = 0.5, d). Under this simplistic
assumption the ANLL is
T
D
T
1X
1
1 XX
(t)
ln(p(xd )) =
ln(0.5D ) = T ln(0.5D ) = D ln(0.5)
J(X; ) =
T t=1 d=1
T t=1
T
(5.9)
We present the values obtained under this baseline for every dataset in Table 5.2 .
Usually, our parameter initialization led to average negative log likelihood values
in this region, so it is a good indication of the value of our objective function
before training. Of course, after training we always beat this unrefined density
estimator.
Yet another model for density estimation we can use as a baseline is to calculate
p(xd = 0) and p(xd = 1) for each d based on the count of occurrences of bit 1
(c1 ) and 0 (c0 ) in all T instances. This model can be viewed as a form of Naive
Bayes approach for Bernoulli variables. We used pseodocounts instead of actual
counts of occurrences so as to avoid the problem of zero probabilities in the case
of zero occurrences
(t)
p(xd = 0) =
(t)
c0 + a
, t,
T +a+b
(5.10)
(t)
J(X; ) =
T
D
D
X
1 XX
(t)
(t)
ln(p(xd )) =
ln(p(xd ))
T t=1 d=1
d=1
(5.11)
In Table 5.2 we can see the ANLL computed under this model for each of the
datasets we will train NADE on (the datasets themselves are properly introduced
in section 6.1). As we will see, in section 6.3, we also beat this baseline as well.
We also calculated the percentage of 1s in each dataset. We will use it here to
compare the two baselines, but it can also be thought of as a measure of density
of the dataset, with low values corresponding to sparse datasets.
Table 5.2: Baseline values of Average Negative Loglikelihood (ANLL) of simple
density estimation techniques. The last column is the percentage of 1s in each
dataset. All refer to the corresponding test set.
Dataset
ANLL
ANLL Based
% of 1s
Random on Pseudocounts in Dataset
adult
84.5640
25.7602
11.37
mnist
542.7342
205.6319
13.27
connect-4
87.3365
35.3312
33.33
dna
124.7665
100.2841
25.14
mushrooms 77.6325
34.2084
18.75
nips-0-12
346.5736
293.9804
36.65
ocr-letters
88.7228
63.7293
21.87
rcv1
103.9721
58.3684
13.88
web
207.9442
39.7515
3.88
As we see in Figure 5.10. The farther away from 50% the percentage of 1s in the
(t)
dataset is, the less correct the assumption p(xd = 0) = p(xd (t) = 1) = 0.5, d is
AN LLBASELIN E1 AN LLBASELIN E2
so the largest the relative difference
among these
AN LLBASELIN E1
two baselines becomes. This was to be expected, of course.
Having done these checks we are now confident that our code works as it is
supposed to and we can move on to applying our methods to training NADE.
70
71
Figure 5.10: Relative difference of the two baselines presented in section 5.3 as a
function of the number of 1s in the dataset.
Chapter 6
Experimental Results
6.1
The Datasets
The datasets used in the experiments are all from the LIBSVM database 1 except
for ocr-letter 2 and nips-0-12 3 . They are all versions that have undergone preprocessing (not performed by us, these are versions already available in the sources
mentioned above). Features with missing values have been removed and all features,whether discrete or continuous, whether numerical, categorical or ordinal
have been converted into an appropriate binary form. The datasets are split in
3 subsets each: Training, Validation and Test. The training set was used during
training to optimize the parameters of the model. The validation set was used in
hyperparameter selection and in overfitting control by early stopping. The test
set was the one used for the final evaluation of this method (the one reported on
the tables that follow). In Table 6.1 we can see the size and number of features of
each dataset. We already saw that there is a different degree of sparsity in each
dataset in Table 5.2.
The datasets we used were the same ones used in [Larochelle et al. , 2011]. This
way we can verify that we get similar results to those reported by the authors for
NADE with H=500 using SGD basic and compare them to the other baselines
presented there. As we see in Table 6.3 we indeed get similar results, yet not
exactly the same due to the following reasons:
The authors in [Larochelle et al. , 2011] use a slightly different learning rate
decrease scheme than the one used here.
They use a lookahead of 10 iterations for early stopping. We use 5 instead.
It is not entirely clear how they performed the hyperparameter selection
(e.g. did they use the entire validation set, did they train to the entire
training set, did they train until convergence, did they use early stopping
1
72
73
Table 6.1: The datasets we used for our experiments. T is the number of instances, D the number of features.
Dataset
Name
adult
Training
5000 123
Dimensions (T D)
Validation
Test
1414 123
26147 123
binarized
mnist
50000 784
10000 784
10000 784
connect-4
16000 126
4000 126
47557 126
dna
1400 180
600 180
1186 180
500 112
5624 112
1240 500
500 500
400 500
10000 128
10000 128
rcv1
40000 150
10000 150
150000 150
web
14000 300
3188 300
32561 300
nips-0-12
Brief
Description
Census data,
consisting of 14
features. (Ver. a5a).
28 28 images
of handwritten
digits 0, 1, ..., 9
All legal positions
in a game of
connect-4 and
the player that
controls each.
Primate splice-junction
gene sequences (DNA)
mushrooms described in terms
of 21 physical
characteristics.
Paper information
of NIPS proceedings
from volume 0 to 12.
16 8 images
of handwritten
English characters.
Reuters Corpus
Volume I. Contains
manually categorized
newswire stories.
A subset of the
ICML-09 URL reputation.
dataset. (Ver. w6a).
and if yes was it use the same lookahead as in the final execution of the
algorithm or not). Most likely we have made some differences in this procedure, since it involves many decisions. Here, we are exploring a larger
range of hyperparameters (5 options per hyperparameter compared to 3 of
the authors), but most likely in more hurried way (i.e we run for only a
few iterations with stricter early stopping and training on only a subset of
the training set).
The number of runs we did to obtain the statistics might differ, we used
Nruns = 5.
The inherent stochasticity of the entire procedure. In Table 6.2 we can
see the sources of variance involved in each method group. In the case of
comparing our results to those of [Larochelle et al. , 2011] the ones for SGD
apply.
Table 6.2: Sources of variance for each method group examined for NADE. Even
if we use the same method with the same hyperparameters, these factors will
introduce variation among the results of different runs.
Source of Variance
BGD CD
L-BFGS
Random initialization of
parameters i . (Note:
NADEs ALL is non-convex)
Ordering of features di . (Note:
NADE computes conditionals p(xdi |x<di ))
Ordering of parameters i
Ordering of instances ti
Ordering of minibatches bi
Instances within each minibatch
(MBGD) or within initial batch (DIBG)
6.2
Method Groups
SGD
MBGD DIBGD
Experimental Design
The final version of our code needs only three inputs specified: the dataset we
are working with, the optimization method we will use and the number of hidden
variables H. It returns the iterations spent for parameter selection, the iterations
needed for the actual execution of the optimization procedure and the Average
Log Likelihood (ALL) of the trained NADE model on the Test set.
We initialized the parameters of the model using random values i (0) U (1, 1),
i = 1, 2, ..., N ). We used a random permutation of the columns of the feature
vector thus shuffling the order of features. The ordering of the features affects
the result of NADE but not of models such as FVSBN. However the authors in
[Larochelle et al. , 2011] reported that the effect of the ordering on the ALL is not
statistically significant. In addition to that, stochastic and minibatch methods
shuffle instances in the beginning of their execution (exceptions MBGD basic,
MBGD momentum, which shuffle them in the beginning of every epoch) and
DIBG randomly selects a small subset of data for the initial batch.
In most cases we executed Nruns = 5 runs for each method on each dataset (for
each choice of H {5, 50, 500}) in order to obtain statistics. We present the
average of these runs as well as 95% confidence intervals in the form Nruns .
Each of the Nruns runs for all methods for the same dataset and number of hidden
variables H was performed using the same parameter initialization and shuffling
of features. In methods added in the last stages of this dissertation using the same
initialization and shuffling was impossible, since we kept no record of it. Also,
in a few cases Nruns < 5. Finally, for some methods, we only present results on
two datasets. These methods were deemed non-compettitive so we did not test
them on the remaining datasets. For example DIBGD basic and CD basicGD
gave poor ALL results (we will see possible explanations as to why this happens).
MBGD bolddriver and IAGD basic appear not to improve upon the basic versions
they are supposed to (MBGD basic amd SGD basic, respsctively), so we exclude
74
75
6.3
In the tables that follow we present the ALL results attained, as well as baseline
results from other models. In this subsection we only compare the methods in
terms of the value of the objective function. Therefore when we refer to the
performance of a method we mean quality of optima it finds.
Table 6.3: Average Log Likelihood (ALL) of NADE with H = 500 hidden units
trained by SGD. Results from our work and from [Larochelle et al. , 2011] for
verification. Averages were normalized by subtracting them by the average ALL
of the Mixture of Bernoullis (MoB model), shown in 6.4. Higher values correspond
to better density estimation.
NADE
adult connect-4
dna
mushrooms nips-0-12 ocr-letters rcv1
web
binarized
Results
mnist
by
Larochelle
7.25
11.42
13.38
4.65
16.94
13.34
0.93
1.77
48.78
& Murray
0.05
0.01
0.57
0.04
1.11
0.21
0.11 0.20
Present
7.24
11.42
13.23
4.65
16.26
13.08
0.92
1.64
42.39
work
0.04
0.01
0.40
0.04
1.20
0.37
0.09 0.18
In [Larochelle et al. , 2011], NADEs performance (in terms of mean ALL value)
is compared to that of some other models. We present here their results as well
to include them in our comparisons with the results obtained by training NADE
using the methods discussed here. These additional models are the following and
their mean ALL, as well as 95% confidence intervals, are shown in Figure 6.4 :
MoB: a mixture of multivariate Bernoullis.
RBM: a RBM (briefly described in Chapter 1) made tractable by using
only 23 hidden units and trained by contrastive divergence with up to 25
steps of Gibbs sampling.
RBM mult: a RBM where units have been split into groups and within
each group only one unit can be active (equal to 1) at any time. This model
was proposed in [Larochelle et al. , 2010].
RBForest: an RBM where the activation of hidden units within a group
obeys tree constraints. This model was proposed in [Frey, 1998].
FVSBN: A Fully Visible Sigmoid Belief Network, as described in Chapter 1.
Table 6.4:
Average Log Likelihood (ALL) for other models in
[Larochelle et al. , 2011]. These results were not reproduced by us, but we
will use them in our comparisons. Averages were normalized by subtracting
them by the average ALL of the Mixture of Bernoullis (MoB model), shown in
the first row. Higher values correspond to better density estimation.
Model
MOB
adult
0.00
0.10
RBM
4.18
0.06
RBM mult.
4.15
0.06
RBForest
4.12
0.06
FVSBN
7.27
0.04
Normalization 20.44
connect-4
dna
0.00
0.04
0.75
0.02
1.72
0.03
0.59
0.02
11.02
0.01
23.41
0.00
0.53
1.29
0.48
1.45
0.40
1.39
0.49
14.55
0.50
98.19
0.00
1.12
12.65
1.07
11.25
1.06
12.61
1.07
13.14
0.98
290.02
0.00
0.32
2.49
0.30
0.99
0.29
3.78
0.28
1.26
0.23
40.56
rcv1
web
binarized
mnist
0.00
0.00
0.00
0.11 0.23
1.29
0.78
0.11 0.20
51.3
0.04
0.02
0.11 0.21
0.56
0.15
0.11 0.21
2.24
0.81
40.19
0.11 0.20
We will describe the main findings regarding the average ALL values found in
words. The comments there, unless stated otherwise refer to Tables 6.5 and 6.6:
Low-order batch methods vs the rest: As expected, batch methods
below second order (BGD variants, DIBG and CD) give optima of lesser
quality than those found by L-BFGS, stochastic and minibatch methods.
First and lower order batch methods do not use the curvature of the objective function, nor do they gain from the stochasticity of stochastic and minibatch methods to escape local optima. There are, however some datasets
that appear to be easy (adult, mushrooms, connect-4 ) to train on. In
these cases, the relative difference in ALL between BGD variants and other
methods is considerably smaller. In fact for adult, the result for BGD
and MBGD even overlap. In some difficult datasets (binarized mnist,
ocr-letters, connect-4 ) we can also observe some overlapping between the
resulting ALL of batch and minibatch methods.
Coordinate descent: In these runs we used the CD basicGD version as
we only implemented CD basicNewton in the last stages of the dissertation.
As a result the coordinate descent version here was slow. It terminated in
most cases due to MAX ITERS being reached. So it performs worse than
all other methods, especially for problems with many parameters (i.e. large
H and/or D). An indication for this, is that the relative performance of
76
77
Table 6.5: Average Log Likelihood (ALL) for each optimization method training
NADE with H = 5 hidden units. Averages were normalized by subtracting them
by the average ALL of the Mixture of Bernoullis (MoB model), shown in 6.4.
Higher values correspond to better density estimation.
Method
BGD basic
adult
5.90
0.18
BGD heavyball
6.13
0.17
6.16
BGD bolddriver
0.23
6.22
BGD rprop
0.21
CD basicGD
3.93
0.24
L-BFGS
7.12
0.11
5.09
DIBGD basic
0.26
7.08
SGD basic
0.06
SGD heavyball
7.08
0.05
7.10
ASGD basic
0.05
IAGD basic
7.02
0.08
6.22
MBGD basic
0.13
MBGD heavyball
6.27
0.14
MBGD bolddriver 6.14
0.23
Normalization
20.44
* Results for Nruns = 2.
connect-4
dna
5.92
0.03
6.18
0.03
6.13
0.05
6.14
0.05
2.99
0.3
7.52
0.02
5.11
0.3
7.45
0.01
7.57
0.02
7.59
0.02
7.42
0.03
7.01
0.04
7.07
0.03
6.88
0.06
23.41
1.23
0.46
1.27
0.45
1.28
0.46
1.20
0.44
5.93
0.51
5.96
0.40
6.02
0.38
6.02
0.33
3.92
0.37
3.99
0.47
98.19
3.24
0.06
3.22
0.06
3.20
0.06
3.23
0.06
1.99
0.05
2.03
0.08
14.46
2.27
1.59
2.31
1.47
2.30
1.49
2.30
1.63
5.38
1.66
5.21
1.60
5.33
1.62
5.36
1.25
3.84
1.54
3.93
1.69
290.02
1.24
0.58
1.30
0.60
1.24
0.62
1.37
0.55
7.72
0.06
7.58
0.57
7.58
0.54
7.61
0.58
2.05
0.59
2.04
0.61
40.56
rcv1
web
binarized
mnist
0.31
0.19
20.31
0.08 0.29
5.12
0.29
0.19
26.33
0.07 0.31
4.73
0.28
0.21
24.29
0.08 0.33
3.45
0.32
0.21
25.25
0.08 0.30
4.87
0.61
0.72
31.49
0.29 0.29
4.29
0.57
0.67
30.20
0.07 0.27
3.53
0.60
0.70
29.34
0.05 0.25
3.37
0.59
0.72
30.06
0.05 0.22
3.40
0.34
0.55
24.64
0.08 0.27
3.81
0.33
0.59
25.15
0.08 0.24
2.55
CD basicGD compared to all other methods is lower in the case of connect4 than in adult and more convincingly from the fact that coordinate
descent gives worse ALL results for H=50 than for H=5. This result is
not normal and it happens only with this method. It is a result of the
method not getting enough time to converge. Again, this is perhaps mainly
a weakness of the implementation we used and we mention it mainly to
explain the results. A different implementation of Coordinate Descent might
have been on par with other methods. However, the large parameter space
suggests that Coordinate Descent might be a poor choice in these problems
in any form that doesnt update parameters in blocks.
Dynamically increasing batch Gradient Descent: Contrary to what
we hoped, DIBGD basic is the second-worst performing (in terms of ALL
attained) method. It even performs considerably worse than BGD basic,
the method to which it is supposed to be equivalent asymptotically. An
explanation for this is that DIBGD overfits on the initial batch, as we did
not perform any sort of regularization. Trapped in a local minimum, the
weights grow dramatically and subsequent increases of the minibatch could
not counter this effect.
Table 6.6: Average Log Likelihood (ALL) for each optimization method training
NADE with H = 50 hidden units. Averages were normalized by subtracting
them by the average ALL of the Mixture of Bernoullis (MoB model), shown in
6.4. Higher values correspond to better density estimation.
Method
BGD basic
adult
6.53
0.07
BGD heavyball
6.78
0.07
6.89
BGD bolddriver
0.08
6.74
BGD rprop
0.09
CD basicGD
2.17
0.10
L-BFGS
7.22
0.07
4.91
DIBGD basic
0.11
7.17
SGD basic
0.02
SGD heavyball
7.19
0.02
7.20
ASGD basic
0.02
IAGD basic
7.17
0.05
6.79
MBGD basic
0.05
MBGD heavyball
6.83
0.07
MBGD bolddriver 6.17
0.09
Normalization
20.44
* Results for Nruns = 2.
connect-4
dna
6.12
0.03
6.41
0.03
6.32
0.04
6.47
0.05
1.41
0.07
9.15
0.09
4.16
0.07
9.18
0.02
9.20
0.02
9.21
0.02
9.07
0.04
7.18
0.03
7.43
0.02
7.13
0.04
23.41
4.24
0.46
4.76
0.48
4.72
0.53
4.71
0.55
9.13
0.46
9.04
0.30
9.08
0.27
9.10
0.33
6.44
0.39
6.49
0.33
98.19
4.15
0.03
4.06
0.03
4.07
0.03
4.07
0.03
3.56
0.04
3.59
0.02
14.46
4.79
1.26
4.77
1.25
4.89
1.28
4.79
1.32
10.79
1.52
10.60
1.14
10.64
1.17
10.69
1.14
7.50
1.23
7.72
1.26
290.02
4.56
0.45
4.70
0.48
4.50
0.48
4.56
0.52
10.98
0.56
9.97
0.32
10.02
0.37
10.06
0.33
6.74
0.39
6.78
0.36
40.56
rcv1
web
binarized
mnist
0.36
1.35
27.48
0.19 0.22
5.13
0.34
1.41
26.82
0.18 0.24
4.22
0.42
1.45
24.46
0.22 0.26
6.46
0.33
1.45
28.17
0.21 0.26
5.57
0.62
1.79
36.27
0.18 0.16
5.27
0.70
1.75
41.11
0.09
0.09
3.17
0.73
1.75
40.07
0.09 0.11
3.25
0.73
1.76
46.62
0.10 0.09
3.40
0.46
1.60
32.11
0.14 0.16
4.83
0.45
1.63
34.10
0.17 0.15
3.99
79
Stochastic methods vs the rest: Stochastic methods give the best results among all lower order methods examined. They also demonstrate
a consistency in these results in that the variance of ALL values reached
is smaller than in any other group of methods. We did expect stochastic
methods to reach better optima due to the non-convexity of the objective.
The ALL of the stochastic methods is in most cases slightly lower than that
of L-BFGS, with a few cases even matching or exceeding that.
L-BFGS vs the rest: In terms of ALL results, training NADE with LBFGS appears to outshine any other method. However, there is a caveat:
the variance of the ALL of NADE models trained by L-BFGS is generally
higher than that of stochastic methods (though not as high as in lower order
batch methods). The reason for this difference in variance is that L-BFGS
at least in our case is more prone to overfitting. In other words, LBFGS does indeed (almost) always find better local optima of the objective
function than SGD when training (i.e. on the training set), but this can
lead it to overfit on the training set and fail to generalize on the test set.
In fact, had we shown4 the results for ALL on both training and test set
for L-BFGS and SGD, we would have seen a much larger difference on the
training set in terms of ALL of the solution found. This all happens because
we took no measures to protect L-BFGS against overfitting. On the other
hand, SGD uses early stopping.
Effect of heuristics (generally): The simple versions of each method
were in most cases outperformed by versions based on the various heuristics. The Heavy Ball version nearly always improves upon the basic version and Bold Driver and R-prop in BGD improve ALL significantly
over that of the basic version in many cases. Exceptions to this rule are
MBGD bolddriver and IAGD basic. Interestingly, the improvement that
the heuristics offer over the basic version is far more pronounced in batch
methods. To the other extreme, SGD seems little affected by the heuristic
modifications we applied to it, although the effect is consistently positive.
Effect of heuristics (Heavy Ball): In nearly every case, a method
that uses the Heavy Ball heuristic achieves better performance than the
basic version. In batch methods this happens because the memory of the
past updates can help the method avoid bad local optima (certainly not to
the extend that noise in SGD does, though) by rolling over them, to keep
up with the ball parallelism. On the other hand in stochastic methods
the Heavy Ball heuristic diminishes the stochasticity of the updates. In
minibatch methods it can potentially do either of these two (i.e. jump
over local minima and do a sort of weighted averaging over past updates)
depending on the minibatch size.
Effect of heuristics (Bold Driver and R-prop): Although both
Bold Driver and Resilient Backpropagation generally improve over sim4
Records were not retained to be included in this discussion, but we observed that during
our experiments.
81
ber of features (binarized mnist, nips-0-12, web, dna), NADE using fewer
hidden units performs considerably worse. On the other hand, on datasets
with a small number of features (adult, mushrooms, connect-4 ) NADEs
performance seems little affected by H. Of course D is not the sole factor
that determines the number of hidden units needed to encode the dataset
with little loss. Another factor is the amount of redundancy (meaning interfeature dependencies in this case) of the data. As we saw above, adult and
mushrooms seem to be the easiest of the datasets, a fact that is apparent
from the small variance of the results obtained, not only per method of
training, but per model (as apparent in the results shown in Table 6.4 ).
Apparently the resulting ALL also exhibits small variance with respect to
the choice of H, which might imply a high amount of redundancy in these
datasets.
Effect of the number of instances T : The only observation that we can
make regarding the effect of the size of the dataset on the ALL results is that
the advantage of mini-batch methods over batch methods diminishes as T
increases. In theory the contrary should happen, due to the stochasticity of
the MBGD updates allowing it to escape local optima. This inconsistency
in our case is most likely caused by the way we choose the batchsize (for
large datasets even 2% of them is still a large dataset).
Comparison of NADEs results to those of other models: If we
compare our results with those on 6.4 we will notice that in many cases
(e.g. in adult) NADE outperforms most of the baseline models there, even
trained by a lower order batch method, and even when using only as few as
H = 5 hidden units. The only exception is FVSBN which generally achieves
a performance comparable to that of NADE trained with H = 500 hidden
units using SGD (and presumably L-BFGS, although we did not train a
NADE with H = 500 hidden units using L-BFGS in these experiments).
Comments on variance: The variance on ALL attained seems to be
greater for batch methods than for minibatch methods and greater for minibatch methods than for stochastic methods. This gives rise to an interesting
observation: The more noisy the updates of an optimization method are, the
more robust the method is in a non-convex setting. This appears counterintuitive at first but is based on the fact that by using noisy updates we
can escape poor local optima, thus the possible outcomes in terms of ALL
become fewer.
6.4
Now we will present the average number of iterations needed for each method
to converge and the number of iterations spent on hyperparameter optimization
for each of the datasets. These numbers are not to be directly compared as the
computational cost per iteration differs dramatically across methods and across
datasets. It is interesting, however, to compare in each case the number of iterations spent on hyperparameter optimization and on optimization itself. It
gives us a nice picture of what price we pay computationally for adding extra
hyperparameters.
Below we give results only for the runs for H = 50. For H = 5 and H = 500
the parameter space of the objective function becomes significantly smaller and
bigger, respectively. As a result, the number of iterations a method will need until
it converges will be different. It is therefore not safe to make any assumptions
about the number of iterations needed for H = 5 and H = 500 based on the
results below for H = 50. That said, there are some observations we can make
regarding the number of iterations which are likely to hold even for different
values of H:
Observations about methods: Notice that CD basicGD always uses
the full number of iterations (i.e. terminates due to reaching MAX ITERS
without converging) in its optimization run. This explains its bad performance and we already attributed it to our poor choice for the line search
(BGD step).
BGD bolddriver and BGD rprop take fewer iterations than BGD basic to
terminate. They mainly do so due to the learning rate(s) assuming an
extreme value. This effect on MBGD bolddriver might explain its worse
average ALL compared to MBGD basic.
Heavy Ball methods converge in fewer iterations than their basic counterparts. The parameter space is large, the objective function non-convex
so differences in curvature along different directions are to be expected.
The result is the zig-zagging problem we discussed in Chapter 3 and the
Heavy Ball heuristic helps counter that and speed up convergence.
ASGD basic and IAGD basic also converge in fewer iterations than SGD basic.
They require the same numbers of hyperparameters as SGD basic, contrary
to SGD heavyball that also has the momentum parameter.
Finally, to state the obvious, L-BFGS requires the least number of iterations
to converge, and it requires no hyperparameter optimization which is a huge
advantage over the other methods examined. Of course, BGD methods
take more iterations than L-BFGS but fewer than MBGD and SGD and
SGD methods require the most iterations. Still this means nothing without
factoring in the individual costs per iteration.
Observations about datasets: The datasets that need the fewest iterations are adult, mushrooms, dna and nips-0-12 the first two are apparently
easy to train on, as a result the optimizers find a good local optimum
quickly. The other two appear to be among the most difficult datasets.
Here we have fast termination due to getting stuck quickly in not necessarily good local optima and early stopping triggering.
Also, we can notice that in easy datasets, the hyperparameter optimiza82
83
tion procedure takes more time. This happens because the outcome of
the optimization in these datasets is less sensitive to the hyperparameters
than in more difficult ones. As a result, more hyperparameter configurations are executed all (or at least closer to) 20 times (the MAX ITERS
or MAX EPOCHS value for hyperparameter optimization) without early
stopping being triggered. For example compare the iterations spent in hyperparameter selection for adult and mushrooms (easy ) to those of dna
and nips-0-12 (difficult) in Table 6.7. This, combined with the fact that
the optimization itself requires fewer iterations increases the gap in iterations between hyperparameter selection and the optimization process. See,
for instance the extreme cases of BGD bolddriver and BGD rprop for mushrooms, where the optimizer needs an overhead of about 20 the iterations
it will spent in optimization just to set its hyperparameters. A lesson from
this observation is that in such easy cases, we should not dedicate a large
part of our resources in hyperparameter optimization, since (i) a large range
of configurations will offer reasonably good results, (ii) hyperparameter optimization will take more iterations than in more difficult datasets, since
it will quickly reject hyperparameter combinations less often.
Now we present the average time per iteration for each method group on each
dataset. In this case, getting a rough estimate of the corresponding average time
per iteration for the case of H = 5 and H = 500 is simpler. Since the cost of training NADE is O(HD) we simply need to multiply the results below by 0.1 and 10
respectively. Again, this is an idealized projection that disregards constant computational overheads, different for each method and subject to implementation
details.
Finally, multiplying the number of iterations from Table 6.7 by the corresponding
time from Table 6.8 we can get the total time the execution of each method takes
on average. The results are shown in Table of Appendix C, where again we show
times for hyperparameter selection (top row), times for optimization (middle row)
and total times (bottom row). In Table we show only the total times normalized
(all times divided by 19021.1438 = maximum 101 ) to enhance readability.
Again, we can see that LBFGS
6.5
As expected, L-BFGS and SGD and its variants outperform all other methods in
terms of quality of optima they find. L-BFGS gives us the best ALL results its
iterations are the slowest among the methods we examined here, but it has the
benefit that it does not require fine tuning of hyperparameters such as the step
size. This and the fact that it requires fewer iterations than any other method
also makes it the fastest among the methods examined. Its main drawback is
that it is more prone to overfitting than SGD. Evidence for this is its greater
variance in average ALL attained and its larger gap between the value of the
Table 6.7: Average number of iterations until termination for each optimization
method training NADE with H = 50 hidden units. First row shows average
number of iterations for the hyperparameter optimization step, second row shows
average number of iterations for the parameter optimization step (the execution
of the optimizer) and third row is the sum of these two. The numbers were
computed based on average iterations or epochs until termination, rounded up
to integers and then multiplied by batchsizes and number of instances where
appropriate.
Method
BGD basic
adult
56
112
168
BGD heavyball
287
99
386
BGD bolddriver
1379
84
1463
BGD rprop
1254
89
343
CD basicGD
44
400
444
L-BFGS
44
DIBGD basic
102
91
193
SGD basic
125000
150000
275000
SGD heavyball 575000
145000
720000
ASGD basic
115000
135000
250000
IAGD basic
115000
120000
235000
MBGD basic
25500
1660
27160
MBGD heavyball 26500
1580
28080
MBGD bolddriver 23000
1240
24240
connect-4
dna
rcv1
web
35
386
421
187
367
554
862
362
1224
843
351
1194
44
380
424
221
347
568
1059
349
1408
1102
346
1448
42
309
351
219
298
517
1112
275
1387
1002
263
1265
binarized
mnist
39
382
421
228
130
358
652
351
1003
454
340
794
44
276
320
212
260
472
928
228
1156
976
194
1170
38
400
438
64
81
204
285
304000
512000
816000
1520000
49600
1569600
328000
432000
760000
336000
400000
736000
21500
2640
24140
22500
2560
25060
20000
2040
22040
24
103
127
197
98
295
642
90
732
626
88
714
61
88
149
298
79
377
1426
67
1493
1404
69
1473
27
81
108
142
69
211
727
73
800
703
59
762
33
32
39
97
125
114
137
21000
19600
40600
21000
94500
115500
18200
18200
36400
62200
42000
104200
302400
40000
342400
60800
34000
94800
17112
17360
34472
87668
14880
102548
17484
13640
31124
546550
1093168
1639718
2813125
1093168
3906293
562625
996712
1559337
313600
1680000
1993600
4256000
1640000
5896000
303800
1560000
1863800
856000
812000
1668000
4132000
81200
4213200
896000
756000
1652000
725000
2450000
3175000
2625000
2400000
5025000
575000
2150000
2725000
13000
1300
14300
13500
1280
14780
29380
1300
30680
29040
1240
30280
17760
1260
19020
17640
1220
18860
17940
5820
23760
17840
5640
23480
22460
4980
27440
21800
4640
26440
21240
3120
24360
20300
3060
23360
13020
5280
183000
13120
5180
18300
objective function after optimization on the training and the test set. More likely
this should be the method of choice for training NADE, provided that a heuristic
is applied to it to protect it from overfitting, such as early stopping.
SGD is therefore more robust. We could say that L-BFGS is more suitable when
the goal is optimization itself, while SGD is better suited for learning tasks ,
where the goal is to generalize. Pure SGD is fairly slow, but its variants help
84
85
Table 6.8: Average time (in seconds 105 ) for 1 iteration for each method group,
on each dataset. These times were all computed on the same machine.
Method
adult
BGD basic
BGD heavyball
10814
BGD bolddriver
BGD rprop
CD basicGD
L-BFGS
121842
DIBGD basic
4796
SGD basic
SGD heavyball
33
ASGD basic
IAGD basic
MBGD basic
MBGD heavyball 10814
MBGD bolddriver
connect-4
dna
rcv1
web
binarized
mnist
69831
10806
10490
8698
138826
187445 209378
1453730
310195
21816
61563
44306
61693
619730
819849 771647
6932125
35
49
30
147
35
40
88
213
3305
806
755
568
8105
12532
21310
103941
Table 6.9: Average time (in seconds 105 ) until termination for each optimization method training NADE with H = 50 hidden units. First row shows average
time for the hyperparameter optimization step, second row shows average time for
the parameter optimization step (the execution of the optimizer) and third row
is the sum of these two. All times are divided by 190211438 = maximum 101
Method
L-BFGS
10
104
7
8
5
307
418 386 3218
5
33
DIBGD basic
SGD basic
48
150
10
16
27
302
419 772 3555
SGD heavyball 125
289
30
54
79
719
1240 1949 5627
ASGD basic
43
140
9
15
24
287
392 764 3051
IAGD basic
42
132
function attained are small compared to those that ASGD offers. IAGD on the
other hand seems to be problematic, at least in the non-convex, high-dimensional
parameter space, relatively large scale task we examine. SGD methods lead to
faster optimization than BGD methods and this effect becomes more pronounced
the larger the dataset is, as a result of redundancies (i.e. many similar instances)
within the dataset.
Generally, the various heuristics seem to improve upon the basic versions and do
so both on speed of convergence and on the quality of optima found. Heavy
Ball method appears to be more effective on batch methods and minibatch
methods with large batch sizes, than on SGD. Bold Driver and R-prop help
mainly in terms of speeding up convergence, in some cases significantly. In fact,
in some datasets the fastest methods prove to be BGD rprop (more often) and
BGD bolddriver. Of course they do not match the quality of optima found by
SGD and L-BFGS.
We had hopes that MBGD would match the performance of SGD. In general
they did not, however a different choice of batchsizes, might have helped them
become more competitive. They do improve upon batch methods, though only
in terms of the optima found. Their execution times, however, suffer from the
additional hyparameter that needs to be set (batchsize). Had we fixed, however,
the batchsize to any of the values of the grid, MBGD would most likely still
perform better than BGD both in terms of optima found (see in Tables 6.5 and
6.6 the big difference in ALL between BGD and MBGD methods) and in terms of
execution times (compare only the optimization times in Table C.1 of Appendix
C, ignoring hyperparameter optimization, generally, the larger the dataset, the
faster MBGD methods are compared to BGD).
To close our discussion regarding the methods, DIBGD basic, although promising
as an idea, seems to have not been properly implemented here. No measure was
taken to prevent it from overfitting and the non-convexity of the objective function
of NADE appears to make the optimizer get stuck early in local optima.
As for the hyperparameters selected in each case, although we did not keep proper
records of the hyperparameter vectors selected we made some general observations during the experimentations that might be useful in further studies. In
easy datasets (the extremes being adult and mushrooms) we observed more
variation in the choice of hyperparameters per run than in difficult ones (the
extremes being dna and nips-0-12 ). The (initial) learning rate 0 and to a lesser
extent parameter C that controls its decay exhibited the highest variation among
datasets. Experimentation showed that further tweaking the learning rate to
values not on the grid we searched, could lead to improved performance (faster
convergence/better optima). The 20 epoch/20 iteration limit might have been
harsher for batch methods, forcing them to terminate hyperparameter optimization runs far away from the optimum, as a result choosing suboptimal learning
rates and particularly favouring larger values. The momentum hyperparameter
appeared to be less important than the stepsize and vales within the same
scale did not affect considerably the end result. As we said, the way we picked
86
87
the batchsize in MBGD was probably not the ideal. In retrospect, by fixing the
batchsize to say 2% of the full batch, MBGD methods could have achieved
similar or better performance in terms of ALL, without paying the heavy computational price of an extra free hyperparameter. Similarly the hyperparameters
and of Bold Driver and R-prop in the BGD methods should have been set
to the rule of thumb values = 1.1 and = 0.5 to remove the hyperparameter
optimization overhead for BGD bolddriver and BGD rprop. However these too
methods were very sensitive to and so we should first experiment to select
appropriate values for them on each dataset separately.
In this project we did not go into a detailed study of the properties of the individual datasets beforehand, as our goal was to build a framework that could be
applied to any dataset, allowing us to compare the methods. Hence the broad
range of hyperparameters on the grid (5 choices per each hyperparameter) and
the universality of the termination criteria across datasets. It is generally a
good idea to get a feel of the dataset first, perhaps use only parts of it to see
how easy it is to train on (how sensitive it is to the choice of method or the
setting of hyperparameters, how different results we get by training on different
subsets of it or by using different initializations). If we detect high variances in
the results, this is an indication that we need careful fine-tuning of the hyperparameters. If the variances are small we could perhaps afford to dedicate little
resources to hyperparameter optimization. Early stopping conditions in difficult to optimize hyperparameter spaces should be more forgiving, allowing for
more iterations without exceeding the best solution found so far, while for easy
datasets we can be stricter. The size of the parameter space (in many models,
such as in NADE this depends on the dimensionality of the dataset) should also
affect the parameter initialization.
Having seen how the individual methods perform, we suggest as a first option to
use L-BFGS, since it is faster and it does not require any hyperparameters to be
set. Ideally, measures should be taken to prevent it from overfitting. In the case
of large datasets (judging from our results this means 30000 instances) SGD
methods might be a better choice, especially ASGD. In any case each dataset
is different (so each parameter space is different) and other factors such as
redundancy can give an advantage to SGD. In large scale non-convex problems
we should probably avoid using BGD.
Chapter 7
Conclusion and Future Work
In this study we examined optimization in the context of machine learning. We
covered the basic steps of implementing and using optimization algorithms from
understanding the ideas behind the basic algorithms themselves, to optimizing
the hyperparameters involved, from checking the gradient code and protecting
the methods from overfitting to evaluating methods and choosing the most suitable among them. We focused on a non-convex objective function, with a high
dimensional parameter space and used datasets of small to fairly large size, however, most of this work carries on to general optimization problems in machine
learning.
Among the methods we examined the ones better suited for our task are LBFGS and SGD. L-BFGS is faster in most cases and always finds slightly better
local optima, but is more likely to overfit the training data than SGD. The most
successful variant of SGD was ASGD improving both the speed, in large datasets
being even faster than LBFGS and the quality of optima found by SGD without
adding any additional hyperparameters.
We also verified the power of the NADE model showing that even using first order
batch methods, despite the objective function being non-convex, even with a small
number of hidden units it can outperform simpler models. An interesting factor to
explore in further studies could be finding the minimal value for H in NADE that
leads to the optimal density estimation results on each of the dataset. The optimal
value of H could also demonstrate the amount of redundancy (interdependencies
among features) within the datasets.
This study was far from exhaustive. We did not cover higher order stochastic
methods such as Stochastic Quasi-Newton [Bordes et al. , 2009]. We also did not
cover many popular batch methods such as Conjugate Gradient and members of
the Quasi-Newton family like Barzilai-Borwein [Barzilai et al. , 1988] (BarzilaiBorwein approximates the Hessian by a diagonal matrix to obtain a step size for
gradient descent). Both are implemented in the minfunc() package and comparisons could be easily extended to include them as well.
There are possible directions to explore even with the methods we experimented
88
89
with. We could combine different methods in different stages of the optimization. For example we could start with stochastic methods and once the training
progresses (and we are nearing a local optimum), then we can switch to a batch
method for the few remaining iterations. A simple way to mimic this approach
could be to use a minibatch method with batch sizes that increase in size every
few epochs. The Dynamically Increasing Batch size methods are doing something
similar but as we saw they did not perform well in our task. We hypothesized
that the poor performance was due to the absence of regularization combined
with the non-convexity of the objective function.
An interesting experiment here would be to use DIBG basic to optimize a convex objective function. If the method proves successful, we would have another
indication that our hypothesis that it gets trapped in local optima and overfits
the training data on a non-convex setting is correct. Furthermore, we could also
explore other methods of the Dynamically Increasing Batch Size family by applying its idea of growing the batch size based on the result of a statistical test
on other batch methods. These could be methods we already examined here, e.g.
BGD rprop or L-BFGS, which proved better suited to the problem than simple
BGD.
Another heuristic we can use, is executing the optimization algorithms starting
from various initializations and keeping the best one as evaluated on the validation
set. This idea could be used to give us a good initialization of the parameters and
we could also compare how well different methods perform as initializers for
others. Another unexplored idea would be to use a minibatch version of L-BFGS,
this might help improve both speed (by exploiting redundancy in the dataset) and
quality of optima (by adding some amount of stochasticity to the updates).
Finally, since our results indicated that LBFGS is prone to overfitting, a good
idea would be to try to modify L-BFGS to include some form of overfitting
control. A simple example would be to use early stopping. Another approach
would be to use a regularized objective (using e.g. L1 -regularization) and combine
it with a variant of L-BFGS like Orthant-wise limited-memory quasi-Newton
(OWL-QN). OWL-QN is specifically designed for fitting L1-regularized models
[Galen et al. , 2007].
We also saw that variants of stochastic gradient descent such as ASGD work very
well and appear to offer more consistent results. However, they are in most cases
slower than L-BFGS and require fine-tuning their learning rate. We could modify
ASGD to increase its speed by exploiting the sparsity of the datasets (and the
resulting update vectors). A version of ASGD that takes sparsity into account is
presented in [Xu, 2012]. Another interesting variant to explore is the version of
SGD proposed in [Schaul et al. , 2012] which automatically adjusts the step size
on every iteration so as to minimize the expected error after the next update.
The same method can be applied to individual (or even grouped) learning rates
per parameter. Using this version of SGD, there is no longer need to fine-tune
the step size (and parameter C that controls its decay in our version). This
would effectively remove all need for hyperparameter optimization in our SGD
implementation.
As for hyperparameter optimization in general, we saw that for many of the methods examined, it plays an important role. We also saw that performing a combinatorial search in a grid of the hyperparameter space can become the bottleneck
of the entire procedure (for more than 2 hyperparameters with 5 options for each).
In [Bergstra et al. , 2012] the authors propose random search as an alternative to
grid-search. They argue that only a few of the hyper-parameters really matter for
most data sets (not the same in all). When doing a say n n grid search we
only examine n values per hyperparameter. If on the other hand we examine n2
random hyperparameter configurations we can cover up to n2 different values per
hyperparameter at the same computational cost. Thus, by doing a random search
we can find hyperparameter configurations that are as good or better than those
found by a grid search, but much faster. In [Bergstra et al. , 2011] the authors
present two greedy sequential hyperparameter optimization techniques capable of
dealing with both discrete and continuous hyperparameters. Both methods take
into account priors over the hyperparameters and are based in the optimization
of the criterion of Expected Improvement [Jones, 2001]. When compared to random search in the hyperparameter space and the manual grid-aided search they
outperform both. It would be a good idea to explore alternative schemes such as
this one to automatically select the optimal hyper-parameter configuration.
90
Bibliography
[Barzilai et al. , 1988] Barzilai J. & Borwein J. M. (1988) Two-point step size
gradient methods. IMA Journal of Numerical Analysis, 8 (1988), 141148.
[Bengio et al. , 2007] Bengio Y. , Lamblin P. , Popovici P. & Larochelle H. (2007)
Greedy Layer-Wise Training of Deep Networks. Advances in Neural Information Processing Systems 19, MIT Press, Cambridge, MA.
[Bengio, 2009a] Bengio Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, vol. 2, iss. 1, pp. 1 127.
[Bengio, 2009b] Bengio Y. , Louradour J. , Collobert R. & Weston J. (2009).
Curriculum Learning. Proceedings of the Twenty-sixth International Conference on Machine Learning (ICML 2009), ACM.
[Bengio et al. , 2000] Bengio, Y. & Bengio, S. (2000). Modeling high-dimensional
discrete data with multi-layer neural networks. Advances in Neural Information
Processing Systems 12 (NIPS 1999) pp. 400 406. MIT Press.
[Bengio, 2012] Bengio Y. 2012. Practical Recommendations for Gradient-Based
Training of Deep Architectures. CoRR abs/1206.5533.
[Bergstra et al. , 2011] Bergstra J. , Bardenet R. , Bengio Y. and Kegl B. (2011).
Algorithms for Hyper-parameter Optimization. NIPS 2011.
[Bergstra et al. , 2012] Bergstra J. & Bengio Y. (2012). Random Search for
Hyper-Parameter Optimization. Journal of Machine Learning Research, 13(281
305).
[Bishop, 1995] Bishop C. (1995). Neural Networks for Pattern Recognition. Oxford University Press.
[Bordes et al. , 2009] Bordes A. , Bottou L. & Patrick G. (2009). SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent. Journal of Machine Learning
Research, 10: 1737 1754.
[Bottou et al. , 2004] Bottou L. & LeCun Y. (2004). Large Scale Online Learning. Advances in Neural Information Processing Systems 16, MIT Press, Cambridge, MA.
[Bottou et al. , 2007] Bottou L. & Bousquet O. (2007). The Tradeoffs of Large
Scale Learning. In Proceedings of NIPS 2007.
91
[Broyden, 1970] Broyden, C. G. (1970). The convergence of a class of doublerank minimization algorithms. Journal of the Institute of Mathematics and Its
Applications 6: 76 90.
[Boyd et al. , 2004] Boyd S. P. & Vandenberghe L. (2004). Convex Optimization.
Cambridge University Press. p. 129. ISBN 978-0-521-83378-3.
[Boyles et al. , 2011] Boyles L. , Korattikara A. , Ramanan, D. and Welling, M.
(2011). Statistical Tests for Optimization Efficiency. NIPS 2011.
[Collobert et al. , 2009] Collobert R. & Weston J. (2009). A unified architecture
for natural language processing: Deep neural networks with multitask learning.
Proceedings of the Twenty-sixth International Conference on Machine Learning
(ICML 2009), ACM.
[Deng et al. , 2011] Deng L. & Yu D. (2011). Deep Convex Network: A Scalable
Architecture for Deep Learning. Interspeech 2011, pp. 22852288.
[Fletcher, 1970] Fletcher, R. (1970). A New Approach to Variable Metric Algorithms. Computer Journal 13 (3): 317 322.
[Frey, 1998] Frey B. J. (1998). Graphical models for machine learning and digital
communication. MIT Press.
[Galen et al. , 2007] Galen A. & Jianfeng G. (2007). Scalable Training of L1Regularized Log-Linear Models. 24th Annual International Conference on Machine Learning (ICML 2007).
[Goldfarb, 1970] Goldfarb, D. (1970). A Family of Variable Metric Updates Derived by Variational Means. Mathematics of Computation 24 (109): 23 26.
[Hahn et al. , 2011] Hahn S. , Lehnen P. & Ney H. (2011). Powerful extensions to
CRFS for grapheme to phoneme conversion. In Proceedings of ICASSP. 2011,
4912 4915.
[Heigold et al. , 2009] Heigold G. , Rybach D. , Schl
uter R. & Ney H. (2009). Investigations on Convex Optimization Using Log-Linear HMMs for Digit String
Recognition. Interspeech 2009.
[Hestenes et al. , 1952] Hestenes M. R. & Stiefel E. (1952). Methods of Conjugate
Gradients for Solving Linear Systems. Journal of Research of the National
Bureau of Standards 49 (6).
[Hinton et al. , 2006a] Hinton G. E. & Salakhutdinov R. R. (2006) Reducing the
Dimensionality of Data with Neural Networks. Science, 28 July 2006, Vol. 313.
no. 5786, pp. 504 507.
[Hinton et al. , 2006b] Hinton G. E. , Osindero S. & Teh, Y. (2006) A fast learning
algorithm for deep belief nets. Neural Computation 18, pp 1527 1554.
[Hinton, 2010] Hinton G. E. (2010). A Practical Guide to Training Restricted
Boltzmann Machines. Technical report.
92
BIBLIOGRAPHY
93
BIBLIOGRAPHY
95
Appendix A
Notational Conventions
Vectors are denoted by lowercase boldface letters or sometimes (mainly in
pseudocodes) variables written in bold starting with a lowercase letter (e.g.
, pass).
Matrices are denoted by uppercase boldface letters or sometimes (mainly
in pseudocodes) variables written in bold starting with an uppercase letter
(e.g. H, Batch).
We use the notation A(1, ) in pseudocodes and A1, in formulae to refer
to the of the first row of A ( i.e. A(1, ) A1, and is a vector).
We use the notation x(T ) to denote the T -th instance x(t) . We use xT to
denote the transpose of a vector x.
When defining a D-dimensional vector a by a RD without specifying
whether it is a column or a row vector, assume it is whichever of the two is
required by the operations a is involved in to be defined.
We use {a(n)} to refer to the series a(0), a(1), a(2), ..., a(n).
We use f (x) interchangeably with f (x0 ), when it is clear from the context
that we refer to the value of f (x) for x = x0 . Similarly, we might even just
use f in some cases to denote the same thing (especially in pseudocodes)
when doing so adds no ambiguity. This notational slip carries on to partial
derivatives, the gradient and the Hessian of f as well.
In pseudocodes, constants are usually written in plain uppercase characters
(e.g. N , M AX IT ERS). Their values are always explained in the main
text and whenever deemed necessary in the algorithms themselves in
commends or by an explicit value assignment. So if some symbol or term
is not specified in a pseudocode, then it is a constant and it is bound to
be explained in the main text, most likely in the subsection describing the
pseudocode.
In pseudocodes, functions are also written in uppercase (e.g. OBJFUNCEVAL(), COMPUTE OBJECTIVE AND GRADIENT()). Most of these are
96
97
Appendix B
Table of Method Names
Table B.1: The methods discussed here, a brief description of each and a reference
to the algorithm that describes it.
Method
BGD basic
Newton
Brief Description
The simple version of
batch gradient descent
Batch gradient descent using
the Heavy Ball heuristic
Batch gradient descent using
the Bold Driver heuristic
Batch gradient descent using the
Resilient Backpropagation heuristic
Coordinate Descent using
a batch gradient descent step
to update each parameter
Coordinate Descent using
an inexpensive Newton step
to update each parameter
Newtons Method
L-BFGS
BGD heavyball
BGD bolddriver
BGD rprop
CD basicGD
CD basicNewton
Algorithm
4
Implemented
Yes
Yes
Yes
Yes
Yes
Yes
Yes
12
No. Used
minfunc()
implementation
Yes
13
Yes
14
Yes
15
Yes
16
Yes
13
Yes
18
Yes
19
Yes
10
DIBGD basic
98
Comments
Only used in
logistic regression
toy problem
Only used in
logistic regression
toy problem
Appendix C
Average Execution Times Table
Table C.1: Average time (in seconds 105 ) until termination for each optimization method training NADE with H = 50 hidden units. First row shows average
time for the hyperparameter optimization step, second row shows average time
for the parameter optimization step (the execution of the optimizer) and third
row is the sum of these two.
Method
adult
connect-4
dna
BGD basic
605584
1211168
1816752
3103618
1070586
4174204
14912506
908376
15874952
13560756
962446
14523202
475816
4325600
4801416
5361048
3072564
19273356
22345920
14804172
18156060
32960232
64803168
15921468
80724636
68155056
13547214
81702270
2653578
27932400
30585978
19852480
259344
1113018
1372362
2128782
1058988
3187770
6937452
972540
7909992
6764556
950928
7715484
639890
923120
1563010
3126020
828710
3954730
14958740
702830
15661570
14727960
723810
15451770
234846
704538
939384
1235116
600162
1835278
6323446
634954
6958400
6114694
513182
6627876
2031579
1417792
489192
436436
925628
SGD basic
4125000
4950000
9075000
SGD heavyball 18975000
4785000
23760000
ASGD basic
3795000
4455000
8250000
IAGD basic
3795000
4200000
7995000
MBGD basic 275800000
17951240
293751240
MBGD heavyball 286600000
17086120
303686120
MBGD bolddriver 248722000
13409360
62131360
1767096
4450464
6217560
10640000
17920000
28560000
53200000
1736000
54936000
11480000
15120000
26600000
11088000
14000000
25088000
71100000
8725200
79825200
74400000
8460800
82860800
66100000
6742200
72842200
BGD heavyball
BGD bolddriver
BGD rprop
CD basicGD
L-BFGS
DIBGD basic
1029000
960400
1989400
1029000
4630500
5659500
891800
891800
1783600
binarized
mnist
4858910 8247580 8793876 56695470
53586836 71229100 64697802 555324860
58445746 79476680 73491678 612020330
25960462 41425345 45853782 331450440
50949142 65043415 62394644 188984900
76909604 106468760 108248426 520435340
119668012 198504255 232828336 947831960
50255012 65418305 57578950 510259230
169923024 263922560 290407286 1458091190
117030318 206564390 209796756 659993420
48727926 64855970 55066414 494268200
165758244 271420360 264863170 1154261620
rcv1
web
10500000
1047800
11547800
10900000
1031680
11931680
22200000
981500
23181500
21900000
936200
22836200
10100000
715680
10815680
10000000
692960
10692960
99