Ryan Adams 140814 Bayesopt Ncap

A Tutorial on!
Bayesian Optimization!
for Machine Learning
Ryan P. Adams!
School of Engineering and Applied Sciences!
Harvard University
http://hips.seas.harvard.edu
Machine Learning Meta-Challenges
Increasing Model Complexity

More flexible models have more parameters.!
More Sophisticated Fitting Procedures

Non-convex optimization has many knobs to turn.!
Less Accessible to Non-Experts

Harder to apply complicated techniques.!
Results Are Less Reproducible

Too many important implementation details are missing.
Example: Deep Neural Networks
Resurgent interest in large neural networks.!
When well-tuned, very successful for visual object

identification, speech recognition, comp bio, ...!
Big investments by Google, Facebook, Microsoft, etc.!
Many choices: number of layers, weight regularization,

layer size, which nonlinearity, batch size, learning rate
schedule, stopping conditions
Example: Online Latent Dirichlet Allocation
Hoffman, Blei, and Bach (2010): Approximate inference for

large-scale text analysis with latent Dirichlet allocation.!
When well-tuned, gives good empirical results.!
Many choices: Dirichlet parameters, number of topics,

ativevocabulary
process size, learning rate schedule, batch size
Figure by David Blei
Example: Classification of DNA Sequences
Objective: predict which DNA sequences will bind with

which proteins.!
Miller, Kumar, Packer, Goodman, and Koller (2012):

Latent Structural Support Vector Machine!
Choices: margin parameter, entropy parameter,

convergence criterion
Search for Good Hyperparameters?
Define an objective function.

Most often, we care about generalization performance.
Use cross validation to measure parameter quality.!
How do people currently search? Black magic.

Grid search
Random search
Grad student descent!
Painful!
Requires many training cycles.
Possibly noisy.
Can We Do Better? Bayesian Optimization
Build a probabilistic model for the objective.

Include hierarchical structure about units, etc.!
Compute the posterior predictive distribution.

Integrate out all the possible true functions.
We use Gaussian process regression.!
Optimize a cheap proxy function instead.

The model is much cheaper than that true objective.
The main insight:!
Make the proxy function exploit uncertainty to balance
exploration against exploitation.
The Bayesian Optimization Idea

3
2
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

3
2
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

3
2
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

3
2
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

3
2
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

3
2
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

3
2
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Where should we evaluate next!

in order to improve the most?
0.9

3
2
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8

0.9

3
2
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8

0.9

3
2
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
One idea: expected improvement
0.9
Historical Overview
Closely related to statistical ideas of optimal design of experiments,

dating back to Kirstine Smith in 1918.!
As response surface methods, they date back to Box and Wilson in 1951.!
As Bayesian optimization, studied first by Kushner in 1964 and then

Mockus in 1978.!
Methodologically, it touches on several important machine learning

areas: active learning, contextual bandits, Bayesian nonparametrics!
Started receiving serious attention in ML in 2007, for example:

Brochu, de Freitas & Ghosh, NIPS 2007 [preference learning]
Krause, Singh & Guestrin, JMLR 2008 [optimal sensor placement]
Srinivas, Krause, Kakade & Seeger, ICML 2010 [regret bounds]
Brochu, Hoffman & de Freitas, UAI 2011 [portfolios]!
Interest exploded when it was realized that Bayesian optimization

provides an excellent tool for finding good ML hyperparameters.
Todays Topics
Review of Gaussian process priors!
Bayesian optimization basics!
Managing covariances and kernel parameters!
Accounting for the cost of evaluation!
Parallelizing training!
Sharing information across related problems!
Better models for nonstationary functions!
Random projections for high-dimensional problems!
Accounting for constraints!
Leveraging partially-completed training runs
Gaussian Processes as Function Models

Nonparametric prior on functions specified in
terms of a positive definite kernel.
3
2
4
2
0
0
2
1
4
2
2
3
0
2
0
2
the quick brown fox

jumps over the lazy dog
strings (Haussler 1999)
permutations (Kondor 2008)
proteins (Borgwardt 2007)
Gaussian Processes
Gaussian process (GP) is a distribution on functions.!
Allows tractable Bayesian modeling of functions

without specifying a particular finite basis.!
Input space (where were optimizing) !X
Model scalar functions !f : X ! R
Positive definite covariance function! C : X X ! R
Mean function m : X ! R
Gaussian Processes
Any finite set of N points in X , {xn }N
n=1 induces a
homologous N -dimensional Gaussian distribution
on RN, taken to be the distribution on {y
n = f (xn )}N
n=1
Gaussian Processes
n=1 induces a
n = f (xn )}N
n=1
4
3
2
1
0
!1
!2
!3
!2
!1
Gaussian Processes
n=1 induces a
n = f (xn )}N
n=1
4
3
2
1
0
1
2
3
Gaussian Processes
Due to Gaussian form, closed-form solutions for many

useful questions about finite data.!
Marginal likelihood:!
N
ln 2
2
1
1 T
1
ln p(y | X, ) =
ln |K |
y K y
!
2
2
M
Predictive distribution at test points {xm }m=1 :!
test
y
N(m, )
!
!
m=
T
1
k K y
T
1
k K k
We compute these matrices from the covariance:

[k ]n,m
[K ]n,n0 = C(xn , xn0 ; )

= C(xn , xm ; )
[ ]m,m0 = C(xm , xm0 ; )
Examples of GP Covariances
Squared-Exponential
C(x, x ) =
exp
1
2
D
X
x0d
xd
d=1
Matrn
2 )
21
C(r) =
( )
Neural Network
C(x, x ) =
sin
8
<
:
(1 +
x)(1 +
2x0 T
2 r
2 r
Periodic
2xT x0
2xT
x0 )
9
=
;
C(x, x ) = exp
2 sin
1
2 (x
2
x)
GPs Provide Closed-Form Predictions

3
2
1
0
1
2
3
3
Using Uncertainty in Optimization

?
Find the minimum:! x = arg min f (x)
We can evaluate the objective pointwise, but do not

have an easy functional form or gradients. !
After performing some evaluations, the GP gives us

easy closed-form marginal means and variances.!
Exploration: Seek places with high variance.!
Exploitation: Seek places with low mean.!
The acquisition function balances these for our proxy

optimization to determine the next evaluation.
x2X
Closed-Form Acquisition Functions
The GP posterior gives a predictive mean function (x)

2
and a predictive marginal variance function ! (x)
!
f (xbest ) (x)
(x) =
(x)
Probability of Improvement (Kushner 1964):!

aPI (x) =
( (x))
Expected Improvement (Mockus 1978):!

aEI (x) = (x)( (x) ( (x)) + N( (x) ; 0, 1))
GP Upper Confidence Bound (Srinivas et al. 2010):

aLCB (x) = (x)
(x)
Probability of Improvement
3
2
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Expected Improvement
3
2
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
GP Upper (Lower) Confidence Bound

3
2
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Distribution Over Minimum (Entropy Search)

3
2
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Illustrating Bayesian Optimization
Why Doesnt Everyone Use This?

These ideas have been around for decades.!
Why is Bayesian optimization in broader use?
Fragility and poor default choices.

Getting the function model wrong can be catastrophic.!
There hasnt been standard software available.

Its a bit tricky to build such a system from scratch.!
Experiments are run sequentially.

We want to take advantage of cluster computing.!
Limited scalability in dimensions and evaluations.

We want to solve big problems.
Fragility and Poor Default Choices

Ironic Problem:!
Bayesian optimization has its own hyperparameters!
Covariance function selection

This turns out to be crucial to good performance.
The default choice for regression is way too smooth.
Instead: use adaptive Matrn 3/5 kernel.!
Gaussian process hyperparameters

Typical empirical Bayes approach can fail horribly.
Instead: use Markov chain Monte Carlo integration.
Slice sampling means no additional parameters!
Covariance Function Choice

21
C(r) =
( )
3
2 r
= 1/2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
= 5/2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
2 r
0.7
0.8
0.9
3
0
= 3/2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.7
0.8
0.9
=1
3
0
3
0
0.1
0.2
0.3
0.4
0.5
0.6
Choosing Covariance Functions

Structured SVM for Protein Motif Finding
Miller et al (2012)
0.28
Matern 52 ARD
ExpQuad ISO
ExpQuad ARD
Matern 32 ARD
Classification Error
0.275
0.27
0.265
0.26
0.255
0.25
0.245
0.24
10
20
30
40
50
60
70
80
90
100
Function evaluations
Snoek, Larochelle & RPA, NIPS 2012
MCMC for GP Hyperparameters
Covariance hyperparameters are often optimized rather than

marginalized, typically in the name of convenience and efficiency.!
Slice sampling of hyperparameters (e.g., Murray and Adams 2010) is

comparably fast and easy, but accounts for uncertainty in length
scale, mean, and amplitude.!
Integrated Acquisition Function:!

!
!
!
!
a
(x) =
a(x ;
N
) p( | {xn , yn }n=1 ) d
K
X
1
(k)
a(x ; )
K
k=1
(k) p( | {xn , yn }N
n=1 )
For a theoretical discussion of the implications of inferring

hyperparameters with BayesOpt, see recent work by Wang and de
Freitas (http://arxiv.org/abs/1406.7758)
Integrating Out GP Hyperparameters

Posterior samples
with three different
length scales
Length scale specific

expected improvement
Integrated expected
improvement
MCMC for Hyperparameters

Logistic regression for handwritten digit recognition
0.25
GP EI MCMC
GP EI Conventional
Tree Parzen Algorithm
0.2
0.15
0.1
0.05
10
20
30
40
50
60
70
80
90
100
Function Evaluations
Cost-Sensitive Bayesian Optimization
Practical problems often have varying costs, e.g.,

durations, hardware requirements, materiel.!
We often have a budget or a deadline and would

ideally like to account for these limited resources.!
It may be sensible to first characterize the function with

cheap evaluations before trying expensive ones.!
These costs are probably unknown, but they can

themselves be modeled with Gaussian processes.!
One idea: expected improvement per second

Expected Improvement per Second

Structural SVM for Protein Motif Finding
Miler et al. (2012)
0.26
Min function value
GP EI MCMC
GP EI per Second
0.255
0.25
0.245
0.24
10
15
Time (hours)

CIFAR10: Deep convolutional neural net (Krizhevsky)
Achieves 9.5% test error vs. 11% with hand tuning.
0.3
GP EI MCMC
GP EI per Second
Stateoftheart
0.28
0.26
0.24
0.22
0.2
0.18
10
15
20
25
30
Time (Hours)
we can easily compute the predicted expected inverse duration and use it to compute the
xpected improvement per second as a function of x.
Parallelizing Bayesian Optimization
3.3. Monte Carlo Acquisition for Parallelizing Bayesian Optimization. With the advent
f multi-core
computing,
it is natural
to ask how we can
our Bayesian optimization
Classical
Bayesian
optimization
isparallelize
sequential.
rocedures. More generally than simply batch parallelism, however, we would like to be able
Dowhat
an experiment.
Waitnext,
until
it while
finishes.
o decide
x should be evaluated
even
a set ofRepeat.!
points are being evaluated.
Clearly, we cannot use the same acquisition function again, or we will repeat one of the pending
xperiments.
We would
ideally perform
roll-out
of ourthings
acquisition
Compute
clusters
let usa do
many
at policy,
once.to choose a point
hat appropriately balanced information gain and exploitation. However, such roll-outs are
How
do weInstead
efficiently
pick
multiple
experiments?!
enerally
intractable.
we propose
a sequential
strategy
that takes advantage of the
ractable inference properties of the Gaussian process to compute Monte Carlo estimates of
he acquisiton
function
under dierent
possible
from pending
function evaluations.
Fantasize
outcomes
from
theresults
Gaussian
process!
Consider the situation in which N evaluations have completed, yielding data {xn , yn }N
n=1 ,
J
gives
us{x
coherent
predictions.
nd in The
whichprobabilistic
J evaluations are model
pending at
locations
j }j=1 . Ideally, we would choose a new
oint based
thepredictions
expected acquisition
function under
all possible
outcomes of these pending
Use on
the
to imagine
possible
futures.
valuations:
Evaluate points that are good under the average future.!
7) a
(x ; {xn , yn }, , {xj }) =
Z
!
a(x ; {xn , yn }, , {xj , yj }) p({yj }Jj=1 | {xj }Jj=1 , {xn , yn }N
n=1 ) dy1 dyJ .
RJ
shrink
theunder
variance
and integrate
out the mean,
ThisisAlternative:
simply the expectation
of a(x)
a J-dimensional
Gaussian distribution,
whose
mean and
covariance
can easily
be computed.
As in theJMLR
covariance
hyperparameter case, it
as in
Desautels,
Krause
& Burdick,
2014.
s straightforward to use samples from this distribution to compute
the expected
acquisition
Snoek, Larochelle
& RPA,
NIPS 2012
Parallelizing Bayesian Optimization

Two pending
experiments, with
three joint fantasies
Three expected
improvement
functions, one for
each fantasy
Integrating out the
fantasies yields an
overall expected
improvement
Parallelized Bayesian Optimization

Online Latent Dirichlet Allocation (250K documents)
Hoffman, Blei & Bach (2010)
1350
GP EI MCMC
3x GP EI MCMC
5x GP EI MCMC
10x GP EI MCMC
OutofSample Perplexity
1340
1330
1320
1310
1300
1290
1280
1270
1260
4
5
6
7
8
Time (Days)
Multi-Task Bayesian Optimization
We usually dont just have one problem to solve.

Probably, we have a set of related objectives.!
Examples:
Object classification for different image types.
Speech recognition for different languages/speakers.
Different folds of the same cross-validation setup.!
How can we share information between problems?

Use a prior on vector-valued functions.
Reframe acquisitions to leverage output covariance.
Make decisions that look for bits about the minimum.
Swersky, Snoek & RPA, NIPS 2013
Parallel-Task Bayesian Optimization

Convolutional Neural Network with Identical Architectures
Google Street View!

House Numbers:!
73K 32x32 Digits
STL-10!
5K 96x96!
Natural Images
CIFAR-10!
50K 32x32!
Natural Images
Correlated Objective Functions

dependent functions
(1)
(2)
(3)
independent predictions
dependent predictions
Min Function Value (Error Rate)
Visual object recognition: STL-10 and CIFAR-10!

Achieves 70.7% test -- best previous was 64.5%
STL10 from scratch
STL10 from CIFAR10
0.25
0.245
0.24
0.235
0.23
0
10
20
30
40
50
Min Function Value (Error Rate)
Recognizing house numbers, leveraging object recognition!

Achieves 4.77% test error -- best non-dropout result.
0.07
SVHN from scratch

SVHN from CIFAR10
SVHN from subset of SVHN
0.065
0.06
0.055
0.05
10
20
30
40
50
Auxiliary-Task Bayesian Optimization

Using a smaller problem to go fast on a bigger problem.
16
14
12
10
8
6
Time Taken by MTBO (Mins)
MTBO
STBO
10
20
30
40
50
60
Time (Minutes)
70
80
90
Min Function Value (Classification Error %)
100
80
60
40
6.79
20
0
0
13.53
20
9.49
8.14
7.06
40
60
80
Time Taken by STBO (Mins)
100
1280
MTBO
STBO
1278
1276
1274
1272
1270
1268
Time Taken by MTBO (Days)
Min Function Value (Error %)
18
Online Latent Dirichlet Allocation

Min Function Value (Perplexity)
Digit Recognition
6
8
Time (Days)
10
12
10
8
1267
6
1270
4
1271
2
0
0
1272
1489
2
4
6
8
Time Taken by STBO (Days)
10
Bayesian Optimization with Unknown Constraints
In general, you need to specify a range for each

parameter you wish to optimize.!
However, there may be regimes in which things break,

that are not easily pre-specified.!
Alternatively, you may have expensive constraints that

you want to make optimal decisions about.
min f (x) s.t. 8k gk (x)
x
objective!
function
constraint!
functions
Gelbart, Snoek & RPA, UAI 2014
Noisy Constraints
Realistically, lots of problems we care about have noisy

constraints just like they have noisy objectives.!
This is a problem because we can never be completely

sure that we have a valid solution.!
This can be addressed via probabilistic constraints:

min E[f (x)] s.t. 8k Pr[gk (x)
x
objective!
function
0]
constraint!
functions
confidence!
threshold
Decoupled Constraints
Just like in the multi-task case, it may be possible to

evaluate the constraints separately from the objective.!
The constraints may have vastly different costs.!
We call these decoupled constraints.!
We use entropy search to determine whether to

evaluate the constraint or the objective.
LDA with Sparsity Constraints

Constrained
Unconstrained
Gramacy and Lee (2010)
Min Function Value
9.5
9
8.5
8
7.5
10
20
30
40
50
Require that online LDA result in low-entropy topics.
(a) Online LDA
Neural Network with Memory Constraints

Constrained
Unconstrained
Min Function Value
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
20
40
60
80
100
Decoupled: find a good neural network for MNIST, subject to

(b)
Neural
Network
stringent memory limitations.
Bayesian Optimization with Random Projections
Bayesian optimization relies on having a good prior on

functions challenging in high dimensions!!
Realistically, many problems have low effective

dimensionality, i.e., the objective is primarily
determined by a manifold in the input space.!
Wang, Zoghi, Hutter, Matheson & de Freitas, IJCAI

2013 suggest a way to deal with many dimensions.!
Approach: choose a random embedding of the input

dimensions and then optimize in the subspace.!
Straightforward to implement and offers some progress

for large and difficult BayesOpt problems.
Input Warping for Nonstationary Objectives
Good priors are key to Bayesian optimization.!
Almost all widely-used kernels are stationary, i.e., covariance depends

only on the distance.!
For machine learning hyperparameters, this is a big deal; learning rate

changes may have big effects in one regime and small ones in another.!
Real problems are highly nonstationary, but it is difficult to strike a

balance between flexibility and tractability.!
One approach: extend the kernel to include a parametric bijection of each

input, and integrate it out with MCMC.!
The CDF of beta and Kumaraswamy

distributions
Input Warping for Bayesian
Optimization are suitable bijections
on the unit hypercube and allow interesting nonlinear effects.
1
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
0
0.2
0.4
0.6
(a) Linear
0.8
0
0
0.2
0.4
0.6
(b) Exponential
0.8
0
0
0.2
0.4
0.6
(c) Logarithmic
0.8
0
0
0.2
0.4
0.6
0.8
(d) Sigmoidal
Snoek, Swersky, Zemel & RPA, ICML 2014
average
loss
and standard
are6.88
reported.
The algorithm
7.3
0.2 validation
standard
0.6
40deviation
algorithm
0.0
average
validation
loss8.2
and
deviation
are reported.
The
with the
9
7.3
of0.2
8.2 our
0.6algorithm40
6.88
0.0in far few
on
some
the
benchmarks
converges
to
a
solution
onInput
some
of the
benchmarks
our
converges1266.2
to a solution
in far fewer eval
272.6
10.3
1271.5 for
3.5algorithm
50
0.1
Warping
Nonstationary
Objectives
9 1272.6 10.3
1271.5 3.5
50
24.6 0.9
24.2 0.0
100
1
24.6 0.9
24.2 0.0
100
1280
EI MCMC logspace
GP EI MCMCGP
logspace
EI MCMC
GP EI MCMCGP
orig
space orig space
GP orig space
Warped GP Warped
orig space
Logistic Regression
0.25
0.25
1280
1266.2 0.1
24.1 0.1
24.1 0.1
Online LDA
0.248
0.248
GP
EI
MCMC
GP EI MCMC
MCMC
0.247
Warped GP EIWarped
MCMC GP EI
Structured SVM
Min Function Value
Min Function Value
Min Function Value
Min Function Value
0.247
ameter0.2benchmarks
proposed
in
Eggensperger
et
al.
(2013).
We
compare
0.2
0.246
lued parameter
benchmarks proposed in1275
Eggensperger
et al. (2013). We compare
0.246
1275
(Hutter et al., 2011), the Tree Parzen Estimator (TPE) (Bergstra et al.,
0.245
0.245et al.,
SMAC)0.15(Hutter
et
al.,
2011),
the
Tree
Parzen
Estimator
(TPE)
(Bergstra
0.15
0.244
C,
Spearmint
and
TPE
are
reproduced
from
Eggensperger
et
al.
(2013).
0.244
or SMAC, Spearmint and TPE are reproduced
from Eggensperger et al.0.243
(2013).0.243
1270
1270
gorithm
ten run
times
the for
given
of evaluations,
and the
0.1
0.1was run
each algorithm
was
tenfor
times
the number
given number
of evaluations,
and the0.242
0.242
algorithm
with thewith
lowest
validation
loss
is loss
shown
in bold. We note that
1265 is shown in bold. We note
0.241
1265
0.241 that
ed.
The 0algorithm
the
lowest
validation
05
510
10
15
20
15
20
50
30
10
2010
30 20
40 30
50 40
30
40
50
Function
Evaluations
Fu
n
in
far
fewer
evaluations
than
the
protocol
allows.
a solution in far fewer evaluations than the protocol allows.
(a) Logistic
Regression
(b) Online
LDA LDA
(c) Struc
(a) Logistic
Regression
(b) Online
(
CIFAR 10 Subset
0.244
0.243
Min Function Value
0.243
Min Function Value
Min Function Value
0.244
Min Function Value
GP EI MCMC
FigureFigure
3. An0.248empirical
comparison
of0.8Bayesian
optimization
following
the standa
3. An empirical
comparison
of0.8Bayesian
optimization
following
the
GP EI MCMC
GP EI MCMC
Warped GP EI MCMC
GP EI MCMC
GP EI MCMC
0.247
Warped GP EI MCMC
CMC
Warped
GP
EI
MCMC
0.247 MCMC)
Warped
GP EIEI
MCMC
Warped
GP
EI 0.7
MCMC (Warped
(GP
MCMC)
and our
strategy
(Warped
GP
EI
MCMC)
for
modeling
input n
(GP
EI
and
our
strategy
GP
EI
MCMC)
for
modeling
0.7
0.246
0.246
challenging
problems
involving
the
optimization
of
the
hyperparameters
of
popula
problems
involving
of
the
hyperparameters
of
0.6 the optimization
0.245challenging
0.6
0.245
0.248
0.5
0.5
0.4
It is clear
that
the
learned
warpings
are
In
Overall,
in
It is clear that the learned warpingsnon-trivial.
are non-trivial.
In
Ove
5
10
50
515 1020 1525 2030 2535 3040 35
40
Function
evaluations
some some
cases,cases,
like with
rates, rates,
they agree
with
inGaussian
evaluations
like learning
with learning
theyFunction
agree
with inGaup
(c)
Structured
SVM
(d)
10
Subset
tuition,
while
for
others
likeSVM
dropout
theyCifar
yield
surprising
as every
DA
(c) Structured
(d)
Cifar
10 Subset
tuition,
while
for
others
like dropout
they
yield
surprising
as ot
e
0
tions
0.242
0.241
40
0.4
0.242
3050
0.241
40
50
60
70
80
90
100
30
40
50
60
70
80
90
100
Freeze-Thaw Bayesian Optimization
For ML BayesOpt, the objective itself is often the result of an

incremental training procedure, e.g., SGD.!
Humans often abort jobs that arent promising.!
Freeze-Thaw Bayesian optimization tries to do the same thing.!
It tries to project when a training procedure will give a good

value, and only continues promising ones.!
Requires a prior on training curves as a function of input

hyperparameters. Introduces a new kernel for this.
(a) Exponential Decay Basis
(b) Samples
(c) Training Curve Samples
Swersky, Snoek & RPA, Under Review
Fig 1: Example functions from the exponential decay kernel. (a) Example functions from our basis set with = 1.0

1.4
1.2
0.8
Error
1.0
0.6
0.4
0.2
0.0
10
20
30
Epochs
40
(a)
50
60
140
120
100 ed
80 tart
S
60
n
40
io
t
20 tera
I
0
Swersky, Snoek & RPA, Under Review
yt+1
using
Equation
using
Equation
20. 20.17:
+1
Select xk , where k = argmaxk a(k) as the next model to run.
y1 using
Equation
using
Equation
21. 21.
Logistic Regression
yPy
ation,
compute
basket
using
Monte
Carlo
simulation
Equation
FreezeThaw
n, compute Pmin min
overover
the the
basket
using
Monte
Carlo
simulation
andand
Equation
19. 19.
Warped GP EI MCMC
H(P
)
min )min // information gain.
9
0.1
//
information
gain.
t
9.5
Perplexity
Classification Error Rate
0.11
0.09
(k)
as the
model
to run. 0.08
as the
nextnext
model
to run.
0.07
0
00 1000
Perplexity
Perplexity
8.5
7.5
0
400
600
Epochs
800
(a) Logistic Regression

0.8 0.8
0.7
0.7
1000
500
1000
Epochs
Prob. Matrix Factorization
(b) Online LD
FreezeThaw
FreezeThaw
Warped
EI MCMC
Warped
GP EIGP
MCMC
Fig 3: This figure shows

the results of the empirical comparison
0.6 0.6
0.5 0.5
machine learning
hyperparameter optimization prob
0.4
observed0.4over
all training epochs evaluated by eac
7.5
0
200
FreezeThaw
FreezeThaw
Warped
EI MCMC
Warped
GP EIGP
MCMC
8.5
7.5
500 500
RMSE
MCMC
C
Online LDA
8.5
RMSE
9.5
9.5
F
W
0.3
0.3
0.2
0.2
Given a basket, the task now becomes one of choosing w

always favor picking new Swersky,
models Snoek
rather&than
running
old
o
RPA, Under Review
1000 1000 1500 1500

Epochs
Epochs
Online
LDA
(b) (b)
Online
LDA
0.1 0.10
2000
2000
0
200 200400 400600 600800 8001000 1000
1200 1200
Epochs
Epochs
(c) PMF
(c) PMF
New Directions for Bayesian Optimization
Finding new!
organic materials
Optimizing robot!
control systems
Improving turbine!
blade design
Designing DNA for!

protein affinities
Cheetah Gait - Sangbae Kims Lab @ MIT
Human-Tuned Gait
Human-Tuned Gait
Bayesian Optimized Gait
Bayesian Optimized Gait
I just want to use Bayesian optimization.
The tool my group maintains for performing Bayesian

optimization is called Spearmint and is available at
https://github.com/HIPS/Spearmint!
Spearmint is available for non-commercial use, and has

most of the fancy research bits Ive described today.!
For turn-key super-easy Bayesian optimization, were

launching a startup at whetlab.com. It provides a
straightforward REST API and clients in Python,
Matlab, R, etc.!
Our intention is to make Whetlab cheap/free for ML

researchers.
Shameless Plug: Whetlab Python Client

import whetlab
# Define parameters to optimize

parameters = { learn_rate'
: {'type':'float','min':0,'max':15,'size':1},
weight_decay : {type:'float','min':-5,'max':10,'size':1},
. . .
. . .
. . .
layer1_size' : {'type':'integer', 'min':1, 'max':1000, size':1}}
name = My Fancy Deep Network'

description = 'Optimize the hyperparams of my network.
outcome = {name:Cross-Validation Accuracy', 'type':'float'}
scientist = whetlab.Experiment(name=name, description=description,
parameters=parameters, outcome=outcome)
# Loop until youre happy.

# You can have multiple of these in parallel, too.
for ii in xrange(1000):
!
!
!
# Ask Whetlab scientist for a new experiment to try.

job = scientist.suggest()
# Perform experiment with your own code.
. . . do training and cross-validation here . . .
# Inform Whetlab scientist about the outcome.
scientist.update(job, outcome)
Another Shameless Plug: Kayak
Simple open source Python AutoDiff library.!

Meant to be lightweight alternative to, e.g,. Theano.!
Available at https://github.com/HIPS/Kayak!
Not yet all that featureful.
import kayak
import numpy.random as np
# Create a batcher object.

batcher = kayak.Batcher(batch_size, inputs.shape[0])
# Specify inputs and targets.

X = kayak.Inputs(inputs, batcher)
T = kayak.Targets(targets, batcher)
# First-layer weights and biases, with random initializations.

W1 = kayak.Parameter( 0.1*npr.randn( inputs.shape[1], layer1_sz ))
B1 = kayak.Parameter( 0.1*npr.randn(1, layer1_sz) )
# First hidden layer: ReLU + Dropout

H1 = kayak.Dropout(kayak.HardReLU(kayak.ElemAdd(kayak.MatMult(X, W1), B1)), layer1_dropout)
!
!
. . . various other bits in your architecture . . .
# Specify your loss function.

loss = kayak.MatSum(kayak.LogMultinomialLoss(Y, T))
# Gradients are so easy!

grad_W1 = loss.grad(W1)
grad_B1 = loss.grad(B1)
Thanks
Coming soon at whetlab.com
Jasper!
Snoek
Kevin!
Swersky
Hugo!
Larochelle
Code: https://github.com/HIPS/Spearmint
Michael!
Gelbart

Ryan Adams 140814 Bayesopt Ncap

Uploaded by

Copyright:

Available Formats

Ryan Adams 140814 Bayesopt Ncap

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ryan Adams 140814 Bayesopt Ncap

Uploaded by

Copyright:

Available Formats

A Tutorial on!

Machine Learning Meta-Challenges

Increasing Model Complexity

More Sophisticated Fitting Procedures

Less Accessible to Non-Experts

Results Are Less Reproducible

Example: Deep Neural Networks

Resurgent interest in large neural networks.!

When well-tuned, very successful for visual object

Big investments by Google, Facebook, Microsoft, etc.!

Many choices: number of layers, weight regularization,

Example: Online Latent Dirichlet Allocation

Hoffman, Blei, and Bach (2010): Approximate inference for

When well-tuned, gives good empirical results.!

Many choices: Dirichlet parameters, number of topics,

Figure by David Blei

Example: Classification of DNA Sequences

Objective: predict which DNA sequences will bind with

Miller, Kumar, Packer, Goodman, and Koller (2012):

Choices: margin parameter, entropy parameter,

Search for Good Hyperparameters?

Define an objective function.

How do people currently search? Black magic.

Can We Do Better? Bayesian Optimization

Build a probabilistic model for the objective.

Compute the posterior predictive distribution.

Optimize a cheap proxy function instead.

The Bayesian Optimization Idea

The Bayesian Optimization Idea

The Bayesian Optimization Idea

The Bayesian Optimization Idea

The Bayesian Optimization Idea

The Bayesian Optimization Idea

The Bayesian Optimization Idea

Where should we evaluate next!

The Bayesian Optimization Idea

Where should we evaluate next!

The Bayesian Optimization Idea

Where should we evaluate next!

The Bayesian Optimization Idea

One idea: expected improvement

Closely related to statistical ideas of optimal design of experiments,

As Bayesian optimization, studied first by Kushner in 1964 and then

Methodologically, it touches on several important machine learning

Started receiving serious attention in ML in 2007, for example:

Interest exploded when it was realized that Bayesian optimization

Review of Gaussian process priors!

Bayesian optimization basics!

Managing covariances and kernel parameters!

Accounting for the cost of evaluation!

Sharing information across related problems!

Better models for nonstationary functions!

Random projections for high-dimensional problems!

Accounting for constraints!

Leveraging partially-completed training runs

Gaussian Processes as Function Models

the quick brown fox

strings (Haussler 1999)

permutations (Kondor 2008)