Ryan Adams 140814 Bayesopt Ncap

Download as pdf or txt
Download as pdf or txt
You are on page 1of 84

A Tutorial on!

Bayesian Optimization!
for Machine Learning
Ryan P. Adams!
School of Engineering and Applied Sciences!
Harvard University

http://hips.seas.harvard.edu

Machine Learning Meta-Challenges

Increasing Model Complexity


More flexible models have more parameters.!

More Sophisticated Fitting Procedures


Non-convex optimization has many knobs to turn.!

Less Accessible to Non-Experts


Harder to apply complicated techniques.!

Results Are Less Reproducible


Too many important implementation details are missing.

Example: Deep Neural Networks

Resurgent interest in large neural networks.!

When well-tuned, very successful for visual object


identification, speech recognition, comp bio, ...!

Big investments by Google, Facebook, Microsoft, etc.!

Many choices: number of layers, weight regularization,


layer size, which nonlinearity, batch size, learning rate
schedule, stopping conditions

Example: Online Latent Dirichlet Allocation

Hoffman, Blei, and Bach (2010): Approximate inference for


large-scale text analysis with latent Dirichlet allocation.!

When well-tuned, gives good empirical results.!

Many choices: Dirichlet parameters, number of topics,


ativevocabulary
process size, learning rate schedule, batch size

Figure by David Blei

Example: Classification of DNA Sequences

Objective: predict which DNA sequences will bind with


which proteins.!

Miller, Kumar, Packer, Goodman, and Koller (2012):


Latent Structural Support Vector Machine!

Choices: margin parameter, entropy parameter,


convergence criterion

Search for Good Hyperparameters?

Define an objective function.


Most often, we care about generalization performance.
Use cross validation to measure parameter quality.!

How do people currently search? Black magic.


Grid search
Random search
Grad student descent!

Painful!
Requires many training cycles.
Possibly noisy.

Can We Do Better? Bayesian Optimization

Build a probabilistic model for the objective.


Include hierarchical structure about units, etc.!

Compute the posterior predictive distribution.


Integrate out all the possible true functions.
We use Gaussian process regression.!

Optimize a cheap proxy function instead.


The model is much cheaper than that true objective.
The main insight:!
Make the proxy function exploit uncertainty to balance
exploration against exploitation.

The Bayesian Optimization Idea


3
2

current!
best

1
0
1
2
3
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

The Bayesian Optimization Idea


3
2

current!
best

1
0
1
2
3
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

The Bayesian Optimization Idea


3
2

current!
best

1
0
1
2
3
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

The Bayesian Optimization Idea


3
2

current!
best

1
0
1
2
3
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

The Bayesian Optimization Idea


3
2

current!
best

1
0
1
2
3
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

The Bayesian Optimization Idea


3
2

current!
best

1
0
1
2
3
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

The Bayesian Optimization Idea


3
2

current!
best

1
0
1
2
3
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Where should we evaluate next!


in order to improve the most?

0.9

The Bayesian Optimization Idea


3
2

current!
best

1
0
1
2
3
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Where should we evaluate next!


in order to improve the most?

0.9

The Bayesian Optimization Idea


3
2

current!
best

1
0
1
2
3
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Where should we evaluate next!


in order to improve the most?

0.9

The Bayesian Optimization Idea


3
2

current!
best

1
0
1
2
3
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

One idea: expected improvement

0.9

Historical Overview

Closely related to statistical ideas of optimal design of experiments,


dating back to Kirstine Smith in 1918.!

As response surface methods, they date back to Box and Wilson in 1951.!

As Bayesian optimization, studied first by Kushner in 1964 and then


Mockus in 1978.!

Methodologically, it touches on several important machine learning


areas: active learning, contextual bandits, Bayesian nonparametrics!

Started receiving serious attention in ML in 2007, for example:


Brochu, de Freitas & Ghosh, NIPS 2007 [preference learning]
Krause, Singh & Guestrin, JMLR 2008 [optimal sensor placement]
Srinivas, Krause, Kakade & Seeger, ICML 2010 [regret bounds]
Brochu, Hoffman & de Freitas, UAI 2011 [portfolios]!

Interest exploded when it was realized that Bayesian optimization


provides an excellent tool for finding good ML hyperparameters.

Todays Topics

Review of Gaussian process priors!

Bayesian optimization basics!

Managing covariances and kernel parameters!

Accounting for the cost of evaluation!

Parallelizing training!

Sharing information across related problems!

Better models for nonstationary functions!

Random projections for high-dimensional problems!

Accounting for constraints!

Leveraging partially-completed training runs

Gaussian Processes as Function Models


Nonparametric prior on functions specified in
terms of a positive definite kernel.
3
2
4
2

0
0
2
1

4
2

2
3

0
2

0
2

the quick brown fox


jumps over the lazy dog

strings (Haussler 1999)

permutations (Kondor 2008)

proteins (Borgwardt 2007)

Gaussian Processes

Gaussian process (GP) is a distribution on functions.!

Allows tractable Bayesian modeling of functions


without specifying a particular finite basis.!

Input space (where were optimizing) !X

Model scalar functions !f : X ! R

Positive definite covariance function! C : X X ! R

Mean function m : X ! R

Gaussian Processes
Any finite set of N points in X , {xn }N
n=1 induces a
homologous N -dimensional Gaussian distribution
on RN, taken to be the distribution on {y
n = f (xn )}N
n=1

Gaussian Processes
Any finite set of N points in X , {xn }N
n=1 induces a
homologous N -dimensional Gaussian distribution
on RN, taken to be the distribution on {y
n = f (xn )}N
n=1
4
3
2
1
0
!1
!2

!3

!2

!1

Gaussian Processes
Any finite set of N points in X , {xn }N
n=1 induces a
homologous N -dimensional Gaussian distribution
on RN, taken to be the distribution on {y
n = f (xn )}N
n=1
4
3
2
1
0
1
2
3

Gaussian Processes

Due to Gaussian form, closed-form solutions for many


useful questions about finite data.!

Marginal likelihood:!
N
ln 2
2

1
1 T
1
ln p(y | X, ) =
ln |K |
y K y
!
2
2
M
Predictive distribution at test points {xm }m=1 :!
test
y
N(m, )
!
!

m=

T
1
k K y

T
1
k K k

We compute these matrices from the covariance:


[k ]n,m

[K ]n,n0 = C(xn , xn0 ; )


= C(xn , xm ; )
[ ]m,m0 = C(xm , xm0 ; )

Examples of GP Covariances
Squared-Exponential

C(x, x ) =

exp

1
2

D
X

x0d

xd

d=1

Matrn

2 )

21
C(r) =
( )

Neural Network

C(x, x ) =

sin

8
<
:

(1 +

x)(1 +

2x0 T

2 r

2 r

Periodic

2xT x0
2xT

x0 )

9
=
;

C(x, x ) = exp

2 sin

1
2 (x
2

x)

GPs Provide Closed-Form Predictions


3
2
1
0
1
2
3
3

Using Uncertainty in Optimization


?

Find the minimum:! x = arg min f (x)

We can evaluate the objective pointwise, but do not


have an easy functional form or gradients. !

After performing some evaluations, the GP gives us


easy closed-form marginal means and variances.!

Exploration: Seek places with high variance.!

Exploitation: Seek places with low mean.!

The acquisition function balances these for our proxy


optimization to determine the next evaluation.

x2X

Closed-Form Acquisition Functions

The GP posterior gives a predictive mean function (x)


2
and a predictive marginal variance function ! (x)
!

f (xbest ) (x)
(x) =
(x)

Probability of Improvement (Kushner 1964):!


aPI (x) =

( (x))

Expected Improvement (Mockus 1978):!


aEI (x) = (x)( (x) ( (x)) + N( (x) ; 0, 1))

GP Upper Confidence Bound (Srinivas et al. 2010):


aLCB (x) = (x)

(x)

Probability of Improvement
3
2
1
0
1
2
3
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Expected Improvement
3
2
1
0
1
2
3
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

GP Upper (Lower) Confidence Bound


3
2
1
0
1
2
3
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Distribution Over Minimum (Entropy Search)


3
2
1
0
1
2
3
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Illustrating Bayesian Optimization

Illustrating Bayesian Optimization

Illustrating Bayesian Optimization

Illustrating Bayesian Optimization

Illustrating Bayesian Optimization

Illustrating Bayesian Optimization

Illustrating Bayesian Optimization

Why Doesnt Everyone Use This?


These ideas have been around for decades.!
Why is Bayesian optimization in broader use?

Fragility and poor default choices.


Getting the function model wrong can be catastrophic.!

There hasnt been standard software available.


Its a bit tricky to build such a system from scratch.!

Experiments are run sequentially.


We want to take advantage of cluster computing.!

Limited scalability in dimensions and evaluations.


We want to solve big problems.

Fragility and Poor Default Choices


Ironic Problem:!
Bayesian optimization has its own hyperparameters!

Covariance function selection


This turns out to be crucial to good performance.
The default choice for regression is way too smooth.
Instead: use adaptive Matrn 3/5 kernel.!

Gaussian process hyperparameters


Typical empirical Bayes approach can fail horribly.
Instead: use Markov chain Monte Carlo integration.
Slice sampling means no additional parameters!

Covariance Function Choice


21
C(r) =
( )
3

2 r

= 1/2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

= 5/2

3
0

0.1

0.2

0.3

0.4

0.5

0.6

2 r

0.7

0.8

0.9

3
0

= 3/2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.7

0.8

0.9

=1

3
0

3
0

0.1

0.2

0.3

0.4

0.5

0.6

Choosing Covariance Functions


Structured SVM for Protein Motif Finding
Miller et al (2012)
0.28
Matern 52 ARD
ExpQuad ISO
ExpQuad ARD
Matern 32 ARD

Classification Error

0.275
0.27
0.265
0.26
0.255
0.25
0.245
0.24

10

20

30

40
50
60
70
80
90
100
Function evaluations
Snoek, Larochelle & RPA, NIPS 2012

MCMC for GP Hyperparameters

Covariance hyperparameters are often optimized rather than


marginalized, typically in the name of convenience and efficiency.!

Slice sampling of hyperparameters (e.g., Murray and Adams 2010) is


comparably fast and easy, but accounts for uncertainty in length
scale, mean, and amplitude.!

Integrated Acquisition Function:!


!
!
!
!

a
(x) =

a(x ;

N
) p( | {xn , yn }n=1 ) d

K
X
1
(k)

a(x ; )
K
k=1

(k) p( | {xn , yn }N
n=1 )

For a theoretical discussion of the implications of inferring


hyperparameters with BayesOpt, see recent work by Wang and de
Freitas (http://arxiv.org/abs/1406.7758)
Snoek, Larochelle & RPA, NIPS 2012

Integrating Out GP Hyperparameters


Posterior samples
with three different
length scales

Length scale specific


expected improvement

Integrated expected
improvement

MCMC for Hyperparameters


Logistic regression for handwritten digit recognition

Classification Error

0.25
GP EI MCMC
GP EI Conventional
Tree Parzen Algorithm

0.2

0.15

0.1

0.05

10

20

30

40
50
60
70
80
90
100
Function Evaluations
Snoek, Larochelle & RPA, NIPS 2012

Cost-Sensitive Bayesian Optimization

Practical problems often have varying costs, e.g.,


durations, hardware requirements, materiel.!

We often have a budget or a deadline and would


ideally like to account for these limited resources.!

It may be sensible to first characterize the function with


cheap evaluations before trying expensive ones.!

These costs are probably unknown, but they can


themselves be modeled with Gaussian processes.!

One idea: expected improvement per second


Snoek, Larochelle & RPA, NIPS 2012

Expected Improvement per Second

Expected Improvement per Second

Expected Improvement per Second

Expected Improvement per Second

Expected Improvement per Second

Expected Improvement per Second


Structural SVM for Protein Motif Finding
Miler et al. (2012)
0.26

Min function value

GP EI MCMC
GP EI per Second
0.255

0.25

0.245

0.24

10

15

Time (hours)
Snoek, Larochelle & RPA, NIPS 2012

Expected Improvement per Second


CIFAR10: Deep convolutional neural net (Krizhevsky)
Achieves 9.5% test error vs. 11% with hand tuning.
0.3
GP EI MCMC
GP EI per Second
Stateoftheart

Classification Error

0.28
0.26
0.24
0.22
0.2
0.18

10

15
20
25
30
Time (Hours)
Snoek, Larochelle & RPA, NIPS 2012

we can easily compute the predicted expected inverse duration and use it to compute the
xpected improvement per second as a function of x.

Parallelizing Bayesian Optimization

3.3. Monte Carlo Acquisition for Parallelizing Bayesian Optimization. With the advent
f multi-core
computing,
it is natural
to ask how we can
our Bayesian optimization
Classical
Bayesian
optimization
isparallelize
sequential.
rocedures. More generally than simply batch parallelism, however, we would like to be able
Dowhat
an experiment.
Waitnext,
until
it while
finishes.
o decide
x should be evaluated
even
a set ofRepeat.!
points are being evaluated.
Clearly, we cannot use the same acquisition function again, or we will repeat one of the pending
xperiments.
We would
ideally perform
roll-out
of ourthings
acquisition
Compute
clusters
let usa do
many
at policy,
once.to choose a point
hat appropriately balanced information gain and exploitation. However, such roll-outs are
How
do weInstead
efficiently
pick
multiple
experiments?!
enerally
intractable.
we propose
a sequential
strategy
that takes advantage of the
ractable inference properties of the Gaussian process to compute Monte Carlo estimates of
he acquisiton
function
under dierent
possible
from pending
function evaluations.
Fantasize
outcomes
from
theresults
Gaussian
process!
Consider the situation in which N evaluations have completed, yielding data {xn , yn }N
n=1 ,
J
gives
us{x
coherent
predictions.
nd in The
whichprobabilistic
J evaluations are model
pending at
locations
j }j=1 . Ideally, we would choose a new
oint based
thepredictions
expected acquisition
function under
all possible
outcomes of these pending
Use on
the
to imagine
possible
futures.
valuations:

Evaluate points that are good under the average future.!

7) a
(x ; {xn , yn }, , {xj }) =
Z
!
a(x ; {xn , yn }, , {xj , yj }) p({yj }Jj=1 | {xj }Jj=1 , {xn , yn }N
n=1 ) dy1 dyJ .
RJ

shrink
theunder
variance
and integrate
out the mean,
ThisisAlternative:
simply the expectation
of a(x)
a J-dimensional
Gaussian distribution,
whose
mean and
covariance
can easily
be computed.
As in theJMLR
covariance
hyperparameter case, it
as in
Desautels,
Krause
& Burdick,
2014.
s straightforward to use samples from this distribution to compute
the expected
acquisition
Snoek, Larochelle
& RPA,
NIPS 2012

Parallelizing Bayesian Optimization


Two pending
experiments, with
three joint fantasies
Three expected
improvement
functions, one for
each fantasy
Integrating out the
fantasies yields an
overall expected
improvement

Parallelized Bayesian Optimization


Online Latent Dirichlet Allocation (250K documents)
Hoffman, Blei & Bach (2010)
1350
GP EI MCMC
3x GP EI MCMC
5x GP EI MCMC
10x GP EI MCMC

OutofSample Perplexity

1340
1330
1320
1310
1300
1290
1280
1270
1260

4
5
6
7
8
Time (Days)
Snoek, Larochelle & RPA, NIPS 2012

Multi-Task Bayesian Optimization

We usually dont just have one problem to solve.


Probably, we have a set of related objectives.!

Examples:
Object classification for different image types.
Speech recognition for different languages/speakers.
Different folds of the same cross-validation setup.!

How can we share information between problems?


Use a prior on vector-valued functions.
Reframe acquisitions to leverage output covariance.
Make decisions that look for bits about the minimum.
Swersky, Snoek & RPA, NIPS 2013

Parallel-Task Bayesian Optimization


Convolutional Neural Network with Identical Architectures

Google Street View!


House Numbers:!
73K 32x32 Digits

STL-10!
5K 96x96!
Natural Images

CIFAR-10!
50K 32x32!
Natural Images
Swersky, Snoek & RPA, NIPS 2013

Correlated Objective Functions


dependent functions
(1)

(2)

(3)

independent predictions

dependent predictions

Swersky, Snoek & RPA, NIPS 2013

Parallel-Task Bayesian Optimization

Min Function Value (Error Rate)

Visual object recognition: STL-10 and CIFAR-10!


Achieves 70.7% test -- best previous was 64.5%
STL10 from scratch
STL10 from CIFAR10

0.25
0.245
0.24
0.235
0.23
0

10

20
30
Function evaluations

40

50

Swersky, Snoek & RPA, NIPS 2013

Parallel-Task Bayesian Optimization

Min Function Value (Error Rate)

Recognizing house numbers, leveraging object recognition!


Achieves 4.77% test error -- best non-dropout result.
0.07

SVHN from scratch


SVHN from CIFAR10
SVHN from subset of SVHN

0.065

0.06

0.055

0.05
10

20
30
Function Evaluations

40

50

Swersky, Snoek & RPA, NIPS 2013

Auxiliary-Task Bayesian Optimization


Using a smaller problem to go fast on a bigger problem.

16
14
12
10
8
6

Time Taken by MTBO (Mins)

MTBO
STBO

10

20

30

40
50
60
Time (Minutes)

70

80

90

Min Function Value (Classification Error %)

100
80
60
40

6.79

20
0
0

13.53
20

9.49

8.14

7.06

40
60
80
Time Taken by STBO (Mins)

100

1280

MTBO
STBO

1278
1276
1274
1272
1270
1268

Time Taken by MTBO (Days)

Min Function Value (Error %)

18

Online Latent Dirichlet Allocation


Min Function Value (Perplexity)

Digit Recognition

6
8
Time (Days)

10

12

Min Function Value (Perplexity)

10
8

1267

6
1270
4

1271

2
0
0

1272
1489
2

4
6
8
Time Taken by STBO (Days)

10

Swersky, Snoek & RPA, NIPS 2013

Bayesian Optimization with Unknown Constraints

In general, you need to specify a range for each


parameter you wish to optimize.!

However, there may be regimes in which things break,


that are not easily pre-specified.!

Alternatively, you may have expensive constraints that


you want to make optimal decisions about.
min f (x) s.t. 8k gk (x)
x

objective!
function

constraint!
functions

Gelbart, Snoek & RPA, UAI 2014

Noisy Constraints

Realistically, lots of problems we care about have noisy


constraints just like they have noisy objectives.!

This is a problem because we can never be completely


sure that we have a valid solution.!

This can be addressed via probabilistic constraints:


min E[f (x)] s.t. 8k Pr[gk (x)
x

objective!
function

0]

constraint!
functions

confidence!
threshold

Gelbart, Snoek & RPA, UAI 2014

Decoupled Constraints

Just like in the multi-task case, it may be possible to


evaluate the constraints separately from the objective.!

The constraints may have vastly different costs.!

We call these decoupled constraints.!

We use entropy search to determine whether to


evaluate the constraint or the objective.

Gelbart, Snoek & RPA, UAI 2014

LDA with Sparsity Constraints


Constrained
Unconstrained
Gramacy and Lee (2010)

Min Function Value

9.5
9
8.5
8
7.5
10

20
30
Function Evaluations

40

50

Require that online LDA result in low-entropy topics.

(a) Online LDA

Neural Network with Memory Constraints


Constrained
Unconstrained

Min Function Value

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
20

40
60
Function Evaluations

80

100

Decoupled: find a good neural network for MNIST, subject to


(b)
Neural
Network
stringent memory limitations.

Bayesian Optimization with Random Projections

Bayesian optimization relies on having a good prior on


functions challenging in high dimensions!!

Realistically, many problems have low effective


dimensionality, i.e., the objective is primarily
determined by a manifold in the input space.!

Wang, Zoghi, Hutter, Matheson & de Freitas, IJCAI


2013 suggest a way to deal with many dimensions.!

Approach: choose a random embedding of the input


dimensions and then optimize in the subspace.!

Straightforward to implement and offers some progress


for large and difficult BayesOpt problems.

Input Warping for Nonstationary Objectives

Good priors are key to Bayesian optimization.!

Almost all widely-used kernels are stationary, i.e., covariance depends


only on the distance.!

For machine learning hyperparameters, this is a big deal; learning rate


changes may have big effects in one regime and small ones in another.!

Real problems are highly nonstationary, but it is difficult to strike a


balance between flexibility and tractability.!

One approach: extend the kernel to include a parametric bijection of each


input, and integrate it out with MCMC.!

The CDF of beta and Kumaraswamy


distributions
Input Warping for Bayesian
Optimization are suitable bijections
on the unit hypercube and allow interesting nonlinear effects.
1

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0
0

0.2

0.4

0.6

(a) Linear

0.8

0
0

0.2

0.4

0.6

(b) Exponential

0.8

0
0

0.2

0.4

0.6

(c) Logarithmic

0.8

0
0

0.2

0.4

0.6

0.8

(d) Sigmoidal

Snoek, Swersky, Zemel & RPA, ICML 2014

average
loss
and standard
are6.88
reported.
The algorithm
7.3
0.2 validation
standard
0.6
40deviation
algorithm
0.0
average
validation
loss8.2
and
deviation
are reported.
The
with the
9
7.3
of0.2
8.2 our
0.6algorithm40
6.88
0.0in far few
on
some
the
benchmarks
converges
to
a
solution
onInput
some
of the
benchmarks
our
converges1266.2
to a solution
in far fewer eval
272.6
10.3
1271.5 for
3.5algorithm
50
0.1
Warping
Nonstationary
Objectives

9 1272.6 10.3
1271.5 3.5
50
24.6 0.9
24.2 0.0
100
1
24.6 0.9
24.2 0.0
100
1280
EI MCMC logspace
GP EI MCMCGP
logspace
EI MCMC
GP EI MCMCGP
orig
space orig space
GP orig space
Warped GP Warped
orig space

Min Function Value (Perplexity)

Logistic Regression

Min Function Value (Perplexity)

0.25

0.25

1280

1266.2 0.1
24.1 0.1
24.1 0.1

Online LDA

0.248

0.248
GP
EI
MCMC
GP EI MCMC
MCMC
0.247
Warped GP EIWarped
MCMC GP EI

Structured SVM

Min Function Value

Min Function Value

Min Function Value

Min Function Value

0.247
ameter0.2benchmarks
proposed
in
Eggensperger
et
al.
(2013).
We
compare
0.2
0.246
lued parameter
benchmarks proposed in1275
Eggensperger
et al. (2013). We compare
0.246
1275
(Hutter et al., 2011), the Tree Parzen Estimator (TPE) (Bergstra et al.,
0.245
0.245et al.,
SMAC)0.15(Hutter
et
al.,
2011),
the
Tree
Parzen
Estimator
(TPE)
(Bergstra
0.15
0.244
C,
Spearmint
and
TPE
are
reproduced
from
Eggensperger
et
al.
(2013).
0.244
or SMAC, Spearmint and TPE are reproduced
from Eggensperger et al.0.243
(2013).0.243
1270
1270
gorithm
ten run
times
the for
given
of evaluations,
and the
0.1
0.1was run
each algorithm
was
tenfor
times
the number
given number
of evaluations,
and the0.242
0.242
algorithm
with thewith
lowest
validation
loss
is loss
shown
in bold. We note that
1265 is shown in bold. We note
0.241
1265
0.241 that
ed.
The 0algorithm
the
lowest
validation
05
510
10
15
20
15
20
50
30
10
2010
30 20
40 30
50 40
30
40
50
Function
Evaluations
Function Evaluations
Function Evaluations
Function Evaluations
Fu
n
in
far
fewer
evaluations
than
the
protocol
allows.
a solution in far fewer evaluations than the protocol allows.
(a) Logistic
Regression
(b) Online
LDA LDA
(c) Struc
(a) Logistic
Regression
(b) Online
(

CIFAR 10 Subset

0.244
0.243

Min Function Value

0.243

Min Function Value

Min Function Value

0.244

Min Function Value

GP EI MCMC
FigureFigure
3. An0.248empirical
comparison
of0.8Bayesian
optimization
following
the standa
3. An empirical
comparison
of0.8Bayesian
optimization
following
the
GP EI MCMC
GP EI MCMC
Warped GP EI MCMC
GP EI MCMC
GP EI MCMC
0.247
Warped GP EI MCMC
CMC
Warped
GP
EI
MCMC
0.247 MCMC)
Warped
GP EIEI
MCMC
Warped
GP
EI 0.7
MCMC (Warped
(GP
MCMC)
and our
strategy
(Warped
GP
EI
MCMC)
for
modeling
input n
(GP
EI
and
our
strategy
GP
EI
MCMC)
for
modeling
0.7
0.246
0.246
challenging
problems
involving
the
optimization
of
the
hyperparameters
of
popula
problems
involving
of
the
hyperparameters
of
0.6 the optimization
0.245challenging
0.6
0.245
0.248

0.5

0.5

0.4
It is clear
that
the
learned
warpings
are
In
Overall,
in
It is clear that the learned warpingsnon-trivial.
are non-trivial.
In
Ove
5
10
50
515 1020 1525 2030 2535 3040 35
40
Function
evaluations
some some
cases,cases,
like with
rates, rates,
they agree
with
inGaussian
evaluations
like learning
with learning
theyFunction
agree
with inGaup
(c)
Structured
SVM
(d)
10
Subset
tuition,
while
for
others
likeSVM
dropout
theyCifar
yield
surprising
as every
DA
(c) Structured
(d)
Cifar
10 Subset
tuition,
while
for
others
like dropout
they
yield
surprising
as ot
e

0
tions

0.242

0.241
40

0.4

0.242

3050

0.241
40

50
60
70
80
90
100
30
40
50
60
70
80
Function evaluations
Function evaluations

90

100

Freeze-Thaw Bayesian Optimization

For ML BayesOpt, the objective itself is often the result of an


incremental training procedure, e.g., SGD.!

Humans often abort jobs that arent promising.!

Freeze-Thaw Bayesian optimization tries to do the same thing.!

It tries to project when a training procedure will give a good


value, and only continues promising ones.!

Requires a prior on training curves as a function of input


hyperparameters. Introduces a new kernel for this.

(a) Exponential Decay Basis

(b) Samples

(c) Training Curve Samples

Swersky, Snoek & RPA, Under Review

Fig 1: Example functions from the exponential decay kernel. (a) Example functions from our basis set with = 1.0

Freeze-Thaw Bayesian Optimization


1.4
1.2

0.8

Error

1.0

0.6
0.4
0.2
0.0

10

20

30
Epochs

40

(a)

50

60

140
120
100 ed
80 tart
S
60
n
40
io
t
20 tera
I
0

Swersky, Snoek & RPA, Under Review

yt+1
using
Equation
using
Equation
20. 20.17:
+1

Select xk , where k = argmaxk a(k) as the next model to run.

Freeze-Thaw Bayesian Optimization

y1 using
Equation
using
Equation
21. 21.
Logistic Regression

yPy
ation,
compute
basket
using
Monte
Carlo
simulation
Equation
FreezeThaw
n, compute Pmin min
overover
the the
basket
using
Monte
Carlo
simulation
andand
Equation
19. 19.
Warped GP EI MCMC
H(P
)
min )min // information gain.
9
0.1
//
information
gain.
t
9.5

Perplexity

Classification Error Rate

0.11

0.09

(k)
as the
model
to run. 0.08
as the
nextnext
model
to run.
0.07
0

00 1000

Perplexity

Perplexity

8.5

7.5
0

400

600
Epochs

800

(a) Logistic Regression


0.8 0.8
0.7

0.7

1000

500

1000
Epochs

Prob. Matrix Factorization

(b) Online LD

FreezeThaw
FreezeThaw
Warped
EI MCMC
Warped
GP EIGP
MCMC

Fig 3: This figure shows


the results of the empirical comparison
0.6 0.6
0.5 0.5
machine learning
hyperparameter optimization prob
0.4
observed0.4over
all training epochs evaluated by eac

7.5
0

200

FreezeThaw
FreezeThaw
Warped
EI MCMC
Warped
GP EIGP
MCMC

8.5

7.5

500 500

RMSE

MCMC
C

Online LDA

8.5

RMSE

9.5

9.5

F
W

0.3

0.3

0.2

0.2

Given a basket, the task now becomes one of choosing w


always favor picking new Swersky,
models Snoek
rather&than
running
old
o
RPA, Under Review

1000 1000 1500 1500


Epochs
Epochs

Online
LDA
(b) (b)
Online
LDA

0.1 0.10
2000
2000
0
200 200400 400600 600800 8001000 1000
1200 1200
Epochs
Epochs

(c) PMF
(c) PMF

New Directions for Bayesian Optimization

Finding new!
organic materials

Optimizing robot!
control systems

Improving turbine!
blade design

Designing DNA for!


protein affinities

Cheetah Gait - Sangbae Kims Lab @ MIT

Human-Tuned Gait

Cheetah Gait - Sangbae Kims Lab @ MIT

Human-Tuned Gait

Cheetah Gait - Sangbae Kims Lab @ MIT

Bayesian Optimized Gait

Cheetah Gait - Sangbae Kims Lab @ MIT

Bayesian Optimized Gait

I just want to use Bayesian optimization.

The tool my group maintains for performing Bayesian


optimization is called Spearmint and is available at
https://github.com/HIPS/Spearmint!

Spearmint is available for non-commercial use, and has


most of the fancy research bits Ive described today.!

For turn-key super-easy Bayesian optimization, were


launching a startup at whetlab.com. It provides a
straightforward REST API and clients in Python,
Matlab, R, etc.!

Our intention is to make Whetlab cheap/free for ML


researchers.

Shameless Plug: Whetlab Python Client


import whetlab

# Define parameters to optimize


parameters = { learn_rate'
: {'type':'float','min':0,'max':15,'size':1},
weight_decay : {type:'float','min':-5,'max':10,'size':1},
. . .
. . .
. . .
layer1_size' : {'type':'integer', 'min':1, 'max':1000, size':1}}

name = My Fancy Deep Network'


description = 'Optimize the hyperparams of my network.
outcome = {name:Cross-Validation Accuracy', 'type':'float'}
scientist = whetlab.Experiment(name=name, description=description,
parameters=parameters, outcome=outcome)

# Loop until youre happy.


# You can have multiple of these in parallel, too.
for ii in xrange(1000):

!
!
!

# Ask Whetlab scientist for a new experiment to try.


job = scientist.suggest()
# Perform experiment with your own code.
. . . do training and cross-validation here . . .
# Inform Whetlab scientist about the outcome.
scientist.update(job, outcome)

Another Shameless Plug: Kayak

Simple open source Python AutoDiff library.!


Meant to be lightweight alternative to, e.g,. Theano.!
Available at https://github.com/HIPS/Kayak!
Not yet all that featureful.
import kayak
import numpy.random as np

# Create a batcher object.


batcher = kayak.Batcher(batch_size, inputs.shape[0])

# Specify inputs and targets.


X = kayak.Inputs(inputs, batcher)
T = kayak.Targets(targets, batcher)

# First-layer weights and biases, with random initializations.


W1 = kayak.Parameter( 0.1*npr.randn( inputs.shape[1], layer1_sz ))
B1 = kayak.Parameter( 0.1*npr.randn(1, layer1_sz) )

# First hidden layer: ReLU + Dropout


H1 = kayak.Dropout(kayak.HardReLU(kayak.ElemAdd(kayak.MatMult(X, W1), B1)), layer1_dropout)

!
!

. . . various other bits in your architecture . . .

# Specify your loss function.


loss = kayak.MatSum(kayak.LogMultinomialLoss(Y, T))

# Gradients are so easy!


grad_W1 = loss.grad(W1)
grad_B1 = loss.grad(B1)

Thanks
Coming soon at whetlab.com

Jasper!
Snoek

Kevin!
Swersky

Hugo!
Larochelle

Code: https://github.com/HIPS/Spearmint

Michael!
Gelbart

You might also like