Ryan Adams 140814 Bayesopt Ncap
Ryan Adams 140814 Bayesopt Ncap
Ryan Adams 140814 Bayesopt Ncap
Bayesian Optimization!
for Machine Learning
Ryan P. Adams!
School of Engineering and Applied Sciences!
Harvard University
http://hips.seas.harvard.edu
Painful!
Requires many training cycles.
Possibly noisy.
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
current!
best
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Historical Overview
As response surface methods, they date back to Box and Wilson in 1951.!
Todays Topics
Parallelizing training!
0
0
2
1
4
2
2
3
0
2
0
2
Gaussian Processes
Mean function m : X ! R
Gaussian Processes
Any finite set of N points in X , {xn }N
n=1 induces a
homologous N -dimensional Gaussian distribution
on RN, taken to be the distribution on {y
n = f (xn )}N
n=1
Gaussian Processes
Any finite set of N points in X , {xn }N
n=1 induces a
homologous N -dimensional Gaussian distribution
on RN, taken to be the distribution on {y
n = f (xn )}N
n=1
4
3
2
1
0
!1
!2
!3
!2
!1
Gaussian Processes
Any finite set of N points in X , {xn }N
n=1 induces a
homologous N -dimensional Gaussian distribution
on RN, taken to be the distribution on {y
n = f (xn )}N
n=1
4
3
2
1
0
1
2
3
Gaussian Processes
Marginal likelihood:!
N
ln 2
2
1
1 T
1
ln p(y | X, ) =
ln |K |
y K y
!
2
2
M
Predictive distribution at test points {xm }m=1 :!
test
y
N(m, )
!
!
m=
T
1
k K y
T
1
k K k
Examples of GP Covariances
Squared-Exponential
C(x, x ) =
exp
1
2
D
X
x0d
xd
d=1
Matrn
2 )
21
C(r) =
( )
Neural Network
C(x, x ) =
sin
8
<
:
(1 +
x)(1 +
2x0 T
2 r
2 r
Periodic
2xT x0
2xT
x0 )
9
=
;
C(x, x ) = exp
2 sin
1
2 (x
2
x)
x2X
f (xbest ) (x)
(x) =
(x)
( (x))
(x)
Probability of Improvement
3
2
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Expected Improvement
3
2
1
0
1
2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
2 r
= 1/2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
= 5/2
3
0
0.1
0.2
0.3
0.4
0.5
0.6
2 r
0.7
0.8
0.9
3
0
= 3/2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.7
0.8
0.9
=1
3
0
3
0
0.1
0.2
0.3
0.4
0.5
0.6
Classification Error
0.275
0.27
0.265
0.26
0.255
0.25
0.245
0.24
10
20
30
40
50
60
70
80
90
100
Function evaluations
Snoek, Larochelle & RPA, NIPS 2012
a
(x) =
a(x ;
N
) p( | {xn , yn }n=1 ) d
K
X
1
(k)
a(x ; )
K
k=1
(k) p( | {xn , yn }N
n=1 )
Integrated expected
improvement
Classification Error
0.25
GP EI MCMC
GP EI Conventional
Tree Parzen Algorithm
0.2
0.15
0.1
0.05
10
20
30
40
50
60
70
80
90
100
Function Evaluations
Snoek, Larochelle & RPA, NIPS 2012
GP EI MCMC
GP EI per Second
0.255
0.25
0.245
0.24
10
15
Time (hours)
Snoek, Larochelle & RPA, NIPS 2012
Classification Error
0.28
0.26
0.24
0.22
0.2
0.18
10
15
20
25
30
Time (Hours)
Snoek, Larochelle & RPA, NIPS 2012
we can easily compute the predicted expected inverse duration and use it to compute the
xpected improvement per second as a function of x.
3.3. Monte Carlo Acquisition for Parallelizing Bayesian Optimization. With the advent
f multi-core
computing,
it is natural
to ask how we can
our Bayesian optimization
Classical
Bayesian
optimization
isparallelize
sequential.
rocedures. More generally than simply batch parallelism, however, we would like to be able
Dowhat
an experiment.
Waitnext,
until
it while
finishes.
o decide
x should be evaluated
even
a set ofRepeat.!
points are being evaluated.
Clearly, we cannot use the same acquisition function again, or we will repeat one of the pending
xperiments.
We would
ideally perform
roll-out
of ourthings
acquisition
Compute
clusters
let usa do
many
at policy,
once.to choose a point
hat appropriately balanced information gain and exploitation. However, such roll-outs are
How
do weInstead
efficiently
pick
multiple
experiments?!
enerally
intractable.
we propose
a sequential
strategy
that takes advantage of the
ractable inference properties of the Gaussian process to compute Monte Carlo estimates of
he acquisiton
function
under dierent
possible
from pending
function evaluations.
Fantasize
outcomes
from
theresults
Gaussian
process!
Consider the situation in which N evaluations have completed, yielding data {xn , yn }N
n=1 ,
J
gives
us{x
coherent
predictions.
nd in The
whichprobabilistic
J evaluations are model
pending at
locations
j }j=1 . Ideally, we would choose a new
oint based
thepredictions
expected acquisition
function under
all possible
outcomes of these pending
Use on
the
to imagine
possible
futures.
valuations:
7) a
(x ; {xn , yn }, , {xj }) =
Z
!
a(x ; {xn , yn }, , {xj , yj }) p({yj }Jj=1 | {xj }Jj=1 , {xn , yn }N
n=1 ) dy1 dyJ .
RJ
shrink
theunder
variance
and integrate
out the mean,
ThisisAlternative:
simply the expectation
of a(x)
a J-dimensional
Gaussian distribution,
whose
mean and
covariance
can easily
be computed.
As in theJMLR
covariance
hyperparameter case, it
as in
Desautels,
Krause
& Burdick,
2014.
s straightforward to use samples from this distribution to compute
the expected
acquisition
Snoek, Larochelle
& RPA,
NIPS 2012
OutofSample Perplexity
1340
1330
1320
1310
1300
1290
1280
1270
1260
4
5
6
7
8
Time (Days)
Snoek, Larochelle & RPA, NIPS 2012
Examples:
Object classification for different image types.
Speech recognition for different languages/speakers.
Different folds of the same cross-validation setup.!
STL-10!
5K 96x96!
Natural Images
CIFAR-10!
50K 32x32!
Natural Images
Swersky, Snoek & RPA, NIPS 2013
(2)
(3)
independent predictions
dependent predictions
0.25
0.245
0.24
0.235
0.23
0
10
20
30
Function evaluations
40
50
0.065
0.06
0.055
0.05
10
20
30
Function Evaluations
40
50
16
14
12
10
8
6
MTBO
STBO
10
20
30
40
50
60
Time (Minutes)
70
80
90
100
80
60
40
6.79
20
0
0
13.53
20
9.49
8.14
7.06
40
60
80
Time Taken by STBO (Mins)
100
1280
MTBO
STBO
1278
1276
1274
1272
1270
1268
18
Digit Recognition
6
8
Time (Days)
10
12
10
8
1267
6
1270
4
1271
2
0
0
1272
1489
2
4
6
8
Time Taken by STBO (Days)
10
objective!
function
constraint!
functions
Noisy Constraints
objective!
function
0]
constraint!
functions
confidence!
threshold
Decoupled Constraints
9.5
9
8.5
8
7.5
10
20
30
Function Evaluations
40
50
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
20
40
60
Function Evaluations
80
100
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
0
0.2
0.4
0.6
(a) Linear
0.8
0
0
0.2
0.4
0.6
(b) Exponential
0.8
0
0
0.2
0.4
0.6
(c) Logarithmic
0.8
0
0
0.2
0.4
0.6
0.8
(d) Sigmoidal
average
loss
and standard
are6.88
reported.
The algorithm
7.3
0.2 validation
standard
0.6
40deviation
algorithm
0.0
average
validation
loss8.2
and
deviation
are reported.
The
with the
9
7.3
of0.2
8.2 our
0.6algorithm40
6.88
0.0in far few
on
some
the
benchmarks
converges
to
a
solution
onInput
some
of the
benchmarks
our
converges1266.2
to a solution
in far fewer eval
272.6
10.3
1271.5 for
3.5algorithm
50
0.1
Warping
Nonstationary
Objectives
9 1272.6 10.3
1271.5 3.5
50
24.6 0.9
24.2 0.0
100
1
24.6 0.9
24.2 0.0
100
1280
EI MCMC logspace
GP EI MCMCGP
logspace
EI MCMC
GP EI MCMCGP
orig
space orig space
GP orig space
Warped GP Warped
orig space
Logistic Regression
0.25
0.25
1280
1266.2 0.1
24.1 0.1
24.1 0.1
Online LDA
0.248
0.248
GP
EI
MCMC
GP EI MCMC
MCMC
0.247
Warped GP EIWarped
MCMC GP EI
Structured SVM
0.247
ameter0.2benchmarks
proposed
in
Eggensperger
et
al.
(2013).
We
compare
0.2
0.246
lued parameter
benchmarks proposed in1275
Eggensperger
et al. (2013). We compare
0.246
1275
(Hutter et al., 2011), the Tree Parzen Estimator (TPE) (Bergstra et al.,
0.245
0.245et al.,
SMAC)0.15(Hutter
et
al.,
2011),
the
Tree
Parzen
Estimator
(TPE)
(Bergstra
0.15
0.244
C,
Spearmint
and
TPE
are
reproduced
from
Eggensperger
et
al.
(2013).
0.244
or SMAC, Spearmint and TPE are reproduced
from Eggensperger et al.0.243
(2013).0.243
1270
1270
gorithm
ten run
times
the for
given
of evaluations,
and the
0.1
0.1was run
each algorithm
was
tenfor
times
the number
given number
of evaluations,
and the0.242
0.242
algorithm
with thewith
lowest
validation
loss
is loss
shown
in bold. We note that
1265 is shown in bold. We note
0.241
1265
0.241 that
ed.
The 0algorithm
the
lowest
validation
05
510
10
15
20
15
20
50
30
10
2010
30 20
40 30
50 40
30
40
50
Function
Evaluations
Function Evaluations
Function Evaluations
Function Evaluations
Fu
n
in
far
fewer
evaluations
than
the
protocol
allows.
a solution in far fewer evaluations than the protocol allows.
(a) Logistic
Regression
(b) Online
LDA LDA
(c) Struc
(a) Logistic
Regression
(b) Online
(
CIFAR 10 Subset
0.244
0.243
0.243
0.244
GP EI MCMC
FigureFigure
3. An0.248empirical
comparison
of0.8Bayesian
optimization
following
the standa
3. An empirical
comparison
of0.8Bayesian
optimization
following
the
GP EI MCMC
GP EI MCMC
Warped GP EI MCMC
GP EI MCMC
GP EI MCMC
0.247
Warped GP EI MCMC
CMC
Warped
GP
EI
MCMC
0.247 MCMC)
Warped
GP EIEI
MCMC
Warped
GP
EI 0.7
MCMC (Warped
(GP
MCMC)
and our
strategy
(Warped
GP
EI
MCMC)
for
modeling
input n
(GP
EI
and
our
strategy
GP
EI
MCMC)
for
modeling
0.7
0.246
0.246
challenging
problems
involving
the
optimization
of
the
hyperparameters
of
popula
problems
involving
of
the
hyperparameters
of
0.6 the optimization
0.245challenging
0.6
0.245
0.248
0.5
0.5
0.4
It is clear
that
the
learned
warpings
are
In
Overall,
in
It is clear that the learned warpingsnon-trivial.
are non-trivial.
In
Ove
5
10
50
515 1020 1525 2030 2535 3040 35
40
Function
evaluations
some some
cases,cases,
like with
rates, rates,
they agree
with
inGaussian
evaluations
like learning
with learning
theyFunction
agree
with inGaup
(c)
Structured
SVM
(d)
10
Subset
tuition,
while
for
others
likeSVM
dropout
theyCifar
yield
surprising
as every
DA
(c) Structured
(d)
Cifar
10 Subset
tuition,
while
for
others
like dropout
they
yield
surprising
as ot
e
0
tions
0.242
0.241
40
0.4
0.242
3050
0.241
40
50
60
70
80
90
100
30
40
50
60
70
80
Function evaluations
Function evaluations
90
100
(b) Samples
Fig 1: Example functions from the exponential decay kernel. (a) Example functions from our basis set with = 1.0
0.8
Error
1.0
0.6
0.4
0.2
0.0
10
20
30
Epochs
40
(a)
50
60
140
120
100 ed
80 tart
S
60
n
40
io
t
20 tera
I
0
yt+1
using
Equation
using
Equation
20. 20.17:
+1
y1 using
Equation
using
Equation
21. 21.
Logistic Regression
yPy
ation,
compute
basket
using
Monte
Carlo
simulation
Equation
FreezeThaw
n, compute Pmin min
overover
the the
basket
using
Monte
Carlo
simulation
andand
Equation
19. 19.
Warped GP EI MCMC
H(P
)
min )min // information gain.
9
0.1
//
information
gain.
t
9.5
Perplexity
0.11
0.09
(k)
as the
model
to run. 0.08
as the
nextnext
model
to run.
0.07
0
00 1000
Perplexity
Perplexity
8.5
7.5
0
400
600
Epochs
800
0.7
1000
500
1000
Epochs
(b) Online LD
FreezeThaw
FreezeThaw
Warped
EI MCMC
Warped
GP EIGP
MCMC
7.5
0
200
FreezeThaw
FreezeThaw
Warped
EI MCMC
Warped
GP EIGP
MCMC
8.5
7.5
500 500
RMSE
MCMC
C
Online LDA
8.5
RMSE
9.5
9.5
F
W
0.3
0.3
0.2
0.2
Online
LDA
(b) (b)
Online
LDA
0.1 0.10
2000
2000
0
200 200400 400600 600800 8001000 1000
1200 1200
Epochs
Epochs
(c) PMF
(c) PMF
Finding new!
organic materials
Optimizing robot!
control systems
Improving turbine!
blade design
Human-Tuned Gait
Human-Tuned Gait
!
!
!
!
!
Thanks
Coming soon at whetlab.com
Jasper!
Snoek
Kevin!
Swersky
Hugo!
Larochelle
Code: https://github.com/HIPS/Spearmint
Michael!
Gelbart