A Tutorial On Bayesian Optimization of
A Tutorial On Bayesian Optimization of
A Tutorial On Bayesian Optimization of
Reinforcement Learning
Eric Brochu, Vlad M. Cora and Nando de Freitas
December 14, 2010
Abstract
We present a tutorial on Bayesian optimization, a method of finding
the maximum of expensive cost functions. Bayesian optimization employs
the Bayesian technique of setting a prior over the objective function and
combining it with evidence to get a posterior function. This permits a
utility-based selection of the next observation to make on the objective
function, which must take into account both exploration (sampling from
areas of high uncertainty) and exploitation (sampling areas likely to offer
improvement over the current best observation). We also present two
detailed extensions of Bayesian optimization, with experiments—active
user modelling with preferences, and hierarchical reinforcement learning—
and a discussion of the pros and cons of Bayesian optimization based on
our experiences.
1 Introduction
An enormous body of scientific literature has been devoted to the problem of
optimizing a nonlinear function f (x) over a compact set A. In the realm of
optimization, this problem is formulated concisely as follows:
max f (x)
x∈A⊂Rd
One typically assumes that the objective function f (x) has a known mathe-
matical representation, is convex, or is at least cheap to evaluate. Despite the
influence of classical optimization on machine learning, many learning problems
do not conform to these strong assumptions. Often, evaluating the objective
function is expensive or even impossible, and the derivatives and convexity prop-
erties are unknown.
In many realistic sequential decision making problems, for example, one can
only hope to obtain an estimate of the objective function by simulating fu-
ture scenarios. Whether one adopts simple Monte Carlo simulation or adaptive
1
schemes, as proposed in the fields of planning and reinforcement learning, the
process of simulation is invariably expensive. Moreover, in some applications,
drawing samples f (x) from the function corresponds to expensive processes:
drug trials, destructive tests or financial investments. In active user modeling,
x represents attributes of a user query, and f (x) requires a response from the
human. Computers must ask the right questions and the number of questions
must be kept to a minimum so as to avoid annoying the user.
P (M |E) ∝ P (E|M )P (M ).
Inside this simple equation is the key to optimizing the objective function. In
Bayesian optimization, the prior represents our belief about the space of possible
objective functions. Although the cost function is unknown, it is reasonable to
assume that there exists prior knowledge about some of its properties, such as
smoothness, and this makes some possible objective functions more plausible
than others.
Let’s define xi as the ith sample, and f (xi ) as the observation of the objective
function at xi . As we accumulate observations1 D1:t = {x1:t , f (x1:t )}, the prior
distribution is combined with the likelihood function P (D1:t |f ). Essentially,
given what we think we know about the prior, how likely is the data we have
seen? If our prior belief is that the objective function is very smooth and noise-
free, data with high variance or oscillations should be considered less likely than
data that barely deviate from the mean. Now, we can combine these to obtain
our posterior distribution:
P (f |D1:t ) ∝ P (D1:t |f )P (f ).
1 Here we use subscripts to denote sequences of data, i.e. y1:t = {y1 , . . . , yt }.
2
t=2
t=4
The posterior captures our updated beliefs about the unknown objective func-
tion. One may also interpret this step of Bayesian optimization as estimating
the objective function with a surrogate function (also called a response sur-
face), described formally in §2.1 with the posterior mean function of a Gaussian
process.
To sample efficiently, Bayesian optimization uses an acquisition function to
determine the next location xt+1 ∈ A to sample. The decision represents an
automatic trade-off between exploration (where the objective function is very
uncertain) and exploitation (trying values of x where the objective function is
expected to be high). This optimization technique has the nice property that it
aims to minimize the number of objective function evaluations. Moreover, it is
likely to do well even in settings where the objective function has multiple local
maxima.
3
Figure 1 shows a typical run of Bayesian optimization on a 1D problem.
The optimization starts with two points. At each iteration, the acquisition
function is maximized to determine where next to sample from the objective
function—the acquisition function takes into account the mean and variance of
the predictions over the space to model the utility of sampling. The objective
is then sampled at the argmax of the acquisition function, the Gaussian process
is updated and the process is repeated. +One may also interpret this step
of Bayesian optimization as estimating the objective function with a surrogate
function (also called a response surface), described formally in §2.1 with the
posterior mean function of a Gaussian process.
1.2 Overview
In §2, we give an overview of the Bayesian optimization approach and its history.
We formally present Bayesian optimization with Gaussian process priors (§2.1)
and describe covariance functions (§2.2), acquisition functions (§2.3) and the role
of Gaussian noise (§2.4). In §2.5, we cover the history of Bayesian optimization,
and the related fields of kriging, GP experimental design and GP active learning.
The second part of the tutorial builds on the basic Bayesian optimization
model. In §3 and §4 we discuss extensions to Bayesian optimization for active
user modelling in preference galleries, and hierarchical control problems, respec-
tively. Finally, we end the tutorial with a brief discussion of the pros and cons
of Bayesian optimization in §5.
g(x) = −f (x).
We also assume that the objective is Lipschitz-continuous. That is, there
exists some constant C, such that for all x1 , x2 ∈ A:
4
If −f (x) is convex, then any local maximum is also a global maximum. However,
in our optimization problems, we cannot assume that the negative objective
function is convex. It might be the case, but we have no way of knowing before
we begin optimizing.
It is common in global optimization, and true for our problem, that the ob-
jective is a black box function: we do not have an expression of the objective
function that we can analyze, and we do not know its derivatives. Evaluating
the function is restricted to querying at a point x and getting a (possibly noisy)
response. Black box optimization also typically requires that all dimensions
have bounds on the search space. In our case, we can safely make the sim-
plifying assumption these bounds are all axis-aligned, so the search space is a
hyperrectangle of dimension d.
A number of approaches exist for this kind of global optimization and have
been well-studied in the literature (e.g., [Törn and Žilinskas, 1989, Mongeau et al., 1998,
Liberti and Maculan, 2006, Zhigljavsky and Žilinskas, 2008]). Deterministic ap-
proaches include interval optimization and branch and bound methods. Stochas-
tic approximation is a popular idea for optimizing unknown objective func-
tions in machine learning contexts [Kushner and Yin, 1997]. It is the core
idea in most reinforcement learning algorithms [Bertsekas and Tsitsiklis, 1996,
Sutton and Barto, 1998], learning methods for Boltzmann machines and deep
belief networks [Younes, 1989, Hinton and Salakhutdinov, 2006] and parameter
estimation for nonlinear state space models [Poyiadjis et al., 2005, Martinez–Cantin et al., 2006].
However, these are generally unsuitable for our domain because they still re-
quire many samples, and in the active user-modelling domain drawing samples
is expensive.
Even in a noise-free domain, evaluating an objective function with Lipschitz
continuity C on a d-dimensional unit hypercube, guaranteeing the best observa-
tion f (x+ ) ≥ f (x? ) − requires (C/2)d samples [Betrò, 1991]. This can be an
incredibly expensive premium to pay for insurance against unlikely scenarios.
As a result, the idea naturally arises to relax the guarantees against patho-
logical worst-case scenarios. The goal, instead, is to use evidence and prior
knowledge to maximize the posterior at each step, so that each new evalua-
tion decreases the distance between the true global maximum and the expected
maximum given the model. This is sometimes called “one-step” [Močkus, 1994]
“average-case” [Streltsov and Vakili, 1999] or “practical” [Lizotte, 2008] opti-
mization. This average-case approach has weaker demands on computation
than the worst-case approach. As a result, it may provide faster solutions in
many practical domains where one does not believe the worst-case scenario is
plausible.
Bayesian optimization uses the prior and evidence to define a posterior dis-
tribution over the space of functions. The Bayesian model allows for an elegant
means by which informative priors can describe attributes of the objective func-
tion, such as smoothness or the most likely locations of the maximum, even
when the function itself is not known. Optimizing follows the principle of max-
imum expected utility, or, equivalently, minimum expected risk. The process of
deciding where to sample next requires the choice of a utility function and a
5
Algorithm 1 Bayesian Optimization
1: for t = 1, 2, . . . do
2: Find xt by optimizing the acquisition function over the GP: xt = argmaxx u(x|D1:t−1 ).
way of optimizing the expectation of this utility with respect to the posterior
distribution of the objective function. This secondary optimization problem is
usually easier because the utility is typically chosen so that it is easy to evaluate,
though still nonconvex. To make clear which function we are discussing, we will
refer to this utility as the acquisition function (also sometimes called the infill
function). In §2.3, we will discuss some common acquisition functions.
In practice, there is also have the possibility of measurement noise, which
we will assume is Gaussian. We define xi as the ith sample and yi = f (xi ) + εi ,
iid 2
with εi ∼ N (0, σnoise ), as the noisy observation of the objective function at xi .
We will discuss noise in more detail in §2.4.
The Bayesian optimization procedure is shown in Algorithm 1. As men-
tioned earlier, it has two components: the posterior distribution over the ob-
jective and the acquisition function. Let us focus on the posterior distribu-
tion first and come back to the acquisition function in §2.3. As we accumu-
late observations D1:t = {x1:t , y1:t }, a prior distribution P (f ) is combined
with the likelihood function P (D1:t |f ) to produce the posterior distribution:
P (f |D1:t ) ∝ P (D1:t |f )P (f ). The posterior captures the updated beliefs about
the unknown objective function. One may also interpret this step of Bayesian
optimization as estimating the objective function with a surrogate function (also
called a response surface). In §2.1, we will discuss how Gaussian process priors
can be placed on f .
6
Figure 2: Simple 1D Gaussian process with three observations. The solid black line
is the GP surrogate mean prediction of the objective function given the data, and the
shaded area shows the mean plus and minus the variance. The superimposed Gaussians
correspond to the GP mean and standard deviation (µ(·) and σ(·)) of prediction at the
points, x1:3 .
includes a very large family of common optimization tasks, and Močkus showed
that the GP prior is well-suited to the task.
A GP is an extension of the multivariate Gaussian distribution to an infinite-
dimension stochastic process for which any finite combination of dimensions will
be a Gaussian distribution. Just as a Gaussian distribution is a distribution over
a random variable, completely specified by its mean and covariance, a GP is a
distribution over functions, completely specified by its mean function, m and
covariance function, k:
7
tion m(x) = 0; alternative priors for the mean can be found in, for example
[Martinez–Cantin et al., 2009, Brochu et al., 2010a]. This leaves us the more
interesting question of defining the covariance function k. A very popular choice
is the squared exponential function:
1 2
k(xi , xj ) = exp − kxi − xj k . (1)
2
Note that this function approaches 1 as values get close together and 0 as
they get further apart. Two points that are close together can be expected
to have a very large influence on each other, whereas distant points have almost
none. This is a necessary condition for convergence under the assumptions of
[Močkus, 1994]. We will discuss more sophisticated kernels in §2.2.
If we were to sample from the prior, we would choose {x1:t } and sample the
values of the function at these indices to produce the pairs {x1:t , f1:t }, where
f1:t = f (x1:t ). The function values are drawn according to a multivariate normal
distribution N (0, K), where the kernel matrix is given by:
k(x1 , x1 ) . . . k(x1 , xt )
K= .. .. ..
.
. . .
k(xt , x1 ) ... k(xt , xt )
Of course, the diagonal values of this matrix are 1 (each point is perfectly
correlated with itself), which is only possible in a noise-free environment. We
will discuss noise in §2.4. Also, recall that we have for simplicity chosen the
zero mean function.
In our optimization tasks, however, we will use data from an external model
to fit the GP and get the posterior. Assume that we already have the observa-
tions {x1:t , f1:t }, say from previous iterations, and that we want to use Bayesian
optimization to decide what point xt+1 should be considered next. Let us de-
note the value of the function at this arbitrary point as ft+1 = f (xt+1 ). Then,
by the properties of Gaussian processes, f1:t and ft+1 are jointly Gaussian:
f1:t K k
∼ N 0, T ,
ft+1 k k(xt+1 , xt+1 )
where
k = k(xt+1 , x1 ) k(xt+1 , x2 ) · · · k(xt+1 , xt )
Using the Sherman-Morrison-Woodbury formula (see, e.g., [Rasmussen and Williams, 2006,
Press et al., 2007]), one can easily arrive at an expression for the predictive dis-
tribution:
P (ft+1 |D1:t , xt+1 ) = N µt (xt+1 ), σt2 (xt+1 )
where
8
1.0
θ =0.1 3
0.8 2
0.6 1
0
0.4 1
0.2 2
0.01.5 1.0 0.5 0.0 0.5 1.0 1.5 30 1 2 3 4 5
1.0
θ =0.2 2.5
2.0
0.8 1.5
0.6 1.0
0.5
0.4 0.0
0.5
0.2 1.0
1.5
0.01.5 1.0 0.5 0.0 0.5 1.0 1.5 2.00 1 2 3 4 5
1.0
θ =0.5 1.5
1.0
0.8 0.5
0.6 0.0
0.5
0.4 1.0
1.5
0.2 2.0
2.5
0.01.5 1.0 0.5 0.0 0.5 1.0 1.5 3.00 1 2 3 4 5
Figure 3: The effect of changing the kernel hyperparameters. Shown are squared
exponential kernels with θ = 0.1, 0.2, 0.5. On the left is the function k(0, x). On the
right are some one-dimensional functions sampled from a GP with the hyperparameter
value.
That is, µt (·) and σt2 (·) are the sufficient statistics of the predictive posterior
distribution P (ft+1 |D1:t , xt+1 ). For legibility, we will omit the subscripts on
µ and σ except where it might be unclear. In the sequential decision making
setting, the number of query points is relatively small and, consequently, the
GP predictions are easy to compute.
9
[Rasmussen and Williams, 2006, page 106]:
where diag(θ) is a diagonal matrix with d entries θ along the diagonal. In-
tuitively, if a particular θ` has a small value, the kernel becomes indepen-
dent of `-th input, effectively removing it automatically. Hence, irrelevant
dimensions are discarded. Figure 3 shows examples of different hyperparam-
eter values on the squared exponential function and what functions sampled
from those values look like. Typically, the hyperparameter values are learned
by “seeding” with a few random samples and maximizing the log-likelihood
of the evidence given θ [Jones et al., 1998, Sasena, 2002, Santner et al., 2003,
Rasmussen and Williams, 2006]. This can often be aided with an informative
hyperprior on the hyperparameters, often a log normal prior [Lizotte, 2008,
Frean and Boyle, 2008]. Methods of learning these values more efficiently is cur-
rently an active subfield of research (e.g. [Osborne, 2010, Brochu et al., 2010a]).
Another important kernel for Bayesian optimization is the Matérn kernel
[Matérn, 1960, Stein, 1999], which incorporates a smoothness parameter ς to
permit greater flexibility in modelling functions:
1 √ ς √
k(xi , xj ) = (2 ς kxi − xj k) Hς (2 ς kxi − xj k) ,
2ς−1 Γ(ς)
where Γ(·) and Hς (·) are the Gamma function and the Bessel function of order ς.
Note that as ς → ∞, the Matérn kernel reduces to the squared exponential ker-
nel, and when ς = 0.5, it reduces to the unsquared exponential kernel. As with
the squared exponential, length-scale hyperparameter are often incorporated.
While the squared exponential and Matérn are the most common kernels
for GPs, numerous others have been examined in the machine learning liter-
ature (see, e.g., [Genton, 2001] or [Rasmussen and Williams, 2006, Chapter 4]
for an overview). Appropriate covariance functions can also be used to ex-
tend the model in other interesting ways. For example, the recent sequential
sensor work of Osborne, Garnett and colleagues uses GP models with exten-
sions to the covariance function to model the characteristics of changepoints
[Osborne et al., 2010] and the locations of sensors in a network [Garnett et al., 2010a].
A common additional hyperparameter is simply a scalar applied to k to control
the magnitude of the variance.
Determining which of a set of possible kernel functions to use for a problem
typically requires a combination of engineering and automatic model selection,
either hierarchical Bayesian model selection [Mackay, 1992] or cross-validation.
However, these methods require fitting a model given a representative sam-
ple of data. In [Brochu et al., 2010a], we discuss how model selection can be
performed using models believed to be similar. The techniques introduced in
[Brochu et al., 2010b] could also be applied to model selection, though that is
outside the scope of this tutorial.
10
2.3 Acquisition Functions for Bayesian Optimization
Now that we have discussed placing priors over smooth functions and how to
update these priors in light of new observations, we will focus our attention on
the acquisition component of Bayesian optimization. The role of the acquisition
function is to guide the search for the optimum. Typically, acquisition functions
are defined such that high acquisition corresponds to potentially high values of
the objective function, whether because the prediction is high, the uncertainty
is great, or both. Maximizing the acquisition function is used to select the
next point at which to evaluate the function. That is, we wish to sample f at
argmaxx u(x|D), where u(·) is the generic symbol for an acquisition function.
where Φ(·) is the normal cumulative distribution function. This function is also
sometimes called MPI (for “maximum probability of improvement”) or “the
P -algorithm” (since the utility is the probability of improvement).
The drawback, intuitively, is that this formulation is pure exploitation.
Points that have a high probability of being infinitesimally greater than f (x+ )
will be drawn over points that offer larger gains but less certainty. As a result,
a modification is to add a trade-off parameter ξ ≥ 0:
The exact choice of ξ is left to the user, though Kushner recommended a sched-
ule for ξ, so that it started fairly high early in the optimization, to drive ex-
ploration, and decreased toward zero as the algorithm continued. Several re-
searchers have studied the empirical impact of different values of ξ in different
domains [Törn and Žilinskas, 1989, Jones, 2001, Lizotte, 2008].
An appealing characteristic of this formulation for perceptual and preference
models is that while maximizing PI(·) is still greedy, it selects the point most
likely to offer an improvement of at least ξ. This can be useful in psychoper-
ceptual tasks, where there is a threshold of perceptual difference.
Jones [2001] notes that the performance of PI(·)
“is truly impressive. It would be quite natural if the reader, like
so many others, became enthusiastic about this approach. But if
there is a single lesson to be taken away from this paper, it is that
11
Figure 4: Gaussian process from Figure 2, additionally showing the region of probable
improvement. The maximum observation is at x+ . The darkly-shaded area in the
superimposed Gaussian above the dashed line can be used as a measure of improvement,
I(x). The model predicts almost no possibility of improvement by observing at x1 or
x2 , while sampling at x3 is more likely to improve on f (x+ ).
12
a new trial point:
Note that this decision process is myopic in that it only considers one-step-
ahead choices. However, if we want to plan two steps ahead, we can easily
apply recursion:
0 ?
xt+1 = argmin E min 0
E(kf t+2 (x ) − f (x )k |Dt+1 ) |D 1:t
x x
That is, I(x) is positive when the prediction is higher than the best value known
thus far. Otherwise, I(x) is set to zero. The new query point is found by
maximizing the expected improvement:
13
where φ(·) and Φ(·) denote the PDF and CDF of the standard normal distribu-
tion respectively. Figure 4 illustrates a typical expected improvement scenario.
It should be said that being myopic is not a requirement here. For ex-
ample, it is possible to derive analytical expressions for the two-step ahead
expected improvement [Ginsbourger et al., 2008] and multistep Bayesian opti-
mization [Garnett et al., 2010b]. This is indeed a very promising recent direc-
tion.
where
(
µ(x)−f (x+ )−ξ
if σ(x) > 0
Z = σ(x) .
0 if σ(x) = 0
This ξ is very similar in flavour to the ξ used in Eqn (2), and to the approach
used by Jones et al. [2001]. Lizotte’s experiments suggest that setting ξ = 0.01
(scaled by the signal variance if necessary) works well in almost all cases, and
interestingly, setting a cooling schedule for ξ to encourage exploration early and
exploitation later does not work well empirically, contrary to intuition (though
Lizotte did find that a cooling schedule for ξ might slightly improve performance
on short runs (t < 30) of PI optimization).
14
posterior 1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
ξ =0.01
ξ =0.10
ξ =1.00
PI
ξ =0.01
ξ =0.10
ξ =1.00
EI
ν =0.2
ν =1.0
GP-UCB
ν =2.0
15
Figure 6: Examples of acquisition functions and their settings in 2 dimensions. The
top row shows the objective function (which is the Branin function here), and the
posterior mean and variance estimates µ(·) and σ 2 (·). The samples used to train the
GP are shows with white dots. The second row shows the acquisition functions for the
GP. From left to right: probability of improvement (Eqn (2)), expected improvement
(Eqn (4)) and upper confidence bound (Eqn (5)). The maximum of each function is
shown with a triangle marker.
This in turn implies a lower-bound on the convergence rate for the optimization
problem.
Figures 5 and 6 show how with the same GP posterior, different acquisition
functions with different maxima are defined. Figure 7 gives an example of how
2 These bounds hold for reasonably smooth kernel functions, where the exact formulation
of the bounds depends upon the form of kernel used. We refer the interested reader to the
original paper [Srinivas et al., 2010].
16
t=1 t=2 t=3 t=4 t=5 t=6
1
0
1
0.0 0.5 1.0
PI
0.0 0.5 1.0
EI
17
0.0 0.5 1.0
GP-UCB
0.0 0.5 1.0
Figure 7: Comparison of probability of improvement (top), expected improvement (middle) and upper confidence bound (bottom) acqui-
sition functions on a toy 1D problem. In the upper rows, the objective function is shown with the dotted red line, the solid blue line is the
GP posterior mean. In the lower rows, the respective infill functions are shown, with a star denoting the maximum. The optimizations
are initialized with the same two points, but quickly follow different sampling trajectories. In particular, note that the greedy EI algorithm
ignores the region around x = 0.4 once it is determined there is minimal chance of improvement, while GP-UCB continues to explore.
PI, EI and GP-UCB give rise to distinct sampling behaviour over time.
With several different parameterized acquisition functions in the literature,
it is often unclear which one to use. Brochu et al. [2010b] present one method
of utility selection. Instead of using a single acquisition function, they adopt
a portfolio of acquisition functions governed by an online multi-armed bandit
strategy, which almost always outperforms the best individual acquisition func-
tion of a suite of standard test problems.
2.4 Noise
The model we’ve used so far assumes that we have perfectly noise-free observa-
tions. In real life, this is rarely possible, and instead of observing f (x), we can
often only observe a noisy transformation of f (x).
The simplest transformation arises when f (x) is corrupted with Gaussian
2
noise ∼ N (0, σnoise ) [Rasmussen and Williams, 2006]. If the noise is additive,
we can easily add the noise distribution to the Gaussian distribution N (0, K)
and define
yi = f (xi ) + i .
Since the mean is zero, this type of noise simply requires that we replace the
kernel K with the following kernel for the noisy observations of f (·):
k(x1 , x1 ) . . . k(x1 , xt )
K= .. .. .. 2
+ σnoise I (6)
. . .
k(xt , x1 ) ... k(xt , xt )
This yields the predictive distribution:
P (yt+1 |D1:t , xt+1 ) = N (µt (xt+1 ), σt2 (xt+1 ) + σnoise
2
),
and the sufficient statistics
µt (xt+1 ) = kT [K + σnoise
2
I]−1 y1:t
σt2 (xt+1 ) = k(xt+1 , xt+1 ) − kT [K + σnoise
2
I]−1 k.
18
In a noisy environment, we also change the definition of the incumbent in
the PI and EI acquisition functions. Instead of using the best observation, we
use the distribution at the sample points, and define as the incumbent, the point
with the highest expected value,
µ+ = argmax µ(xi ).
xi ∈x1:t
19
There exist several consistency proofs for this algorithm in the one-dimensional
setting [Locatelli, 1997] and one for a simplification of the algorithm using sim-
plicial partitioning in higher dimensions [Žilinskas and Žilinskas, 2002]. The
convergence of the algorithm using multivariate Gaussian processes has been
recently established in [Vasquez and Bect, 2008].
2.6 Kriging
Kriging has been used in geostatistics and environmental science since the 1950s
and remains important today. We will briefly summarize the connection to
Bayesian optimization here. More detailed examinations can be found in, for
example, [Stein, 1999, Sasena, 2002, Diggle and Ribeiro, 2007]. This section is
primarily drawn from these sources.
In many modelling techniques in statistics and machine learning, it is as-
sumed that samples drawn from a process with independent, identically dis-
2
tributed residuals, typically, ε ∼ N (0, σnoise ):
y(x) = f (x) + ε
In kriging, however, the usual assumption is that errors are not independent,
and are, in fact, spatially correlated: where errors are high, it is expected that
nearby errors will also be high. Kriging is a combination of a linear regression
model and a stochastic model fitted to the residual errors of the linear model.
The residual is modelled with a zero-mean Gaussian process, so ε is actually
parameterized by x: ε(x) ∼ N (0, σ 2 (x)).
The actual regression model depends on the type of kriging. In simple krig-
ing, f is modelled with the zero function, making it a zero-mean GP model. In
ordinary kriging, f is modelled with a constant but unknown function. Univer-
sal kriging models f with a polynomial of degree k with bases m and coefficients
β, so that
Xk
y(x) = βj mj (x) + ε(x).
j=1
20
2.7 Experimental design
Kriging has been applied to experimental design under the name DACE, after
“Design and Analysis of Computer Experiments”, the title of a paper by Sacks
et al. [1989] (and more recently a book by Santner et al. [2003]). In DACE, the
regression model is a best linear unbiased predictor (BLUP), and the residual
model is a noise-free Gaussian process. The goal is to find a design point or
points that optimizes some criterion.
The “efficient global optimization”, or EGO, algorithm is the combination
of DACE model with the sequential expected improvement (§2.3.1) acquisition
criterion. It was published in a paper by Jones et al. [1998] as a refinement
of the SPACE algorithm (Stochastic Process Analysis of Computer Experi-
ments) [Schonlau, 1997]. Since EGO’s publication, there has evolved a body of
work devoted to extending the algorithm, particularly in adding constraints to
the optimization problem [Audet et al., 2000, Sasena, 2002, Boyle, 2007], and
in modelling noisy functions [Bartz-Beielstein et al., 2005, Huang et al., 2006,
Hutter et al., 2009, Hutter, 2009].
In so-called “classical” experimental design, the problem to be addressed is
often to learn the parameters ζ of a function gζ such that
yi = gζ (xi ) + εi , ∀i ∈ 1, . . . , t
with noise εi (usually Gaussian) for scalar output yi . xi is the ith set of exper-
imental conditions. Usually, the assumption is that gζ is linear, so that
yi = ζ T xi + εi .
An experiment is represented by a design matrix X, whose rows are the inputs
x1:t . If we let ε ∼ N (0, σ), then for the linear model, the variance of the
parameter estimate ζb is
b = σ 2 (XT X)−1 ,
V ar(ζ)
and for an input xi , the prediction is
yt ) = σ 2 xTi (XT X)−1 xi .
V ar(b
An optimal design is a design matrix that minimizes some characteristic of
the inverse moment matrix (XT X)−1 . Common criteria include A-optimality,
which minimizes the trace; D-optimality, which minimizes the determinant; and
E-optimality, which minimizes the maximum eigenvalue.
Experimental design is usually non-adaptive: the entire experiment is de-
signed before data is collected. However, sequential design is an important and
active subfield (e.g. [Williams et al., 2000, Busby, 2009].
21
learning or experimental design is often arbitrary. However, there are a few
distinguishing characteristics that most (but by no means all) active learning
approaches use.
• Active learning is most often adaptive: a model exists that incorporates
all the data seen, and this is used to sequentially select candidates for
labelling. Once labelled, the data are incorporated into the model and
new candidates selected. This is, of course, the same strategy used by
Bayesian optimization.
• Active learning employs an oracle for data labelling, and this oracle is very
often a human being, as is the case with interactive Bayesian optimization.
• Active learning is usually concerned with selecting candidates from a fi-
nite set of available data (pool-based sampling). Experimental design and
optimization are usually concerned with continuous domains.
• Finally, active learning is usually used to learn a model for classification,
or, less commonly, regression. Usually the candidate selection criterion is
the maximization of some informativeness measure, or the minimization
of uncertainty. These criteria are often closely related to the alphabetic
criteria of experimental design.
An excellent recent overview of active learning is [Settles, 2010]. We are
particularly interested in active learning because it often uses a human oracle for
label acquisition. An example using GPs is the object categorization of Kapoor
et al. [2007]. In this work, candidates are selected from a pool of unlabelled
images so as to maximize the margin of the GP classifier and minimize the
uncertainty. Osborne et al. [2010] also use active learning in a Gaussian process
problem, deciding when to sample variables of interest in a sequential sampling
problem with faults and changepoints. Chu and Ghahramani [2005a] briefly
discuss how active learning could be used for ranking GPs, by selecting sample
points that maximize entropy gained with a new preference relation.
Interesting recent work with GPs that straddles the boundary between ac-
tive learning and experimental design is the sensor placement problem of Krause
et al. [2008]. They examine several criteria, including maximum entropy, and
argue for using mutual information. Ideally, they would like to simultaneously
select a set of points to place the entire set of sensors in a way that maximizes the
mutual information. This is essentially a classical experimental design problem
with maximum mutual information as the design criterion. However, maxi-
mizing mutual information over a set of samples is NP-complete, so they use
an active learning approach. By exploiting the submodularity of mutual infor-
mation, they are able to show that sequentially selecting sensor locations that
greedily maximize mutual information, they can bound the divergence of the
active learning approach from the experimental design approach. This work
influenced the GP-UCB acquisition function (§2.3.3).
Finally, as an aside, in active learning, a common acquisition strategy is se-
lecting the point of maximum uncertainty. This is called uncertainty sampling
22
[Lewis and Gale, 1994]. GPs have in the useful property that the posterior vari-
ance (interpreted as uncertainty) is independent of the actual observations! As
a result, if this is the criterion, the entire active learning scheme can be designed
before a single observation are made, making adaptive sampling unnecessary.
2.9 Applications
Bayesian optimization has recently begun to appear in the machine learning
literature as a means of optimizing difficult black box optimizations. A few
recent examples include:
• Lizotte et al. [2007, 2008] used Bayesian optimization to learn a set of
robot gait parameters that maximize velocity of a Sony AIBO ERS-7
robot. As an acquisition function, the authors used maximum probabil-
ity of improvement (§2.3.1). They show that the Bayesian optimization
approach not only outperformed previous techniques, but used drastically
fewer evaluations.
• Frean and Boyle [2008] use Bayesian optimization to learn the weights of
a neural network controller to balance two vertical poles with different
weights and lengths on a moving cart.
• Cora’s MSc thesis [2008] uses Bayesian optimization to learn a hierarchical
policy for a simulated driving task. At the lowest level, using the vehicle
controls as inputs and fitness to a course as the response, a policy is learned
by which a simulated vehicle performs various activities in an environment.
• Martinez–Cantin et al. [2009] also applied Bayesian optimization to policy
search. In this problem, the goal was to find a policy for robot path
planning that would minimize uncertainty about its location and heading,
as well as minimizing the uncertainty about the environmental navigation
landmarks.
• Hutter’s PhD thesis [2009] studies methods of automatically tuning algo-
rithm parameters, and presents several sequential approaches using Bayesian
optimization.
• The work of Osborne, Garnett et al. [Osborne, 2010, Osborne et al., 2010,
Garnett et al., 2010b] uses Bayesian optimization to select the locations
of a set of (possibly heterogenous) sensors in a dynamic system. In this
case, the samples are a function of the locations of the entire set of sensors
in the network, and the objective is the root mean squared error of the
predictions made by the sensor network.
23
man judgement, for instance, preferences are often more accurate than rat-
ings. Prospect theory, for example, employs utility models based on rela-
tion to a reference point, based on evidence that the human perceptual ap-
paratus is attuned to evaluate differences rather than absolute magnitudes
[Kahneman and Tversky, 1979, Tversky and Kahneman, 1992]. We present here
a Bayesian optimization application based on discrete choice for a “preference
gallery” application, originally presented in [Brochu et al., 2007a, Brochu et al., 2007b].
In the case of a person rating the suitability of a procedurally-generated ani-
mation or image, each sample of valuation involves creating an instance with the
given parameters and asking a human to provide feedback, which is interpreted
as the function response. This is a very expensive class of functions to evaluate!
Furthermore, it is in general impossible to even sample the function directly
and get a consistent response from users. Asking humans to rate an animation
on a numerical scale has built-in problems—not only will scales vary from user
to user, but human evaluation is subject to phenomena such as drift, where
the scale varies over time, anchoring, in which early experiences dominate the
scale [Siegel and Castellan, 1988, Payne et al., 1993]. However, human beings
do excel at comparing options and expressing a preference for one over others
[Kingsley, 2006]. This insight allows us to approach the optimization function
in another way. By presenting two or more realizations to a user and requiring
only that they indicate preference, we can get far more robust results with much
less cognitive burden on the user [Kendall, 1975]. While this means we can’t get
responses for a valuation function directly, we model the valuation as a latent
function, inferred from the preferences, which permits a Bayesian optimization
approach.
Probability models for learning from discrete choices have a long history in
psychology and econometrics [Thurstone, 1927, McFadden, 1980, Stern, 1990].
They have been studied extensively, for example, in rating chess players, and
the Elo system [Élő, 1978] was adopted by the World Chess Federation FIDE to
model the probability of one player beating another. It has since been adopted
to many other two-player games such as Go and Scrabble, and, more recently,
online computer gaming [Herbrich and Graepel, 2006].
Parts of §3.1 are based on [Chu and Ghahramani, 2005b], which presents a
preference learning method using probit models and Gaussian processes. They
use a Thurstone–Mosteller model (below), but with an innovative nonparametric
model of the valuation function.
24
In each case, the user has chosen which item she likes best. The data set
therefore consists of the ranked pairs:
D = {ri ci ; i = 1, . . . , M },
where the symbol indicates that the user prefers r to c. We use x1:t to denote
the t distinct elements in the training data. That is, ri and ci correspond to
two elements of x1:t . ri ci can be interpreted as a binary variable that takes
value 1 when ri is preferred to ci and is 0 otherwise.
In the probit approach, we model the value functions v(·) for items r and c
as follows:
v(ri ) = f (ri ) + ε
v(ci ) = f (ci ) + ε, (7)
2
where the noise terms are Gaussian: ε ∼ N (0, σnoise ). Following [Chu and Ghahramani, 2005b],
we assign a nonparametric Gaussian process prior to the unknown mean valua-
tion: f (·) ∼ GP(0, K(·, ·)). That is, at the t training points:
− 12 1 T −1
P (f ) = |2πK| exp − f K f ,
2
where f = {f (x1 ), f (x2 ), . . . , f (xt )}.
Random utility models such as (7) have a long and influential history in
psychology and the study of individual choice behaviour in economic markets.
Daniel McFadden’s Nobel Prize speech [McFadden, 2001] provides a glimpse of
this history. Many more comprehensive treatments appear in classical economics
books on discrete choice theory.
Under our Gaussian utility models, the probability that item r is preferred
to item c is given by:
P (ri ci |f (ri ), f (ci )) = P (v(ri ) > v(ci )|f (ri ), f (ci ))
= P (ε − ε < f (ri ) − f (ci ))
= Φ(Zi ),
where
f (ri ) − f (ci )
Zi = √
2σnoise
and Φ(·) is the CDF of the standard normal distribution. This model, relating
binary observations to a continuous latent function, is known as the Thurstone-
Mosteller law of comparative judgement [Thurstone, 1927, Mosteller, 1951]. In
statistics it goes by the name of binomial-probit regression. Note that one could
−1
also easily adopt a logistic (sigmoidal) link function ϕ (Zi ) = (1 + exp (−Zi )) .
In fact, such choice is known as the Bradley-Terry model [Stern, 1990]. If the
user had more than two choices one could adopt a polychotomous regression
model [Holmes and Held, 2006]. This multi-category extension would, for ex-
ample, enable the user to state no preference, or a degree of preference for any
of the two items being presented.
25
Note that this approach is related to, but distinct from the binomial logistic-
linear model used in geostatistics [Diggle et al., 1998], in which the responses y
represent the outcomes of Bernoulli trials which are conditionally independent
given the model (i.e., the responses are binary observations yt ∈ {0, 1} for xt ,
rather than preference observations between {rt , ct }).
Our goal is to estimate the posterior distribution of the latent utility function
given the discrete data. That is, we want to maximize
M
Y
P (f |D) ∝ P (f ) P (ri ci |f (ri ), f (ci )).
i=1
g = ∇f log P (f |D)
" M
#
1 T −1 X
= ∇f const − f K f + log Φ(Zi )
2 i=1
"M #
X
−1
= −K f + ∇f log Φ(Zi ) = −K−1 f + b,
i=1
where φ(·) denotes the PDF of the standard normal distribution. Clearly, the
∂
derivative hi (xj ) = ∂f (xj)
(f (ri ) − f (ci )) is 1 when xj = ri , -1 when xj = ci
and 0 otherwise. Proceeding to compute the second derivative, one obtains the
26
preferences
0.2 0.1
0.35 0.5
0.2 0.35
0.2 0.6
0.8 0.7
The Hessian is a positive semi-definite matrix. Hence, one can find the MAP
estimate with a simple Newton–Raphson recursion:
At f = fMAP , we have
with b = K−1 fMAP . The goal of our derivation, namely the predictive distribu-
tion P (ft+1 |D), follows by straightforward convolution of two Gaussians:
Z
P (ft+1 |D) = P (ft+1 |fMAP )P (fMAP |D)dfMAP
27
3.2 Application: Interactive Bayesian optimization for ma-
terial design
T arget 1.
2.
3.
4.
28
of realistic image synthesis. The appearance of a material is formalized by the
notion of the Bidirectional Reflectance Distribution Function (BRDF). In com-
puter graphics, BRDFs are most often specified using various analytical models.
Analytical models that are of interest to realistic image synthesis are the ones
that observe the physical laws of reciprocity and energy conservation while typ-
ically also exhibiting shadowing, masking and Fresnel reflectance phenomenon.
Realistic models are therefore fairly complex with many parameters that need to
be adjusted by the designer for the proper material appearance. Unfortunately
these parameters can interact in non-intuitive ways, and small adjustments to
certain settings may result in non-uniform changes in the appearance. This can
make the material design process quite difficult for the artist end user, who is
not expected to be an expert in the field, but who knows the look that she
desires for a particular application without necessarily being interested in un-
derstanding the various subtleties of reflection. We attempt to deal with this
using a preference gallery approach, in which users are simply required to view
two or more images rendered with different material properties and indicate
which they prefer, in an iterative process.
We use the interactive Bayesian optimization model with probit responses on
an example gallery application for helping users find a BRDF. For the purposes
of this example, we limit ourselves to isotropic materials and ignore wavelength
dependent effects in reflection. Our gallery demonstration presents the user
with two BRDF images at a time. We start with four predetermined queries
to “seed” the parameter space, and after that use the learned model to select
gallery images. The GP model is updated after each preference is indicated. We
use parameters of real measured materials from the MERL database for seeding
the parameter space, but can draw arbitrary parameters after that.
By querying the user with a paired comparison, one can estimate statistics
of the valuation function at the query point, but only at considerable expense.
Thus, we wish to make sure that the samples we do draw will generate the
maximum possible improvement.
Our method for achieving this goal iterates over the following steps:
1. Present the user with a new set of instances and record preferences
from the user: Augment the training set of paired choices with the new user
data.
2. Infer the valuation function: Here we use a Thurstone–Mosteller model
with Gaussian processes. See §3.1 for details. Note that in this application, the
valuation function is the objective of Bayesian optimization. We will use the
terms interchangeably.
3. Optimize the acquisition function of the valuation to obtain the query
points for the next gallery: Methods for selecting a set of instances are
described in §3.2.1.
29
algorithm trials n (mean ± std)
random 50 18.40 ± 7.87
argmaxσ 50 17.87 ± 8.60
argmaxEI 50 8.56 ± 5.23
• random The second point is sampled uniformly from the parameter do-
main A.
• argmaxσ The second point is the point of highest uncertainty, argmaxx σ(x).
The results are shown in Table 1. n is the number clicks required of the user
to find the target image. Clearly argmaxEI dominates, with a mean n less than
half that of the competing algorithms. Interestingly, selecting images using
maximum variance does not perform much better than random. We suspect
that this is because argmaxσ has a tendency to select images from the corners
of the parameter space, which adds limited information to the other images,
whereas Latin hypercubes at least guarantees that the selected images fill the
space.
30
game AI solutions, from behaviour trees, to hierarchically decomposed agents
(teams vs. players), implemented by a multitude of customized hierarchical
state machines. The benefits are due to isolating complex decision logic to
fairly independent functional units (tasks). The standard game AI development
process consists of the programmer implementing a large number of behaviours
(in as many ways as there are published video games), and hooking them up to
a more manageable number of tuneable parameters. We present a class of algo-
rithms that attempt to bridge the gap between game development, and general
reinforcement learning. They reduce the amount of hand-tuning traditionally
encountered during game development, while still maintaining the full flexibility
of manually hard-coding a policy when necessary.
The Hierarchical Reinforcement Learning [Barto and Mahadevan, 2003] field
models repeated decision making by structuring the policy into tasks (actions)
composed of subtasks that extend through time (temporal abstraction) and are
specific to a subset of the total world state space (state abstraction). Many al-
gorithms have recently been developed, and are described further in Section 4.1.
The use of Bayesian optimization for control has previously been proposed by
Murray-Smith and Sbarbaro [2002], and (apparently independently) by Frean
and Boyle [2008], who used it for a control problem of balancing two poles on a
cart. This work did not involve a nonhierarchical setting, however.
The exploration policies typically employed in HRL research tend to be slow
in practice, even after the benefits of state abstraction and reward shaping.
We demonstrate an integration of the MAXQ hierarchical task learner with
Bayesian active exploration that significantly speeds up the learning process,
applied to hybrid discrete and continuous state and action spaces. Section 4.2
describes an extended Taxi domain, running under The Open Racing Car Sim-
ulator [Wymann et al., 2009], a 3D game engine that implements complex vehi-
cle dynamics complete with manual and automatic transmission, engine, clutch,
tire, suspension and aerodynamic models.
31
the formulation is very nice and would match game AI development processes,
the underlying solver based on HAMs flattens the task hierarchy by including
the program’s memory and call-stack into a new joint-state space, and solves
this new MDP instead. It is less clear how to extend and implement per-task
customized learning with this formulation. Even if this difficulty is surmounted,
as evidenced by the last line in the concluding remarks of [Marthi et al., 2005],
there is an imperative need for designing faster algorithms in HRL. This paper
aims to address this need.
In our solution, we still require a programmer or designer to specify the
task hierarchy. In most cases breaking a plan into sub-plans is much easier
than coding the decision logic. With the policy space constrained by the task
hierarchy, termination and state abstraction functions, the rate of learning is
greatly improved, and the amount of memory required to store the solution
reduces. The benefits of HRL are very dependant however on the quality of
these specifications, and requires the higher-level reasoning of a programmer or
designer. An automatic solution to this problem would be an agent that can
learn how to program, and anything less than that will have limited applicability.
We can use Bayesian optimization to learn the relevant aspects of value
functions by focusing on the most relevant parts of the parameter space. In the
work on this section, we use refer to the objective as the value function, to be
consistent with the HRL literature.
4.1.1 Semi-MDPs
Each task in an HRL hierarchy is a semi-Markov Decision Process [Sutton et al., 1999],
that models repeated decision making in a stochastic environment, where the
actions can take more than one time step. Formally, an SMDP is defined as a
tuple: {S, A, P (s0 , N |s, a), R(s, a)} where S is the set of state variables, A is a
set of actions, P (s0 , N |s, a) is the transition probability of arriving to state s0
in N time steps after taking action a in s, and R(s, a) is the reward received.
The solution of this process is a policy π ∗ (s) ∈ A, that selects the action with
the highest expected discounted reward in each state. The function V ∗ (s) is the
value of state s when following the optimal policy. Equivalently, the Q∗ (s, a)
function stores the value of taking action a in state s and following the optimal
policy thereafter. These quantities follow the classical Bellman recursions:
X
V ∗ (s) = max R(s, a) + γ P (s0 , N |s, a)γ N V ∗ (s0 )
a∈A
s0 ,N
X
Q∗ (s, a) = R(s, a) + γ P (s0 , N |s, a)γ N V ∗ (s0 ) (8)
s0 ,N
32
is a termination predicate, Zi (s) is a state abstraction function that returns a
subset of the state relevant to the current subtask, and πi (s) ∈ Ai is the policy
learned by the agent (or used to explore during learning). Each task is effec-
tively a separate, decomposed SMDP that has allowed us to integrate active
learning for discrete map navigation with continuous low-level vehicle control.
This is accomplished by decomposing the Q function into two parts:
a = πi (s) (9)
Qπ (i, s, a) = V π (a, s) + C π (i, s, a)
X
C π (i, s, a) = Piπ (s0 , N |s, a)γ N Qπ (i, s0 , πi (s0 ))
s0 ,N
π
π Q
P (i, s, π0 i (s)) 0 if composite
V (i, s) =
s0 P (s |s, i)R(s |s, i) if primitive
Here, γ is the discount factor, i is the current task, and a is a child action
given that we are following policy πi . The Q function is decomposed into two
parts: the value of V π being the expected one step reward, plus C π which is the
expected completion reward for i after a completes. V is defined recursively,
as the expected value of its child actions, or the expected reward itself if i is a
primitive (atomic) action. The MAXQ learning routine is a simple modification
of the typical Q-learning algorithm. In task i, we execute subtask a, observe
the new state s0 and reward r. If a is primitive, we update V (s, a), otherwise
we update C(i, s, a), with learning rate α ∈ (0, 1):
33
Figure 10: City Experiment uses a simplified map (orange overlay) roughly based
on downtown Vancouver, and used by the TORCS simulator. Each waypoint is labeled,
and pickup and dropoff locations are marked by the Taxi icons. One way streets are
accounted for in the waypoint adjacency matrix. Source image care of Google Maps.
34
Table 2: Comparing Domain Size
35
Root - this task selects between Get and Put to pickup and deliver the pas-
senger. It requires no learning because the termination criteria of the subtasks
fully determine when they should be invoked. TRoot = (P assLock = P assDest)
and ZRoot = {}.
Get - getting the passenger involves navigating through the city, parking
the car and picking up the passenger. In this task, the LegalLoad state is true
when the taxi is at the passenger’s location. Receives a reward of 750 when the
passenger is picked up, TGet = ((P assLoc = 0) or (P assLoc = P assDest)),
and ZGet = {}.
Put - similar to Get, also receives reward of 750 when passenger is success-
fully delivered. The passenger destination PassDest is passed to the Navigate
task. The abstracted LegalLoad state is true when the taxi is at the passenger’s
destination location. TP ut = ((P assLoc > 0) or (P assLoc = P assDest)) and
ZP ut = {}.
Pickup - this is a primitive action, with a reward of 0 if successful, and
−2500 if a pickup is invalid (if the taxi is not stopped, or if LegalLoad is false).
ZP ickup = {LegalLoad, Stopped}.
Dropoff - this is a primitive action, with a reward of 1500 if successful, and
−2500 if a dropoff is invalid. ZDropof f = {LegalLoad, Stopped}.
Navigate - this task learns the sequence of intersections from the current
TaxiLoc to a target destination T. By parameterizing the value function of
this task, we can apply Active Path learning as described in Section 4.3.2.
TN avigate = (T axiLoc = T ) and ZN avigate = {T, T axiLoc}.
Follow - this is the previously trained continuous trajectory following task
that takes as input an adjacent waypoint WP, and generates continuous steering
and throttle values to follow the straight-line trajectory from TaxiLoc to WP.
TF ollow = (T axiLoc = WP ) and ZF ollow = {WP , Ωerr , Verr , Yerr , Vy }.
Park - this is a hard-coded task which simply puts on the brakes (steer = 0,
throttle = −1).
Drive - this performs one time step of the physics simulation, with the given
steer and throttle inputs. The default reward per time step of driving is −0.75.
36
Figure 11: Task Hierarchies. Each composite task is a separate SMDP whose policy
is optimal given the optimal policies of its subtasks (recursive optimality). Triangles
are composite tasks, and rectangle are primitive actions. The hierarchy on the right
simplifies learning by reusing policies for navigating form waypoint to waypoint, and the
Navigation task only needs to learn the sequence of waypoints to get to the destination.
For the continuous case, the discrete actions N/S/E/W are replaced by one continuous
Drive(steer, throttle) task, with driving parameters generated by the parameterized
policy contained in the Follow task.
the first local minimum, and it is explicitly designed to minimize the number of
expensive value function evaluations.
The lowest level Drive task uses the parameterized function illustrated in
Figure 12 to generate continuous steer and throttle values, within the range of
-1 to 1. The |x| = 15 parameters (weights) are trained using the Bayesian active
policy learning Algorithm 2. We first generate and evaluate a set of 30 Latin
hypercube samples of x and store them and corresponding values vector V in
the data matrix D. The value of a trajectory is the negative accumulated error
between the car’s position and velocity, and the desired position and velocity.
The policy evaluation consists of averaging 10 episodes along the same trajec-
37
ȍ
ȍ
ȍ
tory but with different, evenly spaced starting angles, where the car needs to
accelerate from rest, go to the first waypoint, perform a u-turn, and arrive back
to the starting location. In a noisier environment, more samples would be neces-
sary to properly evaluate a policy. The TORCS simulator is deterministic, and
a small amount of noise arises from unmodeled tire slipping and random bumpi-
ness of the road. The 10 different starting angles were sufficient for evaluating
a policy in our experiments. Subsequently, we perform the iteration described
in Algorithm 2 to search for the best instantiation of the parameters.
38
Algorithm 3 Active Path Learning with GPs
1: function N avigateT askLearner(N avigate i, State s)
39
nation and terminates, but we still need to mark visited intersections to avoid
indefinite looping. Lines 23–26 compute an estimated value for V (s) by sum-
ming the immediate discounted reward of executing F ollow(WP , s) with the
discounted, previously recorded value of the new state V (s0 ), and a heuristic
penalty factor to avoid looping. Once we reach the destination of this task,
T argeti , we have the necessary information to propagate the discounted reward
to all the intersections along the trajectory, in lines 15–21.
4.4 Simulations
The nature of the domain requires that we run policy optimization first to
train the Follow task. This is reasonable, since the agent cannot be expected
to learn map navigation before learning to drive the car. Figure 13 com-
pares the results of three different values for the GP kernel k, when running
the active policy optimization algorithm from §4.3.1. The desired velocity
is 60km/hr, a time step lasts 0.25 seconds, and ithe trajectory reward R =
P h
− t 1 × Ỹerr2 2
+ 0.8 × Ṽerr + 1 × Ω̃2err + 0.7 × ã0 ã is the negative weighted sum
of normalized squared error values between the vehicle and the desired trajec-
tory, including a = [steer, throttle] to penalize for abrupt actions. After ∼ 50
more parameter samples (after the initial 30 random samples), the learner has
already found a useable policy.
Subsequently, the best parameters are fixed inside the Follow task, and we
run the full hierarchical learners, with results in Figure 14. We averaged the
results from 10 runs of RAR, MAXQ, and the value learning Algorithm 3 ap-
plied only to the Navigate task (with the rest of the hierarchy using MAXQ).
All the experiments use the hierarchical task model presented in Section 4.2.1.
Each reward time step lasts 0.3 seconds, so the fastest learner, VT M GP with
= 0.2 drove for ∼ 4 hours real-time at ∼ 60 km/hr before finding a good ap-
proximation of the VN avigate value function. Refer to Figure 15 for an intuition
of how fitting the GP over the samples values transfers observations to adja-
cent areas of the state space. This effect is controlled through the GP kernel
parameter k. While the application is specific to navigating a topological map,
the algorithm is general and can be applied to any continuous state spaces of
reasonable dimensionality.
40
Active Policy Optimization for Trajectory Following
0
−2000
−3000
−4000
−5000
2000
Accumulated Reward per Episode
1000
−1000
−2000
−3000
−4000
RAR
−5000
MAXQ
MAXQ VTM e−Greedy e=0.1
−6000
MAXQ VTM GP k=.01 e=0.2
−7000
0 1 2 3 4 5
# of Reward Samples x 10
5
Figure 14: Parameterized VT M vs. RAR and MAXQ: These experiments com-
pare the original Recursive Average Reward (RAR) and MAXQ (discounted reward)
algorithms against the parameterized VT M (T axiLoc, WP) path learner.
41
GP Estimate of Value Function, k=0.01
X: 125
Y: 25
Z: −8.371e−006
−5
X: 100
V( x , y , x , y )
T
Y: 25
−10 Z: −6.451
T
X: 75
C
Y: 25
−15 X: 50
C
Y: 100 Z: −11.03
Z: −22.03
−20 X: 50
Y: 50
150
Z: −18.02
−25 100
150
100 50
50
0 0 XC
Y
C
(a) VT M GP k = 0.01
X: 125
Y: 25
Z: −0
X: 100
−5 Y: 25
Z: −4.75
−10
X: 75
Y: 25
−15 Z: −11.6
X: 50
X: 50
Y: 50
Y: 100
Z: −20.16
Z: −23.36
−20
120
−25
100
80
100 60
40
50
20
0 0
(b) VT M GP k = 0.02
Figure 15: GP Response Surface. A small kernel value narrows the ‘footprint’ of
an observation, whereas a larger k interpolates to the surrounding state space.
42
However, Bayesian optimization is also a fairly recent addition to the ma-
chine learning community, and not yet extensively studied on user applications.
Here, we wish to describe some of the shortcomings we have experienced in our
work with Bayesian optimization, both as caveats and as opportunities for other
researchers.
A particular issue is that the design of the prior is absolutely critical to
efficient Bayesian optimization. Gaussian processes are not always the best or
easiest solution, but even when they are, great care must be taken in the de-
sign of the kernel. In many cases, though, little is known about the objective
function, and, of course, it is expensive to sample from (or we wouldn’t need
to use Bayesian optimization in the first place). The practical result is that in
the absence of (expensive) data, either strong assumptions are made without
certainty that they hold, or a weak prior must be used. It is also often unclear
how to handle the trade-off between exploration and exploitation in the acqui-
sition function. Too much exploration, and many iterations can go by without
improvement. Too much exploitation leads to local maximization.
These problems are exacerbated as dimensionality is increased—more dimen-
sions means more samples are required to cover the space, and more parameters
and hyperparameters may need to be tuned, as well. In order to deal with this
problem effectively, it may be necessary to do automatic feature selection, or
assume independence and optimize each dimension individually.
Another limitation of Bayesian optimization is that the acquisition is cur-
rently both myopic and permits only a single sample per iteration. Looking
forward to some horizon would be extremely valuable for reinforcement learn-
ing problems, as well as in trying to optimize within a known budget of future
observations. Recent work [Garnett et al., 2010b, Azimi et al., 2011] has indi-
cated very promising directions for this work to follow. Being able to efficiently
select entire sets of samples to be labelled at each iteration would be a boon to
design galleries and other batch-incremental problems.
Finally, there are many extensions that will need to be made to Bayesian
optimization for particular applications—feature selection, time-varying models,
censored data, heteroskedasticity, nonstationarity, non-Gaussian noise, etc. In
many cases, these can be dealt with as extensions to the prior—in the case of
Gaussian processes, for example, a rich body of literature exists in which such
extensions have been proposed. However, these extensions need to take into
account the adaptive and iterative nature of the optimization problem, which
can vary from trivial to impossible.
Clearly, there is a lot of work to be done in Bayesian optimization, but we
feel that the doors it opens make it worthwhile. It is our hope that as Bayesian
optimization proves itself useful in the machine learning domain, the community
will embrace the fascinating new problems and applications it opens up.
43
References
[Andre, 2003] D. Andre. Programmable Reinforcement Learning Agents. PhD thesis,
University of California at Berkley, 2003.
[Audet et al., 2000] C. Audet, J. Jr, Dennis, D. W. Moore, A. Booker, and
P. D. Frank. Surrogate-model-based method for constrained optimization. In
AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Opti-
mization, 2000.
[Azimi et al., 2011] J. Azimi, A. Fern, and X. Z. Fern. Batch Bayesian optimization
via simulation matching. In Advances in Neural Information Processing Systems
24, 2011.
[Barto and Mahadevan, 2003] A. G. Barto and S. Mahadevan. Recent advances in
hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(1-2):41–
77, 2003.
[Bartz-Beielstein et al., 2005] T. Bartz-Beielstein, C. Lasarczyk, and M. Preuss. Se-
quential parameter optimization. In Proc. CEC-05, 2005.
[Baxter and Bartlett, 2001] J. Baxter and P. L. Bartlett. Infinite-horizon policy-
gradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001.
[Bertsekas and Tsitsiklis, 1996] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic
Programming. Athena Scientific, 1996.
[Betrò, 1991] B. Betrò. Bayesian methods in global optimization. J. Global Optimiza-
tion, 1:1–14, 1991.
[Boyle, 2007] P. Boyle. Gaussian Processes for Regression and Optimisation. PhD
thesis, Victoria University of Wellington, Wellington, New Zealand, 2007.
[Brochu et al., 2007a] E. Brochu, N. de Freitas, and A. Ghosh. Active preference
learning with discrete choice data. In Advances in Neural Information Processing
Systems 21, 2007.
[Brochu et al., 2007b] E. Brochu, A. Ghosh, and N. de Freitas. Preference galleries
for material design. In ACM SIGGRAPH 2007 Posters, page 105, 2007.
[Brochu et al., 2010a] E. Brochu, T. Brochu, and N. de Freitas. A Bayesian interac-
tive optimization approach to procedural animation design. In Eurographics/ ACM
SIGGRAPH Symposium on Computer Animation, 2010.
[Brochu et al., 2010b] E. Brochu, M. Hoffman, and N. de Freitas. Hedging strategies
for Bayesian optimization. eprint arXiv:1009.5419, arXiv.org, September 2010.
[Busby, 2009] D. Busby. Hierarchical adaptive experimental design for Gaussian pro-
cess emulators. Reliability Engineering and System Safety, 94(7):1183–1193, July
2009.
[Chu and Ghahramani, 2005a] W. Chu and Z. Ghahramani. Extensions of Gaussian
processes for ranking: semi-supervised and active learning. In Learning to Rank
workshop at NIPS-18, 2005.
[Chu and Ghahramani, 2005b] W. Chu and Z. Ghahramani. Preference learning with
Gaussian processes. In Proc. 22nd International Conf. on Machine Learning, 2005.
[Cora, 2008] V. M. Cora. Model-based active learning in hierarchical policies. Master’s
thesis, University of British Columbia, Vancouver, Canada, April 2008.
44
[Cox and John, 1992] D. D. Cox and S. John. A statistical method for global opti-
mization. In Proc. IEEE Conference on Systems, Man and Cybernetics, volume 2,
pages 1241–1246, 1992.
[Cox and John, 1997] D. D. Cox and S. John. SDO: A statistical method for global
optimization. In M. N. Alexandrov and M. Y. Hussaini, editors, Multidisciplinary
Design Optimization: State of the Art, pages 315–329. SIAM, 1997.
[Dietterich, 2000] T. G. Dietterich. Hierarchical reinforcement learning with the maxq
value function decomposition. Journal of Artificial Intelligence Research, 13:227–
303, 2000.
[Diggle and Ribeiro, 2007] P. J. Diggle and P. J. Ribeiro. Model-Based Geostatistics.
Springer Series in Statistics. Springer, 2007.
[Diggle et al., 1998] P. J. Diggle, J. A. Tawn, and R. A. Moyeed. Model-based geo-
statistics. Journal of the Royal Statistical Society: Series C (Applied Statistics),
47(3):299–350, 1998.
[Elder, 1992] J. F. Elder, IV. Global Rd optimization when probes are expensive: The
GROPE algorithm. In Proc. IEEE International Conference on Systems, Man and
Cybernetics, 1992.
[Élő, 1978] Á. Élő. The Rating of Chess Players: Past and Present. Arco Publishing,
New York, 1978.
[Frean and Boyle, 2008] M. Frean and P. Boyle. Using Gaussian processes to optimize
expensive functions. In W. Wobcke and M. Zhang, editors, AI 2008: Advances in
Artificial Intelligence, volume 5360 of Lecture Notes in Computer Science, pages
258–267. Springer Berlin / Heidelberg, 2008.
[Garnett et al., 2010a] R. Garnett, M. Osborne, S. Reece, A. Rogers, and S. Roberts.
Sequential Bayesian prediction in the presence of changepoints and faults. The
Computer Journal, 2010.
[Garnett et al., 2010b] R. Garnett, M. Osborne, and S. Roberts. Bayesian optimiza-
tion for sensor set selection. In Proceedings of the 9th ACM/IEEE International
Conference on Information Processing in Sensor Networks, pages 209–219. ACM,
2010.
[Genton, 2001] M. G. Genton. Classes of kernels for machine learning: A statistics
perspective. Journal of Machine Learning Research, 2:299–312, 2001.
[Ghavamzadeh, 2005] M. Ghavamzadeh. Hierarchical Reinforcement Learning in Con-
tinuous State and Multi-agent Environments. PhD thesis, University of Mas-
sachusetts Amherst, 2005.
[Ginsbourger et al., 2008] D. Ginsbourger, R. Le Riche, and L. Carraro. A Multi-
points Criterion for Deterministic Parallel Global Optimization based on Gaussian
Processes. 2008.
[Goldberg et al., 1998] P. W. Goldberg, C. K. I. Williamsn, and C. M. Bishop. Re-
gression with input-dependent noise: A Gaussian process treatment. In Advances
in Neural Information Processing Systems 10, 1998.
[Herbrich and Graepel, 2006] R. Herbrich and T. Graepel. Trueskill: A Bayesian skill
rating system. Technical Report MSR-TR-2006-80, Microsoft Research, June 2006.
[Hinton and Salakhutdinov, 2006] G. Hinton and R. Salakhutdinov. Reducing the
dimensionality of data with neural networks. Science, 313(5786):504 – 507, 2006.
45
[Holmes and Held, 2006] C. Holmes and L. Held. Bayesian auxiliary variable models
for binary and multinomial regression. Bayesian Analysis, 1(1):145–168, 2006.
[Huang et al., 2006] D. Huang, T. T. Allen, W. I. Notz, and N. Zheng. Global op-
timization of stochastic black-box systems via sequential Kriging meta-models. J.
Global Optimization, 34(3):441–466, March 2006.
[Hutter et al., 2009] F. Hutter, H. H. Hoos, K. Leyton-Brown, and K. P. Murphy.
An experimental investigation of model-based parameter optimisation: SPO and
beyond. In Proc. GECCO’09, 2009.
[Hutter, 2009] F. Hutter. Automating the Configuration of Algorithms for Solving
Hard Computational Problems. PhD thesis, University of British Columbia, Van-
couver, Canada, August 2009.
[Jones et al., 1993] D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian
optimization without the Lipschitz constant. J. Optimization Theory and Apps,
79(1):157–181, 1993.
[Jones et al., 1998] D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global opti-
mization of expensive black-box functions. J. Global Optimization, 13(4):455–492,
1998.
[Jones, 2001] D. R. Jones. A taxonomy of global optimization methods based on
response surfaces. J. Global Optimization, 21:345–383, 2001.
[Kahneman and Tversky, 1979] D. Kahneman and A. Tversky. Prospect theory: an
analysis of decision making under risk. Econometrica, 47:263–291, 1979.
[Kapoor et al., 2007] A. Kapoor, K. Grauman, R. Urtasun, and T. Sarrell. Active
learning with Gaussian processes for object categorization. In Proc. International
Conference on Computer Vision (ICCV), 2007.
[Kendall, 1975] M. Kendall. Rank Correlation Methods. Griffin Ltd, 1975.
[Kingsley, 2006] D. C. Kingsley. Preference uncertainty, preference refinement and
paired comparison choice experiments. Working Paper 06-06, Center for Economic
Analysis, University of Colorado at Boulder, 2006. Dept. of Economics, University
of Colorado.
[Krause et al., 2008] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor place-
ments in Gaussian processes: Theory, efficient algorithms and empirical studies. J.
Machine Learning Research, 9:235–284, 2008.
[Krige, 1951] D. G. Krige. A statistical approach to some basic mine valuation prob-
lems on the Witwatersrand. J. the Chemical, Metallurgical and Mining Soc. of South
Africa, 52(6), 1951.
[Kushner and Yin, 1997] H. J. Kushner and G. G. Yin. Stochastic Approximation
Algorithms and Applications. Springer-Verlag, 1997.
[Kushner, 1964] H. J. Kushner. A new method of locating the maximum of an arbi-
trary multipeak curve in the presence of noise. J. Basic Engineering, 86:97–106,
1964.
[Lewis and Gale, 1994] D. Lewis and W. Gale. A sequential algorithm for training
text classifiers. In Proc. ACM SIGIR Conference on Research and Development in
Information Retreival, 1994.
[Liberti and Maculan, 2006] L. Liberti and N. Maculan, editors. Global Optimiza-
tion: From Theory to Implementation. Springer Nonconvex Optimization and Its
Applications. Springer, 2006.
46
[Lizotte et al., 2007] D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans. Auto-
matic gait optimization with Gaussian process regression. In IJCAI, 2007.
[Lizotte, 2008] D. Lizotte. Practical Bayesian Optimization. PhD thesis, University
of Alberta, Edmonton, Alberta, Canada, 2008.
[Locatelli, 1997] M. Locatelli. Bayesian algorithms for one-dimensional global opti-
mization. J. Global Optimization, 1997.
[Mackay, 1992] D. J. C. Mackay. A practical bayesian framework for backpropagation
networks. Neural Computation, 4(3):448–472, 1992.
[Marthi et al., 2005] B. Marthi, D. Latham, S. Russell, and C. Guestrin. Concurrent
hierarchical reinforcement learning. In Proceedings of the 19th International Joint
Conference on Artificial Intelligence, 2005.
[Martinez–Cantin et al., 2006] R. Martinez–Cantin, N. de Freitas, and J. Castellanos.
Analysis of particle methods for simultaneous robot localization and mapping and
a new algorithm: Marginal-SLAM. In Proc. IEEE International Conference on
Robots and Automation, 2006.
[Martinez–Cantin et al., 2007] R. Martinez–Cantin, N. de Freitas, A. Doucet, and
J. A. Castellanos. Active policy learning for robot planning and exploration un-
der uncertainty. Robotics: Science and Systems (RSS), 2007.
[Martinez–Cantin et al., 2009] R. Martinez–Cantin, N. de Freitas, E. Brochu,
J. Castellanos, and A. Doucet. A Bayesian exploration-exploitation approach for op-
timal online sensing and planning with a visually guided mobile robot. Autonomous
Robots, 27(2):93–103, 2009.
[Matérn, 1960] B. Matérn. Spatial Variation. Springer-Verlag, 2nd (1986) edition,
1960.
[Matheron, 1971] G. Matheron. The theory of regionalized variables and its applica-
tions. Cahier du Centre de Morphologie Mathematique, Ecoles des Mines, 1971.
[McFadden, 1980] D. McFadden. Econometric models for probabilistic choice among
products. Journal of Business, 53(3):13–29, 1980.
[McFadden, 2001] D. McFadden. Economic choices. The American Economic Review,
91:351–378, 2001.
[Močkus et al., 1978] J. Močkus, V. Tiesis, and A. Žilinskas. Toward Global Opti-
mization, volume 2, chapter The Application of Bayesian Methods for Seeking the
Extremum, pages 117–128. Elsevier, 1978.
[Močkus, 1982] J. Močkus. The Bayesian approach to global optimization. In
R. Drenick and F. Kozin, editors, System Modeling and Optimization, volume 38,
pages 473–481. Springer Berlin / Heidelberg, 1982.
[Močkus, 1994] J. Močkus. Application of Bayesian approach to numerical methods
of global and stochastic optimization. J. Global Optimization, 4(4):347 – 365, 1994.
[Mongeau et al., 1998] M. Mongeau, H. Karsenty, V. Rouzé, and J.-B. Hiriart-Urruty.
Comparison of public-domain software for black-box global optimization. Technical
Report LAO 98-01, Universite Paul Sabatier, Toulouse, France, 1998.
[Mosteller, 1951] F. Mosteller. Remarks on the method of paired comparisons: I. the
least squares solution assuming equal standard deviations and equal correlations.
Psychometrika, 16:3–9, 1951.
47
[Murray-Smith and Girard, 2001] R. Murray-Smith and A. Girard. Gaussian process
priors with ARMA noise models. In Irish Signals and Systems Conference, 2001.
[Murray-Smith and Sbarbaro, 2002] R. Murray-Smith and D. Sbarbaro. Nonlinear
adaptive control using non-parametric Gaussian process prior models. In 15th IFAC
World Congress on Automatic Control. Citeseer, 2002.
[Ng and Jordan, 2000] A. Y. Ng and M. I. Jordan. Pegasus: A policy search method
for large MDPs and POMDPs. In Uncertainty in Artificial Intelligence (UAI2000),
2000.
[O’Hagan, 1978] A. O’Hagan. On curve fitting and optimal design for regression.
Journal of the Royal Statistical Society B, 40:1–42, 1978.
[Osborne et al., 2010] M. Osborne, R. Garnett, and S. Roberts. Active data selection
for sensor networks with faults and changepoints. In IEEE International Conference
on Advanced Information Networking and Applications, 2010.
[Osborne, 2010] M. Osborne. Bayesian Gaussian Processes for Sequential Prediction,
Optimization and Quadrature. PhD thesis, University of Oxford, 2010.
[Parr, 1998] R. E. Parr. Hierarchical control and learning for markov decision pro-
cesses. PhD thesis, 1998. Chair-Stuart Russell.
[Payne et al., 1993] J. W. Payne, J. R. Bettman, and E. J. Johnson. The Adaptive
Decision Maker. Cambridge University Press, 1993.
[Poyiadjis et al., 2005] G. Poyiadjis, A. Doucet, and S. S. Singh. Particle methods for
optimal filter derivative: Application to parameter estimation. In IEEE ICASSP,
pages 925–928, 2005.
[Press et al., 2007] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery.
Numerical Recipes: The Art of Scientific Computing. Cambridge University Press,
3rd edition, 2007.
[Rasmussen and Williams, 2006] C. E. Rasmussen and C. K. I. Williams. Gaussian
Processes for Machine Learning. MIT Press, Cambridge, Massachusetts, 2006.
[Sacks et al., 1989] J. Sacks, W. J. Welch, T. J. Welch, and H. P. Wynn. Design and
analysis of computer experiments. Statistical Science, 4(4):409–423, 1989.
[Santner et al., 2003] T. J. Santner, B. Williams, and W. Notz. The Design and Anal-
ysis of Computer Experiments. Springer, 2003.
[Sasena, 2002] M. J. Sasena. Flexibility and Efficiency Enhancement for Constrained
Global Design Optimization with Kriging Approximations. PhD thesis, University
of Michigan, 2002.
[Schonlau, 1997] M. Schonlau. Computer Experiments and Global Optimization. PhD
thesis, University of Waterloo, Waterloo, Ontario, Canada, 1997.
[Settles, 2010] B. Settles. Active learning literature survey. Computer Science Tech-
nical Report 1648, University of Wisconsin-Madison, January 2010.
[Siegel and Castellan, 1988] S. Siegel and N. J. Castellan. Nonparametric Statistics
for the Behavioral Sciences. McGraw-Hill, 1988.
[Srinivas et al., 2010] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian
process optimization in the bandit setting: No regret and experimental design. In
Proc. International Conference on Machine Learning (ICML), 2010.
48
[Stein, 1999] M. L. Stein. Interpolation of Spatial Data: Some Theory for Kriging.
Springer Series in Statistics. Springer, 1999.
[Stern, 1990] H. Stern. A continuum of paired comparison models. Biometrika,
77:265–273, 1990.
[Streltsov and Vakili, 1999] S. Streltsov and P. Vakili. A non-myopic utility function
for statistical global optimization algorithms. J. Global Optimization, 14:283–298,
1999.
[Stuckman, 1988] B. Stuckman. A global search method for optimizing nonlinear
systems. IEEE Transactions on Systems, Man and Cybernetics, 18(6):965–977,
1988.
[Sutton and Barto, 1998] R. S. Sutton and A. G. Barto. Reinforcement Learning: An
Introduction. MIT Press, 1998.
[Sutton et al., 1999] R. S. Sutton, D. Precup, and S. P. Singh. Between MDPs and
semi-MDPs: A framework for temporal abstraction in reinforcement learning. Ar-
tificial Intelligence, 112(1-2):181–211, 1999.
[Thurstone, 1927] L. Thurstone. A law of comparative judgement. Psychological Re-
view, 34:273–286, 1927.
[Törn and Žilinskas, 1989] A. Törn and A. Žilinskas. Global Optimization. Springer-
Verlag, 1989.
[Tversky and Kahneman, 1992] A. Tversky and D. Kahneman. Advances in prospect
theory: Cumulative representation of uncertainty. J. Risk and Uncertainty, 5:297–
323, 1992.
[Vasquez and Bect, 2008] E. Vasquez and J. Bect. On the convergence of the expected
improvement algorithm. Technical Report arXiv:0712.3744v2, arXiv.org, Feb 2008.
[Williams et al., 2000] B. J. Williams, T. J. Santner, and W. I. Notz. Sequential
design of computer experiments to minimize integrated response functions. Statistica
Sinica, 10:1133–1152, 2000.
[Wymann et al., 2009] B. Wymann, C. Dimitrakakis, and C. Alexopoulos. The open
racing car simulator (http://torcs.sourceforge.net/), 2009.
[Younes, 1989] L. Younes. Parameter estimation for imperfectly observed Gibbsian
fields. Prob. Theory and Rel. fields, 82:625–645, 1989.
[Zhigljavsky and Žilinskas, 2008] A. Zhigljavsky and A. Žilinskas. Stochastic Global
Optimization. Springer Optimization and Its Applications. Springer, 2008.
[Žilinskas and Žilinskas, 2002] A. Žilinskas and J. Žilinskas. Global optimization based
on a statistical model and simplical partitioning. Computers and Mathematics with
Applications, 44:957–967, 2002.
[Žilinskas, 1980] A. Žilinskas. Lecture Notes in Control and Information Sciences,
chapter On the Use of Statistical Models of Multimodal Functions for the Con-
struction of Optimization Algorithms. Number 23. Springer-Verlag, 1980.
49