Bayesian task embedding for few-shot Bayesian optimization
arXiv:2001.00637v1 [cs.LG] 2 Jan 2020
Steven Atkinson∗ , Sayan Ghosh† , Natarajan Chennimalai-Kumar‡ , Genghis Khan§ , and Liping Wang¶
GE Research, 1 Research Circle, Niskayuna, NY, 12309, USA
We describe a method for Bayesian optimization by which one may incorporate data from
multiple systems whose quantitative interrelationships are unknown a priori. All general (nonreal-valued) features of the systems are associated with continuous latent variables that enter
as inputs into a single metamodel that simultaneously learns the response surfaces of all of
the systems. Bayesian inference is used to determine appropriate beliefs regarding the latent
variables. We explain how the resulting probabilistic metamodel may be used for Bayesian
optimization tasks and demonstrate its implementation on a variety of synthetic and realworld examples, comparing its performance under zero-, one-, and few-shot settings against
traditional Bayesian optimization, which usually requires substantially more data from the
system of interest.
I. Introduction
Many engineering design problems can be cast as optimization problems of the form
x ∗ = arg min η(x),
(1)
x ∈X
where x ∈ X is a dx -dimensional vector of design and operation parameters, and η is some quantity of interest, such as
the strength of a material or the efficiency of an engine. Solving optimization problems that fit under the scope of Eq.
(1) can be challenging if the dimension dx of the search space is high and/or the function η is nonconvex and highly
rugged, resulting in an intractably large number of potential solutions that may be considered. Furthermore, if it is
expensive to evaluate η (corresponding to performing a laboratory experiment or running a complex simulation code),
one may be limited by their experimental budget and be forced to try and determine a good solution to Eq. (1) with few
opportunities to gain information about the true response surface for the system of interest.
The most elementary approaches for solving Eq. (1) involve specifying some space-filling design and querying η at
all of the design points. While rudimentary designs such as factorial designs suffer from the curse of dimensionality and
are intractable for all but the simplest problems, more advanced space-filling designs such as Latin hypercube designs
still may require far too many evaluations than is feasible. Improving on these by orders of magnitude, in Bayesian
optimization (BO), one defines a Bayesian metamodel y(x) [such as a Gaussian process (GP)] to approximate η and
uses it to adaptively select promising designs until the optimum is found or an experimental budget is exhausted. Still,
such an approach commonly takes tens to hundreds of evaluations to converge to a satisfactory degree in practice. In
this work, we are interested in improving on this by another order of magnitude, finding plausible solutions to Eq. (1)
with a handful–or no—evaluations of η.
Humans intuitively solve optimization problems in their daily lives in novel settings by leveraging knowledge about
previous related experiences. By contrast, it is common to start quantitative engineering design problems from scratch
since it is unclear how to re-use data from legacy systems with different response surfaces. There is a growing literature
around methods that seek to devise suitable models to bridge the gap between these disparate sources of information.
Through this, we hope to leverage so-called “legacy” data about previously seen tasks in a manner that endows our
current effort with strong, but reasonable inductive biases that guide it towards effective solutions.
The main technical challenge is to devise a means of reasoning in a quantitative way about features of data that
are not numerical in nature and therefore not suitable for standard modeling approaches∗ These “general” features are
∗ Research
Engineer, Mechanical Systems, GE Research, Niskayuna, New York.
Engineer, Mechanical Systems, GE Research, Niskayuna, New York.
‡ Senior Engineer, Mechanical Systems, GE Research, Niskayuna, New York.
§ Senior Principal Engineer, Mechanical Systems, GE Research, Niskayuna, New York.
¶ Technology Manager, Mechanical Systems, GE Research, Niskayuna, New York.
∗ Some literature refer to these as “qualitative” features, though this seems somewhat of a misnomer since certain types of attributes in question
can be numerical in nature, such as zip codes, yet are clearly unsuitable to treat as numbers in a model; others may not be quantitative, but are
nonetheless precise (e.g. the name of an operator), yet “qualitative” does not convey this preciseness.
† Lead
typically found when describing the difference between tasks [1] and could include, for example, the serial number of a
machine, the identity of an operator, or a chemical compound involved in some process of interest.
The key to our approach is to learn a probabilistic embedding associated with the general features associated with a
system such that notions of similarity can be quantified and utilized by a downstream data-driven model. Minding our
ultimate goal of solving optimization problems, we focus on Gaussian process metamodels and call the composition of
our probabilistic embedding with the Gaussian process metamodel “Bayesian embedding GP” (BEGP). The contributions
of this work are the following:
1) We define the structure of the BEGP metamodel designed to fuse information from systems with differing general
features. We use a variational inference scheme to learn to infer reasonable probabilistic embeddings of the
general features that capture uncertainty due to limited data while showing that the compositional model can be
recognized as a deep Gaussian process [2] with a particular choice of kernel function.
2) We explain and demonstrate the application of this model to the task of Bayesian optimization, showing how the
BEGP can be used to satisfy the usual requirements of the algorithm.
3) We conduct a series of computational experiments on a variety of synthetic and real-world systems to illustrate
the usage of our approach and compare its performance to existing methods and evaluating the contribution of
various components of the metamodel.
The scope of our work here is optimization problems of the form in Eq. (1). However, because our approach can
be used as a drop-in replacement for other Bayesian metamodels, it is straightforward to extend our work to cases
including multi-objective optimization and problems involving complex or unknown constraints. We also consider
regression tasks as a stepping stone to our ultimate application of interest since satisfactory predictive power is a desired
preliminary skill.
The remainder of this paper is organized as follows: In section II, we explain the formulation of our metamodel and
its usage within Bayesian optimization In section III, we review related approaches and results from the literature. In
section IV, we demonstrate our approach on a variety of synthetic and real-world examples. In section V, we offer
concluding remarks.
II. Methodology
In this section, we describe the methodology including our model definition, training objective, predictive distribution,
and application to Bayesian optimization.
A. Model definition
We begin the discussion of our methodology by defining our model, explaining how inference may be done to
determine its posterior, and how predictions are computed.
1. Gaussian process regression
We begin by reviewing Gaussian processes for regression; the interested reader may consult [3] for more details. A
Gaussian process (GP) GP(µ(·), k(·, ·)) with mean function µ and kernel k is a distribution over functions such that
the marginal distribution of the random function over a finite index set is a multivariate Gaussian. Concretely, let
f : X → R be described by a GP where X is an input space that indexes the random process; traditionally, this will be
dx -dimensional Euclidean space Rd x or some subset therein. However, any (infinite) set of inputs might be used, and
we are mindful of this potential generality in the review in this section. Given n inputs X ∈ X n with corresponding
outputs f ∈ Rn , we write the joint probability density of f as
p( f |X) = N ( f |µ, K f f ),
(2)
where µi = µ(x i ), and (K f f )i j = k(x i, x j ). The quantity f is recognized as the latent output of the GP model. Next, we
define a Gaussian likelihood
p(y| f ) = N (y| f , σy2 In×n ),
(3)
where y ∈ Rn denotes the observed output values. Integrating out f results in the familiar marginal likelihood
p(y|X) = N (y|µ, Kyy ),
(4)
where Kyy = K f f + σy2 In×n . The negative logarithm of this quantity is conventionally used to determine appropriate
parameters for the mean and kernel functions as well as the likelihood through gradient-based minimization. Alternatively,
2
approximate Bayesian inference of the model parameters may be done as well using Markov chain Monte Carlo or
variational inference once suitable priors are defined [4].
Given a training set D = {X, y} ∈ X n × Rn , predictions are made by using Bayes’ rule to condition the GP on the
∗
∗
training set. The resulting predictive distribution over the latent outputs f ∗ ∈ Rn at some test inputs X∗ ∈ X n is
∗
−1
p( f ∗ |X∗, D) = N ( f ∗ |K∗ f K−1
yy (y − µ) + µ , K∗∗ − K∗ f Kyy K f ∗ ),
(5)
(K f ∗ )i, j = k(x i, x ∗j ),
(6)
where
(K∗∗ )i, j =
k(x ∗i ,
x ∗j ),
(7)
K∗ f = K f ∗ , and µ∗i = µ(x ∗i ). Applying the likelihood and marginalizing out f ∗ gives the posterior in output space
⊺
∗
−1
2
p(y ∗ |X∗, D) = N (y ∗ |K∗ f K−1
yy (y − µ) + µ , K∗∗ − K∗ f Kyy K f ∗ + σy In∗ ×n∗ ).
(8)
Remark: Our use of a likelihood term in our model definition generalizes the concept of a “nugget” or “jitter” that
is sometimes used within the GP metamodeling community [4, 5], usually with the stated purpose of either improving
the conditioning of the kernel matrix or representing the noisiness in the observations being modeled.
2. Incorporating general input features as inputs to Gaussian process regression models
The main source of the rich behavior in GP models is their kernel function, which encodes strong inductive biases
about the statistics of the functions being modeled. It is most common to consider parametric kernels of the form
k(·, ·; θ k ) : Rd x × Rd x → R. For example, the popular exponentiated quadratic kernel is defined as
" d
#
x
Õ
xi − xi′ 2
′
2
,
(9)
k(x, x ; θ k ) = σk exp −
li
i=1
and can be evaluated easily for any pair of dx -dimensional vectors x and x ′.
While this is suitable for real-valued inputs commonly found in engineering problems, it is not suitable for more
general inputs such as those mentioned above. One workaround is to embed elements from general input spaces as
dz -dimensional latent variables; one can then form the complete input vector by concatenating these latents z with
the real-valued inputs x (r) to form a dx (r ) + dz -dimensional input vector amenable to computation with traditional
parametric kernels such as Eq. (9).
More general nonparametric positive-definite kernels exist that are valid for Gaussian processes. For example, the
white noise kernel function,
(
2
σwhite
if x = x ′
′
k white (x, x ; θ) =
,
(10)
0
otherwise
does not require Euclidean inputs; instead, it merely requires that we are able to distinguish between elements in its
index set.
Therefore, to accomplish the embedding mentioned above, we map our general inputs x (g) ∈ Xg through a dz -variate
Gaussian process with white noise kernel:
z i ∼ GP (0; k white (·, ·)) : Xg → R, i = 1, . . . , dz .
(11)
Sampling this GP at some general input returns potential embeddings of the input as a dz -dimensional vector in
Euclidean space. These latent variables may then be concatenated with the numerical feature space and fed through
a second Gaussian process that serves as the familiar regression model employed in Bayesian metamodeling. The
Bayesian embedding GP (BEGP) model that combines the embedding Gaussian process with the GP regression model
is shown in Fig. 1. This model is composed of two GPs, where the outputs of the first (embedding) GP are fed as
inputs to the second (regression) GP, making it a deep Gaussian process with serial topology similar to those originally
discussed in [2].
Our modeling approach may be thought of as a type of multi-task learning in that one doing regression traditionally
segregates datasets along differences in general features, building a separate model for each dataset using the
3
𝒙(𝑔)
𝒢𝒫
𝒢𝒫
𝒛
𝒇
𝑝(𝒚|𝒇)
𝒚
𝒙(𝑟)
Fig. 1 Probabilistic graphical model of the Bayesian embedding Gaussian process (BEGP) model. General
inputs x (g) are first mapped through a GP with white noise kernel to latent variables z. These latents are
concatenated with real-valued inputs x (r) to form the full inputs x to the second GP model with a traditional
parametric mean function and kernel. White nodes denote variables that are unobserved, and shaded nodes are
observed. Point nodes are deterministic, and large nodes are associated with probability distributions. Model
parameters associated with mean functions, kernels, and the likelihood are hidden for clarity.
remaining real-valued features; by contrast, here we build in correlations across general features within a single model,
simultaneously improving the predictive power across all of the input space.
Remark: for systems with multi-dimensional general inputs (e.g. operator IDs and machine serial numbers), it is
straightforward to extend our approach to embed each general input dimension separately to its own latent space; these
latents are then all concatenated with each other to define the total latent variable which is then concatenated with x (r)
as before.
3. Inference for the Bayesian embedding GP model
Having defined our model, we now discuss how to do inference on it. Inference for the Bayesian EGP model presents
two challenges. First, inferring the posterior of the embedding layer is challenging due to its nonparameteric nature
stemming from the white noise kernel combined with the fact that its posterior will generally be non-Gaussian due to
the nonlinearity of the downstream model that operates on it. Second, unlike a traditional Gaussian process regression
model, it is analytically intractable to marginalize out the latent variables z. In this section, we discuss simple strategies
for overcoming these two challenges so that we may devise an objective function for training our model.
First, we consider the challenges associated with the embedding layer. Let ng denote the number of distinct general
inputs that we observe from Xg (i.e. the number of legacy datasets that we would like to incorporate in our model). Due
n
to the kernel, the joint prior density over latents Z ∈ Rng ×dz associated with general inputs X(g) ∈ Xg g is
dz
n Ö
Ö
N (zi j |0, σw2 ),
p Z|X(g) =
(12)
i=1 j=1
where zi j is the j-th dimension associated with the latent for input i. For reasons analogous to those mentioned in
previous literature [2, 6, 7], the posterior over these inputs is generally non-Gaussian. We define a mean-field variational
posterior q(Z) to approximate the true posterior p(Z|D) with the form
q(Z) =
ng Ö
dz
Ö
N (zi j |mi j , si j ),
(13)
i=1 j=1
where M, S ∈ Rng ×dz are variational parameters. Fortunately, one is usually only interested in a relatively small number
of inputs in Xg , so it is not a challenge to track these variational parameters. In fact, computational challenges associated
with data scalability of Gaussian process models are likely to become problematic well before challenges associated
with working with a high number of tasks, even though recent advances have make significant progress in lift traditional
barriers to Gaussian process modeling in large-data regimes [7–12].
Lastly, note that if no data have been observed for some held-out general input x (g),∗ , then, by the nature of the white
noise kernel, the posterior over the associated latent is equal to the prior. this observation is critical for enabling our
model to make credible predictions in a zero-shot setting, where no data about some task of interest is available at the
4
outset of optimization. Indeed, we will see that this probabilistic approach enables our model to make surprisingly good
inferences to guide the first iteration of Bayesian optimization in such settings.
Having resolved the challenge of representing the posterior of the embedding layer, we now turn our attention to the
second challenge regarding the intractability of the model’s marginal log-likelihood (or evidence), which usually serves
as the objective function for training probabilistic models. Again, following a variational strategy, we derive a tractable
evidence lower bound (ELBO) that can be used in place of the exact marginal log-likelihood. While this approach was
first demonstrated for Gaussian processes using a sparse approach based on inducing variables, it turns out that later
advances in variational inference [13] allow us to avoid relying on a sparse GP formulation. That said, we recognize that
such an approach may realistically be helpful in the plausible cases where an abundance of legacy data takes the problem
into a large-data setting. This is in contrast to the usual assumptions of data-scarcity in engineering metamodeling,
which are usually myopically focused only on solving a single task in isolation. For the sake of completeness, we now
provide the derivation of the ELBO for the BEGP model. Based on the probabilistic graphical model shown in Fig. 1,
the joint probability of our model with n observations is
P = p(Z, f , y|X(r), X(g) ) = p(Z|X(g) )p( f |Z, X(r) )p(y| f ),
(14)
where p(Z|X(r) ) is given by Eq. (12), p( f |Z, X(r) ) is analogous to Eq. (2), and p(y| f ) is the likelihood of Eq. (3). and
the marginal log-likelihood is obtained by simply integrating over the latent variables and taking the logarithm of the
result:
∫
log p(y|X(r), X(g) ) = log Pd f dZ.
(15)
From this point onwards, we will omit the conditioning on the inputs for brevity. While we can integrate out f due to
the conjugacy of the likelihood, Z cannot be integrated out analytically. We multiply and divide the integrand by q(Z)
to obtain
∫
p(y, Z)
log p(y) = log q(Z)
dZ.
(16)
q(Z)
Applying Jensen’s equality, we move the logarithm inside the integrand to get
∫
p(y, Z)
log p(y) ≥ E LBO =
q(Z) log
dZ.
q(Z)
Rearranging and using p(y, Z) = p(y|Z)p(Z) gives
∫
E LBO =
q(Z) (log p(y|Z) + log p(Z) − q(Z)) dZ.
(17)
(18)
Additional progress may be made at this point by restricting our choice of kernel function for the second GP layer to
either an exponentiated quadratic or a linear kernel since the kernel expectations that form the crux of the challenge
may then be evaluated analytically. Furthermore, under the special case where the same real-valued input points are
sampled for each general input, Atkinson and Zabaras [14, 15] showed that the kernel expectations can be further broken
down into a Kronecker product between a deterministic kernel matrix and a kernel expectation enabling extensions to
modeling millions of data. In this work, we instead opt to estimate Eq. (18) through sampling so as to maintain the
ability to leverage arbitrary kernel functions.
The model is trained by maximizing Eq. (18) with respect to the parameters of the variational posteriors as well as
the kernel hyperparameters and likelihood variance σy2 through gradient-based methods. We find that it is typically
sufficient to approximate the integral with a single sample from q(Z) at each iteration of the optimization.
Note that there is a redundancy in defining the scale of the latent space in that our model possesses both a parameter
σw2 for the prior variance of the embedding layer as well as length scales for the parametric kernel of the GP regression
layer. Given this freedom, one can make the ELBO arbitrarily high by bringing σw2 towards zero while scaling the
variational parameters of q(Z) and the length scales of the regression kernel correspondingly without changing the
predictions of the model. to avoid this pathology and establish the scale of the latent space, we fix σw2 = 1.
We choose to learn point estimates of the other model parameters since they are informed by all of the tasks’ data.
Therefore, the uncertainty contributed to the predictions due to uncertainty in these parameters will generally be small.
This is again only because we are incorporating a large amount of data into our model by leveraging legacy data; in
the typical single-task engineering setting, we might expect to have far fewer data, and Bayesian inference over the
parameters may provide value to the model.
5
A remark on variational inference for EGP A reasonable alternative to our variational inference approach is to
instead perform inference using Monte Carlo. We favor VI because we need to update the model posteriors repeatedly
during Bayesian optimization as new data are added to the model. This is simple to do with VI since we can continue
our gradient-based optimization from the previous variational posterior. By contrast, Monte Carlo requires us to run the
chain for almost as long as the initial chain every time that we need to update the model. Thus, while there is not such a
clear advantage for one method compared to the other when initially training the model on legacy data, VI quickly
becomes more attractive in the following steps. Furthermore, we note that updating the model parameters is rather
important since we are focusing on extreme low-data regimes (fewer than 10 data from the task of interest), where each
new datum will generally have considerable effects on the posterior.
Another reason for using MC inference is because highly-factorized variational posterior distributions are known
to underestimate the entropy of the true posterior when there is mismatch between the variational form and the true
posterior [16]. One might reasonably worry that we are neglecting important uncertainty information by using a
mean-field posterior for q(z). We find in practice that this approximation successfully captures much of the uncertainty
and is sufficient to provide competitive or superior predictions compared to the baselines considered in our examples.
However, our method is not incompatible with more powerful VI methods such as multivariate Gaussians with full
covariance matrices [15] or normalizing flow-based approaches [17].
4. Predictions with the model
Given a trained model, we are now interested in making predictions at held-out inputs in Xr × Xg . Consider a set of
n∗
n∗
test inputs X∗ ∈ Xg g × Xr r with ng∗ distinct general inputs and nr∗ distinct real-valued inputs, combined to form n∗ test
∗
inputs in all. We first apply the embedding layer X(g),∗ to obtain a sample of the latents Z∗ ∈ Rng ×dz as well as the
training data’s
to form the full training and test inputs for the regression model,
latents Z. These latents are expanded
∗
{X, X∗ } = [X(r), Z], [X(r),∗, Z∗ ] ∈ Rn×d x × Rn ×d x , where dx = dx (r ) + dz . Given these samples we can compute the
conditional predictive distributions over the latent and observed outputs, given by Eq. (5) and (8), respectively.
Due to the nonlinearity of the GP model, marginalizing over latent variables’ variational posterior q(Z, Z∗ ) generally
induces a non-Gaussian predictive distribution. However, given a Gaussian variational posterior, the moments of the
marginal predictive distribution admit analytic expressions; previous works [6, 18] have approximated the predictive
distribution as Gaussian with these statistics. Alternatively, one may sample the predictive distribution in order to better
resolve its details. We utilize the former approach in our examples.
B. Bayesian optimization
Having discussed the formulation of our model, including training and predictions, we now turn our attention to
using it in the context of optimization. The main algorithm for Bayesian optimization is given in Algorithm 1. We
restrict ourselves to the task of optimizing subject to a fixed general input x ∗g , though our method permits searching over
multiple general inputs with trivial modification.
Algorithm 1 Bayesian optimization
Require: Training data D, general input of interest x ∗g , acquisition function a(·), computational budget n∗ .
Ensure: optimal design x (r),∗
1: M → BEGP(D).
2: n → 0, D ∗ = ∅.
3: for i = 1, . . . , n∗ do
4:
Train M on D.
(r)
5:
x (r)
next → arg min x (r ) ∈Xr a(x ).
6:
ynext → η(x ∗g, x (r)
next ), n → n + 1.
7:
8:
D → D ∪ {(x ∗g, x (r)
next , ynext )}
D ∗ → D ∗ ∪ {(x ∗g, x (r)
next , ynext )}
return x (r),∗ = arg minx (r ) ∈D ∗ y(x (r) ).
6
1. Acquisition functions
One ingredient that must be specified for BO is the acquisition function a, which is used to steer the selection of
points at which one conducts experiments to evaluate η. In this work, for systems in which we can evaluate η at any
location in Xr for the current task x ∗g , we consider the expected improvement
E I(x) = h(y(x) − ymin )Θ(ymin − y(x))i p(y |x, D),
(19)
where Θ(·) is the Heaviside theta function and ymin is the value of the current best design. When p(y|x, D) has a known
form such as a Gaussian, evaluation of Eq. (19) can be done cheaply. Notice as a matter of convention that we have
defined a in Eq. (19) such that lower values are better.
Finally, the traditional approach to maximizing the acquisition function has been to simple select a large random
sampling of Xr , evaluate a on all of the samples (which is generally cheap), then select the best sample. However,
following the widespread adoption of computational frameworks supporting automatic differentiation such as TensorFlow
[19] and PyTorch [20], it has become customary to accelerate the inner loop optimization over a using gradient-based
methods by simply backpropagating the acquisition function back through the model to obtain its gradient with respect
to the inputs x (r) [21]. Other recent work has investigated this matter more broadly [22]. Here, we use gradient-based
search with restarts to find the next point to select similarly to [21]. For the case where one must estimate a through
sampling (e.g. due to estimating the predictive posterior by sampling latents from q(Z), stochastic backpropagation
[23] can be used to deal with the variance in the estimates of a due to sampling. We also note that this setup may be
applied immediately to solve robust optimization problems with no modification by additionally sampling from any
uncontrolled inputs.
In our real-world examples, we do not have access to the real-world data-generating process and must instead work
with a static, pre-computed dataset. In this case, our design space is practically the finite set at which experiments have
already been carried out. In this case, we can instead directly estimate the negative† probability that a given available
point will be the best design:
Ö
(r) (g),∗
(r) (g),∗
a(x (r)
)
=
P
y(x
,
x
)
<
y(x
,
x
)
.
(20)
i
i
j
j,i
We approximate Eq. (20) by repeatedly sampling the joint predictive distribution over all available design points and
counting the frequency with which each design point is predicted to have the best output. We find that this crude
approximation is not overly computationally burdensome and results in good performance in practice as shown by our
examples.
III. Related work
McMillan et al. [24] study Gaussian process models based on “qualitative” and “quantitative” variables. This
approach would be analogous to a deterministic version of our Bayesian embedding layer. As we show in our examples,
the incorporation of Bayesian uncertainty estimates provides highly valuable information that is essential to making the
predictive distribution of the model credible.
The Gaussian process latent variable model [25] is a seminal work for inferring latent variables by using the marginal
log-likelihood of a Gaussian process as training signal. The later Bayesian extension by Titsias and Lawrence [6] was
found to significantly improve the quality of the model and gives compelling evidence in favor of modeling latent
variables in a Bayesian manner.
The task of inferring input-output relationships where the inputs are only partially-specified was first identified by
Damianou and Lawrence as “semi-described learning” [26]. Our problem statement may be regarded as similar in that
we choose to associate the general features in our data with latent variables that are unspecified a priori.
Using related systems to improve knowledge about a related system of interest has a long history in multi-fidelity
modeling [27, 28], where one traditionally considers a sequence of systems that constitute successively more accurate
(and expensive) representations of a true physical process. However, it is unclear how to define a “sequence” of legacy
datasets that are all “on equal footing” (such as different operators or machines performing the same task), particularly
as the number of such instances becomes large.
Our model is similar to the multi-task Gaussian process introduced in [29]. Our approaches are similar in that we
both learn a kernel over tasks; however, [29] require that the kernel matrix decompose as a Kronecker product between a
task-wise covariance matrix and the input kernel matrix. The Kronecker product kernel matrix can be equivalent to a
† We
take the negative probability to keep with the convention that lower values of a are better.
7
kernel matrix evaluated on our augmented inputs in the special cases of if one uses either a linear or a Gaussian kernel,
but more general kernels cannot be composed in this way. By finding representations of tasks in a Euclidean input
space, we lift this restriction, allowing us to use arbitrary kernels instead, wherein the Kronecker product decomposition
of [29] is a special case. Additionally, by posing a prior over the latent input space, our method also allows us to do
principled zero-shot predictions on unseen inputs; it is significantly more challenging to formulate a reasonable prior on
the kernel matrix itself. Finally, we can inject additional modeling bias into our model by selecting the dimensionality
dz of our latent space. While a low-rank approximation to the task covariance matrix might be derived to obtain similar
savings, it again seems more natural in our opinion to exercise this ability in a latent space.
The task of utilizing “legacy” data in order to improve the predictive capability on a related system of interest was
explored in [30]. However, the approach used in that work can only utilize legacy models through linear combinations,
making extrapolation on the system of interest difficult, particularly when few (or no) data are available from the task of
interest. Additionally, their method requires a cross-validation scheme for model selection; by contrast, the probabilistic
representation of tasks in the current method enables the use of Bayesian inference to guide model selection in the sense
of identifying relevance between tasks.
A recent work by Zhang et al. [31] also studies Bayesian optimization in the context of qualitative and quantitative
inputs. However, they learn point embeddings via maximum likelihood estimation rather than inferring a posterior
through Bayesian inference; this makes the overall model prone to overfitting in practice and generally unsuitable
for providing credible uncertainty estimates. This also makes their model difficult to apply in few-shot settings, and
impossible to apply in a principled way for zero-shot learning. A related work by Iyer et al. [32] identifies a pathology
associated with the point estimate inference strategy and proposes an ad hoc workaround; a natural outcome of our
Bayesian embedding approach is that this challenge vanishes.
The BEGP model can be thought of as a latent variable model [6, 25, 33] and has similarities with previous works
with latent variable GPs [14, 15, 34], though those works additionally impose structure over latent variables as well as
(potentially) the input training data. However, neither works were applied within the context of Bayesian optimization,
nor did they identify that multiple general feature sets could be decomposed as we have chosen to do.
IV. Examples
In this section, we demonstrate our method on a variety of synthetic and real-world examples. The embedding
GP model is implemented using gptorch‡ , a Gaussian process framework built on PyTorch [20]. Source code for
implementing the model and synthetic experiments described below can be found on GitHub§ . For each system, we
consider both a regression and optimization problem using our embedding GP model with both probabilistic and
deterministic embedding layers; the latter is achieved by replacing the posterior of Eq. (13) with a delta function at its
mode. We use an RBF kernel as in Eq. (9) and constant mean function whose value is determined via an MLE point
estimate.
We compare our results against a vanilla Gaussian process metamodel that is unable to directly leverage the legacy
data as a baseline. Since GP models typically fare very poorly the the extreme low-data regime that we are interested in,
we conduct full Bayesian inference over the mean and kernel function parameters, using as prior a factorized Gaussian
over the parameters¶ whose mean and variance are estimated by fitting GPs to each legacy task and using the statistics
of the point estimates of the model parameters found therein. Inference for the Bayesian GP is carried out using
Hamiltonian Monte Carlo [35] with the NUTS sampler [36] as implemented in Pyro [37].
A. Systems
We begin by describing the systems that we consider. Each system contains a set of tasks that we will jointly learn.
1. Toy system
We first consider a system with one-dimensional input and Ωx = [0, 1]. The set of response surfaces are given by
η(x, θ) = 0.1 ∗ z 4 − z2 + (2.0 + θ 2 ) sin(2z),
‡ https://github.com/cics-nd/gptorch
§ https://github.com/sdatkinson/BEBO
¶ Or
the logarithm of the parameters that are constrained to be positive.
8
(21)
where we define z = θ 1 + 4x − 4. Different response surfaces are selected by sampling θ ∼ U[0, 1]2 . We consider a
problem setting where 5 legacy tasks are available, each with 5 data with inputs sampled uniformly from Ωx . In the
regression setting, we randomly sample a number of data from one response surface as the current task and split it
into a training and test set. We quantify our prediction accuracy in terms of mean negative log probability (MNLP),
mean absolute error (MAE), and root mean square error (RMSE). In the optimization setting, we perform Bayesian
optimization over the current task and track the best design point as a function of the number of evaluations on the
current task. We repeat both experiments with 10 different random seeds to quantify the performance statistics of each
metamodeling approach.
2. Forrester function
We also consider a generalization of the Forrester function [28] to a many-task setting:
η(x; θ) = θ 1 ∗ η f orr ester (x) + θ 2 ∗ (x − 0.5) + θ 3,
(22)
η f orr ester (x) = (6x − 2)2 sin(12x − 4).
(23)
where
Under the parameterization of Eq. (22), the original “high-fidelity” function considered in [28] is obtained when
θ = (1, 0, 0), and the low-fidelity function corresponds to θ = (0.5, 10, −5). We always include the low-fidelity function
in the legacy tasks along with other tasks generated by sampling θ ∼ U[0, 1] × U[0, 10] × U[−5, 5]. Note that this
places the high-fidelity task at an extremum of the system parameter space, meaning that multi-task models will not
necessarily benefit by assuming a priori that the held-out task lives at the center of latent space (assuming that their
learned latent representation resembles that of the parameterization we have chosen). We consider a design space
Ωx = [0, 1]. Each of the 5 legacy tasks includes 5 data where the input was sampled uniformly in Ωx . We consider a
regression and an optimization with identical setup to the toy system in Sec. 1.
3. Pump data
Our first real-world dataset comprises of data in which the performance of a pump is quantified in terms of seven
design variables. Our data includes data from 6 pump families, each with of which has between 11 and 55 data. We
consider each pump family in turn as the “current” task while using the other families as legacy task data. We repeat
our experiment 6 times, each time using a different pump family as the current task and the other 5 datasets as legacy
task data. In the regression setting, we split the current task data into a train and test set, training our metamodel on
all of the legacy data as well as whatever data is available from the current task, then quantify our predictions on the
held-out test set of the current task in terms of MNLP, MAE, and RMSE. Each experiment is repeated 10 times with
different random initializations and train-test splits. In the optimization setting, we begin with no evaluations from the
current task and carry out Bayesian optimization using the available data as a (finite) design set. Data are selected at
each iteration according the acquisition function of Eq. (20). We track the best evaluation (relative to the per-task best
design) as a function of the number of evaluations on the current task.
4. Additive manufacturing data
We consider a dataset in which the performance of an additive manufacturing process is quantified in terms of four
design variables. We have data from 35 different chemistries, each of which has 24 data. We conduct experiments on
each of the 35 chemistries in an analogous manner to the experiments on the pump data.
B. Regression results
Figure 2 shows regression results for the synthetic systems described in Sections 1 and 2. We notice that the
embedding GPs typically outperform the Bayesian GP by a margin, though there is overlap in the statistics of their
performance over repeats. Curiously, we notice that the deterministic embedding GP occasionally outperforms the
Bayesian embedding in terms of RMSE and MAE, but fails catastrophically in terms of MNLP, where the overconfidence
of its predictions is severely penalized.
Figure 3 shows our results for the pump dataset. For this problem, we observe mixed results with the embedding
GPs outperforming the baseline on some of the tasks, but underperforming on others. We hypothesize that this is
because certain tasks are substantially different from the others, causing the prior information from legacy tasks to be
9
forrester, task 0
forrester, task 0
7
EGP
forrester, task 0
BEGP
6
100
BGP
EGP
BEGP
5
BGP
BGP
EGP
BEGP
80
4
5
3
3
MNLP
MAE
RMSE
60
4
40
2
2
20
1
1
0
0
0
0
1
2
3
4
ntrain
5
6
7
8
0
1
2
3
4
ntrain
5
6
7
8
0
synthetic, task 0
synthetic, task 0
EGP
3
4
ntrain
5
6
7
8
6
7
8
synthetic, task 0
1.4
BEGP
2
100
BGP
EGP
BEGP
1.6
BGP
2.0
1
BGP
EGP
BEGP
80
1.2
60
1.0
MNLP
1.0
MAE
RMSE
1.5
0.8
40
0.6
0.4
0.5
20
0.2
0
0.0
0.0
0
1
2
3
4
ntrain
5
6
7
8
0
1
2
3
4
ntrain
5
6
7
8
0
1
2
3
4
ntrain
5
Fig. 2 (Regression, synthetic systems) Performance on held-out tasks. From left to right, performance in
terms of root mean squared error (RMSE), mean absolute error (MAE), and median negative log probability
(MNLP). Top row: Forrester function. Bottom row: synthetic system of functions. Solid lines show the median
performance over 10 random seeds and the shaded region shows the 80% coverage interval.
unhelpful. However, on certain tasks (3 and 4), we see that the Bayesian embedding GP outperforms the baseline even
in a zero-shot setting. We also notice that the deterministic embedding GP consistently gives rather poor uncertainty
estimates as evidenced by its MNLP scores. Thus, we see a clear benefit to the Bayesian embedding approach for
improving the credibility of the predictions.
Figure 4 shows our results for the additive dataset for the first few held-out tasks. In contrast to the pump problem,
we often see considerable improvements by using the embedding models. For many of the held-out tasks, the zero-shot
predictions from the embedding models are, on average, superior to those of the Bayesian GP even after it has seen 10
examples from the current task. On others (e.g. RMSE, MAE for Task 2 in Fig. 4), the Bayesian GP does somewhat
outperform the embedding models, though the embedding models continue to improve. This may be because the
prior knowledge built up from the legacy tasks is unhelpful for the held-out task; the embedding model must learn to
“overturn” this prior knowledge by finding a way to separate the embedding of the held-out task from the unrelated
legacy tasks. However, this requires observations to gradually overturn the belief via Bayesian updating.
Additionally, we see again that the deterministic embedding consistently results in very poor performance in terms
of MNLP, frequently performing so much worse that it is impractical to show on the same axes as the Bayesian GP
and Bayesian EGP models. Figure 5 shows the results when zoomed out to view the performance of the deterministic
EGP for one task. We see a similar deficiency in all of the other held-out tasks, firmly underscoring that it is critical to
account for uncertainty in the embedding in order to obtain credible predictions. This makes sense, given the high
number of tasks we must embed relative to the number of data available to elucidate their input-output characteristics.
This is not uncommon in engineering design problems, where one typically has a limited computational budget to
explore any given design problem, but may potentially have access to a large archive of legacy tasks.
C. Optimization results
Figure 6 shows the best design discovered for the synthetic systems as a function of the number of evaluations on the
current task. We report the median performance over 10 splits as well as the 80% coverage interval (10-th to 90-th
percentile). We see that the legacy task data enables the embedding GPs to consistently find near-optimal designs with a
either a single evaluation (Forrester) or without any data from the current task (synthetic). By contrast, the GP usually
requires a handful of evaluations before it begins to find good solutions.
Figures 7 and 8 show the zero- and one-shot posteriors of all three methods on the toy and Forrester systems,
respectively. We see that the legacy task data equip the embedding GPs with helpful priors that aid them in identifying
10
pump, task 0
pump, task 0
BGP
EGP
BEGP
1.4
1.2
1.2
3
0.8
0
0.4
0.4
0
1
2
3
4
ntrain
5
6
7
0.2
8
1
0
1
2
pump, task 1
ntrain
5
6
7
0.6
0.5
0.5
0.4
0.4
0.3
3
4
ntrain
5
6
7
8
0.7
MAE
0.5
0.4
2
3
4
ntrain
5
6
7
1
2
3
4
ntrain
5
6
7
2
8
pump, task 2
1.6
3
0.45
1
0.35
0
1
2
3
4
ntrain
5
6
7
2
8
pump, task 3
1.4
MNLP
MAE
RMSE
1.0
4
5
6
7
8
0.65
3
4
ntrain
5
6
7
pump, task 4
0.45
0
1
2
3
4
ntrain
5
6
7
8
0.40
0.25
1
2
3
4
ntrain
5
6
7
8
pump, task 3
BGP
EGP
BEGP
0
1
2
3
4
ntrain
5
6
7
8
pump, task 4
BGP
EGP
BEGP
5
3
0.45
2
1
0
0.30
0.35
0
4
0.35
0.40
BGP
EGP
BEGP
6
BGP
EGP
BEGP
MNLP
MAE
0.50
pump, task 2
2
2
8
0.50
0.55
RMSE
2
0.55
0.60
0.30
1
0.60
BGP
EGP
BEGP
8
1
0
pump, task 4
0.70
7
0
0.7
ntrain
6
1
0.8
3
5
3
0.9
2
4
ntrain
4
1.1
1
3
5
1.2
0
2
6
BGP
EGP
BEGP
1.3
1.0
1
1
0
1.2
0
2
0.40
1.4
BGP
EGP
BEGP
5
4
BGP
EGP
BEGP
8
pump, task 1
6
BGP
EGP
BEGP
0.50
pump, task 3
1.8
7
2
0.55
8
6
1
0.25
1
5
0
0.30
0
4
ntrain
1
0.60
0.6
3
3
0.65
BGP
EGP
BEGP
2
4
0
pump, task 2
1
5
MNLP
2
0
6
BGP
EGP
BEGP
MNLP
0.6
1
2
8
MAE
RMSE
0.7
0.7
0
RMSE
4
0.8
0.8
0.3
3
pump, task 1
BGP
EGP
BEGP
0.9
2
1
0.6
0.6
0.2
4
MNLP
MAE
RMSE
0.8
BGP
EGP
BEGP
5
1.0
1.0
pump, task 0
6
BGP
EGP
BEGP
1.4
1
0
1
2
3
4
ntrain
5
6
7
8
2
0
1
2
3
4
ntrain
5
6
7
8
Fig. 3 (Regression, pump dataset) Regression performance on held-out tasks. From left to right, performance
in terms of root mean squared error (RMSE), mean absolute error (MAE), and median negative log probability
(MNLP). Each row shows the performance for a different held-out task. Solid lines show the median performance
over 10 random seeds and the shaded region shows the 80% coverage interval.
the response surface. By contrast, the Bayesian GP guesses randomly in the zero-shot setting and has minimal insight for
its one-shot predictions. We also notice that while the deterministic embedding model seems to fortunately pick good
designs, its model of the response surface is highly overconfident. By contrast, we see that the Bayesian embedding
11
additive, task 0
additive, task 0
BGP
additive, task 0
BEGP
0.8
0.9
3
BGP
EGP
BEGP
0.9
EGP
1.0
BGP
EGP
BEGP
2
0.7
1
0.7
0.6
MNLP
MAE
RMSE
0.8
0.5
0.6
0
0.4
1
0.5
0.3
0.2
0.4
0
1
2
3
4
ntrain
5
6
7
8
additive, task 1
1.4
0
1
2
EGP
4
ntrain
5
6
7
0
1
2
3
4
ntrain
5
6
7
8
additive, task 1
3
BGP
EGP
BEGP
0.9
BEGP
2
8
additive, task 1
1.0
BGP
3
BGP
EGP
BEGP
2
1.2
0.8
0.7
0.8
1
MNLP
MAE
RMSE
1.0
0.6
0.5
0
0.4
0.6
1
0.3
0.4
0.2
0
1
2
3
4
ntrain
5
6
7
8
0
1
2
3
4
ntrain
5
6
7
2
8
0
additive, task 2
additive, task 2
0.8
0.8
3
4
ntrain
5
6
7
8
additive, task 2
EGP
BEGP
2
3
BGP
EGP
BEGP
BGP
0.9
1
BGP
EGP
BEGP
2
0.6
1
MNLP
0.6
MAE
RMSE
0.7
0.5
0
0.4
0.4
1
0.3
0.2
0.2
0
1
2
3
4
ntrain
5
6
7
0
8
1
2
3
4
ntrain
5
6
7
2
8
0
additive, task 3
additive, task 3
1.2
3
4
ntrain
5
6
7
8
6
7
8
additive, task 3
0.7
BEGP
2
3
BGP
EGP
BEGP
BGP
EGP
1
BGP
EGP
BEGP
2
1.0
0.6
MNLP
MAE
RMSE
1
0.8
0.5
0
0.4
0.6
1
0.3
0.4
0
1
2
3
4
ntrain
5
6
7
0
8
1
2
3
4
ntrain
5
6
7
2
8
0
additive, task 4
additive, task 4
EGP
1.0
0.7
BEGP
2
3
4
ntrain
5
additive, task 4
3
BGP
EGP
BEGP
BGP
1
BGP
EGP
BEGP
2
0.9
0.6
0.8
0.6
0.5
MNLP
MAE
RMSE
1
0.7
0.4
0
0.5
0.3
0.4
1
0.2
0.3
0
1
2
3
4
ntrain
5
6
7
8
0
1
2
3
4
ntrain
5
6
7
8
2
0
1
2
3
4
ntrain
5
6
7
8
Fig. 4 (Regression, additive dataset) Regression performance on held-out tasks. From left to right, performance
in terms of root mean squared error (RMSE), mean absolute error (MAE), and median negative log probability
(MNLP). Each row shows the performance for a different held-out task. Solid lines show the median performance
over 10 random seeds and the shaded region shows the 80% coverage interval.
endows our models with well-calibrated uncertainty estimates that simultaneously allows is to find good designs while
retaining a credible metamodel. Thus, while the EGP performs comparably to the BEGP, we must be cautious about
generalizing these results. We also point out that while it seems like the BGP has picked a very good first point in Fig. 7,
12
additive, task 0
600
BGP
EGP
BEGP
500
MNLP
400
300
200
100
0
0
1
2
3
4
ntrain
5
6
7
8
Fig. 5 (Regression, additive dataset) Zoomed-out MNLP performance for held-out task 0. The deterministic
embedding GP performs worse than the other methods by orders of magnitude.
5
17.5
BGP
EGP
BEGP
4
BGP
EGP
BEGP
15.0
12.5
Best
Best
3
2
10.0
7.5
5.0
1
2.5
0
0.0
2
4
6
Experiments
8
10
2
4
6
Experiments
8
10
Fig. 6 Bayesian optimization for the synthetic and Forrester systems. The solid curves show the median
performance over 10 repeats, and the shaded region shows the 80% coverage interval over all repeats.
this selection is by pure luck since BGP’s predictive distribution with zero training data (i.e. the prior) is flat in its inputs.
Indeed, we see that it picks the same first point for the Forrester function in Fig. 8, which is not nearly so close to optimal.
Figure 9 shows the running best design for the pump and additive systems as a function of the number of current task
evaluations. Since each legacy system has a different optimum, results show the performance relative to the best design
for the current task. Again, we see that the embedding GPs vastly outperform the vanilla GP approach, usually finding
good designs on their first attempts. However, here we see that the Bayesian embedding gives better results over a
deterministic embedding; this can be directly attributed to our earlier observation that the deterministic embedding tends
to overfit, particularly when there are very few data to inform the embedding. By contrast, the Bayesian embedding
safely guards against overfitting and provides robust performance.
V. Conclusion
We have introduced a method for Bayesian optimization applicable to settings where one has access to a variety of
related datasets but does not know how they relate to each other a priori. The Bayesian embedding GP automatically
learns latent representations that enable a single metamodel to share information across multiple tasks, thereby improving
the predictive performance in all cases, particularly in novel settings where limited data is available. We observe that our
method enables Bayesian optimization to more quickly and reliably find good solutions compared to traditional methods.
We find that variational inference with rudimentary variational approximations (factorized Gaussians) over the latent
13
2
Prediction
4
4
Prediction
Ground truth
Ground truth
1
3
2
2
0
0
0
1
f(x)
f(x)
f(x)
1
2
2
1
3
2
4
4
3
Prediction
Ground truth
6
4
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
x
0.6
0.8
1.0
0.0
0.2
0.4
x
4
Prediction
4
0.6
Ground truth
1
2
1.0
Prediction
2
Ground truth
0.8
x
2
f(x)
f(x)
f(x)
0
0
1
2
2
0
2
3
4
4
Prediction
4
0.0
0.2
0.4
0.6
x
0.8
1.0
Ground truth
0.0
0.2
0.4
0.6
0.8
x
1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
Fig. 7 Predictive posteriors for the metamodels on the toy system. Top row: zero-shot predictions. Bottom
row: one-shot predictions. From left to right: BGP, EGP, BEGP metamodels.
variables provides sufficient uncertainty information that credible estimates may be obtained in practice and is robust
with respect to the number of latent dimensions. Our experiments show that this Bayesian embedding is critical for
obtaining credible uncertainty quantification.
There are a number of ways in which our work might be extended. For one, our method might be applied to
optimization settings with unknown constraints. In such cases, one might simultaneously model the probability of
violating a constraint with a multi-task model. Modeling binary outputs such as constraint violation with GPs requires
non-conjugate inference, which can be enabled using techniques from [38]. Second, our approach is straightforward to
apply to problems of robust design. In fact, we have already shown how uncertainty in our metamodel can be propagated
efficiently during design selection through stochastic backpropagation, eliminating the need for expensive double-loop
optimization. Other approaches [39, 40] exploit analytic properties to solve the inner-loop optimization, though our
work suggests that it might be approximately solved in general using stochastic optimization. Finally, our model learns a
single embedding for each general input (task) as a latent vector. We would like to explore the effect of a hierarchical
latent representation in which the embedding of each task is allowed to vary with x (r) . Such a conditional multi-task
model might enable one to exploit partial similarities between tasks in certain regions of input space while still providing
for the possibility that their trends might differ elsewhere. While this was studied in [30], we expect that our Bayesian
latent representation might improve performance in few-shot settings such as those considered in this work.
Funding Sources
Funding for this work was provided internally by GE Research.
14
Prediction
15
Prediction
15
Ground truth
Prediction
15
Ground truth
10
Ground truth
10
10
5
f(x)
f(x)
f(x)
5
5
0
0
0
5
5
10
5
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
x
0.8
1.0
0.0
0.2
0.4
x
Prediction
15
0.6
Prediction
Ground truth
0.6
0.8
1.0
0.6
0.8
1.0
x
Prediction
15
Ground truth
Ground truth
15
10
10
5
f(x)
f(x)
f(x)
10
5
5
0
5
0
0
10
5
5
15
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
x
0.4
0.6
0.8
1.0
0.0
0.2
x
0.4
x
Fig. 8 Predictive posteriors for the metamodels on the Forrester system. Top row: zero-shot predictions.
Bottom row: one-shot predictions. From left to right: GP, EGP, BEGP metamodels.
BGP
EGP
BEGP
3.0
2.5
0.5
0.4
Best
2.0
Best
BGP
EGP
BEGP
0.6
1.5
0.3
1.0
0.2
0.5
0.1
0.0
0.0
2
4
6
Experiments
8
10
0
5
10
15
Experiments
20
25
Fig. 9 Bayesian optimization for the pump and additive systems. Results are shown relative to the per-task
best design. The solid curves show the median performance over all repeats, and the shaded region shows the
80% coverage interval.
15
References
[1] Argyriou, A., Evgeniou, T., and Pontil, M., “Multi-task feature learning,” Advances in Neural Information Processing Systems,
2007, pp. 41–48.
[2] Damianou, A., and Lawrence, N., “Deep Gaussian processes,” Artificial Intelligence and Statistics, 2013, pp. 207–215.
[3] Rasmussen, C. E., and Williams, C. K., Gaussian processes for machine learning, Vol. 2, MIT Press Cambridge, MA, 2006.
[4] Bilionis, I., Zabaras, N., Konomi, B. A., and Lin, G., “Multi-output separable Gaussian process: Towards an efficient, fully
Bayesian paradigm for uncertainty quantification,” Journal of Computational Physics, Vol. 241, 2013, pp. 212–239.
[5] Bilionis, I., and Zabaras, N., “Multi-output local Gaussian process regression: Applications to uncertainty quantification,”
Journal of Computational Physics, Vol. 231, No. 17, 2012, pp. 5718–5746.
[6] Titsias, M., and Lawrence, N. D., “Bayesian Gaussian process latent variable model,” Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics, 2010, pp. 844–851.
[7] Salimbeni, H., and Deisenroth, M., “Doubly stochastic variational inference for deep Gaussian processes,” Advances in Neural
Information Processing Systems, 2017, pp. 4588–4599.
[8] Snelson, E., and Ghahramani, Z., “Sparse Gaussian processes using pseudo-inputs,” Advances in Neural Information Processing
Systems, 2006, pp. 1257–1264.
[9] Titsias, M., “Variational learning of inducing variables in sparse Gaussian processes,” Artificial Intelligence and Statistics,
2009, pp. 567–574.
[10] Hensman, J., Fusi, N., and Lawrence, N. D., “Gaussian processes for big data,” arXiv preprint arXiv:1309.6835, 2013.
[11] Wilson, A., and Nickisch, H., “Kernel interpolation for scalable structured Gaussian processes (KISS-GP),” International
Conference on Machine Learning, 2015, pp. 1775–1784.
[12] Wang, K. A., Pleiss, G., Gardner, J. R., Tyree, S., Weinberger, K. Q., and Wilson, A. G., “Exact Gaussian Processes on a Million
Data Points,” arXiv preprint arXiv:1903.08114, 2019.
[13] Ranganath, R., Gerrish, S., and Blei, D., “Black box variational inference,” Artificial Intelligence and Statistics, 2014, pp.
814–822.
[14] Atkinson, S., and Zabaras, N., “Structured Bayesian Gaussian process latent variable model,” arXiv preprint arXiv:1805.08665,
2018.
[15] Atkinson, S., and Zabaras, N., “Structured Bayesian Gaussian process latent variable model: Applications to data-driven
dimensionality reduction and high-dimensional inversion,” Journal of Computational Physics, Vol. 383, 2019, pp. 166–195.
[16] Bishop, C. M., Pattern Recognition and Machine Learning, Springer Science+ Business Media, 2006.
[17] Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M., “Improved variational inference with
inverse autoregressive flow,” Advances in Neural Information Processing Systems, 2016, pp. 4743–4751.
[18] Girard, A., Rasmussen, C. E., Candela, J. Q., and Murray-Smith, R., “Gaussian process priors with uncertain inputs application
to multiple-step ahead time series forecasting,” Advances in Neural Information Processing Systems, 2003, pp. 545–552.
[19] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al., “Tensorflow:
A system for large-scale machine learning,” 12th {USENIX} Symposium on Operating Systems Design and Implementation
({OSDI} 16), 2016, pp. 265–283.
[20] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A.,
“Automatic differentiation in pytorch,” 2017.
[21] Snoek, J., Larochelle, H., and Adams, R. P., “Practical Bayesian optimization of machine learning algorithms,” Advances in
Neural Information Processing Systems, 2012, pp. 2951–2959.
[22] Zhang, Y., Kristensen, J., Ghosh, S., Vandeputte, T., Tallman, J., and Wang, L., “Finding Maximum Expected Improvement for
High-Dimensional Design Optimization,” AIAA Aviation 2019 Forum, 2019, p. 2985.
[23] Rezende, D. J., Mohamed, S., and Wierstra, D., “Stochastic backpropagation and approximate inference in deep generative
models,” arXiv preprint arXiv:1401.4082, 2014.
16
[24] McMillan, N. J., Sacks, J., Welch, W. J., and Gao, F., “Analysis of protein activity data by Gaussian stochastic process models,”
Journal of Biopharmaceutical Statistics, Vol. 9, No. 1, 1999, pp. 145–160.
[25] Lawrence, N. D., “Gaussian process latent variable models for visualisation of high dimensional data,” Advances in Neural
Information Processing Systems, 2004, pp. 329–336.
[26] Damianou, A., and Lawrence, N. D., “Semi-described and semi-supervised learning with Gaussian processes,” arXiv preprint
arXiv:1509.01168, 2015.
[27] Kennedy, M. C., and O’Hagan, A., “Predicting the output from a complex computer code when fast approximations are
available,” Biometrika, Vol. 87, No. 1, 2000, pp. 1–13.
[28] Forrester, A. I., Sóbester, A., and Keane, A. J., “Multi-fidelity optimization via surrogate modelling,” Proceedings of the royal
society a: mathematical, physical and engineering sciences, Vol. 463, No. 2088, 2007, pp. 3251–3269.
[29] Swersky, K., Snoek, J., and Adams, R. P., “Multi-task Bayesian optimization,” Advances in Neural Information Processing
Systems, 2013, pp. 2004–2012.
[30] Ghosh, S., Asher, I., Kristensen, J., Ling, Y., Ryan, K., and Wang, L., “Bayesian Multi-Source Modeling with Legacy Data,”
2018 AIAA Non-Deterministic Approaches Conference, 2018, p. 1663.
[31] Zhang, Y., Apley, D., and Chen, W., “Bayesian Optimization for Materials Design with Mixed Quantitative and Qualitative
Variables,” arXiv preprint arXiv:1910.01688, 2019.
[32] Iyer, A., Zhang, Y., Prasad, A., Tao, S., Wang, Y., Schadler, L., Brinson, L. C., and Chen, W., “Data-Centric Mixed-Variable
Bayesian Optimization For Materials Design,” arXiv preprint arXiv:1907.02577, 2019.
[33] Kingma, D. P., and Welling, M., “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[34] Dai, Z., Alvarez, M., and Lawrence, N., “Efficient modeling of latent information in supervised learning using gaussian
processes,” Advances in Neural Information Processing Systems, 2017, pp. 5131–5139.
[35] Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D., “Hybrid monte carlo,” Physics letters B, Vol. 195, No. 2, 1987, pp.
216–222.
[36] Hoffman, M. D., and Gelman, A., “The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo.”
Journal of Machine Learning Research, Vol. 15, No. 1, 2014, pp. 1593–1623.
[37] Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Horsfall, P., and
Goodman, N. D., “Pyro: Deep Universal Probabilistic Programming,” Journal of Machine Learning Research, 2018.
[38] Hensman, J., Matthews, A., and Ghahramani, Z., “Scalable variational Gaussian process classification,” 2015.
[39] Ryan, K. M., Kristensen, J., Ling, Y., Ghosh, S., Asher, I., and Wang, L., “A Gaussian Process Modeling Approach for Fast
Robust Design With Uncertain Inputs,” ASME Turbo Expo 2018: Turbomachinery Technical Conference and Exposition,
American Society of Mechanical Engineers, 2018, pp. V07AT32A012–V07AT32A012.
[40] Ling, Y., Ryan, K., Asher, I., Kristensen, J., Ghosh, S., and Wang, L., “Efficient robust design optimization using Gaussian
process and intelligent sampling,” 2018 Multidisciplinary Analysis and Optimization Conference, 2018, p. 4175.
17