Big Data Research 14 (2018) 12–26
Contents lists available at ScienceDirect
Big Data Research
www.elsevier.com/locate/bdr
Fast Gaussian Process Regression for Big Data
Sourish Das a , Sasanka Roy b , Rajiv Sambasivan a,∗
a
b
Chennai Mathematical Institute, India
Indian Statistical Institute, India
a r t i c l e
i n f o
Article history:
Received 31 July 2017
Received in revised form 20 April 2018
Accepted 20 June 2018
Available online 4 July 2018
Keywords:
Big data
Gaussian Process
Regression
a b s t r a c t
Gaussian Processes are widely used for regression tasks. A known limitation in the application of
Gaussian Processes to regression tasks is that the computation of the solution requires performing a
matrix inversion. The solution also requires the storage of a large matrix in memory. These factors
restrict the application of Gaussian Process regression to small and moderate size datasets. We present
an algorithm that combines estimates from models developed using subsets of the data obtained in a
manner similar to the bootstrap. The sample size is a critical parameter for this algorithm. Guidelines
for reasonable choices of algorithm parameters, based on a detailed experimental study, are provided.
Various techniques have been proposed to scale Gaussian Processes to large-scale regression tasks. The
most appropriate choice depends on the problem context. The proposed method is most appropriate for
problems where an additive model works well and the response depends on a small number of features.
The minimax rate of convergence for such problems is attractive and we can build effective models
with a small subset of the data. The Stochastic Variational Gaussian Process and the Sparse Gaussian
Process are also appropriate choices for such problems. Results from experiments conducted as part of
this study indicate that the algorithm presented in this work can be as effective as these methods. Unlike
these methods, the proposed algorithm requires minimal hyper-parameter tuning and is much simpler to
implement. The rate of convergence is also attractive.
2018 Elsevier Inc. All rights reserved.
1. Introduction
Gaussian Processes (GP) are attractive tools to perform supervised learning tasks on complex datasets on which traditional
parametric methods may not be effective. Traditional parametric
methods make a priori rigid assumptions about the nature and
form of the model. For example, a linear regression model is a
parametric model choice. This choice may not adapt to the data as
it changes, for example, updates to the data might suggest that a
non-linear model is more appropriate. Finding a suitable parametric form for a complex dataset is very challenging without prior
experience with the data. Gaussian Processes offer flexibility in this
regard. We only need to commit to a broad family of functions
that are suitable for the problem by specifying an appropriate covariance function. We do not need to commit to a rigid functional
form between the response and predictor variables. This can simplify the effort required to pick a good model when confronted
with an unfamiliar dataset that requires a complex model. They
are also easier to use in comparison to alternatives like neural
*
Corresponding author.
E-mail address:
[email protected] (R. Sambasivan).
https://doi.org/10.1016/j.bdr.2018.06.002
2214-5796/ 2018 Elsevier Inc. All rights reserved.
networks [1]. Gaussian Processes offer some practical advantages
over Support Vector Machines (SVM) [2]. They offer uncertainty
estimates with predictions. The kernel and regularization parameters can be learned directly from the data instead of using crossvalidation. Feature selection can be incorporated into the learning
algorithm. For regression, exact inference is possible with Gaussian
Processes. To apply Gaussian Processes to classification, we need to
resort to approximate inference techniques such as Markov Chain
Monte Carlo, Laplace Approximation or Variational Inference. Even
though exact inference is possible for Gaussian Process regression,
the computation of the solution requires matrix inversion. For a
dataset with n rows or data instances, the time complexity of
matrix inversion is O (n3 ). The space complexity associated with
storing a matrix of size n is O (n2 ). This restricts the applicability
of the technique to small or moderate-sized datasets.
In this paper we present an algorithm that uses subset selection
and ideas borrowed from bootstrap aggregation to mitigate the
problem discussed above. Parallel implementation of this algorithm
is also possible and can further improve performance.
The rest of this paper is organized as follows: In section 2, we
motivate the problem. In section 3, we present our solution to
the problem. Our solution is based on combining estimators developed on subsets of the data. The selection of the subsets is
S. Das et al. / Big Data Research 14 (2018) 12–26
based on simple random sampling with replacement (similar to
what is done in the bootstrap). The size of the subset selected is
a key aspect of this algorithm. This is determined empirically. We
present two methods to determine the subset size. We present the
motivating ideas leading to the final form of the algorithm. When
the model is an additive structure of univariate components, this
has attractive implications on the convergence rate [3]. An additive
model worked well for the datasets used in this study. Relevant
facts from Minimax theory for non-parametric regression, that are
consistent with the experimental results reported in this work, are
presented.
In section 4, we present a brief summary of related work. Applying Gaussian Processes to large datasets has attracted a lot of
interest from the machine learning research community. Connecting ideas to research related to the algorithm reported in this work
are presented. Selecting parameters for an algorithm is an arduous task. However this algorithm has only two important parameters, the subset size and the number of estimators. We provide
guidelines to pick these parameters based on detailed experiments
across a range of datasets.
In section 5 we provide experimental results that provide insights into the effect of the parameters associated with the algorithm. In section 6, we illustrate the application of our algorithm
to synthetic and real world data sets. We applied the algorithm developed in this work to data sets with over a million instances. We
compare the performance of the proposed method to the Sparse
Gaussian Process [4] and the Stochastic Variational Gaussian Process [5]. The inputs required by these algorithms are similar to
the inputs required for the proposed method and therefore are applicable in a similar context. We compare the estimates obtained
from the reported algorithm with two other popular methods to
perform regression on large datasets – Gradient Boosted Trees (using XGBoost, [6]) and the Generalized Additive Model (GAM) [7].
Results from experiments performed as part of this study show
that accuracies from the proposed method are comparable to those
obtained from Gradient Boosted Trees or GAM’s. However there
are some distinct advantages to using a Gaussian Process model.
A Gaussian Process model yields uncertainty estimates directly
whereas methods like Gradient Boosted Trees do not provide this
(at least directly). [8] and [9]. A Gaussian Process model is also
directly interpretable [2, pg. 5, 6] in comparison to methods like
Gradient Boosted Trees or Neural Networks. Therefore, the proposed method can yield both explanatory and predictive models.
It is possible to use stacking [10] to combine the estimates from
the proposed model with those obtained from a competing model
(like Gradient Boosted Trees) and obtain higher accuracies. Combining a Gaussian Process solution with XGBoost has been used
by [11].
In section 7, we present the conclusions from this work. The
contribution of this work is as follows. The proposed method to
perform Gaussian Process regression on large datasets has a very
simple implementation in comparison to other alternatives, with
similar levels of accuracy. The algorithm has two key parameters – the subset size and the number of estimators. Detailed
guidelines to pick these parameters are provided. The choice of
a method to scale Gaussian Process regression to large datasets
depends on the characteristics of the problem. This is discussed
in section 4. The proposed method is most effective for problems
where the response depends on a small number of features and
the kernel characteristics are unknown. In such cases, exploratory
data analysis can be used to arrive at appropriate kernel choices
[12]. Additive models may work well for these problems. Appropriate preprocessing like principal component analysis can be used
if needed to create additive models. The rate of convergence for
additive models is attractive [13]. This implies that we can build
very effective models with a small proportion of samples. Sparse
13
Gaussian Processes see [14] and Stochastic Variational Gaussian
Processes [5] are also appropriate for such problems. These require
a more complex implementation and may require extensive effort
to tune the optimization component of the algorithm (see [5]).
Results of the experiments conducted as part of this study show
that the proposed method can match or exceed the performance
of these methods.
2. Problem formulation
A Gaussian Process y with additive noise can be represented
as:
y = f (x) + η.
(1)
Here:
• y represents the observed response.
• x represents an input vector of covariates.
• η represents the noise. The noise terms are assumed to be
identical, independently distributed (IID) random variables
drawn from a normal distribution with variance σn2 .
• f represents the function being modeled. It is a multivariate normal with mean function μ(x) and covariance function
K(x).
If we want to make a prediction for the value of the response
at a test point X ∗ , then the predictive equations are given by (see
[15]):
f ∗ | X , y, X∗ ∼ N
f ∗ , cov ( f ∗ ) ,
f ∗ = E[ f ∗ | X , y , X ∗ ],
= K( X ∗ , X )[(K( X , X ) + σn2 I]−1 y ,
(2)
(3)
V( f ∗ ) = K( X ∗ , X ∗ ) − K( X ∗ , X )[K( X , X ) + σn2 I]−1 K( X , X ∗ ). (4)
Here:
• f ∗ is the value of the function at the test point X ∗ .
• K( X ∗ , X ) represents the covariance between the test point and
the training set.
• K( X , X ) represents the covariance matrix for the training set.
• I is the identity matrix.
• V( f ∗ ) is the variance at the test point X ∗ .
Equation (3) is the key equation to make predictions. An inspection
of equation (3) shows that this equation requires the computation
of an inverse. For a training dataset of size n, computation of the
inverse has O (n3 ) time complexity. This is one of the bottle necks
in the application of Gaussian Processes. Calculation of the solution also requires the storage of the matrix (K( X , X ) + σn2 I). This
is associated with a space complexity of O (n2 ). This is the other
bottle neck associated with Gaussian Processes. The uncertainty
associated with our prediction is provided by Equation (4). The
covariance K( X , X ) is expressed in terms of a kernel that is appropriate for the modeling task. The kernel is associated with a set of
hyper-parameters. These may be known, for example if the modeling task has been well studied, or unknown if the problem has not
been well studied. In this work, we treat these kernel parameters
as unknown. When the kernel parameters are unknown, these are
estimated using maximum likelihood estimation. The marginal log
likelihood is given by (see [15]):
14
S. Das et al. / Big Data Research 14 (2018) 12–26
log y | X =
1
2
−1
y T K + σn2 I
y−
1
2
log K + σn2 I −
n
2
log 2π .
(5)
Using an appropriate optimization technique with Equation (5)
as the objective function, the hyper-parameters of the kernel (K)
can be determined.
Ns ∝
Ns =
Here:
Since GP regression is both effective and versatile on complex
datasets in comparison to parametric methods, a solution to the
bottlenecks mentioned above will enable us to apply GP regression to large datasets that are complex. We run into such datasets
routinely in this age of big data. Our solution to applying Gaussian Process regression to large data sets is based on developing
estimators on smaller subsets of the data. The size of the subset and the desired accuracy are key parameters for the proposed
solution. The accuracy is specified in terms of an acceptable error threshold, ǫ . We pick K smaller subsets of size N s from the
original data set (of size N) in a manner similar to bootstrap aggregation. Sampling is done with replacement to ensure that the
models used to develop the samples are independent. This has
benefits in the analysis of the estimates from each of the models. We can treat them as independent. Sampling with replacement
permits us to assume that each data instance has the same probability of making it to the subset used to develop the model. We
develop a Gaussian Process estimator on each subset. The GP fit
procedure includes hyper-parameter selection using the likelihood
as the objective function. We want a subset size such that when
we combine the estimators developed on the K subsets, we have
an estimator that yields a prediction error that is acceptable. The
rationale for this approach is based on results from Minimax theory for non-parametric regression that are presented later in this
section. To combine estimators, we simply average the prediction
from each of the K estimators.
The time complexity for fitting a Gaussian Process regression
on the entire dataset of size N is O ( N 3 ). The algorithm presented
above requires the fitting of K Gaussian Process regression models to smaller size datasets (N s ). Therefore, the time complexity is
O ( K . N s3 ).
We present two methods to determine the subset size:
δ( N )
N s = N δ , where
(0 < δ < 1).
(6)
δ is a random variable and is determined based on inference of a proportion using a small sample. The details of this
method are presented in the next subsection.
2. Estimating the subset size using an empirical estimator. We
use the following observations to derive this empirical estimator:
(a) The subset size N s , should be proportional to the size of
the dataset, N:
Ns ∝ N.
We posited that as the size of the dataset increases, there
is possibly more detail or variations to account for in the
model. Therefore larger datasets would require larger sample sizes.
(b) The sample size N s , should be a decreasing function of the
desired error rate (ǫ ). This means that decreasing the desired error rate should increase the sample size.
g (ǫ )
,
where g (ǫ ) is an increasing function. Application of the above
observations yields the following estimator for sample size:
3. Proposed solution
1. Estimating the subset size using statistical inference. For a
dataset of size N, the subset size is expressed as:
1
N δ( N )
g (ǫ )
g (ǫ )
(7)
.
is a function that characterizes the fact that an increase in N should increase the sample size N s .
is a function that characterizes the fact that sample
size should increase as ǫ (desired error level) decreases.
The algorithm is summarized in Algorithm 1.
input : A dataset D of size N, δ , K
output: An estimator f that combines the estimators fitted from resampling
for i ← 1 to K do
/* select a sample from D . Two ways of selecting
the sample size are presented. Sampling with
replacement is used so that the samples used to
develop each of the K models are independent. */
N s ←SampleWithReplacement(D , δ);
/* A kernel is fit for each sample. Hyper-parameter
selection is done for each sample. This
computation can be parallelized.
*/
end
f̂ i ←FitGP( N s );
/* the estimate for a point x ∈ Dtest (the test dataset)
is the average of the estimates from the K
estimators fitted above.
*/
i = K
f̂
(
x
)
f resampled (x) ← K1
i
i =1
Algorithm 1: Gaussian Process Regression Using Resampling.
An important observation from experiments conducted for this
study with regard to the choice of δ is as follows. The choice of
δ affects the complexity of the functions used in modeling the response. To model complex responses, we need complex functions
from the K models used in Algorithm 1. This is obtained by using
a sufficiently large δ . The choice of δ is the primary requirement
to develop a good model. Experiments showed (section 5.4) that
increasing K is beneficial to a point – about 30 for most datasets
used in this study. Increasing K beyond this point yielded very little increase in performance.
The proposed algorithm is based on model averaging as described in [16, Chapter 14]. This is similar to combining estimates
from regression trees in the Random Forest [17] algorithm. Given
i = {1, 2, . . . , K } models, the conditional distribution of the response at a point X j is obtained from:
p (Y | X j ) =
i= K
i =1
p ( Y |i , X j ) p (i ) .
(8)
If each of the models is equally probable, then p (i ) = K1 . If each
of p (Y |i , X j ) is a Gaussian N (μi , i ), as would be the case when
each of the estimators is a GP based on Equation (1), then Equation
(8) is a linear combination of Gaussian random variables. We can
then represent Equation (8) as1 :
1
To see why the variance has the term K12 in the denominator in Equation (11),
recall that given independent random variables: X 1 , X 2 , . . . , X K , the variance of a
linear combination is given by:
S. Das et al. / Big Data Research 14 (2018) 12–26
σC2 ( X j )),
p (Y | X j ) ∼ N (μC ( X j ),
(9)
where:
μC ( X j ) =
σ
2
C (X j)
=
i= K
1
K
μi ( X j ),
(10)
i =1
i= K
1
K2
σ
(11)
Model combination is an alternative approach to combining estimates from a set of models. The product of experts model is an
example of this approach. In the product of experts approach, each
model is considered to be an independent expert. The conditional
distribution of the response at a point X j is obtained from:
p (Y | X j ) =
1
Z
i =1
p i (Y | X j ),
T P O E (X j) =
i= K
2
P O E ( X j ).
i= K
μi ( X j ) T i ( X j ) ,
T i ( X j ),
i =1
f (t ) =< f (.), σ (., t ) >
−1
.
∀t ∈ T .
(16)
Algorithm 1 makes use of K estimators, f 1 , f 2 , . . . , f K . The reproducing kernels associated with the estimators are σ1 , σ2 , . . .,
σ K . We show that the estimator resulting from combining these K
estimators is a Gaussian Process.
= b1 2 V[ X 1 ] + . . . + b K 2 V[ Xk ]
=
i= K
i =1
f i (t ),
i =1
i= K
1
K
σi .
b i 2 V[ X i ].
In this case X 1 , X 2 , . . . , X K are Gaussian random variables and b i = K1 .
2
In Equation (16) the . symbol represents a place holder for the variable of interest.
(17)
i =1
= f (t )
i = K
Consider the kernel σrm = K1
i =1 σi and the inner product
< f (.), σrm >. The inner product can be written as:
< f (.), σrm >
K
1
K
=
1
K
1
times
[< f (.), σ1 (., t ) > + . . . + < f (.), σ K (., t ) >]
=
=
K
times
[ f (t ) + . . . + f (t )]
K f (t )
K
✷
= f (t ) .
(15)
Here T represents the index set to select the predictor instances.2
bi X i
K
< f (.), σ1 (., t ) > =< f (.), σ2 (., t ) >= . . . =< f (.), σ K (., t ) >
(14)
Assumption 1. The function being modeled is in the Reproducing
Kernel Hilbert Space of the kernel (σ ) of the estimator. The reproducing property of the kernel can be expressed as:
i =1
i= K
1
Proof. From Assumption 1, the kernel associated with each estimator has the reproducing property. So the following equation
holds:
(13)
Here T i ( X j ) is the precision of the expert i at point X j . The results
from using a product of experts model are also included in this
study (see section 6).
Model averaging and model combination (Equation (9) and
Equation (13)) are simply ways to combine estimates from the
component estimators used in Algorithm 1. They do not specify
any information about the character of the estimator developed
using Algorithm 1. To do this, we need the following preliminaries:
i= K
σrm =
i =1
σ P2 O E ( X j ) = T P O E ( X j )
V
f rm (t ) =
(12)
where Z is the normalization constant. [18] report a study where
the individual experts are Gaussian Processes. In this case, each of
p i (Y | X j ) are N (μi ( X j ), σh2 ( X j )). The mean and variance associated with p (Y | X j ) are:
μP O E ( X j ) = σ
Lemma 1. The estimator f rm obtained from the individual estimators f 1 ,
f 2 , . . ., f K using:
is associated with the following kernel:
2
i ( X j ).
i =1
i= K
15
Using Lemma 1, the estimator developed using Algorithm 1 is a
Gaussian Process. This Gaussian Process is associated with a kernel
defined by Equation (17). This kernel is defined as a mixture of the
kernels used by the individual estimators. The uncertainty estimate
at any point X can be obtained using Equation (4).
Consistency of estimators is concerned with the asymptotic
properties of the developed estimator as we increase the sample
size. Consistency of Gaussian Processes has been widely studied
(see [1, Chapter 7, Section 7.1]) for details. The convergence rate
for the algorithm is a characteristic of great interest. This characteristic tells us the rate at which we can drop the error as we
increase the sample size. Minimax theory provides a framework
to assess this (see [19] or [20]). A famous result by [13] states
the minimax rate of convergence for non-parametric regression is
−
α
n (2α+d) , where α describes the smoothness of the function being
modeled and d represents the dimensionality of the feature space.
This suggests that the rate of convergence is affected by the curse
of dimensionality. However as noted by [21], the following factors
usually mitigate the curse of dimensionality:
1. The data lie in a manifold of low dimensionality even though
the dimensionality of the feature space is large. The work of
Yang et al. [22] reports a method for such problems.
2. The function being modeled depends only on a small set of
predictors. All datasets reported in this study had this characteristic. Feature selection can be performed in Gaussian Processes using Automatic Relevance Determination (see [1, Chapter 5, Section 5.1]).
3. The function being modeled admits an additive structure with
univariate components. The minimax rate of convergence for
this problem is very attractive (see [3]). For all datasets reported in this study such an additive model yielded good results.
16
S. Das et al. / Big Data Research 14 (2018) 12–26
Feature Selection is therefore a critical first step. This is discussed
in section 6. The additive structure and the dependence on a very
small set of predictors suggest that we can get reasonable models
with a small subset of the data. This was consistent with the experimental results reported in this work. Simple preprocessing like
Principal Component Analysis (PCA) could be used to make features independent. This preprocessing can be limited to the set of
highly correlated features. The data for this study came from public
repositories and from very diverse application domains. This suggests that datasets with these characteristics are not uncommon.
When data lie in a manifold, Bayesian Manifold Regression may be
more appropriate.
Zaytsev and Burnaev [23] investigate the minimax interpolation
error under certain conditions (known covariance, stationary Gaussian Process). As noted in this recent study, theoretical work in
estimating the convergence rate using minimax theory is an active
area of research. In this work, we investigated the empirical estimation of the sample size. We describe two methods to determine
the sample (subset) size.
3.1. Estimating subset size on the basis of inference of a proportion
Since δ takes values between 0 and 1, it can be interpreted as
a proportion. We treat it as a random variable that can be inferred
from a small sample. To estimate δ , we do the following: We pick
a small sample of the original dataset by simple random sampling.
We start with a small value of δ and check if the prediction error
with this value of δ is acceptable. If not we increment δ until we
arrive at a δ that yields an acceptable error. This procedure yields
the smallest δ value that produces an acceptable error on this sample. Since the size of this dataset is small, the above procedure can
be performed quickly. This technique yielded reliable estimates of
δ on both synthetic and real-world datasets.
3.2. Empirical estimation of the subset size
Equation (7) provides the functional form for the second empirical estimator of the subset size N s . Appropriate choices for δ( N )
and g (ǫ ) are based on observations from the experimental evaluation of the parameters of Algorithm 1. Choices that provided good
results are provided in section 5.2.
In this work, we considered random sub-sampling. Refining this
sample sequentially is an area of future work. One strategy for refinement could be to choose the first sample in a random manner
and then choose the subsequent samples based on the errors observed in a validation set. In each subsequent iteration, we can
adjust the sampling (by weights) such that we sample more in
the regions where the previous estimator made errors. Another
possible approach could be a hybrid approach. Rather than picking the first sample in a random manner, it is picked based on
some optimality criterion such as what is done in the Sparse Gaussian Process. This sample is then refined sequentially as discussed
above.
Time series datasets require special attention. Using temporal
features is one strategy to handle temporal correlation. This approach is common and was used in this work. Another strategy is
to adapt the sampling process to ensure that the autocorrelation
component is captured. Such a strategy would involve sampling in
blocks. Rather than pick a single element during sampling, we can
pick a block of contiguous data elements. The block size should
be such that it captures the auto-correlation present. For example, if exploratory data analysis reveals that each instance could
be correlated to five instances before it (lag = 5), then we pick
a block size that is larger than this window during sampling. Another strategy to consider for using Gaussian Processes with time
series data is to consider a model like the Kalman Filter [24]. In
such an approach, a time window is associated with the forecasting procedure. A Gaussian Process is used to model the response
within a window and use it to forecast the next window. The posterior mean of the previous window is used as the prior for the
Gaussian Process used to model the next window. See [25] for
an application of this technique in the financial domain to predict
bond yield curves. When the window sizes are big, we could use
the approach presented in this work to model the regression function in each window. We would need to ensure that the sample
size selected captures the autocorrelation in any of the windows
using a block sampling strategy as mentioned earlier.
4. Related work
Rasmussen and Williams [15, Chapter 8] provide a detailed
discussion of the approaches used to apply Gaussian Process regression to large datasets. The study of Quiñonero-Candela et al.
[26] is another detailed review of the various approaches to applying Gaussian Processes to large datasets. The choice of a method
appropriate for a regression task is dependent on the problem context. Therefore we discuss the work related to scaling Gaussian
Process regression taking the problem context into consideration.
The problem context refers to relevant background information
from the manifestation of the problem. This includes, but is not
limited to, the following:
• Knowledge of the model parameters: Do we know the kernel
•
•
•
•
types and range of hyper-parameter values from prior experience?
Properties of the input space: Does the input come from a particular distribution? Does it manifest in a very regular manner
over a grid?
Properties of the required covariance: Is an isotropic covariance a reasonable choice for the problem?
Frequency of model development: For Internet facing applications, if the application is deployed in batch mode rather than
in an online manner, there is a need for model iteration when
a new batch of data becomes available. Frequent model iteration may impose constraints on the complexity and availability
of time to fine tune hyper-parameter and optimization parameters.
Properties of the output: Do we need prediction at a large
number of points or a small number of points?
Specific problem characteristics determine the appropriate choice
of a method used to scale Gaussian Process regression. The following discussion provides examples of these characteristics and the
appropriate methods to scale Gaussian Process regression in each
case.
If the data associated with the problem has been well studied
and kernel methods have been successfully applied to the problem,
then we may have reasonable insights into the nature of the kernel
appropriate for the regression task. We may be able to arrive at the
kernel hyper-parameters quickly from a small set of experiments.
On the other hand, if the problem and data are new, then we may
not have a lot of information about the kernel. In general, scaling
Gaussian Process regression to large datasets has two challenges:
1. Finding a kernel that has good generalization properties for
the dataset.
2. Overcoming the computational hurdles – O ( N 3 ) for training
and O ( N 2 ) for storage.
Learning a kernel that has good generalization properties is
a related area of research in Gaussian Processes (see [27], [28]).
When a good kernel representation has been learned, there are
S. Das et al. / Big Data Research 14 (2018) 12–26
many techniques to overcome the computational hurdles. The Nystrom method to approximate the Gram matrix ([29]) and the
Random Kitchen Sinks ([30]) are probably the most well known.
The Random Kitchen Sinks approach maps the data into a low
dimensional feature space and learns a linear estimator in this
space. It should be noted that these methods work well when the
problem needs a stationary kernel for which we know the hyperparameters. Using “sensible defaults” for hyper-parameters and applying these techniques to problems that require a non-stationary
kernel may yield poor results. For example with the airline dataset,
described later in this study (see section 5.1), the Random Kitchen
Sinks cannot be used directly and would require suitable preprocessing (like removing simple trends or using a mean function) so
that a stationary kernel would be applicable. Using the Random
Kitchen Sinks directly with no preprocessing and using default
kernel choices provided with the scikit-learn [31] implementation
yielded poor results (RMSE of 31.49 as opposed to 8.75 with the
proposed method).
Using a kernel learning approach to determine a good kernel
representation and then solving the computational hurdles independently is one way to approach scaling Gaussian Process regression to large datasets. Another approach to determine the appropriate kernel is to use exploratory data analysis. Guidelines to pick
kernels based on exploratory analysis of the data is provided in
[12]. This is a practical approach when the number of relevant
features is not too many, as was the case with the datasets used
in this study. It should be noted that hyper-parameters for these
choices still need to be specified. We may be able to build additive
models using this approach. Appropriate preprocessing could help,
for example, principal component analysis can be applied to make
the features independent. Minimax theory for non-parametric regression indicates that the convergence rate for additive models is
very attractive. We can build effective models with a small proportion of the data.
The choice of kernel hyper-parameters is critical and can affect
the performance. When datasets are large and the kernel hyperparameters are unknown, we need algorithms that can address
both these issues. Ideally, the algorithm should be able to work
with both stationary and non-stationary kernels. The proposed algorithm is one such candidate. Sparse Gaussian Processes ([4])
and Stochastic Variational Gaussian Processes ([5]) are two others.
Like the proposed algorithm, these algorithms require the specification of a input size. A subset of points is selected from the
dataset for training. The criteria for subset selection is different
in each case. These algorithms do not require the specification of
the kernel hyper-parameters. These are estimated from the data.
Stochastic Variational Gaussian Processes can require considerable
manual tuning of the optimization parameters. Typical implementations (like [32]) for Sparse GP and Stochastic Variational GP use
stochastic gradient descent for hyper-parameter optimization. This
explores the entire dataset in batch size increments. Hensman et
al. [5] report the details associated with picking the parameters
for the optimization task (learning rates, momentum, batch sizes
etc.). In contrast, sample sizes with the proposed algorithm even
for datasets with over million instances are typically small (order of few hundred instances). Learning hyper-parameters over
small datasets is considerably easier. The experiments reported in
this work required no tuning effort. We report the performance
of Sparse Gaussian Process, Stochastic Variational Gaussian Process
and the proposed method on a variety of datasets in section 6.
The proposed algorithm uses ideas that have proven to be
effective with other machine learning techniques. Bagging has
been used to improve performance using regression trees (Random Forests, [17]). Like with Random Forests, the algorithm uses
model averaging to combine estimates from component Gaussian
Process Regressions. Dropout ([33]) is a technique used in neural
17
networks to prevent overfitting. Dropping random units achieves
regularization in neural networks. In the proposed algorithm, selecting a sample can be viewed as dropping random instances from
the training dataset. Sparse Gaussian Processes and Stochastic Variational Gaussian Processes use theoretical ideas to select a small
subset of points to develop a Gaussian Process regression model.
This study suggests that combined with model averaging, random
selection of the subset can also work well.
In some areas like spatial statistics or engineering design the
input data points lie in a grid at regularly spaced intervals. Belyaev
et al. [34] discuss an approach to scaling Gaussian Process regression when the problem has this structure. The covariance matrix
in these problems has a specific structure that permits an efficient
computational approach. This approach also uses a prior on the
hyper-parameters to deal with anisotropy. So the implementation
is quite complex. Using a divide an conquer strategy is another
theme in scaling Gaussian Process regression to large datasets. The
Bayesian Committee Machine (BCM) ([35]), is an idea related to
the algorithm presented in this work. The BCM proposes a partition of the dataset into M parts. An estimator is developed on
each partition. The BCM does not choose a subset of the partition. It uses the entire partition for developing the estimator. This
is the key difference between the method proposed in this work
and the BCM. The estimates from each estimator are assumed to
be independent. The BCM assumes that computing a GP solution
on the partitions of the dataset is computationally tractable because the partition sizes are small. Datasets encountered today are
much larger than those reported in [35]. In present-day datasets
the partitions of the dataset based on guidelines provided in [35]
would be very big and computing a full Gaussian Process solution
on them may not be computationally tractable. Even when the partition size is not big enough to create computational hurdles, using
all the data may result in over fitting. Using a hierarchical model
as in [36] or [37] are possible ways to work around the size of
the partitions, however this requires a complex implementation to
partition and recombine computations.
The Locally Approximate Gaussian Process [38] fits a local Gaussian Process for a prediction point using a local neighborhood and
local isotropy assumption. This method too requires some tuning,
the size of neighborhood and method to choose the neighbors are
important parameters. For datasets where the local isotropy assumption works well and when the size of the test set is small,
this method might be useful. When prediction is required at a
large number of test points, this method might be slow if we
use a large neighborhood. For example with the airline dataset,
described in section 5.1, using the defaults provided with the
laGP ([39]) package did not yield good results on the airline delay dataset (RMSE of 24.89 as opposed to 8.75 with the proposed
method). The running time for laGP was also considerably longer.
Scoring the test set in batches of 15000 rows, the test set prediction took about 50 minutes for the airline delay dataset. The
proposed algorithm builds a single model that is used to score the
entire test set and completed in about 6.5 minutes.
In summary, there are several ways to scale Gaussian Process
regression to large datasets. The choice of a particular method
should be guided by the characteristics of the problem. The
method proposed in this work is appropriate for large datasets
that have a small number of important features for which the kernel characteristics are unknown. In such cases, exploratory data
analysis can be used to determine appropriate kernel types. Additive models may work well for such datasets. Preprocessing such as
principal component analysis can be used if needed to make features independent (so that we can use additive models). Stochastic
Variational Gaussian Processes and Sparse Gaussian Processes are
also good candidates for such problems. Kernel hyper-parameters
are learned from the data by these methods. Results of the ex-
18
S. Das et al. / Big Data Research 14 (2018) 12–26
periments conducted as part of this study show that the proposed
method can match or exceed the performance of the Sparse Gaussian Process or the Stochastic Variational Gaussian Process.
5. Effect of the parameters
Selection of algorithm parameters appropriate for a machine
learning task is an arduous task for all practitioners. To alleviate
this difficulty, we provide guidelines for parameter selection based
on detailed experimentation. The proposed algorithm has three parameters:
1. The dataset size
2. The subset size
3. The number of estimators
Accordingly, three sets of experiments were performed to capture
the effect of each of these parameters on the performance of the
algorithm. These experiments are described in this section.
Fig. 1. Delta Elevators.
5.1. Datasets
The following datasets were used in this study:
1. Combined Cycle Power Plant: This dataset was obtained from
the UCI Machine Learning repository [40]. This dataset has
9568 instances. The target variable is the net hourly electrical power output from a power plant. The dataset has four
features.
2. Ailerons: This dataset was obtained from the LIAD (Laboratory
of Artificial Intelligence and Decision) [41]. The target variable
for this dataset is the control action associated with the control of an F-16 aircraft. The dataset has 40 features and 7154
instances.
3. Elevators: This dataset was obtained from the LIAD repository
[41]. This dataset is also related to the control of an F-16 aircraft. The target for this dataset is the control action variation
for the elevators of the aircraft. The dataset has 6 features and
9517 instances.
4. California Housing: This dataset was obtained from the LIAD
repository [41]. The target variable for this dataset is the median house price. The dataset has 8 features and 20460 instances.
5. Individual Household Electric Power Consumption: This dataset was obtained from the UCI Machine Learning repository
[40]. It captures the electric power consumption in a single household over a four year period at a one-minute sampling rate. For the experiments in this study, the Voltage was
treated as the target variable. Seasonality and periodicity are
important characteristics of this dataset (identified during exploratory data analysis of this dataset). For this reason, minute
and hour attributes were created out of the timestamp attribute. Similarly, day of week and day of month attributes
were created out of the date attribute. This dataset has over
2 million instances and 12 features.
6. Airline Delay: This dataset was obtained from the US Department of Transportation’s website [42]. The data represents arrival delays for US domestic flights during January and February of 2016. This dataset had 12 features and over two hundred and fifty thousand instances. Departure delay is included
as one of the predictors while [5] does not include it. Also,
the raw data includes a significant amount of early arrivals
(negative delays). For all regression methods considered in this
study, better models were obtained by limiting the data to the
delayed flights only (arrival delays were greater than zero).
Fig. 2. California Housing.
This suggests that arrival delays and early arrivals are better
modeled separately.
7. The Sinc Function: This is a one dimensional synthetic dataset
where the response variable is the sine cardinal function otherwise called the sinc function (noise free). The sinc function
is a complex function to learn and is, therefore, a candidate for
this as well as many other machine learning research studies.
The dataset had one hundred thousand instances.
5.2. Effect of dataset size
These experiments study the effect of the size of the dataset
(N) on the subset size (δ ). For each dataset, we pick a fraction
of the data elements and determine the subset size (N δ ) required
to achieve a target accuracy. The target accuracy is an input parameter for the experiment. We repeat this procedure for various
settings of the fraction of the dataset selected (0.1 through 1). The
number of estimators for these experiments was maintained at 30.
The rationale for this choice is provided in section 5.4. The results
are shown in Fig. 1 through 7.
The key observation from these experiments was that the δ required to maintain a preset accuracy decreases very slowly as N
increases. This set of experiments was used to identify candidate
choices for the parameters of Equation (7). Since we wanted a
function δ( N ), that decreases slowly as N increases, we considered
1
{ √1 , log1(N ) , log(log
( N )) }. These are slowly decreasing functions of N
N
S. Das et al. / Big Data Research 14 (2018) 12–26
19
Fig. 3. Airline Delay.
Fig. 6. Ailerons.
Fig. 4. Combined Cycle Power Plant.
Fig. 7. Sinc Function (Synthetic Data).
empirically, is a constant (about 30) for all experiments. The rationale for this choice is explained in section 5.4. The time complexity
3
of the GP computation is therefore O ( K . N x ). So when N is large
enough (such that x > 3), the GP computation is sub-linear. For ex3
4
ample, when N = 22 , the time complexity is O ( K . N 4 ). g (ǫ ) is a
monotonically increasing function that characterizes the fact that
sample size should increase as the acceptable error threshold ǫ
√
1
decreases (e.g. g (ǫ ) = ǫ , ǫ , ǫ 10 , . . .). An optimal choice of g (ǫ ) is
an area of future work. For the experiments reported in this work,
1
g (ǫ ) = C ǫ 10 with C = 1 for low noise datasets (R M S E << 1) and
C = 0.5 for noisy datasets (R M S E > 1), worked well.
5.3. Effect of subset size
Fig. 5. House Hold Power Consumption.
in decreasing order of slowness. After a rigorous empirical study
1
we found that if we choose δ( N ) = log(log
( N )) then it works well
on all real world datasets and synthetic data. As discussed in section 3, the time complexity of the proposed algorithm is O ( K . N s3 )
or O ( K . N 3.δ( N ) ). The exponent of N decreases monotonically as N
x
x
increases. Since N can be expressed as 22 , for N > 22 , the run3
ning time is < O ( K . N x ). The number of estimators K , determined
These experiments explore the effect of the subset size, captured by the parameter δ , on the accuracy (ǫ ), for a fixed dataset
size (N). For each dataset, a fraction of the dataset is picked for
this experiment. This represents the data for the experiment (N).
We pick N such that N 1.0 (i.e., δ = 1.0) is computationally tractable
(about 2000). The subset N s , used for Gaussian Process model development was N δ . We fix the number of estimators to be 30 (see
section 5.4 for the rationale). For each δ value in a range of values,
we record the RMSE (ǫ ). The key insight from these experiments
was that complex functions, like the Sinc function (see Fig. 14)
needed a larger subset size to produce a given level of accuracy.
Note that we can expect the δ to drop with the increase in dataset
20
S. Das et al. / Big Data Research 14 (2018) 12–26
Fig. 8. Delta Elevators.
Fig. 11. Combined Cycle Power Plant.
Fig. 12. House Hold Power Consumption.
Fig. 9. California Housing.
Fig. 13. Ailerons.
Fig. 10. Airline Delay.
5.4. Effect of number of estimators
size because of the effect reported in section 5.2. Section 6.3 provides the δ values associated with the full dataset. The results for
these experiments are shown in Fig. 8 through 14.
These experiments capture the effect of the number of estimators (K ) on the accuracy of the algorithm (ǫ ), for a particular set
of δ and N values.
S. Das et al. / Big Data Research 14 (2018) 12–26
21
Fig. 14. Sinc Function (Synthetic Data).
Fig. 16. California Housing.
Fig. 15. Delta Elevators.
Fig. 17. Airline Delay.
For each dataset, the following experiment was performed. The
fraction of the dataset to use for the experiment (N) and the
subset size (δ ) are selected. For each K in a range of values, Algorithm 1 is applied and the error (ǫ ) is recorded. The key insight
from this set of experiments was that given a subset size (N s ),
there is a point upto which increasing the number of estimators
(K ) drops the error, but once a threshold value is reached (around
30, for all the datasets), there is little benefit in increasing the
number of estimators. A plausible explanation for this behavior
could be that increasing the number of estimators reduces the
variance component of the error in the bias-variance decomposition of the error (ǫ ). The results for these experiments are shown
in Fig. 15 through 21.
6. Application of the algorithm
In this section we describe the results of applying the algorithm
reported in this work to the datasets described in section 5.1.
Fig. 18. Combined Cycle Power Plant.
6.1. Independent performance assessments
It may of interest to see how the estimates from Gaussian
Progress regression compare to estimates from other methods for
large regression tasks. We report the performance of two methods.
XGBoost [6] is a recent tree based algorithm using the gradient
boosting framework that is very scalable and gaining adoption
among practitioners. Trees partition the predictor space and can
account for interaction. This method uses boosting instead of bagging. The difference in accuracy estimates between XGBoost and
the proposed method for a particular dataset could be attributed
to the influence of boosting or effects of interaction among variables. In most cases, the estimates from XGBoost were comparable
22
S. Das et al. / Big Data Research 14 (2018) 12–26
Table 1
Feature Importance.
1
2
3
4
5
6
Dataset
Features
Important Features
Airline Delay
Ailerons
Power Plant
Delta Elevators
California Housing
House Hold Power
11
40
4
6
8
12
1
3
2
2
3
4a
a
Timestamp is counted as a feature. Variables derived from timestamp are not included in this count.
ables. The GAM estimate serves as an independent performance
estimate from another non-parametric regression algorithm based
on an additive model. In most cases, the accuracy obtained from
GAM’s were similar to those obtained from Gaussian Process regression.
Fig. 19. House Hold Power Consumption.
6.2. Feature relevance
As discussed in section 3, an additive model worked well for
the datasets used in this study. Further, in all these datasets, the
response depended only on a small number of attributes. XGBoost
can report feature importance. This summarized in Table 1. The
response depends on a small subset of predictors in all the datasets
reported in this study. GPy [32], the package used to implement
the experiments reported in this work, implements the Automatic
Relevance Determination (see [1, Chapter 5]) feature. Features that
are not relevant are dropped from the model.
6.3. Predictive performance
Fig. 20. Ailerons.
Fig. 21. Sinc Function (Synthetic Data).
to the estimates from the proposed method. As discussed in section 6.4, it is possible to combine estimates from XGBoost with the
estimates from the proposed method using stacking. The Generalized Additive Model (GAM) [7] is a scalable regression technique.
As the name suggests, it fits an additive model in terms of smooth
non-parametric functions (typically splines) of the predictor vari-
As discussed in section 4, the choice of a GP method to use for
a regression task depends on the problem context. The algorithm
presented in this work does not require the hyper-parameter values to be specified. In terms of the inputs to the algorithm, the
algorithm presented in this work is similar to the Sparse Gaussian
Process and the Stochastic Variational Gaussian Process. These algorithms only require the dataset, kernel and the subset size as
input. We report the performance of the Sparse Gaussian Process,
Stochastic Variational Gaussian Process and the proposed algorithm
for each dataset. These algorithms are good choices to begin the
knowledge discovery process in large datasets. The results of applying Algorithm 1 to all the datasets listed in section 5.1 are
shown in Table 2. For each dataset, Table 2 includes the following columns:
1. BM1 (Bagging Method 1): This is the result of applying Algorithm 1 using the Inference of a Proportion method to select
the subset.
2. BM2 (Bagging Method 2): This is the result of applying Algorithm 1 using the Empirical Estimator method to select the
subset.
3. POE: This is the result of applying the Product of Experts Algorithm (Equation (13)) instead of bagging. The subset size used
is the same as that for BM1.
4. SVGP: This is the result of applying the Stochastic Variational
Gaussian Process algorithm.
5. SPGP: This is the result of applying the Sparse Gaussian Process algorithm.
6. XGBoost: This is the result obtained from the XGBoost algorithm.
7. GAM: This is the result obtained from GAM.
8. SD: This is the standard deviation of the response. If we use
a model that simply predicts the mean response for each test
point, then the standard deviation represents the error asso-
S. Das et al. / Big Data Research 14 (2018) 12–26
23
Table 2
RMSE.
Dataset
Ailerons
Delta Elevators
CCPP
Cal Housing
Airline Delay
HPC
Sinc Function
BM1
BM2
POE
SVGP
SPGP
XGBoost
GAM
SD
0.000192
0.00146
4.19
0.293
8.75
2.039
0.0267
0.000189
0.00147
3.99
0.294
8.75
1.992
0.0200
0.0001893
0.001453
3.97
0.293
8.74
2.120
0.0132
0.000207
0.00144
5.24
NA
8.85
NA
0.0627
0.000192
0.00146
3.93
NA
10.94
NA
0.0170
0.00017
0.00143
3.73
0.240
8.74
1.63
NA
0.00020
0.00144
4.11
0.281
9.45
2.18
NA
0.00041
0.00237
17.07
0.569
31.31
3.24
0.0634
Table 5
Explained Variance.
Table 3
δ Requirements for the Datasets.
Dataset
2
3
4
5
6
7
Ailerons
Delta Elevators
CCPP
Cal Housing
Airline Delay
HPC
Sinc Function
BM1
BM2
POE
SPGP
SVGP
0.75
0.625
0.6
0.625
0.56
0.43
0.50
0.65
0.55
0.6
0.625
0.49
0.38
0.50
0.30
0.625
0.6
0.625
0.56
0.43
0.50
0.75
0.625
0.6
NA
0.33
NA
0.34
0.30
0.30
0.3
NA
0.523
NA
0.36
Table 4
δ Kernels for the Datasets.
Dataset
Kernel
1
2
3
4
Ailerons
Delta Elevators
CCPP
California Housing
5
6
Airline Delay
HPC
7
Sinc Function
sum of linear and RBF
sum of linear and RBF
sum of linear and RBF
sum of Periodic Matern 32, Linear, RBF and a
product kernel of Linear and RBF on the
medianIncome attribute
sum of BF, linear and White Noise
sum of Bias, Cosine, RBF, Linear and Brownian
Motion
RBF
ciated with such a model. For regression models to be useful,
they must perform better than this model.
For all datasets, the split between the training and the test sets
was 70%–30%. The reported RMSE are on the test set. The subset sizes required for each dataset is captured by the δ parameter.
The δ values associated with Table 2 are presented in Table 3. The
δ values for the bagging methods and POE correspond the most
accurate estimates we could obtain for the datasets on a laptop
with 16 GB of RAM. The SVGP and the SPGP columns of Table 3,
represent the δ at which the best performance was obtained for
the Stochastic Variational GP and the Sparse GP methods. For the
airline dataset, the best performance for Stochastic Variational GP
was obtained at a smaller subset size than the one used for the
proposed method. For the airline dataset, Sparse GP showed no
improvement in performance for δ greater than 0.33
The kernels used for the datasets are shown in Table 4. The experiments reported in this work used [32] for implementation. The
Sparse GP implementation in this package is based on the Variational Sparse GP method (see [4]). The Stochastic Variational GP
implementation is the one used in [5]. Both these methods use
Stochastic Gradient Descent for hyper-parameter learning. Kernel
support for these methods is also limited to simple kernels and
kernels like the Brownian Motion Kernel or the Periodic Matern
kernel are not supported. For this reason, we do not report accuracy for SVGP or the Sparse GP for datasets that required these
complex kernels (Household Power Consumption and California
Housing). This also highlights the fact that Algorithm 1 may be
a good candidate for datasets with the desired characteristics described earlier but requiring a complex kernel to model the underlying function.
1
2
3
4
5
6
Dataset
BM1
XGBoost
Airline Delay
HPC
DeltaElevators
Ailerons
CCPP
California Housing
0.92
0.61
0.64
0.77
0.94
0.73
0.92
0.75
0.64
0.82
0.95
0.82
An analysis of Table 2 shows that the proposed method can perform as well as XGBoost for the Airline Delay, Delta Elevators and
the Power Plant (CCPP) datasets. Table 5 provides the variance explained by the model for XGBoost and Bagging Method 1. Gaussian
Process regression models offer some advantages over the XGBoost
model given the similar levels of prediction performance. Gaussian
Process models provide variance and probability estimates directly
unlike XGBoost models. This is important in many applications ([8]
and [9]). In some cases like the California Housing and the Household Power Consumption datasets, XGBoost does better. In these
cases we explored if it is possible to construct a better estimator using both these estimators. This is discussed in section 6.4.
[43] [Chapter 2] discusses ANOVA based approaches that permit
different regression methods to be compared. Application of these
techniques to compare the methods in Table 2 is an area of future
work.
In small to moderate-sized datasets, a validation dataset is required to assess the generalization of the model. Since the datasets
considered in this study were big, the test sets were also big. The
error observed over a single large test set was similar to the error
observed when a validation methodology was used. Algorithm 1
uses a bagging type approach where we average the estimates
of the individual models. An error estimate similar to the out of
bag error estimate [44] provides a better picture of the generalization performance. The key idea here is that the samples that
are not used for developing any one of the models used in creating the bagged estimator is used to evaluate the performance of
the composite model. This becomes the out of bag estimate. We
apply this procedure repeatedly to obtain a picture of the generalization of the algorithm. For each iteration, we report the RMSE on
the out of bag dataset. Since the dataset is big, this was repeated
three times. The results are shown in Table 6. These results are
shown for Bagging Method 1, Sparse GP, Stochastic Variational GP,
and XGBoost. For the Sparse GP, Stochastic Variational GP and XGBoost, the training dataset is the dataset used to create the model
for Bagging Method 1. The test dataset is the out of bag dataset.
A comparison of these values to those reported in Table 2 in most
cases. The performance of the Stochastic Variational Gaussian Process was quite variable for the CCPP dataset. This suggests that the
optimization parameters for this method may need tuning. This is
one of the problems with the SVGP method in practice. Arriving
at good hyper-parameters for the optimization procedure is a time
consuming and cumbersome procedure.
In section 4 we discussed the Locally Approximate Gaussian
Process as a technique that fits a Gaussian Process around a test
24
S. Das et al. / Big Data Research 14 (2018) 12–26
Table 6
Out of Bag Prediction Performance.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Dataset
Method
OOB
set 1
OOB
set 2
OOB
set 3
Mean
RMSE
HPC
Airline Delay
Ailerons
Delta Elevators
CCCP
SincFunction
California Housing
SincFunction
Ailerons
Delta Elevators
CCCP
Airline Delay
HPC
California Housing
SincFunction
Ailerons
Delta Elevators
CCCP
HPC
Airline Delay
California Housing
Ailerons
Delta Elevators
CCCP
California Housing
HPC
Airline Delay
BM1
BM1
BM1
BM1
BM1
BM1
BM1
Sparse GP
Sparse GP
Sparse GP
Sparse GP
Sparse GP
Sparse GP
Sparse GP
SVIGP
SVIGP
SVIGP
SVIGP
SVIGP
SVIGP
SVIGP
XGBoost
XGBoost
XGBoost
XGBoost
XGBoost
XGBoost
2.006400
8.644600
0.0001950
0.001452
4.129000
0.037051
0.295650
0.018060
0.000217
0.001453
3.679500
10.915200
NA
NA
0.059540
0.000199
0.001441
6.442400
NA
11.105500
NA
0.000167
0.001423
3.710000
0.243694
1.800968
8.733207
2.013900
8.784300
0.0001961
0.001466
4.179600
0.034210
0.293180
0.015330
0.000204
0.001472
3.892800
10.934100
NA
NA
0.061600
0.000212
0.001442
26.397900
NA
9.470400
NA
0.000170
0.001451
3.620000
0.240122
1.777669
8.765240
2.008600
8.777100
0.0001979
0.001442
4.190900
0.030093
0.294890
0.018230
0.000224
0.001456
3.772700
10.862900
NA
NA
0.065490
0.000196
0.001431
18.143190
NA
9.167900
NA
0.000173
0.001475
3.540000
0.241744
1.771144
8.805467
2.009600
8.735350
0.0001964
0.001453
4.166500
0.033780
0.294570
0.017210
0.000215
0.001460
3.781600
10.904000
NA
NA
0.062210
0.000202
0.001438
16.994500
NA
9.914600
NA
0.000170
0.001449
3.620000
0.241800
1.783000
8.767971
Table 7
Performance – Locally Approximate Gaussian Process.
1
2
3
4
5
6
7
Dataset
Test Set
Size
Running
Time (s)
Holdout
set 1
Holdout
set 2
Holdout
set 3
Mean
RMSE
California Housing
Airline Delay
HPC
CCPP
Ailerons
Delta Elevators
Sinc Function
6873
85401
10247
3190
2382
3169
33300
142
5636
3240
80
18
75
590
3.888
30.008
2.897
6.500
0.00026304
0.001701
0.066
0.792
26.273
2.950
6.032
0.00028159
0.001757
0.062
0.735
28.710
3.004
5.366
0.00026438
0.001803
0.085
1.805
28.330
2.951
5.960
0.0002696
0.0017539
0.071
point. This method fits an isotropic Gaussian kernel to the neighborhood of the test point using the training data. The neighborhood size and the method to select the neighborhood points are
parameters that must be tuned for each dataset. We used the
guidelines provided in the documentation for the package [39] to
estimate test datasets that can be estimated within an hour with
this package. This method is not practical for very large test sets,
for example, the test set associated with the household power consumption dataset (2 million instances). The test set associated with
a 70–30 training-test split for this dataset yields a test set of six
hundred and sixty thousand instances. The testing methodology
is similar to the one associated with Table 2. The notion of an
out of bag sample does not apply to this case, but a similar approach to testing was used so that the results can be compared.
For these experiments, a test set of the size specified in Table 7
is set aside for testing and remainder is used for prediction. This
procedure was repeated three times with different test and training sets. The results of applying the Locally Approximated Gaussian
process are shown in Table 7. For these experiments a neighborhood size of fifty was used. The “Minimize Mean Square Predictive
Error” (MSPE) criterion (see [39]) was used to select the neighborhood of the test points.
The running times for Bagging Method 1 (based on the algorithm reported in this work), the Sparse GP and the Stochastic
Variational GP are shown in Table 8. As mentioned in the discussion following Table 2 the Sparse GP and the Stochastic Vari-
Table 8
Running Times for GP Methods (in seconds).
1
2
3
4
5
6
7
Dataset
BM1
Sparse.GP
SVGP
HPC
Airline Delay
Ailerons
Delta Elevators
CCCP
SincFunction
California Housing
1828
458
286
22
24
50
1876
NA
76
364
59
291
603
NA
NA
423
37
42
36
131
NA
ational GP implementations do not support complex kernels like
the Brownian Motion kernel or the Matern kernel that are required
by the Household power consumption dataset and the California
Housing datasets. The implementation of the algorithm reported
in this work can be parallelized. The experiments reported in this
work are performed with an implementation that is not parallelized. An implementation that parallelizes the bagging step can
improve the run-time for Algorithm 1.
6.4. Combining estimators
Combining estimators is not a new idea [10]. More recently,
[11] has combined gradient boosted tree models with Gaussian
Process models. This prompted us to explore combining estimators
for the data sets where the Gaussian Process model produced a
slightly lower accuracy. There were three datasets where the per-
S. Das et al. / Big Data Research 14 (2018) 12–26
Table 9
δ RMSE using XGBoost and bagged GP models.
1
2
3
Dataset
RMSE
California Housing
CCPP
HPC
0.211
2.943
1.610
formance of the Gaussian Process model was slightly lower than
the XGBoost model. These were the Combined Cycle Power Plant,
California Housing, and the Household Power Consumption dataset.
To combine estimators, the output of the XGBoost models and the
GP models were used as the inputs to a classifier (the stacking
classifier). The response from the classifier was the best model
for a given set of XGBoost and GP model responses. A K-nearest
neighbor classifier was used for this purpose. The best value of K
was determined through experimentation. Given a test point, the
estimates from the XGBoost model and the GP model can be obtained. The classifier then predicts the best model for this set of
estimates. This model is then used to provide the estimate for the
test point. This procedure improved the accuracy for the California Housing and Combined Cycle Power Plant datasets. A K-nearest
neighbor regression performed better than the K-nearest neighbor
classifier for the Household Power Consumption dataset. The results are shown in Table 9.
A comparison of Table 9 with Table 2 shows that combining
estimators yields solutions that are more accurate. The idea of
combining estimators can be refined in many ways. We have not
included the Sparse GP and Stochastic Variational GP in this solution. The choice of the classifier or regression model to combine
estimates from the component models is another modeling decision. The intent here is to show that it is possible to combine the
GP model developed using the method presented in this work with
other regression models to produce solutions that are better than
the individual solutions. An optimal choice of the estimators and
the method used for stacking is beyond the scope of this work.
7. Conclusion
There are many methods to scale Gaussian Process regression
to large datasets. The appropriate choice depends on the problem context. For example, if Gaussian Process regression has been
applied to similar datasets, then the kernel types and the kernel
hyper-parameters may be known and methods such as the Nystrom method are suitable. If we only seek estimates at a small
number of test points, methods such as the Locally Approximate
Gaussian Process may be suitable (see section 4 for a detailed discussion). The method proposed in this work is appropriate for large
datasets with a small set of important features for which the kernel characteristics are unknown. Kernel choices can be determined
through exploratory data analysis [12]. Kernel hyper-parameters
can be learned from the data. The data for the experiments reported in this work came from diverse application areas and in
a wide range of sizes (few thousand to two million). In these
datasets, the target variable depended on a small set of variables
and an additive model matched the performance of models that
permit interaction like XGBoost. This suggests that datasets with
these characteristics are not uncommon. Results from Minimax
theory for non-parametric regression indicate that additive models that depend on a small set of predictors have an attractive rate
of convergence. This suggests that we can develop adequate regression models with a small subset of samples from the dataset.
This was consistent with results observed in the experiments conducted as part of this study. The Stochastic Variational Gaussian
Process and the Sparse Gaussian Process are also good candidates
for problems with these characteristics. The results of this study
25
show that the proposed algorithm can match or exceed the performance of the Sparse Gaussian Process or the Stochastic Variational
Gaussian Process. The results of this study also show that Gaussian
Processes can be as effective as ensemble methods like Gradient
Boosted Trees on large datasets. Gaussian Processes are based on a
probabilistic framework and can provide variance and probability
estimates directly as compared to other tools for large regression
tasks like XGBoost. This could very important for some applications (see [9] or [8]). The regression function is also interpretable
in the case of Gaussian Process models in contrast to methods
like gradient boosted trees. Therefore Gaussian Process models can
be good explanatory models as well as good predictive models.
An important feature of this algorithm is the simplicity of implementation. In Internet applications, frequent model development
and deployment is needed. An algorithm that is simple to implement and easy to tune but still performs as well as or better than
the other choices for scaling Gaussian Process regression will be
attractive to practitioners. Finally, it should be noted that it is possible to combine this algorithm with other algorithms like Gradient
Boosted Trees using model stacking to achieve performance gains.
Refinement of the sampling strategy used in the algorithm reported in this work is an area of future work. Parallelization of the
algorithm is simple and will yield performance gains. Parallelization will permit further evaluation of the algorithm characteristics
using techniques such as those described in [43] [Chapter 2].
Acknowledgement
The authors would like to thank the reviewers for their many
insightful suggestions and comments that have helped improve the
presentation of the paper significantly.
References
[1] C.E. Rasmussen, Gaussian Processes for Machine Learning, MIT Press, 2006.
[2] Z. Ghahramani, A tutorial on Gaussian processes (or why I don’t use SVMs),
MLSS Workshop talk by Zoubin Ghahramani on Gaussian Processes, http://
mlss2011.comp.nus.edu.sg/uploads/Site/lect1gp.pdf, 2011. (Accessed 19 July
2016).
[3] C.J. Stone, Additive regression and other nonparametric models, Ann. Stat.
(1985) 689–705.
[4] M.K. Titsias, Variational learning of inducing variables in sparse Gaussian processes, in: AISTATS, vol. 12, 2009, pp. 567–574.
[5] J. Hensman, N. Fusi, N.D. Lawrence, Gaussian processes for big data, in: Conference on Uncertainty in Artificial Intelligence, auai.org, 2013, pp. 282–290.
[6] T. Chen, C. Guestrin, Xgboost: a scalable tree boosting system, preprint arXiv:
1603.02754.
[7] J. Friedman, T. Hastie, R. Tibshirani, The Elements of Statistical Learning,
Springer Series in Statistics, vol. 1, Springer, Berlin, 2001.
[8] M.A. Pimentel, D.A. Clifton, L. Clifton, L. Tarassenko, Probabilistic estimation of
respiratory rate using Gaussian processes, in: 2013 IEEE 35th Annual International Conference on Engineering in Medicine and Biology Society, EMBC, IEEE,
2013, pp. 2902–2905.
[9] Y. Weng, R. Rajagopal, Probabilistic baseline estimation via Gaussian process,
in: IEEE Power & Energy Society General Meeting, IEEE, 2015, pp. 1–5.
[10] D.H. Wolpert, Stacked generalization, Neural Netw. 5 (2) (1992) 241–259.
[11] J.R. Lloyd, Gefcom2012 hierarchical load forecasting: gradient boosting machines and Gaussian processes, Int. J. Forecast. 30 (2) (2014) 369–374.
[12] D. Duvenaud, Automatic Model Construction with Gaussian Processes, Ph.D.
Thesis, University of Cambridge, 2014.
[13] C.J. Stone, Optimal global rates of convergence for nonparametric regression,
Ann. Stat. (1982) 1040–1053.
[14] E. Snelson, Z. Ghahramani, Sparse Gaussian processes using pseudo-inputs, in:
Advances in Neural Information Processing Systems, 2005, pp. 1257–1264.
[15] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning,
Adaptive Computation and Machine Learning, The MIT Press, 2005.
[16] C.M. Bishop, Pattern Recognition, vol. 128, 2006.
[17] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32.
[18] Y. Cao, D.J. Fleet, Generalized product of experts for automatic and principled
fusion of Gaussian process predictions, preprint, arXiv:1410.7827.
[19] A.B. Tsybakov, Introduction to Nonparametric Estimation, 2009, revised and extended from the 2004 French original, translated by Vladimir Zaiats.
26
S. Das et al. / Big Data Research 14 (2018) 12–26
[20] L. Györfi, M. Kohler, A. Krzyzak, H. Walk, A Distribution-Free Theory of Nonparametric Regression, Springer Science & Business Media, 2006.
[21] Y. Yang, S.T. Tokdar, et al., Minimax-optimal nonparametric regression in high
dimensions, Ann. Stat. 43 (2) (2015) 652–674.
[22] Y. Yang, D.B. Dunson, et al., Bayesian manifold regression, Ann. Stat. 44 (2)
(2016) 876–905.
[23] A. Zaytsev, E. Burnaev, Minimax approach to variable fidelity data interpolation,
in: Artificial Intelligence and Statistics, 2017, pp. 652–661.
[24] R.E. Kalman, A new approach to linear filtering and prediction problems, J. Basic Eng. 82 (1) (1960) 35–45.
[25] R. Sambasivan, S. Das, A statistical machine learning approach to yield curve
forecasting, in: 2017 International Conference on Computational Intelligence in
Data Science, ICCIDS, 2017, pp. 1–6.
[26] J. Quiñonero-Candela, C.E. Rasmussen, C.K. Williams, Approximation methods for Gaussian process regression, in: Large-Scale Kernel Machines, 2007,
pp. 203–224.
[27] A.G. Wilson, Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes, Ph.D. Thesis, University of Cambridge,
2014.
[28] A. Wilson, R. Adams, Gaussian process kernels for pattern discovery and extrapolation, in: Proceedings of the 30th International Conference on Machine
Learning, ICML-13, 2013, pp. 1067–1075.
[29] P. Drineas, M.W. Mahoney, On the Nyström method for approximating a gram
matrix for improved kernel-based learning, J. Mach. Learn. Res. 6 (Dec) (2005)
2153–2175.
[30] A. Rahimi, B. Recht, Random features for large-scale kernel machines, in: Advances in Neural Information Processing Systems, 2008, pp. 1177–1184.
[31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine learning in
Python, J. Mach. Learn. Res. 12 (2011) 2825–2830.
[32] GPy, GPy: a Gaussian process framework in python, http://github.com/
SheffieldML/GPy, 2012–2014.
[33] N. Srivastava, G.E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,
Dropout: a simple way to prevent neural networks from overfitting, J. Mach.
Learn. Res. 15 (1) (2014) 1929–1958.
[34] M. Belyaev, E. Burnaev, Y. Kapushev, Computationally efficient algorithm for
Gaussian process regression in case of structured samples, Comput. Math.
Math. Phys. 56 (4) (2016) 499–513.
[35] V. Tresp, A Bayesian committee machine, Neural Comput. 12 (11) (2000)
2719–2741.
[36] M.P. Deisenroth, J.W. Ng, Distributed Gaussian processes, in: International Conference on Machine Learning, vol. 2, ICML, 2015, p. 5.
[37] S. Park, S. Choi, Hierarchical Gaussian process regression, in: ACML, 2010,
pp. 95–110.
[38] R.B. Gramacy, D.W. Apley, Local Gaussian process approximation for large computer experiments, J. Comput. Graph. Stat. 24 (2) (2015) 561–578.
[39] R.B. Gramacy, laGP: large-scale spatial modeling via local approximate Gaussian
processes in R, J. Stat. Softw. 72 (1) (2016) 1–46, https://doi.org/10.18637/jss.
v072.i01.
[40] M. Lichman, UCI machine learning repository, http://archive.ics.uci.edu/ml,
2016.
[41] L. Targo, Large regression datasets, http://www.transtats.bts.gov/DL_
SelectFields.asp?Table_ID=236, 2016.
[42] B. USDOT, Rita airline delay data download, http://www.transtats.bts.gov/DL_
SelectFields.asp?Table_ID=236, 2016.
[43] C.E. Rasmussen, Evaluation of Gaussian Processes and Other Methods for NonLinear Regression, Ph.D. Thesis, Cambridge, 1999, http://mlg.eng.cam.ac.uk/pub/
pdf/Ras96b.pdf.
[44] Z. Ghahramani, Out-of-Bag Estimation, Tech Report on UC Berkeley
Website, 2018, https://www.stat.berkeley.edu/users/breiman/OOBestimation.
pdf. (Accessed 14 April 2018).