Academia.eduAcademia.edu

Big Data Research

Gaussian Processes are widely used for regression tasks. A known limitation in the application of Gaussian Processes to regression tasks is that the computation of the solution requires performing a matrix inversion. The solution also requires the storage of a large matrix in memory. These factors restrict the application of Gaussian Process regression to small and moderate size datasets. We present an algorithm that combines estimates from models developed using subsets of the data obtained in a manner similar to the bootstrap. The sample size is a critical parameter for this algorithm. Guidelines for reasonable choices of algorithm parameters, based on a detailed experimental study, are provided. Various techniques have been proposed to scale Gaussian Processes to large-scale regression tasks. The most appropriate choice depends on the problem context. The proposed method is most appropriate for problems where an additive model works well and the response depends on a small number of features. The minimax rate of convergence for such problems is attractive and we can build effective models with a small subset of the data. The Stochastic Variational Gaussian Process and the Sparse Gaussian Process are also appropriate choices for such problems. Results from experiments conducted as part of this study indicate that the algorithm presented in this work can be as effective as these methods. Unlike these methods, the proposed algorithm requires minimal hyper-parameter tuning and is much simpler to implement. The rate of convergence is also attractive.

Big Data Research 14 (2018) 12–26 Contents lists available at ScienceDirect Big Data Research www.elsevier.com/locate/bdr Fast Gaussian Process Regression for Big Data Sourish Das a , Sasanka Roy b , Rajiv Sambasivan a,∗ a b Chennai Mathematical Institute, India Indian Statistical Institute, India a r t i c l e i n f o Article history: Received 31 July 2017 Received in revised form 20 April 2018 Accepted 20 June 2018 Available online 4 July 2018 Keywords: Big data Gaussian Process Regression a b s t r a c t Gaussian Processes are widely used for regression tasks. A known limitation in the application of Gaussian Processes to regression tasks is that the computation of the solution requires performing a matrix inversion. The solution also requires the storage of a large matrix in memory. These factors restrict the application of Gaussian Process regression to small and moderate size datasets. We present an algorithm that combines estimates from models developed using subsets of the data obtained in a manner similar to the bootstrap. The sample size is a critical parameter for this algorithm. Guidelines for reasonable choices of algorithm parameters, based on a detailed experimental study, are provided. Various techniques have been proposed to scale Gaussian Processes to large-scale regression tasks. The most appropriate choice depends on the problem context. The proposed method is most appropriate for problems where an additive model works well and the response depends on a small number of features. The minimax rate of convergence for such problems is attractive and we can build effective models with a small subset of the data. The Stochastic Variational Gaussian Process and the Sparse Gaussian Process are also appropriate choices for such problems. Results from experiments conducted as part of this study indicate that the algorithm presented in this work can be as effective as these methods. Unlike these methods, the proposed algorithm requires minimal hyper-parameter tuning and is much simpler to implement. The rate of convergence is also attractive.  2018 Elsevier Inc. All rights reserved. 1. Introduction Gaussian Processes (GP) are attractive tools to perform supervised learning tasks on complex datasets on which traditional parametric methods may not be effective. Traditional parametric methods make a priori rigid assumptions about the nature and form of the model. For example, a linear regression model is a parametric model choice. This choice may not adapt to the data as it changes, for example, updates to the data might suggest that a non-linear model is more appropriate. Finding a suitable parametric form for a complex dataset is very challenging without prior experience with the data. Gaussian Processes offer flexibility in this regard. We only need to commit to a broad family of functions that are suitable for the problem by specifying an appropriate covariance function. We do not need to commit to a rigid functional form between the response and predictor variables. This can simplify the effort required to pick a good model when confronted with an unfamiliar dataset that requires a complex model. They are also easier to use in comparison to alternatives like neural * Corresponding author. E-mail address: [email protected] (R. Sambasivan). https://doi.org/10.1016/j.bdr.2018.06.002 2214-5796/ 2018 Elsevier Inc. All rights reserved. networks [1]. Gaussian Processes offer some practical advantages over Support Vector Machines (SVM) [2]. They offer uncertainty estimates with predictions. The kernel and regularization parameters can be learned directly from the data instead of using crossvalidation. Feature selection can be incorporated into the learning algorithm. For regression, exact inference is possible with Gaussian Processes. To apply Gaussian Processes to classification, we need to resort to approximate inference techniques such as Markov Chain Monte Carlo, Laplace Approximation or Variational Inference. Even though exact inference is possible for Gaussian Process regression, the computation of the solution requires matrix inversion. For a dataset with n rows or data instances, the time complexity of matrix inversion is O (n3 ). The space complexity associated with storing a matrix of size n is O (n2 ). This restricts the applicability of the technique to small or moderate-sized datasets. In this paper we present an algorithm that uses subset selection and ideas borrowed from bootstrap aggregation to mitigate the problem discussed above. Parallel implementation of this algorithm is also possible and can further improve performance. The rest of this paper is organized as follows: In section 2, we motivate the problem. In section 3, we present our solution to the problem. Our solution is based on combining estimators developed on subsets of the data. The selection of the subsets is S. Das et al. / Big Data Research 14 (2018) 12–26 based on simple random sampling with replacement (similar to what is done in the bootstrap). The size of the subset selected is a key aspect of this algorithm. This is determined empirically. We present two methods to determine the subset size. We present the motivating ideas leading to the final form of the algorithm. When the model is an additive structure of univariate components, this has attractive implications on the convergence rate [3]. An additive model worked well for the datasets used in this study. Relevant facts from Minimax theory for non-parametric regression, that are consistent with the experimental results reported in this work, are presented. In section 4, we present a brief summary of related work. Applying Gaussian Processes to large datasets has attracted a lot of interest from the machine learning research community. Connecting ideas to research related to the algorithm reported in this work are presented. Selecting parameters for an algorithm is an arduous task. However this algorithm has only two important parameters, the subset size and the number of estimators. We provide guidelines to pick these parameters based on detailed experiments across a range of datasets. In section 5 we provide experimental results that provide insights into the effect of the parameters associated with the algorithm. In section 6, we illustrate the application of our algorithm to synthetic and real world data sets. We applied the algorithm developed in this work to data sets with over a million instances. We compare the performance of the proposed method to the Sparse Gaussian Process [4] and the Stochastic Variational Gaussian Process [5]. The inputs required by these algorithms are similar to the inputs required for the proposed method and therefore are applicable in a similar context. We compare the estimates obtained from the reported algorithm with two other popular methods to perform regression on large datasets – Gradient Boosted Trees (using XGBoost, [6]) and the Generalized Additive Model (GAM) [7]. Results from experiments performed as part of this study show that accuracies from the proposed method are comparable to those obtained from Gradient Boosted Trees or GAM’s. However there are some distinct advantages to using a Gaussian Process model. A Gaussian Process model yields uncertainty estimates directly whereas methods like Gradient Boosted Trees do not provide this (at least directly). [8] and [9]. A Gaussian Process model is also directly interpretable [2, pg. 5, 6] in comparison to methods like Gradient Boosted Trees or Neural Networks. Therefore, the proposed method can yield both explanatory and predictive models. It is possible to use stacking [10] to combine the estimates from the proposed model with those obtained from a competing model (like Gradient Boosted Trees) and obtain higher accuracies. Combining a Gaussian Process solution with XGBoost has been used by [11]. In section 7, we present the conclusions from this work. The contribution of this work is as follows. The proposed method to perform Gaussian Process regression on large datasets has a very simple implementation in comparison to other alternatives, with similar levels of accuracy. The algorithm has two key parameters – the subset size and the number of estimators. Detailed guidelines to pick these parameters are provided. The choice of a method to scale Gaussian Process regression to large datasets depends on the characteristics of the problem. This is discussed in section 4. The proposed method is most effective for problems where the response depends on a small number of features and the kernel characteristics are unknown. In such cases, exploratory data analysis can be used to arrive at appropriate kernel choices [12]. Additive models may work well for these problems. Appropriate preprocessing like principal component analysis can be used if needed to create additive models. The rate of convergence for additive models is attractive [13]. This implies that we can build very effective models with a small proportion of samples. Sparse 13 Gaussian Processes see [14] and Stochastic Variational Gaussian Processes [5] are also appropriate for such problems. These require a more complex implementation and may require extensive effort to tune the optimization component of the algorithm (see [5]). Results of the experiments conducted as part of this study show that the proposed method can match or exceed the performance of these methods. 2. Problem formulation A Gaussian Process y with additive noise can be represented as: y = f (x) + η. (1) Here: • y represents the observed response. • x represents an input vector of covariates. • η represents the noise. The noise terms are assumed to be identical, independently distributed (IID) random variables drawn from a normal distribution with variance σn2 . • f represents the function being modeled. It is a multivariate normal with mean function μ(x) and covariance function K(x). If we want to make a prediction for the value of the response at a test point X ∗ , then the predictive equations are given by (see [15]): f ∗ | X , y, X∗ ∼ N   f ∗ , cov ( f ∗ ) , f ∗ = E[ f ∗ | X , y , X ∗ ], = K( X ∗ , X )[(K( X , X ) + σn2 I]−1 y , (2) (3) V( f ∗ ) = K( X ∗ , X ∗ ) − K( X ∗ , X )[K( X , X ) + σn2 I]−1 K( X , X ∗ ). (4) Here: • f ∗ is the value of the function at the test point X ∗ . • K( X ∗ , X ) represents the covariance between the test point and the training set. • K( X , X ) represents the covariance matrix for the training set. • I is the identity matrix. • V( f ∗ ) is the variance at the test point X ∗ . Equation (3) is the key equation to make predictions. An inspection of equation (3) shows that this equation requires the computation of an inverse. For a training dataset of size n, computation of the inverse has O (n3 ) time complexity. This is one of the bottle necks in the application of Gaussian Processes. Calculation of the solution also requires the storage of the matrix (K( X , X ) + σn2 I). This is associated with a space complexity of O (n2 ). This is the other bottle neck associated with Gaussian Processes. The uncertainty associated with our prediction is provided by Equation (4). The covariance K( X , X ) is expressed in terms of a kernel that is appropriate for the modeling task. The kernel is associated with a set of hyper-parameters. These may be known, for example if the modeling task has been well studied, or unknown if the problem has not been well studied. In this work, we treat these kernel parameters as unknown. When the kernel parameters are unknown, these are estimated using maximum likelihood estimation. The marginal log likelihood is given by (see [15]): 14 S. Das et al. / Big Data Research 14 (2018) 12–26   log y | X = 1 2   −1 y T K + σn2 I y− 1 2   log K + σn2 I − n 2   log 2π . (5) Using an appropriate optimization technique with Equation (5) as the objective function, the hyper-parameters of the kernel (K) can be determined. Ns ∝ Ns = Here: Since GP regression is both effective and versatile on complex datasets in comparison to parametric methods, a solution to the bottlenecks mentioned above will enable us to apply GP regression to large datasets that are complex. We run into such datasets routinely in this age of big data. Our solution to applying Gaussian Process regression to large data sets is based on developing estimators on smaller subsets of the data. The size of the subset and the desired accuracy are key parameters for the proposed solution. The accuracy is specified in terms of an acceptable error threshold, ǫ . We pick K smaller subsets of size N s from the original data set (of size N) in a manner similar to bootstrap aggregation. Sampling is done with replacement to ensure that the models used to develop the samples are independent. This has benefits in the analysis of the estimates from each of the models. We can treat them as independent. Sampling with replacement permits us to assume that each data instance has the same probability of making it to the subset used to develop the model. We develop a Gaussian Process estimator on each subset. The GP fit procedure includes hyper-parameter selection using the likelihood as the objective function. We want a subset size such that when we combine the estimators developed on the K subsets, we have an estimator that yields a prediction error that is acceptable. The rationale for this approach is based on results from Minimax theory for non-parametric regression that are presented later in this section. To combine estimators, we simply average the prediction from each of the K estimators. The time complexity for fitting a Gaussian Process regression on the entire dataset of size N is O ( N 3 ). The algorithm presented above requires the fitting of K Gaussian Process regression models to smaller size datasets (N s ). Therefore, the time complexity is O ( K . N s3 ). We present two methods to determine the subset size: δ( N ) N s = N δ , where (0 < δ < 1). (6) δ is a random variable and is determined based on inference of a proportion using a small sample. The details of this method are presented in the next subsection. 2. Estimating the subset size using an empirical estimator. We use the following observations to derive this empirical estimator: (a) The subset size N s , should be proportional to the size of the dataset, N: Ns ∝ N. We posited that as the size of the dataset increases, there is possibly more detail or variations to account for in the model. Therefore larger datasets would require larger sample sizes. (b) The sample size N s , should be a decreasing function of the desired error rate (ǫ ). This means that decreasing the desired error rate should increase the sample size. g (ǫ ) , where g (ǫ ) is an increasing function. Application of the above observations yields the following estimator for sample size: 3. Proposed solution 1. Estimating the subset size using statistical inference. For a dataset of size N, the subset size is expressed as: 1 N δ( N ) g (ǫ ) g (ǫ ) (7) . is a function that characterizes the fact that an increase in N should increase the sample size N s . is a function that characterizes the fact that sample size should increase as ǫ (desired error level) decreases. The algorithm is summarized in Algorithm 1. input : A dataset D of size N, δ , K output: An estimator f that combines the estimators fitted from resampling for i ← 1 to K do /* select a sample from D . Two ways of selecting the sample size are presented. Sampling with replacement is used so that the samples used to develop each of the K models are independent. */ N s ←SampleWithReplacement(D , δ); /* A kernel is fit for each sample. Hyper-parameter selection is done for each sample. This computation can be parallelized. */ end f̂ i ←FitGP( N s ); /* the estimate for a point x ∈ Dtest (the test dataset) is the average of the estimates from the K estimators fitted above. */ i = K f̂ ( x ) f resampled (x) ← K1 i i =1 Algorithm 1: Gaussian Process Regression Using Resampling. An important observation from experiments conducted for this study with regard to the choice of δ is as follows. The choice of δ affects the complexity of the functions used in modeling the response. To model complex responses, we need complex functions from the K models used in Algorithm 1. This is obtained by using a sufficiently large δ . The choice of δ is the primary requirement to develop a good model. Experiments showed (section 5.4) that increasing K is beneficial to a point – about 30 for most datasets used in this study. Increasing K beyond this point yielded very little increase in performance. The proposed algorithm is based on model averaging as described in [16, Chapter 14]. This is similar to combining estimates from regression trees in the Random Forest [17] algorithm. Given i = {1, 2, . . . , K } models, the conditional distribution of the response at a point X j is obtained from: p (Y | X j ) = i= K  i =1 p ( Y |i , X j ) p (i ) . (8)   If each of the models is equally probable, then p (i ) = K1 . If each of p (Y |i , X j ) is a Gaussian N (μi , i ), as would be the case when each of the estimators is a GP based on Equation (1), then Equation (8) is a linear combination of Gaussian random variables. We can then represent Equation (8) as1 : 1 To see why the variance has the term K12 in the denominator in Equation (11), recall that given independent random variables: X 1 , X 2 , . . . , X K , the variance of a linear combination is given by: S. Das et al. / Big Data Research 14 (2018) 12–26 σC2 ( X j )), p (Y | X j ) ∼ N (μC ( X j ), (9) where: μC ( X j ) = σ 2 C (X j) = i= K 1  K μi ( X j ), (10) i =1 i= K 1  K2 σ (11) Model combination is an alternative approach to combining estimates from a set of models. The product of experts model is an example of this approach. In the product of experts approach, each model is considered to be an independent expert. The conditional distribution of the response at a point X j is obtained from: p (Y | X j ) = 1 Z i =1 p i (Y | X j ), T P O E (X j) = i= K  2 P O E ( X j ). i= K  μi ( X j ) T i ( X j ) , T i ( X j ), i =1  f (t ) =< f (.), σ (., t ) >  −1 . ∀t ∈ T . (16) Algorithm 1 makes use of K estimators, f 1 , f 2 , . . . , f K . The reproducing kernels associated with the estimators are σ1 , σ2 , . . ., σ K . We show that the estimator resulting from combining these K estimators is a Gaussian Process. = b1 2 V[ X 1 ] + . . . + b K 2 V[ Xk ] = i= K  i =1 f i (t ), i =1 i= K 1  K σi . b i 2 V[ X i ]. In this case X 1 , X 2 , . . . , X K are Gaussian random variables and b i = K1 . 2 In Equation (16) the . symbol represents a place holder for the variable of interest. (17) i =1 = f (t ) i = K Consider the kernel σrm = K1 i =1 σi and the inner product < f (.), σrm >. The inner product can be written as: < f (.), σrm > K 1  K = 1  K 1 times   [< f (.), σ1 (., t ) > + . . . + < f (.), σ K (., t ) >] = = K times   [ f (t ) + . . . + f (t )] K f (t ) K ✷ = f (t ) . (15) Here T represents the index set to select the predictor instances.2 bi X i K < f (.), σ1 (., t ) > =< f (.), σ2 (., t ) >= . . . =< f (.), σ K (., t ) > (14) Assumption 1. The function being modeled is in the Reproducing Kernel Hilbert Space of the kernel (σ ) of the estimator. The reproducing property of the kernel can be expressed as: i =1 i= K 1  Proof. From Assumption 1, the kernel associated with each estimator has the reproducing property. So the following equation holds: (13) Here T i ( X j ) is the precision of the expert i at point X j . The results from using a product of experts model are also included in this study (see section 6). Model averaging and model combination (Equation (9) and Equation (13)) are simply ways to combine estimates from the component estimators used in Algorithm 1. They do not specify any information about the character of the estimator developed using Algorithm 1. To do this, we need the following preliminaries: i= K  σrm = i =1 σ P2 O E ( X j ) = T P O E ( X j ) V f rm (t ) = (12) where Z is the normalization constant. [18] report a study where the individual experts are Gaussian Processes. In this case, each of p i (Y | X j ) are N (μi ( X j ), σh2 ( X j )). The mean and variance associated with p (Y | X j ) are: μP O E ( X j ) = σ Lemma 1. The estimator f rm obtained from the individual estimators f 1 , f 2 , . . ., f K using: is associated with the following kernel: 2 i ( X j ). i =1 i= K 15 Using Lemma 1, the estimator developed using Algorithm 1 is a Gaussian Process. This Gaussian Process is associated with a kernel defined by Equation (17). This kernel is defined as a mixture of the kernels used by the individual estimators. The uncertainty estimate at any point X can be obtained using Equation (4). Consistency of estimators is concerned with the asymptotic properties of the developed estimator as we increase the sample size. Consistency of Gaussian Processes has been widely studied (see [1, Chapter 7, Section 7.1]) for details. The convergence rate for the algorithm is a characteristic of great interest. This characteristic tells us the rate at which we can drop the error as we increase the sample size. Minimax theory provides a framework to assess this (see [19] or [20]). A famous result by [13] states the minimax rate of convergence for non-parametric regression is − α n (2α+d) , where α describes the smoothness of the function being modeled and d represents the dimensionality of the feature space. This suggests that the rate of convergence is affected by the curse of dimensionality. However as noted by [21], the following factors usually mitigate the curse of dimensionality: 1. The data lie in a manifold of low dimensionality even though the dimensionality of the feature space is large. The work of Yang et al. [22] reports a method for such problems. 2. The function being modeled depends only on a small set of predictors. All datasets reported in this study had this characteristic. Feature selection can be performed in Gaussian Processes using Automatic Relevance Determination (see [1, Chapter 5, Section 5.1]). 3. The function being modeled admits an additive structure with univariate components. The minimax rate of convergence for this problem is very attractive (see [3]). For all datasets reported in this study such an additive model yielded good results. 16 S. Das et al. / Big Data Research 14 (2018) 12–26 Feature Selection is therefore a critical first step. This is discussed in section 6. The additive structure and the dependence on a very small set of predictors suggest that we can get reasonable models with a small subset of the data. This was consistent with the experimental results reported in this work. Simple preprocessing like Principal Component Analysis (PCA) could be used to make features independent. This preprocessing can be limited to the set of highly correlated features. The data for this study came from public repositories and from very diverse application domains. This suggests that datasets with these characteristics are not uncommon. When data lie in a manifold, Bayesian Manifold Regression may be more appropriate. Zaytsev and Burnaev [23] investigate the minimax interpolation error under certain conditions (known covariance, stationary Gaussian Process). As noted in this recent study, theoretical work in estimating the convergence rate using minimax theory is an active area of research. In this work, we investigated the empirical estimation of the sample size. We describe two methods to determine the sample (subset) size. 3.1. Estimating subset size on the basis of inference of a proportion Since δ takes values between 0 and 1, it can be interpreted as a proportion. We treat it as a random variable that can be inferred from a small sample. To estimate δ , we do the following: We pick a small sample of the original dataset by simple random sampling. We start with a small value of δ and check if the prediction error with this value of δ is acceptable. If not we increment δ until we arrive at a δ that yields an acceptable error. This procedure yields the smallest δ value that produces an acceptable error on this sample. Since the size of this dataset is small, the above procedure can be performed quickly. This technique yielded reliable estimates of δ on both synthetic and real-world datasets. 3.2. Empirical estimation of the subset size Equation (7) provides the functional form for the second empirical estimator of the subset size N s . Appropriate choices for δ( N ) and g (ǫ ) are based on observations from the experimental evaluation of the parameters of Algorithm 1. Choices that provided good results are provided in section 5.2. In this work, we considered random sub-sampling. Refining this sample sequentially is an area of future work. One strategy for refinement could be to choose the first sample in a random manner and then choose the subsequent samples based on the errors observed in a validation set. In each subsequent iteration, we can adjust the sampling (by weights) such that we sample more in the regions where the previous estimator made errors. Another possible approach could be a hybrid approach. Rather than picking the first sample in a random manner, it is picked based on some optimality criterion such as what is done in the Sparse Gaussian Process. This sample is then refined sequentially as discussed above. Time series datasets require special attention. Using temporal features is one strategy to handle temporal correlation. This approach is common and was used in this work. Another strategy is to adapt the sampling process to ensure that the autocorrelation component is captured. Such a strategy would involve sampling in blocks. Rather than pick a single element during sampling, we can pick a block of contiguous data elements. The block size should be such that it captures the auto-correlation present. For example, if exploratory data analysis reveals that each instance could be correlated to five instances before it (lag = 5), then we pick a block size that is larger than this window during sampling. Another strategy to consider for using Gaussian Processes with time series data is to consider a model like the Kalman Filter [24]. In such an approach, a time window is associated with the forecasting procedure. A Gaussian Process is used to model the response within a window and use it to forecast the next window. The posterior mean of the previous window is used as the prior for the Gaussian Process used to model the next window. See [25] for an application of this technique in the financial domain to predict bond yield curves. When the window sizes are big, we could use the approach presented in this work to model the regression function in each window. We would need to ensure that the sample size selected captures the autocorrelation in any of the windows using a block sampling strategy as mentioned earlier. 4. Related work Rasmussen and Williams [15, Chapter 8] provide a detailed discussion of the approaches used to apply Gaussian Process regression to large datasets. The study of Quiñonero-Candela et al. [26] is another detailed review of the various approaches to applying Gaussian Processes to large datasets. The choice of a method appropriate for a regression task is dependent on the problem context. Therefore we discuss the work related to scaling Gaussian Process regression taking the problem context into consideration. The problem context refers to relevant background information from the manifestation of the problem. This includes, but is not limited to, the following: • Knowledge of the model parameters: Do we know the kernel • • • • types and range of hyper-parameter values from prior experience? Properties of the input space: Does the input come from a particular distribution? Does it manifest in a very regular manner over a grid? Properties of the required covariance: Is an isotropic covariance a reasonable choice for the problem? Frequency of model development: For Internet facing applications, if the application is deployed in batch mode rather than in an online manner, there is a need for model iteration when a new batch of data becomes available. Frequent model iteration may impose constraints on the complexity and availability of time to fine tune hyper-parameter and optimization parameters. Properties of the output: Do we need prediction at a large number of points or a small number of points? Specific problem characteristics determine the appropriate choice of a method used to scale Gaussian Process regression. The following discussion provides examples of these characteristics and the appropriate methods to scale Gaussian Process regression in each case. If the data associated with the problem has been well studied and kernel methods have been successfully applied to the problem, then we may have reasonable insights into the nature of the kernel appropriate for the regression task. We may be able to arrive at the kernel hyper-parameters quickly from a small set of experiments. On the other hand, if the problem and data are new, then we may not have a lot of information about the kernel. In general, scaling Gaussian Process regression to large datasets has two challenges: 1. Finding a kernel that has good generalization properties for the dataset. 2. Overcoming the computational hurdles – O ( N 3 ) for training and O ( N 2 ) for storage. Learning a kernel that has good generalization properties is a related area of research in Gaussian Processes (see [27], [28]). When a good kernel representation has been learned, there are S. Das et al. / Big Data Research 14 (2018) 12–26 many techniques to overcome the computational hurdles. The Nystrom method to approximate the Gram matrix ([29]) and the Random Kitchen Sinks ([30]) are probably the most well known. The Random Kitchen Sinks approach maps the data into a low dimensional feature space and learns a linear estimator in this space. It should be noted that these methods work well when the problem needs a stationary kernel for which we know the hyperparameters. Using “sensible defaults” for hyper-parameters and applying these techniques to problems that require a non-stationary kernel may yield poor results. For example with the airline dataset, described later in this study (see section 5.1), the Random Kitchen Sinks cannot be used directly and would require suitable preprocessing (like removing simple trends or using a mean function) so that a stationary kernel would be applicable. Using the Random Kitchen Sinks directly with no preprocessing and using default kernel choices provided with the scikit-learn [31] implementation yielded poor results (RMSE of 31.49 as opposed to 8.75 with the proposed method). Using a kernel learning approach to determine a good kernel representation and then solving the computational hurdles independently is one way to approach scaling Gaussian Process regression to large datasets. Another approach to determine the appropriate kernel is to use exploratory data analysis. Guidelines to pick kernels based on exploratory analysis of the data is provided in [12]. This is a practical approach when the number of relevant features is not too many, as was the case with the datasets used in this study. It should be noted that hyper-parameters for these choices still need to be specified. We may be able to build additive models using this approach. Appropriate preprocessing could help, for example, principal component analysis can be applied to make the features independent. Minimax theory for non-parametric regression indicates that the convergence rate for additive models is very attractive. We can build effective models with a small proportion of the data. The choice of kernel hyper-parameters is critical and can affect the performance. When datasets are large and the kernel hyperparameters are unknown, we need algorithms that can address both these issues. Ideally, the algorithm should be able to work with both stationary and non-stationary kernels. The proposed algorithm is one such candidate. Sparse Gaussian Processes ([4]) and Stochastic Variational Gaussian Processes ([5]) are two others. Like the proposed algorithm, these algorithms require the specification of a input size. A subset of points is selected from the dataset for training. The criteria for subset selection is different in each case. These algorithms do not require the specification of the kernel hyper-parameters. These are estimated from the data. Stochastic Variational Gaussian Processes can require considerable manual tuning of the optimization parameters. Typical implementations (like [32]) for Sparse GP and Stochastic Variational GP use stochastic gradient descent for hyper-parameter optimization. This explores the entire dataset in batch size increments. Hensman et al. [5] report the details associated with picking the parameters for the optimization task (learning rates, momentum, batch sizes etc.). In contrast, sample sizes with the proposed algorithm even for datasets with over million instances are typically small (order of few hundred instances). Learning hyper-parameters over small datasets is considerably easier. The experiments reported in this work required no tuning effort. We report the performance of Sparse Gaussian Process, Stochastic Variational Gaussian Process and the proposed method on a variety of datasets in section 6. The proposed algorithm uses ideas that have proven to be effective with other machine learning techniques. Bagging has been used to improve performance using regression trees (Random Forests, [17]). Like with Random Forests, the algorithm uses model averaging to combine estimates from component Gaussian Process Regressions. Dropout ([33]) is a technique used in neural 17 networks to prevent overfitting. Dropping random units achieves regularization in neural networks. In the proposed algorithm, selecting a sample can be viewed as dropping random instances from the training dataset. Sparse Gaussian Processes and Stochastic Variational Gaussian Processes use theoretical ideas to select a small subset of points to develop a Gaussian Process regression model. This study suggests that combined with model averaging, random selection of the subset can also work well. In some areas like spatial statistics or engineering design the input data points lie in a grid at regularly spaced intervals. Belyaev et al. [34] discuss an approach to scaling Gaussian Process regression when the problem has this structure. The covariance matrix in these problems has a specific structure that permits an efficient computational approach. This approach also uses a prior on the hyper-parameters to deal with anisotropy. So the implementation is quite complex. Using a divide an conquer strategy is another theme in scaling Gaussian Process regression to large datasets. The Bayesian Committee Machine (BCM) ([35]), is an idea related to the algorithm presented in this work. The BCM proposes a partition of the dataset into M parts. An estimator is developed on each partition. The BCM does not choose a subset of the partition. It uses the entire partition for developing the estimator. This is the key difference between the method proposed in this work and the BCM. The estimates from each estimator are assumed to be independent. The BCM assumes that computing a GP solution on the partitions of the dataset is computationally tractable because the partition sizes are small. Datasets encountered today are much larger than those reported in [35]. In present-day datasets the partitions of the dataset based on guidelines provided in [35] would be very big and computing a full Gaussian Process solution on them may not be computationally tractable. Even when the partition size is not big enough to create computational hurdles, using all the data may result in over fitting. Using a hierarchical model as in [36] or [37] are possible ways to work around the size of the partitions, however this requires a complex implementation to partition and recombine computations. The Locally Approximate Gaussian Process [38] fits a local Gaussian Process for a prediction point using a local neighborhood and local isotropy assumption. This method too requires some tuning, the size of neighborhood and method to choose the neighbors are important parameters. For datasets where the local isotropy assumption works well and when the size of the test set is small, this method might be useful. When prediction is required at a large number of test points, this method might be slow if we use a large neighborhood. For example with the airline dataset, described in section 5.1, using the defaults provided with the laGP ([39]) package did not yield good results on the airline delay dataset (RMSE of 24.89 as opposed to 8.75 with the proposed method). The running time for laGP was also considerably longer. Scoring the test set in batches of 15000 rows, the test set prediction took about 50 minutes for the airline delay dataset. The proposed algorithm builds a single model that is used to score the entire test set and completed in about 6.5 minutes. In summary, there are several ways to scale Gaussian Process regression to large datasets. The choice of a particular method should be guided by the characteristics of the problem. The method proposed in this work is appropriate for large datasets that have a small number of important features for which the kernel characteristics are unknown. In such cases, exploratory data analysis can be used to determine appropriate kernel types. Additive models may work well for such datasets. Preprocessing such as principal component analysis can be used if needed to make features independent (so that we can use additive models). Stochastic Variational Gaussian Processes and Sparse Gaussian Processes are also good candidates for such problems. Kernel hyper-parameters are learned from the data by these methods. Results of the ex- 18 S. Das et al. / Big Data Research 14 (2018) 12–26 periments conducted as part of this study show that the proposed method can match or exceed the performance of the Sparse Gaussian Process or the Stochastic Variational Gaussian Process. 5. Effect of the parameters Selection of algorithm parameters appropriate for a machine learning task is an arduous task for all practitioners. To alleviate this difficulty, we provide guidelines for parameter selection based on detailed experimentation. The proposed algorithm has three parameters: 1. The dataset size 2. The subset size 3. The number of estimators Accordingly, three sets of experiments were performed to capture the effect of each of these parameters on the performance of the algorithm. These experiments are described in this section. Fig. 1. Delta Elevators. 5.1. Datasets The following datasets were used in this study: 1. Combined Cycle Power Plant: This dataset was obtained from the UCI Machine Learning repository [40]. This dataset has 9568 instances. The target variable is the net hourly electrical power output from a power plant. The dataset has four features. 2. Ailerons: This dataset was obtained from the LIAD (Laboratory of Artificial Intelligence and Decision) [41]. The target variable for this dataset is the control action associated with the control of an F-16 aircraft. The dataset has 40 features and 7154 instances. 3. Elevators: This dataset was obtained from the LIAD repository [41]. This dataset is also related to the control of an F-16 aircraft. The target for this dataset is the control action variation for the elevators of the aircraft. The dataset has 6 features and 9517 instances. 4. California Housing: This dataset was obtained from the LIAD repository [41]. The target variable for this dataset is the median house price. The dataset has 8 features and 20460 instances. 5. Individual Household Electric Power Consumption: This dataset was obtained from the UCI Machine Learning repository [40]. It captures the electric power consumption in a single household over a four year period at a one-minute sampling rate. For the experiments in this study, the Voltage was treated as the target variable. Seasonality and periodicity are important characteristics of this dataset (identified during exploratory data analysis of this dataset). For this reason, minute and hour attributes were created out of the timestamp attribute. Similarly, day of week and day of month attributes were created out of the date attribute. This dataset has over 2 million instances and 12 features. 6. Airline Delay: This dataset was obtained from the US Department of Transportation’s website [42]. The data represents arrival delays for US domestic flights during January and February of 2016. This dataset had 12 features and over two hundred and fifty thousand instances. Departure delay is included as one of the predictors while [5] does not include it. Also, the raw data includes a significant amount of early arrivals (negative delays). For all regression methods considered in this study, better models were obtained by limiting the data to the delayed flights only (arrival delays were greater than zero). Fig. 2. California Housing. This suggests that arrival delays and early arrivals are better modeled separately. 7. The Sinc Function: This is a one dimensional synthetic dataset where the response variable is the sine cardinal function otherwise called the sinc function (noise free). The sinc function is a complex function to learn and is, therefore, a candidate for this as well as many other machine learning research studies. The dataset had one hundred thousand instances. 5.2. Effect of dataset size These experiments study the effect of the size of the dataset (N) on the subset size (δ ). For each dataset, we pick a fraction of the data elements and determine the subset size (N δ ) required to achieve a target accuracy. The target accuracy is an input parameter for the experiment. We repeat this procedure for various settings of the fraction of the dataset selected (0.1 through 1). The number of estimators for these experiments was maintained at 30. The rationale for this choice is provided in section 5.4. The results are shown in Fig. 1 through 7. The key observation from these experiments was that the δ required to maintain a preset accuracy decreases very slowly as N increases. This set of experiments was used to identify candidate choices for the parameters of Equation (7). Since we wanted a function δ( N ), that decreases slowly as N increases, we considered 1 { √1 , log1(N ) , log(log ( N )) }. These are slowly decreasing functions of N N S. Das et al. / Big Data Research 14 (2018) 12–26 19 Fig. 3. Airline Delay. Fig. 6. Ailerons. Fig. 4. Combined Cycle Power Plant. Fig. 7. Sinc Function (Synthetic Data). empirically, is a constant (about 30) for all experiments. The rationale for this choice is explained in section 5.4. The time complexity 3 of the GP computation is therefore O ( K . N x ). So when N is large enough (such that x > 3), the GP computation is sub-linear. For ex3 4 ample, when N = 22 , the time complexity is O ( K . N 4 ). g (ǫ ) is a monotonically increasing function that characterizes the fact that sample size should increase as the acceptable error threshold ǫ √ 1 decreases (e.g. g (ǫ ) = ǫ , ǫ , ǫ 10 , . . .). An optimal choice of g (ǫ ) is an area of future work. For the experiments reported in this work, 1 g (ǫ ) = C ǫ 10 with C = 1 for low noise datasets (R M S E << 1) and C = 0.5 for noisy datasets (R M S E > 1), worked well. 5.3. Effect of subset size Fig. 5. House Hold Power Consumption. in decreasing order of slowness. After a rigorous empirical study 1 we found that if we choose δ( N ) = log(log ( N )) then it works well on all real world datasets and synthetic data. As discussed in section 3, the time complexity of the proposed algorithm is O ( K . N s3 ) or O ( K . N 3.δ( N ) ). The exponent of N decreases monotonically as N x x increases. Since N can be expressed as 22 , for N > 22 , the run3 ning time is < O ( K . N x ). The number of estimators K , determined These experiments explore the effect of the subset size, captured by the parameter δ , on the accuracy (ǫ ), for a fixed dataset size (N). For each dataset, a fraction of the dataset is picked for this experiment. This represents the data for the experiment (N). We pick N such that N 1.0 (i.e., δ = 1.0) is computationally tractable (about 2000). The subset N s , used for Gaussian Process model development was N δ . We fix the number of estimators to be 30 (see section 5.4 for the rationale). For each δ value in a range of values, we record the RMSE (ǫ ). The key insight from these experiments was that complex functions, like the Sinc function (see Fig. 14) needed a larger subset size to produce a given level of accuracy. Note that we can expect the δ to drop with the increase in dataset 20 S. Das et al. / Big Data Research 14 (2018) 12–26 Fig. 8. Delta Elevators. Fig. 11. Combined Cycle Power Plant. Fig. 12. House Hold Power Consumption. Fig. 9. California Housing. Fig. 13. Ailerons. Fig. 10. Airline Delay. 5.4. Effect of number of estimators size because of the effect reported in section 5.2. Section 6.3 provides the δ values associated with the full dataset. The results for these experiments are shown in Fig. 8 through 14. These experiments capture the effect of the number of estimators (K ) on the accuracy of the algorithm (ǫ ), for a particular set of δ and N values. S. Das et al. / Big Data Research 14 (2018) 12–26 21 Fig. 14. Sinc Function (Synthetic Data). Fig. 16. California Housing. Fig. 15. Delta Elevators. Fig. 17. Airline Delay. For each dataset, the following experiment was performed. The fraction of the dataset to use for the experiment (N) and the subset size (δ ) are selected. For each K in a range of values, Algorithm 1 is applied and the error (ǫ ) is recorded. The key insight from this set of experiments was that given a subset size (N s ), there is a point upto which increasing the number of estimators (K ) drops the error, but once a threshold value is reached (around 30, for all the datasets), there is little benefit in increasing the number of estimators. A plausible explanation for this behavior could be that increasing the number of estimators reduces the variance component of the error in the bias-variance decomposition of the error (ǫ ). The results for these experiments are shown in Fig. 15 through 21. 6. Application of the algorithm In this section we describe the results of applying the algorithm reported in this work to the datasets described in section 5.1. Fig. 18. Combined Cycle Power Plant. 6.1. Independent performance assessments It may of interest to see how the estimates from Gaussian Progress regression compare to estimates from other methods for large regression tasks. We report the performance of two methods. XGBoost [6] is a recent tree based algorithm using the gradient boosting framework that is very scalable and gaining adoption among practitioners. Trees partition the predictor space and can account for interaction. This method uses boosting instead of bagging. The difference in accuracy estimates between XGBoost and the proposed method for a particular dataset could be attributed to the influence of boosting or effects of interaction among variables. In most cases, the estimates from XGBoost were comparable 22 S. Das et al. / Big Data Research 14 (2018) 12–26 Table 1 Feature Importance. 1 2 3 4 5 6 Dataset Features Important Features Airline Delay Ailerons Power Plant Delta Elevators California Housing House Hold Power 11 40 4 6 8 12 1 3 2 2 3 4a a Timestamp is counted as a feature. Variables derived from timestamp are not included in this count. ables. The GAM estimate serves as an independent performance estimate from another non-parametric regression algorithm based on an additive model. In most cases, the accuracy obtained from GAM’s were similar to those obtained from Gaussian Process regression. Fig. 19. House Hold Power Consumption. 6.2. Feature relevance As discussed in section 3, an additive model worked well for the datasets used in this study. Further, in all these datasets, the response depended only on a small number of attributes. XGBoost can report feature importance. This summarized in Table 1. The response depends on a small subset of predictors in all the datasets reported in this study. GPy [32], the package used to implement the experiments reported in this work, implements the Automatic Relevance Determination (see [1, Chapter 5]) feature. Features that are not relevant are dropped from the model. 6.3. Predictive performance Fig. 20. Ailerons. Fig. 21. Sinc Function (Synthetic Data). to the estimates from the proposed method. As discussed in section 6.4, it is possible to combine estimates from XGBoost with the estimates from the proposed method using stacking. The Generalized Additive Model (GAM) [7] is a scalable regression technique. As the name suggests, it fits an additive model in terms of smooth non-parametric functions (typically splines) of the predictor vari- As discussed in section 4, the choice of a GP method to use for a regression task depends on the problem context. The algorithm presented in this work does not require the hyper-parameter values to be specified. In terms of the inputs to the algorithm, the algorithm presented in this work is similar to the Sparse Gaussian Process and the Stochastic Variational Gaussian Process. These algorithms only require the dataset, kernel and the subset size as input. We report the performance of the Sparse Gaussian Process, Stochastic Variational Gaussian Process and the proposed algorithm for each dataset. These algorithms are good choices to begin the knowledge discovery process in large datasets. The results of applying Algorithm 1 to all the datasets listed in section 5.1 are shown in Table 2. For each dataset, Table 2 includes the following columns: 1. BM1 (Bagging Method 1): This is the result of applying Algorithm 1 using the Inference of a Proportion method to select the subset. 2. BM2 (Bagging Method 2): This is the result of applying Algorithm 1 using the Empirical Estimator method to select the subset. 3. POE: This is the result of applying the Product of Experts Algorithm (Equation (13)) instead of bagging. The subset size used is the same as that for BM1. 4. SVGP: This is the result of applying the Stochastic Variational Gaussian Process algorithm. 5. SPGP: This is the result of applying the Sparse Gaussian Process algorithm. 6. XGBoost: This is the result obtained from the XGBoost algorithm. 7. GAM: This is the result obtained from GAM. 8. SD: This is the standard deviation of the response. If we use a model that simply predicts the mean response for each test point, then the standard deviation represents the error asso- S. Das et al. / Big Data Research 14 (2018) 12–26 23 Table 2 RMSE. Dataset Ailerons Delta Elevators CCPP Cal Housing Airline Delay HPC Sinc Function BM1 BM2 POE SVGP SPGP XGBoost GAM SD 0.000192 0.00146 4.19 0.293 8.75 2.039 0.0267 0.000189 0.00147 3.99 0.294 8.75 1.992 0.0200 0.0001893 0.001453 3.97 0.293 8.74 2.120 0.0132 0.000207 0.00144 5.24 NA 8.85 NA 0.0627 0.000192 0.00146 3.93 NA 10.94 NA 0.0170 0.00017 0.00143 3.73 0.240 8.74 1.63 NA 0.00020 0.00144 4.11 0.281 9.45 2.18 NA 0.00041 0.00237 17.07 0.569 31.31 3.24 0.0634 Table 5 Explained Variance. Table 3 δ Requirements for the Datasets. Dataset 2 3 4 5 6 7 Ailerons Delta Elevators CCPP Cal Housing Airline Delay HPC Sinc Function BM1 BM2 POE SPGP SVGP 0.75 0.625 0.6 0.625 0.56 0.43 0.50 0.65 0.55 0.6 0.625 0.49 0.38 0.50 0.30 0.625 0.6 0.625 0.56 0.43 0.50 0.75 0.625 0.6 NA 0.33 NA 0.34 0.30 0.30 0.3 NA 0.523 NA 0.36 Table 4 δ Kernels for the Datasets. Dataset Kernel 1 2 3 4 Ailerons Delta Elevators CCPP California Housing 5 6 Airline Delay HPC 7 Sinc Function sum of linear and RBF sum of linear and RBF sum of linear and RBF sum of Periodic Matern 32, Linear, RBF and a product kernel of Linear and RBF on the medianIncome attribute sum of BF, linear and White Noise sum of Bias, Cosine, RBF, Linear and Brownian Motion RBF ciated with such a model. For regression models to be useful, they must perform better than this model. For all datasets, the split between the training and the test sets was 70%–30%. The reported RMSE are on the test set. The subset sizes required for each dataset is captured by the δ parameter. The δ values associated with Table 2 are presented in Table 3. The δ values for the bagging methods and POE correspond the most accurate estimates we could obtain for the datasets on a laptop with 16 GB of RAM. The SVGP and the SPGP columns of Table 3, represent the δ at which the best performance was obtained for the Stochastic Variational GP and the Sparse GP methods. For the airline dataset, the best performance for Stochastic Variational GP was obtained at a smaller subset size than the one used for the proposed method. For the airline dataset, Sparse GP showed no improvement in performance for δ greater than 0.33 The kernels used for the datasets are shown in Table 4. The experiments reported in this work used [32] for implementation. The Sparse GP implementation in this package is based on the Variational Sparse GP method (see [4]). The Stochastic Variational GP implementation is the one used in [5]. Both these methods use Stochastic Gradient Descent for hyper-parameter learning. Kernel support for these methods is also limited to simple kernels and kernels like the Brownian Motion Kernel or the Periodic Matern kernel are not supported. For this reason, we do not report accuracy for SVGP or the Sparse GP for datasets that required these complex kernels (Household Power Consumption and California Housing). This also highlights the fact that Algorithm 1 may be a good candidate for datasets with the desired characteristics described earlier but requiring a complex kernel to model the underlying function. 1 2 3 4 5 6 Dataset BM1 XGBoost Airline Delay HPC DeltaElevators Ailerons CCPP California Housing 0.92 0.61 0.64 0.77 0.94 0.73 0.92 0.75 0.64 0.82 0.95 0.82 An analysis of Table 2 shows that the proposed method can perform as well as XGBoost for the Airline Delay, Delta Elevators and the Power Plant (CCPP) datasets. Table 5 provides the variance explained by the model for XGBoost and Bagging Method 1. Gaussian Process regression models offer some advantages over the XGBoost model given the similar levels of prediction performance. Gaussian Process models provide variance and probability estimates directly unlike XGBoost models. This is important in many applications ([8] and [9]). In some cases like the California Housing and the Household Power Consumption datasets, XGBoost does better. In these cases we explored if it is possible to construct a better estimator using both these estimators. This is discussed in section 6.4. [43] [Chapter 2] discusses ANOVA based approaches that permit different regression methods to be compared. Application of these techniques to compare the methods in Table 2 is an area of future work. In small to moderate-sized datasets, a validation dataset is required to assess the generalization of the model. Since the datasets considered in this study were big, the test sets were also big. The error observed over a single large test set was similar to the error observed when a validation methodology was used. Algorithm 1 uses a bagging type approach where we average the estimates of the individual models. An error estimate similar to the out of bag error estimate [44] provides a better picture of the generalization performance. The key idea here is that the samples that are not used for developing any one of the models used in creating the bagged estimator is used to evaluate the performance of the composite model. This becomes the out of bag estimate. We apply this procedure repeatedly to obtain a picture of the generalization of the algorithm. For each iteration, we report the RMSE on the out of bag dataset. Since the dataset is big, this was repeated three times. The results are shown in Table 6. These results are shown for Bagging Method 1, Sparse GP, Stochastic Variational GP, and XGBoost. For the Sparse GP, Stochastic Variational GP and XGBoost, the training dataset is the dataset used to create the model for Bagging Method 1. The test dataset is the out of bag dataset. A comparison of these values to those reported in Table 2 in most cases. The performance of the Stochastic Variational Gaussian Process was quite variable for the CCPP dataset. This suggests that the optimization parameters for this method may need tuning. This is one of the problems with the SVGP method in practice. Arriving at good hyper-parameters for the optimization procedure is a time consuming and cumbersome procedure. In section 4 we discussed the Locally Approximate Gaussian Process as a technique that fits a Gaussian Process around a test 24 S. Das et al. / Big Data Research 14 (2018) 12–26 Table 6 Out of Bag Prediction Performance. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Dataset Method OOB set 1 OOB set 2 OOB set 3 Mean RMSE HPC Airline Delay Ailerons Delta Elevators CCCP SincFunction California Housing SincFunction Ailerons Delta Elevators CCCP Airline Delay HPC California Housing SincFunction Ailerons Delta Elevators CCCP HPC Airline Delay California Housing Ailerons Delta Elevators CCCP California Housing HPC Airline Delay BM1 BM1 BM1 BM1 BM1 BM1 BM1 Sparse GP Sparse GP Sparse GP Sparse GP Sparse GP Sparse GP Sparse GP SVIGP SVIGP SVIGP SVIGP SVIGP SVIGP SVIGP XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost 2.006400 8.644600 0.0001950 0.001452 4.129000 0.037051 0.295650 0.018060 0.000217 0.001453 3.679500 10.915200 NA NA 0.059540 0.000199 0.001441 6.442400 NA 11.105500 NA 0.000167 0.001423 3.710000 0.243694 1.800968 8.733207 2.013900 8.784300 0.0001961 0.001466 4.179600 0.034210 0.293180 0.015330 0.000204 0.001472 3.892800 10.934100 NA NA 0.061600 0.000212 0.001442 26.397900 NA 9.470400 NA 0.000170 0.001451 3.620000 0.240122 1.777669 8.765240 2.008600 8.777100 0.0001979 0.001442 4.190900 0.030093 0.294890 0.018230 0.000224 0.001456 3.772700 10.862900 NA NA 0.065490 0.000196 0.001431 18.143190 NA 9.167900 NA 0.000173 0.001475 3.540000 0.241744 1.771144 8.805467 2.009600 8.735350 0.0001964 0.001453 4.166500 0.033780 0.294570 0.017210 0.000215 0.001460 3.781600 10.904000 NA NA 0.062210 0.000202 0.001438 16.994500 NA 9.914600 NA 0.000170 0.001449 3.620000 0.241800 1.783000 8.767971 Table 7 Performance – Locally Approximate Gaussian Process. 1 2 3 4 5 6 7 Dataset Test Set Size Running Time (s) Holdout set 1 Holdout set 2 Holdout set 3 Mean RMSE California Housing Airline Delay HPC CCPP Ailerons Delta Elevators Sinc Function 6873 85401 10247 3190 2382 3169 33300 142 5636 3240 80 18 75 590 3.888 30.008 2.897 6.500 0.00026304 0.001701 0.066 0.792 26.273 2.950 6.032 0.00028159 0.001757 0.062 0.735 28.710 3.004 5.366 0.00026438 0.001803 0.085 1.805 28.330 2.951 5.960 0.0002696 0.0017539 0.071 point. This method fits an isotropic Gaussian kernel to the neighborhood of the test point using the training data. The neighborhood size and the method to select the neighborhood points are parameters that must be tuned for each dataset. We used the guidelines provided in the documentation for the package [39] to estimate test datasets that can be estimated within an hour with this package. This method is not practical for very large test sets, for example, the test set associated with the household power consumption dataset (2 million instances). The test set associated with a 70–30 training-test split for this dataset yields a test set of six hundred and sixty thousand instances. The testing methodology is similar to the one associated with Table 2. The notion of an out of bag sample does not apply to this case, but a similar approach to testing was used so that the results can be compared. For these experiments, a test set of the size specified in Table 7 is set aside for testing and remainder is used for prediction. This procedure was repeated three times with different test and training sets. The results of applying the Locally Approximated Gaussian process are shown in Table 7. For these experiments a neighborhood size of fifty was used. The “Minimize Mean Square Predictive Error” (MSPE) criterion (see [39]) was used to select the neighborhood of the test points. The running times for Bagging Method 1 (based on the algorithm reported in this work), the Sparse GP and the Stochastic Variational GP are shown in Table 8. As mentioned in the discussion following Table 2 the Sparse GP and the Stochastic Vari- Table 8 Running Times for GP Methods (in seconds). 1 2 3 4 5 6 7 Dataset BM1 Sparse.GP SVGP HPC Airline Delay Ailerons Delta Elevators CCCP SincFunction California Housing 1828 458 286 22 24 50 1876 NA 76 364 59 291 603 NA NA 423 37 42 36 131 NA ational GP implementations do not support complex kernels like the Brownian Motion kernel or the Matern kernel that are required by the Household power consumption dataset and the California Housing datasets. The implementation of the algorithm reported in this work can be parallelized. The experiments reported in this work are performed with an implementation that is not parallelized. An implementation that parallelizes the bagging step can improve the run-time for Algorithm 1. 6.4. Combining estimators Combining estimators is not a new idea [10]. More recently, [11] has combined gradient boosted tree models with Gaussian Process models. This prompted us to explore combining estimators for the data sets where the Gaussian Process model produced a slightly lower accuracy. There were three datasets where the per- S. Das et al. / Big Data Research 14 (2018) 12–26 Table 9 δ RMSE using XGBoost and bagged GP models. 1 2 3 Dataset RMSE California Housing CCPP HPC 0.211 2.943 1.610 formance of the Gaussian Process model was slightly lower than the XGBoost model. These were the Combined Cycle Power Plant, California Housing, and the Household Power Consumption dataset. To combine estimators, the output of the XGBoost models and the GP models were used as the inputs to a classifier (the stacking classifier). The response from the classifier was the best model for a given set of XGBoost and GP model responses. A K-nearest neighbor classifier was used for this purpose. The best value of K was determined through experimentation. Given a test point, the estimates from the XGBoost model and the GP model can be obtained. The classifier then predicts the best model for this set of estimates. This model is then used to provide the estimate for the test point. This procedure improved the accuracy for the California Housing and Combined Cycle Power Plant datasets. A K-nearest neighbor regression performed better than the K-nearest neighbor classifier for the Household Power Consumption dataset. The results are shown in Table 9. A comparison of Table 9 with Table 2 shows that combining estimators yields solutions that are more accurate. The idea of combining estimators can be refined in many ways. We have not included the Sparse GP and Stochastic Variational GP in this solution. The choice of the classifier or regression model to combine estimates from the component models is another modeling decision. The intent here is to show that it is possible to combine the GP model developed using the method presented in this work with other regression models to produce solutions that are better than the individual solutions. An optimal choice of the estimators and the method used for stacking is beyond the scope of this work. 7. Conclusion There are many methods to scale Gaussian Process regression to large datasets. The appropriate choice depends on the problem context. For example, if Gaussian Process regression has been applied to similar datasets, then the kernel types and the kernel hyper-parameters may be known and methods such as the Nystrom method are suitable. If we only seek estimates at a small number of test points, methods such as the Locally Approximate Gaussian Process may be suitable (see section 4 for a detailed discussion). The method proposed in this work is appropriate for large datasets with a small set of important features for which the kernel characteristics are unknown. Kernel choices can be determined through exploratory data analysis [12]. Kernel hyper-parameters can be learned from the data. The data for the experiments reported in this work came from diverse application areas and in a wide range of sizes (few thousand to two million). In these datasets, the target variable depended on a small set of variables and an additive model matched the performance of models that permit interaction like XGBoost. This suggests that datasets with these characteristics are not uncommon. Results from Minimax theory for non-parametric regression indicate that additive models that depend on a small set of predictors have an attractive rate of convergence. This suggests that we can develop adequate regression models with a small subset of samples from the dataset. This was consistent with results observed in the experiments conducted as part of this study. The Stochastic Variational Gaussian Process and the Sparse Gaussian Process are also good candidates for problems with these characteristics. The results of this study 25 show that the proposed algorithm can match or exceed the performance of the Sparse Gaussian Process or the Stochastic Variational Gaussian Process. The results of this study also show that Gaussian Processes can be as effective as ensemble methods like Gradient Boosted Trees on large datasets. Gaussian Processes are based on a probabilistic framework and can provide variance and probability estimates directly as compared to other tools for large regression tasks like XGBoost. This could very important for some applications (see [9] or [8]). The regression function is also interpretable in the case of Gaussian Process models in contrast to methods like gradient boosted trees. Therefore Gaussian Process models can be good explanatory models as well as good predictive models. An important feature of this algorithm is the simplicity of implementation. In Internet applications, frequent model development and deployment is needed. An algorithm that is simple to implement and easy to tune but still performs as well as or better than the other choices for scaling Gaussian Process regression will be attractive to practitioners. Finally, it should be noted that it is possible to combine this algorithm with other algorithms like Gradient Boosted Trees using model stacking to achieve performance gains. Refinement of the sampling strategy used in the algorithm reported in this work is an area of future work. Parallelization of the algorithm is simple and will yield performance gains. Parallelization will permit further evaluation of the algorithm characteristics using techniques such as those described in [43] [Chapter 2]. Acknowledgement The authors would like to thank the reviewers for their many insightful suggestions and comments that have helped improve the presentation of the paper significantly. References [1] C.E. Rasmussen, Gaussian Processes for Machine Learning, MIT Press, 2006. [2] Z. Ghahramani, A tutorial on Gaussian processes (or why I don’t use SVMs), MLSS Workshop talk by Zoubin Ghahramani on Gaussian Processes, http:// mlss2011.comp.nus.edu.sg/uploads/Site/lect1gp.pdf, 2011. (Accessed 19 July 2016). [3] C.J. Stone, Additive regression and other nonparametric models, Ann. Stat. (1985) 689–705. [4] M.K. Titsias, Variational learning of inducing variables in sparse Gaussian processes, in: AISTATS, vol. 12, 2009, pp. 567–574. [5] J. Hensman, N. Fusi, N.D. Lawrence, Gaussian processes for big data, in: Conference on Uncertainty in Artificial Intelligence, auai.org, 2013, pp. 282–290. [6] T. Chen, C. Guestrin, Xgboost: a scalable tree boosting system, preprint arXiv: 1603.02754. [7] J. Friedman, T. Hastie, R. Tibshirani, The Elements of Statistical Learning, Springer Series in Statistics, vol. 1, Springer, Berlin, 2001. [8] M.A. Pimentel, D.A. Clifton, L. Clifton, L. Tarassenko, Probabilistic estimation of respiratory rate using Gaussian processes, in: 2013 IEEE 35th Annual International Conference on Engineering in Medicine and Biology Society, EMBC, IEEE, 2013, pp. 2902–2905. [9] Y. Weng, R. Rajagopal, Probabilistic baseline estimation via Gaussian process, in: IEEE Power & Energy Society General Meeting, IEEE, 2015, pp. 1–5. [10] D.H. Wolpert, Stacked generalization, Neural Netw. 5 (2) (1992) 241–259. [11] J.R. Lloyd, Gefcom2012 hierarchical load forecasting: gradient boosting machines and Gaussian processes, Int. J. Forecast. 30 (2) (2014) 369–374. [12] D. Duvenaud, Automatic Model Construction with Gaussian Processes, Ph.D. Thesis, University of Cambridge, 2014. [13] C.J. Stone, Optimal global rates of convergence for nonparametric regression, Ann. Stat. (1982) 1040–1053. [14] E. Snelson, Z. Ghahramani, Sparse Gaussian processes using pseudo-inputs, in: Advances in Neural Information Processing Systems, 2005, pp. 1257–1264. [15] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning, Adaptive Computation and Machine Learning, The MIT Press, 2005. [16] C.M. Bishop, Pattern Recognition, vol. 128, 2006. [17] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32. [18] Y. Cao, D.J. Fleet, Generalized product of experts for automatic and principled fusion of Gaussian process predictions, preprint, arXiv:1410.7827. [19] A.B. Tsybakov, Introduction to Nonparametric Estimation, 2009, revised and extended from the 2004 French original, translated by Vladimir Zaiats. 26 S. Das et al. / Big Data Research 14 (2018) 12–26 [20] L. Györfi, M. Kohler, A. Krzyzak, H. Walk, A Distribution-Free Theory of Nonparametric Regression, Springer Science & Business Media, 2006. [21] Y. Yang, S.T. Tokdar, et al., Minimax-optimal nonparametric regression in high dimensions, Ann. Stat. 43 (2) (2015) 652–674. [22] Y. Yang, D.B. Dunson, et al., Bayesian manifold regression, Ann. Stat. 44 (2) (2016) 876–905. [23] A. Zaytsev, E. Burnaev, Minimax approach to variable fidelity data interpolation, in: Artificial Intelligence and Statistics, 2017, pp. 652–661. [24] R.E. Kalman, A new approach to linear filtering and prediction problems, J. Basic Eng. 82 (1) (1960) 35–45. [25] R. Sambasivan, S. Das, A statistical machine learning approach to yield curve forecasting, in: 2017 International Conference on Computational Intelligence in Data Science, ICCIDS, 2017, pp. 1–6. [26] J. Quiñonero-Candela, C.E. Rasmussen, C.K. Williams, Approximation methods for Gaussian process regression, in: Large-Scale Kernel Machines, 2007, pp. 203–224. [27] A.G. Wilson, Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes, Ph.D. Thesis, University of Cambridge, 2014. [28] A. Wilson, R. Adams, Gaussian process kernels for pattern discovery and extrapolation, in: Proceedings of the 30th International Conference on Machine Learning, ICML-13, 2013, pp. 1067–1075. [29] P. Drineas, M.W. Mahoney, On the Nyström method for approximating a gram matrix for improved kernel-based learning, J. Mach. Learn. Res. 6 (Dec) (2005) 2153–2175. [30] A. Rahimi, B. Recht, Random features for large-scale kernel machines, in: Advances in Neural Information Processing Systems, 2008, pp. 1177–1184. [31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine learning in Python, J. Mach. Learn. Res. 12 (2011) 2825–2830. [32] GPy, GPy: a Gaussian process framework in python, http://github.com/ SheffieldML/GPy, 2012–2014. [33] N. Srivastava, G.E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (1) (2014) 1929–1958. [34] M. Belyaev, E. Burnaev, Y. Kapushev, Computationally efficient algorithm for Gaussian process regression in case of structured samples, Comput. Math. Math. Phys. 56 (4) (2016) 499–513. [35] V. Tresp, A Bayesian committee machine, Neural Comput. 12 (11) (2000) 2719–2741. [36] M.P. Deisenroth, J.W. Ng, Distributed Gaussian processes, in: International Conference on Machine Learning, vol. 2, ICML, 2015, p. 5. [37] S. Park, S. Choi, Hierarchical Gaussian process regression, in: ACML, 2010, pp. 95–110. [38] R.B. Gramacy, D.W. Apley, Local Gaussian process approximation for large computer experiments, J. Comput. Graph. Stat. 24 (2) (2015) 561–578. [39] R.B. Gramacy, laGP: large-scale spatial modeling via local approximate Gaussian processes in R, J. Stat. Softw. 72 (1) (2016) 1–46, https://doi.org/10.18637/jss. v072.i01. [40] M. Lichman, UCI machine learning repository, http://archive.ics.uci.edu/ml, 2016. [41] L. Targo, Large regression datasets, http://www.transtats.bts.gov/DL_ SelectFields.asp?Table_ID=236, 2016. [42] B. USDOT, Rita airline delay data download, http://www.transtats.bts.gov/DL_ SelectFields.asp?Table_ID=236, 2016. [43] C.E. Rasmussen, Evaluation of Gaussian Processes and Other Methods for NonLinear Regression, Ph.D. Thesis, Cambridge, 1999, http://mlg.eng.cam.ac.uk/pub/ pdf/Ras96b.pdf. [44] Z. Ghahramani, Out-of-Bag Estimation, Tech Report on UC Berkeley Website, 2018, https://www.stat.berkeley.edu/users/breiman/OOBestimation. pdf. (Accessed 14 April 2018).