Academia.eduAcademia.edu

Stacking and Stability

2019

Stacking is a general approach for combining multiple models toward greater predictive accuracy. It has found various application across different domains, ensuing from its meta-learning nature. Our understanding, nevertheless, on how and why stacking works remains intuitive and lacking in theoretical insight. In this paper, we use the stability of learning algorithms as an elemental analysis framework suitable for addressing the issue. To this end, we analyze the hypothesis stability of stacking, bag-stacking, and dag-stacking and establish a connection between bag-stacking and weighted bagging. We show that the hypothesis stability of stacking is a product of the hypothesis stability of each of the base models and the combiner. Moreover, in bag-stacking and dag-stacking, the hypothesis stability depends on the sampling strategy used to generate the training set replicates. Our findings suggest that 1) subsampling and bootstrap sampling improve the stability of stacking, and 2) stacking improves the stability of both subbagging and bagging.

S TACKING AND STABILITY P REPRINT arXiv:1901.09134v1 [cs.LG] 26 Jan 2019 Nino Arsov Macedonian Academy of Sciences and Arts 1000 Skopje, Macedonia [email protected] Martin Pavlovski Temple University Philadelphia, PA 19122 [email protected] Ljupco Kocarev Macedonian Academy of Sciences and Arts 1000 Skopje, Macedonia [email protected] January 29, 2019 A BSTRACT Stacking is a general approach for combining multiple models toward greater predictive accuracy. It has found various application across different domains, ensuing from its meta-learning nature. Our understanding, nevertheless, on how and why stacking works remains intuitive and lacking in theoretical insight. In this paper, we use the stability of learning algorithms as an elemental analysis framework suitable for addressing the issue. To this end, we analyze the hypothesis stability of stacking, bag-stacking, and dag-stacking and establish a connection between bag-stacking and weighted bagging. We show that the hypothesis stability of stacking is a product of the hypothesis stability of each of the base models and the combiner. Moreover, in bag-stacking and dag-stacking, the hypothesis stability depends on the sampling strategy used to generate the training set replicates. Our findings suggest that 1) subsampling and bootstrap sampling improve the stability of stacking, and 2) stacking improves the stability of both subbagging and bagging. Keywords stacking · stacked generalization · bagging · bootstrap · algorithmic stability · generalization 1 Introduction Stacked generalization (stacking) [1] is a prominent and popular off-the-shelf meta-learning approach suitable for a variety of downstream machine learning tasks in areas such as computer vision and computational biology. Stacking is related to ensemble learning approaches such as bagging [2] and boosting [3]. With the rising prevalence of machine learning techniques for everyday problem solving, it has also reached the status of a missing-ingredient algorithm in competition platforms such as Kaggle. Moreover, stacking has recently found various applications across different domains [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. Currently, however, there is still very little insight into how and why stacking works, barring rare blog posts by machine learning practitioners in which the explanations tend to be facile, based on intuition, and lacking in theoretical insight. In the machine learning research community, there are only two scant papers that investigate the effectiveness of stacking and its variations, called bag-stacking and dag-stacking, in terms of predictive accuracy [15, 16]. A common way to analyze a learning algorithm’s capability to generalize is to see how sensitive it is with respect to small changes in the training set. Sensitivity can be quantified using different notions of stability, and once we know how stable a learning algorithm is, we can use that to derive an upper bound on its generalization error. The research focused on bounding the generalization error has been prolific and has produced significant advances toward understanding how good can learning algorithms generalize. But, more importantly, it has established a precise relationship P REPRINT - JANUARY 29, 2019 between stability and generalization. There are different classes of upper bounds that vary depending on how tight they are; tighter upper bounds convey more reliable estimates of the generalization error. Examples include upper bounds based on the Vapnik-Chervonenkis (VC) dimension [17], probably approximately correct (PAC) learning [18], PAC-Bayes bounds [19], the stability of learning algorithms [20, 21], and so on. Inspired by upper bounds as a tool to assess the generalization error of a learning algorithm, we formally investigate how stability interacts in stacking, bag-stacking, and dag-stacking. This paper is a first look into the effectiveness of these three approaches from the perspective of stability analysis. The contributions of this paper are: • we analyze the hypothesis stability of stacking, bag-stacking, and dag-stacking, • we establish a connection between bag-stacking and weighted bagging, i.e., we show that bag-stacking equals weighted bagging, • we show that subsampling/bootstrap sampling improves the hypothesis stability of stacking and that stacking improves the hypothesis stability of both dagging (subbagging) and bagging. Regarding the third contribution, we additionally show that stacking improves the stability of the stacked learning algorithms by a factor of 1/m, whereas using bootstrap sampling lends further improvements on top of stacking. In this paper, we float a thorough description of the way stability interacts in bagging, stacking, and in-between, i.e., in bag-stacking. We maintain that stability is the appropriate formal apparatus to demonstrate the effectiveness of stacking toward reducing the generalization error of the base models. We show that, in contrast to the traditional heterogeneous ensemble structure of stacking, its homogeneous counterpart, bag-stacking, can be used instead to stabilize the algorithm further and, consequently, yield a tighter stability-based upper bound, reducing the generalization error further. The paper is organized in the following sections: Section 2 introduces some definitions and the notation used throughout the text, Section 3 discusses the link between meta-learning and ensemble learning approaches, Section 4 presents the existing work on the stability of deterministic and randomized learning algorithms, and Section 5 contains the main results and contributions. We conclude the paper with Section 6. 2 Notation and Prerequisites This section briefly introduces the notation used in the rest of the paper and the prerequisite notions that help analyze learning algorithms from the perspective of stability. Bold symbols represent vectors, such as x ∈ R3 , whereas x ∈ R is a scalar. The symbols below designate various quantities or measures. • • • • P[·]: probability EX [·]: expected value with respect to X Var(·): variance e(·): error function Let X and Y denote the input and output space of a learning algorithm. Here, we assume that X ⊆ V d , where d ≥ 1 and V is a vector space, while Y ⊆ R for regression tasks, and Y = C, which is a set of labels, not necessarily numerical, for classification tasks. Let Z = X × Y. Consider a sample D of m input-output pairs, D = {z1 = (x1 , y1 ), z2 = (x2 , y2 ), . . . , zm = (xm , ym )} = (DX , DY ). The examples z1 , z2 , . . . zm are drawn i.i.d. from an unknown distribution P and comprise the training set D. If we take Z m to denote the set of all possible training sets of size m, then D ∈ Z m . By removing the i-th example from D, we get a new training set of m − 1 examples, D\i = {z1 , . . . , zi−1 , zi+1 , . . . , zm }. Replacing the i-th example from D with some z ∈ Z and z ∈ / D yields D\i ∪ {z} = {z1 , . . . , zi−1 , z, zi+1 , . . . , zm }, z ∈ Z, z ∈ / D. A training algorithm A learns a mapping (function) f : X → Y that approximates P . When the learning algorithm A learns f using a training set D, we denote this by fA(D) . For brevity, however, it is useful to omit A and reduce fA(D) 2 P REPRINT - JANUARY 29, 2019 to fD , barring a few cases when it is necessary to distinguish different learning algorithms. In addition, fD also refers to a model (for f ), a hypothesis, or a predictor. Lastly, for any S ⊆ Z, we define fD (SX ) = {fD (x) | x ∈ SX }, where SY and fD (SX ) retain the order of examples. 3 Combining Predictors: Ensemble Learning Ensemble learning is a modeling paradigm in which at least two predictors are combined into a single model, called an ensemble. At its simplest, an ensemble is a merger of weak models having a wanting capability to generalize. Combined, they generalize significantly better. In 1988, Kearns and Valiant introduced a riddle that asked whether weak models could turn into a single strong predictor. Weak models are considered to predict slightly better than random guessing, and a succinct definition can be found in [22, Def. 4.1]. A paper by Rob Schapire analyzed the predictive strength of weak models as early as 1990 [23]. Ensemble learning has seen two major breakthroughs: bootstrap aggregation (bagging) [2] and boosting [3]. Both are meta-learning approaches that aim to “boost” the accuracy of a given algorithm. Moreover, bagging is designed to reduce variance, whereas boosting reduces both bias and variance that comprise the expected prediction error e = bias + variance2 . 3.1 Bagging as a parallel and independent ensemble learning approach In bagging, a learning algorithm repeatedly trains weak models using different bootstrap samples D1 , D2 , . . . , DT drawn from a training set D. The training rounds are independent and thus easily parallelizable. The trained models are then fused into a single predictor fa by aggregation. Bagging does not minimize a particular loss function for it is engineered to reduce variance and stabilize an unstable learning algorithm by aggregation [2]. The basic reasoning behind bagging is that we want to train weak learners using many different training sets from Z m and then aggregate them by taking their expected value fa = ED [fD ]. (1) In reality, however, the data distribution P is unknown and there is only one training set D available to work with. Bootstrap sampling, or random uniform sampling with replacement, thus allows one to repeatedly re-sample D and generate many different replicates of the training set in order to simulate and estimate the aggregated predictor fa by FD : T 1X fD (x) fˆa (x) = FD (x) = T t=1 t A large number of replicates gives a better estimate of fa = ED [fD ] since the sample used to estimate fa is larger. Although the error of fa does not taper off as the number of replicates increases, its improvement starts faltering at some point. The reason being is that when the number of bootstrap samples becomes sufficiently large to provide a very accurate estimate of fa , adding new samples makes only minute improvements. The bagging algorithm is given in Algorithm 1. Algorithm 1 Bagging [2] 1: procedure BAGGING(A, D, T ) 2: for t = 1, . . . , T do 3: Dt ← BootstrapSampling(D) (t) 4: fDt ← A(Dt ) PT 5: return FD (x) = T1 t=1 fDt (x) PT 6: FD (x) = argmaxCk t=1 [I(fDt (x) = Ck )]K k=1 ⊲ Generate a bootstrap sample ⊲ Apply A on Dt ⊲ For regression ⊲ For classification A variation of bagging, called subbagging, is based on sampling from D without replacement to generate the T training sets, each comprising n < m examples. Although this might compound stability in contrast to bagging, it speeds up the computation when n is significantly smaller than m. The stability analysis is also deferred to Section 5. 3 P REPRINT - JANUARY 29, 2019 3.2 Boosting as a sequential and dependent ensemble learning approach Boosting [3] follows the same ensemble learning paradigm as bagging, but it has two key differences. By contrast, it minimizes the exponential loss eyF (x) , which is an upper bound on the {0, 1} classification loss. In the first round, all training examples are equally important. In each following round, the algorithm adjusts the importance so that the current weak model focuses on examples mispredicted by the preceding one, while those already learned become less important. This way, the boosting algorithm transforms the training set D into T different training sets D(w1 ), D(w2 ), . . . , D(wT ), where wt ∈ Rm and kwt k1 = 1 for t = 1, 2, . . . , T . These imporatnces form a probability distribution, i.e., they add up to 1. They have the same effect as oversampling D. The least important example occurs once in the oversampled set, while the rest occur multiple times. This approach, unlike bagging, introduces a dependency between the weak models. At the same time, the t-th weak learner gets a weight αt reflecting its error. Accurate weak models get higher weights that are then used to aggregate them into a single strong model with boosted accuracy using the mixture fˆa (x) = FD (x) = T X αt fDt (x). t=1 The second difference is that boosting is a fully deterministic algorithm that minimizes the loss function eyF (x) . In other words, boosting is a method for margin maximization [24]. More details on how each step in Algorithm 2 helps minimize the exponential loss can be found in [3]. Algorithm 2 AdaBoost.M1 [3] 1: procedure A DA B OOST.M1(A, D, T ) 2: Initialize the example importances ω1i = 1/m, i = 1, 2, . . . , m 3: for t = 1, . . . , T do 4: fDt ← A(D(ωt )) 5: Pm ωti I(yi 6= fDt (xi )) . errt ← i=1 Pm i=1 ωti 6: 7: 8: 9: 3.3 αt ← log((1 − errt )/errt ) ωti ← ωti exp[αt I(yi 6= fDt (xi ))], i = 1, . . . , m PT return FD (x) = T1 t=1 αt fDt (x) PT FD (x) = argmaxCk t=1 αt [I(fDt (x) = Ck )]K k=1 ⊲ Apply A on D(ωt ) ⊲ Compute the weighted error ⊲ Compute fDt ’s weight ⊲ Adjust the example importances ⊲ For regression ⊲ For classification Stacking: a meta-learning perspective Meta-learning strategies are used to learn latent variables of a model in supervised and unsupervised settings. They always aim to consolidate various virtues of the model. On one hand, meta-learning works with trained models and uses ground truth data, like boosting, to find the optimal mixture weights α to combine weak models. Regularization, on the other hand, is added to the training process to mitigate overfitting. In general, meta-learning approaches operate on top of either the input space, the output space, or the span of the parameter space. They have a vague nature that makes them more involving than classical learning in the sense that they can often be heuristic and allow different interpretations of the latent variables across learning scenarios. Stacking is a meta-learning approach for combining multiple models. Stacking operates over the output space and is strictly limited to supervised learning settings. At its simplest, it is a two-level pipeline of models, where predictions from the first level move on to the second to act as the input to a separate learning component called a combiner. Least-squares and logistic regression are regular instances of combiner algorithms for regression and classification tasks. Stacking has an intrinsic connection to ensemble learning. Stacking combines T ≥ 2 models, each of which is trained using a (usually different) learning algorithm over the same (t) training set D. Thus there are T learned mappings from X to Y that we denote by fD , t = 1, 2, . . . , T . Adequately, (t) there are T sets of predictions on D, fD (DX ), t = 1, 2, . . . , T . The combiner learns a new mapping g with unknown (t) parameters θg ∈ RT , formed by the Cartesian product of the base models’ outcomes. In some cases, fD (DX ) may be subtly different from the entirety of Y, for instance when algorithms produce unrecognizable output (randomly 4 P REPRINT - JANUARY 29, 2019 generated output or external noise). To eliminate the ambiguities, the work presented here relies on the assumption that the outcomes are always in Y.  T (t) To learn θg , the combiner algorithm uses the outcomes fD (DX ) that comprise the features of the new T t=1 dimensional training set D̃ = (D̃X , D̃Y ), where n om (1) (2) (T ) D̃X = ( fD (xi ), fD (xi ), . . . , fD (xi ) ) , i=1 D̃Y = DY . Stacking learns T + 1 mappings; it learns (t) fD : X → Y, t = 1, 2, . . . , T on the first level using D as the training set, D̃ and then, using these outcomes, it learns one mapping g : D̃ → Y on the second level, such that g ◦ f : X −→ Y. Now, the combiner learns the parameters θg by minimizing a loss function L, θg∗ = argmin L(θ; D̃). (2) θ The equation of the stacking model F is given by (1) (2) (T ) FD (x) = gD̃ ((fD (x), fD (x), . . . , fD (x)); θg ), x∈X (3) and can be reduced to FD (x) = (gD̃ ◦ fD )(x), x ∈ X, (4) where g can simply be regarded as the model equation of the combiner. For instance, if the combiner is a least-squares P 2 regressor, then it minimizes the mean squared error L(θ, D̃) = zi ∈D ((g ◦ fD )(xi ) − yi ) ; classifiers are stacked in the same way, only this time, the mean squared error is replaced by cross-entropy loss. Fundamentally, stacking has a conceptual connection to ensemble learning; the parameters θg∗ —depending on their nature—are simply mixture weights used to combine the T predictors, which is particularly the case for linear combiners. For example, linear or logistic regression learn the weights of predictors, i.e., learn latent coefficients of their linear combination that minimizes the mean squared error or the cross-entropy loss, respectively. The former is nothing but a weighted average of regressors (see Example 3.1 below), while logistic regression, to which the combiner establishes weighted majority voting, chooses the weights that minimize the cross-entropy loss and thereon does not treat ensemble members equally. Example 3.1. (Weighted bagging equals bag-stacking) Suppose we are given a bagging ensemble of size T . We want to transform it into a weighted ensemble so that each member has a different importance in the final vote (aggregation). To achieve this, a real number θ, called a weight, is associated with each base model. The weights act as measures of relative importance among the T members. We call this weighted bagging, and extending Equation (1) to lend itself to this model, we get fa (x; θ) = ED [θD fD (x)]. (5) To estimate fa (x; θ), we use T replicates of the training set D, drawn using bootstrap sampling. We assume that the sampling is guided by a random parameter r ∈ Rm that stores the sampling probabilities of the examples in D. In practice, all elements of r are equal to 1/m. The bootstrap samples are thus D(rt ), t = 1, 2, . . . , T . Let θ = [θ1 θ2 . . . θT ]⊤ be a latent parameter vector in RT that stores the weight of each base model fD(rt ) . The objective here is to find θ that yield the optimal combination of the T base models. The model equation of weighted bagging therefore becomes FD,r,θ (x) = T X θt fD(rt ) (x). t=1 This equation is applicable to both regression and classification settings. In a classification setting, this is called weighted majority voting. For binary {−1, +1} classification tasks, the prediction ŷ for any x is ŷ = sign [FD,r,θ (x)]. 5 P REPRINT - JANUARY 29, 2019 To learn θ after learning fD(r1 ) , fD(r2 ) , . . . , fD(rT ) , we minimize either a squared loss function in regression, ∗ θ = argmin θ m X 2 (yi − FD,θ,r (xi )) = argmin θ i=1 m X !2 (6) log p(Ck | xi , FD,r ; θ). (7) yi − T X t=1 i=1 fD(rt ) (xi ) or cross-entropy loss in classification, as follows: θ ∗ = argmin −L(θ; D, FD,r ) = argmin − θ θ K X m X k=1 i=1 yi =Ck Setting the partial derivatives with respect to each coordinate in θ in Equation (6) yields a system of T linear equations with T unknowns, which can be solved using any relevant method. Equation (7), by contrast, can be solved using gradient descent or other optimization methods. The takeaway from this example is that Equations (6) and (7) are in fact the loss functions of the combiner in bagstacking (linear regression and logistic regression, respectively), hence the relationship of weighted bagging to stacking. 4 Related work: stability of learning algorithms The first conceptions of stability of learning algorithms date back to a few decades ago when machine learning was a fitting part of statistics more than it is today. As a general quantitative theory, some of the early notions of hypothesis stability stem from [1]. Being relatively intelligible to a broader audience, the term itself describes the sensitivity of learning algorithms as a quantifiable measure of sensitivity to changes in the training set. In the last decade, this notion of stability has been used to derive upper bounds on the generalization error of deterministic and randomized learning algorithms in Elisseeff and Bousquet’s work [20, 21]. Apart from upper bounds based on the Vapnik-Chervonenkis (VC) dimension [17] or Probably Approximately Correct (PAC) learnability theory [18], stability-based bounds are easily controllable and serve as a powerful tool for designing new learning algorithms. An important virtue of stability-based bounds is their simplicity from a mathematical standpoint. Originally not as tight as the PAC-Bayes bounds [19], considered the tightest, they can be optimized or consolidated, i.e., be made significantly tighter in different ways, not the least of which is collaborative learning [25, 26] that has attained significant generalization improvement. 4.1 Stability and generalization: basic principles for deterministic and randomized learning algorithms A learning algorithm is stable when it meets a particular stability criterion. The easiest way to define the stability of a learning algorithm is to start at the goal of establishing an upper bound on its generalization error: we want this bound to be tight when the algorithm meets the stability criterion. More restrictive stability criterions lead to tighter upper bounds [20]. The randomness in supervised learning comes from the sampling of the training set, and stability is thus considered with respect to changes in the training set such as the removal or replacement of a training example [20]. This definition of stability helps establish upper bounds on the generalization error. Other ways of defining the stability of a learning algorithm and establishing upper bounds include the VC dimension of its search space [27] and PAC or PAC-Bayes bounds [19, 28]. The rest of this section provides the basics of hypothesis stability, such as notation, variations, and stability for deterministic and randomized algorithms as well as a definition of the generalization error and its upper bound. The definitions hereinafter apply to both deterministic and randomized learning algorithms, unless explicitly stated otherwise. Definition 4.1. (Loss functions) ′ Let ℓ(y ′ , y) ∈ R+ 0 be a loss function that measures the loss of y = f (x) on some z = (x, y) ∈ Z. For brevity, we denote this by ℓ(f, z). There are three kinds of loss functions (see Fig. 4.1): 1. Squared loss ℓsq (f, z) ∈ R, ℓsq (f, z) = (y − f (x))2 . 6 P REPRINT - JANUARY 29, 2019 2. Classification loss ℓ(f, z) ∈ {0, 1}, ℓ(f, z) = I(f (x) 6= y), where I(C) is an indicator function equal to 1 when C is true, and equal to 0 otherwise. 3. The γ-loss ℓγ ∈ [0, 1] ,  1,    yf (x) ℓγ (f, z) = 1 − ,  γ   0, if yf (x) < 0, if 0 ≤ yf (x) ≤ γ, otherwise. Here, ℓγ (f, z) takes into account the margin m(f, z) that increases the loss as it gets closer to zero, which means it favors correct and confident predictions, the intensity being controlled by γ. 125 1.0 1.0 100 0.8 0.8 75 0.6 0.6 50 0.4 0.4 25 0.2 0.2 0 0.0 1 0 0 (a) 0.2 0.0 0 0.4 (b) 0.0 m(f, z) 0.6 0 (c) 0.8 0 (d) 1.0 Figure 1: Loss functions: the squared loss (a), the classification loss (b), the γ-loss for γ = 1 and γ = 2, respectively (c and d). Each loss function, barring the squared loss, is confined to [0, 1]. Next, we lay out the definitions of the true error of a learning algorithm, also called the generalization error, different notions of stability, and the corresponding upper bounds on the generalization error. To begin with, let fD (x) be the outcome of a learning algorithm trained on D. Definition 4.2. (Generalization error) Let ℓ be a loss function. The true, i.e., generalization error of a learning algorithm that trains a model f using D is expressed as the expected loss of f for all z ∈ Z, that is, Rgen (fD ) = Ez [ℓ(fD , z)]. (8) The simplest estimator of R(f, D) is the empirical error observed on the sample D, also known as the training error, m Remp (f, D) = 1 X ℓ(fD , zi ). m i=1 (9) In addition, the leave-one-out error is given by m Rloo (f, D) = Definition 4.3. (Hypothesis stability) [21, Def. 1] 1 X ℓ(fD\i , zi ). m i=1 7 (10) P REPRINT - JANUARY 29, 2019 A learning algorithm has hypothesis stability β with respect to a loss function ℓ if the following holds: ∀i ∈ {1, 2, . . . , m}, ED,z [|ℓ(fD , z) − ℓ(fD\i , z)|] ≤ β. (11) With this definition of stability, the expected loss difference is measured on a fixed example z ∈ Z while one example at a time is removed from the training set D. This strategy provides the means to derive an upper bound on the generalization error based on the leave-one-out error Rloo (f, D) of the algorithm. Theorem 4.1. (Hypothesis stability generalization error bound) [21, Thm. 2] Let fD be the outcome of a learning algorithm with hypothesis stability β with respect to a loss functions ℓ such that 0 ≤ ℓ(f, z) ≤ M . Then, with probability 1 − δ over the random draw of the training set D , Rgen (fD ) ≤ Rloo (fD , D) + r δ −1 M 2 + 6M mβ . 2m (12) A slightly different notion of stability, called pointwise hypothesis stability, measures the loss change on one of the training examples zi ∈ D that is replaced by some z ∈ Z, which is not originally in D. Definition 4.4. (Pointwise hypothesis stability) [21, Def. 3] A learning algorithm has pointwise hypothesis stability β with respect to the loss function ℓ if the following holds: ∀i ∈ {1, 2, . . . , m}, ED [|ℓ(fD , zi ) − ℓ(fD\i ∪z , zi )|] ≤ β. (13) With pointwise hypothesis stability we get a similar error bound that can now include the empirical (training) error Remp (fD , D) of the algorithm. Theorem 4.2. (Pointwise hypothesis stability generalization error bound) [21, Thm. 4] Let fD be the outcome of a learning algorithm with pointwise hypothesis stability β with respect to a loss function ℓ such that 0 ≤ ℓ(f, z) ≤ M . Then, with probability 1 − δ over the random draw of the training set D, Rgen (fD ) ≤ Remp (fD , D) + r δ −1 M 2 + 12M mβ . 2m (14) An even stronger notion of stability, called uniform stability, provides tighter bounds on the generalization error. Definition 4.5. (Uniform stability) [21, Def.5] An algorithm has uniform stability β with respect to the loss function ℓ if the following holds ∀D ∈ Z m , ∀i ∈ {1, . . . , m}, kℓ(fD , .) − ℓ(fD\i , .)k∞ ≤ β (15) Uniform stability is an upper bound on hypothesis and pointwise hypothesis stability [20]. Theorem 4.3. (Uniform stability generalization error bound) [21, Thm. 6] Let fD be the outcome of a learning algorithm with uniform stability β with respect to a loss function ℓ such that 0 ≤ ℓ(f, z) ≤ M , for all z ∈ Z and all sets D. Then, for any m ≥ 1 and any δ ∈ (0, 1), the following bound holds with probability 1 − δ over the random draw of the training set D, Rgen (fD ) ≤ Remp (fD , D) + 2β + (4mβ + M ) In this bound, the dependence is error. r log(1/δ) . 2m (16) p log(1/δ) instead of 1/δ, which implies a tighter upper bound on the generalization Algorithms whose stability scales as O(1/m) are considered stable [20]. When an algorithm is randomized using random parameter r, we then have a randomized fD,r and consequently, random hypothesis, pointwise hypothesis, and uniform stability definitions. When r is fixed, they are equal to Definitions 4.3 and 4.4. The randomized definitions of stability and the resulting upper bounds on the generalization error are strikingly similar to the deterministic ones. 8 P REPRINT - JANUARY 29, 2019 Definition 4.6. (Uniform stability of randomized algorithms) [21, Def. 13] A randomized learning algorithm has uniform stability β with respect to the loss function ℓ if for every i = 1, . . . , m,   sup Er [ℓ(fD,r , z)] − Er ℓ(fD\i ,r , z) ≤ β. (17) D,r The empirical estimate Remp (f, D) can be used to construct a guess for the generalization error concealed behind the unknown distribution P of Z. Stability of learning algorithms can then be leveraged to construct an upper bound on the generalization error. 4.2 Stability and generalization error upper bounds for bagging and subbagging This section focuses on bagging and subbagging. The notation conforms to the one used in [21]. Let T be the number of bootstrap samples D(r1 ), . . . , DrT , where D(rt ) denotes the t-th bootstrap set. The random parameter rt ∈ R = {1, . . . , m}m are instances of a random variable corresponding to random sampling with replacement of m elements  from the training set D, and such random variables have a multinomial distribution with parameters 1 1 , . . . , m m , which means each example in D has an equal probability to be sampled [21]. For simplicity, the fact that the base algorithm can also be randomized is omitted. The bagging model can thus be written as FD,r = T 1X fD(rt ) . T t=1 Proposition 4.1. (Random hypothesis stability of bagging for regression) [21, Prop. 4.1]. Let the loss ℓ be B-Lipschitzian with respect to its first variable and let FD,r be the outcome of a bagging algorithm whose base algorithm fD(rt ) has hypothesis (respectfully pointwise hypothesis) stability γm with respect to the ℓ1 loss function. Then, the random hypothesis (respectfully pointwise hypothesis) stability βm of FD,r with respect to ℓ is bounded by βm ≤ B m X kγk k=1 m Pr [d(r) = k] , (18) where d(r), r ∈ R is the number of distinct examples in one bootstrap iteration. Proposition 4.2. (Random hypothesis stability of bagging for classification) [21, Prop. 4.2]. Let FD,r be the outcome of a bagging algorithm whose base algorithm fD(rt ) has hypothesis (respectfully pointwise hypothesis) stability γm with respect to the classification loss function. Then, the random hypothesis (respectfully pointwise hypothesis) stability βm of FD,r with respect to the ℓ1 loss is bounded by βm ≤ 2 m X kγk k=1 m Pr [d(r) = k] . (19) The upper bounds for subbagging are less complicated since the sampling is without replacement, meaning that the examples in each training set D(rt ) are distinct. Proposition 4.3. (Stability of subbagging for regression) [21, Prop. 4.4]. Assume that the loss ℓ is B-Lipschitzian with respect to its first variable. Let FD,r be the outcome of a subbagging algorithm whose base algorithm has uniform (respectfully hypothesis or pointwise hypothesis) stability γp with respect to the ℓ1 loss function, and subbagging is done by sampling p ≤ m points without replacement. Then, the random uniform (respectfully hypothesis or pointwise hypothesis) stability βm of FD,r with respect to ℓ is bounded by βm ≤ Bγp p . m (20) Proposition 4.4. (Stability of subbagging for classification) [21, Prop. 4.5]. Let FD,r be the outcome of a subbagging algorithm whose base algorithm has hypothesis (respectfully pointwise hypothesis) stability γp with respect to the classification loss function, and subbagging is done by subsampling p ≤ m examples from D without replacement. 9 P REPRINT - JANUARY 29, 2019 Then, the random hypothesis (respectfully pointwise hypothesis) stability βm of FD,r with respect to ℓ1 loss is bounded by βm ≤ 2γp p . m (21) The propositions above can be used to derive an upper bound on the generalization errors of bagging and subbagging. The p the upper bounds for subbagging, the latter being tighter, pγwhere the depenpfollowing theorems give dence is log(2/δ) rather than 1/δ. Moreover, the same bounds hold for bagging, where mp is replaced by Pm kγk k=1 m Pr [d(r) = k], which is roughly 0.632γ0.632m when m is sufficiently large. Theorem 4.4. (Hypothesis and pointwise hypothesis stability upper bound on the generalization error of subbagging) [21, Thm. 16] Assume that the loss ℓ is B-lipschitzian with respect to its first variable. Let FD,r be the outcome of a subbagging algorithm. Assume subbagging is done with T sets of size p subsampled without replacement from D and the base learning algorithm has hypothesis stability γp and pointwise hypothesis stability γp′ with respect to the loss ℓ. The following bounds hold separately with probability at least 1 − δ Rgen (FD,r ) ≤ Rloo (FD,r ), D) + r Rgen (FD,r ) ≤ Remp (FD,r ), D) + δ −1 r 2M 2 + 12M Bpγp , m (22) 2M 2 + 12M Bpγp′ . m (23) δ −1 pγ The bounds above are derived for subbagging, but the same bounds hold true for bagging [21] in case mp is replaced Pm T i by i=1 iγ m P(d(r) = i) which is roughly equal to 0.632γ0.632m , where d(r), r ∈ R is the number of distinct sampled points in one bootstrap iteration. 5 Stability analysis of bagging, stacking and bag-stacking: why and when do they work? This section is devoted to providing readers with a thorough view of how stability interacts in bagging, stacking, and in-between. Here we emphasize that stability is the appropriate formal apparatus to argue the effectiveness of stacking, more so than other relatively lenient, yet pragmatically acclaimed expositions. To illustrate one, the following statement appears to convey cogent reasoning: ”Students in a school are classified into three groups: bad, average, and excellent. The confusion matrix shows that a K-NN classifier does well only in discerning bad and excellent students; a decision tree classifier, on the other hand, does so with average and excellent. Now, stacking the K-NN and decision tree models together, one might expect to obtain a predictor that accurately discerns all three classes of students.” While a trial-and-error approach might prove this statement correct, the conventional wisdom used here does not suffice, hence the need to formally investigate the effectiveness of stacking. The upper bound on the generalization error of a learning algorithm becomes tighter as its stability increases, i.e., as the value of the stability measure decreases. Throughout this section, we are going to show that—as opposed to the conventional heterogeneous nature of the stacking ensemble—its homogeneous counterpart, bagging, can instead be used on the first level in the stacking pipeline to further improve stability and, as a consequence, yield a tighter upper bound on the generalization error and hence improve performance. Bag-stacking can be seen as a simple modification of stacking, where instead of passing the predictions by different kinds of models for the combiner algorithm to learn, we use several models of the same kind, but learned using different training sets constructed via bootstrap sampling. Being perhaps the simplest way to achieve this, the predictor aggregation step of bagging is omitted and subsequently replaced by the outcome of a combiner algorithm. It lends a heterogeneity property to stacking and can be used to overcome various shortcomings of stacking (bootstrap sampling generates different training sets and thus reduces the variance of the stacking model) and bagging (stacking the bagged models introduces weights of the models unlike bagging that treats all base models equally). The results in [16] show that bag-stacking and its variation, dag-stacking, usually outperform both bagging and stacking. Bag-stacking, however, has never attained the whopping success of bagging and stacking in the research community as have had bagging and stacking, though it is clear that it integrates their traits. The work on bag-stacking is comprised 10 P REPRINT - JANUARY 29, 2019 of a scant paper [16] that compares its error rate to those of bagging, stacking, and a special case referred to as dagstacking, where subsets of unique points are used in place of bootstrap samples. The paper is, however, wanting in theoretical analysis as well as a formal explanation of bag-stacking and dag-stacking’s effectiveness. 5.1 Stability of Stacking In this part, we analyze the hypothesis stability of stacking according to Definition 4.3. Here, we use the classification loss given in Definition 4.1. The smooth ℓ1 loss is also applicable because ℓ1 (y ′ , y) ≤ ℓ(y ′ , y) for all y, y ′ ∈ Y. To analyze interaction of hypothesis stability among the constituent models in stacking, we first look at the expected absolute loss difference with respect to z ∈ Z, i.e., Ez [|ℓ(fD , z) − ℓ(fD\i , z)|] and then take the expectation with respect to D in order to get ED,z [|ℓ(fD , z) − ℓ(fD\i , z)|]. The goal is to bound Ez [|ℓ(fD , z) − ℓ(fD\i , z)|] from above and the changes in the outcome fD (x) when zi is removed from D. For instance, in the k-NN algorithm, the loss difference is bounded by the probability of the set of examples vi such that the closest point from the training set to any point in vi is zi , or vi = {z ′ | dist(z ′ , zi ) ≤ dist(z ′ , zj ) ∧ j 6= i}; the loss difference depends on P(vi ) and thus for z ∈ Z, it holds that Ez [|ℓ(fD , z) − ℓ(fD\i , z)|] ≤ P(vi ) [21, Example 1]. We apply the same logic here, noting that when dealing with ensembles, one needs to take into account the stability of the base learning algorithm(s). To analyze the hypothesis stability of stacking, it is important to stress that (i) the base learning algorithms are applied independently of one another, and (ii) the combiner learning algorithm is applied independently of the base learning algorithms. Consequently, the stability of each base algorithm is independent of the rest and the stability of the combiner algorithm is also independent of that of the base algorithms. Next, we continue by formally expressing the statements above: (i) for the outcomes of each pair of base models (t) (s) fD (x) and fD (x), it holds that i h i h (s) (s) (t) (t) Ez |ℓ(fD , z) − ℓ(fD\i , z)| Ez |ℓ(fD , z) − ℓ(fD\i , z)| ≤ Pf (t) Pf (s) , D (j) t 6= s (24) D (k) (j) which follows from multiplying the inequalities Ez [|ℓ(fD , z) − ℓ(fD\i , z)|] ≤ Pf (j) and Ez [|ℓ(fD , z) − D (k) ℓ(fD\i , z)|] ≤ Pf (k) , whereas (ii) for the outcome gD̃ (x) of the combiner algorithm and the outcome fD (x) of a D base algorithm, it holds that Ez [|ℓ(fD , z) − ℓ(fD\i , z)|] Ez̃ [|ℓ(gD̃ , z) − ℓ(fD̃\i , z)|] ≤ PfD PgD̃ , (25) (t) where D̃ = (fD (DX ))Tt=1 . Combining Equations (24) and (25) into a single equation for a stacking model yields the following bound on the expected absolute loss difference with respect to z̃ ∈ Z such that for z ∈ Z, z̃ has the same output. i.e., ỹ = y: Ez̃ [|ℓ(gD̃ , z̃) − ℓ(gD̃\i , z̃)|] ≤ PgD̃ T Y t=1 Pf (t) . (26) D Finally, taking ED of both sides of Equation (26) to get hypothesis stability, we get " ED,z̃ [|ℓ(gD̃ , z̃) − ℓ(gD̃\i , z̃)|] ≤ ED PgD̃ T Y t=1 11 # Pf (t) = ED̃ [PgD̃ ] D T Y t=1 ED [Pf (t) ]. D (27) P REPRINT - JANUARY 29, 2019 (t) Note that taking ED has the same effect as taking ED̃ . The expectations ED [PgD̂ ] and ED [PfD ] are essentially the hypothesis stability expressions β(gD̃ ) = ED [Pf (t) ] and β(f (t) ) = ED [Pf (t) ], as given in Definition 4.3. Finally, the D D̃ hypothesis stability of stacking is ED,z̃ [|ℓ(gD̃ , z̃) − ℓ(gD̃\i , z̃)|] ≤ β(gD̃ ) T Y (t) β(fD ). (28) t=1 The rightmost equality in Equation (27) follows from statements (i) and (ii). In other words, the hypothesis stability of stacking is the product of the hypothesis stabilities of all the base models and the combiner. The independence between the base and combiner algorithms eases the computations. Equation (28) also shows that increasing the number of base models improves the stability of stacking. For example, let there be a stacking ensemble in which a ridge regression classifier with a penalty λ acts as the combiner and there are three base k-NN classifiers with k1 , k2 , and k3 , with hypothesis stability 1/λm, k1 /m, k2 /m, and k3 /m, respectively. The hypothesis stability of this stacking ensemble is ED̃,z̃ [|ℓ(gD̃ , z̃) − ℓ(gD̃\i , z̃)|] ≤ 5.2 k1 k2 k3 . λm4 (29) Stability of dag-stacking and bag-stacking In bag-stacking, the only change is that, now, the base models are trained on bootstrap samples drawn from D instead of using D to train them. In addition, the bootstrap samples still allow one to use different base learning algorithms. When stacking is combined with bagging, it is easy to see that the merger of the two is a weighted bagging ensemble. We continue by analyzing the hypothesis stability of bag-stacking and dag-stacking. 5.2.1 Hypothesis stability of bag-stacking: why bootstrap sampling improves stacking? In this part, we describe how bootstrap sampling improves stacking. At the same time, Equation (32) gives the hypothesis stability of bag-stacking. (t) Recall that for any z = (x, y) ∈ Z, z̃ = ((fD )Tt=1 , y) ∈ Z. Since the second coordinate of z and z̃ is the same, we can say that the expected value with respect to z, i.e., Ez is proportional to the one with respect to z̃, Ez̃ . Let Ni be the number of bootstrap samples in which the training example zi appears. The probability of any training example appearing in a bootstrap sample is 0.632/m. Thus, Ni follows a Binomial distribution with p = 0.632/m and n = T . In ensemble learning, we are interested whether Ni > s, for 1 ≤ s ≤ T , and we thus have a sum of Binomials, given that Ni ∼ B(0.632/m, T ) P(Ni > s) = k  T −k   T X 0.632 0.632 T . 1− k m m (30) k=s+1 For bag-stacking, the expected absolute loss difference depends on whether zi appears in more than half of the bootstrap samples, i.e., on P(Ni > T /2). Therefore, the hypothesis stability of stacking with bootstrap sampling is ED̃,z̃ [|ℓ(gD̃ , z̃) − ℓ(gD̃\i , z̃)|] ≤ P(Ni > i.e., T Y T ED [Pf (t) ], )ED̃ [PgD̃ ] D 2 t=1 ED̃,z̃ [|ℓ(gD̃ , z̃) − ℓ(gD̃\i , z̃)|] ≤ P(Ni > T Y T (t) β(fD ), )β(gD̃ ) 2 t=1 (31) (32)  PT where P(Ni > T /2) = k=⌊T /2⌋+1 Tk (0.632/m)k (1 − 0.632/m)T −k . In classical stacking, without bootstrap sampling, Ni = T , which means that P(Ni > T /2) = 1 and the hypothesis stability reduces to Equation (28). Therefore, we can conclude that bootstrap sampling improves the stability of stacking by a factor proportional to the order of P(Ni > T /2). The whole method is known as bag-stacking. 12 P REPRINT - JANUARY 29, 2019 Following the example given in Section 5.1, the hypothesis stability of a bag-stacking ensemble, arranged in the same way as the stacking ensemble given there, is k1 k2 k3 λm4   k  3−k 1 k1 k2 k3 X 3 0.632 0.632 = 1− k λm4 m m ED̃,z̃ [|ℓ(gD̃ , z̃) − ℓ(gD̃\i , z̃)|] ≤ P(Ni > 1) (33) k=0 5.2.2 Hypothesis stability of dag-stacking: why subsampling improves stacking? In this part, we explain why subsampling improves stacking. At the same time, Equation (34) gives the hypothesis stability of dag-stacking. In dag-stacking, we use subsampling instead of bootstrap sampling on top of stacking. T subsamples of size p < m are drawn from D without replacement. Thus, the probability that a training example zi appears in one of the subsamples is 1/p instead of 0.632/m. The resulting hypothesis stability of dag-stacking is the same as in Equation (32), i.e., we have the same inequality ED̃,z̃ [|ℓ(gD̃ , z̃) − ℓ(gD̃\i , z̃)|] ≤ P(Ni > T Y T (t) β(fD ), )β(gD̃ ) 2 t=1 (34)  PT T k T −k . Again, we can conclude that subexcept that, this time, P(Ni > T /2) = k=⌊T /2⌋+1 k (1/p) (1 − 1/p) sampling improves the stability of stacking by a factor proportional to the order of P(Ni > T /2) with respect to p. For example, a dag-stacking ensemble in which the subsamples drawn from D are of size p, has hypothesis stability k1 k2 k3 λm4 1   X k1 k2 k3 p k  p 3−k 3 = 1 − k λm4 m m ED̃,z̃ [|ℓ(gD̃ , z̃) − ℓ(gD̃\i , z̃)|] ≤ P(Ni > 1) (35) k=0 5.3 Why stacking improves bagging and subbagging? In this part, we show that using a combiner on top of bagging or subbagging improves the stability by a factor proportional to the hypothesis stability of the combiner learning algorithm. This way, the model becomes “weighted” bagging, or “weighted” subbagging, because the ensemble members are not treated equally in the final majority vote. It is necessary to emphasize again that the stability of bagging (or subbagging, respectively) and the stability of the combiner are independent. According to Equations (19) and (21) (hypothesis stability of bagging and subbagging where the base algorithms have stabilities γm and γk , respectively), adding a combiner gD̃ trained on D̃ yields the following inequalities with respect to a B-Lipschitzian loss function ℓ: ED̃,z̃ [|ℓ(gD̃ , z̃) − ℓ(gD̃\i , z̃)|] ≤ Bβ(gD̃ ) m X kγk k=1 m Pr [d(r) = k], ED̃,z̃ [|ℓ(gD̃ , z̃) − ℓ(gD̃\i , z̃)|] ≤ 2Bβ(gD̃ )γp p , m (36) (37) Here, the stability of bagging (subbagging) and the stability of the combiner are independent. The first multiplier on the right-hand side refers to the hypothesis stability of the combiner algorithm. It follows immediately that stacking improves the stability of bagging and subbagging by a factor proportional to the stability of the combiner algorithm. For instance, if we use k-NN as the combiner, which has hypothesis stability k/m, it improves the stability of bagging and subbagging by a factor proportional to O(1/m), which, in theory, is a significant improvement. 13 P REPRINT - JANUARY 29, 2019 To summarize, Equations (32) and (34), compared to Equation (28), imply that a larger number of base models T improves the hypothesis stability of all three approaches: stacking, bag-stacking, and dag-stacking. Conversely, bagging and subbagging improve the stability of the base learning algorithm, while adding a combiner on top improves it even further. 6 Conclusion In this paper, we studied the effectiveness of stacking, bag-stacking, and dag-stacking from the perspective of algorithmic stability. This aspect allowed us to formally study the performance of stacking by analyzing its hypothesis stability and establishing a connection to bag-stacking and dag-stacking. Additionally, stacking turned out to improve stability by a factor of O(1/m) when the combiner learning algorithm is stable, whereas subsampling/bootstrap sampling (within stacking) improved it even further. Finally, we found that the converse holds as well—stacking improves the stability of both subbagging and bagging. References [1] David H Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992. [2] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. [3] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory, pages 23–37. Springer, 1995. [4] Fatai Anifowose, Jane Labadin, and Abdulazeez Abdulraheem. Improving the prediction of petroleum reservoir characterization with a stacked generalization ensemble model of support vector machines. Applied Soft Computing, 26:483–496, 2015. [5] Wenchao Yu, Fuzhen Zhuang, Qing He, and Zhongzhi Shi. Learning deep representations via extreme learning machines. Neurocomputing, 149:308–315, 2015. [6] Luis F Nicolas-Alonso, Rebeca Corralejo, Javier Gomez-Pilar, Daniel Álvarez, and Roberto Hornero. Adaptive stacked generalization for multiclass motor imagery-based brain computer interfaces. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 23(4):702–712, 2015. [7] Wanli Xing, Xin Chen, Jared Stein, and Michael Marcinkowski. Temporal predication of dropouts in moocs: Reaching the low hanging fruit through stacking generalization. Computers in Human Behavior, 58:119–129, 2016. [8] Samir Bhatt, Ewan Cameron, Seth R Flaxman, Daniel J Weiss, David L Smith, and Peter W Gething. Improved prediction accuracy for disease risk mapping using gaussian process stacked generalization. Journal of The Royal Society Interface, 14(134):20170520, 2017. [9] Sean P Healey, Warren B Cohen, Zhiqiang Yang, C Kenneth Brewer, Evan B Brooks, Noel Gorelick, Alexander J Hernandez, Chengquan Huang, M Joseph Hughes, Robert E Kennedy, et al. Mapping forest change using stacked generalization: An ensemble approach. Remote Sensing of Environment, 204:717–728, 2018. [10] Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gelman. Using stacking to average bayesian predictive distributions (with discussion). Bayesian Analysis, 13(3):917–1007, 09 2018. [11] Shervin Malmasi and Mark Dras. Native language identification with classifier stacking and ensembles. Computational Linguistics, 44(3):403–446, 2018. [12] Yufei Xia, Chuanzhe Liu, Bowen Da, and Fangming Xie. A novel heterogeneous ensemble credit scoring model based on bstacking approach. Expert Systems with Applications, 93:182–199, 2018. [13] Rémy Peyret, Ahmed Bouridane, Fouad Khelifi, Muhammad Atif Tahir, and Somaya Al-Maadeed. Automatic classification of colorectal and prostatic histologic tumor images using multiscale multispectral local binary pattern texture features and stacked generalization. Neurocomputing, 275:83–93, 2018. [14] Ruobing Wang. Significantly improving the prediction of molecular atomization energies by an ensemble of machine learning algorithms and rescanning input space: A stacked generalization approach. The Journal of Physical Chemistry C, 122(16):8868–8873, 2018. 14 P REPRINT - JANUARY 29, 2019 [15] Kai Ming Ting and Ian H Witten. Stacked generalization: when does it work? 1997. [16] Kai Ming Ting and Ian H Witten. Stacking bagged and dagged models, 1997. [17] Vladimir Naumovich Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999, 1999. [18] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984. [19] Amiran Ambroladze, Emilio Parrado-Hernández, and John S Shawe-taylor. Tighter pac-bayes bounds. In Advances in neural information processing systems, pages 9–16, 2007. [20] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526, 2002. [21] Andre Elisseeff, Theodoros Evgeniou, and Massimiliano Pontil. Stability of randomized learning algorithms. Journal of Machine Learning Research, 6(Jan):55–79, 2005. [22] Samuel Kutin and Partha Niyogi. The interaction of stability and weakness in adaboost. Technical Report TR-2001–30, 2001. [23] Robert E Schapire. The strength of weak learnability. Machine learning, 5(2):197–227, 1990. [24] Robert E Schapire, Yoav Freund, Peter Bartlett, Wee Sun Lee, et al. Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics, 26(5):1651–1686, 1998. [25] Nino Arsov, Martin Pavlovski, Lasko Basnarkov, and Ljupco Kocarev. Generating highly accurate prediction hypotheses through collaborative ensemble learning. Scientific reports, 7:44649, 2017. [26] Martin Pavlovski, Fang Zhou, Nino Arsov, Ljupco Kocarev, and Zoran Obradovic. Generalization-aware structured regression towards balancing bias and variance. In IJCAI, pages 2616–2622, 2018. [27] Michael Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural computation, 11(6):1427–1453, 1999. [28] Sanjoy Dasgupta, Michael L Littman, and David A McAllester. Pac generalization bounds for co-training. In Advances in neural information processing systems, pages 375–382, 2002. 15