Large2019 Article AProbabilisticClassifierEnsemb
Large2019 Article AProbabilisticClassifierEnsemb
Large2019 Article AProbabilisticClassifierEnsemb
https://doi.org/10.1007/s10618-019-00638-y
Received: 21 August 2017 / Accepted: 3 June 2019 / Published online: 17 June 2019
© The Author(s) 2019
Abstract
Our hypothesis is that building ensembles of small sets of strong classifiers constructed
with different learning algorithms is, on average, the best approach to classification
for real-world problems. We propose a simple mechanism for building small het-
erogeneous ensembles based on exponentially weighting the probability estimates of
the base classifiers with an estimate of the accuracy formed through cross-validation
on the train data. We demonstrate through extensive experimentation that, given the
same small set of base classifiers, this method has measurable benefits over commonly
used alternative weighting, selection or meta-classifier approaches to heterogeneous
ensembles. We also show how an ensemble of five well-known, fast classifiers can
produce an ensemble that is not significantly worse than large homogeneous ensem-
bles and tuned individual classifiers on datasets from the UCI archive. We provide
evidence that the performance of the cross-validation accuracy weighted probabilistic
ensemble (CAWPE) generalises to a completely separate set of datasets, the UCR time
series classification archive, and we also demonstrate that our ensemble technique can
significantly improve the state-of-the-art classifier for this problem domain. We inves-
tigate the performance in more detail, and find that the improvement is most marked
in problems with smaller train sets. We perform a sensitivity analysis and an ablation
study to demonstrate the robustness of the ensemble and the significant contribution
of each design element of the classifier. We conclude that it is, on average, better to
ensemble strong classifiers with a weighting scheme rather than perform extensive
tuning and that CAWPE is a sensible starting point for combining classifiers.
B Anthony Bagnall
[email protected]
Extended author information available on the last page of the article
123
A probabilistic classifier ensemble weighting scheme 1675
1 Introduction
123
1676 J. Large et al.
123
A probabilistic classifier ensemble weighting scheme 1677
mance between combining a set of classifiers with CAWPE and picking the best of
them based on the train set of any given dataset (Sect. 6.1). To better understand the
nature of the improvements, we also carry out an ablation study that builds up from
simple majority voting to CAWPE (Sect. 6.2), and perform a sensitivity analysis of
CAWPE’s parameter, α (Sect. 6.3). Finally, we conclude in Sect. 7. Our conclusion is
that it is, on average, better to ensemble the probability estimates of strong classifiers
with a weighting scheme based on cross-validated estimates of accuracy than expend
resources on a large amount of tuning of a single classifier and that the CAWPE scheme
means that classifiers can be incrementally added to the ensemble with very little extra
computational cost.
2 Background
We use the following notation. A dataset D of size n is a set of attribute vectors with an
associated observation of a class variable (the response), D = {(x1 , y1 ), . . . , (xn , yn )},
where the class variable has c possible values, y ∈ {1, . . . , c} and we assume there are
m attributes, xi = {xi,1 , . . . , xi,m }. A learning algorithm L, takes a training dataset
Dr and constructs a classifier or model M. To avoid any ambiguity, we stress that
all model selection, parameter tuning and/or model fitting that may occur with any
classifier are conducted on the train set, which may or may not require nested cross-
validation. The final model M produced by L by training on Dr is evaluated on a test
dataset De . A classifier M is a mapping from the space of possible attribute vectors
to the space of possible probability distributions over the c valid values of the class
variable, M(x) = p̂, where p̂ = { p̂(y = 1|M, x), . . . , p̂(y = c|M, x)}. Given p̂, the
estimate of the response is simply the value with the maximum probability.
123
1678 J. Large et al.
a role. Broadly speaking, diversity can be engineered by either changing the training
data or training scheme for each of a set of the same base classifier to form a homoge-
neous ensemble or by employing different classification algorithms to train each base
classifier, forming a heterogeneous ensemble.
Heterogeneous ensemble design focuses on how to use the output of the base classi-
fiers to form a prediction for a new case. i.e., given k predictions { yˆ1 , . . . , yˆk } or k
probability distributions { p̂1 , . . . , p̂k }, how to produce a single prediction ŷ or proba-
bility distribution p̂. There are three core approaches: define a weighting function on
the model output (weighting schemes); select a subset of the models and ignore other
output (ensemble selection schemes); or build a model on the training output of the
models (stacking) (Re and Valentini 2011).
The family of techniques most similar to our approach are weighted combination
schemes, which estimate a weight w j for each base classifier and then apply it to their
predictions. Base classifier predictions multiplied by some weight are summed,
k
si = w j · d(i, yˆj )
j=1
where
1, if a == b
d(a, b) =
0, otherwise
ŷ = arg maxi∈{1,...,c} si .
123
A probabilistic classifier ensemble weighting scheme 1679
4. Naive Bayes combiner (NBC): The Naive Bayes combiner uses the conditional
distributions to form an overall distribution, assuming conditional independence.
p̂(y = i|{ yˆ1 , . . . , yˆk }) = p̂(y = i| yˆ1 ) · p̂(y = i| yˆ2 ), . . . , p̂(y = i| yˆk )
where the probability estimates are derived directly from the train cross-validation
confusion matrix. The final prediction is the index of the maximum probability.
2.1.3 Stacking
123
1680 J. Large et al.
Homogeneous ensemble design focuses more on how to diversify the base classifiers
than on how to combine outputs. Popular homogeneous ensemble algorithms based on
sampling cases or attributes include: Bagging decision trees (Breiman 1996); Random
Committee, a technique that creates diversity through randomising the base classi-
fiers, which are a form of random tree; Dagging (Ting and Witten 1997); Random
Forest (Breiman 2001), which combines bootstrap sampling with random attribute
selection to construct a collection of unpruned trees; and Rotation Forest (Rodriguez
et al. 2006), which involves partitioning the attribute space then transforming in to the
principal components space. Of these, we think it fair to say Random Forest is by far
the most popular. These methods combine outputs through a majority vote scheme,
which assigns an equal weight to the output of each model.
Boosting ensemble algorithms seek diversity through iteratively re-weighting the
training cases and are also very popular. These include AdaBoost (Adaptive Boost-
ing) (Freund and Schapire 1996), which iteratively re-weights based on the training
error of the base classifier; Multiboost (Webb 2000), a combination of a boosting
strategy (similar to AdaBoost) and Wagging, a Poisson weighted form of Bagging;
LogitBoost (Friedman et al. 1998) which employs a form of additive logistic regres-
sion; and gradient boosting algorithms (Friedman 2001), which have become popular
through the performance of recent incarnations such as XGBoost (Chen 2016). Boost-
ing algorithms also produce a weighting for each classifier in addition to iteratively
re-weighting instances. This weight is usually derived from the the training process
of the base classifier, which may involve regularisation if cross-validation is not used.
The key features that define the weighting scheme we propose in the context of other
commonly used weighting schemes such as those described above are that, firstly, we
weight with accuracy estimated through cross-validation instead of a single hold-out
validation set, secondly, we extenuate differences in accuracy estimates by raising
each estimate to the power of α and thirdly, we weight the probability outputs of the
base classifiers instead of the predictions. To clarify, prediction weighting takes just
the prediction from each member classifier,
k
p̂(y = i|E, x) ∝ w j d(i, yˆj )
j=1
k
p̂(y = i|E, x) ∝ w j p̂ j (y = i|M j , x). (1)
j=1
123
A probabilistic classifier ensemble weighting scheme 1681
Fig. 1 Illustration of the different effects of combination and weighting schemes on a toy instance classifi-
cation. Each stage progressively pushes the predicted class probabilities further in the correct direction for
this prediction
123
1682 J. Large et al.
thus the use of cross-validation as opposed to a single validation set as used in a number
of previous works (Kohavi 1995).
The optimal value of α will therefore allow the strongest classifiers to steer the
ensemble, but enable them to be overruled when sufficiently outvoted. This value will
be dependent on the relative performances and distribution of probabilistic outputs
of the base classifiers on the given dataset. To keep in line with the general ethos of
simplicity, we remove the need to tune α and potentially overfit it by fixing α to 4 for
all experiments and all component structures presented. We chose the value 4 fairly
arbitrarily as a sensible starting point before running any experiments. In Sect. 6 we
revisit the importance of the α parameter and whether it could benefit from tuning, as
well other design decisions we have made.
4 Experimental design
The UCI dataset archive1 is widely used in the machine learning and data mining
literature. An extensive evaluation of 179 classifiers on 121 datasets from the UCI
archive, including different implementations of notionally the same classifier, was
performed by Fernández-Delgado et al. (2014). It is worth mentioning there have
been several problems identified with the experimental procedure used in this study
(see Wainberg et al. 2016 for a critique). Firstly, some algorithms were tuned, others
were used with the built in default parameters, which are often poor. For example,
random forest in Weka defaults to 10 trees. Secondly, for some of the tuned algorithms,
there was an overlap between validation and test datasets, which will have introduced
bias. Thirdly, the data were formatted to contain only real valued attributes, with the
categorical attributes in some data sets being naively converted to real values. We
retain this formatting in order to maintain consistency with previous research but
this may bias against certain types of classifier. Comparisons between heterogeneous
ensembles should be entirely unaffected, since they are all built on the same base
classifier prediction information. We have no prior belief as to the impact of the
formatting on other base classifiers and in order to avoid any suggestion of a priori
bias, we use the exact same 121 datasets. A summary of the data is provided in Table 5
in the “Appendix”.
1 http://archive.ics.uci.edu/ml/index.php.
123
A probabilistic classifier ensemble weighting scheme 1683
The UCR archive is a continually growing collection of real valued time series
classification (TSC) datasets.2 A recent study Bagnall et al. (2017) implemented 18
state-of-the-art TSC classifiers within a common framework and evaluated them on 85
datasets in the archive. The best performing algorithm, the collective of transformation-
based ensembles (COTE), was a heterogeneous ensemble of strong classifiers. These
results were our primary motivation for further exploring heterogeneous ensembles
for classification problems in general.
We aim to use this data to test the generality of some of the core results obtained on
the UCI archive, serving as an independent collection of data with entirely different
characteristics and separate from the problems with the UCI data described previously.
A summary of this data is provided in Table 5 in the “Appendix”.
Experiments are conducted by averaging over 30 stratified resamples. Data, results
and code can all be found at the accompanying website for this research.3 For the UCI
data, 50% of the data is taken for training, 50% for testing. Therefore there is no overlap
in train or test data as previously observed by Wainberg et al. (2016) and the data can be
used in a similar manner to Wainer and Cawley (2017) without introducing bias. The
UCR archive provides a default train/test split. We perform resamples using the number
of train and test cases defined in these default splits. We always compare classifiers
on the same resamples, and these can be exactly reproduced with the published code.
Resample creation is deterministic and can be reproduced using the method
Experiments.sampleDataset(directory,datasetName,foldID), or
alternatively the initial train/test split and all resampled folds can be downloaded.
This means we can compare two classifiers with paired two sample tests, such as
Wilcoxon signed-rank test. For comparing two classifiers on multiple datasets we
compare either the number of datasets where there is a significant difference over
resamples, or we can do a pairwise comparison of the average errors over all resam-
ples. All code is available and open source. The experiments can be reproduced (see
class vector_classifiers.CAWPE). In the course of experiments we have gen-
erated gigabytes of prediction information and results. These are available in raw
format and in summary spreadsheets. For comparing multiple classifiers on multiple
datasets, we follow the recommendation of Demšar (2006) and use the Friedmann test
to determine if there are any statistically significant differences in the rankings of the
classifiers. However, following recent recommendations by Benavoli et al. (2016) and
García and Herrera (2008), we have abandoned the Nemenyi post-hoc test originally
used by Demšar (2006) to form cliques (groups of classifiers within which there is
no significant difference in ranks). Instead, we compare all classifiers with pairwise
Wilcoxon signed-rank tests, and form cliques using the Holm correction (which adjusts
family-wise error less conservatively than a Bonferonni adjustment).
We assess classifier performance by four statistics of the predictions and the proba-
bility estimates. Predictive power is assessed by test set error and balanced test set error.
The quality of the probability estimates is measured with the negative log likelihood
(NLL). The ability to rank predictions is estimated by the area under the receiver oper-
ator characteristic curve (AUC). For problems with two classes, we treat the minority
2 http://www.timeseriesclassification.com.
3 http://www.timeseriesclassification.com/CAWPE.php.
123
1684 J. Large et al.
class as a positive outcome. For multiclass problems, we calculate the AUC for each
class and weight it by the class frequency in the train data, as recommended by Provost
and Domingos (2003).
5 Results
Throughout, we make the associated point that CAWPE is significantly better than
its components when they are approximately equivalent. CAWPE has a single param-
eter, α, which is set to the default value of 4 for all experiments. We stress that we
perform no tuning of CAWPE’s parameter α: it simply combines classifier output using
the algorithm described in Algorithm 1. We investigate the sensitivity of CAWPE to
α in Sect. 6.3.
We present results in this section through critical difference diagrams which display
average rankings. A full list of the average scores for each classifier is provided in
Table 6 in the “Appendix”, while further spreadsheets are available on the accompa-
nying website.
Ensembling multiple classifiers inherently involves more work than using any single
one of them. As a basic sanity check, we assess whether applying CAWPE to a random
set of classifiers improves performance. We randomly sampled 5 out of 22 classifiers
available in Weka and constructed CAWPE on top of them. Over 200 random configu-
rations, CAWPE was significantly more accurate than the individual component with
the best average rank on 143 (71.5%), and insignificantly more accurate on a further
34 (17%), over the 121 UCI datasets. CAWPE was never significantly worse than the
best individual component. Note that many of these sets contain components that are
significantly different, with average accuracies across the archive ranging between
81.4 and 62.7%.
To avoid confusion as to the components of any CAWPE instantiation, we con-
tinue the evaluation with two sets of base classifiers. The first, simpler set contains
well known classifiers that are fast to build. These are: logistic regression (Logistic);
C4.5 decision tree (C4.5); linear support vector machine (SVML); nearest neighbour
classifier (NN); and a multilayer perceptron with a single hidden layer (MLP1). These
123
A probabilistic classifier ensemble weighting scheme 1685
6 5 4 3 2 1
6 5 4 3 2 1
Fig. 2 Critical difference diagrams CAWPE-S with its base classifiers (left), and CAWPE-A with its base
classifiers (right). Ranks formed on test set accuracy averaged over 30 resamples
classifiers are each distinct in their method of modelling the data, and are roughly
equivalent in performance. We call this version CAWPE-S.
The second set of five classifiers are more complex, and generally considered
more accurate than the previous set. These are: random forest (RandF); rotation forest
(RotF); a quadratic support vector machine (SVMQ); a multi layer perceptron imple-
mentation with two hidden layers (MLP2); and extreme gradient boosting (XGBoost).
We call CAWPE built on this second set of advanced classifiers CAWPE-A.
In Fig. 2 we compare CAWPE-A and CAWPE-S against their respective base clas-
sifiers in terms of accuracy. In both cases, CAWPE is significantly better than all
components. CAWPE also significantly improves of all the base components in terms
of balanced accuracy, AUROC, and log likelihood.
The improvement is not particularly surprising for CAWPE-S, since the benefits of
ensembling weaker learners are well known. It is perhaps more noteworthy, however,
that learners often considered state-of-the-art such as random forest, rotation forest and
XGBoost, are improved by inclusion in the CAWPE-A ensemble. This improvement is
achieved at a computational cost. The CAWPE scheme will require more computation
than using a single classifier, since a cross-validation procedure is required for each
base classifier. If a ten-fold cross-validation is used, as we do in all our experiments,
CAWPE requires approximately 50 times longer to train than the average training
time of its five base classifiers. In terms of time taken to predict a new test case,
CAWPE simply needs five times the average prediction time of the base classifiers.
We have experimentally verified this is the case, but exclude results for brevity (see
the associated webpage). This constant time overhead is easy to mitigate against: it is
trivial to distribute CAWPE’s base classifiers and even the cross-validation for each
classifier can easily be parallelised.
We compare the particular weighting scheme used in CAWPE to well known alterna-
tives. We compare CAWPE-S and CAWPE-A to the weighting, selection and stacking
approaches described in Sect. 2. In each comparison, all ensembles use the same set
of base classifiers, so the only source of variation is the ensemble scheme. Algorithms
such as ensemble selection were originally described as using a single validation set
to assess models. However, cross-validation will on average give a better estimate of
the true error than a single hold-out validation set (Kohavi 1995). Given that CAWPE
uses cross-validation error estimates and that these estimates are already available to
123
1686 J. Large et al.
10 9 8 7 6 5 4 3 2 1 10 9 8 7 6 5 4 3 2 1
us, we also use these for all ensembles. Hence, we are purely testing the ability of the
ensembles to combine predictions with exactly the same meta-information available.
Figure 3 shows the summary ranks of ten heterogeneous ensembles built on the
simpler classifier set on the 121 UCI datasets using four performance metrics. CAWPE-
S is highest ranked for error and in the top clique for both error and balanced error. It
is significantly better than all other approaches for AUC and NLL. It has significantly
lower error than all but SMLR, and significantly lower balanced error than all but
NBC.
Figure 4 shows the summary ranks of the same ten heterogeneous ensembles on the
121 UCI datasets using the more advanced classifiers. The pattern of results is very
similar to those for the simple classifiers. CAWPE-A is top ranked for error and in
a clique with majority vote and weighted majority vote. For balanced error, it is not
significantly different to NBC and is significantly better than the others. For both AUC
and NLL, it is significantly better than all the other methods. Considering the results
for both CAWPE-S and CAWPE-A, it is apparent that the CAWPE scheme is more
consistent than other approaches, since it is the only algorithm in the top clique for all
measures for both sets of classifiers. We think this suggests that the CAWPE scheme
on this data is the best heterogeneous ensemble technique, at least for the simple and
advanced component sets studied.
Given the ensembles are using the same base classifiers and accompanying error
estimates, and these are all good classifiers in their own right, we would expect the
actual differences in average error to be small, and this is indeed the case (see Table 6
123
A probabilistic classifier ensemble weighting scheme 1687
10 9 8 7 6 5 4 3 2 1 10 9 8 7 6 5 4 3 2 1
123
1688 J. Large et al.
6 5 4 3 2 1 6 5 4 3 2 1
Table 1 Summaries of train times for CAWPE-S and the homogeneous ensembles
Table 1 summarises the train times of CAWPE-S and the homogeneous ensembles
in seconds. CAWPE on this simpler component set has a much larger mean train
time than RandF and XGBoost. This largely comes down to the logistic regression
component, which takes a relatively much longer amount of time on datasets with
larger numbers of classes. The median times are closer, however XGBoost especially
still achieves predictive performance not significantly different to that of CAWPE-S
in much shorter times on average.
These timings should be interpreted with the understanding that XGBoost is a
highly optimised library, while the logistic and MLP1 implementations in particular
are relatively straight forward and unoptimised implementations in Java. The fact that
CAWPE-S has a median train time within the same order of magnitude as XGBoost
while not being significantly less accurate is, we think, a positive result.
In Sect. 5.1 we showed the ensemble scheme outperforms its set of base classifiers.
However, finding the weights requires an order of magnitude more work than build-
ing a single classifier because of the ten fold cross-validation across the different
components. Given it is widely accepted that tuning parameters on the train data
can significantly improve classifier accuracy (Bagnall and Cawley 2017), perhaps a
carefully tuned classifier will do as well as or better than CAWPE built on untuned
classifiers. To investigate whether this is the case, we tune an SVM with a radial
123
A probabilistic classifier ensemble weighting scheme 1689
Table 2 Tuning parameter ranges for SVMRBF, random forest, MLP and XGBoost
Random forest 1000 Number of trees (10 values) {10, 100, 200, . . . , 900}
√ m ,..., m}
Feature subset size (10 values) { m, (log2 m + 1), 10 3
Max tree depth (10 values) {0, m9 , m8 . . . , m}
XGBoost 625 Number of trees (5 values) {50, 100, 250, 500, 1000}
Learning rate (5 values) {0.01, 0.05, 0.1, 0.2, 0.3}
Max tree depth (5 values) {2, 4, 6, 8, 10}
Min child weight (5 values) {1, 3, 5, 7, 9}
c is the number of classes and m the number of attributes
5 4 3 2 1 5 4 3 2 1
(a) (b)
Fig. 6 Average ranked errors for a CAWPE-S and b CAWPE-A against four tuned classifiers on 117 datasets
in the UCI archive. The datasets adult, chess-krvk, miniboone and magic are omitted due to computational
restraints
basis function kernel (SVMRBF), XGBoost, MLP and a random forest and compare
the results to CAWPE-S and CAWPE-A. We tune by performing a ten-fold cross-
validation on each train resample for a large number of possible parameter values,
described in Table 2. This requires a huge computational effort. We can distribute
resamples and parameter combinations over a reasonably sized cluster. Even so, con-
siderable computation is required; we were unable to complete a full parameter search
for 4 datasets (within a 7 day limit): adult; chess-kvrk; miniboone; and magic. To
avoid bias, we perform this analysis without these results.
Figure 6 compares CAWPE-S and CAWPE-A to tuned versions of MLP, XGBoost,
RandF and SVM. On average, CAWPE-S, containing the five simpler untuned base
classifiers (Logistic, C4.5, SVML, NN and MLP1), is significantly better than the
tuned MLP and not significantly worse than tuned versions of XGBoost, SVMRBF
and Random Forest (Fig. 6a). The highest ranked tuned classifier is SVM, but it is
still ranked lower than CAWPE-S. This despite the fact that CAWPE-S is two orders
of magnitude faster than the tuned SVM and at least one order of magnitude faster
123
1690 J. Large et al.
than tuned Random Forest, MLP and XGBoost. Sequential execution of CAWPE-S
for miniboone (including all internal cross-validation to find the weights) is 5 hours.
For TunedSVM, ten-fold cross-validation on 1089 different parameter combinations
gives 10,890 models trained for each resample of each dataset. For the slowest dataset
(miniboone), sequential execution would have taken more than 6 months. Of course,
such extensive tuning may not be necessary. However, the amount and exact method
of tuning to perform is in itself very hard to determine. Our observation is that using
simple approach such as CAWPE-S avoids the problem of guessing how much to tune
completely.
If we use CAWPE-A, containing the more advanced components (RandF, RotF,
SVMQ, MLP2 and XGBoost), we get a classifier that is significantly more accurate
than any of the individuals (Fig. 6b). CAWPE-A takes significantly longer to train
than CAWPE-S, but it is still not slower on average than the tuned classifiers. We are
not claiming that CAWPE-A is significantly faster than tuning a base classifier in the
general case, because this is obviously dependent on the tuning strategy. CAWPE-A
involves a ten fold cross-validation of five classifiers, so it is going to be comparable in
run time to one of these single classifiers tuned over 50 parameter settings. However,
our experiments demonstrate that tuning a single base learner over a much larger
parameter space does not result in as strong of a model, on average.
Our goal is not to propose a particular set of classifiers that should be used with
CAWPE. Rather, we maintain that if one has some set of classifiers they wish to apply
to problem, ensembling them using CAWPE is generally at least as strong as other
heterogeneous ensemble schemes when we have a relatively small number of base
classifiers, that it significantly improves base classifiers that are approximately equally
strong, and that the degree of improvement is such that state-of-the-art level results can
be achieved with minimal effort. Once a classifier is trained and the results are stored,
ensembling is very quick. To perhaps belabour the point, we ensembled the four tuned
classifiers using the parameter ranges given in Table 2 and the resulting classifier was
significantly better than the components in a manner reflecting the patterns observed
in Sect. 5.1.
123
A probabilistic classifier ensemble weighting scheme 1691
6 5 4 3 2 1 7 6 5 4 3 2 1
(a) (b)
Fig. 7 Average ranked errors for DTW against a CAWPE-S and its components and b CAWPE-A and its
components on the 85 datasets in the UCR archive
(2017) found that of 18 such algorithms, only 13 were significantly better (in terms of
accuracy) than DTW.
Our goal is to test how well the results observed for CAWPE on the UCI data
generalise to other data, by testing whether CAWPE significantly improves over its
components on the UCR archive also. To do so, we ignore the ordering of the series and
treat each time step in the series as a feature for traditional vector-based classification.
The UCR datasets generally have many more features than the UCI data. This has
meant we have had to make one change to CAWPE-S: we remove logistic regression
because it cannot feasibly be built on many of the data. Since DTW is a 1-nearest
neighbour classifier, it always produces 0/1 probability estimates. Because of this, we
omit a probabilistic evaluation using AUC and NLL, as it has little meaning for DTW.
Figure 7 shows the critical difference diagrams for accuracy of CAWPE-S, CAWPE-
A, their respective constituents, and DTW. Both sets of base classifiers are significantly
improved by CAWPE once more. These results closely mirror those on the UCI datasets
presented above. Furthermore, neither of the CAWPE versions are significantly worse
than DTW and both have higher average rank. This should be considered in the context
that neither classifier takes advantage of any information in the ordering of attributes.
Despite this, CAWPE-A has a higher average rank than 9 of the 18 bespoke time series
classification algorithms evaluated in Bagnall et al. (2017), and is not significantly
worse than 11 of them. CAWPE, a simple ensemble using off the shelf components
and a simple weighting scheme, has been made as accurate as complex algorithms
that use a range of complicated techniques such as forming bags of patterns, using
edit distance based similarity, differential based distances, compression techniques
and decision trees based on short subseries features.
Using standard classifiers for TSC is unlikely to be the best approach. The best per-
forming TSC algorithm in Bagnall et al. (2017), significantly more accurate than all
the others, was the Collective of Transformation-based Ensembles (COTE) (Bagnall
et al. 2015). It has components built on different representations of the data. COTE
uses an ensemble structure that is the progenitor of CAWPE. The latest version of
COTE, HIVE-COTE (Lines et al. 2016) uses weighted majority voting for five mod-
ularised classifier components defined on shapelet, elastic distance, power spectrum,
bag-of-words and interval based representations, and is significantly more accurate
than the previous version, flat-COTE, and all of the competing algorithms. HIVE-
COTE exploits the diversity of the representations through an ensemble scheme. We
address the question of whether CAWPE is the best ensemble scheme for HIVE-COTE.
123
1692 J. Large et al.
6 Analysis
We perform a more in-depth analysis of results to determine whether there are any
patterns in the results that indicate when and why CAWPE performs well. We compare
various facets of performance against choosing the best component on any given
dataset (Sect. 6.1). We then perform an ablative study of CAWPE (Sect. 6.2), and a
sensitivity study of its parameter, α (Sect. 6.3).
Given CAWPE ensembles based on estimates of accuracy obtained from the train data
and gives increasingly larger weights to the better classifiers, it seems reasonable to ask,
why not just choose the single classifier with the highest estimate of accuracy? Figure 3
demonstrated that it is on average significantly worse choosing a single classifier than
using the CAWPE ensembles. When comparing algorithms over entire archives, we
get a good sense of those which are better for general purpose classification. However,
differences in aggregated ranks do not tell the whole story of differences between
classifiers. It could be the case that CAWPE is just more consistent that its components:
it could be a jack of all trades ensemble that achieves a high ranking most of the time, but
is usually beaten by one or more of its components. A more interesting improvement
is an ensemble that consistently achieves higher accuracy than all of its components.
For this to happen, the act of ensembling needs to not only cover for the weaknesses
of the classifiers when in their suboptimal domains, but accentuate their strengths
when within their specialisation too. Figure 9 shows the scatter plots of accuracy for
choosing the best base classifier from their respective component sets against using
CAWPE. This demonstrates that CAWPE has higher accuracy than Pick Best on the
majority of problems, and that the differences are not tiny.
Figure 10 shows the counts of the rankings achieved by CAWPE built on the simpler
(a) and advanced (b) components, in terms of accuracy, over the 121 UCI datasets.
123
A probabilistic classifier ensemble weighting scheme 1693
Fig. 9 Accuracy of a CAWPE-S and b CAWPE-A versus picking the best component
(a) (b)
Fig. 10 Clustered histograms of accuracy rankings over the 121 UCI datasets for a CAWPE-S and b
CAWPE-A and their respective components. For each classifier, the number of occurrences of each rank
being achieved relative to the other classifiers is shown
CAWPE is the single best classifier far more often than any of its components, and is in
fact more often the best classifier than second best. Both versions of CAWPE are never
ranked fifth or sixth, and very rarely ranked fourth, demonstrating the consistency of
the improvement. This suggests that the simple combination scheme used in CAWPE
is able to actively enhance the predictions of its locally specialised members, rather
than just achieve a consistently good rank.
For clarity we restrict further analysis to the CAWPE-S results. Comparable results
for CAWPE-A are available on the accompanying website.
Comparing overall performance of classifiers is obviously desirable; it addresses the
general question: given no other information, what classifier should I use? However,
we do have further information. We know the number of train cases, the number of
attributes and the number of classes. Does any of this information indicate scenarios
where CAWPE is gaining an advantage? The most obvious factor is train set size,
since picking the best classifier based on train estimates is likely to be less reliable
with small train sets.
Table 3 breaks down the results of CAWPE-S compared to Pick Best by train set
size. With under 1000 train cases, CAWPE-S is clearly superior. With 1000–5000
123
1694 J. Large et al.
1–100 28 21 1.49
101–500 46 36 0.71
501–1000 12 11 1.51
1001–5000 23 11 0.16
> 5001 9 2 0.02
The three datasets with the same average error have been removed (acute-inflammation, acute-nephritis and
breast-cancer-wisc-diag). If there is a significant difference within a group (tested using a Wilcoxon sign
rank test) the row is in bold
Fig. 11 The difference in average errors in increasing order between CAWPE-S and picking the best
classifier on each dataset. Significant differences according to paired t-tests over folds are also reported.
CAWPE-S is significantly more accurate on 46, the best individual classifier on 18, and there is no significant
difference on 57
cases, there is little difference. With over 5000 cases, CAWPE-S is better on just 2 of
9 problems, but there is only a tiny difference in error. This would indicate that if one
has over 5000 cases then there may be little benefit in using CAWPE-S, although it is
unlikely to be detrimental. Analysis shows there is no detectable significant effect of
number of attributes. For the number of classes, there is a benefit for CAWPE-S on
problems with more than 5 classes. CAWPE-S wins on 62% of problems with five or
fewer classes (53 out of 85) and wins on 85% of problems with 6 or more (28 out of
33). This is not unexpected, as a large number of classes means fewer cases per class,
which is likely to introduce more noise into the estimate of error.
Despite using the same classification algorithms, not all of the differences between
pick best and CAWPE-S are small in magnitude. Figure 11 shows the ordered dif-
ferences between the two approaches. The largest difference in favour of CAWPE-S
(averaged over 30 folds) is 4.42% (on the arrhythmia dataset) and in favour of pick
123
A probabilistic classifier ensemble weighting scheme 1695
best 4.5% (on energy-y1). This demonstrates the importance of the selection method
for classifiers; it can cause large differences on unseen data.
This analysis indicates that CAWPE-S is likely to be a better approach than simply
picking the best when there is not a large amount of training data, there are a large
number of classes and/or the problem is hard. Overall, CAWPE requires almost no
extra work beyond pick best and yet is more accurate.
6 5 4 3 2 1 6 5 4 3 2 1
123
1696 J. Large et al.
123
A probabilistic classifier ensemble weighting scheme 1697
Section 6.2 has shown that exaggerating the weights of classifiers using α gives a
significant increase in performance over standard weighted averaging of probabilities,
even with all else being equal. As stated at the end of Sect. 3, the value of α was
fixed to 4 for CAWPE for all experiments reported throughout the previous sections.
This value was decided on while developing HIVE-COTE. Having performed our
experiments with α = 4, we were interested to find out how sensitive the performance
of CAWPE is to this single parameter.
Figure 14 depicts what happens if we fix α to progressively higher values over both
dataset archives and both base classifier sets used throughout, the basic set (Logistic,
C4.5, SVML, NN and MLP1) and the advanced set (RandF, RotF, SVMQ, MLP2 and
XGBoost). To keep everything on the same scale and to appropriately highlight the
actual differences in accuracy, the average accuracy of each α value is expressed as
the difference between itself and using α = 0, i.e. no weighting of the base classifiers.
Even across the two different archives and base classifier sets, the test performances
of different values of α show a fairly consistent pattern, rising steadily until around
five to seven before tapering off or eventually falling again. Ultimately as α tends to
infinity, we know that the ensemble becomes equivalent to picking the best individual,
at which point the line has fallen far below 0 on these graphs. While not included for
the sake of space and clarity, the results for the other three test statistics (balanced
error, AUC, and NLL) follow an effectively identical pattern.
These results give us an understanding of the surprisingly consistent properties
of α overall. However, given some particular set of base classifiers, their relative
performances and ability to estimate their own performance on the training set could
vary to different extents depending on the individual dataset provided. As such, the
amount that we want to extenuate the differences between the classifier could change
from dataset to dataset. It is therefore natural to wonder whether the alpha parameter
could be tuned. To do this in a completely fair and unbiased way, we would need to
123
1698 J. Large et al.
3 2 1 3 2 1
3 2 1 3 2 1
perform a further nested level of cross-validation. However, we can find a much faster
(but possibly biased) estimate of the ensemble’s error by using exactly the same folds
as the base classifiers once more, and simply recombining their predictions.
However, as Fig. 15 shows, tuning alpha over the range {0, 1, . . . 15, ∞} appears to
offer little to no benefit when doing so with simple and sensible tuning rules such as
picking the α with the best accuracy estimate, and resolving ties (which can be quite
common in this scenario) either randomly (RandTie) or conservatively, by choosing
the smallest tied value of α (ConTie). ConTie tends towards more evenly averaging
the base classifier’s outputs, both to counteract any potential overfitting by the base
classifiers and, as shown in Fig. 14, the tendency for higher values of α to increasingly
lead to higher estimates of the ensemble’s own performance incorrectly.
One could imagine many more complex tuning schemes potentially having a posi-
tive effect, such as sticking to the default value of 4, and only deviating if another value
significantly improves accuracy over the cross-validation folds. However, considering
both this analysis of α and the findings of the previous section, and remembering our
initial guiding principle of simplicity, we believe we can reasonably fall back to fixing
the value of α.
7 Conclusions
The key message of this paper is simple: forming heterogeneous ensembles of approx-
imately equivalent classifiers produces on average a significantly better classifier (in
terms of error, ordering and probability estimates) than a wide range of potential base
classifiers, and that when we use a weighted probabilistic combination mechanism,
ensembles of simple classifier can be at least as good as homogeneous ensembles,
heterogeneous ensembles or tuned classifiers. The CAWPE method we propose is sig-
nificantly better than many equivalent methods and, if the number of classifiers being
ensembled is relatively small, represents a sensible starting point. CAWPE is quick,
simple and easy to understand. The CAWPE of five simple untuned classifiers is not
significantly worse than heavily tuned support vector machines, multilayer perceptron,
random forest and XGBoost. CAWPE is significantly better than similar heterogeneous
123
A probabilistic classifier ensemble weighting scheme 1699
schemes based on predictions rather than probabilities. Clearly, CAWPE is not always
the best approach, but given the short time it takes to build the simple classifiers we
have used to test it, it seems a sensible starting point.
CAWPE has limitations or areas where it is untested. Firstly, as the train set size
increases, the value in ensembling, as opposed to just picking the best, reduces. How-
ever, picking best rather than ensembling requires a similar amount of work, and
ensembling is unlikely to make things worse. Secondly, with a larger pool of classi-
fiers, it may be better to select a subset rather than use all classifiers using some ES
type algorithm. We have not tested this, because unless we choose the overproduce
and select methodology of including multiple copies of the same learning algorithm,
there are not that many learning algorithms that would be considered equivalent. Our
approach is to use fewer very different base classifiers, then combine their output in
a way that retains the maximum information. Thirdly, it may well be possible that
advanced classifiers such as boosting, deep learning and support vector machines can
be designed to beat CAWPE, but if this is the case it is not trivial, as we have shown.
Finally, the data we have used has only continuous attributes. We made this decision
based on the fact that we wanted to extend previous research and because we come to
this problem from time series classification, where all data is real valued. It may be
that the variation in classifier performance on nominal data is such that the ensembling
does not benefit. However, given that CAWPE is classifier neutral, it seems unlikely
that the pattern of results would be much different.
Ultimately we hope to drive a better understanding of what classifier to use for a
new problem and how best to use it. With current technology, our conclusion is that,
rather than expend extra computational time tuning a single classifier, it is better to
ensemble different classifiers from different families of algorithms, and that the best
way of doing this is to weight the probability estimates from each base classifier with
an exponentiated accuracy estimate derived from the train data.
Acknowledgements This work is supported by the UK Engineering and Physical Sciences Research Coun-
cil (EPSRC) (Grant No. EP/M015807/1) and the Biotechnology and Biological Sciences Research Council
(BBSRC) Norwich Research Park Biosciences Doctoral Training Partnership (Grant No. BB/M011216/1).
The experiments were carried out on the High Performance Computing Cluster supported by the Research
and Specialist Computing Support service at the University of East Anglia and using a Titan X Pascal
donated by the NVIDIA Corporation.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 Interna-
tional License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution,
and reproduction in any medium, provided you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license, and indicate if changes were made.
Appendix
123
Table 4 A full list of the UCI datasets used in the experiments in Sects. 5.1–5.4
1700
123
abalone 8 3 4177 monks-1 6 2 556
acute-inflammation 6 2 120 monks-2 6 2 601
acute-nephritis 6 2 120 monks-3 6 2 554
adult 14 2 48, 842 mushroom 21 2 8124
annealing 31 5 898 musk-1 166 2 476
arrhythmia 262 13 452 musk-2 166 2 6598
audiology-std 59 18 196 nursery 8 5 12, 960
balance-scale 4 3 625 oocytes_m_nucleus_4d 41 2 1022
balloons 4 2 16 oocytes_m_states_2f 25 3 1022
bank 16 2 4521 oocytes_t_nucleus_2f 25 2 912
blood 4 2 748 oocytes_t_states_5b 32 3 912
breast-cancer 9 2 286 optical 62 10 5620
breast-cancer-w 9 2 699 ozone 72 2 2536
breast-cancer-w-diag 30 2 569 page-blocks 10 5 5473
breast-cancer-w-prog 33 2 198 parkinsons 22 2 195
breast-tissue 9 6 106 pendigits 16 10 10, 992
car 6 4 1728 pima 8 2 768
cardio-10clases 21 10 2126 pit-bri-MATERIAL 7 3 106
cardio-3clases 21 3 2126 pit-bri-REL-L 7 3 103
chess-krvk 6 18 28, 056 pit-bri-SPAN 7 3 92
chess-krvkp 36 2 3196 pit-bri-T-OR-D 7 2 102
J. Large et al.
Table 4 continued
123
1701
Table 4 continued
1702
123
hill-valley 100 2 1212 synthetic-control 60 6 600
horse-colic 25 2 368 teaching 5 3 151
ilpd-indian-liver 9 2 583 thyroid 21 3 7200
image-segmentation 18 7 2310 tic-tac-toe 9 2 958
ionosphere 33 2 351 titanic 3 2 2201
iris 4 3 150 trains 29 2 10
led-display 7 10 1000 twonorm 20 2 7400
lenses 4 3 24 vert-col-2clases 6 2 310
letter 16 26 20, 000 vert-col-3clases 6 3 310
libras 90 15 360 wall-following 24 4 5456
low-res-spect 100 9 531 waveform 21 3 5000
lung-cancer 56 3 32 waveform-noise 40 3 5000
lymphography 18 4 148 wine 13 3 178
magic 10 2 19, 020 wine-quality-red 11 6 1599
mammographic 5 2 961 wine-quality-white 11 7 4898
miniboone 50 2 130, 064 yeast 8 10 1484
molec-biol-promoter 57 2 106 zoo 16 7 101
molec-biol-splice 60 3 3190
Experiments were conducted on 30 stratified resamples of each dataset, with 50% of the data taken for training and 50% for testing. All classifiers were are aligned on the
same folds
J. Large et al.
Table 5 The 85 UCR time series classification problems used in the experiments for Sect. 5.5
Dataset Atts Classes Train Test Dataset Atts Classes Train Test
123
1703
Table 5 continued
1704
Dataset Atts Classes Train Test Dataset Atts Classes Train Test
123
ECGFiveDays 136 2 23 861 SonyAIBORSurface2 65 2 27 953
ElectricDevices 96 7 8926 7711 StarlightCurves 1024 3 1000 8236
FaceAll 131 14 560 1690 Strawberry 235 2 613 370
FaceFour 350 4 24 88 SwedishLeaf 128 15 500 625
FacesUCR 131 14 200 2050 Symbols 398 6 25 995
FiftyWords 270 50 450 455 SyntheticControl 60 6 300 300
Fish 463 7 175 175 ToeSegmentation1 277 2 40 228
FordA 500 2 3601 1320 ToeSegmentation2 343 2 36 130
FordB 500 2 3636 810 Trace 275 4 100 100
GunPoint 150 2 50 150 TwoLeadECG 82 2 23 1139
Ham 431 2 109 105 TwoPatterns 128 4 1000 4000
HandOutlines 2709 2 1000 370 UWaveAll 945 8 896 3582
Haptics 1092 5 155 308 UWaveX 315 8 896 3582
Herring 512 2 64 64 UWaveY 315 8 896 3582
InlineSkate 1882 7 100 550 UWaveZ 315 8 896 3582
InsectWingbeatSound 256 11 220 1980 Wafer 152 2 1000 6164
ItalyPowerDemand 24 2 67 1029 Wine 234 2 57 54
LargeKitchApps 720 3 375 375 WordSynonyms 270 25 267 638
Lightning2 637 2 60 61 Worms 900 5 181 77
Lightning7 319 7 70 73 WormsTwoClass 900 2 181 77
Mallat 1024 8 55 2345 Yoga 426 2 300 3000
Meat 448 3 60 60
Experiments were conducted on 30 stratified resamples of each dataset and all classifiers were aligned on the same folds. Each UCR dataset has an initial default train and
test partition that was used for the first experiment, and each subsequent experiment was conducted using resamples of the data that preserve the class distributions and size
J. Large et al.
121 UCI datasets Classifier Sections Error Balanced error AUC NLL
123
1705
Table 6 continued
1706
121 UCI datasets Classifier Sections Error Balanced error AUC NLL
123
Heterogeneous ensembles, ES-A 5.2 0.176 0.246 0.817 0.847
advanced components MV-A 5.2 0.176 0.249 0.815 0.833
NBC-A 5.2 0.183 0.249 0.821 1.031
PB-A 5.2 0.193 0.261 0.876 0.843
RC-A 5.2 0.177 0.262 0.813 0.87
SMLR-A 5.2 0.19 0.263 0.752 1.141
SMLRE-A 5.2 0.203 0.275 0.747 1.232
SMM5-A 5.2 0.188 0.261 0.757 1.019
WMV-A 5.2 0.175 0.248 0.817 0.837
Homogeneous ensembles AdaBoost 5.3 0.353 0.469 0.775 3.258
(RandF and XGBoost repeated) Bagging 5.3 0.206 0.303 0.868 0.775
LogitBoost 5.3 0.241 0.302 0.836 8.246
RandF 5.1, 5.3 0.185 0.259 0.886 0.713
XGBoost 5.1, 5.3 0.193 0.261 0.876 0.843
Tuned classifiers TunedMLP 5.4 0.227 0.318 0.857 1.009
TunedRandF 5.4 0.188 0.271 0.879 0.719
TunedSVM 5.4 0.188 0.255 0.857 0.955
(On 117 UCI datasets) TunedXGBoost 5.4 0.194 0.267 0.869 0.86
CAWPE-T 5.4 0.175 0.244 0.891 0.653
J. Large et al.
Table 6 continued
123
1707
1708 J. Large et al.
References
Bagnall A, Cawley G (2017) On the use of default parameter settings in the empirical evaluation of classi-
fication algorithms. arXiv:1703.06777
Bagnall A, Lines J, Hills J, Bostrom A (2015) Time-series classification with COTE: the collective of
transformation-based ensembles. IEEE Trans Knowl Data Eng 27:2522–2535
Bagnall A, Lines J, Bostrom A, Large J, Keogh E (2017) The great time series classification bake off: a review
and experimental evaluation of recent algorithmic advances. Data Min Knowl Discov 31(3):606–660
Bagnall A, Lines J, Keogh E (2018) The UEA UCR time series classification archive. http://
timeseriesclassification.com. Accessed 6 June 2019
Benavoli A, Corani G, Mangili F (2016) Should we really use post-hoc tests based on mean-ranks? J Mach
Learn Res 17:1–10
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Caruana R, Niculescu-Mizil A (2004) Ensemble selection from libraries of models. In: Proceedings of the
21st international conference on machine learning
Chen T (2016) XGBoost: a scalable tree boosting system. In: Proceedings of 22nd ACM SIGKDD inter-
national conference on knowledge discovery and data mining
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Dietterich T (2000) An experimental comparison of three methods for constructing ensembles of decision
trees: bagging, boosting, and randomization. Mach Learn 40(2):139–157
Džeroski S, Ženko B (2004) Is combining classifiers with stacking better than selecting the best one? Mach
Learn 54(3):255–273
Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve
real world classification problems? J Mach Learn Res 15:3133–3181
Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Proceedings of international
conference on machine learning, vol 96, pp 148–156
Friedman J (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Friedman J, Hastie T, Tibshirani R (1998) Additive logistic regression: a statistical view of boosting. Stanford
University, technical report
García S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for
all pairwise comparisons. J Mach Learn Res 9:2677–2694
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomised trees. Mach Learn 63(1):3–42
Giacinto G, Roli F (2001) An approach to the automatic design of multiple classifier systems. Pattern
Recognit Lett 22(1):25–33
Hansen L, Salamo P (1990) Neural network ensembles. IEEE Trans Pattern Anal Mach Intell 12(10):993–
1001
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection.
In: Proceedings of 14th international joint conference on artificial intelligence. Morgan Kaufmann
Publishers Inc., pp 1137–1143
Kuncheva L, Rodríguez J (2014) A weighted voting framework for classifiers ensembles. Knowl Inf Syst
38(2):259–275
Kuncheva L, Whitaker C (2003) Measures of diversity in classifier ensembles and their relationship with
the ensemble accuracy. Mach Learn 51(2):181–207
Didaci L, Fumera G, Roli F (2013) Diversity in classifier ensembles: fertile concept or dead end? In:
International workshop on multiple classifier systems. Springer, pp 37–48
Lines J, Taylor S, Bagnall A (2016) HIVE-COTE: the hierarchical vote collective of transformation-based
ensembles for time series classification. In: Proceedings of IEEE international conference on data
mining
Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. J Artif Intell Res 11:169–198
Partridge D, Yates W (1996) Engineering multiversion neural-net systems. Neural Comput 8(4):869–93
Provost F, Domingos P (2003) Tree induction for probability-based ranking. Mach Learn 52(3):199–215
Ratanamahatana C, Keogh E (2005) Three myths about dynamic time warping data mining. In: Proceedings
of 5th SIAM international conference on data mining
Re M, Valentini G (2011) Ensemble methods: a review. Data mining and machine learning for astronomical
applications. Data Mining and Knowledge Discovery Series, Chapman & Hall
123
A probabilistic classifier ensemble weighting scheme 1709
Rodriguez J, Kuncheva L, Alonso C (2006) Rotation forest: a new classifier ensemble method. IEEE Trans
Pattern Anal Mach Intell 28(10):1619–1630
Tang K, Suganthan P, Yao X (2006) An analysis of diversity measures. Mach Learn 65(1):247–271
Ting K, Witten I (1997) Stacking bagged and dagged models. In: Proceedings of 14th international confer-
ence on machine learning, pp 367–375
Ting K, Witten I (1999) Issues in stacked generalization. J Artif Intell Res 10:271–289
Wainberg M, Alipanahi B, Frey B (2016) Are random forests truly the best classifiers? J Mach Learn Res
17:1–5
Wainer J, Cawley G (2017) Empirical evaluation of resampling procedures for optimising SVM hyperpa-
rameters. J Mach Learn Res 18(15):1–35
Webb G (2000) Multiboosting: a technique for combining boosting and wagging. Mach Learn 40(2):159–
196
Wolpert H (1992) Stacked generalization. Neural Netw 3(2):241–259
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Affiliations
James Large
[email protected]
Jason Lines
[email protected]
1 School of Computing Sciences, University of East Anglia, Norwich, UK
123