Ensemble Methods in Data Mining
Ensemble Methods in Data Mining
Ensemble Methods in Data Mining
Data Mining:
Improving Accuracy
Through Combining Predictions
Synthesis Lectures on Data
Mining and Knowledge
Discovery
Editor
Robert Grossman, University of Illinois, Chicago
Ensemble Methods in Data Mining: Improving Accuracy Through Combining
Predictions
Giovanni Seni and John F. Elder
2010
Modeling and Data Mining in Blogosphere
Nitin Agarwal and Huan Liu
2009
Copyright 2010 by Morgan & Claypool
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any meanselectronic, mechanical, photocopy, recording, or any other except for brief quotations in
printed reviews, without the prior permission of the publisher.
Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions
Giovanni Seni and John F. Elder
www.morganclaypool.com
ISBN: 9781608452842 paperback
ISBN: 9781608452859 ebook
DOI 10.2200/S00240ED1V01Y200912DMK002
A Publication in the Morgan & Claypool Publishers series
SYNTHESIS LECTURES ON DATA MINING AND KNOWLEDGE DISCOVERY
Lecture #2
Series Editor: Robert Grossman, University of Illinois, Chicago
Series ISSN
Synthesis Lectures on Data Mining and Knowledge Discovery
Print 2151-0067 Electronic 2151-0075
Ensemble Methods in
Data Mining:
Improving Accuracy
Through Combining Predictions
Giovanni Seni
Elder Research, Inc. and Santa Clara University
John F. Elder
Elder Research, Inc. and University of Virginia
SYNTHESIS LECTURES ON DATA MINING AND KNOWLEDGE DISCOVERY
#2
C
M
&
cLaypool Morgan publishers
&
ABSTRACT
Ensemble methods have been called the most inuential development in Data Mining and Machine
Learning in the past decade. They combine multiple models into one usually more accurate than
the best of its components. Ensembles can provide a critical boost to industrial challenges from
investment timing to drug discovery, and fraud detection to recommendation systems where
predictive accuracy is more vital than model interpretability.
Ensembles are useful with all modeling algorithms, but this book focuses on decision trees
to explain them most clearly. After describing trees and their strengths and weaknesses, the authors
provide an overview of regularization today understood to be a key reason for the superior per-
formance of modern ensembling algorithms. The book continues with a clear description of two
recent developments: Importance Sampling (IS) and Rule Ensembles (RE). IS reveals classic ensemble
methods bagging, random forests, and boosting to be special cases of a single algorithm, thereby
showing how to improve their accuracy and speed. REs are linear rule models derived from decision
tree ensembles. They are the most interpretable version of ensembles, which is essential to appli-
cations such as credit scoring and fault diagnosis. Lastly, the authors explain the paradox of how
ensembles achieve greater accuracy on new data despite their (apparently much greater) complexity.
This book is aimed at novice and advanced analytic researchers and practitioners especially
in Engineering, Statistics, and Computer Science. Those with little exposure to ensembles will learn
why and how to employ this breakthrough method, and advanced practitioners will gain insight into
building even more powerful models. Throughout, snippets of code in R are provided to illustrate
the algorithms described and to encourage the reader to try the techniques
1
.
The authors are industry experts in data mining and machine learning who are also adjunct
professors and popular speakers. Although early pioneers in discovering and using ensembles, they
here distill and clarify the recent groundbreaking work of leading academics (such as Jerome Fried-
man) to bring the benets of ensembles to practitioners.
The authors would appreciate hearing of errors in or suggested improvements to this book,
and may be emailed at [email protected] and [email protected]. Errata and
updates will be available from www.morganclaypool.com
KEYWORDS
ensemble methods, rule ensembles, importance sampling, boosting, randomforest, bag-
ging, regularization, decision trees, data mining, machine learning, pattern recognition,
model interpretation, model complexity, generalized degrees of freedom
1
R is an Open Source Language and environment for data analysis and statistical modeling available through the Comprehensive
R Archive Network (CRAN). The R systems library packages offer extensive functionality, and be downloaded form http://
cran.r-project.org/ for many computing platforms. The CRAN web site also has pointers to tutorial and comprehensive
documentation. A variety of excellent introductory books are also available; we particularly like Introductory Statistics with R by
Peter Dalgaard and Modern Applied Statistics with S by W.N. Venables and B.D. Ripley.
To the loving memory of our fathers,
Tito and Fletcher
ix
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Foreword by Jaffray Woodriff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Foreword by Tin KamHo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
1
Ensembles Discovered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Building Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Real-World Examples: Credit Scoring + the Netix Challenge . . . . . . . . . . . . . . . . . . 7
1.4 Organization of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2
Predictive Learning and DecisionTrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Decision Tree Induction Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Decision Tree Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Decision Tree Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3
Model Complexity, Model Selection and Regularization . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 What is the Right Size of a Tree? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Bias-Variance Decomposition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Regularization and Cost-Complexity Tree Pruning 25
3.3.2 Cross-Validation 26
3.3.3 Regularization via Shrinkage 28
3.3.4 Regularization via Incremental Model Building 32
3.3.5 Example 34
3.3.6 Regularization Summary 37
x CONTENTS
4
Importance Sampling and the Classic Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.1 Parameter Importance Measure 43
4.1.2 Perturbation Sampling 45
4.2 Generic Ensemble Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.1 Example 49
4.3.2 Why it Helps? 53
4.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.1 Example 58
4.5.2 Why the Exponential Loss? 59
4.5.3 AdaBoosts Population Minimizer 60
4.6 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7 MART. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.8 Parallel vs. Sequential Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5
Rule Ensembles and Interpretation Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.1 Rule Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2.1 Simulated Data Example 68
5.2.2 Variable Importance 73
5.2.3 Partial Dependences 74
5.2.4 Interaction Statistic 74
5.3 Manufacturing Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6
Ensemble Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Generalized Degrees of Freedom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3 Examples: Decision Tree Surface with Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
CONTENTS xi
6.4 R Code for GDF and Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5 Summary and Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A
AdaBoost Equivalence to FSF Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
B
Gradient Boosting and Robust Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Authors Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Acknowledgments
We would like to thank the many people who contributed to the conception and completion of this
project. Giovanni had the privilege of meeting with Jerry Friedman regularly to discuss many of
the statistical concepts behind ensembles. Prof. Friedmans inuence is deep. Bart Goethels and the
organizers of ACM-KDD07rst welcomedour tutorial proposal onthe topic.TinKamHofavorably
reviewed the book idea, Keith Bettinger offered many helpful suggestions on the manuscript, and
Matt Strampe assisted with R code. The staff at Morgan & Claypool especially executive editor
Diane Cerra were diligent and patient in turning the manuscript into a book. Finally, we would
like to thank our families for their love and support.
Giovanni Seni and John F. Elder
January 2010
Foreword by Jaffray Woodriff
John Elder is a well-known expert in the eld of statistical prediction. He is also a good friend
who has mentored me about many techniques for mining complex data for useful information. I
have been quite fortunate to collaborate with John on a variety of projects, and there must be a good
reason that ensembles played the primary role each time.
I need to explain how we met, as ensembles are responsible! I spent my four years at the
University of Virginia investigating the markets. My plan was to become an investment manager
after I graduated. All I needed was a protable technical style that t my skills and personality (thats
all!). After I graduated in 1991, I followed where the data led me during one particular caffeine-
fueled, double all-nighter. In a t of crazed trial and error brainstorming I stumbled upon the
winning concept of creating one super-model from a large and diverse group of base predictive
models.
After ten years of combining models for investment management, I decided to investigate
where my ideas t in the general academic body of work. I had moved back to Charlottesville after
a stint as a proprietary trader on Wall Street, and I sought out a local expert in the eld.
I found Johns rm, Elder Research, on the web and hoped that theyd have the time to talk to
a data mining novice. I quickly realized that John was not only a leading expert on statistical learning,
but a very accomplished speaker popularizing these methods. Fortunately for me, he was curious to
talk about prediction and my ideas. Early on, he pointed out that my multiple model method for
investing described by the statistical prediction term, ensemble.
John and I have worked together on interesting projects over the past decade. I teamed
with Elder Research to compete in the KDD Cup in 2001. We wrote an extensive proposal for a
government grant to fund the creation of ensemble-based research and software. In 2007 we joined
up to compete against thousands of other teams on the Netix Prize - achieving a third-place ranking
at one point (thanks partly to simple ensembles). We even pulled a brainstorming all-nighter coding
up our user rating model, which brought back fond memories of that initial breakthrough so many
years before.
The practical implementations of ensemble methods are enormous. Most current implemen-
tations of them are quite primitive and this book will denitely raise the state of the art. Giovanni
Senis thorough mastery of the cutting-edge research and John Elders practical experience have
combined to make an extremely readable and useful book.
Looking forward, I can imagine software that allows users to seamlessly build ensembles in
the manner, say, that skilled architects use CAD software to create design images. I expect that
xvi FOREWORDBY JAFFRAY WOODRIFF
Giovanni and John will be at the forefront of developments in this area, and, if I am lucky, I will be
involved as well.
Jaffray Woodriff
CEO, Quantitative Investment Management
Charlottesville, Virginia
January 2010
[Editors note: Mr. Woodriff s investment rm has experienced consistently positive results, and has
grown to be the largest hedge fund manager in the South-East U.S.]
Foreword by Tin KamHo
Fruitful solutions to a challenging task have often been found to come from combining an
ensemble of experts. Yet for algorithmic solutions to a complex classication task, the utilities of
ensembles were rst witnessed only in the late 1980s, when the computing power began to support
the exploration and deployment of a rich set of classication methods simultaneously. The next
two decades saw more and more such approaches come into the research arena, and the develop-
ment of several consistently successful strategies for ensemble generation and combination. Today,
while a complete explanation of all the elements remains elusive, the ensemble methodology has
become an indispensable tool for statistical learning. Every researcher and practitioner involved in
predictive classication problems can benet from a good understanding of what is available in this
methodology.
This book by Seni and Elder provides a timely, concise introduction to this topic. After an
intuitive, highly accessible sketch of the key concerns in predictive learning, the book takes the
readers through a shortcut into the heart of the popular tree-based ensemble creation strategies, and
follows that with a compact yet clear presentation of the developments in the frontiers of statistics,
where active attempts are being made to explain and exploit the mysteries of ensembles through
conventional statistical theory and methods. Throughout the book, the methodology is illustrated
with varied real-life examples, and augmented with implementations in R-code for the readers
to obtain rst-hand experience. For practitioners, this handy reference opens the door to a good
understanding of this rich set of tools that holds high promises for the challenging tasks they face.
For researchers and students, it provides a succinct outline of the critically relevant pieces of the vast
literature, and serves as an excellent summary for this important topic.
The development of ensemble methods is by no means complete. Among the most interesting
open challenges are a more thorough understanding of the mathematical structures, mapping of the
detailed conditions of applicability, nding scalable and interpretable implementations, dealing with
incomplete or imbalanced training samples, and evolving models to adapt to environmental changes.
It will be exciting to see this monograph encourage talented individuals to tackle these problems in
the coming decades.
Tin Kam Ho
Bell Labs, Alcatel-Lucent
January 2010
1
C H A P T E R 1
Ensembles Discovered
and in a multitude of counselors there is safety.
Proverbs 24:6b
A wide variety of competing methods are available for inducing models from data, and their
relative strengths are of keen interest. The comparative accuracy of popular algorithms depends
strongly on the details of the problems addressed, as shown in Figure 1.1 (from Elder and Lee
(1997)), which plots the relative out-of-sample error of ve algorithms for six public-domain prob-
lems. Overall, neural network models did the best on this set of problems, but note that every
algorithm scored best or next-to-best on at least two of the six data sets.
Relative Performance Examples: 5 Algorithms on 6 Datasets
(J ohn Elder, Elder Research & Stephen Lee, U. Idaho, 1997)
1 00
)
80
.90
1.00
Neural Network
Logistic Regression
Linear Vector Quantization
Projection Pursuit Regression
e
r
i
s
b
e
t
t
e
r
)
60
.70
.80
Decision Tree
n
i
q
u
e
s
(
l
o
w
e
40
.50
.60
P
e
e
r
T
e
c
h
n
20
.30
.40
R
e
l
a
t
i
v
e
t
o
P
00
.10
.20
E
r
r
o
r
R
.00
Diabetes Gaussian Hypothyroid German Credit Waveform Investment
Figure 1.1: Relative out-of-sample error of ve algorithms on six public-domain problems (based
on Elder and Lee (1997)).
2 1. ENSEMBLES DISCOVERED
How can we tell, ahead of time, which algorithm will excel for a given problem? Michie et al.
(1994) addressed this question by executing a similar but larger study (23 algorithms on 22 data
sets) and building a decision tree to predict the best algorithm to use given the properties of a data
set
1
. Though the study was skewed toward trees they were 9 of the 23 algorithms, and several of
the (academic) data sets had unrealistic thresholds amenable to trees the study did reveal useful
lessons for algorithm selection (as highlighted in Elder, J. (1996a)).
Still, there is a way to improve model accuracy that is easier and more powerful than judicious
algorithm selection: one can gather models into ensembles. Figure 1.2 reveals the out-of-sample
accuracy of the models of Figure 1.1 when they are combined four different ways, including aver-
aging, voting, and advisor perceptrons (Elder and Lee, 1997). While the ensemble technique of
advisor perceptrons beats simple averaging on every problem, the difference is small compared to the
difference between ensembles and the single models. Every ensemble method competes well here
against the best of the individual algorithms.
This phenomenon was discovered by a handful of researchers, separately and simultaneously,
to improve classication whether using decision trees (Ho, Hull, and Srihari, 1990), neural net-
works (Hansen and Salamon, 1990), or math theory (Kleinberg, E., 1990). The most inuential
early developments were by Breiman, L. (1996) with Bagging, and Freund and Shapire (1996) with
AdaBoost (both described in Chapter 4).
One of us stumbled across the marvel of ensembling (which we called model fusion or
bundling) while striving to predict the species of bats from features of their echo-location sig-
nals (Elder, J., 1996b)
2
. We built the best model we could with each of several very different
algorithms, such as decision trees, neural networks, polynomial networks, and nearest neighbors
(see Nisbet et al. (2009) for algorithm descriptions). These methods employ different basis func-
tions and training procedures, which causes their diverse surface forms as shown in Figure 1.3
and often leads to surprisingly different prediction vectors, even when the aggregate performance is
very similar.
The project goal was to classify a bats species noninvasively, by using only its chirps. Univer-
sity of Illinois Urbana-Champaign biologists captured 19 bats, labeled each as one of 6 species, then
recorded 98 signals, from which UIUC engineers calculated 35 time-frequency features
3
. Figure 1.4
illustrates a two-dimensional projection of the data where each class is represented by a different
color and symbol. The data displays useful clustering but also much class overlap to contend with.
Each bat contributed 3 to 8 signals, and we realized that the set of signals froma given bat had
to be kept together (in either training or evaluation data) to fairly test the models ability to predict
a species of an unknown bat. That is, any bat with a signal in the evaluation data must have no other
1
The researchers (Michie et al., 1994, Section 10.6) examined the results of one algorithm at a time and built a C4.5 decision
tree (Quinlan, J., 1992) to separate those datasets where the algorithm was applicable (where it was within a tolerance of
the best algorithm) to those where it was not. They also extracted rules from the tree models and used an expert system to
adjudicate between conicting rules to maximize net information score. The book is online at http://www.amsta.leeds.
ac.uk/charles/statlog/whole.pdf
2
Thanks to collaboration with Doug Jones and his EE students at the University of Illinois, Urbana-Champaign.
3
Features such as lowfrequency at the 3-decibel level, time position of the signal peak, and amplitude ratio of 1st and 2nd harmonics.
3
Ensemble methods all improve performance
1 00
)
80
.90
1.00
e
r
i
s
b
e
t
t
e
r
)
60
.70
.80
n
i
q
u
e
s
(
l
o
w
e
40
.50
.60
P
e
e
r
T
e
c
h
n
20
.30
.40
Advisor Perceptron
AP weighted average
Vote
Average
R
e
l
a
t
i
v
e
t
o
P
00
.10
.20
E
r
r
o
r
R
.00
Diabetes Gaussian Hypothyroid German Credit Waveform Investment
Figure 1.2: Relative out-of-sample error of four ensemble methods on the problems of Figure 1.1(based
on Elder and Lee (1997)).
signals from it in training. So, evaluating the performance of a model type consisted of building
and cross-validating 19 models and accumulating the out-of-sample results ( a leave-one-bat-out
method).
On evaluation, the baseline accuracy (always choosing the plurality class) was 27%. Deci-
sion trees got 46%, and a tree algorithm that was improved to look two-steps ahead to choose
splits (Elder, J., 1996b) got 58%. Polynomial networks got 64%. The rst neural networks tried
achieved only 52%. However, unlike the other methods, neural networks dont select variables; when
the inputs were then pruned in half to reduce redundancy and collinearity, neural networks improved
to 63% accuracy. When the inputs were pruned further to be only the 8 variables the trees employed,
neural networks improved to 69% accuracy out-of-sample. (This result is a clear demonstration of
the need for regularization, as described in Chapter 3, to avoid overt.) Lastly, nearest neighbors,
using those same 8 variables for dimensions, matched the neural network score of 69%.
Despite their overall scores being identical, the two best models neural network and nearest
neighbor disagreed a third of the time; that is, they made errors on very different regions of the
data. We observed that the more condent of the two methods was right more often than not.
4 1. ENSEMBLES DISCOVERED
(Their estimates were between 0 and 1 for a given class; the estimate more close to an extreme was
usually more correct.) Thus, we tried averaging together the estimates of four of the methods two-
step decision tree, polynomial network, neural network, and nearest neighbor and achieved 74%
accuracy the best of all. Further study of the lessons of each algorithm (such as when to ignore an
estimate due to its inputs clearly being outside the algorithms training domain) led to improvement
reaching 80%. In short, it was discovered to be possible to break through the asymptotic performance
ceiling of an individual algorithmby employing the estimates of multiple algorithms. Our fascination
with what came to be known as ensembling began.
1.1 BUILDINGENSEMBLES
Building anensemble consists of twosteps: (1) constructing variedmodels and(2) combining their es-
timates (see Section 4.2). One may generate component models by, for instance, varying case weights,
data values, guidance parameters, variable subsets, or partitions of the input space. Combination can
be accomplishedby voting, but is primarily done throughmodel estimate weights, withgating andad-
visor perceptrons as special cases. For example, Bayesian model averaging sums estimates of possible
Figure 1.3: Example estimation surfaces for ve modeling algorithms. Clockwise from top left: deci-
sion tree, Delaunay planes (based on Elder, J. (1993)), nearest neighbor, polynomial network (or neural
network), kernel.
1.1. BUILDINGENSEMBLES 5
t10
Var4
\
\
bw
\
\
\
Figure 1.4: Sample projection of signals for 6 different bat species.
models, weighted by their posterior evidence. Bagging (bootsrap aggregating; Breiman, L. (1996))
bootstraps the training data set (usually to build varied decision trees) and takes the majority vote or
the average of their estimates (see Section 4.3). Random Forest (Ho, T., 1995; Breiman, L., 2001)
adds a stochastic component to create more diversity among the trees being combined (see Sec-
tion 4.4) AdaBoost (Freund and Shapire, 1996) and ARCing (Breiman, L., 1996) iteratively build
models by varying case weights (up-weighting cases with large current errors and down-weighting
those accurately estimated) and employs the weighted sumof the estimates of the sequence of models
(see Section 4.5). Gradient Boosting (Friedman, J., 1999, 2001) extended the AdaBoost algorithm
to a variety of error functions for regression and classication (see Section 4.6).
The Group Method of Data Handling (GMDH) (Ivakhenko, A., 1968) and its descendent,
Polynomial Networks (Barron et al., 1984; Elder and Brown, 2000), can be thought of as early en-
semble techniques. They build multiple layers of moderate-order polynomials, t by linear regression,
6 1. ENSEMBLES DISCOVERED
where variety arises from different variable sets being employed by each node. Their combination is
nonlinear since the outputs of interior nodes are inputs to polynomial nodes in subsequent layers.
Network construction is stopped by a simple cross-validation test (GMDH) or a complexity penalty.
An early popular method, Stacking (Wolpert, D., 1992) employs neural networks as components
(whose variety can stem from simply using different guidance parameters, such as initialization
weights), combined in a linear regression trained on leave-1-out estimates from the networks.
Models have to be individually good to contribute to ensembling, and that requires knowing
when to stop; that is, how to avoid overt the chief danger in model induction, as discussed next.
1.2 REGULARIZATION
A widely held principle in Statistical and Machine Learning model inference is that accuracy and
simplicity are both desirable. But there is a tradeoff between the two: a exible (more complex) model
is often needed to achieve higher accuracy, but it is more susceptible to overtting and less likely to
generalize well. Regularization techniques damp down the exibility of a model tting procedure
by augmenting the error function with a term that penalizes model complexity. Minimizing the
augmented error criterion requires a certain increase in accuracy to pay for the increase in model
complexity (e.g., adding another term to the model). Regularization is today understood to be one
of the key reasons for the superior performance of modern ensembling algorithms.
An inuential paper was Tibshiranis introduction of the Lasso regularization technique for
linear models (Tibshirani, R., 1996). The Lasso uses the sum of the absolute value of the coefcients
in the model as the penalty function and had roots in work done by Breiman on a coefcient
post-processing technique which he had termed Garotte (Breiman et al., 1993).
Another important development came withthe LARSalgorithmby Efron et al. (2004), which
allows for an efcient iterative calculation of the Lasso solution. More recently, Friedman published
a technique called Path Seeker (PS) that allows combining the Lasso penalty with a variety of
loss (error) functions (Friedman and Popescu, 2004), extending the original Lasso paper which was
limited to the Least-Squares loss.
Careful comparison of the Lasso penalty with alternative penalty functions (e.g., using the
sum of the squares of the coefcients) led to an understanding that the penalty function has two
roles: controlling the sparseness of the solution (the number of coefcients that are non-zero) and
controlling the magnitude of the non-zero coefcients (shrinkage). This led to development of
the Elastic Net (Zou and Hastie, 2005) family of penalty functions which allow searching for the
best shrinkage/sparseness tradeoff according to characteristics of the problem at hand (e.g., data
size, number of input variables, correlation among these variables, etc.). The Coordinate Descent
algorithm of Friedman et al. (2008) provides fast solutions for the Elastic Net.
Finally, an extension of the Elastic Net family to non-convex members producing sparser
solutions (desirable when the number of variables is much larger than the number of observations)
is now possible with the Generalized Path Seeker algorithm (Friedman, J., 2008).
1.3. REAL-WORLDEXAMPLES: CREDITSCORING+ THENETFLIXCHALLENGE 7
1.3 REAL-WORLD EXAMPLES: CREDIT SCORING + THE
NETFLIXCHALLENGE
Many of the examples we show are academic; they are either curiosities (bats) or kept very simple to
best illustrate principles. We close Chapter 1 by illustrating that even simple ensembles can work in
very challenging industrial applications. Figure 1.5 reveals the out-of-sample results of ensembling
up to ve different types of models on a credit scoring application. (The output of each model is
ranked, those ranks are averaged and re-ranked, and the credit defaulters in a top percentage is
counted. Thus, lower is better.) The combinations are ordered on the horizontal axis by the number
of models used, and Figure 1.6 highlights the nding that the mean error reduces with increasing
degree of combination. Note that the nal model with all ve component models does better than
the best of the single models.
50
55
60
65
70
75
80
0 1 2 3 4 5
#Modelscombined(averagingoutputrange)
#
D
e
f
a
u
l
t
e
r
s
M
i
s
s
e
d
(
f
e
w
e
r
i
s
b
e
t
t
e
r
)
Bundled Trees
Stepwise Regression
Polynomial Network
Neural Network
MARS
NT
NS
ST
MT
PS
PT
NP
MS
MN
MP
SNT
SPT
SMT
SPN
MNT
SMP
SMN
MPT
PNT
MPN
SPNT
SMPT
SMNT
SMPN
MPNT
SMPNT
Figure 1.5: Out-of-sample errors on a credit scoring application when combining one to ve different
types of models into ensembles. T represents bagged trees; S, stepwise regression; P, polynomial networks;
N, neural networks; M, MARS. The best model, MPN, thus averages the models built by MARS, a
polynomial network, and a neural network algorithm.
Each model in the collection represents a great deal of work, and it was constructed by
advocates of that modeling algorithm competing to beat the other methods. Here, MARS was the
best and bagged trees was the worst of the ve methods (though a considerable improvement over
single trees, as also shown in many examples in Chapter 4).
8 1. ENSEMBLES DISCOVERED
55
60
65
70
75
1 2 3 4 5
Number of models in combination
N
u
m
b
e
r
o
f
D
e
f
a
u
l
t
e
r
s
M
i
s
s
e
d
Figure 1.6: Box plot for Figure 1.5; median (and mean) error decreased as more models are combined.
Most of the ensembling being done in research and applications use variations of one kind
of modeling method particularly decision trees (as described in Chapter 2 and throughout this
book). But one great example of heterogenous ensembling captured the imagination of the geek
community recently. In the Netix Prize, a contest ran for two years in which the rst teamto submit
a model improving on Netixs internal recommendation system by 10% would win $1,000,000.
Contestants were supplied with entries from a huge movie/user matrix (only 2% non-missing) and
asked to predict the ranking (from1 to 5) of a set of the blank cells. Ateamone of us was on, Ensemble
Experts, peaked at 3
rd
place at a time when over 20,000 teams had submitted. Moving that high in
the rankings using ensembles may have inspired other leading competitors, since near the end of the
contest, when the two top teams were extremely close to each other and to winning the prize, the
nal edge was obtained by weighing contributions from the models of up to 30 competitors.
Note that the ensembling techniques explained in this book are even more advanced than
those employed in the nal stages of the Netix prize.
1.4 ORGANIZATIONOFTHIS BOOK
Chapter 2 presents the formal problem of predictive learning and details the most popular nonlinear
method decision trees, which are used throughout the book to illustrate concepts. Chapter 3
discusses model complexity and how regularizing complexity helps model selection. Regularization
techniques play an essential role in modern ensembling. Chapters 4 and 5 are the heart of the book;
there, the useful new concepts of Importance Sampling Learning Ensembles (ISLE) and Rule
Ensembles developed by J. Friedman and colleagues are explained clearly. The ISLE framework
1.4. ORGANIZATIONOFTHIS BOOK 9
allows us to viewthe classic ensemble methods of Bagging, RandomForest, AdaBoost, and Gradient
Boosting as special cases of a single algorithm. This unied view claries the properties of these
methods and suggests ways to improve their accuracy and speed. Rule Ensembles is a new ISLE-
based model built by combining simple, readable rules. While maintaining (and often improving)
the accuracy of the classic tree ensemble, the rule-based model is much more interpretable. Chapter 5
also illustrates recently proposed interpretation statistics, which are applicable to Rule Ensembles as
well as to most other ensemble types. Chapter 6 concludes by explaining why ensembles generalize
much better than their apparent complexity would seem to allow. Throughout, snippets of code in
R are provided to illustrate the algorithms described.
11
C H A P T E R 2
Predictive Learning and
DecisionTrees
In this chapter, we provide an overview of predictive learning and decision trees. Before introducing
formal notation, consider a very simple data set represented by the following data matrix:
Table 2.1: Asimple data set. Each rowrepresents
a data point and each column corresponds to an
attribute. Sometimes, attribute values could be
unknown or missing (denoted by a ? below).
TI PE Response
1.0 M2 good
2.0 M1 bad
4.5 M5 ?
Each row in the matrix represents an observation or data point. Each column corresponds
to an attribute of the observations: TI, PE, and Response, in this example. TI is a numeric attribute,
PE is an ordinal attribute, and Response is a categorical attribute. A categorical attribute is one that
has two or more values, but there is no intrinsic ordering to the values e.g., either good or bad in
Table 2.1. An ordinal attribute is similar to a categorical one but with a clear ordering of the attribute
values. Thus, in this example M1 comes before M2, M2 comes before M3, etc. Graphically, this data
set can be represented by a simple two-dimensional plot with numeric attribute TI rendered on the
horizontal axis and ordinal attribute PE, rendered on the vertical axis (Figure 2.1).
When presented with a data set such as the one above, there are two possible modeling tasks:
1. Describe: Summarize existing data in an understandable and actionable way
2. Predict: What is the Response (e.g., class) of new point ? See (Hastie et al., 2009).
More formally, we say we are given training data D = {y
i
, x
i1
, x
i2
, , x
in
}
N
1
= {y
i
, x
i
}
N
1
where
- y
i
, x
ij
are measured values of attributes (properties, characteristics) of an object
- y
i
is the response (or output) variable
12 2. PREDICTIVELEARNINGANDDECISIONTREES
PE
TI
M9
M4
M3
M2
M1
.
.
.
2 5
Figure 2.1: A graphical rendering of the data set from Table 2.1. Numeric and ordinal attributes make
appropriate axes because they are ordered, while categorical attributes require color coding the points.
The diagonal line represents the best linear boundary separating the blue cases from the green cases.
- x
ij
are the predictor (or input) variables
- x
i
is the input vector made of all the attribute values for the i-th observation
- n is the number of attributes; thus, we also say that the size of x is n
- N is the number of observations
- D is a random sample from some unknown (joint) distribution p(x, y) i.e., it is assumed
there is a true underlying distribution out there, and that through a data collection effort, weve
drawn a random sample from it.
Predictive Learning is the problem of using D to build a functional model
y =
F(x
1
, x
2
, , x
n
) =
F(x)
which is the best predictor of y given input x. It is also often desirable for the model to offer an
interpretable description of how the inputs affect the outputs. When y is categorical, the problem is
termed a classication problem; when y is numeric, the problem is termed a regression problem.
The simplest model, or estimator, is a linear model, with functional form
F(x) = a
0
+
n
j=1
a
j
x
j
i.e., a weighted linear combination of the predictors. The coefcients {a
j
}
n
0
are to be determined
via a model tting process such as ordinary linear regression (after assigning numeric labels to the
points i.e., +1 to the blue cases and 1 to the green cases). We use the notation
F(x) to refer
13
to the output of the tting process an approximation to the true but unknown function F
(x)
linking the inputs to the output. The decision boundary for this model, the points where
F(x) = 0,
is a line (see Figure 2.1), or a plane, if n > 2. The classication rule simply checks which side of the
boundary a given point is at i.e.,
F(x)
_
0 (blue)
else (green)
In Figure 2.1, the linear model isnt very good, with several blue points on the (mostly) green
side of the boundary.
Decision trees (Breiman et al., 1993; Quinlan, J., 1992) instead create a decision boundary by
asking a sequence of nested yes/no questions. Figure 2.2 shows a decision tree for classifying the
data of Table 2.1. The rst, or root, node splits on variable TI : cases for which T I 5, follow the
left branch and are all classied as blue; cases for which T I < 5, go to the right daughter of the
root node, where they are subject to additional split tests.
PE
TI
M9
M4
M3
M2
M1
.
.
.
2 5
TI 2
true false
PE {M1, M2, M3 }
TI 5
Figure 2.2: Decision tree example for the data of Table 2.1. There are two types of nodes: split and
terminal. Terminal nodes are given a class label. When reading the tree, we follow the left branch when
a split test condition is met and the right branch otherwise.
At every new node the splitting algorithm takes a fresh look at the data that has arrived at it,
and at all the variables and all the splits that are possible. When the data arriving at a given node is
mostly of a single class, then the node is no longer split and is assigned a class label corresponding
to the majority class within it; these nodes become terminal nodes.
To classify a new observation, such as the white dot in Figure 2.1, one simply navigates the
tree starting at the top (root), following the left branch when a split test condition is met and the
right branch otherwise, until arriving at a terminal node. The class label of the terminal node is
returned as the tree prediction.
14 2. PREDICTIVELEARNINGANDDECISIONTREES
The tree of Figure 2.2 can also be expressed by the following expert system rule (assuming
green = bad and blue = good):
T I [2, 5] AND PE {M1, M2, M3} bad
ELSE good
which offers an understandable summary of the data (a descriptive model). Imagine this data came
from a manufacturing process, where M1, M2, M3, etc., were the equipment names of machines
used at some processing step, and that the T I values represented tracking times for the machines.
Then, the model also offers an actionable summary: certain machines used at certain times lead to
bad outcomes (e.g., defects). The ability of decision trees to generate interpretable models like this
is an important reason for their popularity.
In summary, the predictive learning problem has the following components:
- Data: D = {y
i
, x
i
}
N
1
- Model : the underlying functional form sought from the data e.g., a linear model, a decision
tree model, etc. We say the model represents a family F of functions, each indexed by a
parameter vector p:
F(x) =
F(x; p) F
In the case where F are decision trees, for example, the parameter vector p represents the splits
dening each possible tree.
- Score criterion: judges the quality of a tted model. This has two parts:
Loss function: Penalizes individual errors in prediction. Examples for regression tasks in-
clude the squared-error loss, L(y, y) = (y y)
2
, and the absolute-error loss, L(y, y) =
|y y|. Examples for 2-class classication include the exponential loss, L(y, y) =
exp(y y) , and the (negative) binomial log-likelihood, L(y, y) = log(1 + e
y y
).
Risk: the expected loss over all predictions, R(p) = E
y,x
L(y, F(x; p)), which we often
approximate by the average loss over the training data:
R(p) =
1
N
N
i=1
L(y
i
,
F(x
i
; p)) (2.1)
In the case of ordinary linear regression (OLR), for instance, which uses squared-error
loss, we have
R(p) =
R(a) =
1
N
N
i=1
y
i
a
0
n
j=1
a
j
x
j
2
2.1. DECISIONTREEINDUCTIONOVERVIEW 15
- Search Strategy: the procedure used to minimize the risk criterion i.e., the means by which
we solve
p = arg min
p
R(p)
In the case of OLR, the search strategy corresponds to direct matrix algebra. In the case of
trees, or neural networks, the search strategy is a heuristic iterative algorithm.
It should be pointed out that no model family is universally better; each has a class of target functions,
sample size, signal-to-noise ratio, etc., for which it is best. For instance, trees work well when 100s of
variables are available, but the output vector only depends on a fewof them(say <10); the opposite is
true for Neural Networks (Bishop, C., 1995) and Support Vector Machines (Scholkopf et al., 1999).
How to choose the right model family then? We can do the following:
- Match the assumptions for particular model to what is known about the problem, or
- Try several models and choose the one that performs the best, or
- Use several models and allow each subresult to contribute to the nal result (the ensemble
method).
2.1 DECISIONTREEINDUCTIONOVERVIEW
In this section, we look more closely at the algorithm for building decision trees. Figure 2.3 shows
an example surface built by a regression tree. Its a piece-wise constant surface: there is a region R
m
in input space for each terminal node in the tree i.e., the (hyper) rectangles induced by tree cuts.
There is a constant associated with each region, which represents the estimated prediction y = c
m
that the tree is making at each terminal node.
Formally, an M-terminal node tree model is expressed by:
y = T (x) =
M
m=1
c
m
I
R
m
(x)
where I
A
(x) is 1 if x A and 0 otherwise. Because the regions are disjoint, every possible input x
belongs in a single one, and the tree model can be thought of as the sum of all these regions.
Trees allowfor different loss functions fairly easily. The two most used for regression problems
are squared-error where the optimal constant c
m
is the mean and the absolute-error where the optimal
constant is the median of the data points within region R
m
(Breiman et al., 1993).
16 2. PREDICTIVELEARNINGANDDECISIONTREES
Figure 2.3: Sample regression tree and corresponding surface in input (x) space (adapted
from (Hastie et al., 2001)).
If we choose to use squared-error loss, then the search problem, nding the tree T (x) with
lowest prediction risk, is stated:
_
c
m
,
R
m
_
M
1
=arg min
{c
m
,R
m
}
M
1
N
i=1
[y
i
T (x
i
)]
2
= arg min
{c
m
,R
m
}
M
1
N
i=1
_
y
i
M
m=1
c
m
I
R
m
(x
i
)
_
2
To solve, one searches over the space of all possible constants and regions to minimize average
loss. Unrestricted optimization with respect to {R
m
}
M
1
is very difcult, so one universal technique is
to restrict the shape of the regions (see Figure 2.4).
Joint optimization with respect to {R
m
}
M
1
and {c
m
}
M
1
, simultaneously, is also extremely dif-
cult, so a greedy iterative procedure is adopted (see Figure 2.5). The procedure starts with all the
data points being in a single region R and computing a score for it; in the case of squared-error loss
this is simply:
e(R) =
1
N
xR
_
y
i
mean ({y
i
}
N
1
)
_
2
Then each input variable x
j
, and each possible test s
j
on that particular variable for splitting R into
R
l
(left region) and R
r
(right region), is considered, and scores e(R
l
) and e(R
r
) computed. The
2.1. DECISIONTREEINDUCTIONOVERVIEW 17
x
2
x
2
x
1
x
1
X
Figure 2.4: Examples of invalid and valid regions induced by decision trees. To make the problem of
building a tree computationally fast, the region boundaries are restricted to be rectangles parallel to the
axes. Resulting regions are simple, disjoint, and cover the input space (adapted from(Hastie et al., 2001)).
- Starting with a single region -- i.e., all given data
- At the m-th iteration:
Figure 2.5: Forward stagewise additive procedure for building decision trees.
quality, or improvement, score of the split s
j
is deemed to be
I(x
j
, s
j
) = e(R) e(R
l
) e(R
r
)
i.e., the reduction in overall error as a result of the split. The algorithm chooses the variable and the
split that improves the t the most, with no regard to whats going to happen subsequently. And
then the original region is replaced with the two new regions and the splitting process continues
iteratively (recursively).
Note the data is consumed exponentiallyeach split leads to solving two smaller subsequent
problems. So, when should the algorithm stop? Clearly, if all the elements of the set {x : x R} have
the same value of y, then no split is going to improve the score i.e., reduce the risk; in this case,
18 2. PREDICTIVELEARNINGANDDECISIONTREES
we say the region R is pure. One could also specify a maximum number of desired terminal nodes,
maximum tree depth, or minimum node size. In the next chapter, we will discuss a more principled
way of deciding the optimal tree size.
This simple algorithmcan be coded in a fewlines. But, of course, to handle real and categorical
variables, missing values and various loss functions takes thousands of lines of code. In R, decision
trees for regression and classication are available in the rpart package (rpart).
2.2 DECISIONTREEPROPERTIES
As recently as 2007, a KDNuggets poll (Data Mining Methods, 2007) concluded that trees were the
method most frequently used by practitioners. This is so because they have many desirable data
mining properties. These are as follows:
1. Ability to deal with irrelevant inputs. Since at every node, we scan all the variables and pick the
best, trees naturally do variable selection. And, thus, anything you can measure, you can allow
as a candidate without worrying that they will unduly skew your results.
Trees also provide a variable importance score basedonthe contributionto error (risk) reduction
across all the splits in the tree (see Chapter 5).
2. No data preprocessing needed. Trees naturally handle numeric, binary, and categorical variables.
Numeric attributes have splits of the form x
j
< cut _value; categorical attributes have splits
of the form x
j
{value1, value2, . . .}.
Monotonic transformations wont affect the splits, so you dont have problems with input
outliers. If cut _value = 3 and a value x
j
is 3.14 or 3,100, its greater than 3, so it goes to the
same side. Output outliers can still be inuential, especially with squared-error as the loss.
3. Scalable computation. Trees are very fast to build and run compared to other iterative techniques.
Building a tree has approximate time complexity of O (nN log N).
4. Missing value tolerant. Trees do not suffer much loss of accuracy due to missing values.
Some tree algorithms treat missing values as a separate categorical value. CART handles them
via a clever mechanism termed surrogate splits (Breiman et al., 1993); these are substitute
splits in case the rst variable is unknown, which are selected based on their ability to approx-
imate the splitting of the originally intended variable.
One may alternatively create a newbinary variable x
j
_is_NA(not available) when one believes
that there may be information in x
j
s being missing i.e., that it may not be missing at
random.
5. Off-the-shelf procedure: there are only few tunable parameters. One can typically use them
within minutes of learning about them.
2.3. DECISIONTREELIMITATIONS 19
6. Interpretable model representation. The binary tree graphic is very interpretable, at least to a few
levels.
2.3 DECISIONTREELIMITATIONS
Despite their many desirable properties, trees also suffer from some severe limitations:
1. Discontinuous piecewise constant model. If one is trying to t a trend, piecewise constants are
a very poor way to do that (see Figure 2.6). In order to approximate a trend well, many splits
would be needed, and in order to have many splits, a large data set is required.
x <= cutValue
C
1
C
2
C
1
C
2
y
x
F
*
(x)
cutValue
Figure 2.6: A 2-terminal node tree approximation to a linear function.
2. Data fragmentation. Each split reduces training data for subsequent splits. This is especially
problematic in high dimensions where the data is already very sparse and can lead to overt
(as discussed in Chapter 6).
3. Not good for lowinteraction target functions F
(x)=a
o
+
n
j=1
a
j
x
j
=
n
j=1
f
j
_
x
j
_
i.e., no interactions, additive model
and in order for x
j
to enter the model, the tree must split on it, but once the root split variable
is selected, additional variables enter as products of indicator functions. For instance,
R
1
in
Figure 2.3 is dened by the product of I (x
1
> 22) and I (x
2
> 27).
20 2. PREDICTIVELEARNINGANDDECISIONTREES
4. Not good for target functions F
(x) + (3.1)
where F
(x) is the target function that we are trying to learn. We dont really know F
, and because
either we are not measuring everything that is relevant, or we have problems with our measurement
equipment, or what we measure has noise in it, the response variable we have contains the truth
plus some error. We assume that these errors are independent and identically distributed. Specically,
we assume is normally distributed i.e., N(0,
2
) (although this is not strictly necessary).
Now consider the idealized aggregate estimator
F(x) = E
F
D
(x) (3.2)
which is the average t over all possible data sets. One can think of the expectation operator as an
averaging operator. Going back to the manufacturing example, each
F represents the model t to
the data set from a given day. And assuming many such data sets can be collected,
F can be created
as the average of all those
Fs.
Now, lets look at what the error of one of these
Fs is on one particular data point, say x
0
, under
one particular loss function, the squared-error loss, which allows easy analytical manipulation. The
error, known as the Mean Square Error (MSE) in this case, at that particular point is the expectation
of the squared difference between the target y and
F:
Err(x
0
) = E
_
y
F(x)|x = x
0
_
2
= E
_
F
(x
0
)
F(x
0
)
_
2
+
2
= E
_
F
(x
0
)
F(x
0
) +
F(x
0
)
F(x
0
)
_
2
+
2
1
Unless two cases have identical input values and different output values.
3.2. BIAS-VARIANCEDECOMPOSITION 23
The derivation above follows from equations Equations (3.1) and (3.2), and properties of the
expectation operator. Continuing, we arrive at:
= E
_
F(x
0
) F
(x
0
)
_
2
+ E
_
F(x
0
)
F(x
0
)
_
2
+
2
=
_
F(x
0
) F
(x
0
)
_
2
+ E
_
F(x
0
)
F(x
0
)
_
2
+
2
= Bias
2
(
F(x
0
)) +Var(
F(x
0
)) +
2
The nal expression says that the error is made of three components:
- [
F(x
0
) F
(x
0
)]
2
: known as squared-bias, is the amount by which the average estimator
is represented by the
blue circle. The model family F, or model space, is represented by the region to the right of the red
curve. For a given target realization y, one
F is t, which is the member from the model space F
that is closest to y. After repeating the tting process many times, the average
F can be computed.
Thus, the orange circle represents variance, the spread of the
Fs around their mean
F. Similarly,
the distance between the average estimator
F and the truth F
E
r
r
o
r
Figure 3.3: Bias-Variance trade-off as a function of model complexity (adapted from (Hastie et al.,
2001)). A simple model has high bias but low variance. A complex model has low bias but high variance.
To determine the optimal model a test set is required.
3.3. REGULARIZATION 25
Finally, the bias-variance tradeoff also means that the more complex (exible) we make the
model
F, the lower the bias but the higher the variance it is subject to. We want to be able to use
exible models, but a way to control variance is needed. This is where regularization comes in.
3.3 REGULARIZATION
What is regularization? We offer the following denition by Rosset, S. (2003):
any part of model building which takes into account implicitly or explicitly the
niteness and imperfection of the data and the limited information in it, which we can
term variance in an abstract sense.
We know of at least three different ways of regularizing:
1. Explicitly via constraints on model complexity. This means augmenting the risk score criterion
being minimizedwitha penalty termP(F) that is a functionof Fs complexity.This complexity
term penalizes for the increased variance associated with more complex models.
Cost-complexity pruning and shrinkage-based regression are examples of this form of regu-
larization. They are discussed in Sections 3.3.1 and 3.3.3, respectively.
2. Implicitly through incremental building of the model. This means using learning algorithms
that update model parameter estimates very slowly.
The forward stagewise linear regression algorithmis an example of this approach. It is discussed
in Section 3.3.4.
3. Implicitly through the choice of robust loss functions. Since the presence of outliers is a source
of variance in the regression procedure, by using a robust loss function that is less sensitive to
them, we are controlling variance, and thus doing implicit regularization.
The Huber loss (Huber, P., 1964)
L(y, y) =
_
1
2
(y y)
2
y y
y y
/
2
_
y y
>
is an example of a robust loss function for regression.
It is said that regularization builds on the bet on sparsity principle: use a method that is
known to work well onsparse problems (where only a subset of the predictors really matter) because
most methods are going to do poorly on dense problems.
3.3.1 REGULARIZATIONANDCOST-COMPLEXITYTREEPRUNING
The rst example of augmenting the score criterion with a complexity term comes from the CART
approach to tree pruning (Breiman et al., 1993):
(T ) =
R(T ) + P(T ) (3.3)
26 3. MODEL COMPLEXITY, MODEL SELECTIONANDREGULARIZATION
Where
-
R(T ) is the error or risk of tree T i.e., how good it ts the data as discussed in Chapter 2.
- P(T ) = |T | measures the complexity of the tree by its size i.e., number of terminal nodes
(analogous to an L0-norm).
- is called the complexity parameter and represents the cost of adding another split variable
to the model.
Note that the complexity term is deterministic and independent of the particular random
sample; it thereby provides a stabilizing inuence on the criterion being minimized. It acts as a
counterbalancing force to the data-dependent part of the error; the model can get more complex if
it can reduce the error by a certain amount to compensate for the increased penalty in complexity.
Our goal then is rephrased from nding the tree that has minimum risk
R(T ) to nding the
tree that has minimum regularized risk
R
(T ).
The parameter controls the degree of stabilization of the regularized component of the
error. At one extreme, if = 0, theres no regularization; a fully grown tree , T
max
, is obtained which
corresponds to the least stable estimate. At the other extreme, >> 0 (much greater than 0), and
the resulting tree, T
0
= root, is completely deterministic: no matter what the data indicates, the
complexity penalty wins over the loss component of the risk, and a stump results. No tree can be
built. In between, varying produces a nested (nite) sequence of subtrees:
T
max
T
1
T
2
root
=0
1
2
3
>>0
The procedure can be set up in such a way that every tree is a subset of the previous
tree (Breiman et al., 1993). This nestedness property for trees allows the process of minimizing
R
).
3.3.2 CROSS-VALIDATION
As mentioned in Section 3.2, while discussing the bias-variance tradeoff, the error on the training
set is not a useful estimator of model performance. What is needed is a way to estimate prediction
risk, also called test risk or future risk. If not enough data is available to partition it into separate
training and test sets, one may use the powerful general technique of cross-validation.
To perform what is called 3-fold cross-validation: simply, randomly split the data D into
three non-overlapping groups D
1
, D
2
, and D
3
, and generate data folds (partitions) with one D
i
3.3. REGULARIZATION 27
T
(1)
T
(2)
T
(3)
Figure 3.4: Illustration of 3-fold cross-validation. The original data set D randomly split into three
non-overlapping groups D
1
, D
2
, and D
3
. Two of the groups are allocated for training and one for testing.
The process is repeated three times.
designated for testing and the other two for training (see Figure 3.4). For each fold, a model (e.g., a
tree) is built on the training part of it and evaluated on the test part of it.
Note that every observation x
i
in D is assigned to a test sub-group only once, so an indexing
function can be dened,
(i) : {1, . . . , N} {1, . . . , V}
which maps the observation number, 1, . . . , N, to a fold number 1, . . . , V. Thus, function (i)
indicates the partition in which observation i is a test observation. The cross-validated estimate of
risk is then computed as:
R
CV
=
1
N
N
i=1
L
_
y
i
, T
(i)
(x
i
)
_
This estimate of prediction risk can be plotted against model complexity. Since varying the
value of the regularization parameter varies the complexity of the model e.g., the size of the tree,
the risk estimate can be written as a function of :
R
CV
() =
1
N
N
i=1
L
_
y
i
, T
(i)
(x
i
)
_
Figure 3.5 shows a plot of
R
CV
(); it was generated using the plotcp command from the
rpart package in R (rpart), which implements regression and classication trees. Risk estimate
R
CV
appears on the y-axis, in the lower x-axis and tree size in the top x-axis. Since
R
CV
is an average
for every value of , the corresponding standard deviation (or standard-error) can be computed, and
is represented by the short vertical bars.
To choose the optimal tree size, simply nd the minimum in the
R
CV
() plot. For the
example of Figure 3.5, this minimum occurs for a tree of size 28. Sometimes the following more
conservative selection approach is used: locate the minimum of the risk curve, from this point go up
one standard-error, and then move horizontally left until crossing the risk curve again. The tree size
value corresponding to where the crossing occurred is selected as the optimal tree size. In Figure 3.5,
this occurs for a tree of size 15. This procedure, known as the 1-SE rule, corresponds to the notion
28 3. MODEL COMPLEXITY, MODEL SELECTIONANDREGULARIZATION
1 2 4 6 8 9 10 12 15 28 34
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Inf
0.1 0.077 0.037 0.011 0.041
X
-
v
a
l
R
e
l
a
t
i
v
e
E
r
r
o
r
sizeoftree
j=1
a
j
x
j
3.3. REGULARIZATION 29
and the coefcient estimation problem is simply stated as:
{ a
j
} = arg min
{a
j
}
N
i=1
L
y
i
, a
0
+
n
j=1
a
j
x
ij
(3.4)
Here, we are trying to nd values for the coefcients a
j
that minimize risk (as estimated by the
average loss). There are two main reasons why the ordinary linear regression (OLR) solution is often
not satisfactory for solving this problem. One is that often there is a high variance in the coefcient
estimates (Tibshirani, R., 1996). The other is interpretation: when the number of variables is large,
we would like, if possible, to identify the subset of the variables that capture the stronger effects. A
solution vector a with a
j
= 0 for these small subset of important variables, and a
j
= 0 for all other
variables, is termed a sparse solution, and it is often preferred over a dense solution where all
coefcients are non-zero.
There are generally two types of techniques to improve OLR. One is subset selec-
tion (Hastie et al., 2001), which tries to identify the best subset among variables x
j
to include
in the model. Like the tree growing algorithm of Section 2.1, subset selection is a greedy discrete
process (a variable is either in or out of the model) where often different data sub-samples give rise
to different variable subsets.
The second approach is to use shrinkage, which is a continuous process. As in Equation (3.3),
shrinkage works by augmenting the error criterion being minimized with a data-independent penalty
term:
{ a
j
} = arg min
{a
j
}
N
i=1
L
y
i
, a
0
+
n
j=1
a
j
x
ij
+ P(a)
where controls the amount of regularization (as did in the case of trees). The penalty function
P(a) can take different forms e.g.,
- Ridge: P(a) =
n
j=1
a
2
j
- Lasso (Tibshirani, R., 1996): P(a) =
n
j=1
|a
j
|
- Elastic Net (Zou and Hastie, 2005): P
(a) =
n
j=1
_
1
2
(1 ) a
2
j
+ |a
j
|
_
which is a
weighted mixture of the Lasso and Ridge penalties.
Note that the Ridge and Lasso penalties correspond to the L2- and L1- normof the coefcient
vector a, respectively. If we were to use the L0-norm simply counting the number of non-zero
coefcients the penalty would be analogous to what we used in trees. And, as in the case of trees,
the penalty term promotes reduced variance of the estimated values, by encouraging less complex
models i.e., those with fewer or smaller coefcients.
The Lasso differs from the Ridge penalty in that it encourages sparse coefcient vectors
where some entries are set to zero. This is often the case in the presence of correlated variables;
30 3. MODEL COMPLEXITY, MODEL SELECTIONANDREGULARIZATION
Ridge will shrink the coefcients of the correlated subset towards zero, whereas Lasso will often
select a variable from among them. Thus, the Lasso can be viewed as a continuous version of
variable selection.
Obtaining the Lasso coefcients with the squared-error loss involves solving a quadratic pro-
gramming problem with constraints, which can be computationally demanding for large problems.
In the past fewyears, however, fast iterative algorithms have been devised to solve this problemmore
quickly, and allow other loss functions (see Section 3.3.3).
As with for cost-complexity tree pruning, is a meta-parameter of the minimization pro-
cedure that needs to be estimated (generally via cross-validation). For every value of , there is a
corresponding optimal coefcient vector a(); having >> 0 gives maximum regularization and
only the constant model y = a
0
is produced (maximum bias and minimum variance). At the other
extreme, setting = 0 results in no regularization (minimum bias, maximum variance or least stable
estimates). This is depicted in a plot of error against the values of the regularization parameter (see
Figure 3.6).
0.0
0
.
6
Bias
2
Variance
MSE
0
.
0
0
.
2
0
.
4
0
.
8
1
.
0
1
.
2
0.1 0.2 0.3 0.4 0.5
M
S
E
Shrinkage:
1
Figure 3.6: Sample plot of the estimate of prediction error (MSE) as a function of regularization pa-
rameter . The optimal model, a(
i=1
(y
i
a
0
a
1
x
i1
a
2
x
i2
)
2
+ (|a
1
| + |a
2
|) (3.5)
Since we are using least-squares, the error function has a bowl shape with elliptical contours.
We visualize the error surface as sitting on top of the plane spanned by a
1
and a
2
(see Figure 3.7).
Figure 3.7: Illustration of Lasso penalty in 2D space.
The global minimum is marked by a
LS
. Since it can be shown that the penalized formulation
of our minimization problem, Equation (3.5), is equivalent to a constrained formulation,
{ a
0
, a
1
, a
2
} = arg min
{a
j
}
N
i=1
(y
i
a
0
a
1
x
i1
a
2
x
i2
)
2
subject to ( |a
1
| + |a
2
| )
the complexity penalty can be represented by a diamond around the origin i.e., the set of value pairs
(a
1
, a
2
) for which the inequality condition |a
1
| + |a
2
| is true. If the coefcient estimates stay
within the diamond, the constraint is met. Thus, the Lasso solution is the rst point a = (a
1
, a
2
)
where the blue contours touch the red constraint region.
32 3. MODEL COMPLEXITY, MODEL SELECTIONANDREGULARIZATION
In high dimensional space, this diamond becomes a rhomboid with many corners. And so if
the un-regularized error surface touches one of the corners, a solution vector a with many entries
equal to zero can be obtained. In the case of a Ridge penalty, the constrained region is actually a
circle, which doesnt have corners, so one almost never gets zero value coefcients.
Finally, a note on normalization of the variables x
j
prior to using the above regularization
procedure. Centering, or transforming the variables so as to have zero mean, is required as the effect
of the penalty is to pull the corresponding coefcients towards the origin. Scaling, or transforming
the variables so as to have unit variance, is optional, but its easy to see that if the variables have
vastly different scales, then the effect of the penalty would be uneven. Sometimes, even the response
is centered so that theres no need to include an offset term in the model.
3.3.4 REGULARIZATIONVIAINCREMENTAL MODEL BUILDING
As the number of variables increases, the quadratic programming approach to nding the Lasso solu-
tion a
Lasso
becomes more demanding computationally. Thus, iterative algorithms producing solutions
that closely approximate the effect of the lasso have been proposed. One such algorithm is called
the Forward Stagewise Linear Regression (or Epsilon Stagewise Linear Regression) (Hastie et al.,
2001). Figure 3.8 sketches out the algorithm.
( )
{ }
=
= =
=
+
=
=
> =
n
j
j j
j j
j
N
i
n
l
ij il l i
j
j
x a F
sign a a
a
x x a y j
M m
M a
1
*
1
2
1
,
* *
) (
write
}
magnitude of
) (
amount mal infinitesi by Increment //
min arg ,
residuals current fits best that predictor Select //
{ to 1 For
large ; 0 ; 0 Initialize
* *
*
Figure 3.8: The Epsilon Stagewise Linear Regression algorithm that approximates the Lasso solution.
The algorithm starts with all the coefcients set to zero and a small epsilon dened. Inside
the loop, the algorithm rst selects the predictor variable that best ts the current residuals i.e.,
the difference between the response and the current model r
i
= (y
i
n
l=1
a
l
x
il
)
2
. The coefcient
associated with that particular variable is incremented by a small amount. The process then continues,
slowly incrementing the value of one coefcient at a time for up to M iterations. M here is a meta-
3.3. REGULARIZATION 33
parameter that needs to be estimated via cross-validation. In the end, some coefcients may never
get updated, staying equal to zero, and in general, | a
j
(M)| is smaller than | a
LS
j
|, the least-squares
estimate.
Note that M acts as the inverse of the regularization parameter . M = 0 corresponds to
>> 0 (full regularization) and M >> 0 corresponds to = 0 (no regularization).
Another algorithm for iteratively computing the Lasso solution is LARS (Efron et al., 2004).
LARS, however, is limited to the least-squares loss. The Path Seeker (PS) algorithm of Fried-
man (Friedman and Popescu, 2004; Friedman, J., 2008) allows the use of other differentiable loss
functions e.g., the Huber Loss. This is very desirable because least-squares is not robust in the
presence of outliers.
The evolution of each coefcient from a
j
(M = 0) = 0 to a
j
(M >> 0) = a
LS
j
, as the algo-
rithmevolves, gradually relaxing the amount of regularization, gives rise to what are called coefcient
paths (see Figure 3.9). At the beginning, all the coefcients are equal to zero; the horizontal axis
-
0
.
2
0
.
0
0
.
2
0
.
4
0
.
6
1.0 0.8 0.6 0.4 0.2 0.0
C
o
e
f
f
i
c
i
e
n
t
s
shrienkage 1/M
x
6
, x
7
x
3
x
5
x
8
x
1
x
2
x
4
Figure 3.9: Coefcient paths example in a problem with eight predictor variables. With maximum
regularization, M = 0, all coefcients a
j
= 0. As the amount of regularizationis decreased, the coefcients
episodically become non-zero and gradually drift further away from zero.
34 3. MODEL COMPLEXITY, MODEL SELECTIONANDREGULARIZATION
corresponds to the shrinkage factor s 1/M and the vertical axis corresponds to the values of
the coefcients. As the amount of regularization is reduced, a new variable enters the model and its
coefcient starts growing. The red bar indicates the optimal value of the regularization parameter,
estimated via cross validation. At this point in this example, the coefcient vector a has ve, out of
eight, non-zero coefcients.
3.3.5 EXAMPLE
The data for this example comes from a small study of 97 men with prostate cancer; it is available in
Rs faraway package (farway). There are 97 observations and 9 variables (see Table 3.1). The goal is
to build a linear model for predicting lpsa as a function of the other eight predictors.
Table 3.1: Variables in the prostate cancer data set.
Variable Name Definition
1 lcavol log(cancer volume)
2 lweight log(prostate weight)
3 age age
4 lbph log(benign prostatic hyperplasia amount)
5 svi seminal vesicle invasion
6 lcp log(capsular penetration)
7 gleason Gleason score
8 pgg45 percentage Gleason scores 4 or 5
9 lpsa log(prostate specific antigen)
In R, accessing the data is accomplished with the following commands:
library(faraway)
data(prostate)
attach(prostate)
To check the data size and column names we use:
dim(prostate)
[1] 97 9
names(prostate)
[1] "lcavol" "lweight" "age" "lbph" "svi" "lcp" "gleason"
[8] "pgg45" "lpsa"
3.3. REGULARIZATION 35
Following the analysis of this data done in (Hastie et al., 2001), we randomly split the dataset
into a training set of size 67 and a test set of size 30:
set.seed(321); i.train <- sample(1:97, 67)
x.train <- prostate[i.train, 1:8]; y.train <- prostate[i.train, 9]
x.test <- prostate[-i.train, 1:8]; y.test <- prostate[-i.train, 9]
Package lars (lars) contains an implementation of the Lasso tting algorithm for the least-
squares loss (e.g., Equation (3.5)):
library(lars)
fit.lasso <- lars(as.matrix(x.train), y.train, type="lasso")
plot(fit.lasso, breaks=F, xvar="norm")
The last command plots the coefcient paths for the prostate t (see Figure 3.10(a)) as a
function of the fraction s = |a| /Max |a| (the L1 norm of the coefcient vector, as a fraction of the
maximal L1 norm), which is another way of expressing the amount of regularization done. To select
the optimal value of the coefcient vector a along these paths, we need to estimate prediction error
as a function of s, via cross validation:
cv.lasso <- cv.lars(as.matrix(x.train), y.train, type="lasso")
which results in the plot of Figure 3.10(b). To select the minimum according to the 1-SE rule
discussed earlier (Section 3.3.2), we use:
i.min <- which.min(cv.lasso$cv)
i.se <- which.min(abs(cv.lasso$cv -
(cv.lasso$cv[i.min]+cv.lasso$cv.error[i.min])))
s.best <- cv.lasso$fraction[i.se]
a(s.best ) corresponds to the most parsimonious model with a prediction error within one
standard error of the minimum.
The optimal coefcient values can now be retrieved:
predict.lars(fit.lasso, s = s.best, type = "coefficients",
mode = "fraction")
lcavol lweight age lbph svi lcp
0.5040864 0.1054586 0.0000000 0.0000000 0.3445356 0.0000000
gleason pgg45
0000000 0.0000000
We observe that only three coefcients are non-zero. The corresponding error on the test set
is computed by:
36 3. MODEL COMPLEXITY, MODEL SELECTIONANDREGULARIZATION
0
0.0 0.2 0.4 0.6 0.8 1.0
-
2
2
4
3
6
4
7
5
1
S
t
a
n
d
a
r
d
i
z
e
d
C
o
e
f
f
i
c
i
e
n
t
s
|a|/max|a|
(a)
0.0 0.2 0.4 0.6 0.8 1.0
0
.
6
0
.
8
1
.
0
1
.
2
1
.
4
1
.
6
c
v
fraction
(b)
Figure 3.10: Coefcient paths and cross-validation error for a Lasso t to the Prostate data.
3.3. REGULARIZATION 37
y.hat.test <- predict.lars(fit.lasso, x.test, s = s.best, type = "fit",
mode = "fraction")
sum((y.hat.test$fit - y.test)2) / 30
[1] 0.5775056
which compares favorably to an OLR t using all coefcients:
fit.lm <-lm(lpsa ., data = prostate[i.train,])
fit.lm$coeff
(Intercept) lcavol lweight age lbph
-0.787006412 0.593999810 0.499004071 -0.012048469 0.093761682
svi lcp gleason pgg45
0.847659670 -0.023149568 0.180985575 -0.002979421
y.hat.lm.test <- predict(fit.lm, prostate[-i.train,])
sum((y.hat.lm.test - prostate$lpsa[-i.train])2) / 30
[1] 0.6356946
3.3.6 REGULARIZATIONSUMMARY
Lets revisit Figure 3.2, which provided a schematic of what bias and variance are. The effect of
adding the complexity penalty P(a) to the risk criterion is to restrict the original model space
(see Figure 3.10). The distance between the average regularized estimator
F
Lasso
and the truth F
m=1
c
m
T
m
(x)
where the {T
m
(x)}
M
1
are known as basis functions or also called base learners. For example, each T
m
can be a decision tree. The ensemble is a linear model in a (very) high dimensional space of derived
variables. Additive models like this are not new: Neural Networks (Bishop, C., 1995), Support Vector
Machines (Scholkopf et al., 1999), and wavelets in Signal Processing (Coifman et al., 1992), to name
just a few, have a similar functional form.
Its going to be convenient to use the notation T (x; p
m
) when referring to each base learner
T
m
. That is, each base learner is described by a set of parameters or parameter vector p. For example,
if T
m
is a neural net, p
m
corresponds to the weights that dene the neural net. If T
m
is a tree, p
m
corresponds to the splits that dene the tree. Each possible base learner can then be thought of as a
point in a high-dimensional parameter space P.
With this notation, the ensemble learning problem can be stated as follows: nd the points
p
m
P and the constants c
m
R (real numbers) that minimize the average loss:
{ c
m
, p
m
}
M
o
= min
{c
m
, p
m
}
M
o
N
i=1
L
_
y
i
, c
0
+
M
m=1
c
m
T (x; p
m
)
_
(4.1)
Much like in the case of decision tree regression (see Section 2.2), the joint optimization of
this problem is very difcult. A heuristic two-step approach is useful:
1. Choose the points p
m
. This is equivalent to saying choose a subset of M base learners out
of the space of all possible base learners from a pre-specied family e.g., the family of
5-terminal node trees.
40 4. IMPORTANCESAMPLINGANDTHECLASSICENSEMBLEMETHODS
2. Determine the coefcients, or weights, c
m
via regularized linear regression (see Chapter 3 for
an overview of regularization).
Before introducing the details of the ensemble construction algorithm, consider the example
of Figure 4.1. It is a 2-input, 2-class problem that has a linear decision boundary given by the
diagonal line. As discussed in Chapter 2, linear decision boundaries such as this one are a hard
case for trees, which have to build a stair-like approximation. The decision boundary built by a tree
ensemble based on Boosting (Section 4.5) is still piece-wise constant but with a ner resolution,
thus better capturing the diagonal boundary.
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x1 x1
x
2
x
2
(a) (b)
Figure 4.1: Comparison of decision boundaries generated by a single decision tree (a) and a Boosting-
based ensemble model (b). The true boundary is the black diagonal. The ensemble approximation is
superior.
Table 4.1 includes code for generating and plotting the data. In R, tting a classication tree
to this data is accomplished with the following commands:
library(rpart)
set.seed(123)
tree <- rpart(y . , data = data2d, cp = 0, minsplit = 4, minbucket = 2,
parms = list(prior=c(.5,.5)))
To prune the tree using the 1-SE rule, discussed in Chapter 3, we use:
i.min <- which.min(tree$cptable[,"xerror"])
41
i.se <- which.min(abs(tree$cptable[,"xerror"] -
(tree$cptable[i.min,"xerror"]
+ tree$cptable[i.min,"xstd"])))
alpha.best <- tree$cptable[i.se,"CP"]
tree.p <- prune(tree, cp = alpha.best)
And to plot the decision boundary, we evaluate the tree on a 100 100 grid and color the
points where the two classes are equally probable:
xp <- seq(0, 1, length = 100)
yp <- seq(0, 1, length = 100)
data2dT <- expand.grid(x1 = xp, x2 = yp)
Z <- predict(tree.p, data2dT)
zp.cart <- Z[,1] - Z[,2]
contour(xp, yp, matrix(zp.cart, 100), add=T, levels=0, labcex=0.9, labels="",
col = "green", lwd=2)
The boosting library available in R, gbm (gbm), requires assigning numeric labels to the
training data points i.e., a +1 to the blue cases and a 0 to the red cases. This is easily done with:
y01 <- rep(0, length(data2d$y))
y01[which(data2d$y == blue)] <- 1
data4gbm <- data.frame(y01, x1, x2)
and tting the boosting model to this data is accomplished with the following commands:
library(gbm)
set.seed(123)
boostm <- gbm(y01 ., data = data4gbm, distribution = "bernoulli",
n.trees = 100, interaction.depth = 2,
n.minobsinnode = 4, shrinkage = 0.1,
bag.fraction = 0.5, train.fraction = 1.0, cv.folds = 10)
here too we rely on cross-validation to choose the optimal model size:
best.iter <- gbm.perf(boostm, method = "cv")
as we did for a single tree, we plot the decision boundary by evaluating the model on the 100 100
grid generated above:
Z.gbm <- predict(boostm, data2dT, n.tree = best.iter, type = "response")
zp.gbm <- 1 - 2*Z.gbm
contour(xp, yp, matrix(zp.gbm, 100), add=T, levels=0, labcex=0.9, labels="",
col = "green", lwd=2)
42 4. IMPORTANCESAMPLINGANDTHECLASSICENSEMBLEMETHODS
Table 4.1: Sample R code for generating the 2-input, 2-class
data of Figure 4.1.
genData <- function(seed, N) {
set.seed(seed)
x1 <- runif(N)
x2 <- runif(N)
y <- rep("", N)
for (i in 1:N) {
if ( x1[i] + x2[i] > 1.0) {
y[i] <- "blue"
}
if ( x1[i] + x2[i] < 1.0) {
y[i] <- "red"
}
if ( x1[i] + x2[i] == 1.0) {
if ( runif(1) < 0.5 ) {
y[i] <- "red"
} else {
y[i] <- "blue"
}
}
}
y <- as.factor(y)
return(data.frame(x1, x2, y))
}
## Generate data
data2d <- genData(123, 200)
summary(data2d$y)
blue red
108 92
## Plot data
i.red <- y == 'red'
i.blue <- y == 'blue'
plot(x1, x2, type="n")
points(x1[i.blue], x2[i.blue], col = "blue", pch = 19)
points(x1[i.red], x2[i.red], col = "red", pch = 19)
lines(c(1,0), c(0,1), lwd=2)
4.1 IMPORTANCESAMPLING
The ISLE framework helps us develop an answer to the question of how to judiciously choose the
basis functions. The goal is to ndgood {p
m
}
M
1
so that the ensemble-based approximation is close
to the target function:
F(x; {p
m
}
M
1
, {c
m
}
M
0
) = c
0
+
M
m=1
c
m
T (x; p
m
)
F
(x) (4.2)
4.1. IMPORTANCESAMPLING 43
ISLE makes a connection with numerical integration. The key idea is that the function were
trying to t above is analogous to a high-dimensional integral:
_
P
I (p) p
M
m=1
w
m
I (p
m
) (4.3)
Consider, for example, the problem of integrating the function I (p) drawn in red in Figure 4.2. The
integral is approximated by the average of the function evaluated at a set of points p
1
, p
2
, . . . , p
M
.
The usual algorithms uniformly choose the points p
m
at which the integrand I (p) is evaluated.
Importance Sampling, on the other hand, recognizes that certain values of these p
m
variables have
more impact on the accuracy of the integral being estimated, and thus these important values
should be emphasized by sampling more frequently from among them.
vs.
Figure 4.2: Numerical integration example. Accuracy of the integral improves when we choose more
points from the circled region.
Thus, techniques such as Monte Carlo integration used for computing the approximation in
Equation (4.3) can be studied for developing algorithms for nding the approximation of Equa-
tion (4.2).
4.1.1 PARAMETERIMPORTANCEMEASURE
In order to formalize the notion of choosing good points from among the space P of all possible
p
m
, we need to dene a sampling distribution. Thus, assume that it is possible to dene a sampling
probability density function (pdf ) r(p) according to which we are going to draw the points p
m
i.e.,
{p
m
r(p)}
M
1
. The simplest approach would be to have r(p) be uniform, but this wouldnt have the
effect of encouraging the selection of important points p
m
. In our Predictive Learning problem,
44 4. IMPORTANCESAMPLINGANDTHECLASSICENSEMBLEMETHODS
we want r(p) to be inversely related to the risk of p
m
(see Chapter 2 Equation (2.1)). If T (x; p
m
)
has high error, then p
m
has low relevance and r(p) should be small.
In Monte Carlo integration, point importance can also be measured in single or group
fashion. In single point importance, the relevance of every point is determined without regard for the
other points that are going to be used in computing the integral. In group importance, the relevance
is computed for groups of points. Group importance is more appealing because a particular point
may not look very relevant by itself, but when it is evaluated in the context of other points that are
selected together, its relevance may be higher.
Computationally, however, the problem of assigning importance to every possible group of
points is very demanding. Thus, one often uses the sequential approximation to group importance,
where the relevance of a particular point is judged in the context of the points that have been selected
so far but not the points that are yet to be selected.
Like any density distribution, r(p) can be characterized by its center and width. Assume
p
represents the best single base learner e.g., the single tree T (x; p
(see Figure 4.3). In Figure 4.3(a), r(p) is narrow, which means points
p
m
will be selected from a small vicinity of p
= arg min
p
E
xy
L(y, T (x; p)) (4.4)
that is, we know how to nd the best single base learner. There are at least three aspects of this
problem that can be perturbed:
1. Perturbation of the data distribution < x, y >. For instance, by re-weighting the observations.
2. Perturbation of the loss function L(). For example, by modifying its argument.
3. Perturbation of the search algorithm used to nd min
p
.
Repeatedly nding the solution to a particular perturbed version of Equation (4.4) is then
equivalent to sampling p
m
s according to r(p), with the width of r(p) controlled by the degree of
perturbation done.
In terms of perturbation sampling, generating the ensemble members {p
m
}
M
1
is thus expressed
with the following algorithm:
For m = 1 to M {
p
m
= PERTURB
m
_
arg min
p
E
xy
L
_
y, T (x; p)
__
}
where PERTURB{} is a small perturbation of the type mentioned above. Later in this chapter,
we will see examples of all these types of perturbations. For instance, the AdaBoost algorithm
46 4. IMPORTANCESAMPLINGANDTHECLASSICENSEMBLEMETHODS
reweights the observations (Section 4.5), Gradient Boosting modies the argument to the loss
function (Section 4.6), and Random Forest (Section 4.4) modies the search algorithm.
4.2 GENERICENSEMBLEGENERATION
The generic ensemble generation algorithm described above was formalized
by Friedman and Popescu (2003) and consists of two steps: select points and t coefcients.
In detail:
Step 1 - Choose {p
m
}
M
1
:
( )
{ }
M
m
m m m
m m
S i
i i m i m
T
T F F
T T
T F y L
M m
F
m
1
1
) (
1
0
) ( write
}
) ( ) ( ) (
) ; ( ) (
) ; ( ) ( , min arg
{ to 1 For
0 ) (
x
x x x
p x x
p x x p
x
p
+ =
=
+ =
=
=
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Algorithm 4.1
The algorithmstarts the ensemble F
0
with some constant function (Line 1); it could be zero or
another suitable constant. Then, at each iteration, a new base-learner T
m
is added into the collection
(Line 3). F
m1
represents the ensemble of base learners up to step m 1.
The expression p
m
= arg min
p
in Line 3 is a slightly modied version of Equation (4.4),
which stands for nding the best (lowest error) base-learner on the selected data; here the inclusion
of F
m1
as an argument to the loss function corresponds to the implementation of the sequential
approximation to group importance. That is, we want to nd the base-learner that in combination
with the ones that have already been selected best approximates the response. The ensemble is then
updated with the newly selected base-learner T
m
(Line 5). After M base learners have been built, the
algorithm terminates in Line 6. Notice the similarity with the forward stagewise tting procedure
for regularized regression of Chapter 3.
Three parameters control the operation of the algorithm, L, , :
- L: the loss function. For example, the AdaBoost algorithm, discussed in Section 4.5, uses the
exponential loss.
- : controls the amount of perturbation to the data distribution. The notation S() in Line 3
indicates a random sub-sample of size N, less than or equal to the original data size.
4.2. GENERICENSEMBLEGENERATION 47
Intuitively, smaller values of will increase the ensemble diversity. As the size of the sample
used to t each base-learner is reduced, the perturbation is larger. also has an impact on
computing time because, if every T
m
is built on 10 percent of the data, for example, the total
time to built the ensemble can be reduced substantially.
- : controls the amount of perturbation to the loss function. Note that Line 5 in the algorithm
can be alternatively expressed as F
m1
(x) =
m1
k=1
T
k
(x), and thus controls how much
the approximation built up to the current iteration inuences the selection of the next base
learner. (F
m1
is sometimes referred to as the memory function). Accordingly, having = 0
is equivalent to using single point importance, and having 0 < 1 corresponds to the
sequential approximation to group importance.
Step 2: Choose coefcients {c
m
}
M
0
Once all the base learners {T
m
(x)}
M
1
= {T (x; p
m
)}
M
1
have been selected, the coefcients are
obtained by a regularized linear regression:
{ c
m
} = arg min
{c
m
}
N
i=1
L
_
y
i
, c
0
+
M
m=1
c
m
T
m
(x
i
)
_
+ P(c) (4.5)
where P(c) is the complexity penalty and is the meta-parameter controlling the amount of regu-
larization as discussed in Chapter 3.
Regularization here helps reduce bias, in addition to variance, because it allows the use of a
wider sampling distribution r(p) in Step 1. Awide sampling distribution permits many weak learners
to be included in the ensemble. If too many weak learners are included, however, the performance of
the ensemble might be poor because those weak learners are tted to small fractions of the data and
could bring noise into the model. But using a narrow distribution isnt desirable either as it would
lead to an ensemble where the predications made by the individual T (x; p
m
) are highly correlated.
With a regularization-based post-processing step, a wide distribution can be used in Step 1, and
then using a Lasso-like penalty in Step 2, the coefcients c
m
of those base learners that are not very
helpful can be forced to be zero.
This regularization step is done on an N M data matrix, with the output of the T
m
s as the
predictor variables. Since M can be on the order of tens of thousands, using traditional algorithms
to solve it can be computationally prohibitive. However, fast iterative algorithms are now available
for solving this problem for a variety of loss functions, including GLMs via Coordinate Descent
(Friedman et al., 2008) and Generalized Path Seeker (Friedman and Bogdan, 2008).
We now turn our attention to the classic ensemble methods of Bagging, Random Forest,
AdaBoot, and Gradient Boosting. To viewthemas special cases of the Generic Ensemble Generation
procedure, we will copy Algorithm 4.1. and explain how the control parameters are set in each case.
We will also discuss the improvements that the ISLE framework suggests in each case.
48 4. IMPORTANCESAMPLINGANDTHECLASSICENSEMBLEMETHODS
4.3 BAGGING
Breimans Bagging, short for Bootstrap Aggregation, is one of the earliest and simplest ensembling
algorithms (Breiman, L., 1996). In Bagging, the T
m
s are tted to bootstrap replicates of the training
set D (a bootstrap replicate D
e
u
q
= 0 means there is nomemory, and each T
m
is tted independently without any awareness
of the other base-learners that have already been selected. = N/2 means each T
m
is built using half-
size samples without replacement (it can be shown that using size N/2 samples without replacement
is roughly equivalent to using size N samples with replacement (Friedman and Popescu, 2003), as
called for in the original formulation of Bagging). The base learners T
m
were typically large un-
pruned trees. Well see in Section 4.7, devoted to the MART algorithm, that there are situations
where this is not the best thing to do.
The coefcients in Bagging are c
o
= 0, {c
m
= 1/M}
M
1
, namely, a simple average. They are not
tted to the data. This averaging for regression is justiable assuming a squared-error loss. Although
the original formulation called for a majority vote to be used for classication problems, averaging
would also work since trees already estimate class probabilities.
Bagging does not t the coefcients to the data after the trees have been selected. It just
simply assigns the same coefcient to all the trees. In summary, Bagging perturbs only one of the
three possible knobs, perturbing the data distribution only.
Note that because Bagging has no memory, it is easily parallelizable: the task of building each
T
m
can be spun to a different CPU(Panda et al., 2009). This divisibility is a computational advantage
of the algorithm.
4.3. BAGGING 49
What improvements can be made to Bagging when viewed from the generic ensemble gen-
eration algorithm perspective? We suggest:
1. Use a different sampling strategy. There is no theoretical reason to prefer = N/2 over other
possible values. Using smaller values, e.g., <50% samples, will make the ensemble build faster.
2. Fit the coefcients to the data. That is, instead of using a simple average, {c
m
= 1/M}
M
1
, the
coefcients can be obtained by a regularized linear regression (see Equation (4.5)).
In R, an implementation of Bagging is available in the ipred package (ipred).
Figure 4.4 and Figure 4.5 (from Elder and Ridgeway (1999)) show how the out-of-sample
error for a suite of regression and classication problems, respectively, reduced on all of the problems
attempted, when going from a single tree to a bagged ensemble model. Though these are primarily
academic datasets, our practical experience has been consistent with this nding; so far, we have
always seen ensembles of trees beat single trees in out-of-sample performance. (This is not the case
for ensembles vs. single models, in general, as can be seen from the examples in Chapter 1, but just
for ensembles of decision trees vs. single trees.)
4.3.1 EXAMPLE
We now look at an example of Bagging in action using a very simple data set borrowed from the
Elements of Statistical Learning book (Hastie et al., 2001). (Which is a must-have reference in our
eld.)
There are ve predictor variables, each having a standard Normal distribution with pairwise
correlation of 0.95. The response variable is created so as to generate a 2-class problem according to
the rule
P(y = 1|x
1
0.5) = 0.2, P(y = 1|x
1
> 0.5) = 0.8
Thus, the minimum possible error rate (known as the Bayes error rate) is 0.2. Table 4.3 has R
code for generating the data.
It follows from its denition that the response depends on x
1
only. A small training data
set is generated with N = 30 (see functions genPredictors and genTarget in Table 4.3), from which
200 bootstrap samples are created (see function genBStrapSamp). A separate test set of size 2000
is also generated to evaluate an ensemble model built using Bagging. Table 4.4 shows R code for
building the ensemble using these bootstrap replicates of the training data set.
The rst two trees are shown in Figure 4.6. In the rst, we see that the root split is on x
1
,
and that the split denition closely matches the rule used to generate the data. In the second tree,
the rst split is on x
2
; because the variables are highly correlated and the sample size small, it is
possible for x
2
to take the role of x
1
. Bagging then continues along these lines, tting a tree for each
bootstrap sample.
Figure 4.7 shows the test error of the bagged ensemble. The Bayes error (minimum possible,
with unlimited data) is indicated with a green line. The error of a single tree, tted to the original
50 4. IMPORTANCESAMPLINGANDTHECLASSICENSEMBLEMETHODS
Boston Housing Ozone Friedman1 Friedman2 Friedman3
Figure 4.4: Bagged trees better than single tree for ve regression problems taken from the UC Irvine
data repository.
Diabetes Breast Ionosphere Heart Soybean Glass Waveform
Figure 4.5: Bagged trees better than single tree on seven classication problems taken from the UC
Irvine data repository.
4.3. BAGGING 51
Table 4.3: Sample R code for generating articial data. Function
genPredictors generates 5-dimensional observations, function genTarget
generates the corresponding response variable and function genBStrap-
Samp generates bootstrap replicates.
genPredictors <- function(seed = 123, N = 30) {
# Load package with random number generation
# for the multivariate normal distribution
library(mnormt)
# 5 "features" each having a "standard" Normal
# distribution with pairwise correlation 0.95
Rho <- matrix(c(1,.95,.95,.95,.95,
+ .95, 1,.95,.95,.95,
+ .95,.95,1,.95,.95,
+ .95,.95,.95,1,.95,
+ .95,.95,.95,.95,1), 5, 5)
mu <- c(rep(0,5))
set.seed(seed);
x <- rmnorm(N, mu, Rho)
colnames(x) <- c("x1", "x2", "x3", "x4", "x5")
return(x)
}
genTarget <- function(x, N, seed = 123) {
# Response Y is generated according to:
# Pr(Y = 1 | x1 <= 0.5) = 0.2,
# Pr(Y = 1 | x1 > 0.5) = 0.8
y <- c(rep(-1, N))
set.seed(seed);
for (i in 1:N) {
if ( x[i,1] <= 0.5 ) {
if ( runif(1) <= 0.2 ) {
y[i] <- 1
} else {
y[i] <- 0
}
} else {
if ( runif(1) <= 0.8 ) {
y[i] <- 1
} else {
y[i] <- 0
}
}
}
return(y)
}
genBStrapSamp <- function(seed = 123, N = 200, Size = 30) {
set.seed(seed)
sampleList <- vector(mode = "list", length = N)
for (i in 1:N) {
sampleList[[i]] <- sample(1:Size, replace=TRUE)
}
return(sampleList)
}
52 4. IMPORTANCESAMPLINGANDTHECLASSICENSEMBLEMETHODS
x2<-0.2202
x2>=-0.2202
x3>=-1.644
x3<-1.644
x1>=-0.1458
x1<-0.1458
x1>=0.5656
x1<0.5656
x1 >=0.823
x1>=- 1.121
x1<- 1.121
x2<- 0.2202
x2>=- 0.2202
x2>=0.4795
x2<0.4795
x1<-0.409
x1>=-0.409
x1<0.823
Figure 4.6: First two classication trees in a bagged ensemble for the problem of Table 4.3.
4.3. BAGGING 53
Table 4.4: Sample R code for generating a classication ensemble based
on the Bagging algorithm. Function tClassTree ts a single decision tree
to a given training data set. Function tBStrapTrees keeps track of the trees
built for each bootstrap replicate.
fitBStrapTrees <- function(data, sampleList, N) {
treeList <- vector(mode = "list", length = N)
for (i in 1:N) {
tree.params=list(minsplit = 4, minbucket = 2, maxdepth = 7)
treeList[[i]] <- fitClassTree(data[sampleList[[i]],],
tree.params)
}
return(treeList)
}
fitClassTree <- function(x, params, w = NULL,
seed = 123) {
library(rpart)
set.seed(seed)
tree <- rpart(y ~ ., method = "class",
data = x, weights = w, cp = 0,
minsplit = params.minsplit,
minbucket = params.minbucket,
maxdepth = params.maxdepth)
return(tree)
}
training set, is indicated with a red line. This single tree happens to be a stump after cross-validation
was used to decide the optimal pruning. Selecting only a stump isnt surprising giventhe highvariance
on this data due to the correlation in the predictors.
The error of the Bagged ensemble is shown (in blue) as a function of the ensemble size. As
the number of trees increases, the error generally decreases until it eventually attens out. Bagging
effectively succeeds in smoothing out the variance here and hence reduces test error.
The attening of the error of the bagged ensemble is not unique behavior for this example.
As discussed in the next section, Bagging for regression wont over-t the data when the number
of trees is arbitrarily increased.
4.3.2 WHY ITHELPS?
So why does bagging help? The main reason is that Bagging reduces the variance and leaves bias
unchanged. Under the squared-error loss, L(y, y) = (y y)
2
, the following formal analysis can be
done. Consider the idealized bagging (aggregate) estimator
F(x) = E
F
Z
(x) i.e., the average of
the
F
Z
s, each t to a bootstrap data set Z = {y
i
, x
i
}
N
1
. For this analysis, these Zs are sampled from
the actual population distribution (not the training data). Using simple linear algebra and properties
54 4. IMPORTANCESAMPLINGANDTHECLASSICENSEMBLEMETHODS
of the expectation operator, we can write:
E
_
Y
F
Z
(x)
_
2
= E
_
Y
F(x) +
F(x)
F
Z
(x)
_
2
= E
_
Y
F(x)
_
2
+ E
_
F
Z
(x)
F(x)
_
2
E
_
Y
F(x)
_
2
The above inequality establishes that the error of one single
F
Z
is always greater than the
error of
F. So true population aggregation never increases mean square error, and it often reduces it.
The above argument does not hold for classication because of the non-additivity of bias
and variance. That is, bagging a bad classier can make it worse!
4.4 RANDOMFOREST
The RandomForest technique (Breiman, L., 2001) is Bagging plus a perturbation of the algorithm
used to t the base learners. The specic form of the perturbation is called subset splitting.
As discussed in Section 2.1, the process of building a single tree entails successively nding the
best split at each node by considering every possible variable in turn; for each variable, every possible
Stump tree from CV
Bayes
200 150 100 50 0
0
.
2
0
0
.
2
5
0
.
3
5
0
.
4
0
0
.
4
5
Ensemble Size
T
e
s
t
E
r
r
o
r
0
.
3
0
Figure 4.7: Evaluation error rate for a bagged ensemble as a function of the number of trees. The
ensemble was t to the data of Table 4.2.
4.4. RANDOMFOREST 55
split point is evaluated. In RandomForest, only a randomsubset of the variables is considered at each
node. Breimans recommended size for the random subset is n
s
=
_
log
2
(n) + 1
_
. Thus, with 100
predictor variables; every time that a tree node needs to be split, a random sample of 11 predictors
is drawn.
Clearly, Random Forest falls under the perturbation sampling paradigm we have discussed.
And, compared to Bagging, its {T
m
(x)}
M
1
set has increased diversity i.e., wider r(p), the width
of which is (inversely) controlled by n
s
. Random Forest also has a signicant speed improvement
advantage over Bagging since fewer splits need to be evaluated at each node.
As in the case of Bagging, two potential improvements are possible: 1. use of a different data
sampling strategy (not xed to bootstrap samples), and 2. t the quadrature coefcients to the data.
In R, Random Forest for classication and regression is available in the randomForest pack-
age (randomForest).
How do Bagging and Random Forest compare in terms of accuracy? Figure 4.8 shows the
results of a simulation study with 100 different target functions conducted by Friedman and Popescu
(2004). In the x-axis, we have four algorithms: Bagging (Bag), Random Forest (RF), and one im-
provement of each. The y-axis in the plot represents what is called comparative root mean squared
(RMS) error: for every problem, the error of each algorithm is divided by the error of the single
best algorithm for that particular problem; the resulting distribution is summarized with a boxplot.
Thus, if a given algorithm was consistently the best across all problems, the corresponding boxplot
will only show a horizontal line at 1.0.
One sees that the increased diversity in RandomForest often results in higher error. To benet
fromthat additional diversity, we really need to do the post-processing phase of the ISLEframework.
The notation xxx_6_5%_P indicates 6 terminal nodes trees (instead of large unpruned ones), 5%
samples without replacement (instead of bootstrap ones), and Post-processing i.e., tting the {c
m
}
M
0
coefcients to the data using regularized regression.
In both cases, the ISLE-based versions of Bagging and Random Forest improved accuracy
over their standard formulations. Their rankings are now reversed: ISLE-RF is often more accurate
than ISLE-Bagging, so the extra perturbation was worth it. Finally, the ISLE-based versions can
be 20-100 times faster than the standard versions because every tree is built on 5 percent of the data
only, and at every iteration, a tree is built with only six terminal nodes, as opposed to fully grown
trees.
Although we dont know how much each of the changes made in the algorithms contributed
to the improvement in accuracy, it should be clear that they work together: by increasing the width
of r(p), a more diverse collection of base learners is obtained, and then the post-process step lters
out those base learners that arent very helpful. But the right width of r(p), controlled by the amount
of perturbation done, is problem-dependent and unknown in advance. Thus, experimentation is
required.
56 4. IMPORTANCESAMPLINGANDTHECLASSICENSEMBLEMETHODS
4.5 ADABOOST
Next, the AdaBoost algorithm of Freund and Schapire (1997) is considered. AdaBoost, short for
Adaptive Boosting, was initially proposed for 2-class classication problems, and derives its name
from its ability to boost the performance of weak (moderately accurate) base classiers T
m
.
Table 4.5 expresses AdaBoost as an ISLE algorithm.
AdaBoost uses an exponential loss function: L(y, y) = exp(y y). = 1 means AdaBoost
implements a sequential sampling approximation to group importance where the relevance of base
learner T
m
is judged in the context of the (xed) previously chosen base learners T
1
, . . . , T
m1
.
= N, so the data distribution is not perturbed via random sub-sampling; instead, observation
weights are used.
The coefcients {c
m
}
M
1
are not estimated with post-processing, but they are not set to 1/M
either, as was done in the case of Bagging and RandomForest. They are estimated sequentially in
the following way: at each iteration, the best T
m
is found and then immediately the corresponding
c
m
is estimated.
In its original formulation, the algorithmwas restricted to 2-class problems and the model out-
put was y = sign (F
M
(x)) = sign
_
M
m=1
c
m
T
m
(x)
_
. Section 4.5.1 explains why this is a reasonable
C
o
m
p
a
r
a
t
i
v
e
R
M
S
E
r
r
o
r
1
.
0
1
.
1
1
.
2
1
.
3
1
.
4
Bag RF Bag_6_5%_P RF_6_5%_P
Figure 4.8: Error comparison between classic Bagging and Random Forest with their ISLE-based
improved versions (based on (Friedman and Popescu, 2003)). The improved versions use 6-terminal
node trees and post-processing for tting the tree coefcients.
4.5. ADABOOST 57
Table 4.5: AdaBoost as an ISLE-based algorithm.
- ) exp( ) , ( y y y y L =
} 1 , 1 { e y
- 1 = u
- N = q
- ) (x
m
T : any weak learner
-
M
m o
c c
1
} { , 0 = : sequential
partial regression coefficients
( )
M
m m
m m m m
m m
S i
i i m i
c
m m
T c
T c F F
T T
T c F y L c
M m
F
m
1
1
) (
1
,
0
)} ( , { write
}
) ( ) ( ) (
) ; ( ) (
) ; ( ) ( , min arg ) , (
{ to 1 For
0 ) (
x
x x x
p x x
p x x p
x
p
+ =
=
+ =
=
=
e
u
q
classication rule under the exponential loss. The algorithm was later extended to regression prob-
lems (Real AdaBoost (Friedman, J., 2001)). In R, package gbm (gbm) implements the AdaBoost
exponential loss for 0-1 outcomes.
Table 4.6 shows the AdaBoost algorithm as presented in its original refer-
ence (Freund and Schapire, 1997). Comparing it with the ISLE-based formulation of Table 4.5,
the two algorithms dont look alike. However, it can be shown that they are equivalent.
Table 4.6: The AdaBoost algorithmin its original
(non-ISLE) formulation.
| |
( )
=
+
=
=
= =
=
=
=
=
=
M
m
m m
i m i m
m
i
m
i
m m m
N
i
m
i
i m i
N
i
m
i
m
m
i m
i
T sign
T y I w w
err err
w
T y I w
err
w T
M m
N w
1
) ( ) 1 (
1
) (
1
) (
) (
) 0 (
) ( Output
}
) ( ( exp Set d.
) ) 1 ( log( c.
)) ( (
Compute b.
data with training to ) ( classifier a Fit a.
{ to 1 For
1 : weights n observatio
x
x
x
x
o
o
o Compute
The proof, given in Appendix A, requires showing:
58 4. IMPORTANCESAMPLINGANDTHECLASSICENSEMBLEMETHODS
1. Line p
m
= arg min ()
p
in the ISLE-based algorithm is equivalent to line a. in the AdaBoost
algorithm
2. Line c
m
= arg min ()
c
in the ISLE-based algorithm is equivalent to line c. in the AdaBoost
algorithm
3. How the weights w
(m)
i
are derived/updated.
4.5.1 EXAMPLE
Implementing the AdaBoost algorithm in R is straightforward, as shown in Table 4.7. Figure 4.9
illustrates the evolution of observation weights for the 2-input, 2 class problem introduced at the
beginning of this chapter.
Table 4.7: The AdaBoost algorithm in R for 2-class
classication.
fitAdaBoostTrees <- function(data, M) {
N <- nrow(data)
w <- rep(1/N, N)
alpha <- rep(NA, M)
treeList <- vector(mode = "list", length = M)
tree.params <- list(minsplit = 4, minbucket = 4, maxdepth = 2)
for (m in 1:M) {
treeList[[m]] <- fitClassTree(data, tree.params, w)
yHat <- predict(treeList[[m]], data, type = "class")
i.mistakes <- which(yHat != data$y)
err <- sum(w[i.mistakes])/sum(w)
alpha[m]=log((1-err)/err)
ind <- rep(0, N)
ind[i.mistakes] <- 1
if ( m == 1 ) {
W <- w
} else {
W <- cbind(W, w)
}
w <- w * exp(alpha[m] * ind)
}
return(list(trees=treeList, wghts=W))
}
Initially, all the weights are equal and set to 1
_
N. As the algorithm evolves, it up-weights
the cases that remain in error-especially if theyre rare amongst those that are misclassied-and it
down-weights the cases that are being recognized correctly. After changing the case weights, and
saving the previous model, the algorithm builds a new model using these new case weights. So as
a case is misclassied over and over, its relative weight grows, and the cases that the model gets
right start to fade in importance. Eventually as you iterate, the points that are in dispute have all the
weight and are all along the boundary.
4.5. ADABOOST 59
0.0 0.2 0.4 0.6
x1
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1.0 0.8 0.0 0.2 0.4 0.6
x1
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
2
1.0 0.8
x
2
(a) (b)
Figure 4.9: Evolution of AdaBoost observation weights in a 2-input, 2 class problem that has a linear
decision boundary. In (a), all observations have the same weight. In (b), after a few iterations, only points
along the decision boundary have a non-zero weight.
4.5.2 WHYTHEEXPONENTIAL LOSS?
It is not evident fromTable 4.6 that AdaBoost is using an exponential loss. This fact was established
in (Friedman, J., 2001) after the algorithm had been proposed and found by practitioners to work
well. One advantage of using the exponential loss is implementation convenience. That is, the
exponential loss leads to the straightforward algorithm of Table 4.7. With other loss functions that
one might like to use, solving the optimization problem involved in nding each T
m
(i.e., solving
Equation (4.4)) is a more elaborate process.
Putting implementation convenience aside, is there a connection between the function built
by AdaBoost and class probabilities? Happily, it can be shown that the algorithm converges towards
the half-log odds (see proof in Section 4.5.2):
F
AdaBoost
(x) = arg min
F(x)
E
Y|x
_
e
YF(x)
_
=
1
2
log
Pr(Y=1|x)
Pr(Y=1|x)
This is a reassuring result; it says that the exponential loss leads to a meaningful population
minimizer, which justies the use of sign(F) as the classication rule. It can also be shown that this
is the same population minimizer obtained using the (negative) binomial log-likelihood,
E
Y|x
(l(Y, F(x)) = E
Y|x
_
log(1 + e
YF(x)
_
60 4. IMPORTANCESAMPLINGANDTHECLASSICENSEMBLEMETHODS
which is a more familiar loss in other domains. The exponential loss and the (negative) binomial log-
likelihood, or binomial deviance, can be seen as continuous approximations to the misclassication
loss. This is illustrated in Figure 4.10, which shows them both as functions of the margin y F.
As indicated in Table 4.5, we are using a 1 vs 1 encoding of the response in a 2-class classication
problem. If the response y has the same sign as the ensemble output F(x), then the misclassication
cost is 0; otherwise, it is 1 (see the blue line).
L
o
s
s
Figure 4.10: The exponential loss and binomial deviance approximations to misclassication loss.
Misclassication loss, however, is discontinuous at y F = 0 and this precludes the use of
gradient-descent algorithms to nd the minimum risk T
m
(Line 3 of Algorithm 4.1). The expo-
nential loss and the binomial deviance are thus surrogate criteria used to solve the minimization
problem. Both penalize negative margin values (i.e., mistakes) more heavily than they reward in-
creasingly positive ones. The difference is in the degree: binomial deviance increases linearly, whereas
exponential loss increases exponentially, thus concentrating more inuence onobservations withlarge
negative margins. Thus, in noisy situations where there may be mislabeled data, exponential loss will
lead to performance degradation.
Finally, the exponential loss has no underlying probability and has no natural generalization
to K classes, whereas the binomial deviance does (Friedman, J., 1999).
4.5.3 ADABOOSTS POPULATIONMINIMIZER
In this section, we show that F
AdaBoost
(x) = arg min
F(x)
E
Y|x
_
e
YF(x)
_
=
1
2
log
Pr(Y=1|x)
Pr(Y=1|x)
. We start
by expanding E
Y|x
_
e
YF(x)
_
:
E(e
yF(x)
|x) = Pr(Y = 1|x) e
F(x)
+ Pr(Y = 1|x) e
F(x)
4.6. GRADIENTBOOSTING 61
Next, we nd the minimum of this function by setting its rst derivative to zero and solving
for F(x):
E(e
yF(x)
|x)
F(x)
= Pr(Y = 1|x) e
F(x)
+ Pr(Y = 1|x) e
F(x)
thus
E(e
yF(x)
|x)
F(x)
= 0 Pr(Y = 1|x) e
F(x)
= Pr(Y = 1|x) e
F(x)
ln Pr(Y = 1|x) + F(x) = ln Pr(Y = 1|x) F(x)
2F(x) = ln Pr(Y = 1|x) ln Pr(Y = 1|x)
F(x) =
1
2
ln
Pr(Y = 1|x)
Pr(Y = 1|x)
.
4.6 GRADIENTBOOSTING
The next seminal work in the theory of Boosting came with the generalization of AdaBoost to any
differential loss function. Friedmans Gradient Boosting algorithm (Friedman, J., 2001), with the
default values for the algorithm parameters, is summarized in Table 4.8.
Table 4.8: Gradient Boosting as an ISLE-based algorithm.
- General ) , ( y y L and y
- 1 . 0 = u
- 2 / N = q
- ) (x
m
T : any weak learner
-
=
=
N
i
i c o
c y L c
1
) , ( min arg
- { }
M
1 m
c : shrunk sequential
partial regression coefficients
( )
M
m m
m m m m
m m
S i
i i m i
c
m m
T c
T c F F
T T
T c F y L c
M m
c F
m
1
1
) (
1
,
0 0
)} ( ), {( write
}
) ( ) ( ) (
) ; ( ) (
) ; ( ) ( , min arg ) , (
{ to 1 For
) (
x
x x x
p x x
p x x p
x
p
+ =
=
+ =
=
=
e
u
u
q
The rst element of the ensemble is a constant function which depends on the particular loss
being used (see Appendix B for an example). > 0, means Gradient Boosting also implements
the sequential sampling approximation to group importance; however, compared with AdaBoosts
= 1, more perturbation of the loss is taking place here. As in AdaBoosts case, the coefcients are
tted incrementally, one at every step.
The similarity with the Forward Stagewise Linear Regression procedure of Section 3.3.4,
with {T
m
(x)}
M
1
as predictors, should be evident. And the fact that < 1 means that Gradient
62 4. IMPORTANCESAMPLINGANDTHECLASSICENSEMBLEMETHODS
Boosting is doing shrinkage implicitly: the coefcients are shrunk with a Lasso-type penalty with
the shrinkage controlled by . This fact is today understood to be one of the key reasons for the
superior performance of the algorithm.
The process of instantiating the above template to any differential loss function is covered in
Appendix B.
In R, package gbm (gbm) implements Gradient Boosting for regression and classication
problems using a variety of loss functions (e.g., least squares, absolute-deviations, logistic, etc.).
4.7 MART
MART, short for Multiple Additive Regression Trees, is the name that J. Friedman gave to his im-
plementation of Gradient Boosting when the base learners are trees, which leads to computationally
efcient update rules in the gradient descent algorithm required to solve Equation (4.4) (as shown
in Appendix B).
When using trees as base learners, one has the option of controlling the size of each tree,
which is important for the following reason. The theory of ANOVA decomposition of a function
tells us that most multivariate functions can be expressed as a constant plus the sum of functions
of one variable (main effects), plus the sum of functions of two variables (two-variable interaction
effects), and so on. Most functions can then be expressed in the following form:
F(x) =
j
f
j
(x
j
)
+
j,k
f
jk
(x
j
, x
k
)
+
j,k,l
f
jkl
(x
j
, x
k
, x
l
)
+. . .
and usually there is a dominant interaction order among its variables. Thus, if the functional
model we are using to build our approximation doesnt support the right interaction order, we will
be incurring extra bias and/or variance.
If the base learners T
m
(x) are Jterminal-node trees, then the interaction order allowed by
the approximation can be controlled by adjusting J. For instance, using J = 2 means the ensemble
can model main-effects-only functions because a 2-terminal-node tree is a function of only one
variable. Similarly, using J = 3 means the ensemble can model target functions of up to two-variable
interactions. Again, this is so because a 3-terminal-node tree is a function of at most 2 variables and
so on.
Obviously, the optimal value of J should reect the (unknown) dominant interaction level of
the target function. Thus, one must ne-tune the value of J by trying different values and choosing
the one that produces the lowest error. As a rule of thumb, however, in most practical problems
low-order interactions tend to dominate; i.e., 2 J 8.
4.8. PARALLELVS. SEQUENTIAL ENSEMBLES 63
4.8 PARALLELVS. SEQUENTIAL ENSEMBLES
Figure 4.11 is taken from a Friedman and Popescu (2003) comparative study of over 100 different
target functions. The rst four algorithms are the same ones discussed in Figure 4.8: Bagging,
Random Forest, and their corresponding ISLE based improvements. The notation Seq__%_P
indicates 6-terminal-node trees, sequential approximation to importance sampling, memory
factor, %-size samples without replacement, and post-processing. Thus, MART is equivalent to
Seq_0.1_50%.
As in the cases of Bagging and Random Forest, adding the post-processing step of tting the
coefcients {c
m
}
M
0
to the data using regularized regression after the {T
m
}
M
1
have been tted results
in an improvement over MART. The best algorithm also differs from MART in terms of and :
a smaller means that each T
m
is tted to a smaller data set (increased diversity), which requires
smaller values of (slower learning).
Overall, the sequential ISLE algorithms tend to perform better than parallel versions. This
is consistent with results observed in classical Monte Carlo integration (Friedman and Popescu,
2003).
1
.
0
1
.
2
C
o
m
p
a
r
a
t
i
v
e
R
M
S
E
r
r
o
r
1
.
4
1
.
6
1
.
8
2
.
0
Sequential
Parallel
Bag
Bag_6_5%_P
RF
RF_6_5%_P
Seq_0.1_50%
Seq_0.01_20%_P MART
Figure 4.11: Comparison between Parallel and Sequential ensemble models (adapted
from (Friedman and Popescu, 2003)).
65
C H A P T E R 5
Rule Ensembles and
Interpretation Statistics
Ensemble methods perform extremely well in a variety of problem domains, have desirable statis-
tical properties, and scale well computationally. A classic ensemble model, however, is not as easy
to interpret as a simple decision tree. In this chapter, we provide an overview of Rule Ensem-
bles (Friedman and Popescu, 2005; Friedman and Bogdan, 2008), a new ISLE-based model built
by combining simple, readable rules. While maintaining (and often improving) the accuracy of the
classic tree ensemble, the rule-based model is much more interpretable. In this chapter, we will also
illustrate recently proposed interpretation statistics which are applicable to Rule Ensembles as well
as to most other ensemble types.
Recapping from Chapter 4, an ensemble model of the type we have been considering is a
linear expansion of the form
F(x) = c
0
+
M
m=1
c
m
f
m
(x) (5.1)
where the {f
m
(x)}
M
1
are derived predictors (basis functions or base learners) which capture non-
linearities and interactions. Fitting this model to data is a 2-step process:
1. Generate basis functions
_
f
m
(x) = f (x; p
m
)
_
M
1
, and
2. post-t to the data via regularized regression to determine the coefcients {c
m
}
M
0
.
The resulting model is almost always signicantly more accurate than a single decision tree.
The simple interpretation offered by a single tree, however, is no longer available. A signicant
improvement toward interpretability is achieved with rule ensembles.
5.1 RULEENSEMBLES
As discussed in Chapter 2, the functional form of a Jterminal node decision tree can be expressed
as a linear expansion of indicator functions:
T (x) =
J
j=1
c
j
I
R
j
(x)
66 5. RULEENSEMBLES ANDINTERPRETATIONSTATISTICS
where
I
A
(x) is 1 if x A and 0 otherwise,
the c
j
s are the constants associated with each terminal node,
and the
R
j
s are the regions in x-space dened by the terminal nodes.
In the case of terminal nodes that are at a depth greater than 1, the corresponding I
R
j
(x)
functions can themselves be expressed as a product of indicator functions. For instance,
R
1
in
Figure 2.3 is dened by the product of I (x
1
> 22) and I (x
2
> 27). Thus, I
R
1
(x) is equivalently
expressed by the following conjunctive rule:
if x
1
> 22 and x
2
> 27 then 1 else 0
Table 5.1 lists the rules for the other regions induced by the tree of Figure 2.3. In the ensemble
model of Equation (5.1), the base learners f
m
(x) typically have been trees, but one could use the
rules r
j
(x) instead. Because each rule is a simple readable statement about attributes of x, then a
model using these rules can be more interpretable.
Table 5.1: The regions in input (x) space generated by a decision tree expressed as
a set of binary rules (adapted from (Hastie et al., 2001)).
Region Rule
R
1
r
1
(x) =I(x
1
>22) I(x
2
>27)
R
2
r
2
(x) =I(x
1
>22) I(0 x
2
27)
R
3
r
3
(x) =I(15<x
1
22) I(0 x
2
)
R
4
r
4
(x) =I(0 x
1
15) I( x
2
>15)
R
5
r
5
(x) =I(0 x
1
15) I( 0 x
2
15 )
A rule ensemble is still a piecewise constant model and linear targets are still be problematic
as they were with single trees. But it is possible to combine base learners from different families.
Thus, the non-linear rules can be complemented with purely linear terms. The resulting model has
the form:
F(x) = a
0
+
m
a
m
r
m
(x) +
j
b
j
x
j
(5.2)
In terms of the ISLE framework (Algorithm 4.1), we need some approximate optimization
approach to solve:
p
m
= arg min
p
iS
m
()
L
_
y
i
, F
m1
(x
i
) + r(x
i
; p)
_
5.2. INTERPRETATION 67
where p
m
are the split denitions for rule r
m
(x). But instead of solving this optimization problem
directly, it is possible to take advantage of a decision tree ensemble in the following way: once the
set {T
m
(x)}
M
1
has been built, each tree T
m
(x) can be decomposed into its rules r
j
(x), and let those
be the base learners that are being combined together during the coefcient estimation step.
In the case of shallow trees, as commonly used with Boosting, regions corresponding to non-
terminal nodes can also be included. If all nodes are counted, a Jterminal node tree generates
2 (J 1) rules. Once the rules have been generated, the next step is to t the corresponding
coefcients using the same linear regularized procedure discussed in Chapters 3 and 4:
({ a
k
}, {
b
j
}) = arg min
{a
k
},{b
j
}
N
i=1
L
y
i
, a
0
+
K
k=1
a
k
r
k
(x
i
) +
p
j=1
b
j
x
ij
k=1
|a
k
| +
p
j=1
b
j
where,
p n indicates the number of input predictors that are of continuous type, and which are
desired to be included as purely linear terms. It is often sensible to replace these x
j
variables
by their winzorized (Winsorize) versions.
M is the number of trees in the ensemble.
K =
M
m=1
2 (J
m
1) denotes the total number of rules.
As in the case of tree ensembles, tree size, J, controls rule complexity: a Jterminal node
tree can generate rules involving up to (J 1) variables. Thus, modeling Jorder interactions
requires rules with J or more components.
5.2 INTERPRETATION
Chapter 2 dened the Predictive Learning problem and identied two types of modeling tasks:
predictive and descriptive. A predictive classication model has the goal of assigning a set of
observations into one of two or more classes. For example, a predictive model can be used to score
a credit card transaction as fraudulent or legitimate. A descriptive model, on the other hand, has
the additional goal of describing the classication instead of just determining it. In semiconductor
manufacturing, for example, descriptive models are used to identify and understand defect causes.
In these descriptive tasks, the ability to derive interpretations from the resulting classication model
is paramount.
Over the last few years, new summary statistics have been developed to interpret the models
built by ensemble methods. These fall into three groups:
68 5. RULEENSEMBLES ANDINTERPRETATIONSTATISTICS
1. Importance scores: to answer the question of which particular variables are the most relevant.
Importance scores quantify the relative inuence or contribution of each variable in predicting
the response. In the manufacturing domain, for example, datasets can contain thousands of
predictor variables, and importance scores allow the analyst to focus the defect diagnosis effort
on a smaller subset of them.
2. Interaction statistic: to answer the question of which variables are involved in interactions
with other variables and to measure the strength and degrees of those interactions. In the
manufacturing domain case again, interaction statistics could show that a combination of a
particular machine being used for a certain step is a cause of defects but only when operated
at a certain time of the day.
3. Partial dependence plots: to understand the nature of the dependence of the response on inu-
ential inputs. For example, does the response increase monotonically with the values of some
predictor x
j
? Partial dependence plots allow visualizing the model as a function of, say, two of
the most important variables at a time (while averaging over all other variables).
Most of the math related to the calculation of these interpretation statistics can be applied to
any base learner, but they are easier to compute for trees (as shown in Sections 5.2.2 through 5.2.4
below).
5.2.1 SIMULATEDDATAEXAMPLE
To illustrate the interpretation methodology, consider the following simple articial example (bor-
rowed from (Breiman et al., 1993)). There are ten predictor variables generated according to the
following rule:
p(x
1
= 1) = p(x
1
= 1) = 1
_
2
p(x
2
= 1) = p(x
2
= 1) = 1
_
2
p(x
m
= 1) = p(x
m
= 0) = p(x
m
= 1) = 1
_
3 m = 3, . . . ,10 (5.3)
and the response variable according to:
y =
_
3 + 3 x
2
+ 2 x
3
+ x
4
+ z x
1
= 1
3 + 3 x
5
+ 2 x
6
+ x
7
+ z x
1
= 1
(5.4)
Thus, x
1
and x
2
take two values, 1 and 1, with equal probability, and the other eight variables
take three values, 1, 0, and 1, also with equal probability. The response variable depends on x
1
through x
7
, but not on x
8
, x
9
, or x
10
. Therefore, x
8
, x
9
, and x
10
are irrelevant inputs that will
be present in the input vectors x. The response variable denition also indicates that there is an
interaction between x
1
and x
2
, x
1
and x
3
, x
1
and x
4
, but there is no interaction among them.
Similarly, there is an interaction between x
1
and x
5
, x
1
and x
6
, x
1
and x
7
, with no interaction among
5.2. INTERPRETATION 69
x
5
, x
6
, and x
7
. Finally, z represents noise added to the response variable and has a normal distribution
z N (0, 2).
The goal is to see whether or not one can recover the target rule denition using the ensemble
interpretation methodology.
We generated a small training data set with N = 200. Figure 5.1 shows a single tree tted
to this data. The tree correctly splits on x
1
rst, suggesting the greater importance of this variable.
Then, it splits on x
2
and x
5
at the second level, suggesting that these variables come next in terms
of relevance in predicting the response. The tree splits on x
3
and x
7
after that. Irrelevant variables
x
8
, x
9
, and x
10
are not present in any tree split.
x1=-1
x2=-1 x5=1,0
x6=-1,0 x5=1
x6=1,0 x6=-1,0
x7=-1
-1.27 -7.55-4.11
-5.59-3.64
-1.312.12 -2.24
x3=1
x3=0
x3=-1,0
x2=0 x2=0
5.68 8.05
6 3.78 2.94 1.28 3.62 1.21
x4=-1,0
x3=-1 x3=-1
0.134
Figure 5.1: Decision tree for data generated according to Equations (5.3) and (5.4) in the text.
Table 5.2 shows the eight globally most important rules in a rule ensemble model t to the
same data. For every rule, its importance, coefcient, and support is listed. Since the rules are binary
70 5. RULEENSEMBLES ANDINTERPRETATIONSTATISTICS
functions, the coefcient values, a
m
in Equation (5.2), represent the change in predicted value if the
rule is satised (or res). Support refers to the fraction of the data to which the rule applies. Rule
importance is based on the coefcient magnitude and support of the rule (dened in Section 5.2.2).
Thus, rule 1 applies to 30% of the data and is interpreted as cases for which x
1
= 1 and x
2
is either
0 or 1, tend to have higher response (given the magnitude and sign of its coefcient). The rst rule
suggests x
1
is interacting with x
2
, the second rule suggests x
1
is interacting with x
5
, and so on.
Table 5.2: Rule ensemble for the data generated according to Equa-
tions (5.3) and (5.4) in the text.
Importance Coefficient Support Rule
100 1.80 0.30 x
1
=1 and x
2
{0,1}
95 -2.25 0.14 x
1
=-1and x
5
{-1}
83 -1.59 0.24 x
1
=1 and x
2
{-1,0} and x
3
{-1,0}
71 -1.00 0.35 x
1
=-1and x
6
{-1}
62 1.37 0.17 x
1
=1 and x
2
{0,1} and x
3
{0,1}
57 1.25 0.35 x
1
=-1and x
6
{-1,0}
46 1.00 0.11 x
1
=-1and x
5
{1}and x
7
{0,1}
42 0.91 0.51 x
1
=1 and x
4
{1}
Figure 5.2(a) shows the variable importance score computed using the R package rule-
t (RuleFit). The importance score ranks all the input variables according to the strength of their
effect on the model. As expected, the most important variable is x
1
, and it is followed by x
2
and x
5
with roughly the same relevance. Then, x
3
and x
6
, followed by x
4
and x
7
. Finally, x
8
, x
9
, and x
10
,
which dont play a role in the target, appear last. Its interesting to note that, while none of these
variables have any impact on the single tree in Figure 5.1, they have non-zero importance, as can be
seen in the graph. This means the variables do show up in some rules in the ensemble of trees but
the corresponding coefcients are very small.
Figure 5.3(b) shows the interactionfor eachvariable. It indicates whichvariables are likely to be
involved in interactions with others. The red bars correspond to the reference, or null, distribution
values corresponding to the hypothesis of no interaction effects. Thus, the height of each yellow bar
reects the value of the interaction statistic in excess of its expected value.
The reference distributions are computed using a bootstrap method. The idea is to generate
random data sets that are similar to the original training data, repeatedly compute the interaction
statistic on these articial data, and compare the resulting values with those obtained from the
original data (Friedman and Popescu, 2005).
One sees that x
1
has been identied as the variable interacting the strongest (most involved
in interactions) then x
2
and x
5
, followed by x
3
and x
6
to a lesser extent. Going back to the target
function denition, Equation (5.4), and examining the magnitude of the different coefcients there,
5.2. INTERPRETATION 71
R
e
l
a
t
i
v
e
i
m
p
o
r
t
a
n
c
e
Input variable
0
20
40
60
80
100
X
1
X
5
X
2
X
6
X
3
X
7
X
4
X
1
0
X
9
X
8
N
u
l
l
a
d
j
u
s
t
e
d
i
n
t
e
r
a
c
t
i
o
n
s
t
r
e
n
g
t
h
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
X
1
X
2
X
3
X
4
X
5
X
6
X
7
X
1
0
X
9
X
8
Variable
(a) (b)
Figure 5.2: Variable Importance and Interaction Statistic for the data generated according to Equa-
tions (5.3) and (5.4).
N
u
l
l
a
d
j
u
s
t
e
d
i
n
t
e
r
a
c
t
i
o
n
s
t
r
e
n
g
t
h
w
i
t
h
X
5
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
X
1
X
2
X
3
X
4
X
6
X
7
X
1
0
X
9
X
8
Variable
N
u
l
l
a
d
j
u
s
t
e
d
i
n
t
e
r
a
c
t
i
o
n
s
t
r
e
n
g
t
h
w
i
t
h
X
1
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
X
2
X
3
X
4
X
5
X
6
X
7
X
1
0
X
9
X
8
Variable
Figure 5.3: Two-variable Interaction Statistic for the data generated according to Equations (5.3)
and (5.4).
72 5. RULEENSEMBLES ANDINTERPRETATIONSTATISTICS
one sees a good match with this ranking. The participation of x
4
and x
7
in the interactions, with x
1
in the target, is not detectedi.e., it is measured to be too weak. Having identied which variables
are involved in interactions, the next question in the model interpretation process is nding those
variables with which they are interacting. The two-variable interaction statistic is intended to answer
this question.
Figure 5.3 shows the two-variable interaction statistics (X
5
, ) and (X
1
, ). It is seen that
x
5
has been identied to be only interacting with x
1
, which corresponds with how we constructed
the data. And with which variables is x
1
interacting? We see the (x
1
, x
2
) interaction as strong as the
(x
1
, x
5
) one, then, the equal strength pairs (x
1
, x
3
) and (x
1
, x
6
) and, nally, the weaker interactions
with x
4
and x
7
, and no interaction with the remaining variables.
Lastly, lets examine the partial dependence plots which allow us to understand how the
response variable changes as a function of the most important variables. For instance, the detailed
nature of the x
1
interactionwithx
5
canbe further exploredwiththe correspondingpartial dependence
plot (see Figure 5.4 here translated to have a minimum value of zero). In the absence of an
interaction between these variables, all x
5
partial dependence distributions, conditioned on different
values of x
1
, would be the same. Here, one sees that when x
1
= 1, x
5
has no effect on the response.
The distribution for x
1
= 1 is very different and captures the essence of the interaction effect
between these two variables: response increases as x
5
varies from 1 to 1.
8
6
4
2
0
P
a
r
t
i
a
l
d
e
p
e
n
d
e
n
c
e
-
1 0 1
X5
-
1
0 1
X5
8
6
4
2
0
P
a
r
t
i
a
l
d
e
p
e
n
d
e
n
c
e
X1 = -1 X1 = 1
Figure 5.4: Two-variable Partial Dependence example for the simulated data generated according to
Equations (5.3) and (5.4).
5.2. INTERPRETATION 73
5.2.2 VARIABLEIMPORTANCE
How are variable importance scores such as those shown in Figure 5.2 computed? The single tree
variable importance measure is given by the following expression (Breiman et al., 1993):
(x
l
; T ) =
t T
I
_
v(t ), s
v(t )
_
I (v(t ) = l) (5.5)
which is simply the sum of the goodness of split scores
I (see Section 2.1) whenever x
l
is used
in the split denition of an internal node t in the tree T . Since only the relative magnitudes of the
(x
l
) are interesting, often the following normalized measure is reported instead:
Imp
_
x
j
_
= 100
_
x
j
_
/ max
1 l n
(x
l
)
For a tree ensemble, Equation (5.5) can be easily generalized to an average across all the trees
in the ensemble:
_
x
j
_
=
1
M
M
m=1
_
x
j
; T
m
_
In the case of a rule ensemble, which is a linear model of rules and linear terms, its possible
to dene global and local (term) importance scores:
- Global term importance: as generally used in linear models, dened as the absolute value of
the coefcient of the standardized predictor:
Rule term:
k
=
a
k
s
k
(1 s
k
)
where s
k
denotes the support of the rule; i.e., s
k
=
1
N
N
i=1
r
k
(x
i
).
Linear term:
j
=
b
j
std(x
j
).
- Local term importance: dened at each point x as the absolute change in
F(x) when the term
is removed from the ensemble:
Rule term:
k
(x) =
a
k
|r
k
(x) s
k
|.
Linear term:
j
(x) =
b
j
x
j
x
j
.
Note that a term has an importance score at a particular point x even if the rule doesnt re, or
the particular predictor x
j
= 0, for the given case. The two local term importance scores can
be combined to arrive at a local variable importance score:
(x
l
; x) =
l
(x) +
x
l
r
k
k
(x)
_
size(r
k
)
which can be averaged over any subset of the input space to obtain a regional importance
score. For example, the variable ranking could be computed for the region where all xs are
positive. It can be shown that, if the variable rankings (x
l
; x) are averaged over the entire
input space, the global metrics are recovered (Friedman and Popescu, 2004).
74 5. RULEENSEMBLES ANDINTERPRETATIONSTATISTICS
5.2.3 PARTIAL DEPENDENCES
The purpose of the partial dependence plots, such as the one shown in Figure 5.4, is to visualize the
effect on
F(x) of a small subset of the input variables x
S
{x
1
, x
2
, . . . , x
n
} e.g., a subset that has
been deemed interesting based on the variable importance score.
Since
F(x) also depends on the other (complement) variables x
C
i.e.,
F(x) =
F(x
S
, x
C
)
and x
S
x
C
= {x
1
, x
2
, . . . , x
n
}, such visualization is only possible after accounting for the (average)
effects of x
C
. Thus, the partial dependence on x
S
is dened by (Hastie et al., 2001):
F
S
(x
S
) = E
X
C
F(x
S
, x
C
)
which is approximated by:
F
S
(x
S
) =
1
N
N
i=1
F(x
S
, x
iC
)
where {x
1C
, . . . , x
NC
} are the values of x
C
occurring in the training data.
5.2.4 INTERACTIONSTATISTIC
Howare interaction scores such as those shown in Figure 5.3 computed? If x
j
and x
k
do not interact,
then
F(x) can be expressed as the sum of two functions:
F(x) = f
\j
(x
\j
) + f
\k
(x
\k
)
with f
\j
(x
\j
) not depending on x
j
and f
\k
(x
\k
) not depending on x
k
. Thus, the partial dependence
of
F(x) on x
S
=
_
x
j
, x
k
_
can also be decomposed as the sum of two functions:
F
j,k
(x
j
, x
k
) =
F
j
(x
j
) +
F
k
(x
k
) (5.6)
namely, the sum of the respective partial dependencies. This immediately suggests a way to test for
the presence of an (x
j
, x
k
) interaction: check whether the equality of Equation (5.6) holds. More
formally, the two-variable interaction statistic is dened by (Friedman and Popescu, 2005):
H
2
jk
=
N
i=1
_
F
j,k
(x
ij
, x
ik
)
F
j
(x
ij
)
F
k
(x
ik
)
_
2
_
N
i=1
F
2
j,k
(x
ij
, x
ik
)
with H
2
jk
0 indicating
F
j,k
(x
j
, x
k
)
F
j
(x
j
) +
F
k
(x
k
) and thus no interaction between x
j
and
x
k
. Likewise, H
2
jk
>> 0 would indicate the presence of an interaction.
Similarly, if x
j
does not interact with any other variable, then
F(x) can be expressed as the
sum:
F(x) = F
j
(x
j
) + F
\j
(x
\j
)
5.3. MANUFACTURINGDATAEXAMPLE 75
with F
j
(x
j
) being a function only of x
j
and F
\j
(x
\j
) a function of all variables except x
j
. Thus, a
single-variable interaction statistic to test whether x
j
interacts with any other variable can be dened
by (Friedman and Popescu, 2005):
H
2
j
=
N
i=1
_
F(x
i
)
F
j
(x
ij
)
F
\j
(x
i\j
)
_
2
_
N
i=1
F
2
(x
i
) .
5.3 MANUFACTURINGDATAEXAMPLE
In this section, we present the application of rule ensembles, and the associated interpretation
methodology discussed above, to an analysis of semiconductor manufacturing data (Seni et al., 2007).
The data consists of 2045 observations, each described by 680 input variables. Observations corre-
spond to wafers, and the variables encode manufacturing attributes such as the machine name and
time of various processing steps. There were missing values present in the data. The response variable
corresponds to yield (the number of die in a wafer that pass several electrical tests) which ranges
from 60% to 96% in this data. The goal is to build a model that can help establish if certain machines
(or combinations of them) at a certain time, and/or at certain process steps, cause low yield.
Table 5.3 summarizes the estimated test error results using three models: main effects only
(i.e., using single-variable rules in the ensemble), a single tree, and an rule ensemble with interactions
allowed. For the single tree, the tree size is given. For the ensemble model, the number of terms
(rules and linear) with nonzero coefcients is shown.
Table 5.3: Average absolute prediction error (estimated
via cross-validation) for different models applied to a
semiconductor data set.
Error Size
Main effects only model 0.91
Single tree 0.44 32 terminal nodes
Model with interactions 0.35 220 terms
The average absolute prediction error for the ensemble model is signicantly lower than the
corresponding error for an additive model restricted to main effects only. Thus, there is good evidence
for interaction effects being present. The single tree model had 32 terminal nodes and a depth of 7
i.e., it is not so easy to read. In this case, the ensemble model improves the single-trees error by
20% (.44 to .35).
Table 5.4 shows the three globally most important rules in the ensemble model. Figure 5.5(a)
shows the relative importance of the ten most important input variables (out of the 680 given ones),
as averaged over all predictions. Variable PE.3880.4350.ILTM444, which is also present in the top
rules, gures prominently. An analysis (not shown) of the partial dependence of yield on this variable,
76 5. RULEENSEMBLES ANDINTERPRETATIONSTATISTICS
reveals that wafers using YEQUIP707, or YEQUIP709, at step 3880.4350 tend to have noticeably
lower yield.
Table 5.4: Top three rules based on global importance for the ensemble
model for the yield data.
Imp Coef Supp Rule
100 1.58 0.25 PE.1550.1000.LTH203 {LEQUIP754} &
PE.3560.4720.ILT112 {YEQUIP704, YEQUIP706,
YEQUIP710, YEQUIP711} &
PE.3880.4350.ILTM444 {YEQUIP702, YEQUIP706,
YEQUIP707, YEQUIP709} &
TI.1850.1805.WETETCH13 d 38620
89 -1.37 0.27 PE.3300.0400.WETCFF21 {SEQUIP702} &
PE.3670.4200.CMP552 {PEQUIP702} &
TI.3230.2115.INS711 t 38620
71 -1.09 0.29 PE.2100.1175.ION621 {IEQUIP703} &
PE.2450.1040.WETS23 {CEQUIP704} &
PE.3970.4200.CMP554 {PEQUIP706, PEQUIP707}
The analysis of interaction effects can now be focused on the smaller set of variables deemed
most relevant. The yellow bars in Figure 5.5(b) show the values of the statistic used to test whether
a specied variable interacts with any other variable. The red bars correspond to the reference (null)
distribution values. Thus, the height of each yellow bar reects the value of the interaction statistic
in excess of its expected value under the null hypothesis of no interaction effects.
Although the strengths of the interaction effects shown in Figure 5.5(b) are not large, at
least two of the fteen most inuential variables appear to be involved in interactions with other
variables. After identifying those variables that interact with others e.g., PE.2100.1175.ION621
and PE.2550.1000.LTH233 above, we need to determine the particular other variables with which
they are interacting.
The values of the two-variable interaction strength statistic for PE.2550.1000.LTH233
are shown in Figure 5.6. Here, see that PE.2550.1000.LTH233 dominantly interacts with
PE.2997.0100.WETC755.
The detailed nature of this interaction can be further explored with the corresponding par-
tial dependence plot (see Figure 5.7). In the absence of an interaction between these variables,
all PE.2550.1000.LTH233 partial dependence distributions, conditioned on different values of
PE.2997.0100.WETC755, would be the same.
Here we see similar distributions when PE.2997.0100.WETC755 takes the values DE-
QUIP701 (Figure 5.7(a)) and DEQUIP702 (Figure 5.7(b)) (with one PE.2550.1000.LTH233 value
not represented in one case). The distribution for PE.2997.0100.WETC755 = DEQUIP703 (Fig-
ure 5.7(c)) is fairly different from the others, suggesting it captures the essence of the interaction
effect between these two variables. (Yield is lower throughout in this case.) The impact of this in-
teraction can be further visualized using wafer map plots: Figure 5.8(a) shows average yield for the
5.3. MANUFACTURINGDATAEXAMPLE 77
R
e
l
a
t
i
v
e
I
m
p
o
r
t
a
n
c
e
0
20
40
60
80
100
I
n
t
e
r
a
c
t
i
o
n
S
t
r
e
n
g
t
h
0.00
0.01
0.02
0.03
0.04
0.05
0.06
Variable
(a) (b)
Variable
Figure 5.5: Interpretation statistics for the yield data. In (a), input variable relative (global) importance.
In (b), total interaction strengths in excess of expected null value for the top 15 input variables.
I
n
t
e
r
a
c
t
i
o
n
S
t
r
e
n
g
t
h
0.00
0.15
0.05
0.10
variable
Figure 5.6: Two-variable interaction strengths of variables interacting with
PE.2550.1000.LTH233.
78 5. RULEENSEMBLES ANDINTERPRETATIONSTATISTICS
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
PE.2997.0100.WETC755 = DEQUIP701
PE.2550.1000.LTH233
PE.2997.0100.WETC755 = DEQUIP702
PE.2550.1000.LTH233 PE.2550.1000.LTH233
PE.2997.0100.WETC755 = DEQUIP703
(a) (b) (c)
L
E
Q
U
I
P
7
1
4
L
E
Q
U
I
P
7
0
3
L
E
Q
U
I
P
7
0
1
L
E
Q
U
I
P
7
5
2
L
E
Q
U
I
P
7
5
3
L
E
Q
U
I
P
7
5
4
L
E
Q
U
I
P
7
5
5
L
E
Q
U
I
P
7
1
4
L
E
Q
U
I
P
7
0
3
L
E
Q
U
I
P
7
0
1
L
E
Q
U
I
P
7
5
2
L
E
Q
U
I
P
7
5
3
L
E
Q
U
I
P
7
5
5
L
E
Q
U
I
P
7
1
4
L
E
Q
U
I
P
7
0
3
L
E
Q
U
I
P
7
5
2
L
E
Q
U
I
P
7
5
3
Figure 5.7: Joint partial dependence on variables PE.2550.1000.LTH233 (found earlier to be a relevant
input) and PE.2997.0100.WETC755.
wafers using LEQUIP701 at step 2550.1000 or using DEQUIP703 at step 2997.0100. Figure 5.8(b)
shows average yield for the complement set of wafers. A 7.5% yield loss is observed.
5.4 SUMMARY
Ensemble methods in general, and boosted decision trees in particular, constitute one of the most
important advances in machine learning in recent years. In the absence of detailed a priori knowledge
of the problem at hand, they provide superior performance. A number of interpretation tools have
been developed that, while applicable to other algorithms, are often easiest to compute for trees.
With the introduction of rule ensembles, the interpretability of the ensemble model has been further
signicantly improved. Together, they provide a powerful way to solve a variety of regression and
classication problems.
5.4. SUMMARY 79
Yield: 81.3% #Wafers: 373 Yield: 88.7% #Wafers: 1672
Figure 5.8: Wafer map representation of yield. Minimum values are represented by darker colored die,
graduating to lighter die for higher values. In (a), average yield for the 373 wafers using LEQUIP701 at
step 2550.1000 or using DEQUIP703 at step 2997.0100. In (b), average yield for the remaining 1672
wafers.
81
C H A P T E R 6
Ensemble Complexity
Ensemble
1
models appear very complex, yet we have seen how they can strongly outperform their
component models on new data. This seems to violate Occams Razor the useful and widespread
analytic doctrine that can be stated when accuracy of two hypotheses is similar, prefer the simpler.
We argue that the problem is really that complexity has traditionally been measured incorrectly.
Instead of counting parameters to assess the complexity of a modeling process (as with linear regres-
sion), we need to instead measure its exibility as by Generalized Degrees of Freedom, GDF (Ye, J.,
1998). By measuring complexity according to a models behavior rather than its appearance, the util-
ity of Occams Razor is restored. Well demonstrate this on a two-dimensional decision tree example
where the whole (an ensemble of trees) has less GDF complexity than any of its parts.
6.1 COMPLEXITY
One criticism of ensembles is that interpretation is much harder than with a single model. For
example, decision trees have properties so attractive that, second to linear regression (LR), they
are the modeling method most widely employed, despite having the worst accuracy of the major
algorithms. Bundling trees into an ensemble makes themcompetitive on this crucial property, though
at a serious loss in interpretability. (The Rule Ensembles and companion interpretation methodology
of Chapter 5 make ensembles much more interpretable, but some loyal tree users will still prefer the
simplicity of a single tree.) Note that an ensemble of trees can itself be represented as a tree, as it
produces a piecewise constant response surface
2
. But the tree equivalent to an ensemble can have
vastly more nodes than the component trees; for example, a bag of M stumps (single-split binary
trees) requires up to 2
M
leaves to be represented by a single tree (when all the splits are on different
variables).
Indeed, Bumping (Tibshirani and Knight, 1999a) was designed to get some of the benet
of bagging without requiring multiple models, in order to retain some interpretability. It builds
competing models frombootstrapped datasets and keeps only the one with least error on the original
data. This typically outperforms, on new data, a model built simply on the original data, likely due
to a bumped model being robust enough to do well on two related, but different datasets. But, the
expected improvement is greater with ensembles.
1
This chapter is based on Elder, J. (2003). The Generalization Paradox of Ensembles, Journal of Computational and Graphical
Statistics 12, No. 4: 853-864.
2
The decision boundary induced by a weighted sum of trees, as from bagging or boosting, is piecewise constant and so can be
represented by a single tree. To build it, generate training data from a ne grid of input points and run it through the ensemble
to generate the target variable from the model, then t a tree perfectly to the resulting data.
82 6. ENSEMBLECOMPLEXITY
Another criticism of ensembles more serious to those for whom an incremental increase
in accuracy is worth a multiplied decrease in interpretability is the concern that their increased
complexity will lead to overt, that is, inaccuracy on new data. In fact, not observing ensemble
overt in practical applications has helped throw into doubt, for many, the Occams Razor axiom
that generalization is hurt by complexity. (This and other critiques of the axiom are argued in an
award-winning paper by Domingos, P. (1998).)
But, are ensembles truly complex? They appear so; but do they act so? The key question is how
we should measure complexity. For LR, one can merely count coefcients (to measure the degrees
of freedom of the data used up by the process of parameter estimation), yet this is known to fail for
nonlinear models. It is possible for a single parameter in a nonlinear model to have the inuence of
less than a single linear parameter, or greater than several e.g., it is estimated that each parameter
in Multivariate Adaptive Regression Splines uses three effective degrees of freedom (Friedman, J.,
1991; Owen, A., 1991). The under-linear case can occur with say, a neural network that hasnt
trained long enough to pull all its weights into play. The over-linear case is more widely known. For
example, Friedman and Silverman (1989) note the following: The results of Hastie and Tibshirani
(1985), together with those of Hinkley, D. (1969, 1970) and Feder, P. (1975), indicate that the
number of degrees of freedom associated with nonlinear least squares regression can be considerably
more than the number of parameters involved in the t.
The number of parameters and their degree of optimization is not all that contributes to a
models complexity or its potential for overt. The model form alone doesnt reveal the extent of the
search for structure. For example, the winning model for the 2001 Knowledge Discovery and Data
Mining (KDD) Cup settled on a linear model employing only three variables. This appears simple,
by denition. Yet, the data had 140,000 candidate variables, constrained by only 2,000 cases. Given
a large enough ratio of unique candidate variables to cases, even a blind search will very likely nd
some variables that look explanatory when there is no true relationship. As Hjorth, U. (1989) warned,
the evaluation of a selected model can not be based on that model alone, but requires information
about the class of models and the selection procedure. We thus need to employ model selection
metrics that include the effect of model selection! (Note that the valuable tool of cross-validation
(CV) can be used effectively, but only if all steps in the modeling process are automated and included
inside the CV loop.)
There is a growing realization that complexity should be measured not just for a model, but for
an entire modeling procedure, and that it is closely related to that procedures exibility. For example,
the Covariance Ination Criterion (Tibshirani and Knight, 1999b) ts a model and saves the output
estimates, y
i
, then randomly shufes the output variable, y, re-runs the modeling procedure, and
measures the covariance between the new and old estimates. The greater the change (adaptation
to randomness, or exibility) the greater the complexity penalty needed to restrain the model from
overt (see Section 3.3). Somewhat more simply, Generalized Degrees of Freedom, GDF (Ye, J.,
1998) adds random noise to the output variable, re-runs the modeling procedure, and measures the
6.2. GENERALIZEDDEGREES OF FREEDOM 83
changes to the estimates. Again, the more a modeling procedure adapts to match the added noise
the more exible (and therefore more complex) its model is deemed to be.
The key step in both a randomized loop around a modeling procedure is reminiscent of the
Regression Analysis Tool (Farway, J., 1991), which measured, through resampling, the robustness
of results from multi-step automated modeling. Whereas at that time sufcient resamples of a two-
second procedure took two days, increases in computing power have made such empirical measures
much more practical.
6.2 GENERALIZEDDEGREES OF FREEDOM
For LR, the degrees of freedom, K, equal the number of coefcients, though this does not extrapolate
well to nonlinear regression. But, there exists another denition that does:
GDF(F, D) =
N
i=1
y
i
_
y
i
where
3
y
i
= y
e
i
y
i
, and y
i
= y
e
i
y
i
y
e
i
= y
i
+
i
with
i
N(0,
2
)
y
i
= F
y
(x
i
) is the output of model F trained on data D = {y
i
, x
i
}
N
1
y
e
i
= F
y
e
(x
i
) is the output of model F trained on data D
e
= {y
e
i
, x
i
}
N
1
(6.1)
GDFis thus dened to be the sumof the average sensitivity of each tted value, y
i
, to perturbations in
its corresponding target, y
i
. As it may be calculated experimentally for all algorithms, GDFprovides a
valuable way to measure and compare the exibility (and parameter power) of any modeling methods.
In practice, Ye suggests generating a table of perturbation sensitivities, then employing a
horizontal method of calculating GDF, as diagrammed in Figure 6.1. In that table, the rows
correspond to the observations (cases), the columns correspond to the randomly-perturbed replicant
data sets D
e
, and each cell holds the perturbed output, y
e
, and its estimate, y
e
, for one case and
sample. A cases sensitivity (of its estimate to its perturbing noise), m
i
, is estimated by tting a LR
to y
i
vs. y
i
using the row of data for case i. (Since y
i
and y
i
are constant, the LR simplies to
be y
e
i
vs. y
e
i
.) The GDF is then the sum of these slopes, m
i
. This horizontal estimate seems to
be more robust than that obtained by the vertical method of averaging the value obtained for each
column of data (i.e., the GDF estimate for each perturbation dataset).
6.3 EXAMPLES: DECISIONTREESURFACEWITHNOISE
We take as a starting point for our tests the two-dimensional piecewise constant surface used to
introduce GDF (Ye, J., 1998), shown in Figure 6.2.
3
We enjoyed naming the perturbed output, (y
e
= y + ) after GDFs inventor, Ye.
84 6. ENSEMBLECOMPLEXITY
y = 0.2502x 0.8218
-1.80
-1.60
-1.40
-1.20
-1.00
-0.80
-0.60
-0.60 -0.80 -1.00 -1.20 -1.40 -1.60 -1.80
e
( y
e
,
e
)
e
y
e
y
e
y
X
+
Figure 6.1: Diagram of GDF computation process.
It is generated by (and so it can be perfectly t by) a decision tree with ve terminal (leaf )
nodes (i.e., four splits), whose smallest structural change (level change) is 0.5. Figure 6.3 illustrates the
surface after Gaussian noise N(0, 0.5) has been added, and Figure 6.4 shows a sample made of 100
random observations of that space. This tree+noise data is the dataset, D
e
= {y
e
i
, x
i
}
100
1
, employed
for the experiments. For GDF perturbations, we employed 50 such random samples (replicates),
D
1
e
, . . . , D
50
e
, where each added to y Gaussian noise, N(0, 0.25),having half the standard deviation
of the noise already in the training data
4
.
Figure 6.5 shows the GDF vs. K (number of parameters) sequence for LR models, single
trees, and ensembles of ve trees (and two more sequences described below). Conrming theory, the
GDF for the LR models closely matches the number of terms, K. For single decision trees (Tree 2d)
of different sizes, K (maximum number of splits), the GDF grew at about 3.67 times the rate of K.
Bagging (See Section 4.3) ve trees together (Bagged Tree 2d), the rate of complexity growth was
4
Where the degree of noise in a dataset can be estimated, it is a rule of thumb for the perturbation magnitude to be half as large.
6.3. EXAMPLES: DECISIONTREESURFACEWITHNOISE 85
-
2
-
1
0
1
2
3
Figure 6.2: (Noiseless version of ) two-dimensional tree surface used in experiments (after Ye, J. (1998)).
-
4
-
2
0
2
4
6
Figure 6.3: Tree surface of Figure 6.2 after adding N(0, 0.5) noise.
86 6. ENSEMBLECOMPLEXITY
Figure 6.4: Sample of 100 observations from Figure 6.3 (dotted lines connect points to zero plane).
0
5
10
15
20
25
30
35
40
45
1 2 3 4 5 6 7 8 9
Parameters,
Tree 2d
Bagged Tree 2d
Tree 2d+8 noise
Bagged Tree 2d+8 noise
Linear Regression 1d
3.67
Slope = 3.05
4.15
4.96
G
D
F
k
Figure 6.5: GDF vs. splits, K, for ve models using from one to nine parameters (splits) for the data of
Figure 6.4.
6.3. EXAMPLES: DECISIONTREESURFACEWITHNOISE 87
3.05. Surprisingly perhaps, the bagged trees of a given size, K, are about a fth simpler, by GDF,
than each of their components! (The other two sequences are explained below.)
Figure 6.6 illustrates two of the surfaces in the sequence of bagged trees. Bagging ve trees
limitedtofour leaf nodes (three splits) eachproduces the estimationsurface of Figure 6.6(a). Allowing
eight leaves (seven splits) produces that of Figure 6.6(b). The bag of more complex trees creates a
surface with ner detail most of which here does not relate to actual structure in the underlying
data-generating function, as the tree is more complex than needed. For both bags, the surface has
gentler stair-steps than those of a lone tree, revealing how bagging trees can improve their ability to
estimate smooth functions.
-
3
-
2
-
1
0
2
z
1
-
3
-
2
-
1
0
2
1
-
3
-
2
-
1
0
2
1
3
(a) (b)
Figure 6.6: Surface of bag of ve trees using (a) three splits, and (b) seven splits.
Expanding the experiment (after Ye, J. (1998), we appended eight random candidate input
variables, x
3
, . . . , x
10
, to x, to introduce selection noise, and re-ran the sequence of individual and
bagged trees. Figures 6.7(a) and 6.7(b) illustrate two of the resulting bagged surfaces (projected
onto the space of the two relevant inputs), again for component trees with three and seven splits,
respectively. The structure in the data is clear enough for the under-complex model to avoid using the
random inputs, but the over-complex model picks some up. Figure 6.5 shows the GDF progression
for the individual and bagged trees with ten candidate inputs (Tree 2d+ 8 noise, and Bagged tree 2d+ 8
noise, respectively). Note that the complexity slope for the bagged tree ensemble (4.15) is again less
than that for its components (4.96). Note also that the complexity for each ten-input experiment is
greater than its corresponding two-input one. Thus, even though one cannot tell by looking at a
nal model using only the relevant inputs x
1
and x
2
that random variables were considered, their
presence increases the chance for overt, and this is appropriately reected in the GDF measure of
complexity.
88 6. ENSEMBLECOMPLEXITY
-
3
-
2
-
1
0
2
3
1
(a) (b)
Figure 6.7: Surface of bag of ve trees t to a ten-dimensional input data with 8 irrelevant (noise) inputs.
In (a), three split trees were used. In (b), seven split trees were used. Surface is drawn on the plane of the
two relevant inputs.
6.4 RCODEFORGDF ANDEXAMPLE
Table 6.1 displays code for generating the two-dimensional data of Figure 6.3.
To generate the data we simply do:
data2d.100 <- gen2dData(N=100, noise.seed=321, noise.sd=0.5)
and to compute GDF for a single decision tree:
gdf <- GDF(data2d.100, modelTrainer=treeModelTrainer, replicates=50,
noise.seed=321, noise.sd=0.25)
where a decision tree is built by:
treeModelTrainer <- function(data, K) {
library(rpart)
tree <- rpart(y . , data=data,control=
rpart.control(minsplit=2, cp=0.0001))
## Prune tree back to desired number of splits
i.alpha <- which.min(abs(tree$cptable[,"nsplit"] - K))
alpha.K <- tree$cptable[i.alpha,"CP"]
tree.p <- prune(tree, cp = alpha.K)
return(tree.p)
}
6.5. SUMMARY ANDDISCUSSION 89
Table 6.1: Sample R code for generating the 2-input
data of Figure 6.3.
gen2dData <- function(N, noise.seed, noise.sd) {
set.seed(noise.seed)
x1 <- runif(N, 0, 1)
x2 <- runif(N, 0, 1)
y <- rep(NA, N)
for (i in 1:N) {
if ( x1[i] > .6 ) {
if ( x2[i] > .8 ) {
y[i] <- rnorm(1, 2.5, noise.sd)
} else {
y[i] <- rnorm(1, 2, noise.sd)
}
} else {
if ( x2[i] < .3 ) {
y[i] <- rnorm(1, 0, noise.sd)
} else {
if ( x1[i] > .3 ) {
y[i] <- rnorm(1, -1, noise.sd)
} else {
y[i] <- rnorm(1, -2, noise.sd)
}
}
}
}
return(data.frame(x1, x2, y))
}
6.5 SUMMARY ANDDISCUSSION
Bundling competing models into ensembles almost always improves generalization and using
different algorithms as the perturbation operator is an effective way to obtain the requisite diversity
of components. Ensembles appear to increase complexity, as they have many more parameters than
their components; so, their ability to generalize better seems to violate the preference for simplicity
summarized by Occams Razor. Yet, if we employ GDF an empirical measure of the exibility
of a modeling process to measure complexity, we nd that ensembles can be simpler than their
components. We argue that when complexity is thereby more properly measured, Occams Razor is
restored.
Under GDF, the more a modeling process can match an arbitrary change made to its output,
the more complex it is. The measure agrees with linear theory, but can also fairly compare very
different, multi-stage modeling processes. In our tree experiments, GDF increased in the presence
of distracting input variables, and with parameter power (i.e., decision trees use up more degrees of
freedom per parameter than does LR), GDF is expected to also increase with search thoroughness,
and to decrease with use of Bayesian parameter priors, or parameter shrinkage, or when the structure
in the data is clear relative to the noise. Additional observations (constraints) may affect GDF either
way.
90 6. ENSEMBLECOMPLEXITY
Table 6.2: Sample R code to compute Generalized Degrees of Freedom (GDF) for a
given data set and model. Argument modelTrainer is a function which can be invoked
to train a model of some desired type; K stands for number of terms, number of splits, etc.,
in the model. Argument n.rep is the number of times to replicate (perturb) the data. The
last two arguments control the noise added in the perturbations.
GDF <- function(data, modelTrainer, K, n.rep, noise.seed, noise.sd) {
N <- nrow(data)
## Create the noise to be added over the data replicates
set.seed(noise.seed)
perturbations <- matrix(rnorm(N * n.rep, 0, noise.sd), nrow=N)
ye.mat <- matrix(data$y, nrow=N, ncol=n.rep, byrow=F) + perturbations
## Train a model on input data; store yHat
base_model <- modelTrainer(data, K)
yHat <- predict(base_model, data)
yeHat.mat <- matrix(NA, nrow=N, ncol=n.rep)
data_perturbed <- data
for (i in 1:n.rep) {
data_perturbed$y <- ye.mat[,i]
## Train a model on perturbed data; evaluate on input x
model_perturbed <- modelTrainer(data_perturbed, K)
yeHat.mat[,i] <- predict(model_perturbed, data)
}
GDF.m <- c(NA, N)
for (i in 1:N) {
lmodel <- lm(yeHat.mat[i,] - yHat[i] ~ perturbations[i,])
GDF.m[i] <- lmodel$coefficients[2]
}
GDF <- sum(GDF.m)
return(GDF)
}
Lastly, case-wise (horizontal) computation of GDF has an interesting by-product: a measure
of the complexity contribution of each case. Figures 6.8(a) and 6.8(b) illustrates these contributions
for two of the single-tree models of Figure 6.5 (having three and seven splits, respectively). The
under-t tree results of Figure 6.8(a) reveal only a few observations to be complex, that is, to lead to
changes in the models estimates when perturbed by random noise. (Contrastingly, the complexity
is more diffuse for the results of the overt tree in Figure 6.8(b).) A future modeling algorithm
could recursively seek such complexity contribution outliers and focus its attention on the local model
structure necessary to reduce them, without increasing model detail in regions which are stable.
6.5. SUMMARY ANDDISCUSSION 91
Figure 6.8: (a): Complexity contribution of each sample for bag of ve trees using three splits.
Figure 6.8: (b): Complexity contribution of each sample for bag of ve trees using seven splits.
93
A P P E N D I X A
AdaBoost Equivalence to FSF
Procedure
In this appendix we show that the AdaBoost algorithm presented in Table 4.6 is equivalent to the
Forward Stagewise Fitting (FSF) procedure of Table 4.5. At every iteration of the algorithm, we
need to solve:
(c
m
, p
m
) = arg min
c, p
N
i=1
L
_
y
i
, F
m1
(x
i
) + c T (x
i
; p)
_
and since L(y, y) = exp(y y) , we can write
(c
m
, p
m
) = arg min
c, p
N
i=1
exp
_
y
i
F
m1
(x
i
) c y
i
T (x
i
; p)
_
= arg min
c, p
N
i=1
w
(m)
i
exp
_
c y
i
T (x
i
; p)
_
(A.1)
where w
(m)
i
= e
y
i
F
m1
(x
i
)
. Since w
(m)
i
doesnt depend on c or p, it can be regarded as an observation
weight. A solution to Equation (A.1) can be obtained in two steps:
- Step 1: given c, solve for T
m
= T (x; p
m
):
T
m
= arg min
T
N
i=1
w
(m)
i
exp (c y
i
T (x
i
)) (A.2)
- Step 2: given T
m
, solve for c
c
m
= arg min
c
N
i=1
w
(m)
i
exp (c y
i
T
m
(x
i
)) (A.3)
We need to show that solving Step 1 above is equivalent to line a. in the AdaBoost algorithm,
and that the solution to Step 2 is equivalent to line c. in the same algorithm (Table 4.5). That is, we
94 A. ADABOOSTEQUIVALENCETOFSF PROCEDURE
need to show that T
m
is the classier that minimizes the weighted error rate. We start by expanding
Equation (A.2) above:
T
m
=arg min
T
N
i=1
w
(m)
i
exp (c y
i
T (x
i
))=arg min
T
e
c
y
i
=T (x
i
)
w
(m)
i
+ e
c
y
i
=T (x
i
)
w
(m)
i
e
c
y
i
=T (x
i
)
w
(m)
i
+ e
c
y
i
=T (x
i
)
w
(m)
i
= arg min
T
e
c
i=1
w
(m)
i
e
c
y
i
=T (x
i
)
w
(m)
i
+ e
c
y
i
=T (x
i
)
w
(m)
i
= arg min
T
e
c
i=1
w
(m)
i
+ (e
c
e
c
)
y
i
=T (x
i
)
w
(m)
i
= arg min
T
_
_
e
c
e
c
_
i=1
w
(m)
i
I (y
i
= T (x
i
)) + e
c
i=1
w
(m)
i
_
_
e
c
e
c
_
is constant and e
c
i=1
w
(m)
i
doesnt depend on T , thus the last line above leads to:
= arg min
T
_
N
i=1
w
(m)
i
I (y
i
= T (x
i
))
_
In other words, the T that solves Equation (A.2) is the classier that minimizes the weighted
error rate.
Next, we need to show that the constant c
m
that solves Equation (A.3) is c
m
=
1
2
log
1err
m
err
m
as required by line c. in the AdaBoost algorithm. We start by expanding Equation (A.3):
c
m
= arg min
c
_
_
e
c
e
c
_
i=1
w
(m)
i
I (y
i
= T
m
(x
i
)) + e
c
i=1
w
(m)
i
_
95
and computing and setting to zero its derivative with respect to c:
c
_
(e
c
e
c
)
N
i=1
w
(m)
i
I (y
i
= T
m
(x
i
)) + e
c
i=1
w
(m)
i
_
= (e
c
+ e
c
)
N
i=1
w
(m)
i
I (y
i
= T
m
(x
i
)) e
c
i=1
w
(m)
i
= 0
which follows simply from the derivative of the Exponential function. Thus,
e
c
i=1
w
(m)
i
I (y
i
= T
m
(x
i
)) + e
c
i=1
w
(m)
i
I (y
i
= T
m
(x
i
)) e
c
i=1
w
(m)
i
= 0
dividing by e
c
:
e
2c
N
i=1
w
(m)
i
I (y
i
= T
m
(x
i
)) +
N
i=1
w
(m)
i
I (y
i
= T
m
(x
i
))
N
i=1
w
(m)
i
= 0
e
2c
N
i=1
w
(m)
i
I (y
i
= T
m
(x
i
)) =
N
i=1
w
(m)
i
N
i=1
w
(m)
i
I (y
i
= T
m
(x
i
))
e
2c
=
N
i=1
w
(m)
i
N
i=1
w
(m)
i
I (y
i
= T
m
(x
i
))
N
i=1
w
(m)
i
I (y
i
= T
m
(x
i
))
c =
1
2
ln
N
i=1
w
(m)
i
N
i=1
w
(m)
i
I (y
i
= T
m
(x
i
))
N
i=1
w
(m)
i
I (y
i
= T
m
(x
i
))
(A.4)
Separately, we have that the AdaBoost algorithm calls for c
m
=
1
2
ln
1err
m
err
m
with err
m
=
N
i=1
w
(m)
i
I (y
i
=T
m
(x
i
))
N
i=1
w
(m)
i
. Thus,
c
m
=
1
2
ln
1
N
i=1
w
(m)
i
I (y
i
=T
m
(x
i
))
N
i=1
w
(m)
i
N
i=1
w
(m)
i
I (y
i
=T
m
(x
i
))
N
i=1
w
(m)
i
=
1
2
ln
N
i=1
w
(m)
i
N
i=1
w
(m)
i
I (y
i
=T
m
(x
i
))
N
i=1
w
(m)
i
N
i=1
w
(m)
i
I (y
i
=T
m
(x
i
))
N
i=1
w
(m)
i
=
1
2
ln
N
i=1
w
(m)
i
N
i=1
w
(m)
i
I (y
i
= T
m
(x
i
))
N
i=1
w
(m)
i
I (y
i
= T
m
(x
i
))
which is the same as Equation (A.4) above, so the equivalence between the two algorithms is
established.
97
A P P E N D I X B
Gradient Boosting and Robust
Loss Functions
In this appendix we illustrate the process of instatiating the Gradient Boosting algorithmof Table 4.8
to a particular differentiable loss function.
We need to solve:
(c
m
, p
m
) = arg min
c,p
N
i=1
L
_
y
i
, F
m1
(x
i
) + c T (x
i
; p)
_
(B.1)
This is done in two steps:
- Step 1: given c, solve for T
m
= T (x; p
m
):
p
m
= arg min
p
N
i=1
L
_
y
i
, F
m1
(x
i
) + c T (x
i
; p)
_
(B.2)
- Step 2: given T
m
, solve for c:
c
m
= arg min
c
N
i=1
L
_
y
i
, F
m1
(x
i
) + c T (x
i
; p
m
)
_
(B.3)
Solving Equation (B.2) for robust loss functions L(y, y) such as the absolute loss, Huber loss,
binomial deviance, etc., requires use of a surrogate, more convenient, criterion which is derived
from analogy to numerical optimization in function space.
The minimization problem of Equation (B.2) can be simply stated as nd the function f
that has minimum risk i.e.,
f = arg min
f
R(f ). Each possible f can be viewed as a point in
N
i.e., f = f (x
1
), f (x
2
), . . . , f (x
N
), and gradient-descent (Duda et al., 2001) can be used in
this space to locate the minimum. This is illustrated in Figure B.1 with
m
being the step size and
m
R being the gradient vector:
m
R =
R
_
f (x
1
)
R
_
f (x
N
)
f =f
m1
=
L(y
1
, f (x
1
))
_
f (x
1
)
L(y
N
, f (x
N
))
_
f (x
N
)
f =f
m1
98 B. GRADIENTBOOSTINGANDROBUSTLOSS FUNCTIONS
Figure B.1: Gradient-descent illustration. The risk critetion R(f ), is plotted as a function of f evaluated
at two training data points x
1
and x
2
. Starting with an initial guess, f (x
0
), a sequence converging towards
the minimum of R(f ) is generated by moving in the direction of steepest descenti.e., along the
negative of the gradient.
One difculty, however, is that these f s are dened on the training data x
1
, x
2
, . . . , x
N
only.
To obtain a function that is dened for all xs, we can choose the base learner T (x; p) that is most
parallel to R(f ) (this is illustrated in Figure B.2) with being the angle between the two vectors
T (x; p) and R(f ). Taking advantage of the geometric interpretation of correlation, we write:
cos = corr
_
{R(f (x
i
))}
N
i=1
,
_
T (x
i
; p)
_
N
i=1
_
Figure B.2: Surrogate-loss illustration. The base-learner most parallel to the negative gradient vector is
chosen at every step of the gradient-descent algorithm.
99
and the most highly correlated T (x; p) is given by the solution to:
p
m
= arg min
,p
N
i=1
_
m
R(f (x
i
)) T (x
i
; p)
_
2
.
Thus, given a differentiable loss function L(y, y), solving Equation (B.2) simply entails:
1. Set pseudo response y
i
=
L(y
i
, y
i )
y
i
y= y
m1
.
For instance, if L(y, y) =
y y
i=1
_
y
i
T (x
i
; p)
_
2
. If the T s are decision
trees, this is just tting a regression tree to the data { y
i
, x
i
}, using squared-error loss, something
we know how to do (see Chapter 2).
Solving Equation (B.3) can be done as outlined in Appendix A, namely by computing and
setting to zero the derivative of risk with respect to c, R(c)
_
c. But this step also simplies itself
in the case where the T
m
s are trees. Since adding a Jterminal-node tree T
m
(x) to the ensemble
model is like adding J separate (basis) functions and because the terminal node regions are disjoint,
Equation (B.3) can be expanded into J separate minimization problems, one in each terminal node:
jm
= arg min
x
i
R
jm
L(y
i
, F
m1
(x
i
) + ) 1 j J
that is, the optimal constant update ineachterminal node region. Using the absolute loss, for example,
this is simply
jm
= median
x
i
R
jm
{y
i
F
m1
(x
i
)}
N
1
1 j J .
Figure B.3 outlines the complete Gradient Boosting algorithm for the absolute (LAD) loss
L(y, y) =
y y
) ( ) (
expansion // Update
1 ) (
ts coefficien find : Step2 //
,
~
tree regression - LS node terminal
) (
~
) ( find : Step1 //
{ to 1 For
) (
1
1
1 1
1
1
1
1 0
e + =
= =
=
=
=
=
J
j
jm i jm m m
N
i m i
R
jm
N
i i
J
jm
i m i i
m
N
i
R I F F
J j F y median
y J R
F y sign y
T
M m
y median F
jm i
x x x
x
x
x
x
x
x
u