COMPX310-19A Machine Learning Chapter 7: Ensembles, Random Forest
COMPX310-19A Machine Learning Chapter 7: Ensembles, Random Forest
COMPX310-19A Machine Learning Chapter 7: Ensembles, Random Forest
Machine Learning
Chapter 7: Ensembles, Random Forest
An introduction using Python, Scikit-Learn, Keras, and Tensorflow
Unless otherwise indicated, all images are from Hands-on Machine Learning with
Scikit-Learn, Keras, and TensorFlow by Aurélien Géron, Copyright © 2019 O’Reilly Media
Ensembles
Combining two or more classifiers for better results
Many different ways of doing this, we’ll only cover a few:
Bagging
Random Forest
Boosting
XGBoost
Stacking
03/08/2021 COMPX310 2
Diverse classifiers
03/08/2021 COMPX310 3
Voting diverse classifiers
A (majority) vote can produce a classifier better than the single
best one
We want classifiers that are:
As good as possible by themselves
As independent from all others as possible
This is a bit of a conendrum:
Perfect classifiers would be identical, that is perfectly correlated
03/08/2021 COMPX310 4
Voting diverse classifiers
03/08/2021 COMPX310 5
Voting diverse classifiers
Majority count == ‘hard’ voting
Averaging probabilities == ‘soft’ voting
Law of large numbers: 1000 classifiers with 51% acc => 75% acc
03/08/2021 COMPX310 6
SciKit-Learn Voting classifier
03/08/2021 COMPX310 7
SciKit-Learn Voting classifier
03/08/2021 COMPX310 8
Bagging and Pasting
What if we only want to use one type of classifier, e.g. trees?
Subsample the training data:
With replacement => Bagging (bootstrap aggregation)
Without replacement => Pasting
03/08/2021 COMPX310 9
Bagging and Pasting
03/08/2021 COMPX310 10
SciKit-Learn Bagging
03/08/2021 COMPX310 11
Smooth decision boundary
One decision tree on the Moons dataset, versus Bagging of 500
03/08/2021 COMPX310 12
Out-of-bag evaluation
Every sample/subset is called a bag.
Because of replacements, there will be multiple copies of some
training examples in a bag, while others will not appear at all.
On average, a bag has about 67% unique samples.
In other words, e.g. when Bagging 1000 trees, then on average
any single example will be part of about 670 bags, but be left out
of the remaining 330 bags.
=> We can collect predictions for this example from these 330
trees, and average, and compute accuracies, F1, and similar
This is almost like having a validation set for free
03/08/2021 COMPX310 13
OOB in Scikit-Learn
03/08/2021 COMPX310 14
Random Patches/Subspaces
We can also subsample the features, not just the examples
Random subspace:
only sample features, not examples (so this is NOT bagging)
Random patches:
sample both examples + features
03/08/2021 COMPX310 15
Random Forest
Genius idea that combines Bagging with Feature randomization
inside the decision tree:
For every split in the tree:
Randomly select a feature subset
Choose the best split from this subset
03/08/2021 COMPX310 16
Scikit-Learn: Random Forest
03/08/2021 COMPX310 17
Extra-Trees
Extremely Randomized Trees ensemble:
Do not search for the best splitting threshold for numeric
attributes, instead choose one at random
Much more random, but also much faster to train
May or may not perform better than a Random Forest, but usually
need many more ensemble members
Scikit-Learn:
ExtraTreesClassifier and ExtraTreesRegressor
Same options/API as the RandomForest C+R
03/08/2021 COMPX310 18
Feature Importance
Built into the Random Forest code:
Measure how much a selected attribute improves Gini purity on
average, in a weighted way; results are scaled to sum to 1.0:
03/08/2021 COMPX310 19
Feature Importance: MNIST
Case study using the digits image classification problem:
03/08/2021 COMPX310 20
FAQ1
Pasting vs. Cross-validation?
CV is usually stratified, and “exhaustive”
03/08/2021 COMPX310 21
FAQ2
Why does voting help?
03/08/2021 COMPX310 22
FAQ3
Hard vs. soft voting:
03/08/2021 COMPX310 23
FAQ4
Sampling with replacement: sample 100 times from [0,100)
Summary stats:
missing: 36
Once: 37
Twice: 20
3 times: 5
4 times: 2
03/08/2021 COMPX310 24
Boosting
Another way for improving a ‘weak’ classifier
Train a series of classifiers one after to other, always trying to
correct for the mistakes of the previous ones
Training is inherently sequential
Two main ways:
AdaBoost
Gradient Boosting
03/08/2021 COMPX310 25
AdaBoost
Adaptive Boosting:
Increase the weight of mis-classified examples
Decrease the weight of correctly classified examples
Better base classifiers get a higher weight
Every single base classifier should be non-perfect (why?)
03/08/2021 COMPX310 26
AdaBoost example
Moons dataset, SVM with RBF kernel is being boosted:
03/08/2021 COMPX310 27
AdaBoost MATHS
03/08/2021 COMPX310 28
AdaBoost MATHS 2
03/08/2021 COMPX310 29
Scikit-Learn AdaBoost
Uses a version that also works for multi-class problems, and can
also utilize probabilities
A tree with max_depth=1 is called a Decision Stump
If AdaBoost overfits (which it easily can), use fewer iterations, a
lower learning rate, or more regularization of the base classifier
03/08/2021 COMPX310 30
Gradient Boosting
Don’t modify the weight, try to directly correct the remaining
error, the ‘residual’
3 step ‘manual’ regression example (easier to follow):
03/08/2021 COMPX310 31
03/08/2021 COMPX310 32
How many base classifiers?
Generally: the more, the lower the learning rate should be
How many: similar to gradient descent: early stopping
03/08/2021 COMPX310 33
Use validation to find right #
Use ‘staged_predict’ to get predictions for every prefix of the
ensemble: 1 tree, 2 trees, 3 trees, …
Find smallest validation error, and retrain to best size
03/08/2021 COMPX310 34
Result plot
03/08/2021 COMPX310 35
Manual early stopping
03/08/2021 COMPX310 36
Tedious, use better impl
3 industry strength implementations (including GPU support):
XGBoost (open source)
LightGBM (Microsoft Azure)
CatBoost (Yandex)
03/08/2021 COMPX310 37
Stacking
Smarter variant of voting: replace voting by a learner (aka
blender, or meta-learner)
Scikit-Learn:
NOT builtin , either
- program Yourself or
- find 3rd party lib
03/08/2021 COMPX310 38
How to train the Blender?
Split train into 2 parts: first to train base level, 2nd to train blender
03/08/2021 COMPX310 39
Training the Blender
Get predictions for 2nd part from base, use to train the Blender
03/08/2021 COMPX310 40
Multiple Blenders blended
“Going crazy”: three levels => needs 3 subsets
03/08/2021 COMPX310 41