COMPX310-19A Machine Learning Chapter 7: Ensembles, Random Forest

COMPX310-19A
Machine Learning
Chapter 7: Ensembles, Random Forest
An introduction using Python, Scikit-Learn, Keras, and Tensorflow
Unless otherwise indicated, all images are from Hands-on Machine Learning with
Scikit-Learn, Keras, and TensorFlow by Aurélien Géron, Copyright © 2019 O’Reilly Media
Ensembles
 Combining two or more classifiers for better results
 Many different ways of doing this, we’ll only cover a few:
 Bagging
 Random Forest
 Boosting
 XGBoost
 Stacking
 Ensembles reduce variance and/or bias
03/08/2021 COMPX310 2
Diverse classifiers
03/08/2021 COMPX310 3
Voting diverse classifiers
 A (majority) vote can produce a classifier better than the single
best one
 We want classifiers that are:
 As good as possible by themselves
 As independent from all others as possible
 This is a bit of a conendrum:
 Perfect classifiers would be identical, that is perfectly correlated
 “Weak classifier”: better than chance, but not brilliant

 Ensembling can turn “weak” classifiers into “strong” ones
03/08/2021 COMPX310 4
03/08/2021 COMPX310 5
 Majority count == ‘hard’ voting
 Averaging probabilities == ‘soft’ voting
 Law of large numbers: 1000 classifiers with 51% acc => 75% acc

03/08/2021 COMPX310 6
SciKit-Learn Voting classifier
03/08/2021 COMPX310 7
SciKit-Learn Voting classifier
03/08/2021 COMPX310 8
Bagging and Pasting
 What if we only want to use one type of classifier, e.g. trees?
 Subsample the training data:
 With replacement => Bagging (bootstrap aggregation)
 Without replacement => Pasting
 Scales very well, because it is ”embarrassingly parallel”:

 All trees can be trained in parallal (multi-core or distributed)
 All trees can predict in parallel (but predictions must be fused)
 Bagging much more popular, but Pasting is very useful for

extremely large datasets: split into disjoint subsets and process
on separate machines
03/08/2021 COMPX310 9
Bagging and Pasting
03/08/2021 COMPX310 10
SciKit-Learn Bagging
 Does soft voting automatically, if the base classifier has a

predict_proba method.
 n_jobs=-1 means ‘use all cores’
03/08/2021 COMPX310 11
Smooth decision boundary
 One decision tree on the Moons dataset, versus Bagging of 500
03/08/2021 COMPX310 12
Out-of-bag evaluation
 Every sample/subset is called a bag.
 Because of replacements, there will be multiple copies of some
training examples in a bag, while others will not appear at all.
 On average, a bag has about 67% unique samples.
 In other words, e.g. when Bagging 1000 trees, then on average
any single example will be part of about 670 bags, but be left out
of the remaining 330 bags.
 => We can collect predictions for this example from these 330
trees, and average, and compute accuracies, F1, and similar
 This is almost like having a validation set for free 
03/08/2021 COMPX310 13
OOB in Scikit-Learn
03/08/2021 COMPX310 14
Random Patches/Subspaces
We can also subsample the features, not just the examples
Max_features parameter in the BaggingClassifier:
each base classifier will be trained on a different feature set
Random subspace:
only sample features, not examples (so this is NOT bagging)
Random patches:
sample both examples + features
Both increase diversity of the base classifiers
03/08/2021 COMPX310 15
Random Forest
 Genius idea that combines Bagging with Feature randomization
inside the decision tree:
 For every split in the tree:
 Randomly select a feature subset
 Choose the best split from this subset
 Has many options (union of Tree and Bagging options), but

often defaults work really well, provided enough trees are
generated (100s to 1000s)
 => perfect simple ‘off-the-shelf’ learning algorithm
 RandomForestClassifier + RandomForestRegressor
03/08/2021 COMPX310 16
Scikit-Learn: Random Forest
 Is some what similar to:
03/08/2021 COMPX310 17
Extra-Trees
 Extremely Randomized Trees ensemble:
 Do not search for the best splitting threshold for numeric
attributes, instead choose one at random
 Much more random, but also much faster to train
 May or may not perform better than a Random Forest, but usually
need many more ensemble members
 Scikit-Learn:
 ExtraTreesClassifier and ExtraTreesRegressor
 Same options/API as the RandomForest C+R
03/08/2021 COMPX310 18
Feature Importance
 Built into the Random Forest code:
 Measure how much a selected attribute improves Gini purity on
average, in a weighted way; results are scaled to sum to 1.0:
03/08/2021 COMPX310 19
Feature Importance: MNIST
 Case study using the digits image classification problem:
03/08/2021 COMPX310 20
FAQ1
 Pasting vs. Cross-validation?
 CV is usually stratified, and “exhaustive”
 Pasting batch sizes?

 No replacement, so have to be SMALLER
 Bagging can do any size, default is same as training set: m
 Pasting vs. Bagging:

 Smaller dataset => bag, very large ones => paste
 Feature sampling: replacement (usually) makes no sense

 Random patches vs. random subspaces:
 Use subspace for “few examples, many features” datasets
 Use patches for “many examples, many features” datasets
03/08/2021 COMPX310 21
FAQ2
 Why does voting help?
 Three models with 60% accuracy each, 8 possible prediction cases:
 c/0.6 c/0.6 c/0.6 => correct, 0.216

 c/0.6 c/0.6 w/0.4 => correct, 0.144
 c/0.6 w/0.4 c/0.6 => correct, 0.144
 c/0.6 w/0.4 w/0.4 => wrong, 0.096
 w/0.4 c/0.6 c/0.6 => correct, 0.144
 w/0.4 c/0.6 w/0.4 => wrong, 0.144
 w/0.4 w/0.4 c/0.6 => wrong, 0.096
 w/0.4 w/0.4 w/0.4 => wrong, 0.064
 Ensemble voting total probabilities:
 Correct: 0.216 + 0.144 + 0.144 + 0.144 => 0.648, or 64.8% accuracy
03/08/2021 COMPX310 22
FAQ3
 Hard vs. soft voting:
 3 models and their predicted probabilities for a hypothetical example:
 M1: 0.95, 0.05

 M2: 0.45, 0.55
 M3: 0.40, 0.60
 Hard voting: Class 2 ( 1:2 vote)
 Soft voting: average:

 Sums: 1.8 vs. 1.2, normalize:
 0.6, 0.4 => Class 1
03/08/2021 COMPX310 23
FAQ4
 Sampling with replacement: sample 100 times from [0,100)
 Summary stats:
 missing: 36
 Once: 37
 Twice: 20
 3 times: 5
 4 times: 2
03/08/2021 COMPX310 24
Boosting
 Another way for improving a ‘weak’ classifier
 Train a series of classifiers one after to other, always trying to
correct for the mistakes of the previous ones
 Training is inherently sequential 
 Two main ways:
 AdaBoost
 Gradient Boosting
03/08/2021 COMPX310 25
AdaBoost
 Adaptive Boosting:
 Increase the weight of mis-classified examples
 Decrease the weight of correctly classified examples
 Better base classifiers get a higher weight
 Every single base classifier should be non-perfect (why?)
03/08/2021 COMPX310 26
AdaBoost example
 Moons dataset, SVM with RBF kernel is being boosted:
03/08/2021 COMPX310 27
AdaBoost MATHS
03/08/2021 COMPX310 28
AdaBoost MATHS 2
03/08/2021 COMPX310 29
Scikit-Learn AdaBoost
 Uses a version that also works for multi-class problems, and can
also utilize probabilities
 A tree with max_depth=1 is called a Decision Stump
 If AdaBoost overfits (which it easily can), use fewer iterations, a
lower learning rate, or more regularization of the base classifier
03/08/2021 COMPX310 30
Gradient Boosting
 Don’t modify the weight, try to directly correct the remaining
error, the ‘residual’
 3 step ‘manual’ regression example (easier to follow):
03/08/2021 COMPX310 31
03/08/2021 COMPX310 32
How many base classifiers?
 Generally: the more, the lower the learning rate should be
 How many: similar to gradient descent: early stopping
03/08/2021 COMPX310 33
Use validation to find right #
 Use ‘staged_predict’ to get predictions for every prefix of the
ensemble: 1 tree, 2 trees, 3 trees, …
 Find smallest validation error, and retrain to best size
03/08/2021 COMPX310 34
Result plot
03/08/2021 COMPX310 35
Manual early stopping
03/08/2021 COMPX310 36
Tedious, use better impl
 3 industry strength implementations (including GPU support):
 XGBoost (open source)
 LightGBM (Microsoft Azure)
 CatBoost (Yandex)
 Useful builtin features, like early stopping, feature sampling, …
03/08/2021 COMPX310 37
Stacking
 Smarter variant of voting: replace voting by a learner (aka
blender, or meta-learner)
Scikit-Learn:
NOT builtin , either
- program Yourself or
- find 3rd party lib
03/08/2021 COMPX310 38
How to train the Blender?
 Split train into 2 parts: first to train base level, 2nd to train blender
03/08/2021 COMPX310 39
Training the Blender
 Get predictions for 2nd part from base, use to train the Blender
03/08/2021 COMPX310 40
Multiple Blenders blended
 “Going crazy”: three levels => needs 3 subsets
03/08/2021 COMPX310 41

COMPX310-19A Machine Learning Chapter 7: Ensembles, Random Forest

Uploaded by

Copyright:

Available Formats

COMPX310-19A Machine Learning Chapter 7: Ensembles, Random Forest

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COMPX310-19A Machine Learning Chapter 7: Ensembles, Random Forest

Uploaded by

Copyright:

Available Formats

COMPX310-19A

 Ensembles reduce variance and/or bias

 “Weak classifier”: better than chance, but not brilliant

 Scales very well, because it is ”embarrassingly parallel”:

 Bagging much more popular, but Pasting is very useful for

 Does soft voting automatically, if the base classifier has a

Max_features parameter in the BaggingClassifier:

each base classifier will be trained on a different feature set

Both increase diversity of the base classifiers

 Has many options (union of Tree and Bagging options), but

 Is some what similar to:

 Pasting batch sizes?

 Pasting vs. Bagging:

 Feature sampling: replacement (usually) makes no sense

 Three models with 60% accuracy each, 8 possible prediction cases:

 c/0.6 c/0.6 c/0.6 => correct, 0.216

 Ensemble voting total probabilities:

 Correct: 0.216 + 0.144 + 0.144 + 0.144 => 0.648, or 64.8% accuracy

 3 models and their predicted probabilities for a hypothetical example:

 M1: 0.95, 0.05

 Hard voting: Class 2 ( 1:2 vote)

 Soft voting: average:

 Useful builtin features, like early stopping, feature sampling, …

You might also like