COMPX310-19A Machine Learning Chapter 7: Ensembles, Random Forest

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 41

COMPX310-19A

Machine Learning
Chapter 7: Ensembles, Random Forest
An introduction using Python, Scikit-Learn, Keras, and Tensorflow

Unless otherwise indicated, all images are from Hands-on Machine Learning with
Scikit-Learn, Keras, and TensorFlow by Aurélien Géron, Copyright © 2019 O’Reilly Media
Ensembles
 Combining two or more classifiers for better results
 Many different ways of doing this, we’ll only cover a few:
 Bagging
 Random Forest
 Boosting
 XGBoost
 Stacking

 Ensembles reduce variance and/or bias

03/08/2021 COMPX310 2
Diverse classifiers

03/08/2021 COMPX310 3
Voting diverse classifiers
 A (majority) vote can produce a classifier better than the single
best one
 We want classifiers that are:
 As good as possible by themselves
 As independent from all others as possible
 This is a bit of a conendrum:
 Perfect classifiers would be identical, that is perfectly correlated

 “Weak classifier”: better than chance, but not brilliant


 Ensembling can turn “weak” classifiers into “strong” ones

03/08/2021 COMPX310 4
Voting diverse classifiers

03/08/2021 COMPX310 5
Voting diverse classifiers
 Majority count == ‘hard’ voting
 Averaging probabilities == ‘soft’ voting
 Law of large numbers: 1000 classifiers with 51% acc => 75% acc

03/08/2021 COMPX310 6
SciKit-Learn Voting classifier

03/08/2021 COMPX310 7
SciKit-Learn Voting classifier

03/08/2021 COMPX310 8
Bagging and Pasting
 What if we only want to use one type of classifier, e.g. trees?
 Subsample the training data:
 With replacement => Bagging (bootstrap aggregation)
 Without replacement => Pasting

 Scales very well, because it is ”embarrassingly parallel”:


 All trees can be trained in parallal (multi-core or distributed)
 All trees can predict in parallel (but predictions must be fused)

 Bagging much more popular, but Pasting is very useful for


extremely large datasets: split into disjoint subsets and process
on separate machines

03/08/2021 COMPX310 9
Bagging and Pasting

03/08/2021 COMPX310 10
SciKit-Learn Bagging

 Does soft voting automatically, if the base classifier has a


predict_proba method.
 n_jobs=-1 means ‘use all cores’

03/08/2021 COMPX310 11
Smooth decision boundary
 One decision tree on the Moons dataset, versus Bagging of 500

03/08/2021 COMPX310 12
Out-of-bag evaluation
 Every sample/subset is called a bag.
 Because of replacements, there will be multiple copies of some
training examples in a bag, while others will not appear at all.
 On average, a bag has about 67% unique samples.
 In other words, e.g. when Bagging 1000 trees, then on average
any single example will be part of about 670 bags, but be left out
of the remaining 330 bags.
 => We can collect predictions for this example from these 330
trees, and average, and compute accuracies, F1, and similar
 This is almost like having a validation set for free 

03/08/2021 COMPX310 13
OOB in Scikit-Learn

03/08/2021 COMPX310 14
Random Patches/Subspaces
We can also subsample the features, not just the examples

Max_features parameter in the BaggingClassifier:

each base classifier will be trained on a different feature set

Random subspace:
only sample features, not examples (so this is NOT bagging)

Random patches:
sample both examples + features

Both increase diversity of the base classifiers

03/08/2021 COMPX310 15
Random Forest
 Genius idea that combines Bagging with Feature randomization
inside the decision tree:
 For every split in the tree:
 Randomly select a feature subset
 Choose the best split from this subset

 Has many options (union of Tree and Bagging options), but


often defaults work really well, provided enough trees are
generated (100s to 1000s)
 => perfect simple ‘off-the-shelf’ learning algorithm
 RandomForestClassifier + RandomForestRegressor

03/08/2021 COMPX310 16
Scikit-Learn: Random Forest

 Is some what similar to:

03/08/2021 COMPX310 17
Extra-Trees
 Extremely Randomized Trees ensemble:
 Do not search for the best splitting threshold for numeric
attributes, instead choose one at random
 Much more random, but also much faster to train
 May or may not perform better than a Random Forest, but usually
need many more ensemble members

 Scikit-Learn:
 ExtraTreesClassifier and ExtraTreesRegressor
 Same options/API as the RandomForest C+R

03/08/2021 COMPX310 18
Feature Importance
 Built into the Random Forest code:
 Measure how much a selected attribute improves Gini purity on
average, in a weighted way; results are scaled to sum to 1.0:

03/08/2021 COMPX310 19
Feature Importance: MNIST
 Case study using the digits image classification problem:

03/08/2021 COMPX310 20
FAQ1
 Pasting vs. Cross-validation?
 CV is usually stratified, and “exhaustive”

 Pasting batch sizes?


 No replacement, so have to be SMALLER
 Bagging can do any size, default is same as training set: m

 Pasting vs. Bagging:


 Smaller dataset => bag, very large ones => paste

 Feature sampling: replacement (usually) makes no sense


 Random patches vs. random subspaces:
 Use subspace for “few examples, many features” datasets
 Use patches for “many examples, many features” datasets

03/08/2021 COMPX310 21
FAQ2
 Why does voting help?

 Three models with 60% accuracy each, 8 possible prediction cases:

 c/0.6 c/0.6 c/0.6 => correct, 0.216


 c/0.6 c/0.6 w/0.4 => correct, 0.144
 c/0.6 w/0.4 c/0.6 => correct, 0.144
 c/0.6 w/0.4 w/0.4 => wrong, 0.096
 w/0.4 c/0.6 c/0.6 => correct, 0.144
 w/0.4 c/0.6 w/0.4 => wrong, 0.144
 w/0.4 w/0.4 c/0.6 => wrong, 0.096
 w/0.4 w/0.4 w/0.4 => wrong, 0.064

 Ensemble voting total probabilities:

 Correct: 0.216 + 0.144 + 0.144 + 0.144 => 0.648, or 64.8% accuracy

03/08/2021 COMPX310 22
FAQ3
 Hard vs. soft voting:

 3 models and their predicted probabilities for a hypothetical example:

 M1: 0.95, 0.05


 M2: 0.45, 0.55
 M3: 0.40, 0.60

 Hard voting: Class 2 ( 1:2 vote)

 Soft voting: average:


 Sums: 1.8 vs. 1.2, normalize:
 0.6, 0.4 => Class 1

03/08/2021 COMPX310 23
FAQ4
 Sampling with replacement: sample 100 times from [0,100)
 Summary stats:
 missing: 36
 Once: 37
 Twice: 20
 3 times: 5
 4 times: 2

03/08/2021 COMPX310 24
Boosting
 Another way for improving a ‘weak’ classifier
 Train a series of classifiers one after to other, always trying to
correct for the mistakes of the previous ones
 Training is inherently sequential 
 Two main ways:
 AdaBoost
 Gradient Boosting

03/08/2021 COMPX310 25
AdaBoost
 Adaptive Boosting:
 Increase the weight of mis-classified examples
 Decrease the weight of correctly classified examples
 Better base classifiers get a higher weight
 Every single base classifier should be non-perfect (why?)

03/08/2021 COMPX310 26
AdaBoost example
 Moons dataset, SVM with RBF kernel is being boosted:

03/08/2021 COMPX310 27
AdaBoost MATHS

03/08/2021 COMPX310 28
AdaBoost MATHS 2

03/08/2021 COMPX310 29
Scikit-Learn AdaBoost
 Uses a version that also works for multi-class problems, and can
also utilize probabilities
 A tree with max_depth=1 is called a Decision Stump
 If AdaBoost overfits (which it easily can), use fewer iterations, a
lower learning rate, or more regularization of the base classifier

03/08/2021 COMPX310 30
Gradient Boosting
 Don’t modify the weight, try to directly correct the remaining
error, the ‘residual’
 3 step ‘manual’ regression example (easier to follow):

03/08/2021 COMPX310 31
03/08/2021 COMPX310 32
How many base classifiers?
 Generally: the more, the lower the learning rate should be
 How many: similar to gradient descent: early stopping

03/08/2021 COMPX310 33
Use validation to find right #
 Use ‘staged_predict’ to get predictions for every prefix of the
ensemble: 1 tree, 2 trees, 3 trees, …
 Find smallest validation error, and retrain to best size

03/08/2021 COMPX310 34
Result plot

03/08/2021 COMPX310 35
Manual early stopping

03/08/2021 COMPX310 36
Tedious, use better impl
 3 industry strength implementations (including GPU support):
 XGBoost (open source)
 LightGBM (Microsoft Azure)
 CatBoost (Yandex)

 Useful builtin features, like early stopping, feature sampling, …

03/08/2021 COMPX310 37
Stacking
 Smarter variant of voting: replace voting by a learner (aka
blender, or meta-learner)

Scikit-Learn:
NOT builtin , either
- program Yourself or
- find 3rd party lib

03/08/2021 COMPX310 38
How to train the Blender?
 Split train into 2 parts: first to train base level, 2nd to train blender

03/08/2021 COMPX310 39
Training the Blender
 Get predictions for 2nd part from base, use to train the Blender

03/08/2021 COMPX310 40
Multiple Blenders blended
 “Going crazy”: three levels => needs 3 subsets

03/08/2021 COMPX310 41

You might also like