Lec4 Tree v2.4 1

Decision Tree
Mingsheng Long
Tsinghua University
Spring 2024
Outline
• Decision Tree
-ID3 Algorithm
-C4.5 Algorithm
-Regression Tree
• Random Forest
-Bagging
-Breiman Algorithm
Machine Learning 2
2-class Linear Model Hypothesis Space
Classification: ℎ!(*) = , ⋅ . *
Feature Polynomial Basis
$ % ⋅ ' + discretization
Engineering 1 Function Φ 1
1
Dataset 1 1 1 1
{0,1}
11 Discrete Label
1
Space
Discrete values Input Space Feature Space
Different scales "
! = {0,1}
! = ℝ! ) = ℝ!
How to do with feature heterogeneity and achieve model interpretability?

Machine Learning 3
Human Classifier
• How do people solve this problem? Do we combine
features as in LM?
green, round, no leaf, sweet, … Apple
red, round, leaf, sweet, … Apple
yellow, round, leaf, sour, … Lemon
yellow, curved, no leaf, sweet, … Banana
green, curved, no leaf, sweet, … Banana
Feature Label
Machine Learning 4
Human Classifier
• We find a most useful feature, such as shape: curved or round.
- Split the dataset: If the fruit is curved, then it is a banana.
• After the first split, select next best feature until each node contains
only one class.
Curved Round
① Shape
Sweet Sour
② Taste
• At last, we build a decision tree.
Machine Learning 5
Decision Tree
Generalize to
Interpretability ✓ unseen data ✓
• During test: 5.1 Cross-Validation 187
- Why do you think this fruit is lemon?

- Because it is round and sour.
Degree=1 Degree=2
o o o o
o o oo o o o o oo o o
o oo oo o o o oo oo o o
o
o oo oo ooo oo o
o oo oo ooo oo
oo oo oo o oooooo o oo oo oo oo o oooooo o oo
o o oo oo oo o o oo oo oo
ooo o oo ooo o oo
oo o o o o
o
o o oo o o o o
o
o o
o o o o ooo oo o o o o ooo oo
ooo oo o o ooo oo o o
o o o o ooooo oo oo o oooo o o o o o ooooo oo oo o oooo o
o oo o ooooooo o o o o oo o ooooooo o o o
o o o oo o o o o o o o o oo o o o o o
oo o oo o o oo o oo o o
o o o o o o o o
o o o oo o oooo o o o o oo o oooo o
o o
oo o ooo
oo oo oo o oo o ooo
oo oo oo o
o oo oo o o oo oo o
oo ooo ooo o oo ooo ooo o
Curved Round
o o o o
o o oo ooo
oooo o o oo ooo
oooo
o o o o
o oo o o oo o
Compared with LM:

o o o o o o
Degree=3 Degree=4
o o o o
o oo oo o o o oo oo o o
o
o oo oo ooo oo o
o oo oo ooo oo
o ooo o oo
o o o o ooo o oo
o o o
oo ooo o oo ooo o
o o o o ooo o o o o o ooo o
Sweet Sour
ooo oo o o
o o o o o ooo oo o o
o o o o o
o o o o ooo oo o o o o o o ooo oo o o
o ooo o ooo
o o o oo o ooooooo o o o
o o o oo o ooooooo o o o
oo o oo o o o o o oo o oo o o o o o
o o o o o o o o o o
o ooo oo o ooo oo
o o o oooo o o o o oooo o
o o
oo o ooo
oo oo oo o oo o ooo
oo oo oo o
o oo oo o o oo oo o
o o o oooooo o
o o o o oooooo o
o
o o oooo
o ooo o o o o oo ooo
ooo o
o oo o o oo o
o o
o o o o o o
FIGURE 5.7. Logistic regression fits on the two-dimensional classification data

displayed in Figure 2.13. The Bayes decision boundary is represented using a
purple dashed line. Estimated decision boundaries from linear, quadratic, cubic
and quartic (degrees 1–4) logistic regressions are displayed in black. The test error
Machine Learning
rates for the four logistic regression fits are respectively 0.201, 0.197, 0.160, and 6
0.162, while the Bayes error rate is 0.133.
Decision Tree
• When we face many features, it is hard for us to build tree by hand.
• Can we teach machine to find it? What are the difficulties?
- How to find the most useful feature on each node?
- When should we stop growing the tree?
- What if some features are missing or continuous-valued?
Machine Learning 7
Decision Tree
Outlook
Sunny Overcast Rain
Humidity Each internal node tests an attribute
High Normal Each branch corresponds to an

attribute value node
No Yes Each leaf node assigns a classification
Machine Learning 8
Outline
• Decision Tree
-ID3 Algorithm
-C4.5 Algorithm
-Regression Tree
• Random Forest
-Bagging
-Breiman Algorithm
Machine Learning 9
Node Splitting
• Which split is better?
Single
Curved Round Leaf No Leaf
Class
• As example, we visualize splits in a binary classification problem:

Split by different More
Split 1:
features ‘balanced’
"! = 1 "! = 0
Which split More

Split 2:
is better? ‘pure’ ✓
! = +1 ! = −1 "" = 1 "" = 0
Machine Learning 10
Impurity(
Impurity(
Impurity(
How to Measure a Node
• We want pure leaf nodes, as close to a single class as possible:
- Lead to a lower classification error.
Number of samples
pure group Less &! &"
impure in each class
- Thegroup
y impure classification error is min(
Less ' , ' ).
impure Minimum
Very impure group Minimum
impurity
Less impure Minimum
impurity
• We’ll choose the splitting feature that minimizes impurity measure.
impurity
Very impure group Less impure Minimum impurity

Error: 13/29 Error: 2/14 Error: 0
Machine Learning 11
Classification Tree Regression Tree Medical Applications
Example
Node Impurity Measures

2WClass(Cases:(
Learning trees
Uncertainty measurement for two-class classification (as a
n
X function of the proportional p in class 1.)
• Three standard node impurity measures:
Entropy( H(x) = P (x = i) log2 P (x = i)
- Misclassification error:
i=1
• What(is(the(entropy(of(a(group(in(which(all( Minimum /2
examples(belong(to(the(same(class?( impurity
7* – entropy(=(W(1(log21(=(0(
Err 4 = 1 − max (( 2WClass(Cases:(
()*)+ 4 not(a(good(training(set(for(learning( n
X
Entropy( H(x) = P (x = i) log2 P (x = i)
- Entropy (used in ID3 and C4.5): i=1
• What(is(the(entropy(of(a(group(with(50%( Maximum
• What(is(the(entropy(of(a(group(in(which(all( Minimum
in(either(class?( impurity
examples(belong(to(the(same(class?( impurity
+
Mihaela van der Schaar Classification and Regression Trees
7* – entropy(=(W0.5((log
7* 0.5(–(0.5((log ((
0.5(=1(
@ #classes
– entropy(=(W(1(log
2 1(=(0( 2 2
8 4 = −9 log not(a(good(training(set(for(learning(
Each is 7* ⊂ 4
4 good(training(set(for(learning(
4
*,(Based(on(slide(by(Pedro(Domingos( 30(
• What(is(the(entropy(of(a(group(with(50%( Maximum
- Gini index (used in CART): in(either(class?( Max: Min:
impurity
7*
– entropy(=(W0.5((log20.5(–(0.5((log20.5(=1(
7*
+ - = 0.5 =0
7* 4
good(training(set(for(learning(
4
Gini 4 = 1 − 9
30(
Based(on(slide(by(Pedro(Domingos(
4 (0 log 0 = 0)
*,(
Machine Learning 12
How to Measure a Split
1 1 12 12
Entire dataset (30 instances) .(0$ ) = − log − log = 0.391
13 13 13 13
4 = 4( ∪ 4-
4 4( 13 instances
Child entropy:
13 13 4 4
.(0# ) = − log − log = 0.787
17 17 17 17
Parent entropy:
14 14 16 16 4-
.(0) = − log − log = 0.996 17 instances
30 30 30 30
• We want to measure this split – how this split minimizes entropy.
- How about 8 4 − (8(4() + 8(4-))/2?
- But the instance number on each node is different → Weighting
Machine Learning 13
Information Gain (IG)
• A good split gives minimal weighted average of node impurities:
4( 4-
8 4( + 8(4-)
4 4
• Equivalent to maximizing the Information Gain (IG): Child nodes
4( 4-
Parent node 8 4( ∪ 4- − 8 4( − 8(4-)
4 4
• In information theory, entropy is the expected number of
bits needed to encode a randomly drawn value of C.
• Larger entropy, less information. That’s why we call it IG.
https://en.wikipedia.org/wiki/Entropy_(information_theory)
Claude Shannon
Machine Learning 14
Calculating Information Gain (IG)
Entire population (30 instances) Child entropy:
1 1 12 12
− log − log = 0.391
13 13 13 13
13 instances
Child entropy:
13 13 4 4
Parent entropy: − log − log = 0.787
17 17 17 17
14 14 16 16
− log − log = 0.996
30 30 30 30
17 instances
Information Gain (IG):
13 17
0.996 − ⋅ 0.391 − ⋅ 0.787 = 0.38
30 30
Machine Learning 15
Split Search
Class Information Gain = 0.40
Dataset distribution Top Bottom
Split 1
Information Gain = 0.69
<
Left Right
Four classes.
How to compute IG?
Split 2
• Compute IG for all splits induced by every feature.

- If IGs of all features are small (e.g. < D): stop.
- Else: find the feature that maximizes the Information Gain (IG).
Machine Learning 16
ID3 Algorithm
Ross
Quinlan Greedy
Search
• Star t
- Create the root node.
- Assign all examples to root.
• Main Loop
1. E ← the best decision attribute for next node.
2. For each value of E, create a new descendant of node.
3. Sort training examples to leaf nodes.
4. If training examples well classified, then STOP; else iterate over
new leaf nodes. Time Complexity (. #examples, / #features): F(GH ⋅ depth)
Machine Learning 17
ID3 Algorithm
Class label
Stop Criteria
0 /.
in each layer
depth
min(/, log .)
Machine Learning 18
Outline
• Decision Tree
-ID3 Algorithm
-C4.5 Algorithm
-Regression Tree
• Random Forest
-Bagging
-Breiman Algorithm
Machine Learning 19
ID3
IG Rate Missing
Value
C4.5
Attribute Continuous
with Costs Value
Pruning
Modification on Criteria Solve Adapt to Various Features
Overfitting
Machine Learning 20
Overfitting in Decision Tree
• Decision tree has the ability to fit all data:
- Except noisy samples with same features and different labels.
• Overfitting also appears in decision tree:
Desired
Overfitting
Machine Learning 21
Pre-Pruning
• Tree Complexity is measured by numbers of layers and branches.
- To prevent overfitting: Control number of layers – Pruning.
• A simplest way is Pre-Pruning:
• During Training: for every split, use the change of accuracy on
validation set to measure the necessity of this split. Before Split: 71%
After Split: 71%
- If the accuracy grows no more than D
or even decreases: Before: 71%
After: 80% ✓
• Stop the split.
• May cause underfitting
Go on
Training
- A good split may not change the classification error.
×
Machine Learning 22
Post-Pruning
• Pre-Pruning may lead to underfitting.
• Safe idea is Post-Pruning:
- Pruning the tree after training Before: 68%
After: 57% ✓
• From leaf node to root node:
- Use the change of accuracy on Before: 54%
After: 58%
validation set to measure the
necessity of the pruning on this ×
node.
- Remove node most improving
Before Pruning: 54%
validation set accuracy.
After Pruning: 68%
Fig. 3.1 Decision tree. (a) A tree is a set

Machine Learning 23 o
Tree Complexity Regularization
• The cost complexity criterion with parameter O for tree P is
<
Q: P = ℰR P + O P = 9 T; 8; (P) + O P
;,(
< + T;*
= −9 9 T;* log +O P
;,( *,( T;
- P : #leaf nodes, T;* is the #examples of class U in leaf node V
- ℰR P is the empirical error of the tree on training data.
• Cost Complexity Pruning: (Given tree P= and parameter O)
- Compute the empirical entropy 8; (P) of each node
- Recursively shrink from leaf nodes to internal nodes
• If Q: P> ≤ Q: P? : prune leaf E and use parent X as new leaf
Machine Learning 24
ID3
IG Rate Missing
Value
How to understand
C4.5 decision tree?
What does it learn?
Attribute Continuous
with Costs Value
Pruning
Modification on Criteria Solve Adapt to Various Features
Overfitting
Machine Learning 25
X2
X2
R5 R3
Understanding Decision Tree

R2 t4
t2 R4
X2
X2
R3
t2 R4 R1
• Hypothesis space of decision tree: R1
- Decision trees divide the feature space into axis-parallel rectangles.

t1 t3
X1 t1 t3 X1
- Each rectangular region is labeled with a specific label.

X1 X1
- Curse of dimensionality.
ree-Based Methods
X1 ≤ t1 X1 ≤ t1
| |
R5
X2 ≤ t2 X1 ≤ t3
X2 ≤ t2 R2 t4
X1 ≤ t3
X2
R3
X 2 ≤ t4 t2 R4
R1 R2 R3
X 2 ≤ t4
R1 X2
X1
R1 R2 R3
X2
X1
R4 R5 t1 t3
FIGUREX1 8.3. Top Left: A partition of two-dimensional

X1 feature space that could
R4 R5
not result from recursive binary splitting. Top Right: The
Machine output of recursive
Learning 26
binary splitting on a two-dimensional example. Bottom Left: A tree corresponding
Regression trees
Understanding Decision Tree

Classification trees
Conditional inference trees
Ensemble learning
Example: Iris data

• Decision Tree on the Iris Data. What phenomena can you find?
2.5 ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
2 ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ●
Petal.Width
1.5 ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
1 ● ● ● ● ●
## Construct Decision Trees
from sklearn.datasets import load_iris
●
from sklearn import tree
0.5 ● iris = load_iris()
● ● ● ● ●
clf = tree.DecisionTreeClassifier()
● ● ● ●
● ● ● ● ● ● ● ● clf = clf.fit(iris.data, iris.target)
● ● ●
0
1 2 3 4 5 6 7
Petal.Length
Summer term 2017

Machine 6/42
Learning 27
Nonlinear Feature Extraction
• Suppose decision tree P gives a partition with regions Y(, … , Y@ .
• Then the decision tree can be written as:
]A is the label of YA
@
ℎ(\) = 9 ]A ^{\ ∈ YA }
A,(
• Each region can be viewed as extracting a feature for every sample:
\ → c(\) = ^ \ ∈ Y( , . . , ^ \ ∈ Y@
- ℎ \ = d ⋅ c(\) is a linear classifier on the nonlinear features Φ.
• What is the kernel of these nonlinear features?
• Is it the ability of feature learning? Can it be applied by layering trees?
Machine Learning 28
Expressivity vs. Interpretability Degree=1 Degree=2
o o o o
3 Introduction oo oo o o 9 oo oo o o
• Multivariate decision tree: good expressivity but weak interpretability.
o o
o
o oo oo ooo oo o
o oo oo ooo oo
o ooo o oo
o o o o ooo o oo
o o o
oo o oo o o oo o oo o o
o o o o ooo o o o o ooo
ooo oo o o ooo oo o o
o o o o o o
• Are Expressivity and Interpretability in conflict?

o o o o oooo o oo oooo o o o o o oooo o oo oooo o
oo o
oo ooooooo o o o oo o
oo ooooooo o o o
o o oo o o o o o o o oo o o o o o
oo o oo oo o oo
o ooo oo o o o o ooo oo o o o
o o o oooo o o o o oooo o
o oo o o oo o
oo o ooo o
oo ooo oo o ooo o
oo ooo
o o oooo o o o oooo o
o oo ooo o oo ooo
Stronger expressivity?
o o o o
o o o
o
o
ooo ooo o o o
o
o
ooo ooo
o o o o
o oo o o oo o
o o o o o o
THE LASSO ESTIMATOR 11
Degree=3 Degree=4
Lasso Ridge Regression
o o o o
funding o o oo o o funding o o oo o o
oo oo o o oo oo o o
10
10
o o
o
o oo oo ooo oo o
o oo oo ooo oo
o oo oo o oo oo
not−hs
o
o o ooo o oo
o o not−hs oo o o
o o ooo o oo
o o oo
o
college4 o
o o o o o college4
o o
o o ooo o
o o ooo o o o ooo o
5
5
Coefficients
Coefficients
ooo oo o o o ooo oo o o
o o o o ooo oo o oo o o
oo
o
o o o o o ooooo oo oo o oooo o
oo o
o o o oo
ooo o o o oo o
o o o o
oooo o o o
o o
oo o oo o oo o o oo
o o
oo o oo o oo o o oo
oo oo
college o ooo oo o o college o ooo oo o o
o oooo o o oooo o
0
0
o o o o
o o
oo o ooo
oo oo oo
o
oo o ooo
oo oo oo o
o oo oo
o o o
o ooo
o oooo oo o
o
o o
o ooo
o oooo oo
o o o
o ooo
ooo o o oo ooo
oooo
−5
o o
−5
o oo hs o oo o
hs o o o
o o o o o o
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
k ˆk1 /k ˜k1 k ˆk2 /k ˜k2

FIGURE 5.7. Logistic regression fits on the two-dimensional classification data
Weaker interpretability
displayed in Figure 2.13. The Bayes decision boundary is represented using a
Figure 2.1 Left: Coefficient path for the lasso, plotted
purple versus
dashedthe ¸1 norm
line. of the decision boundaries from linear, quadratic, cubic
Estimated
Fig. 3.1 Decision tree. (a) A tree is a set of nodes
coefficient and
vector, edges
relative organized
to the norm of thein aand
hierarchical
unrestricted least-squares
quartic fashion.
(degrees estimate ˜ regressions are displayed in black. The test error
—.
1–4) logistic
Right: Same for ridge regression, plotted against the relative ¸ 2 norm.
A tree isa graph with no loops. Internal nodes are denoted with circles andrates terminal
for the nodes withregression fits are respectively 0.201, 0.197, 0.160, and
four logistic
0.162, while the Bayes error rate is 0.133.
squares. (b) A decision tree is a tree where each internal node stores a split (or test) function to
be applied to the incoming data. Each leaf stores the final answer (predictor). Here we show an
illustrative decision tree used to figure out whether a photo represents an indoor or outdoor scene
Machine Learning 29
Outline
• Decision Tree
-ID3 Algorithm
-C4.5 Algorithm
-Regression Tree
• Random Forest
-Bagging
-Breiman Algorithm
Machine Learning 30
X
X
R3
t2 R4
Regression Tree
R1
• Can we use decision trees in regression task? t1 t3
X1 X1
• Recall the visualization of decision tree prediction landscape:
X1 ≤ t1
| R5
R2 t4 Discrete Class Label

to
X2
X2 ≤ t2 X1 ≤ t3 R
3
t2 R4 Real Target Value
R1 X 2 ≤ t4
R1 R2 R3
X2
t1 t3 X1
• We assign a class label for each of the regions in classification.

X1
R4 R5
• In regression: for each region, assign real target value (response).

FIGURE 8.3. Top Left: A partition of two-dimensional feature space that could
ot result from recursive binary splitting. Top Right: The output of recursive
inary splitting on a two-dimensional example. Bottom Left: A tree corresponding
o the partition in the top right panel. Bottom Right: A perspective plot of the
rediction surface corresponding to that tree. Machine Learning 31
X2
X2
R3
Regression Tree
t2 R4
R1
• Given the partition {Y(, … , Y@ }, the final prediction is t1 t3
X1 X1
@
!f \ = ℎ(\) = 9 g* ^{\ ∈ Y* }
X1 ≤ t1 *,(
|
• How to choose g(, … , g@ (of linear regressor)?
• For L2 loss ℓ(!,
f !) = !f − !
X2 ≤ t2
is - , the
X ≤t
best
1 3
gf* = avg(!A |\A ∈ Y* ) X 2 ≤ t4
R1 R2 R3
Average. Why? X2
X1
• In classification tree, we find node of high purity. R4 R5
FIGURE 8.3. Top Left: A partition of two-dimensional feature space that cou
• In regression we find not
node
resultof low
from L2 loss
recursive tosplitting.
binary its average target
Top Right: value.
The output of recursi
binary splitting on a two-dimensional example. Bottom Left: A tree correspondi
to the partition in the top right panel. Bottom Right: A perspective plot of t
prediction surface corresponding to that tree.
Machine Learning 32
Continuous Variables
• Add threshold to split:
\|kB ≤ l ∪ \|kB > l
!$% !#% !&% !'% !(% !)% !*%
kB
"$ C "#
• In classification, we try to find best feature and split to ensure best IG.
• In regression, we want to find feature n and splitting threshold lB :
- To minimize L2 loss on left node and right node of the split:
avg(!A |\A ∈ Y) − ! -
Machine Learning 33
Node Splitting
• For each splitting variable n and splitting point l
- Assume that the region Y is split by n and l to Y( and Y-.
gf( n, l = avg !A \A ∈ Y( n, l
gf- n, l = avg !A \A ∈ Y- n, l
• Find n, l that minimize the loss
- -
ℓ n, l = 9 !A − gf( n, l + 9 !A − gf- n, l
A:E# ∈G! B,H A:E# ∈G" B,H
L2 loss for samples in "$ #, % L2 loss for samples in "# #, %

• As we did in classification, search on all n, l ’s and find a best one.
Machine Learning 34
CART Algorithm
Decision trees
• We have determined && and &' from root node &.
Decision trees were one of the first machine learning algorithms
• Repeat until done
- 1. Find best Basic
split for points
idea: Y(
makeinclassification/regression predictions by tracing
- 2. Find best rules
split in
fora points
tree, with a constant prediction at each leaf node
in Y-
x1 ≥ 2
Greedy
Search x2 ≥ 2 x2 ≥ 3
All previous h1 = 0.1 h2 = 0.7 x1 ≥ 3 h3 = 0.8

techniques work in
regression tree. h4 = 0.9 h5 = 0.2
Machine Learning 35
Decision Tree
• Advantages:
-Inexpensive to construct and fast at classifying new samples.
-Easy to interpret for small-sized trees (caution: unstable).
• Trees make no use of geometry
- A nonmetric method, feature scale irrelevant (vs. SVM or KNN).
- Can simultaneously cope with categorical and continuous variables.
- Can naturally cope with missing values and noisy data (why?)
• Prediction functions are not continuous
- Not so bad for classification, but may be undesirable for regression.
Machine Learning 36
Outline
• Decision Tree
-ID3 Algorithm
-C4.5 Algorithm
-Regression Tree
• Random Forest
-Bagging
-Breiman Algorithm
Machine Learning 38
Ensemble Learning
• Ensemble methods train multiple learners and combine them for use.
Machine Learning 39
Random Forest
284 8. Model Inference and Averaging
Original Tree b=1 b=2

x.1 < 0.395 x.1 < 0.555 x.2 < 0.205
| | |

1 1
G features 0 1 0
284 8. Model Inference
0 and Averaging
0 1
1 0
0
0 1 1 0 0 1 0 1
H examples

x.1 < 0.395 x.1 < 0.555 x.2 < 0.205
Original Tree b=1 b=
| b=3 | b =| 4 b=5
x.1 < 0.395 x.1 < 0.555 x.2 <
x.2 < 0.285 x.3 < 0.985 x.4 < −1.36
| |
| | |
1 1
0
Inference:
0 1 0 0
0 1
1 0 1 1
0 0 1 1
0
0 1 1 0 0 0 1 1 0 0 1 0
1
Taking the
1 1 0 0
1 0 1 0
0
....…
0
0 1 11 00 0 1
1 1 0 1 1
b=3 b=4 b=5
x.2 < 0.285
|
b=6
x.3 < 0.985
|
x.4 < −1.36
|
b =b3= 7
majority
b =b4= 8 b=
0
x.1 < 0.395
| 1 1
0
x.2 < 0.285
x.1 < 0.395
| |
vote
x.3 <x.3
0.985
< 0.985
| |
x.4 < −1.36
|
0
1 0
1 1 0
1 0
0 1 1 1
1 0
1 1 0 1 1 0
1 1 0
1
1 0
0 0 1 0 0
1
b=6 1 1 0b = 7 0 1 0 1 1 b=8 0 1 0 1 1 0
x.1 < 0.395 x.1 < 0.395 x.3 < 0.985
Different Trees (why and how?)

| | |
b=9 b =b 6= 10 bb
=7= 11 b=
x.1 < 0.395 x.1 < x.1 < 0.395
0.555 x.1 <
x.10.395
< 0.555 x.3
1
| | | | |
1 1 0
Machine Learning
0 1 0 0 40
1
1 1 0 0 0 1 0 1 1
Random Forest

x.1 < 0.395 x.1 < 0.555 x.2 < 0.205
| | |

1 1
G features 0 1 0
0 and Averaging
0 1
1 0
0
0 1 1 0 0 1 0 1
H examples

x.1 < 0.395 x.1 < 0.555 x.2 < 0.205
Original Tree b=1 b=
| b=3 | b =| 4 b=5
x.1 < 0.395 x.1 < 0.555 x.2 <
x.2 < 0.285 x.3 < 0.985 x.4 < −1.36
| |
| | |
1 1
0
Inference:
0 1 0 0
0 1
1 0 1 1
0 0 1 1
0
0 1 1 0 0 0 1 1 0 0 1 0
1
Taking the
1 1 0 0
1 0 1 0
0
....…
0
0 1 11 00 0 1
1 1 0 1 1
Idea:
b=3 b=4 b=5
x.2 < 0.285
|
b=6
x.3 < 0.985
|
x.4 < −1.36
|
b =b3= 7
majority
b =b4= 8 b=
Different x.1 < 0.395

| 1 1
0
x.2 < 0.285
x.1 < 0.395
| |
vote
x.3 <x.3
0.985
< 0.985
| |
x.4 < −1.36
|
samples
0
0
1 0
1 1 0
1 0
0 1 1 1
Different
1 0
1 1 0 1 1 0
1 1 0
1
1 0
0 0 1 0 0
features
1
b=6 1 1 0b = 7 0 1 0 1 1 b=8 0 1 0 1 1 0
x.1 < 0.395 x.1 < 0.395 x.3 < 0.985
Different Trees (why and how?)

| | |
b=9 b =b 6= 10 bb
=7= 11 b=
x.1 < 0.395 x.1 < x.1 < 0.395
0.555 x.1 <
x.10.395
< 0.555 x.3
1
| | | | |
1 1 0
Machine Learning
0 1 0 0 41
1
1 1 0 0 0 1 0 1 1
Outline
• Decision Tree
-ID3 Algorithm
-C4.5 Algorithm
-Regression Tree
• Random Forest
-Bagging
-Breiman Algorithm
Machine Learning 42
Bootstrap Sample
• Bootstrap (自助法): randomly draw datasets with replacement from
the training data, with the same sample size as the original training set
le1
Model I
am p
S
Sample 2
Model II
Sam
ple
3
Model III
Total Dataset
Machine Learning 43
Bootstrap Sample
• Definition: A bootstrap sample from 4I is a sample of size H drawn
with replacement from 4I .
• In a bootstrap sample, some elements of 4I
- will show up multiple times.
- some will not show up at all. Will this happen?
( I
• Each CA has a probability 1−I of being not selected.
• Recall from calculus that for large H,
1 I 1
1− ≈ ≈ 0.368.
H p
• So we expect ~63.2% of elements of 4 will show up at least once.
Machine Learning 44
Bootstrap Method
• Definition: A bootstrap method is when you simulate X independent
samples from u by taking X bootstrap samples from the dataset 4I .
• Given original data 4I compute X bootstrap samples 4I( , … , 4I>
• For each bootstrap sample, compute some function:
v 4I( , … , v 4I>
- work with these values as though 4I( , … , 4I> were i.i.d. from u.
- For example, find some independent classifiers ℎ'$! , … , ℎ'$% .
• Amazing fact:
- Things v 4I( , … , v 4I> often come out very close to what we
will obtain with independent samples from u.
Machine Learning 45
Bootstrap Aggregation (Bagging)
• Bagging is a general technique to reduce the variance of an estimator.
• Draw X bootstrap samples 4I( , . . . , 4I> from original data 4I .
• Let ℎ(!" , … , ℎ( # be the prediction functions for each sample.

!
• The bagged prediction function is a combination of these functions:
ℎJKL k = Combine ℎ'$! (\), … , ℎ'$% (\)
• How can we combine:
Var ℎ+,-
- Binary/multiclass predictions Vote ≤ Var(ℎ.! )
- Class probability predictions
Average
- Prediction functions for regression
Leo Breiman
Machine Learning 46
Out-of-Bag Validation (OOB)
• Each bagged prediction function is trained on ~63% of the data.
• The remaining 37% are called out-of-bag (OOB) observations.
• For each training point \A , let
zA = {|4IM does not contain \A
• The OOB prediction on \A is
1
ℎNNO \A = 9 ℎ'$& \A
zA
M∈P#
• The error of ℎNNO is a good estimate of the test error.

• OOB error is akin to validation error: computed on training subset.
Machine Learning 47
Outline
• Decision Tree
-ID3 Algorithm
-C4.5 Algorithm
-Regression Tree
• Random Forest
-Bagging
-Breiman Algorithm
Machine Learning 48
Random Forest 284 8. Model Inference and Averaging
① Bootstrap samples from training data Original Tree

x.1 < 0.395
b=1
x.1 < 0.555
b=2
x.2 < 0.205
| | |
②
1 1 Construct
G features 0 1 0
0 and Averaging
0 1
each
1 0
0
0 1 1 0 0 1 0 1
H examples

x.1 < 0.395
| b=3
x.1 < 0.555
|
Original Tree
b =| 4
x.2 < 0.205
b=1
b=5 decision b=
x.1 < 0.395 x.1 < 0.555 x.2 <
x.2 < 0.285 x.3 < 0.985 x.4 < −1.36
1
|
1
|
| |
tree with
|
randomly
0 1 0 0
0 1
1 0 1 1
0 0 1 1
0
0 1 1 0 0 0 1 1 0 0 1 0
1
0
sampled !
1 1 0 1 0
1 0 0
....…
0
0 1 11 00 0 1
1 1 0 1 1
Different
b=3 b=4 b=5
x.2 < 0.285
|
b=6
x.3 < 0.985
|
x.4 < −1.36
|
b =b3= 7 features
b =b4= 8 b=
samples x.1 < 0.395

| 1 1
0
x.2 < 0.285
x.1 < 0.395
| |
x.3 <x.3
0.985
for each
< 0.985
| |
x.4 < −1.36
|
Different
0
0
1 0
split
1 1 0
1 0
0 1 1 1
features
1 0
1 1 0 1 1 0
1 1 0
1
(! ≈ #)
1 0
0 0 1 0 0
1
b=6 1 1 0b = 7 0 1 0 1 1 b=8 0 1 0 1 1 0
x.1 < 0.395 x.1 < 0.395 x.3 < 0.985
| | |
b=9 b =b 6= 10 bb
=7= 11 b=
x.1 < 0.395 x.1 < x.1 < 0.395
0.555 x.1 <
x.10.395
< 0.555 x.3
1
Machine Learning
| | |
1 0
| | 49
1
Breiman Algorithm
t r ee
a ch
e
Bootstrap samples
....…
0≈ 2
Randomized features
By Leo Breiman
Machine Learning 50
Breiman Algorithm
326 8. Tree-Based Methods
8. Tree-Based Methods
A widely-used algorithm in practice
0.30
! m=p
=#
! =m=p/2
#/2
0.5
!= m= #p
Test Classification Error

0.25
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(
0.4
n_estimators=100, ## The number of trees
criterion=“gini”,
Error
0.20
max_depth=2, ## The maximum depth of tree

random_state=0)
03
clf.fit(X, y)
0.15
0.2
Test: Bagging
Test: RandomForest
OOB: Bagging
0.10
OOB: RandomForest
0 50 100 150 200 250 300

0 100 200 300 400 500
Number of Trees
Number of Trees
GURE 8.8. Bagging and random forest results for the Heart data. The test
or (black and orange) is shown as a function of B, the number of bootstrapped
√
Bootstrap samples Randomized
forests forfeatures
ning sets used. Random forests were applied with m = p. The dashed line
icates the test error resulting from a single classification tree. The green 8.10.
FIGURE and Results from random the fifteen-class gene ex
data set
e traces show the OOB error, which in this case is considerably lower.with p = 500 predictors. The
test error is displayed as a functio
number of trees. Each colored line corresponds to a different value of
ss among the B predictions.
number
Figure 8.8 shows the results from bagging trees on the of predictors available for splitting at each interior tree node.
Heart data. The
t error rate is shown as a function of B, the number forests (m < Machine
of trees constructed
Learning
p) lead to a slight improvement over bagging (m =51p).
Bias-Variance Decomposition
Underfit Overfit
Ēnew
error
Var !|\
i bl e
uc
Irred
Error
1
Model Order=1
Variance Var' ℎ' \ \ 1
Model Order=10
0.8 0.8
0.6 0.6
2
0.4 0.4
Bias
0.2 0.2
Individual g 1(x) Individual g 10(x)
Mean of All Fits 0 Mean of All Fits 0
f(x) f(x)
Squared Error Squared Error
-0.2 -0.2
Bias ℎ' \ \ -
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
Model complexity
Finding a balanced fit (neither over- nor underfit) is called the the bias-variance tradeoff.
Machine Learning 52
Decision Trees Have High Variance
284 Q
8. Model Inference and Averaging
• Input space Å = R and output space É = {−1,1}.

• Sample size H = 30
x.1 < 0.395 x.1 < 0.555 x.2 < 0.205
| | |
• Each bootstrap tree is
1 1
quite different.
0 1 0 0
• Different splitting
0 1
1 0
0
0 1 1 0 0 1 0 1
variable at the root.

b=3 b=4 b=5
x.2 < 0.285
|
x.3 < 0.985
|
x.4 < −1.36
|
• This shows that tree
1 1
0 models suffer from
0
1 0
1
1
0
1 0
high variance.
0
1
• Bagging is needed.
1 1 0 1 1 0
b=6 b=7 b=8

x.1 < 0.395 x.1 < 0.395 x.3 < 0.985
| | Machine Learning
| 53
Random Forest
• A usual approach is to build very deep trees
- Low training error, high generalization error.
Statistical
• Diversity in individual tree prediction functions from: ℋH
- Bootstrap samples
- Randomized tree building
• Bagging works better when
h2
we are combining a diverse h1 fℎ
set of prediction functions. h4 h3
• RF find a way to achieve them.
Machine Learning 54
Thank You
Questions?
Mingsheng Long
[email protected]
http://ise.thss.tsinghua.edu.cn/~mlong
答疑：东主楼11区413室

Lec4 Tree v2.4 1

Uploaded by

Copyright:

Available Formats

Lec4 Tree v2.4 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec4 Tree v2.4 1

Uploaded by

Copyright:

Available Formats

Decision Tree

How to do with feature heterogeneity and achieve model interpretability?

green, round, no leaf, sweet, … Apple

red, round, leaf, sweet, … Apple

yellow, round, leaf, sour, … Lemon

yellow, curved, no leaf, sweet, … Banana

green, curved, no leaf, sweet, … Banana

- Why do you think this fruit is lemon?

Compared with LM:

FIGURE 5.7. Logistic regression fits on the two-dimensional classification data

Sunny Overcast Rain

Humidity Each internal node tests an attribute

High Normal Each branch corresponds to an

• As example, we visualize splits in a binary classification problem:

Which split More

Very impure group Less impure Minimum impurity

Node Impurity Measures

Information Gain = 0.69

• Compute IG for all splits induced by every feature.

Fig. 3.1 Decision tree. (a) A tree is a set

Understanding Decision Tree

• Hypothesis space of decision tree: R1

- Decision trees divide the feature space into axis-parallel rectangles.

- Each rectangular region is labeled with a specific label.

FIGUREX1 8.3. Top Left: A partition of two-dimensional

Understanding Decision Tree

Example: Iris data

Summer term 2017

• Are Expressivity and Interpretability in conflict?

k ˆk1 /k ˜k1 k ˆk2 /k ˜k2

• Can we use decision trees in regression task? t1 t3

• Recall the visualization of decision tree prediction landscape:

R2 t4 Discrete Class Label

t2 R4 Real Target Value

• We assign a class label for each of the regions in classification.

• In regression: for each region, assign real target value (response).

• Given the partition {Y(, … , Y@ }, the final prediction is t1 t3

gf* = avg(!A |\A ∈ Y* ) X 2 ≤ t4

• In classification tree, we find node of high purity. R4 R5

L2 loss for samples in "$ #, % L2 loss for samples in "# #, %

All previous h1 = 0.1 h2 = 0.7 x1 ≥ 3 h3 = 0.8

Original Tree b=1 b=2

284 8. Model Inference and Averaging

Original Tree b=1 b=2

Different Trees (why and how?)

Original Tree b=1 b=2

284 8. Model Inference and Averaging

Original Tree b=1 b=2

Different x.1 < 0.395

Different Trees (why and how?)

• Let ℎ(!" , … , ℎ( # be the prediction functions for each sample.

• The error of ℎNNO is a good estimate of the test error.

① Bootstrap samples from training data Original Tree

Original Tree b=1 b=2

samples x.1 < 0.395

Test Classification Error

from sklearn.ensemble import RandomForestClassifier

max_depth=2, ## The maximum depth of tree

0 50 100 150 200 250 300