M01 Tree-Based Methods

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 38

Module 1

Tree-Based Models
Decision trees – classification trees and
regression trees
 CART (Classification And Regression Trees)
 Classification tree: target variable Y is discrete, finitely-many
valued
 Leaf nodes are associated with discrete class labels
 Regression trees: The target variable Y is continuous-valued
 Leaf nodes are associated with numerical, continuous values

 We mainly focus on classification trees


 Ensemble tree-based ML methods:
 Random forest (bagged trees with additional improvement)
Decision tree examples
Decision tree representation:

 Each internal (non-leaf) node tests an attribute


 Each branch corresponds to attribute value
 Each leaf node assigns a class label

 How would we represent:


 , , XOR

 M of N
When to Consider Decision Trees for classification

 Instances describable by attribute-value pairs


 Target function is discrete valued
 Disjunctive hypothesis may be required
 Possibly noisy training data
 Human interpretation of decision may be important

Examples:

 Medical diagnosis
 Credit risk analysis
 Equipment fault diagnosis
Top-down induction of decision tree algorithm
(ID3)
 Main loop: (given the current Node with dataset S)
If the termination condition at Node is satisfied, then STOP
Else:
 A  the ``best'' decision attribute for splitting the data S at
Node
 Assign A as decision attribute for Node
 For each value of A, create a new descendant of Node
 Sort training examples to the new leaf nodes
 Iterate over new leaf nodes
Criterion function for selecting the “best” attribute

Node: S
A

𝑎1 𝑎2 𝑎𝑘

: : …… :

Several criterion functions are available:


• Information gain (=mutual information) Entropy([9+, ]) =
• Gini-impurity
• Misclassification rate Entropy([3+, ]) =
• Others….
Entropy – measuring the impurity of a set S
 S is a sample of training examples
 is the proportion (probability) of positive
examples in S
 is the proportion (probability) of negative
examples in S
 Entropy measures the impurity of S
Entropy(S) =
 Note the symmetry of Entropy(S) curve with
respect to = 0.5.
 Namely, for two datasets if | = | then Entropy() =
Entropy()
Entropy – measuring how random a random
variable is
 Entropy(S) = expected number of bits needed to encode class (, ) of
randomly drawn member of S (under the optimal, shortest-length encoding)
Why?
 Information theory: optimal length code assigns bits to encode message
having probability
 So, expected number of bits to encode class of random member of S ( value
of the random variable “class”):
 ()
= = Entropy(S)

 Note that Entropy(S) = Entropy(Class)


 “Class”, a random variable taking value , with probabilities , respectively
General formula of Entropy(X):
 X is a discrete-valued random variable
For example, X = outlook, Pr(X=Sunny) =0.3, Pr(X=Rain) =
0.25, Pr(X=overcast) = 0.45
 X takes values {for
 Entropy(X) =
 Entropy(X) is maximal if each is equal valued, =
In this case, X is very random/uncertain.
Information gain (mutual information)
 = expected reduction in entropy due to sorting on attribute A

 Note that the first term is the same for all attributes, and the second term is
non-negative.  we just need to select attribute A minimizing the second
term
Information gain (mutual information)

 = expected reduction in entropy due to sorting on attribute A



 here denotes the mutual information between random
variables “Class” and “A”.

 measures the amount of information gained about the class


variable after observing values of random variable A
Training data

Day Outlook Temp Humidity Wind PlayTennis


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Ye
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Selecting the best attribute
A partially learned tree – ID3 is applied at each current leaf node
recursively
Hypothesis space search by ID3
Hypothesis space search by ID3

 Hypothesis space is complete!


 Target function surely in there...
 Outputs a single hypothesis (which one?)
 Can't play 20 questions...

 No back tracking – greedy search guided by the criterion function


 Local minima... (no guarantee of optimal tree) (-)

 Computationally quite efficient (+)


 Statistically-based search choices
 Robust to noisy data...

 Inductive bias: approx ``prefer shortest tree''


Inductive Bias in ID3

Note H is the power set of instances


 Unbiased?
Not really…
 Preference for short trees, and for those with high information gain
attributes near the root
 Bias is a preference for some hypotheses, rather than a restriction
of hypothesis space H
 Occam's razor: prefer the shortest hypothesis that fits the data
Overfitting in decision tree learning
Consider adding noisy training example #15:
< Sunny, Hot, Normal, Strong, PlayTennis=No>
What effect on the entire tree?
If insists on all leaf nodes must be
“pure”, need to grow the tree further
to “fit the noise”.
This may cause overfitting and
degrade the model accuracy.
Overfitting definition

Two notions of error: Training error and true error:


 Training error : error of hypothesis h over training data
The fraction of training examples misclassified by h.

 True error : error of hypothesis h over entire distribution of data.


is an underlying distribution over the instance space X

 Hypothesis overfits training data if there is another hypothesis such


that
 , but
Overfitting in decision tree learning

 After the tree size = 20,


the training accuracy
keeps increasing
 While the accuracy on
test data is decreasing
 the more complex
tree has overfitted the
training data
Avoiding Overfitting

How can we avoid overfitting?


 Early stopping:
 stop growing when data split not statistically significant
 limit tree depth
 Grow full tree, then post-prune  this is the more common approach!

How to select ``best'' tree:


 Measure performance over training data
 Measure performance over separate validation data set
 MDL: minimize size(tree) + size(misclassifications(tree))
Reduced-error pruning (used in C4.5)
Split data into training set and validation set
 T  ID3( ) // build a decision tree using training data
 Do until further pruning is harmful (i.e., the validation error would increase)
 For each internal node n, build a candidate pruned tree
 Prune the subtree rooted at node n
 Label the new leaf node n by the most numerous class of the training data

points at node n
 Compute ( the error of pruned tree validation data
 Greedily remove the one (node n*) that most improves validation set accuracy
((T*) < (T :
 T 

 produces smallest version of most accurate subtree


 What if data is limited?
 0
Effect of reduced error pruning
 Note that with the
pruning of the DT size,
the accuracy rate on the
validation data (small
dashed line) is
improving until the
pruned tree size about
20
Rule post-pruning

 Convert tree to equivalent set of rules


 Prune each rule independently of others
 Sort final rules into desired sequence for use

Perhaps most frequently used method (e.g., in C4.5)


Converting a tree to rules

 IF (Outlook=Sunny) (Humidity=High)
THEN PlayTennis =No
 IF (Outlook=Sunny) ∧ (Humidity=
Normal)
THEN PlayTennis = Yes
 IF Outlook=overcast
THEN PlayTennis =Yes
 IF (Outlook=Rain) (Wind=Strong)
THEN PlayTennis =No
 IF (Outlook=Rain) (Wind=Weak)
THEN PlayTennis =Yes
Extensions: continuous valued attributes
 Discretize in advance and then use ID3
 Create a binary attribute (on the fly) to test continuous variable
values
For example, for variable Temperature, a new attribute
(Temperature > 85 ) takes value true/false

Temperature: 40 48 60 72 80 90
PlayTennis : No No Yes Yes Yes No

The split spot should be just between class label transitions to have
better information gain
Other attribute selection criterion
 If attribute has many values, Gain will select it
Imagine using Date = Jun_1_2020 as attribute  can not
generalize at all
 One approach: use GainRatio instead
GainRatio(S,A) =
SplitInformation(S,A) = Entropy(A)
where is the subset of S for which attribute A has value
Other attribute selection criterion

 Gini-impurity:
Y: discrete random variable (e.g., Y = Class)
values(Y) = with probabilities respectively.
Gini-imp(Y) =
If we use Gini-impurity(S) to denote Gini-impurity(class) for the
class variable associated with dataset S:
Reduction_Gini-

 Misclassification rate: Mis(Y) =


Random forest
 RF is an ensemble learning method
 Building multiple DTs (so the name “forest”)
 Each tree is built from a sampled (with replacement) dataset
from training set
 When selecting a split attribute, use only a random subset of
attributes at the node as candidates (so the name “random
forest”)
 For prediction on a new instance x, use majority vote of the
trees in the ensemble
Ensemble learners
Idea:
 Build multiple models/hypotheses
 Each model capturing some patterns (but may overfit – low
bias, but high variance)
 Combine (average) the predictions of all the models in
predicting the output for a new input
 The combined overall model would not (hopefully) overfit – with
low bias and low variance
 Typically get an ensemble through
Bagging (Bootstrap Aggregation)
Sample with replacement Original training dataset (N
N’ points points)
Typically N’ = N

Set1 Set2 Set3 Set4 Set5

Model 1 Model 2 Model 3 Model 4 Model 5


New input instance x

Model1 Model 2 Model 3 Model 4 Model 5

Average/vote
Bagging

 Bagging is an extremely powerful idea based on two things:


 Averaging: reduces variance!
 Bootstrapping: plenty of training datasets!

 Why does averaging reduce variance?


 Averaging a set of observations reduces variance. We will learn
later in Chapter 5 of Tom Mitchell’s book, that given a set of n
observations of independent random variables each with
variance , the variance of the mean is given by
Bagged tree classifiers

 Generate k different bootstrapped training datasets , each is


generated by sampling with replacement from the (single)
training data set S.
 For each train a model tree using dataset
 For prediction:
Take the majority vote of the predictions by the k trees
Random forest
 It builds on the idea of bagging, but it provides an improvement –also randomize
the set of candidate attributes
 Decision tree:
 Easy to achieve 0% error rate on training data - may overfit
 Random forest: Bagging of decision tree
 Resampling training data is not sufficient
 Randomly restrict the attributes used in each split

 How does it work?


 Build a number of decision trees on bootstrapped training sample
 When learning a DT, at each split node, use only a random subset of all available
attributes at that node as candidates
 Need to select parameters properly – such as “num_estimators”,“max_depth”, etc.
when using sk-learn’s implementation
Summary: Tree-Based Models
 Decision trees:
 Adv: Easy for human comprehension
 ID3 top-down construction is quite efficient O(m*n*|T|)

m: number of training data points


n: number of attributes (features)
|T|: number of internal nodes in tree T
 Statistics-based criterion for node split – quite robust to noisy
data
 Overfitting handled by pruning
Random Forest
 Random forest
 Ensemble method, multiple trees
 Similar to bagged trees: bootstrap samples
 New improvement: randomized candidate attribute subset at
each internal node
 Help handle overfitting

You might also like