Module 3 Chap 3 Decision Tree Learning
Module 3 Chap 3 Decision Tree Learning
Module 3 Chap 3 Decision Tree Learning
LEARNING
MODULE - 2
1
INTRODUCTION
• Decision tree learning is one of the most widely used methods for inductive
inference
• It is used to approximate discrete valued functions that is robust to noisy
data
• It is capable of learning disjunctive expressions
• Disjunctive Expressions – (A ∧ B ∧ C) ∨ (D ∧ E ∧ F)
2
CONSIDER THE DATASET
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
3
DECISION TREE REPRESENTATION
4
REPRESENTATION
internal node =
attribute test
branch =
attribute value
leaf node =
classification 5
DECISION TREE REPRESENTATION -
CLASSIFICATION
• An example is classified by
sorting it through the tree from
the root to the leaf node
6
DECISION TREE REPRESENTATION
• Example –
7
APPROPRIATE PROBLEMS FOR DECISION
TREE LEARNING
8
DECISION TREE ALGORITHMS
9
BASIC ID3 LEARNING ALGORITHM APPROACH
• Top-down construction of the tree, beginning with the question "which
attribute should be tested at the root of the tree?'
• Each instance attribute is evaluated using a statistical test to determine
how well it alone classifies the training examples.
• The best attribute is selected and used as the test at the root node of the
tree.
• A descendant of the root node is then created for each possible value of
this attribute.
• The training examples are sorted to the appropriate descendant node
• The entire process is then repeated at the descendant node using the
training examples associated with each descendant node
• GREEDY Approach
• No Backtracking - So we may get a suboptimal solution.
10
ID3 ALGORITHM
11
TOP-DOWN INDUCTION OF DECISION TREES
12
WHICH ATTRIBUTE IS THE BEST CLASSIFIER?
14
ENTROPY - EXAMPLE
• = -1 . 0 – 0 . log2 0 = 0
• If there are equal number of positive and negative examples
=> The entropy = 1 16
INFORMATION GAIN
• Information gain measures the expected reduction in Entropy
E=0.994 E=0.994
• Gain(S,A1) • Gain(S,A2)
= 0.994 – (26/64)*.706 – (38/64)*.742 = 0.994 – (51/64)*.937 –
= 0.266 (13/64)*.619
• Information gained by partitioning = 0.121
along attribute A1 is 0.266 • Information gained by partitioning
along attribute A2 is 0.121
18
An Illustrative Example
19
DECISION TREE LEARNING
■ Let’s Try an Example!
■ Let
– E([X+,Y-]) represent that there are X positive training elements
and Y negative elements.
■ Therefore the Entropy for the training data, E(S), can be
represented as E([9+,5-]) because of the 14 training examples 9 of
them are yes and 5 of them are no.
20
DECISION TREE LEARNING:
A SIMPLE EXAMPLE
21
DECISION TREE LEARNING:
A SIMPLE EXAMPLE
■ The information gain for Outlook is:
– Gain(S,Outlook) = 0.246
22
DECISION TREE LEARNING:
A SIMPLE EXAMPLE
23
DECISION TREE LEARNING:
A SIMPLE EXAMPLE
24
DECISION TREE LEARNING:
A SIMPLE EXAMPLE
25
AN ILLUSTRATIVE EXAMPLE
26
ROOT NODE
27
AN ILLUSTRATIVE EXAMPLE
28
AN ILLUSTRATIVE EXAMPLE
Day Outlook Temp. Humidity Wind Decision
30
INTERMEDIATE NODE COMPUTATION
Ssunny = {D1,D2,D8,D9,D11}
Entropy (Ssunny ) = (-2/5 log 2/5 ) + ( -3/5 log 3/5) = 0.97
• Gain (Ssunny , Humidity)
■ = .970 - (3/5) 0.0 - (2/5) 0.0
■ = .970 [3/5 * ((-0/3 log 0/3) + ((-3/3 log (3/3))] + [2/5 *
((-2/2 log(2/2) + ((-0/2 log (0/2))]
• Gain (S sunny , Temperature)
■ = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0
■ = .570
• Gain (S sunny , Wind)
■ = .970 - (2/5) 1.0 - (3/5) .918
■ = .019
31
INTERMEDIATE NODE COMPUTATION
32
FINAL DECISION TREE :
33
DECISION FOR LEFT BRANCH:
BRANCH: (HUMIDITY)
34
DECISION FOR RIGHT BRANCH:
BRANCH: (WIND)
Day Outlook Temp. Humidity Wind Decision
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
35
FINAL DECISION TREE :
37
DECISION TREE LEARNING
38
DECISION TREE LEARNING
39
DECISION TREE LEARNING
40
DECISION TREE LEARNING
41
DECISION TREE LEARNING
42
DECISION TREE LEARNING
43
DECISION TREE LEARNING
44
HYPOTHESIS SPACE SEARCH IN DECISION
TREE LEARNING
ID3 – Capabilities and Limitations
• ID3’s hypothesis of all decision trees is a complete space
of finite discrete-valued functions
• ID3 maintains a single current hypothesis as it searches
through the space of decision trees
• ID3 in its pure form performs no backtracking in its
search
• ID3 uses all training examples at each step in the search
to make statistically based decisions
45
INDUCTIVE BIAS IN ID3
47
INDUCTIVE BIAS IN ID3
48
INDUCTIVE BIAS IN ID3
49
OCCAM’S RAZOR – WHY PREFER SHORTER
HYPOTHESIS
• William of Occam was the 14th century English logician
• The principle states that, “All other things being equal, the
simplest solution is the best”
50
OCCAM’S RAZOR
• Occam’s Razor: Prefer the simplest hypothesis that fits the
data
• Argument in favor -
• A short hypothesis that fits data unlikely to be a coincidence
• A long hypothesis that fits data might be a coincidence
• Argument Opposed –
• There are many ways to define small sets of hypotheses
• Two different hypotheses from the same training examples
possible when applied by two learners that perceive these
examples in terms of different internal representations
51
ISSUES IN DECISION TREE LEARNING
• Overfitting
• Incorporating Continuous-valued attributes
• Alternative measures for selecting attributes
• Handling attributes with costs
• Handling examples with missing attribute values
52
OVERFITTING
• Consider a hypothesis h over
• Training data: errortrain(h)
54
OVERFITTING IN DECISION TREE LEARNING
55
OVERFITTING IN DECISION TREE LEARNING
56
AVOIDING OVERFITTING
57
AVOIDING OVERFITTING
58
AVOIDING OVERFITTING
59
AVOIDING OVERFITTING
60
AVOIDING OVERFITTING
61
AVOIDING OVERFITTING
62
AVOIDING OVERFITTING
63
REDUCED-ERROR PRUNING
64
EFFECT OF REDUCED-ERROR PRUNING
65
EFFECT OF REDUCED-ERROR PRUNING
66
RULE POST-PRUNING
• The major drawback of Reduced-Error Pruning is when
the data is limited, validation set reduces even further
the number of examples for training.
67
RULE POST-PRUNING
68
RULE POST-PRUNING
69
RULE POST-PRUNING
70
RULE POST-PRUNING
71
RULE POST-PRUNING
72
RULE POST-PRUNING
73
74
CONTINUOUS VALUED-ATTRIBUTES
75
(48+60) /2 = 54 -
Threshold
(80+90) /2 = 85 -
Threshold
• So if Temperature = 75
• We can infer that PlayTennis = Yes
76
ALTERNATIVE MEASURES FOR SELECTING
ATTRIBUTES
• Problem:
• If attribute has many values, Gain will select any value
• Example – Using date attribute
• One approach – Gain Ratio
78
ATTRIBUTES WITH COSTS
• Problem:
• Medical diagnosis, BloodTest has cost $150
• Robotics, Width_from_1ft has cost 23 sec
• One Approach - replace gain
• Tan and Schlimmer (1990)
• Nunez (1988)
• where w ∈ [0, 1] is a constant that determines the relative importance of cost versus information
gain.
79