Module 3 Chap 3 Decision Tree Learning

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 79

DECISION TREE

LEARNING
MODULE - 2

1
INTRODUCTION

• Decision tree learning is one of the most widely used methods for inductive
inference
• It is used to approximate discrete valued functions that is robust to noisy
data
• It is capable of learning disjunctive expressions

• The learned function is represented by a decision tree.

• Disjunctive Expressions – (A ∧ B ∧ C) ∨ (D ∧ E ∧ F)

2
CONSIDER THE DATASET
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

3
DECISION TREE REPRESENTATION

• Each internal node tests an


attribute
• Each branch corresponds to an
attribute value
• Each leaf node assigns a
classification

PlayTennis: This decision tree classifies Saturday mornings


according to whether or not they are suitable for playing tennis

4
REPRESENTATION

 internal node =

attribute test

 branch =

attribute value

 leaf node =
classification 5
DECISION TREE REPRESENTATION -
CLASSIFICATION

• An example is classified by
sorting it through the tree from
the root to the leaf node

• Example – (Outlook = Sunny,


Humidity = High) =>
(PlayTennis = No)

PlayTennis: This decision tree classifies Saturday mornings


according to whether or not they are suitable for playing tennis

6
DECISION TREE REPRESENTATION

• In general, decision trees represent a disjunction of conjunctions


of constraints on the attribute values of instances

• Example –

7
APPROPRIATE PROBLEMS FOR DECISION
TREE LEARNING

• Instances are represented by attribute-value pairs


• Target function has discrete output values
• Disjunctive hypothesis may be required
• Possibly noisy data
• Training data may contain errors
• Training data may contain missing attribute values
• Examples – Classification Problems
1. Equipment or medical diagnosis
2. Credit risk analysis

8
DECISION TREE ALGORITHMS

• Decision tree algorithms employs top-down greedy


search through the space of possible decision trees

• ID3 (Iterative Dichotomizer 3) by Quinlan 1986, and


C4.5 by Quinlan 1986 uses this approach

• Most algorithms that have been developed for


learning decision trees are variations of the core
algorithms

9
BASIC ID3 LEARNING ALGORITHM APPROACH
• Top-down construction of the tree, beginning with the question "which
attribute should be tested at the root of the tree?'
• Each instance attribute is evaluated using a statistical test to determine
how well it alone classifies the training examples.
• The best attribute is selected and used as the test at the root node of the
tree.
• A descendant of the root node is then created for each possible value of
this attribute.
• The training examples are sorted to the appropriate descendant node
• The entire process is then repeated at the descendant node using the
training examples associated with each descendant node
• GREEDY Approach
• No Backtracking - So we may get a suboptimal solution.
10
ID3 ALGORITHM

11
TOP-DOWN INDUCTION OF DECISION TREES

1. Find A = the best decision attribute for next node


2. Assign A as decision attribute for node
3. For each value of A create new descendants of node
4. Sort the training examples to the leaf node.
5. If training examples classified perfectly, STOP else
iterate over the new leaf nodes.

12
WHICH ATTRIBUTE IS THE BEST CLASSIFIER?

• Information Gain – A statistical property that measures


how well a given attribute separates the training
examples according to their target classification.
• This measure is used to select among the candidate
attributes at each step while growing the tree.
13
ENTROPY
• Entropy (E) is the minimum number of bits needed in order to classify an arbitrary
example as yes or no
• Entropy is commonly used in information theory. It characterizes the (im)purity of an
arbitrary collection of examples.
• S is a sample of training examples
• is the proportion of positive examples in S
• is the proportion of negative examples in S
• Then the entropy measures the impurity of S:

• But If the target attribute can take c different values:

14
ENTROPY - EXAMPLE

• Entropy([29+, 35-]) = - (29/64) log2(29/64) - (35/64) log2(35/64)


= 0.994

• Note: 0 log 0 is defined to be 0


15
ENTROPY

• The entropy varies between 0 and 1.


• If all the members belong to the same class => The entropy
= 0, In this case =1 and = 0.
• Entropy (S) = -1. log2 (1) – 0. log2 0

• = -1 . 0 – 0 . log2 0 = 0
• If there are equal number of positive and negative examples
=> The entropy = 1 16
INFORMATION GAIN
• Information gain measures the expected reduction in Entropy

• It measures the effectiveness of the attribute in classifying the


training data
• Gain(S,A) = expected reduction in entropy by
partitioning the examples according to the attribute A

• Here values(A) is the set of all possible values for attribute A, sv


is the subset of S for which atribute A has value v
17
INFORMATION GAIN - EXAMPLE

E=0.994 E=0.994

E=0.706 E=0.742 E=0.937 E=0.619

• Gain(S,A1) • Gain(S,A2)
= 0.994 – (26/64)*.706 – (38/64)*.742 = 0.994 – (51/64)*.937 –
= 0.266 (13/64)*.619
• Information gained by partitioning = 0.121
along attribute A1 is 0.266 • Information gained by partitioning
along attribute A2 is 0.121
18
An Illustrative Example

19
DECISION TREE LEARNING
■ Let’s Try an Example!
■ Let
– E([X+,Y-]) represent that there are X positive training elements
and Y negative elements.
■ Therefore the Entropy for the training data, E(S), can be
represented as E([9+,5-]) because of the 14 training examples 9 of
them are yes and 5 of them are no.

20
DECISION TREE LEARNING:
A SIMPLE EXAMPLE

■ Let’s start off by calculating the Entropy of the Training Set.


■ E(S) = E([9+,5-]) = (-9/14 log2 9/14) + (-5/14 log2 5/14)
■ = 0.94

■ Next we will need to calculate the information gain G(S,A) for


each attribute A where A is taken from the set {Outlook,
Temperature, Humidity, Wind}.

21
DECISION TREE LEARNING:
A SIMPLE EXAMPLE
■ The information gain for Outlook is:

– Gain(S,Outlook) = E(S) – [5/14 * E(Outlook=sunny) + 4/14 *


E(Outlook = overcast) + 5/14 * E(Outlook=rain)]

– Gain(S,Outlook) = E([9+,5-]) – [5/14*E(2+,3-) + 4/14*E([4+,0]) +


(- 2/5 log2 (2/5) )+ (-3/5 log2(3/5))
5/14*E([3+,2-])]

– Gain(S,Outlook) = 0.94 – [5/14*0.971 + 4/14*0.0 + 5/14*0.971]

– Gain(S,Outlook) = 0.246
22
DECISION TREE LEARNING:
A SIMPLE EXAMPLE

■ Gain(S,Temperature) = 0.94 – [4/14*E(Temperature=hot) +


6/14*E(Temperature=mild) +
4/14*E(Temperature=cool)]
■ Gain(S,Temperature) = 0.94 – [4/14*E([2+,2-]) + 6/14*E([4+,2-]) +
4/14*E([3+,1-])]
■ Gain(S,Temperature) = 0.94 – [4/14 + 6/14*0.918 + 4/14*0.811]
■ Gain(S,Temperature) = 0.029

23
DECISION TREE LEARNING:
A SIMPLE EXAMPLE

■ Gain(S,Humidity) = 0.94 – [7/14*E(Humidity=high) +


7/14*E(Humidity=normal)]
■ Gain(S,Humidity = 0.94 – [7/14*E([3+,4-]) + 7/14*E([6+,1-])]

■ Gain(S,Humidity = 0.94 – [7/14*0.985 + 7/14*0.592]


■ Gain(S,Humidity) = 0.1515

24
DECISION TREE LEARNING:
A SIMPLE EXAMPLE

■ G(S,Wind) = 0.94 – [8/14*0.811 + 6/14*1.00]


■ G(S,Wind) = 0.048

25
AN ILLUSTRATIVE EXAMPLE

• Gain(S, Outlook) = 0.246


• Gain(S, Humidity) = 0.151
• Gain(S, Wind) = 0.048
• Gain(S, Temperature) = 0.029
• Since Outlook attribute provides the
best prediction of the target attribute,
PlayTennis, it is selected as the
decision attribute for the root node, and
branches are created with its possible
values (i.e., Sunny, Overcast, and
Rain).

26
ROOT NODE

27
AN ILLUSTRATIVE EXAMPLE

For Overcast – Decision Class can be obtained

Day Outlook Temp. Humidity Wind Decision

3 Overcast Hot High Weak Yes

7 Overcast Cool Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

28
AN ILLUSTRATIVE EXAMPLE
Day Outlook Temp. Humidity Wind Decision

For Sunny– 1 Sunny Hot High Weak No


Decision Class
2 Sunny Hot High Strong No
cannot be
obtained 8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

11 Sunny Mild Normal Strong Yes

Day Outlook Temp. Humidity Wind Decision

4 Rain Mild High Weak Yes


For Rain–
5 Rain Cool Normal Weak Yes Decision Class
6 Rain Cool Normal Strong No cannot be
obtained
10 Rain Mild Normal Weak Yes

14 Rain Mild High Strong No


29
AN ILLUSTRATIVE EXAMPLE

30
INTERMEDIATE NODE COMPUTATION
Ssunny = {D1,D2,D8,D9,D11}
Entropy (Ssunny ) = (-2/5 log 2/5 ) + ( -3/5 log 3/5) = 0.97
• Gain (Ssunny , Humidity)
■ = .970 - (3/5) 0.0 - (2/5) 0.0
■ = .970 [3/5 * ((-0/3 log 0/3) + ((-3/3 log (3/3))] + [2/5 *
((-2/2 log(2/2) + ((-0/2 log (0/2))]
• Gain (S sunny , Temperature)
■ = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0
■ = .570
• Gain (S sunny , Wind)
■ = .970 - (2/5) 1.0 - (3/5) .918
■ = .019
31
INTERMEDIATE NODE COMPUTATION

For the rightmost branch: Rain

• Gain (SRain, Temperature) = 0.019

• Gain (SRain, Humidity) = 0.019

• Gain (SRain, Wind)= 0.97

32
FINAL DECISION TREE :

33
DECISION FOR LEFT BRANCH:
BRANCH: (HUMIDITY)

Day Outlook Temp. Humidity Wind Decision

1 Sunny Hot High Weak No


2 Sunny Hot High Strong No
8 Sunny Mild High Weak No

Day Outlook Temp. Humidity Wind Decision

9 Sunny Cool Normal Weak Yes


11 Sunny Mild Normal Strong Yes

34
DECISION FOR RIGHT BRANCH:
BRANCH: (WIND)
Day Outlook Temp. Humidity Wind Decision
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes

Day Outlook Temp. Humidity Wind Decision


6 Rain Cool Normal Strong No
14 Rain Mild High Strong No

35
FINAL DECISION TREE :

Test Instance: (Outlook = Rain, Temperature = Hot, Humidity = High, Wind =


Strong ) = ?
36
EXAMPLE 2:

37
DECISION TREE LEARNING

38
DECISION TREE LEARNING

39
DECISION TREE LEARNING

40
DECISION TREE LEARNING

41
DECISION TREE LEARNING

42
DECISION TREE LEARNING

43
DECISION TREE LEARNING

44
HYPOTHESIS SPACE SEARCH IN DECISION
TREE LEARNING
ID3 – Capabilities and Limitations
• ID3’s hypothesis of all decision trees is a complete space
of finite discrete-valued functions
• ID3 maintains a single current hypothesis as it searches
through the space of decision trees
• ID3 in its pure form performs no backtracking in its
search
• ID3 uses all training examples at each step in the search
to make statistically based decisions
45
INDUCTIVE BIAS IN ID3

• Inductive bias is the set of assumptions that along with


the training data justify the classifications assigned by the
learner to future instances.
• Given a collection of training examples, there are several
decision trees consistent with these examples
• Inductive bias of ID3:
• Which of these decision trees does the ID3 choose?
INDUCTIVE BIAS IN ID3

Approximate inductive bias of ID3


• ID3 has preference of short trees over larger trees
• Trees that place high information gain attributes closer to
the root are preferred over that are not.

47
INDUCTIVE BIAS IN ID3

Restriction Biases and Preference Biases


• Interesting difference between the types of inductive
bias exhibited by ID3 and Candidate-Elimination
algorithm
• ID3 – searches complete hypothesis space
incompletely
• CE – searches incomplete hypothesis space
completely

48
INDUCTIVE BIAS IN ID3

• ID3 – Preference Biases


• CE – Restriction Biases

Which type of inductive bias is preferable?


• Combination is also possible

49
OCCAM’S RAZOR – WHY PREFER SHORTER
HYPOTHESIS
• William of Occam was the 14th century English logician

• The principle states that, “All other things being equal, the
simplest solution is the best”

• When multiple competing theories are equal in other


respects, the principle recommends selecting the theory
that introduces fewest assumptions.

50
OCCAM’S RAZOR
• Occam’s Razor: Prefer the simplest hypothesis that fits the
data
• Argument in favor -
• A short hypothesis that fits data unlikely to be a coincidence
• A long hypothesis that fits data might be a coincidence

• Argument Opposed –
• There are many ways to define small sets of hypotheses
• Two different hypotheses from the same training examples
possible when applied by two learners that perceive these
examples in terms of different internal representations
51
ISSUES IN DECISION TREE LEARNING

• Overfitting
• Incorporating Continuous-valued attributes
• Alternative measures for selecting attributes
• Handling attributes with costs
• Handling examples with missing attribute values

52
OVERFITTING
• Consider a hypothesis h over
• Training data: errortrain(h)

• Entire distribution D of data: errorD(h)

• The hypothesis h ∈ H overfits training data if there is an


alternative hypothesis h’ ∈ H such that
• errortrain(h) <
errortrain(h’) AND
• errorD(h) > errorD(h’)
53
OVERFITTING
Definition:
Given a hypothesis space H, a hypothesis h belonging to H
is said to overfit the training data if there exists some
alternative hypothesis h’ belonging H, such that h has
smaller error than h’ over the training examples, but h’ has a
smaller error than h over the entire distribution of instances.

54
OVERFITTING IN DECISION TREE LEARNING

55
OVERFITTING IN DECISION TREE LEARNING

56
AVOIDING OVERFITTING

57
AVOIDING OVERFITTING

58
AVOIDING OVERFITTING

59
AVOIDING OVERFITTING

60
AVOIDING OVERFITTING

61
AVOIDING OVERFITTING

62
AVOIDING OVERFITTING

63
REDUCED-ERROR PRUNING

• Split data into training and validation sets


• Do until further pruning is harmful
1. Evaluate impact of pruning each possible node on
validation set
2. Greedily remove the one that most improves the validation
set accuracy

64
EFFECT OF REDUCED-ERROR PRUNING

65
EFFECT OF REDUCED-ERROR PRUNING

66
RULE POST-PRUNING
• The major drawback of Reduced-Error Pruning is when
the data is limited, validation set reduces even further
the number of examples for training.

Hence Rule Post-Pruning

67
RULE POST-PRUNING

68
RULE POST-PRUNING

69
RULE POST-PRUNING

70
RULE POST-PRUNING

71
RULE POST-PRUNING

72
RULE POST-PRUNING

73
74
CONTINUOUS VALUED-ATTRIBUTES

• Create a discrete-valued attribute to test continuous


• Dynamically define new discrete valued attributes
that partition the continuous valued attribute into
discrete set of intervals
• For an attribute A, create new boolean attribute Ac
that is true of A<c and false otherwise

75
(48+60) /2 = 54 -
Threshold

(80+90) /2 = 85 -
Threshold

• So if Temperature = 75
• We can infer that PlayTennis = Yes

76
ALTERNATIVE MEASURES FOR SELECTING
ATTRIBUTES
• Problem:
• If attribute has many values, Gain will select any value
• Example – Using date attribute
• One approach – Gain Ratio

Where si is a subset of S which has value vi


77
EXAMPLES WITH MISSING ATTRIBUTE
VALUES
• What if some examples missing values of attribute A?
• Use training examples anyway and sort through tree
• If node n tests A, Assign it the most common value among
the examples at node n
• Assign a probability pi to each possible value of A – vi and
assign fraction pi of example to each descendant in tree

78
ATTRIBUTES WITH COSTS
• Problem:
• Medical diagnosis, BloodTest has cost $150
• Robotics, Width_from_1ft has cost 23 sec
• One Approach - replace gain
• Tan and Schlimmer (1990)

• Nunez (1988)
• where w ∈ [0, 1] is a constant that determines the relative importance of cost versus information
gain.

79

You might also like