2.decision Tree
2.decision Tree
2.decision Tree
Overfitting
Decision trees are among the most widely used methods for
inductive inference.
Each node in the tree specifies a test for some attribute of the
instance.
Each branch corresponds to an attribute value.
Each leaf node assigns a classification.
Decision trees represent a disjunction (or) of conjunctions (and)
of constraints on the values. Each root-leaf path is a conjunction.
Outlook
No Yes No Yes
Main loop:
1 A ← the “best” decision attribute for next node
2 Assign A as decision attribute for node
3 For each value of A, create new descendant of node
4 Sort training examples to leaf nodes
5 If training examples perfectly classified, Then STOP, Else iterate
over new leaf nodes
This is basically the ID3 algorithm
What do we mean by best?
1.0
Entropy(S)
0.5
We can also say that Entropy equals the expected number of bits
needed to encode class (⊕ or ) of randomly drawn member of S
using the optimal, shortest length code.
Information theory: optimal length code assigns − log2 p bits to
message having probability p.
Imagine I am choosing elements from S at random and telling
you whether they are ⊕ or . How many bits per element will I
need? (We work-out encoding beforehand).
If message has probability 1 then its encoding length is 0.
If probability .5 then we need 1 bit (the maximum).
So, expected number of bits to encode ⊕ or of random member
of S:
p⊕ (− log2 p⊕ ) + p (− log2 p )
X |Sv |
Gain(S, A) ≡ Entropy(S) − Entropy(Sv ) (2)
|S|
v∈V alues(A)
S: [9+,5-] S: [9+,5-]
E =0.940 E =0.940
Humidity Wind
|Sred |
Gain(S, color) = Entropy(S) − |S| Entropy(Sred ) −
|Sgreen | |Sblue |
|S| Entropy(Sgreen ) − |S| Entropy(Sblue )
= 1 − 36 (.9182) − 0 − 0 = 0.5409.
|Sred |
Gain(S, color) = Entropy(S) − |S| Entropy(Sred ) −
|Sgreen | |Sblue |
|S| Entropy(Sgreen ) − |S| Entropy(Sblue )
= 1 − 36 (.9182) − 0 − 0 = 0.5409.
Consider a classification problem where the class label y can take the values −1 and +1 and there are
two features, x1 and x2 , which both have possible values 0, 1, and 2. Let H = {h1 , h2 , h3 } be a
hypothesis space for this problem that contains the following three hypotheses 1 :
x1 x2 x3
0 2 +1
2 2 -1
1 2 -1
0 0 + 1
+1 if x1 · x2 = 0,
h1 (x) =
−1 otherwise.
+1 if x1 6= x2 ,
h2 (x) =
−1 otherwise.
+1 if x1 = 0,
h3 (x) =
−1 otherwise.
1 Using ID3 find the best attribute for the root node of the
decision tree for the 10 training examples.
2 What is the best attribute for the Parents child node?
Vineet Padmanabhan Machine Learning
Hypothesis Space Search by ID3
+ – +
...
A2
A1
+ – + + + – + –
...
A2 A2
+ – + – + – + –
A3 A4
–
+
... ...
Given a set of examples there are many trees that would fit it.
Which one does ID3 pick?
0.9
0.85
0.8
0.75
Accuracy
0.7
0.65
0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)
0.9
0.85
0.8
0.75
Accuracy
0.7
0.65
0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)
Outlook
No Yes No Yes
Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No
Problem:
If attribute has many values, Gain will select it
Imagine using Date = Jun_3_1996 as attribute
Gain(S, A)
GainRatio(S, A) ≡
SplitInf ormation(S, A)
c
X |Si | |Si |
SplitInf ormation(S, A) ≡ − log2
i=1
|S| |S|
where Si is subset of S for which A has value vi
The SplitInf ormation term discourages the selection of
attributes with many uniformaly distributed values.
c
X |Si | |Si |
SplitInf ormation(S, A) ≡ − log2
i=1
|S| |S|
5 5 4 4
Split(S, Outlook) = (− 14 log 14 ) × 2 + (− 14 log 14 ) = 1.577
Gain(S, A)
GainRatio(S, A) ≡
SplitInf ormation(S, A)
0.247
GainRatio(S, Outlook) = 1.577 = 0.157
GainRatio(S, T emperature) = 0.029
1.362 = 0.021
GainRatio(S, Humidity) = 0.152
1 = 0.152
0.048
GainRatio(S, W ind) = 0,985 = 0.049
1 May choose an attribute just because its intrinsic information is
very low
2 First, only consider attributes with greater than average
information gain
3 Then, compare them on gain ratio
Vineet Padmanabhan Machine Learning
Attributes With Costs
Consider
medical diagnosis, BloodT est has cost $150
robotics, W idth_f rom_1f t has cost 23 sec.
1 Compute the Gini Index for the overall collection of training examples.
2 Compute the Gini Index for the Customer ID attribute.
3 Compute the Gini index for Gender, Car Type and ShirtSize and show which attribute is better.
A B Class Label
T F +
T T +
T T +
T F -
T T +
F F -
F F -
F F -
T T -
T F
For the following generic histogram the gini index can be give as
follows:
C1 C2
L a1 a2
R b1 b2
2 2
(a1+a2) a1 a2
gini = n 1 − a1+a2 − a1+a2 +
2 2
(b1+b2) b1 b2
n 1 − b1+b2 − b1+b2
D: a data partition
Consider attribute A with continuous values
To determine the best binary split on A
What to examine?
Examine each possible split point
The midpoint between each pair of (sorted) adjacent values is
taken as a possible split-point
How to examine?
For each split point compute the weighted sum of the impurity of
each of the two resulting partitions (D1: A ≤ split − point, D2:
A > split − point)
D1 D2
GiniA (D) = Gini(D1) + Gini(D1)
D D
The split-point that gives the minimum Gini index for attribute
A is selected as its splitting subset.
S 0 = {f amily, truck}
giniS 0 (T) = 46 1 − ( 24 )2 − 2 2
2 2 2 0 2
(4) +
6 1 − ( 2 ) − ( 2 ) =
4 4 4
6 1 −( 16 ) − ( 16 ) =
4 8
16 h 1 − ( 16 ) i =
4 8÷8
16 1 − 16÷8 =
4 1
16 1 − 2 =
4 1
6 × 2 =
4 1
12 = 3
D: a data partition
Consider attribute A with v outcomes {a1 , . . . av }
To determine the best binary split on A
What to examine?
Examine the partitions resulting from all possible subsets of
{a1 , . . . av }
Each subset SA is a binary test of attribute A of the form
A ∈ SA?
2v possible subsets. We exclude the power set and the empty set,
then we have 2v − 2 subsets.
How to examine?
For each subset, compute the weighted sum of the impurity of
each of the two resulting partitions
D1 D2
GiniA (D) = Gini(D1) + Gini(D1)
D D
The subset that gives the minimum Gini index for attribute A is
selected as its splitting subset
Using attribute income: there are three values: low, medium and high
Choosing the subset {low, medium} results in two partions:
D1 (income∈ {low, medium} ): 10 tuples
D2 (income ∈ {high} ): 4 tuples
10 4
Giniincome ∈ {low, medium}(D) = 14 Gini(D1 ) + 14 Gini(D2 )
10 6 2 4 2 4
= 14 (1 − ( 10 ) − ( 10 ) )+ 14 (1 − ( 14 )2 − ( 34 )2 )
= 0.450
= Giniincome ∈ {high}(D)