2.decision Tree

Decision Tree Learning
Vineet Padmanabhan Nair
School of Computer and Information Sciences

University of Hyderabad
Hyderabad.
April 12, 2022
Vineet Padmanabhan Machine Learning

Decision Tree Learning
Decision tree representation
ID3 learning algorithm
Entropy, Information gain
Overfitting

Introduction
Decision trees are among the most widely used methods for
inductive inference.
It is a method for approximating discrete-valued functions.
Robust to noisy data and can learn disjunctive expressions.
The hypothesis is represented using a decision tree.

Decision Trees
Each node in the tree specifies a test for some attribute of the
instance.
Each branch corresponds to an attribute value.
Each leaf node assigns a classification.
Decision trees represent a disjunction (or) of conjunctions (and)
of constraints on the values. Each root-leaf path is a conjunction.
(Outlook = Sunny ∧ Humidity = N ormal) ∨ (Outlook = Overcast)

∨(Outlook = Rain ∧ W ind = W eak)

Decision Tree for P layT ennis
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak
No Yes No Yes

When to Consider Decision Trees
Instances describable by attribute–value pairs

Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data
The training data may contain missing attribute values
Problems in which the task is to classify examples into one of a
discrete set of possible categories are called classification problems

Building a decision tree
Main loop:
1 A ← the “best” decision attribute for next node
2 Assign A as decision attribute for node
3 For each value of A, create new descendant of node
4 Sort training examples to leaf nodes
5 If training examples perfectly classified, Then STOP, Else iterate
over new leaf nodes
This is basically the ID3 algorithm
What do we mean by best?

Choosing the best attribute
Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Choosing the best attribute . . .
There are 9 positive and 5 negative examples

Humidity = High has 3 positive and 4 negative examples
Humidity = Normal has 6 positive and 1 negative
Wind = Strong has 3 positive and 3 negative
Which one is better as a root node, Humidity or Wind?

Entropy
1.0
Entropy(S)
0.5
0.0 0.5 1.0

p
+
S is a sample of training examples

p⊕ is the proportion of positive examples in S
p is the proportion of negative examples in S
Entropy measures the impurity of S
Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p

Entropy . . .
If S has 9 positive and 5 negative examples, its entropy is

9 9 5 5
Entropy([9+, 5−]) = − log2 − log2 = 0.94
14 14 14 14
This function is 0 for p⊕ = 0 and p⊕ = 1. It reaches its maximum

of 1 when p⊕ = .5
That is, it is maximised when there degree of confusion is
maximised.

Entropy as Encoding Length
We can also say that Entropy equals the expected number of bits
needed to encode class (⊕ or ) of randomly drawn member of S
using the optimal, shortest length code.
Information theory: optimal length code assigns − log2 p bits to
message having probability p.
Imagine I am choosing elements from S at random and telling
you whether they are ⊕ or . How many bits per element will I
need? (We work-out encoding beforehand).
If message has probability 1 then its encoding length is 0.
If probability .5 then we need 1 bit (the maximum).
So, expected number of bits to encode ⊕ or of random member
of S:
p⊕ (− log2 p⊕ ) + p (− log2 p )
Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p

Non Boolean Entropy
If the target attribute can take on c different values we can still

define entropy
c
X
Entropy(S) ≡ −pi log2 pi
i=1
pi is the proportion belonging to class i

Now the entropy can be as large as log2 c

Information Gain
The information gain is the expected reduction in entropy

caused by partitioning the examples with respect to an attribute.
given S is the set of examples, A the attribute, and Sv the subset
of S for which attribute A has value v:
X |Sv |
Gain(S, A) ≡ Entropy(S) − Entropy(Sv ) (1)
|S|
v∈V alues(A)
First term of the equation is the entropy of the original collection

S
Second term is the expected value of the entropy after S is
partitioned using attribute A.

Information Gain. . .
X |Sv |
Gain(S, A) ≡ Entropy(S) − Entropy(Sv ) (2)
|S|
v∈V alues(A)
The expected entropy described by the second term is simply the

sum of the entropies of each subset Sv , weighted by the fraction
of examples |S v|
|S| that belong to S.
Gain(S, A) is therefore the expected reduction in entropy caused
by knowing the value of attribute A.
Put another way, Gain(S, A) is the information provided about
the target function value, given the value of some other
attribute A.
The value of Gain(S, A) is the number of bits saved when
encoding the target value of an arbitrary member of S, by
knowing the value of attribute A.

Information Gain . . .
Using our set of examples we can calculate that

Original Entropy = 0.94
Humidity = High entropy = 0.985
Humidity = Normal entropy = 0.592

7 7
Gain(S, Humidity) = .94 − 14 .984 − 14 .592 = .151
Wind = Weak entropy = 0.811
Wind = Strong entropy = 1.0

8 6
Gain(S, W ind) = .94 − 14 .811 − 14 1.0 = .048
So Humidity provides a greater information gain.

Selecting the next attribute
Which attribute is the best classifier?
S: [9+,5-] S: [9+,5-]
E =0.940 E =0.940
Humidity Wind
High Normal Weak Strong
[3+,4-] [6+,1-] [6+,2-] [3+,3-]

E =0.985 E =0.592 E =0.811 E =1.00
Gain (S, Humidity ) Gain (S, Wind )

= .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.0
= .151 = .048

ID3
ID3 (Examples, Target, Attributes)
Create a root node
If all examples have the same Target value, give the root this
label
Else if attributes is empty, label the root according to the most
common value
Else begin
Calculate the information gain of each attribute, according to the
average entropy formula.
Select the attribute, A, with the lowest average entropy (highest
information gain) and make this attribute tested at the root.
For each possible value, v, of this attribute
Add a new branch below the root, corresponding to A = v.
Let Examples(v) be those examples with A = v
If examples(v) is empty, make the new branch a leaf node labelled
with the most common value among examples.
Else let the new branch be the tree created by ID3(Examples(v),
Target, Attributes - A)
end
ID3 Example
Again using our examples, ID3 would first calculate

Gain(S, Outlook) = 0.246
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029
So Outlook would be the root. The Overcast branch would lead
to a Yes classification.
At the Sunny branch we would recursively apply it for examples
S 0 = {1, 2, 8, 9, 11} leading to
Gain(S 0 , Humidity) = .97
Gain(S 0 , T emperature) = .57
Gain(S 0 , W ind) = .019

Another Decision Tree Example

Another Decision Tree Example continued . . .
S has 3 positive and 3 negative

examples, its
entropy is
Entropy(S) = − 36 log2 36 − 36 log2 36 = 1.



examples, its
entropy is
Entropy(S) = − 36 log2 36 − 36 log2 36 = 1.

|Sred |
Gain(S, color) = Entropy(S) − |S| Entropy(Sred ) −
|Sgreen | |Sblue |
|S| Entropy(Sgreen ) − |S| Entropy(Sblue )
= 1 − 36 (.9182) − 0 − 0 = 0.5409.


examples, its
entropy is
Entropy(S) = − 36 log2 36 − 36 log2 36 = 1.

|Sred |
Gain(S, color) = Entropy(S) − |S| Entropy(Sred ) −
|Sgreen | |Sblue |
|S| Entropy(Sgreen ) − |S| Entropy(Sblue )
= 1 − 36 (.9182) − 0 − 0 = 0.5409.

Decision Tree example continued . . .
Gain(S, color) = 0.5409.

Gain(S, shape) = 0.
Gain(S, size) = 0.459.
Select color as the best attribute at the root.

What is the best attribute for red child node ?

2
2
1
1

Entropy(Sred ) = − 3
log2 3
− 3
log2 3
= .9182.


2
2
1
1

Entropy(Sred ) = − 3 log2 3 − 3 log2 3 = .9182.
Gain(red, shape) = Entropy(Sred ) −
|Sred,square | |S |
|Sred |
Entropy(Sred,square ) − red,round
|Sred |
Entropy(Sred,round )


2
2
1
1

|Sred,square | |S |
|Sred |
|Sred |
2
= .9182 − 0 − 3 (1) = 0.2515.


2
2
1
1

|Sred,square | |S |
|Sred |
|Sred |
2
= .9182 − 0 − 3 (1) = 0.2515.
Gain(red, size) = .9182.


2
2
1
1

|Sred,square | |S |
|Sred |
|Sred |
2
= .9182 − 0 − 3 (1) = 0.2515.
Best attribute for red child node is size.


2
2
1
1

|Sred,square | |S |
|Sred |
|Sred |
2
= .9182 − 0 − 3 (1) = 0.2515.
Best attribute for red child node is size.

More Questions
Consider classification of data in which the 4-dimensional feature

vector x contains four Boolean features: A, B, C, D. Furthermore, the
class label y is also Boolean. Give decision trees to represent the
following Boolean functions.
1 y = A ∧ ¬B
2 y = A ∨ (B ∧ C)
3 y =A⊗B
4 y = (A ∧ B) ∨ (C ∧ D)

Questions Continued . . .
Consider a classification problem where the class label y can take the values −1 and +1 and there are
two features, x1 and x2 , which both have possible values 0, 1, and 2. Let H = {h1 , h2 , h3 } be a
hypothesis space for this problem that contains the following three hypotheses 1 :
x1 x2 x3
0 2 +1
2 2 -1
1 2 -1
0 0 + 1
Table: Classification Data
+1 if x1 · x2 = 0,
h1 (x) =
−1 otherwise.
+1 if x1 6= x2 ,
h2 (x) =
−1 otherwise.
+1 if x1 = 0,
h3 (x) =
−1 otherwise.
1 Give a decision tree that makes the same classification as h1 .

2 Give an example of a hypothesis for this classification problem that is consistent with the data
in Table 1, but is not a member of H.
1x · x2 denotes the product of x1 and x2 .

1
The data given in Table2 is related to "The Simpsons". Using ID3
find the best attribute for the root node of the decision tree for the 9
training examples. Based on this rootnode construct a tree that
correctly classifies Males/Females so that we know which class Comic
belongs to.
Person Hair length Weight Age Class

Homer 000 250 36 M
Marge 1000 150 34 F
Bart 200 90 10 M
Lisa 600 78 8 F
Maggie 400 20 1 F
Abe 100 170 70 M
Selma 800 160 41 F
Otto 1000 180 38 M
Krusty 600 200 45 M
Comic 800 290 38 ?

Table: The Simpsons

A Different Question
Consider the training examples given in Table3.
Weekend Weather Parents Money Decision

W1. Sunny Yes Rich Cinema
W2 Sunny No Rich Tennis
W3 Windy Yes Rich Cinema
W4 Rainy Yes Poor Cinema
W5 Rainy No Rich Stay-in
W6 Rainy Yes Poor Cinema
W7 Windy No Poor Cinema
W8 Windy No Rich Shopping
W9 Windy Yes Rich Cinema
W10 Sunny No Rich Tennis
Table: Training Examples for ID3
1 Using ID3 find the best attribute for the root node of the
decision tree for the 10 training examples.
2 What is the best attribute for the Parents child node?
Hypothesis Space Search by ID3
ID3 searches the space of possible decision trees: doing

hill-climbing on information gain.
It searches the complete space of all finite discrete-valued
functions. All functions have atleast one tree that represents
them.
It maintains only one hypothesis (unlike Candidat-Elimination).
It cannot tell us how many other viable ones are there.
It does not do back tracking. Can get stuck in local optima.
Uses all training examples at each step. Results are less sensitive
to errors.

Hypothesis Space Search by ID3 . . .
+ – +
...
A2
A1
+ – + + + – + –
...
A2 A2
+ – + – + – + –
A3 A4
–
+
... ...

Inductive Bias
Given a set of examples there are many trees that would fit it.
Which one does ID3 pick?
This is the inductive bias.
Approximate ID3 inductive bias: Prefer shorter trees. To

actually do that ID3 would need to do a BFS on tree sizes.
Better ID3 inductive bias: Prefer shorter trees over longer

trees. Prefer trees that place higher information gain attributes
near the root.

Restriction and Preference Biases
ID3 searches a complete hypothesis space but does so

incompletely since once it finds a good hypothesis it stops
(cannot find others).
Candidate-Elimination searches an incomplete hypothesis space
(it can only represent some hypothesis) but does so completely.
A preference bias is an inductive bias where some hypothesis
are preferred over others (for instance, shorter hypothesis).
A restriction bias is an inductive bias where the set of
hypothesis considered is restricted to a smaller set.

Occam’s Razor
Occam’s Razor: Prefer the simplest hypothesis that fits the

data.
Why should we prefer a shorter hypothesis?
There are fewer short hypothesis than long hypothesis so
A short hypothesis that fits data unlikely to be coincidence
A long hypothesis that fits data might be coincidence
But, there are many ways to define small sets of hypothesis
e.g., all trees with a prime number of nodes that use attributes
beginning with Z.
What’s so special about small sets based on size of hypothesis?

Issues in Decision Tree Learning
How deep to grow?

How to handle continous attributes?
How to choose an appropriate attribute selection measure?
How to handle data with missing attribute values?
How to handle attributes with different costs?
How to improve computational efficiency?
ID3 has been extended to handle most of these. The resulting
system is C4.5.

Overfitting
A hypothesis h ∈ H is said to overfit the training data if there

exists some alternative hypothesis h0 ∈ H, such that h has
smaller error than h0 over the training examples, but h0 has
smaller error than h over the entire distribution of instances.
That is, if
errortrain (h) < errortrain (h0 )
and
errorD (h) > errorD (h0 )
This can happen if there are errors in the training data.
It becomes wprse if we let the tree grow to be too big, as shown
in the next experiment.

Overfitting. . .
0.9
0.85
0.8
0.75
Accuracy
0.7
0.65
0.6 On training data

On test data
0.55
0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

Dealing with Overfitting
Overfitting is a significant practical difficulty for decision tree

learning and many other learning methods.
In one experimental study of ID3 involving five different learning
tasks with noisy, nondeterministic data, overfitting was found to
decrease the accuracy of learned decision trees by 10-25% on
most problems.
There are several approaches to avoiding overfitting in decision
tree learning. These can be grouped into two classes:
Approaches that stop growing the tree earlier, before it reaches
the point where it perfectly classifies the training data,
Approaches that allow the tree to overfit the data, and then
post-prune the tree.

Dealing with Overfitting
Regardless of whether the correct tree size is found by stopping early

or by post-pruning, a key question is what criterion is to be used to
determine the correct final tree size. Approaches include:
Either stop growing the tree earlier or prune it afterwards.
Pruning has been more effective.
Use a separate set of examples (not training) to evaluate the
utility of post-pruning nodes.
Use a statistical test to estimate whether expanding a node is
likely to improve performance beyond the training set.
Use explicit measure of the complexity for encoding the training
examples and the decision tree. Stop when this encoding size is
minimize. Minimum Description Length principle.
MDL: minimize size(tree) + size(misclassif ications(tree))

Reduced Error Pruning
Split data into training and validation set
Do until further pruning is harmful:

1 Evaluate impact on validation set of pruning each possible node
(plus those below it)
2 Greedily remove the one that most improves validation set
accuracy
produces smallest version of most accurate subtree

Requires that a lot of data is available

Effect of Reduced Error Pruning
0.9
0.85
0.8
0.75
Accuracy
0.7
0.65
0.6 On training data

On test data
0.55 On test data (during pruning)
0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

Reduced error pruning: Example

Reduced-error pruning continued . . .

Rule Post-Pruning
Infer tree as well as possible

Convert tree to equivalent set of rules
Prune each rule by removing any preconditions that result in
inproving its estimated accuracy
Sort final rules by their estimated accuracy and consider them in
this sequence when calssifying.
Perhaps most frequently used method (example C4.5)

Converting a Tree to Rules
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak
No Yes No Yes
IF (Outlook = Sunny) ∧ (Humidity = High)

THEN P layT ennis = N o
Why Convert Decision trees to rules befre pruning?
Converting to rules allows distinguishing among the different

contexts in which a decision node is used. Because each distinct
path through the decision tree node produces a distinct rule, the
pruning decision regarding that attribute test can be made
differently for each path. In contrast, if the tree itself were
pruned, the only two choices would be to remove the decision
node completely, or to retain it in its original form.
Converting to rules removes the distinction between attribute
tests that occur near the root of the tree and those that occur
near the leaves. Thus, we avoid messy bookkeeping issues such as
how to reorganize the tree if the root node is pruned while
retaining part of the subtree below this test.
Converting to rules improves readability. Rules are often easier
for to understand.

Continous-Valued Attributes
We might have a Temperature attribute with a continous value
T emperature = 82.5
Create a new boolean attribute that is true when the value is less
than c (the threshold).
(T emperature > 72.3) = t, f
To pick c sort the examples according to the attribute. Identify
adjacent examples that differ in their target calssification.
Generate candidate thresholds at the midpoints.
Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No
The candidate thresholds can be evaluated by computing the

information gain associate with each one.
The new discrete-valued attribute can then compete with the
other attributes.
Attributes with Many Values
Problem:
If attribute has many values, Gain will select it
Imagine using Date = Jun_3_1996 as attribute
One approach: use GainRatio instead
Gain(S, A)
GainRatio(S, A) ≡
SplitInf ormation(S, A)
c
X |Si | |Si |
SplitInf ormation(S, A) ≡ − log2
i=1
|S| |S|
where Si is subset of S for which A has value vi
The SplitInf ormation term discourages the selection of
attributes with many uniformaly distributed values.

Example for Gain Ratio
Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Gain Ratio Example continued . . .
c
X |Si | |Si |
SplitInf ormation(S, A) ≡ − log2
i=1
|S| |S|
5 5 4 4
Split(S, Outlook) = (− 14 log 14 ) × 2 + (− 14 log 14 ) = 1.577
Gain(S, A)
GainRatio(S, A) ≡
SplitInf ormation(S, A)
0.247
GainRatio(S, Outlook) = 1.577 = 0.157
GainRatio(S, T emperature) = 0.029
1.362 = 0.021
GainRatio(S, Humidity) = 0.152
1 = 0.152
0.048
GainRatio(S, W ind) = 0,985 = 0.049
1 May choose an attribute just because its intrinsic information is
very low
2 First, only consider attributes with greater than average
information gain
3 Then, compare them on gain ratio
Attributes With Costs
Consider
medical diagnosis, BloodT est has cost $150
robotics, W idth_f rom_1f t has cost 23 sec.
How to learn a consistent tree with low expected cost?

One approach: replace gain by
Tan and Schlimmer
Gain2 (S, A)
.
Cost(A)
Nunez
2Gain(S,A) − 1
(Cost(A) + 1)w
where w ∈ [0, 1] determines importance of cost

Unknown Attribute Values
What if some examples missing values of A?

Use training example anyway, sort through tree
If node n tests A, assign most common value of A among other
examples sorted to node n
assign most common value of A among other examples with same
target value
assign probability pi to each possible value vi of A
assign fraction pi of example to each descendant in tree

Gini Index
If a data set T contains examples from n classes, the gini index
gini(T) is defined as
m
X
gini(T ) = 1 − p2i
i=1
m: the number of classes

pi : is the relative frequency of class j in T.
The Gini index considers a binary split for each attribute A, say
D1 and D2 . The Gini index of D given that partitioning is
D1 D2
GiniA (D) = Gini(D1 ) + Gini(D2 )
D D
A weighted sum of the impurity of each partition

The reduction in impurity is given by
∆Gini(A) = gini(D) − GiniA (D)
The attribute that maximises the reduction in impurity is chosen
as the splitting attribute.
Consider the training examples shown in Table 4 for a binary
classification problem
Customerid Gender CarType ShirtSize Class
1 M Family Samll C0
2 M Sports Medium C0
3 M Sports Medium C0
4 M Sports Large C0
5 M Sports Extralarge C0
6 M Sports ExtraLarge C0
7 F Sports Small C0
8 F Sports Small C0
9 F Sports Medium C0
10 F Luxury large C0
11 M Family Large C1
12 M Family ExtraLarge C1
13 M Family Medium C1
14 M Luxury ExtraLarge C1
15 F Luxury Small C1
16 F Luxury Small C1
17 F Luxury Medium C1
20 F Luxury large C1
Table: A Sample Data set
1 Compute the Gini Index for the overall collection of training examples.
2 Compute the Gini Index for the Customer ID attribute.
3 Compute the Gini index for Gender, Car Type and ShirtSize and show which attribute is better.

Another Question
A B Class Label
T F +
T T +
T T +
T F -
T T +
F F -
F F -
F F -
T T -
T F
Calculate the gain in the Gini Index when splitting on A and B.

Which attribute would the decision tree induction algorithm choose?

Class Histogram
Let the training data set be T with class labels {C1 , C2 , . . . , Cn }.

If T is partitioned based on the value of the non-class attribute X
into sets T1 , . . . , Tn then the class histogram for the partition is a
table of k columns and n rows. The (i, j)-the entry indicates the
number of records in the data set in the partition Ti and class Cj .
For a numerical attribute A, let us assume that we want to
compute the splitting index for the possible split A ≤ v. Then
the class histogram is a table of two rows and k columns, where k
is the number of classes.
The first row represents the frequency distribution of the set of
records satisfying A ≤ v.
The second row represents the frequency distribution for each
class, for those records which do not satisfy the condition.

Class Histogram
Table: Training Data Set
Age Salary Class Table: Class Histogram for salary

1 30 65 G with splitting criterion Salary ≤ 70
2 23 15 B
Salary ≤ 70 B G
3 40 75 G
left 2 2
4 55 40 B
right 0 2
5 55 100 G
6 45 60 G

Class Histogram continued . . .
For the following generic histogram the gini index can be give as
follows:
C1 C2
L a1 a2
R b1 b2
2 2
(a1+a2) a1 a2
gini = n 1 − a1+a2 − a1+a2 +
2 2
(b1+b2) b1 b2
n 1 − b1+b2 − b1+b2
This concept cn be generalised for cases where the split is non-binary.

In such cases the number of rows is equal to the number of partitions..
Hence one can also define a class histogram for a categorical attribute.

Binary Split: Continuous-Valued Attributes
D: a data partition
Consider attribute A with continuous values
To determine the best binary split on A
What to examine?
Examine each possible split point
The midpoint between each pair of (sorted) adjacent values is
taken as a possible split-point
How to examine?
For each split point compute the weighted sum of the impurity of
each of the two resulting partitions (D1: A ≤ split − point, D2:
A > split − point)
D1 D2
GiniA (D) = Gini(D1) + Gini(D1)
D D
The split-point that gives the minimum Gini index for attribute
A is selected as its splitting subset.

Binary Split for Categorical Attributes
Since we cannot have any ordering of the values of a categorical

attribute, there cannot be a value n such that it splits the
attribute into two
If S(A) is the set of possible values of the categorical attribute A
then the split test is of the form A ∈ S 0 where S 0 ⊆ S
For an attribute with n values there are 2n possible splits
If n is small the split index value is found for all possible
combinations
If n is large some heuristics is used to find the best split
The construction of an attribute list is similar to that of
numerical attributes
Instead of having a class histogram a Count Matrix is maintained
for the categorical attribures

Count Matrix
The count matrix has n rows (for n distinct values of the
attributes) and k columns(for k classes). Each entry, say (i,
j)-entry, represents the number of records in the data set having
ith value of the attribute and in the jth class.
The count matrix is independent of any partition whereas the
class histogram is. Different splitting criteria result in different
class histograms.
Table: Attribute List for a

Categorical Attribute
Table: Count Matrix
Class record-id
family high 1 H L
sports high 2 family 2 1
sports high 3 sports 2 0
family low 4 truck 0 1
truck low 5
family high 6

A Question
Table: Attribute List for a

Categorical Attribute
Table: Count Matrix
Class record-id
family high 1 H L
sports high 2 family 2 1
sports high 3 sports 2 0
family low 4 truck 0 1
truck low 5
family high 6
1 If we select S 0 = {f amily, truck} as the possible splitting subset,

calculate the gini index
2 Determine the best splitting criteria?

Worked out example
S 0 = {f amily, truck}
giniS 0 (T) = 46 1 − ( 24 )2 − 2 2

2 2 2 0 2
(4) +
6 1 − ( 2 ) − ( 2 ) =
4 4 4
6 1 −( 16 ) − ( 16 ) =
4 8
16 h 1 − ( 16 ) i =
4 8÷8
16 1 − 16÷8 =
4 1

16 1 − 2 =
4 1
6 × 2 =
4 1
12 = 3

Binary Split: Discrete-Valued Attributes
D: a data partition
Consider attribute A with v outcomes {a1 , . . . av }
To determine the best binary split on A
What to examine?
Examine the partitions resulting from all possible subsets of
{a1 , . . . av }
Each subset SA is a binary test of attribute A of the form
A ∈ SA?
2v possible subsets. We exclude the power set and the empty set,
then we have 2v − 2 subsets.
How to examine?
For each subset, compute the weighted sum of the impurity of
each of the two resulting partitions
D1 D2
GiniA (D) = Gini(D1) + Gini(D1)
D D
The subset that gives the minimum Gini index for attribute A is
selected as its splitting subset

Gini(Income)
Rid Age Income student credit-rating class:buy-computer

1 youth high no fair no
2 youth high no excellent no
3 middle-aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle-aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle-aged medium no excellent yes
13 middle-aged high yes fair yes
14 senior medium no excellent no
Compute the gini index of the training set D:

9 2 5 2

Gini(D) = 1 − ( 14 ) + ( 14 )
Using attribute income: there are three values: low, medium and high
Choosing the subset {low, medium} results in two partions:
D1 (income∈ {low, medium} ): 10 tuples
D2 (income ∈ {high} ): 4 tuples

Gini Income
10 4
Giniincome ∈ {low, medium}(D) = 14 Gini(D1 ) + 14 Gini(D2 )
10 6 2 4 2 4
= 14 (1 − ( 10 ) − ( 10 ) )+ 14 (1 − ( 14 )2 − ( 34 )2 )
= 0.450
= Giniincome ∈ {high}(D)
The Gini Index measures of the remaining partitions are:

Gini{low,high} and {medium} ( D ) = 0 .315
Gini{medium, high} and {low} ( D ) = 0 .300
Therefore, the best binary split for attribute income is on
{medium, high} and {low}

Comparing attribute selection measures
The three measures, in general, return good results but

Information Gain
Biased towards multivalued attributes
Gain Ratio
Tends to prefer unbalanced splits in which one partition is much
smaller than the other
Gini Index
Biased towards multivalued attributes
Has difficulties when the number of classes is large
Tends to favor tests that result in equal-sized partitions and
purity in both partitions

2.decision Tree

Uploaded by

Copyright:

Available Formats

2.decision Tree

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2.decision Tree

Uploaded by

Copyright:

Available Formats

Decision Tree Learning

Vineet Padmanabhan Nair

School of Computer and Information Sciences

April 12, 2022

Vineet Padmanabhan Machine Learning

Decision tree representation

ID3 learning algorithm

Entropy, Information gain

Vineet Padmanabhan Machine Learning

It is a method for approximating discrete-valued functions.

Robust to noisy data and can learn disjunctive expressions.

The hypothesis is represented using a decision tree.

Vineet Padmanabhan Machine Learning

(Outlook = Sunny ∧ Humidity = N ormal) ∨ (Outlook = Overcast)

Vineet Padmanabhan Machine Learning

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

Vineet Padmanabhan Machine Learning

Instances describable by attribute–value pairs

Vineet Padmanabhan Machine Learning

Vineet Padmanabhan Machine Learning

Day Outlook Temperature Humidity Wind PlayTennis

Vineet Padmanabhan Machine Learning

There are 9 positive and 5 negative examples

Vineet Padmanabhan Machine Learning

0.0 0.5 1.0

S is a sample of training examples

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p

Vineet Padmanabhan Machine Learning

If S has 9 positive and 5 negative examples, its entropy is

This function is 0 for p⊕ = 0 and p⊕ = 1. It reaches its maximum

Vineet Padmanabhan Machine Learning

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p

Vineet Padmanabhan Machine Learning

If the target attribute can take on c different values we can still

pi is the proportion belonging to class i

Vineet Padmanabhan Machine Learning

The information gain is the expected reduction in entropy

First term of the equation is the entropy of the original collection

Vineet Padmanabhan Machine Learning

The expected entropy described by the second term is simply the

Vineet Padmanabhan Machine Learning

Using our set of examples we can calculate that

Vineet Padmanabhan Machine Learning

Which attribute is the best classifier?

High Normal Weak Strong

[3+,4-] [6+,1-] [6+,2-] [3+,3-]

Gain (S, Humidity ) Gain (S, Wind )

Vineet Padmanabhan Machine Learning

Again using our examples, ID3 would first calculate

Vineet Padmanabhan Machine Learning

Vineet Padmanabhan Machine Learning

S has 3 positive and 3 negative

Vineet Padmanabhan Machine Learning

S has 3 positive and 3 negative

Vineet Padmanabhan Machine Learning

S has 3 positive and 3 negative