09 Decision Tree Induction
09 Decision Tree Induction
09 Decision Tree Induction
Concept of Entropy
Basic concept of entropy in information theory
Mathematical formulation of entropy
Calculation of entropy of a training set
A B C f
0 0 0 m0
0 0 1 m1
0 1 0 m2
0 1 1 m3
1 0 0 m4
1 0 1 m5
1 1 0 m6
1 1 1 m7
All nodes drawn with circle (ellipse) are called internal nodes.
All nodes drawn with rectangle boxes are called terminal nodes or leaf
nodes.
The series of questions and their answers can be organized in the form of
a decision tree
As a hierarchical structure consisting of nodes and edges
A decision tree T is a tree associated with D that has the following properties:
Each internal node is labeled with an attribute Ai
Each edges is labeled with predicate that can be applied to the attribute
associated with the parent node of it
Each leaf node is labeled with class cj
Greedy strategy
A top-down recursive divide-and-conquer
C4.5
CART, etc.
5. For each Dk D
BuildTD(Dk) // Recursive call
6. Stop
The test condition for a binary attribute generates only two outcomes
Muti-way split: Outcome depends on the number of distinct values for the
corresponding attribute
A multi-way split
A binary split
Binary splitting attribute values should be grouped maintaining the order property
of the attribute values
In this case, decision tree induction must consider all possible split positions
Range query : vi A < vi+1 for i = 1, 2, , q (if q number of ranges are chosen)
How to measure? = ?
Universe!
What was its entropy value at its starting point?
The first time it was used to measure the information content in messages.
Is there a planet like earth in the galaxy? (answer is with an extreme uncertainty)
Clever approach.
In this approach, we can ask such questions randomly choosing one city at a time. As a
matter of randomness, let us ask the questions, not necessarily in the order, as they are in
the list.
Q.1: Is the city Bangalore? No
Q.2: Is the city Bhubaneswar? No
Q.3: Is the city Bhopal? No
Q.4: Is the city Delhi? No
Q.5: Is the city Hyderabad? No
Q.6: Is the city Madras? No
Q.7: Is the city Mumbai? No
No need to ask further question! Answer is already out by the Q.7. If asked randomly,
1
each of these possibilities is equally likely with probability 8 . Hence on the average, we
need
(1+2+3+4+5+6+7+7)
= 4.375 questions.
8
CS 40003: Data Analytics 32
Approach 2: Clever approach
Clever approach (binary search)
In this approach, we divide the list into two halves, pose a question for a half
Repeat the same recursively until we get yes answer for the unknown.
Note:
Approach 2 is considered to be the best strategy because it will invariably find the answer and
will do so with a minimum number of questions on the average than any other strategy.
Approach 1 occasionally do better (when you are lucky enough!)
It is no coincidence that 8 = 23, and the minimum number of yes/no questions needed is 3.
If m = 16, then 16 = 24 , and we can argue that we need 4 questions to solve the problem.
If m = 32, then 5 questions, m = 256, then 8 questions and so on.
We can encode eight values say eight different colours, we need to use three bits, such as
000 : Violet
001 : Indigo
010 : Blue
011 : Green
100 : Yellow = 8(= 2 , = 3)
101 : Orange
110 : Red
111 : White
Thus, in general, to code m values, each in a distinct manner, we need n bits such that = 2 .
The entropy of a set of m distinct values is the number of bits needed to encode all
the values in the most efficient way.
We are to identify a sequence of 1 such values (each one chosen independently of the others,
that is, repetitions are allowed). Note that if k = 1, it is the type of game, we have already dealt
with.
67 = log 2 76
In general,
= log 2
The entropy of a set of m distinct objects is log 2 even when m is not exactly a
power of 2.
= 1
=1
1 1 1
Thus, in this example, = 4 and 1 = , 2 = , 3 = and 4 = .
2 4 8
1
8
Using standard 2-bit encoding, we can represent them as = 00, = 01, = 10,
= 11.
Also, we can follow variable length coding (also called Huffman coding) as an
improved way of representing them.
1 1 1 1
The Huffman coding of , , with their frequencies 2, 4, 8 and 8 are shown
below.
A=1
B = 01
C = 001
D = 000
CS 40003: Data Analytics 42
Discriminating amongst m values ( 2 )
With the above representation say, if A is to be identified, then we need to
examine only one question, for B it is 2 and for C and D both, it is 3.
Thus, on the average, we need
1 1 1 1
1 + 4 2 + 8 3 + 8 3 = 1.75
2
This is the number of yes/no questions to identify any one of the four objects,
whose frequency of occurrences are not uniform.
1) = 0 2) = 01 3) = 101
= 11 =1 = 001
= 100 = 001 = 10011
= 101 = 000 = 100001
= log(1 ) = 1
=1 =1
Note:
1
If all are equally likely, then = and = log 2 ; it is the special case.
= log 2
=1
Here, E is measured in bits of information.
Note:
The above formula should be summed over the non-empty classes only, that is, classes
for which 0
E is always a positive quantity
E takes its minimum value (zero) if and only if all the instances have the same class
(i.e., the training set with only one non-empty class, for which the probability 1).
Entropy takes its maximum value when the instances are equally distributed among k
possible classes. In this case, the maximum value of E is 2 .
In the OPTH database, there are 3 classes and 4 instances with class 1, 5 instances
with class 2 and 15 instances with class 3. Hence, entropy E of the database is:
4 4 5 5 15 15
= log 2 log 2 log 2 = 1.3261
24 24 24 24 24 24
It is very crucial to decide the series of questions about the value of a set of attribute,
which collectively determine the classification. Sometimes it may take one question,
sometimes many more.
Decision tree induction helps us to ask such a series of questions. In other words, we
can utilize entropy concept to build a better decision tree.
How entropy can be used to build a decision tree is our next topic of discussion.
The procedure is to choose an attribute and split it into from a larger training
set into smaller training sets.
2. Splitting criteria
Several algorithms have been proposed for the above tasks. In this lecture, we
shall limit our discussions into three important of them
ID3
C 4.5
CART
It is observed that splitting on any attribute has the property that average entropy
of the resulting training subsets will be less than or equal to that of the previous
training set.
it partitions into a number of smaller training sets based on the distinct values of
attribute under split.
E(D) = -=1 2 (
where the training set D has 1 , 2 , , , the k number of distinct classes and
| |
() =
=1 E( )
||
| |
Here, the term denotes the weight of the j-th training set.
||
59
CS 40003: Data Analytics
Information Gain Calculation
Example 9.11 : Information gain on splitting OPTH
1 2 1 1 3 8
(1 ) = 1.5 = 0.5000
1 2 1 2 2 24
1 2 2 1 3
1 2 2 2 1
60
CS 40003: Data Analytics
Calculating Information Gain
Training set: 2 (Age = 2)
8
(2 ) = 1.2988 = 0.4329
CS 40003: Data Analytics 24 61
Calculating Information Gain
Training set: 3 (Age = 3)
3 2 1 2 2
3 2 2 1 3
3 2 2 2 3
In the example with OPTH database, the larger values of information gain is
Use Type , = 0.5488
Hence, the attribute should be chosen for splitting is Use Type.
The process of splitting on nodes is repeated for each branch of the evolving
decision tree, and the final tree, which would look like is shown in the following
slide and calculation is left for practice.
Age Eye-sight Use Type Astigmatic
= 0.0394 = 0.395 = 0.0394 = 0.3770
1 (1) 2 (2)
Age Eye Ast Use Class Age Eye Ast Use Class
1 1 1 2 2
1 1 1 1 3
1 1 2 2 1
1 1 2 1 3
1 2 1 2 2
1 2 1 1 3
1 2 2 2 1
(1 ) =?
1 2 2 1 3 (2 ) =?
2 1 1 1 3 2 1 1 2 2
2 2 1 1 3 2 1 2 2 1
2 2 2 1 3 2 2 1 2 2
3 1 1 1 3 3 1 1 2 3
3 1 2 1 3 3 1 2 2 3
3 2 1 1 3 3 2 1 2 2
3 2 2 1 3 3 2 2 2 3
? ? ? ? ? ?
Age Eye-
Astigmatic Age Eye-
sight Astigmatic
=? sight
CS 40003: Data Analytics =? =? =? =? =? 65
Frequency Table : Calculating
Calculation of entropy for each table and hence information gain for a
particular split appears tedious (at least manually)!
As an alternative, we discuss a short-cut method of doing the same using a
special data structure called Frequency Table.
1 2
1
Number of rows = Number
2 of classes
Number of columns =
Class
Calculate = ( + )/
= 2 log 2 + 1 log 1 + 1 log 1 + 2 log 2 + 2 log 2 + 1 log 1 + 4 log 4 + 5 log 5
+ 6 log 6
= 8 log 8 + 8 log 8 + 8 log 8
= .
CS 40003: Data Analytics 69
Proof of Equivalence
In the following, we prove the equivalence of the short-cut of entropy
calculation using Frequency Table.
Splitting on an attribute with values produces subsets of the training
dataset (of size = ). The subset ( = 1,2, , ) contains all
the instances for which the attribute takes its value. Let denotes
the number of instances in the subset. Then
=
=1
=
=1
CS 40003: Data Analytics 70
Proof of Equivalence
Denoting as the entropy of the subset, we have
= log 2
=1
1
= . log 2 + . log 2
=1 =1 =1
=
=1
= . log 2 + log 2
=1 =1 =1
= ( + )
= log 2 ( )
=1
Hence, the equivalence is proved.
CS 40003: Data Analytics 72
Limiting Values of Information Gain
The Information gain metric used in ID3 always should be positive or
zero.
On the other hand, when a training set is such that if there are classes,
and the entropy of training set takes the largest value i.e., log 2 (this
occurs when the classes are balanced), then the information gain will be
zero.
Note:
The absence of information gain does not imply that there is no profit for
splitting on the attribute.
Even if it is chosen for splitting, ultimately it will lead to a final decision tree
with the branches terminated by a leaf node and thus having an entropy of
zero.
Information gain can never be a negative value.
CS 40003: Data Analytics 75
Splitting of Continuous Attribute Values
In the foregoing discussion, we assumed that an attribute to be splitted is
with a finite number of discrete values. Now, there is a great deal if the
attribute is not so, rather it is a continuous-valued attribute.
There are two approaches mainly to deal with such a case.
1 2 3
1 2 3 +1
Here, if n-distinct values are there for the attribute , then we choose 1
split points as shown above.
+ (+1)
For example, there is a split point = in between and (+1)
2
For each split-point, we have two partitions: > , and finally the
point with maximum information gain is the desired split point for that
attribute.
CS 40003: Data Analytics 77
Algorithm CART
CART is a technique that generates a binary decision tree; That is, unlike ID3,
in CART, for each node only two children is created.
ID3 uses Information gain as a measure to select the best attribute to be splitted,
whereas CART do the same but using another measurement called Gini index . It
is also known as Gini Index of Diversity and is denote as .
= 1 2
=1
where is the probability that a tuple in D belongs to class and can be
estimated as
| , |
=
where | , | denotes the number of tuples in D with class .
1
It takes its largest value = 1
when the classes are evenly distributed between the tuples, that is the frequency of
1
each class is .
1 2
= . 1 + . (2 )
This binary partition of D reduces the impurity and the reduction in impurity is
measured by
, =
The attribute that maximizes the reduction in impurity (or equivalently, has
the minimum value of ()) is selected for the attribute to be splitted.
We shall discuss how the same is possible for attribute with more than two
values.
To determine the best binary split on A, we examine all of the possible subsets
say 2 of A that can be formed using the values of A.
Yes No
1 2
This test is satisfied if the value of A for the tuples is among the values listed
in .
If A has m distinct values in D, then there are 2 possible subsets, out of
which the empty subset { } and the power set {1 , 2 , , } should be
excluded (as they really do not represent a split).
Thus, there are 2 2 possible ways to form two partitions of the dataset D,
based on the binary split of A.
CS 40003: Data Analytics 85
n-ary Attribute Values to Binary Splitting
Case2: Continuous valued attributes
For a continuous-valued attribute, each possible split point must be taken into
account.
According to that strategy, the mid-point between ai and ai+1 , let it be vi, then
1 2 +1
Yes No
+ +1
= 1 2
2
The point giving the minimum Gini Index is taken as the split-point of
the attribute A.
Note
The attribute A and either its splitting subset (for discrete-valued splitting
attribute) or split-point (for continuous valued splitting attribute) together
form the splitting criteria.
= 1 2
=1
2 2
9 5
=1 +
14 14
= .
Now let us consider the calculation of for Age, Salary, Job and
Performance.
{} {, } {} {, } {} {, }
, , , , ,
1 2 3 4 5 6
Salary {H}
1 = 2 = 0.3000
Yes No
3 = 4 = 0.3150
5 = 6 = 0.4508 {H} {L,M}
7 3 2 4 2 7 6 2 1 2
= 1 + 1 =?
14 7 7 14 7 7
, = ?
Out of these , gives the maximum value and hence, the attribute
Salary would be chosen for splitting subset {, } or {}.
Note:
It can be noted that the procedure following information gain calculation (i.e.
(, )) and that of impurity reduction calculation ( i.e. , ) are near
about.
Thus, in the same way the average weighted Gini Index can be calculated
using the same frequency table used to calculate information gain , ,
which is as follows.
2
| | | |
= .
|| || | |
=1 =1 =1
1 1
=1 . 2
||
=1 =1
The above gives a formula for m-attribute values; however, it an be fine tuned to
subset of attributes also.
Let us consider the frequency table for OPTH database considered earlier. Also
consider the attribute 1 with three values 1, 2 and 3. The frequency table is
shown below.
1 2 3
Class 1 2 1 1
Class 2 2 2 1
Class 3 4 5 6
Column sum 8 8 8
1. For each non-empty column, form the sum of the squares of the values in the
body of the table and divide by the column sum.
2. Add the values obtained for all columns and divided by ||, the size of the
database.
where = 0.5382
The calculation can be extended to other attributes in the OTPH database and is
left as an exercise.
?
CS 40003: Data Analytics 98
Decision Trees with ID3 and CART Algorithms
Example 9.17 : Comparing Decision Trees of EMP Data set
Compare two decision trees obtained using ID3 and CART for the EMP dataset.
The decision tree according to ID3 is given for your ready reference (subject to
the verification)
Y Age
O
Job Y Performance
P G A E
N Y Y N
?
Decision Tree using CART
ID3 uses information gain measure, which is, in fact biased towards splitting
attribute having a large number of outcomes.
For example, if an attribute has distinct values for all tuples, then it would
result in a large number of partitions, each one containing just one tuple.
In such a case, note that each partition is pure, and hence the purity measure of the
partition, that is = 0
1
= . = .0 = 0
=1 =1
Note:
Decision Tree Induction Algorithm ID3 may suffer from overfitting problem.
In order to reduce the effect of the use of the bias due to the use of
information gain, C4.5 uses a different measure called Gain Ratio, denoted as
.
The gain ratio can be defined as follows. We first define split information
as
= . log
=1
Note that for each outcome, it considers the number of tuples having that
outcome with respect to the total number of tuples in D.
To illustrate (D), let us examine the case where there are 32 instances and
splitting an attribute A which has 1 , 2 , 3 and 4 sets of distinct values.
1 2 3 4
Frequency 32 0 0 0
32 32
(D) = 2 ( ) = 2 1 = 0
32 32
Distribution 2
1 2 3 4
Frequency 16 16 0 0
16 16 16 16
(D) = (
2 32) 2 32)
( = 2 2 = 1
32 32
CS 40003: Data Analytics 107
Physical Interpretation of (D)
Distribution 3
1 2 3 4
Frequency 16 8 8 0
16 16 8 8 8 8
(D) = (
2 32) (
2 32 ) 2 32 )
( = 1.5
32 32 32
Distribution 4
1 2 3 4
Frequency 16 8 4 4
(D) = 1.75
Distribution 5: Uniform distribution of attribute values
1 2 3 4
Frequency 8 8 8 8
8 8 1
(D) = ( 32 2 (32))*4 = 2 (4)= 2.0
CS 40003: Data Analytics 108
Physical Interpretation of (D)
In general, if there are m attribute values, each occurring equally frequently,
then the split information is 2 .
Based on the Example 9.18, we can summarize our observation on split
information as under:
Split information is 0 when there is a single attribute value. It is a trivial case and
implies the minimum possible value of split information.
For a given data set, when instances are uniformly distributed with respect to the
attribute values, split information increases as the number of different attribute
values increases.
The maximum value of split information occur when there are many possible
attribute values, all are equally frequent.
Note:
Split information varies between 0 and log2m (both inclusive)
On the other hand, split information forms the denominator in the gain ratio
formula.
This implies that higher the value of split information is, lower the gain ratio.
In turns, it decreases the information gain.
Further, information gain is large when there are many distinct attribute
values.
When many distinct values, split information is also a large value.
This way split information reduces the value of gain ratio, thus resulting a
balanced value for information gain.
Like information gain (in ID3), the attribute with the maximum gain ratio is
selected as the splitting attribute in C4.5.
Practice:
Using Gain ratio as the measurement of splitting attributes, draw the decision
trees for OPTH and EMP data sets. Give calculation of gain ratio at each node.
CS 40003: Data Analytics 111
Summary of Decision Tree Induction Algorithms
We have learned the building of a decision tree given a training data.
For a given training data D, the important task is to build the decision tree so
that:
All test data can be classified accurately
The tree is balanced and with as minimum depth as possible, thus the classification
can be done at a faster rate.
2. Missing data and noise: Decision tree induction algorithms are quite robust to
the data set with missing values and presence of noise. However, proper data pre-
processing can be followed to nullify these discrepancies.
6. Tree Pruning: A sub-tree can replicate two or more times in a decision tree
(see figure below). This makes a decision tree unambiguous to classify a test
record. To avoid such a sub-tree replication problem, all sub-trees except one
can be pruned from the tree.
A
C B
1 0
D C
0 1 D 1
0 1
CS 40003: Data Analytics 117
Notes on Decision Tree Induction algorithms
7. Decision tree equivalence: The different splitting criteria followed in
different decision tree induction algorithms have little effect on the
performance of the algorithms. This is because the different heuristic
measures (such as information gain (), Gini index () and Gain ratio () are
quite consistent with each other); also see the figure below.
Data Mining: Concepts and Techniques, (3rd Edn.), Jiawei Han, Micheline Kamber,
Morgan Kaufmann, 2015.
Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,
Addison-Wesley, 2014