Lecture 8

Decision Tree Induction
Non-metric Methods
• Numerical Attributes
– Nearest-neighbor -- distance
– Neural networks: two similar inputs leads to
similar outputs
– SVMs: Dot Product
12/8/2021 Data Mining: Concepts and Techniques 2

Non-metric data
• Nominal attributes
• Color, taste
• Strings: DNA
•

• Probability based
• Rule based
– Decision trees

Decision Tree
• Rules in the form of a hierarchy.

• Why are decision trees so popular?

Definition of Decision Tree
Definition 9.1: Decision Tree

We need to work with a training set
Decision tree induction

Training
Data
Classifier
(Decision Tree)

You need to work with a training set

Output: A Decision Tree for “buys_computer”
age?
<=30 >40
overcast
30..40
student? yes credit rating?
no yes fair
excellent
no yes no yes

• Criteria for choosing an attribute?
• You can achieve 100% accuracy with training
set?!
– Overfitting
• When you stop building the tree?
• Are there various types of DT induction

methods?? ID3, C4.5 and CART.

Decision tree induction
• They adopt a greedy (i.e., no backtracking),
top-down recursive divide-and-conquer
approach.

• Node 🡪 subset of training patterns
• Root 🡪 training set.
• Leaf 🡪 class label.

Impurity measures
• Entropy impurity (information impurity)
• Gini impurity (variance impurity)
• Misclassification impurity

For a two category case
There is a mistake in this slide. Entropy for a two class problem has its max. vaule =
1. For others, (for 2 class problems), its max value = 0.5

Which test?
• That which drops the impurity greater.
– Try to become pure quickly.

Which test?

Which test?

Which test?

Which test?

Information gain
•

Gain(age) ??
(yes, no) = (9, 5)

Gain(age) ??
(yes, no) = (9, 5)

For other attributes, their GAIN
• So we choose age as the splitting attribute.

• Similarly one can use other impurity measures

Gini Index (IBM
IntelligentMiner)
• If a data set T contains examples from n classes, gini index, gini(T)
is defined as
where pj is the relative frequency of class j in T.

• If a data set T is split into two subsets T1 and T2 with sizes N1 and
N2 respectively, the gini index of the split data contains examples
from n classes, the gini index gini(T) is defined as
• The attribute provides the smallest ginisplit(T) is chosen to split the

node (need to enumerate all possible splitting points for each
attribute).

• But, there is one drawback with this approach!

• A split with large branching factor is often
chosen.
– So, telephone number is chosen.

So, we penalize large branching factors
• This is called gain ratio (very often used with
information gain).
• Branching factor is more, the denominator is

more.

Notation
•

Building Decision Tree
• In principle, there are exponentially many decision tree that can be
constructed from a given database (also called training data).
– Some of the tree may not be optimum
– Some of them may give inaccurate result
• How a Decision Tree built?

– Greedy strategy
• A top-down recursive divide-and-conquer
– Modification of greedy strategy

• ID3
• C4.5
• CART, etc.

Built Decision Tree Algorithm
• Algorithm BuiltDT
• Input: D : Training data set
• Output: T : Decision tree
Steps
1. If all tuples in D belongs to the same class Cj
Add a leaf node labeled as Cj
Return // Termination condition
2. Select an attribute Ai (so that it is not already selected in the same branch)
3. Partition D = { D1, D2, …, Dp} based on p different values of Ai in D
4. For each Dk ϵ D
Create a node and add an edge between D and Dk with label as the Ai’s attribute value in Dk
5. For each Dk ϵ D
BuildTD(Dk) // Recursive call
6. Stop

Node Splitting in BuildDT Algorithm
• BuildDT algorithm must provides a method for expressing an attribute test
condition and corresponding outcome for different attribute type
• Case: Binary attribute

– This is the simplest case of node splitting
– The test condition for a binary attribute generates only two outcomes

• Case: Nominal attribute
– Since a nominal attribute can have many values, its test condition can be
expressed in two ways:
• A multi-way split
• A binary split
– Muti-way split: Outcome depends on the number of distinct values for the
corresponding attribute
– Binary splitting by grouping attribute values

• Case: Ordinal attribute
– It also can be expressed in two ways:
• A multi-way split
• A binary split
– Muti-way split: It is same as in the case of nominal attribute
– Binary splitting attribute values should be grouped maintaining the order

property of the attribute values

• Case: Numerical attribute
– For numeric attribute (with discrete or continuous values), a test condition can
be expressed as a comparison set
• Binary outcome: A >v or A ≤ v
– In this case, decision tree induction must consider all possible split positions
• Range query : vi ≤ A < vi+1 for i = 1, 2, …, q (if q number of ranges are chosen)
– Here, q should be decided a priori
– For a numeric attribute, decision tree induction is a combinatorial optimization

problem

Illustration : BuildDT Algorithm
Example 9.4: Illustration of BuildDT Algorithm
– Consider a training data set as shown.
Attributes:
Gender = {Male(M), Female (F)} // Binary attribute
Height = {1.5, …, 2.5} // Continuous attribute
Class = {Short (S), Medium (M), Tall (T)}
Given a person, we are to test in which class s/he belongs

• To built a decision tree, we can select an attribute in two different orderings:
<Gender, Height> or <Height, Gender>
• Further, for each ordering, we can choose different ways of splitting
• Different instances are shown in the following.
• Approach 1 : <Gender, Height>


• Approach 2 : <Height, Gender>

Example 9.5: Illustration of BuildDT Algorithm
– Consider an anonymous database as shown.
• Is there any “clue” that enables to

select the “best” attribute first?
• Suppose, following are two
attempts:
• A1🡪A2🡪A3🡪A4 [naïve]
• A3🡪A2🡪A4🡪A1 [Random]
• Draw the decision trees in the
above-mentioned two cases.
• Are the trees different to classify any test

data?
• If any other sample data is added into the
database, is that likely to alter the
decision tree already obtained?

Algorithm ID3

ID3: Decision Tree Induction Algorithms
• Quinlan [1986] introduced the ID3, a popular short form of Iterative

Dichotomizer 3 for decision trees from a set of training data.
• In ID3, each node corresponds to a splitting attribute and each arc is a
possible value of that attribute.
• At each node, the splitting attribute is selected to be the most informative
among the attributes not yet considered in the path starting from the root.

Algorithm ID3
• In ID3, entropy is used to measure how informative a node is.
– It is observed that splitting on any attribute has the property that average
entropy of the resulting training subsets will be less than or equal to that of the
previous (parent node’s) training subset.
• ID3 algorithm defines a measurement of a splitting called Information

Gain to determine the goodness of a split.
– The attribute with the largest value of information gain is chosen as the
splitting attribute and
– it partitions into a number of smaller training sets based on the distinct values
of attribute under split.

Entropy of a Training Set
Example 9.10: OPTH dataset
Consider the OTPH data shown in the following table with total 24 instances in it.
Age Eye sight Astigmatic Use Type Class
1 1 1 1 3
1 1 1 2 2
1 1 2 1 3
1 1 2 2 1
1 2 1 1 3
1 2 1 2 2
1 2 2 1 3
1 2 2 2 1
2 1 1 1 3
2 1 1 2 2
2 1 2 1 3
2 1 2 2 1
A coded
2 2 1 1 3
forms for all
2 2 1 2 2 values of
2 2 2 1 3 attributes are
2 2 2 2 3
3 1 1 1 3 used to avoid
3 1 1 2 3 the cluttering
3 1 2 1 3 in the table.
3 1 2 2 1
3 2 1 1 3
3 2 1 2 2
3 2 2 1 3
3 2 2 2 3 46
Information Gain Calculation
•
Age Eye-sight Astigmatism Use type Class

1 1 1 1 3
1 1 1 2 2
1 1 2 1 3
1 1 2 2 1
1 2 1 1 3
1 2 1 2 2
1 2 2 1 3
1 2 2 2 1
Data Mining: Concepts and Techniques 47

12/8/2021
Calculating Information Gain

2 1 1 1 3
2 1 1 2 2
2 1 2 1 3
2 1 2 2 1
2 2 1 1 3
2 2 1 2 2
2 2 2 1 3
2 2 2 2 3
12/8/2021 48
Calculating Information Gain

3 1 1 1 3
3 1 1 2 3
3 1 2 1 3
3 1 2 2 1
3 2 1 1 3
3 2 1 2 2
3 2 2 1 3
3 2 2 2 3

Information Gains for Different Attributes

Decision Tree Induction : ID3 Way


✔
Age Eye-sight Use Type Astigmatic
Age Eye Ast Use Class Age Eye Ast Use Class
1 1 1 2 2
1 1 1 1 3
1 1 2 2 1
1 1 2 1 3
1 2 1 2 2
1 2 1 1 3
1 2 2 1 3 1 2 2 2 1
2 1 1 1 3 2 1 1 2 2
2 2 1 1 3 2 1 2 2 1
2 2 2 1 3 2 2 1 2 2
3 1 1 1 3 3 1 1 2 3
3 1 2 1 3 3 1 2 2 3
3 2 1 1 3 3 2 1 2 2
3 2 2 1 3 3 2 2 2 3
Age Eye-si
Astigmatic Age Eye-si
ght Astigmatic
ght
Splitting of Continuous Attribute Values

Splitting of Continuous attribute values

Algorithm CART

CART Algorithm
•

Gini Index of Diversity
Definition 9.6: Gini Index

•

Definition 9.7: Gini Index of Diversity

Gini Index of Diversity and CART

n-ary Attribute Values to Binary Splitting
•

n-ary Attribute Values to Binary Splitting
•
D
Yes No

n-ary Attribute Values to Binary
Splitting
Case2: Continuous valued attributes
• For a continuous-valued attribute, each possible split point must be taken
into account.
• The strategy is similar to that followed in ID3 to calculate information gain
for the continuous –valued attributes.
• According to that strategy, the mid-point between ai and ai+1 , let it be vi,
then
Yes No

n-ary Attribute Values to Binary
Splitting
•

CART Algorithm : Illustration
Example 9.15 : CART Algorithm
Suppose we want to build decision tree for the data set EMP as given in the
table below.
Age Tuple# Age Salary Job Performance Select
Y : young 1 Y H P A N
M : middle-aged
2 Y H P E N
O : old
3 M H P A Y
Salary
4 O M P A Y
L : low
M : medium 5 O L G A Y
H : high 6 O L G E N
Job 7 M L G E Y
G : government 8 Y M P A N
P : private
9 Y L G A Y
Performance 10 O M G A Y
A : Average
11 Y M G E Y
E : Excellent
12 M M P E Y
Class : Select
13 M H G A Y
Y : yes
N : no 14 O M P E N
•

•
Yes No
{O} {Y,M}

•
Yes No
{H} {L,M}

•

•

CS 40003: Data Analytics 73
Class
CS 40003: Data Analytics 74

Calculating γ using Frequency Table

Calculating γ using Frequency Table

Illustration: Calculating γ using Frequency
Table
•
1 2 3
Class 1 2 1 1
Class 2 2 2 1
Class 3 4 5 6
Column sum 8 8 8

Table
•

Table
•

Decision Trees with ID3 and CART
Algorithms
Example 9.17 : Comparing Decision Trees of EMP Data set
Compare two decision trees obtained using ID3 and CART for the EMP
dataset. The decision tree according to ID3 is given for your ready reference
(subject to the verification)
Y Age O
Job Y Performance
P G A E
N Y Y N
Decision Tree using ID3
?
Decision Tree using CART

Algorithm C4.5

Algorithm C 4.5 : Introduction

Algorithm C4.5 : Introduction
•

Algorithm: C 4.5 : Introduction
• Although, the previous situation is an extreme case, intuitively, we can
infer that ID3 favours splitting attributes having a large number of values
– compared to other attributes, which have a less variations in their values.
• Such a partition appears to be useless for classification.

• This type of problem is called overfitting problem.
Note:
Decision Tree Induction Algorithm ID3 may suffer from overfitting problem.

Algorithm: C 4.5 : Introduction

Algorithm: C 4.5 : Gain Ratio
Definition 9.8: Gain Ratio

•

•
Frequency 32 0 0 0
Frequency 16 16 0 0

– Distribution 3
Frequency 16 8 8 0
– Distribution 4
Frequency 16 8 4 4
– Distribution 5: Uniform distribution of attribute values
Frequency 8 8 8 8

•

• Information gain signifies how much information will be gained on
partitioning the values of attribute A
– Higher information gain means splitting of A is more desirable.
•
• On the other hand, split information forms the denominator in the gain ratio
formula.
– This implies that higher the value of split information is, lower the gain ratio.
– In turns, it decreases the information gain.
• Further, information gain is large when there are many distinct attribute
values.
– When many distinct values, split information is also a large value.
– This way split information reduces the value of gain ratio, thus resulting a
balanced value for information gain.
• Like information gain (in ID3), the attribute with the maximum gain ratio is
selected as the splitting attribute in C4.5.

•

Summary of Decision Tree Induction
Algorithms
• We have learned the building of a decision tree given a training data.
– The decision tree is then used to classify a test data.
• For a given training data D, the important task is to build the decision tree
so that:
– All test data can be classified accurately
– The tree is balanced and with as minimum depth as possible, thus the
classification can be done at a faster rate.
• In order to build a decision tree, several algorithms have been proposed.

These algorithms differ from the chosen splitting criteria, so that they
satisfy the above mentioned objectives as well as the decision tree can be
induced with minimum time complexity. We have studied three decision
tree induction algorithms namely ID3, CART and C4.5. A summary of
these three algorithms is presented in the following table.

Table 11.6
Algorithm Splitting Criteria Remark
ID3

CART

C4.5
In addition to this, we also highlight few important characteristics

of decision tree induction algorithms in the following.

Notes on Decision Tree Induction
algorithms
1. Optimal Decision Tree: Finding an optimal decision tree is an NP-complete
problem. Hence, decision tree induction algorithms employ a heuristic based
approach to search for the best in a large search space. Majority of the algorithms
follow a greedy, top-down recursive divide-and-conquer strategy to build
decision trees.
2. Missing data and noise: Decision tree induction algorithms are quite robust to
the data set with missing values and presence of noise. However, proper data
pre-processing can be followed to nullify these discrepancies.
3. Redundant Attributes: The presence of redundant attributes does not adversely

affect the accuracy of decision trees. It is observed that if an attribute is chosen
for splitting, then another attribute which is redundant is unlikely to chosen for
splitting.
4. Computational complexity: Decision tree induction algorithms are

computationally inexpensive, in particular, when the sizes of training sets are
large, Moreover, once a decision tree is known, classifying a test record is
extremely fast, with a worst-case time complexity of O(d), where d is the
maximum depth of the tree.
Notes on Decision Tree Induction algorithms
5. Data Fragmentation Problem: Since the decision tree induction
algorithms employ a top-down, recursive partitioning approach, the number
of tuples becomes smaller as we traverse down the tree. At a time, the
number of tuples may be too small to make a decision about the class
representation, such a problem is known as the data fragmentation. To deal
with this problem, further splitting can be stopped when the number of
records falls below a certain threshold.
6. Tree Pruning: A sub-tree can replicate two or more times in a decision tree
(see figure below). This makes a decision tree unambiguous to classify a
test record. To avoid such a sub-tree replication problem, all sub-trees
except one can be pruned from the tree.
A
C B
1 0
D C
0 1 D 1
12/8/2021 Data Mining: Concepts and Techniques 0 1 98

Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data to
decide which is the “best pruned tree”

Notes on Decision Tree Induction algorithms

Reference
⚫ The detail material related to this lecture can be found in
Data Mining: Concepts and Techniques, (3rd Edn.), Jiawei Han, Micheline Kamber, Morgan
Kaufmann, 2015.
Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,
Addison-Wesley, 2014

APPENDIX

Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN rules

• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
“yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

Classification in Large Databases
• Classification—a classical problem extensively studied by

statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
• Why decision tree induction in data mining?
– relatively faster learning speed (than other classification
methods)
– convertible to simple and easy to understand classification
rules
– can use SQL queries for accessing databases
– comparable classification accuracy with other methods
Scalable Decision Tree Induction
Methods in Data Mining Studies
• SLIQ (EDBT’96 — Mehta et al.)
– builds an index for each attribute and only class list and the
current attribute list reside in memory
• SPRINT (VLDB’96 — J. Shafer et al.)
– constructs an attribute list data structure
• PUBLIC (VLDB’98 — Rastogi & Shim)
– integrates tree splitting and tree pruning: stop growing the
tree earlier
• RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
– separates the scalability aspects from the criteria that
determine the quality of the tree
– builds an AVC-list (attribute, value, class label)

Drawbacks
• What we discussed are axis parallel
• For continuous valued attributes cut-points
can be found.
– Can be discretized (CART does).


Lecture 8

Uploaded by

Copyright:

Available Formats

Lecture 8

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 8

Uploaded by

Copyright:

Available Formats

Decision Tree Induction

12/8/2021 Data Mining: Concepts and Techniques 2

12/8/2021 Data Mining: Concepts and Techniques 3

12/8/2021 Data Mining: Concepts and Techniques 4

12/8/2021 Data Mining: Concepts and Techniques 5

12/8/2021 Data Mining: Concepts and Techniques 6

Definition 9.1: Decision Tree

12/8/2021 Data Mining: Concepts and Techniques 7

Decision tree induction

12/8/2021 Data Mining: Concepts and Techniques 8

12/8/2021 Data Mining: Concepts and Techniques 9

student? yes credit rating?

12/8/2021 Data Mining: Concepts and Techniques 10

• Are there various types of DT induction

12/8/2021 Data Mining: Concepts and Techniques 11

12/8/2021 Data Mining: Concepts and Techniques 12

12/8/2021 Data Mining: Concepts and Techniques 13

• Gini impurity (variance impurity)

12/8/2021 Data Mining: Concepts and Techniques 14

12/8/2021 Data Mining: Concepts and Techniques 15

12/8/2021 Data Mining: Concepts and Techniques 16

12/8/2021 Data Mining: Concepts and Techniques 17

12/8/2021 Data Mining: Concepts and Techniques 18

12/8/2021 Data Mining: Concepts and Techniques 19

12/8/2021 Data Mining: Concepts and Techniques 20

12/8/2021 Data Mining: Concepts and Techniques 21

12/8/2021 Data Mining: Concepts and Techniques 22

12/8/2021 Data Mining: Concepts and Techniques 23

• So we choose age as the splitting attribute.

12/8/2021 Data Mining: Concepts and Techniques 24

12/8/2021 Data Mining: Concepts and Techniques 25

where pj is the relative frequency of class j in T.

• The attribute provides the smallest ginisplit(T) is chosen to split the

12/8/2021 Data Mining: Concepts and Techniques 26

12/8/2021 Data Mining: Concepts and Techniques 27

12/8/2021 Data Mining: Concepts and Techniques 28

• Branching factor is more, the denominator is

12/8/2021 Data Mining: Concepts and Techniques 29

12/8/2021 Data Mining: Concepts and Techniques 30

– Some of them may give inaccurate result

• How a Decision Tree built?

– Modification of greedy strategy

12/8/2021 Data Mining: Concepts and Techniques 31

12/8/2021 Data Mining: Concepts and Techniques 32

• Case: Binary attribute

12/8/2021 Data Mining: Concepts and Techniques 33

– Binary splitting by grouping attribute values

12/8/2021 Data Mining: Concepts and Techniques 34

– Muti-way split: It is same as in the case of nominal attribute

– Binary splitting attribute values should be grouped maintaining the order

12/8/2021 Data Mining: Concepts and Techniques 35

– Here, q should be decided a priori

– For a numeric attribute, decision tree induction is a combinatorial optimization

12/8/2021 Data Mining: Concepts and Techniques 36

Class = {Short (S), Medium (M), Tall (T)}

Given a person, we are to test in which class s/he belongs

12/8/2021 Data Mining: Concepts and Techniques 37

12/8/2021 Data Mining: Concepts and Techniques 38

12/8/2021 Data Mining: Concepts and Techniques 39