Ch4 Supervised

Data Mining
Decision Tree Induction and Classification
Nazar Zaki, PhD.

Professor, College of Information Technology, UAEU
Overview
• Linear regression
• Process of Classification
• Decision Tree (DT) Approach for Classification
• DT Algorithm
• Attribute Selection Measures
• Tree Pruning for Solving Model Overfitting
• Evaluating Performance of Decision Trees
• Decision Tree Induction in Weka
Process of Classification
Training, Validation and Testing Examples

Training Records Initial Model
Model Construction
Training Set Method
Validation Records (1)

Refined Model
2/3:1/3
Model Refinement
Method
(2)
50:50
Initial Model
Test Set
Final Model with
Measured Accuracy
Measuring
Model Accuracy
(3)
Refined Model
Categories of DM Problem and Output Patterns
What we will learn

in this chapter
Regression
Regression
Here, we will learn how to derive a linear regression from data that of one
dimension: number of features is 1, which is x
Regression
Regression
Regression
Regression
Regression
Regression
Regression
Regression
Regression (least squares)
Regression (prediction error: R^2)
In statistics, the coefficient of
determination, denoted R2 or r2 and
pronounced "R squared", is the proportion
of the variation in the dependent variable
that is predictable from the independent
variable(s).
R2 is a measure of the goodness of fit of a

model.[11] In regression, the R2 coefficient
of determination is a statistical measure of
how well the regression predictions
approximate the real data points. An R2 of
1 indicates that the regression predictions
perfectly fit the data.
• Explained! If R-squared approaches 0 when there is no relationship

between the regression estimates and the actual values
• Explained! If R-squared approaches 1 when there is no relationship

between the regression estimator and the actual values
Classification
Process of Classification
Influential Factors for a Good Model
• Classification Accuracy
• Estimated accuracy during development stage vs. actual
accuracy during practical use
• Classification Performance
• Time taken for model construction
• Time taken for classification
• Comprehensibility of the model
• Ease of interpreting decisions by the classification model
A set of classification techniques
Method1:
Decision trees
Decision Tree Induction
The Approach
Input Training Examples
Descriptive Attributes Class attribute Root node: Top Attribute

Internal Nodes: other Attributes
Links: Attribute values
Leaf/decision Nodes: Class labels
Different algorithms to construct decision
trees
• ID3 (Iterative Dichotomiser 3)
• C4.5 (successor of ID3)
• CART (Classification And
Regression Tree)[5]
• Chi-square automatic interaction
detection (CHAID). Performs
multi-level splits when
computing classification
trees.[13][14][15]
• MARS: extends decision trees to
handle numerical data better.
Principle of Tree Construction

1. Determine which attribute should be selected
as the root of the current tree
2. Partition the input examples into subsets

according to the values of the selected root
attribute
3. Construct a decision tree recursively for each

subset
4. Connect the roots for the subtrees to the root

of the whole tree via labelled links
Random Selection of Attributes

Criterion for attribute selection: taking single
attribute
Which is the best attribute?
Criterion for attribute selection: taking single
attribute
Criterion for attribute
selection:
Computing likelihood -
probability Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5
Outlook Temp. Humidity Windy Play
Sunny Cool High True ?
Likelihood of the two classes

For “yes” = 2/9  3/9  3/9  3/9  9/14 = 0.0053
For “no” = 3/5  1/5  4/5  3/5  5/14 = 0.0206
Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
44
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795
Criterion for attribute selection: impurity and
information gain
• Want to get the smallest tree
• Heuristic: choose the attribute that produces the “purest” nodes
• Popular impurity criterion: information gain
• Information gain increases with the average purity of the subsets
• Strategy: choose attribute that gives greatest information gain
45
Computing information gain (entropy)
• Measure information in bits

• Given a probability distribution, the info required to predict an event is the
distribution’s entropy
• Entropy gives the information required in bits (can involve fractions of bits!)
• Formula for computing the entropy:
46
Example: attribute Outlook
47
Naïve Bayes for classification
• Classification learning: what is the probability of the class given an instance?

• Evidence E = instance’s non-class attribute values
• Event H = class value of instance
• Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features.
• Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
• Probability of an event H given observed evidence E:
𝑷 𝑬 𝑯 𝑷(𝑯)
𝑷 𝑯𝑬 =
𝑷(𝑬)
• A priori probability of H : 𝑃 𝐻 𝐸
• Probability of event before evidence is seen
• A posteriori probability of H : 𝑃(𝐻)
• Probability of event after evidence is seen
Method2:
Naïve Bayes
Frequency and likelihood table
Outlook Play Class Weather Yes No
0 Rainy Yes Overcast 5 0 5
= 0.35
1 Sunny Yes 14
Rainy 2 2 4
2 Overcast Yes = 0.29
14
3 Overcast Yes Sunny 3 2 5
= 0.35
4 Sunny No 14
5 Rainy Yes
All 10 4
6 Sunny Yes = 0.71 = 0.29
14 14
7 Overcast Yes
8 Rainy No
9 Sunny No Applying Bayes‘ theorem:
10 Sunny Yes P(Yes|Sunny) = P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes) = 3/10= 0.3
11 Rainy No
P(Sunny) = 0.35
12 Overcast Yes P(Yes) = 0.71
13 Overcast Yes So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
Problem: If the weather is P(No|Sunny) = P(Sunny|No)*P(No)/P(Sunny)

“Sunny”, then the Player P(Sunny|NO) = 2/4=0.5
P(No) = 0.29
should play or not? P(Sunny) = 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
𝑷 𝑯𝑬 =
𝑷(𝑬) P(Yes|Sunny) > P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Outlook Play Class

0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Problem: If the weather is

“Rainy”, then the Player
should play or not?
𝑷 𝑯𝑬 =
𝑷(𝑬)
• Naïve assumption: evidence splits into parts (i.e., attributes) that are
conditionally independent P(H | E) = P(E1 | H)P(E3 | H)… P(En | H)P(H) / P(E)

Evidence E
Sunny Cool High True ?
P(yes | E) = P(Outlook = Sunny | yes)

P(Temperature = Cool | yes)
Probability of P(Humidity = High | yes)
class “yes”
P(Windy = True | yes)
P(yes) / P(E)
2 / 9 ´ 3 / 9 ´ 3 / 9 ´ 3 / 9 ´ 9 /14 Why did we

=
P(E) ignore P(E)?
Missing values
• Training: instance is not included in frequency count for attribute

value-class combination
• Classification: attribute will be omitted from calculation
• Example:

? Cool High True ?
Likelihood of “yes” = 3/9  3/9  3/9  9/14 = 0.0238

Likelihood of “no” = 1/5  4/5  3/5  5/14 = 0.0343
P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%
P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%
Association rules: Item sets for weather data
Total number of item sets with a minimum support of at least

two instances: 12 one-item sets, 47 two-item sets, 39 three-
item sets, 6 four-item sets and 0 five-item sets.
All rules with support > 1 and confidence = 100%:

• In total:
3 rules with support four
5 with support three
50 with support two
Problem of
overfitting
The Problem of Model Overfitting
Problem Description (overfitting)

• Any algorithms that build a
classification model based on a finite
set of training examples tend to have
the problem of model overfitting: the
model induced from the training
examples fits the training examples
too well. It reflects the specific
features of the examples than
features of actual data at large.
• Causes of the problem include
presence of noise data, lack of
representative training examples, etc.
The Problem of Model Overfitting
Approaches to Tackle the Problem
• Early stopping rule (also known as pre-pruning): tree construction is
halted at some stage.
• Tree pruning (also known as post-pruning): to “cut back” certain
subtrees and replace them with leaf nodes, making the tree smaller
and more robust.
Pruning methods:
• Reduced error pruning
• Cost complexity pruning
• Pessimistic pruning and many others
Classification:
Evaluation
Evaluating Tree Accuracy
What to Evaluate?
• Accuracy is measured in terms of error rate
• Details of errors are shown in a confusion matrix
e.g.
Overall error rate = ?

Evaluating Tree Accuracy
How to Evaluate?
• Normally, a separate independently sampled test set is used to
measure accuracy of a classification model
• Evaluation methods:
• Holdout Method: divide data set 50-50 as training and test sets
• Random Subsampling: use the holdout method several times and take the
average of the accuracy.
• Bootstrap: use sampling with replacement. Training examples are also test
examples.
• Cross Validation: the data set is divided into k equal-size partitions. For each
round of decision tree induction, one partition is used for testing and the
rest used for training. After k rounds, the average error rate is used.
• Leave-One-Out: a particular form of cross-validation
1 2 3 4 5 6 7 8 9 10
Holdout evaluation
• What to do if the amount of data is limited?
• The holdout method reserves a certain amount for testing and uses the
remainder for training
• Usually: one third for testing, the rest for training
• Problem: the samples might not be representative
• Example: class might be missing in the test data
• Advanced version uses stratification
• Ensures that each class is represented with approximately equal proportions in
both subsets
61
Random subsampling-Bootstrapping
- Random Subsampling performs ‘k’
iterations of entire dataset ,i.e. we form
‘k’ replica of given data.
- For each iteration [for each replica] a fix

no. of observation is chosen and it is
kept aside as test set.
- The model is fitted to training set from

each iteration , and an estimate of
prediction error is obtained from each
test set.
- Let the estimated PE (prediction error) Problem: The testing set might
in the ith test set be denoted by Ei . The appear in some of the iterations.
true error estimate is obtained as the
average of the separate estimates Ei .
62
(K-fold) Cross-validation evaluation
• Cross-validation helps to
compare and select an Example of 4-fold
appropriate model for the cross validation
specific predictive
modeling problem.
• Divide the dataset into two
parts: one for training,
other for testing
• Train the model on the
training set
• Validate the model on the
test set
• Repeat 1-3 steps a couple
of times. This number
depends on the CV method
that you are using 63
Leave-one out evaluation
• Choose one sample from the dataset
which will be the test set
• The remaining n – 1 samples will be
the training set
• Train the model on the training set.
On each iteration, a new model
must be trained
• Validate on the test set
• Save the result of the validation
• Repeat steps 1 – 5 n times as for n
samples we have n different training
and test sets
• To get the final score average the
results that you got on step 5.
64
Decision Tree Induction in Weka
Overview
• ID3 (only work for categorical attributes)
• J48 (Java implementation of C4.5)
• RandomTree (with K attributes)
• RandomForest (a forest of random trees)
• REPTree (regression tree with reduced error pruning)
• BFTree (best-first tree, using Gain or Gini)
• FT (functional tree, logistic regression as split nodes)
• SimpleCart (CART with cost-complexity pruning)
Related Issues
Strengths
 Capability of generating understandable rules
 Efficiency in classifying unseen data
 Ability to indicate the most important attribute for classification
Weaknesses
• Error rate increases as the training set contains a small number of
instances of a large variety of classes
• The computationally expensive to build
Decision Tree Induction in Practice
Key Stages in a Classification Project with Decision Tree

Induction
• Selection of class attribute
• Inclusion of descriptive attributes (manual selection of
attributes to be used not recommended)
• Ensuring sufficient coverage of all classes and all
descriptive values
• Data transformation, if necessary, to suit the solution
• Using a suitable test option
• Setting solution parameters (minimum number of leaf node
examples, pruned or unpruned)
• Evaluating model accuracy (confusion matrix details)
Regression and
Classification in
Weka
Preparation
Pre-processing attributes
if necessary
Specifying the class

attribute
Selecting
attributes
Constructing Classification Models (ID3)
1. Choosing a method and

setting parameters
2. Setting a
4. View the model and
test option
evaluation results
3. Starting
the process
5. Selecting the
option to view
the tree
J48 (unpruned tree)
Random Tree
Classifying Unseen Records

1. Preparing unseen records in an ARFF file
Class values are left

as “unknown” (“?”)

2. Classifying unseen records in the file
1.Selecting this 2.Press the button

option and click and load the file
Set… button
3.Press to start
the classification

3. Saving Classification Results into a file
2.Setting both X and Y

to instance_number
3.Saving the results

into a file
1.Selecting the
option to pop up
visualisation

4. Classification Results in an ARFF file
Class labels
assigned
Class Activities
Heart disease dataset (P = heart disease
present, N = heart disease absent)
a) Calculate the information gain over the
attribute Blood Pressure.
b) Given that gain(BodyWeight) = 0.0275 bits,
gain(BodyHeight) = 0.2184 bits,
gain(BodySugarLevel) =0.1407 bits and
gain(Habits) = 0.0721 bits which attribute
should be selected by ID3 algorithm?
c) Use the data in Table 6.4 as training
dataset and perform the following using
Weka:
i. Use the J48 to construct a decision tree
model and use the data in Table 6.6 as
testing dataset.
ii. Determine the classes for the unseen data
records in Table 6.5.
Homework: Other Classification Techniques
Other Classification Techniques to be presented by group members (how
it works, advantages/disadvantages, and example):
1. Support Vector Machines (G1, G10)

2. Neural Network (G2, G9)
3. Naïve Bayes Method (G3, G8)
4. K-Nearest Neighbour (G4, G7)
5. Random Forest (G5,G6)
Group homework: 10 Points

10 min max.

Ch4 Supervised

Uploaded by

Copyright:

Available Formats

Ch4 Supervised

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ch4 Supervised

Uploaded by

Copyright:

Available Formats

Data Mining

Decision Tree Induction and Classification

Nazar Zaki, PhD.

Training, Validation and Testing Examples

Validation Records (1)

What we will learn

R2 is a measure of the goodness of fit of a

• Explained! If R-squared approaches 0 when there is no relationship

• Explained! If R-squared approaches 1 when there is no relationship

Influential Factors for a Good Model

Descriptive Attributes Class attribute Root node: Top Attribute

Principle of Tree Construction

2. Partition the input examples into subsets

3. Construct a decision tree recursively for each

4. Connect the roots for the subtrees to the root

Random Selection of Attributes

Outlook Temp. Humidity Windy Play

Sunny Cool High True ?

Likelihood of the two classes

• Measure information in bits

• Classification learning: what is the probability of the class given an instance?

Problem: If the weather is P(No|Sunny) = P(Sunny|No)*P(No)/P(Sunny)

Outlook Play Class

Problem: If the weather is

Outlook Temp. Humidity Windy Play

P(yes | E) = P(Outlook = Sunny | yes)

2 / 9 ´ 3 / 9 ´ 3 / 9 ´ 3 / 9 ´ 9 /14 Why did we

• Training: instance is not included in frequency count for attribute

Outlook Temp. Humidity Windy Play

Likelihood of “yes” = 3/9  3/9  3/9  9/14 = 0.0238

Total number of item sets with a minimum support of at least

All rules with support > 1 and confidence = 100%:

Problem Description (overfitting)

Overall error rate = ?

- For each iteration [for each replica] a fix

- The model is fitted to training set from

Key Stages in a Classification Project with Decision Tree

Specifying the class

1. Choosing a method and

Classifying Unseen Records

Class values are left

Classifying Unseen Records

1.Selecting this 2.Press the button

Classifying Unseen Records

2.Setting both X and Y

3.Saving the results

Classifying Unseen Records

1. Support Vector Machines (G1, G10)

Group homework: 10 Points

You might also like