Ch4 Supervised

Download as pdf or txt
Download as pdf or txt
You are on page 1of 78

Data Mining

Decision Tree Induction and Classification

Nazar Zaki, PhD.


Professor, College of Information Technology, UAEU
Overview

• Linear regression
• Process of Classification
• Decision Tree (DT) Approach for Classification
• DT Algorithm
• Attribute Selection Measures
• Tree Pruning for Solving Model Overfitting
• Evaluating Performance of Decision Trees
• Decision Tree Induction in Weka
Process of Classification

Training, Validation and Testing Examples


Training Records Initial Model
Model Construction
Training Set Method

Validation Records (1)


Refined Model
2/3:1/3
Model Refinement
Method

(2)
50:50
Initial Model

Test Set
Final Model with
Measured Accuracy
Measuring
Model Accuracy

(3)
Refined Model
Categories of DM Problem and Output Patterns

What we will learn


in this chapter
Regression
Regression

Here, we will learn how to derive a linear regression from data that of one
dimension: number of features is 1, which is x
Regression
Regression
Regression
Regression
Regression
Regression
Regression
Regression
Regression (least squares)
Regression (least squares)
Regression (least squares)
Regression (least squares)
Regression (least squares)
Regression (least squares)
Regression (least squares)
Regression (least squares)
Regression (least squares)
Regression (prediction error: R^2)
In statistics, the coefficient of
determination, denoted R2 or r2 and
pronounced "R squared", is the proportion
of the variation in the dependent variable
that is predictable from the independent
variable(s).
Regression (prediction error: R^2)

R2 is a measure of the goodness of fit of a


model.[11] In regression, the R2 coefficient
of determination is a statistical measure of
how well the regression predictions
approximate the real data points. An R2 of
1 indicates that the regression predictions
perfectly fit the data.
Regression (prediction error: R^2)
Regression (prediction error: R^2)
Regression (prediction error: R^2)
Regression (prediction error: R^2)
Regression (prediction error: R^2)
Regression (prediction error: R^2)
Regression (prediction error: R^2)

• Explained! If R-squared approaches 0 when there is no relationship


between the regression estimates and the actual values
Regression (prediction error: R^2)

• Explained! If R-squared approaches 1 when there is no relationship


between the regression estimator and the actual values
Classification
Process of Classification

Influential Factors for a Good Model

• Classification Accuracy
• Estimated accuracy during development stage vs. actual
accuracy during practical use
• Classification Performance
• Time taken for model construction
• Time taken for classification
• Comprehensibility of the model
• Ease of interpreting decisions by the classification model
A set of classification techniques
Method1:
Decision trees
Decision Tree Induction

The Approach
Input Training Examples

Descriptive Attributes Class attribute Root node: Top Attribute


Internal Nodes: other Attributes
Links: Attribute values
Leaf/decision Nodes: Class labels
Different algorithms to construct decision
trees
• ID3 (Iterative Dichotomiser 3)
• C4.5 (successor of ID3)
• CART (Classification And
Regression Tree)[5]
• Chi-square automatic interaction
detection (CHAID). Performs
multi-level splits when
computing classification
trees.[13][14][15]
• MARS: extends decision trees to
handle numerical data better.
Decision Tree Induction

Principle of Tree Construction


1. Determine which attribute should be selected
as the root of the current tree

2. Partition the input examples into subsets


according to the values of the selected root
attribute

3. Construct a decision tree recursively for each


subset

4. Connect the roots for the subtrees to the root


of the whole tree via labelled links
Decision Tree Induction

Random Selection of Attributes


Criterion for attribute selection: taking single
attribute
Which is the best attribute?
Criterion for attribute selection: taking single
attribute
Which is the best attribute?
Criterion for attribute
selection:
Computing likelihood -
probability Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5

Outlook Temp. Humidity Windy Play

Sunny Cool High True ?

Likelihood of the two classes


For “yes” = 2/9  3/9  3/9  3/9  9/14 = 0.0053
For “no” = 3/5  1/5  4/5  3/5  5/14 = 0.0206
Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
44
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795
Criterion for attribute selection: impurity and
information gain
Which is the best attribute?
• Want to get the smallest tree
• Heuristic: choose the attribute that produces the “purest” nodes
• Popular impurity criterion: information gain
• Information gain increases with the average purity of the subsets
• Strategy: choose attribute that gives greatest information gain

45
Computing information gain (entropy)

• Measure information in bits


• Given a probability distribution, the info required to predict an event is the
distribution’s entropy
• Entropy gives the information required in bits (can involve fractions of bits!)
• Formula for computing the entropy:

46
Example: attribute Outlook

47
Naïve Bayes for classification

• Classification learning: what is the probability of the class given an instance?


• Evidence E = instance’s non-class attribute values
• Event H = class value of instance
• Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features.
• Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
• Probability of an event H given observed evidence E:
𝑷 𝑬 𝑯 𝑷(𝑯)
𝑷 𝑯𝑬 =
𝑷(𝑬)

• A priori probability of H : 𝑃 𝐻 𝐸
• Probability of event before evidence is seen
• A posteriori probability of H : 𝑃(𝐻)
• Probability of event after evidence is seen
Method2:
Naïve Bayes
Naïve Bayes for classification
Frequency and likelihood table
Outlook Play Class Weather Yes No
0 Rainy Yes Overcast 5 0 5
= 0.35
1 Sunny Yes 14
Rainy 2 2 4
2 Overcast Yes = 0.29
14
3 Overcast Yes Sunny 3 2 5
= 0.35
4 Sunny No 14
5 Rainy Yes
All 10 4
6 Sunny Yes = 0.71 = 0.29
14 14
7 Overcast Yes
8 Rainy No
9 Sunny No Applying Bayes‘ theorem:
10 Sunny Yes P(Yes|Sunny) = P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes) = 3/10= 0.3
11 Rainy No
P(Sunny) = 0.35
12 Overcast Yes P(Yes) = 0.71
13 Overcast Yes So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

Problem: If the weather is P(No|Sunny) = P(Sunny|No)*P(No)/P(Sunny)


“Sunny”, then the Player P(Sunny|NO) = 2/4=0.5
P(No) = 0.29
should play or not? P(Sunny) = 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
𝑷 𝑬 𝑯 𝑷(𝑯)
𝑷 𝑯𝑬 =
𝑷(𝑬) P(Yes|Sunny) > P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Naïve Bayes for classification

Outlook Play Class


0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes

Problem: If the weather is


“Rainy”, then the Player
should play or not?

𝑷 𝑬 𝑯 𝑷(𝑯)
𝑷 𝑯𝑬 =
𝑷(𝑬)
Naïve Bayes for classification

• Naïve assumption: evidence splits into parts (i.e., attributes) that are
conditionally independent P(H | E) = P(E1 | H)P(E3 | H)… P(En | H)P(H) / P(E)

Outlook Temp. Humidity Windy Play


Evidence E
Sunny Cool High True ?

P(yes | E) = P(Outlook = Sunny | yes)


P(Temperature = Cool | yes)
Probability of P(Humidity = High | yes)
class “yes”
P(Windy = True | yes)
P(yes) / P(E)

2 / 9 ´ 3 / 9 ´ 3 / 9 ´ 3 / 9 ´ 9 /14 Why did we


=
P(E) ignore P(E)?
Missing values

• Training: instance is not included in frequency count for attribute


value-class combination
• Classification: attribute will be omitted from calculation
• Example:

Outlook Temp. Humidity Windy Play


? Cool High True ?

Likelihood of “yes” = 3/9  3/9  3/9  9/14 = 0.0238


Likelihood of “no” = 1/5  4/5  3/5  5/14 = 0.0343
P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%
P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%
Association rules: Item sets for weather data

Total number of item sets with a minimum support of at least


two instances: 12 one-item sets, 47 two-item sets, 39 three-
item sets, 6 four-item sets and 0 five-item sets.

All rules with support > 1 and confidence = 100%:


• In total:
3 rules with support four
5 with support three
50 with support two
Problem of
overfitting
The Problem of Model Overfitting

Problem Description (overfitting)


• Any algorithms that build a
classification model based on a finite
set of training examples tend to have
the problem of model overfitting: the
model induced from the training
examples fits the training examples
too well. It reflects the specific
features of the examples than
features of actual data at large.
• Causes of the problem include
presence of noise data, lack of
representative training examples, etc.
The Problem of Model Overfitting
Approaches to Tackle the Problem
• Early stopping rule (also known as pre-pruning): tree construction is
halted at some stage.
• Tree pruning (also known as post-pruning): to “cut back” certain
subtrees and replace them with leaf nodes, making the tree smaller
and more robust.

Pruning methods:
• Reduced error pruning
• Cost complexity pruning
• Pessimistic pruning and many others
Classification:
Evaluation
Evaluating Tree Accuracy

What to Evaluate?
• Accuracy is measured in terms of error rate
• Details of errors are shown in a confusion matrix

e.g.

Overall error rate = ?


Evaluating Tree Accuracy
How to Evaluate?
• Normally, a separate independently sampled test set is used to
measure accuracy of a classification model

• Evaluation methods:
• Holdout Method: divide data set 50-50 as training and test sets
• Random Subsampling: use the holdout method several times and take the
average of the accuracy.
• Bootstrap: use sampling with replacement. Training examples are also test
examples.
• Cross Validation: the data set is divided into k equal-size partitions. For each
round of decision tree induction, one partition is used for testing and the
rest used for training. After k rounds, the average error rate is used.
• Leave-One-Out: a particular form of cross-validation
1 2 3 4 5 6 7 8 9 10
Holdout evaluation
• What to do if the amount of data is limited?
• The holdout method reserves a certain amount for testing and uses the
remainder for training
• Usually: one third for testing, the rest for training
• Problem: the samples might not be representative
• Example: class might be missing in the test data
• Advanced version uses stratification
• Ensures that each class is represented with approximately equal proportions in
both subsets

61
Random subsampling-Bootstrapping
- Random Subsampling performs ‘k’
iterations of entire dataset ,i.e. we form
‘k’ replica of given data.

- For each iteration [for each replica] a fix


no. of observation is chosen and it is
kept aside as test set.

- The model is fitted to training set from


each iteration , and an estimate of
prediction error is obtained from each
test set.

- Let the estimated PE (prediction error) Problem: The testing set might
in the ith test set be denoted by Ei . The appear in some of the iterations.
true error estimate is obtained as the
average of the separate estimates Ei .
62
(K-fold) Cross-validation evaluation
• Cross-validation helps to
compare and select an Example of 4-fold
appropriate model for the cross validation
specific predictive
modeling problem.
• Divide the dataset into two
parts: one for training,
other for testing
• Train the model on the
training set
• Validate the model on the
test set
• Repeat 1-3 steps a couple
of times. This number
depends on the CV method
that you are using 63
Leave-one out evaluation
• Choose one sample from the dataset
which will be the test set
• The remaining n – 1 samples will be
the training set
• Train the model on the training set.
On each iteration, a new model
must be trained
• Validate on the test set
• Save the result of the validation
• Repeat steps 1 – 5 n times as for n
samples we have n different training
and test sets
• To get the final score average the
results that you got on step 5.

64
Decision Tree Induction in Weka

Overview
• ID3 (only work for categorical attributes)
• J48 (Java implementation of C4.5)
• RandomTree (with K attributes)
• RandomForest (a forest of random trees)
• REPTree (regression tree with reduced error pruning)
• BFTree (best-first tree, using Gain or Gini)
• FT (functional tree, logistic regression as split nodes)
• SimpleCart (CART with cost-complexity pruning)
Related Issues

Strengths
 Capability of generating understandable rules
 Efficiency in classifying unseen data
 Ability to indicate the most important attribute for classification

Weaknesses
• Error rate increases as the training set contains a small number of
instances of a large variety of classes
• The computationally expensive to build
Decision Tree Induction in Practice

Key Stages in a Classification Project with Decision Tree


Induction
• Selection of class attribute
• Inclusion of descriptive attributes (manual selection of
attributes to be used not recommended)
• Ensuring sufficient coverage of all classes and all
descriptive values
• Data transformation, if necessary, to suit the solution
• Using a suitable test option
• Setting solution parameters (minimum number of leaf node
examples, pruned or unpruned)
• Evaluating model accuracy (confusion matrix details)
Regression and
Classification in
Weka
Decision Tree Induction in Weka
Preparation
Pre-processing attributes
if necessary

Specifying the class


attribute

Selecting
attributes
Decision Tree Induction in Weka
Constructing Classification Models (ID3)

1. Choosing a method and


setting parameters

2. Setting a
4. View the model and
test option
evaluation results
3. Starting
the process

5. Selecting the
option to view
the tree
Decision Tree Induction in Weka
J48 (unpruned tree)
Decision Tree Induction in Weka
Random Tree
Decision Tree Induction in Weka

Classifying Unseen Records


1. Preparing unseen records in an ARFF file

Class values are left


as “unknown” (“?”)
Decision Tree Induction in Weka

Classifying Unseen Records


2. Classifying unseen records in the file

1.Selecting this 2.Press the button


option and click and load the file
Set… button

3.Press to start
the classification
Decision Tree Induction in Weka

Classifying Unseen Records


3. Saving Classification Results into a file

2.Setting both X and Y


to instance_number

3.Saving the results


into a file

1.Selecting the
option to pop up
visualisation
Decision Tree Induction in Weka

Classifying Unseen Records


4. Classification Results in an ARFF file

Class labels
assigned
Class Activities
Heart disease dataset (P = heart disease
present, N = heart disease absent)
a) Calculate the information gain over the
attribute Blood Pressure.
b) Given that gain(BodyWeight) = 0.0275 bits,
gain(BodyHeight) = 0.2184 bits,
gain(BodySugarLevel) =0.1407 bits and
gain(Habits) = 0.0721 bits which attribute
should be selected by ID3 algorithm?
c) Use the data in Table 6.4 as training
dataset and perform the following using
Weka:
i. Use the J48 to construct a decision tree
model and use the data in Table 6.6 as
testing dataset.
ii. Determine the classes for the unseen data
records in Table 6.5.
Homework: Other Classification Techniques
Other Classification Techniques to be presented by group members (how
it works, advantages/disadvantages, and example):

1. Support Vector Machines (G1, G10)


2. Neural Network (G2, G9)
3. Naïve Bayes Method (G3, G8)
4. K-Nearest Neighbour (G4, G7)
5. Random Forest (G5,G6)

Group homework: 10 Points


10 min max.

You might also like