Data Mining: Model Overfitting Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar

Data Mining
Model Overfitting
Introduction to Data Mining, 2nd Edition

by
Tan, Steinbach, Karpatne, Kumar
03/26/2018 Introduction to Data Mining, 2nd Edition 1
Classification Errors
 Training errors (apparent errors)

– Errors committed on the training set
 Test errors
– Errors committed on the test set
 Generalization errors
– Expected error of a model over random
selection of records from same distribution

Example Data Set
Two class problem:

+ : 5200 instances
• 5000 instances generated
from a Gaussian centered at
(10,10)
• 200 noisy instances added
o : 5200 instances
• Generated from a uniform
distribution
10 % of the data used for

training and 90% of the
data used for testing
Increasing number of nodes in Decision Trees

Decision Tree with 4 nodes
Decision Tree
Decision boundaries on Training data
Decision Tree
Decision boundaries on Training data

Which tree is better?
Which tree is better ?

Model Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large

Model Overfitting
Using twice the number of data instances
• If training data is under-representative, testing errors increase and training errors

decrease on increasing number of nodes
• Increasing the size of training data reduces the difference between training and
testing errors at a given number of nodes
Model Overfitting
Decision Tree with 50 nodes Decision Tree with 50 nodes
Using twice the number of data instances
• If training data is under-representative, testing errors increase and training errors

decrease on increasing number of nodes
• Increasing the size of training data reduces the difference between training and
testing errors at a given number of nodes
Reasons for Model Overfitting
 Limited Training Size
 High Model Complexity

– Multiple Comparison Procedure
Effect of Multiple Comparison Procedure
 Consider the task of predicting whether Day 1 Up

stock market will rise/fall in the next 10 Day 2 Down
trading days
Day 3 Down
Day 4 Up
 Random guessing:
Day 5 Down
P(correct) = 0.5 Day 6 Down
Day 7 Up
 Make 10 random guesses in a row: Day 8 Up
Day 9 Up
10  10  10  Day 10 Down
       
8 9 10
P (# correct  8)     10     0.0547
2

 Approach:
– Get 50 analysts
– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions
 Probability that at least one analyst makes at

least 8 correct predictions
P(# correct  8)  1  (1  0.0547)50  0.9399
 Many algorithms employ the following greedy strategy:

– Initial model: M
– Alternative model: M’ = M  ,
where  is a component to be added to the model
(e.g., a test condition of a decision tree)
– Keep M’ if improvement, (M,M’) > 
 Often times,  is chosen from a set of alternative

components,  = {1, 2, …, k}
 If many alternatives are available, one may inadvertently

add irrelevant components to the model, resulting in
model overfitting

Effect of Multiple Comparison - Example
Use additional 100 noisy variables

generated from a uniform distribution
along with X and Y as attributes.
Use 30% of the data for training and

70% of the data for testing
Using only X and Y as attributes
Notes on Overfitting
 Overfitting results in decision trees that are more

complex than necessary
 Training error does not provide a good estimate

of how well the tree will perform on previously
unseen records
 Need ways for estimating generalization errors

Model Selection
 Performed during model building

 Purpose is to ensure that model is not overly
complex (to avoid overfitting)
 Need to estimate generalization error
– Using Validation Set
– Incorporating Model Complexity
– Estimating Statistical Bounds
Model Selection:
Using Validation Set
 Divide training data into two parts:
– Training set:
 use for model building
– Validation set:
 use for estimating generalization error
 Note: validation set is not the same as test set
 Drawback:
– Less data available for training

Model Selection:
Incorporating Model Complexity
 Rationale: Occam’s Razor
– Given two models of similar generalization errors,
one should prefer the simpler model over the more
complex model
– A complex model has a greater chance of being fitted

accidentally by errors in data
– Therefore, one should include model complexity when

evaluating a model
Gen. Error(Model) = Train. Error(Model, Train. Data) +

x Complexity(Model)
Estimating the Complexity of Decision Trees
 Pessimistic Error Estimate of decision tree T

with k leaf nodes:
– err(T): error rate on all training records

– : trade-off hyper-parameter (similar to )
Relative cost of adding a leaf node
– k: number of leaf nodes
– Ntrain: total number of training records

Estimating the Complexity of Decision Trees: Example
e(TL) = 4/24
e(TR) = 6/24
=1
egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458
egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417
Estimating the Complexity of Decision Trees
 Resubstitution Estimate:
– Using training error as an optimistic estimate of
generalization error
– Referred to as optimistic error estimate
e(TL) = 4/24
e(TR) = 6/24

Minimum Description Length (MDL)
A?
X y Yes No
X y
X1 1 0 B? X1 ?
X2 0 B1 B2
X2 ?
X3 0 C? 1
A C1 C2 B X3 ?
X4 1
0 1 X4 ?
… …
Xn
… …
1
Xn ?
 Cost(Model,Data) = Cost(Data|Model) + x Cost(Model)

– Cost is the number of bits needed for encoding.
– Search for the least costly model.
 Cost(Data|Model) encodes the misclassification errors.
 Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.
Estimating Statistical Bounds

z2 / 2 e(1  e) z2 / 2
e  z / 2 
e' ( N , e,  )  2N N 4N 2
2
z / 2
1
N
Before splitting: e = 2/7, e’(7, 2/7, 0.25) = 0.503
e’(T) = 7  0.503 = 3.521
After splitting:
e(TL) = 1/4, e’(4, 1/4, 0.25) = 0.537
e(TR) = 1/3, e’(3, 1/3, 0.25) = 0.650
e’(T) = 4  0.537 + 3  0.650 = 4.098
Therefore, do not split

Model Selection for Decision Trees
 Pre-Pruning (Early Stopping Rule)

– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
 Stop if all instances belong to the same class
 Stop if all the attribute values are the same
– More restrictive conditions:
 Stop if number of instances is less than some user-specified
threshold
 Stop if class distribution of instances are independent of the
available features (e.g., using  2 test)
 Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
 Stop if estimated generalization error falls below certain threshold
Model Selection for Decision Trees
 Post-pruning
– Grow decision tree to its entirety
– Subtree replacement
 Trim the nodes of the decision tree in a bottom-up
fashion
 If generalization error improves after trimming,
replace sub-tree by a leaf node
 Class label of leaf node is determined from
majority class of instances in the sub-tree
– Subtree raising
 Replace subtree with most frequently used branch
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Class = Yes 20 Pessimistic error = (10 + 0.5)/30 = 10.5/30
Class = No 10 Training Error (After splitting) = 9/30
Error = 10/30 Pessimistic error (After splitting)

= (9 + 4  0.5)/30 = 11/30
PRUNE!
A?
A1 A4
A2 A3
Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5

Class = No 4 Class = No 4 Class = No 1 Class = No 1
Examples of Post-pruning

Model Evaluation
 Purpose:
– To estimate performance of classifier on previously
unseen data (test set)
 Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
 Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
Cross-validation Example
 3-fold cross-validation

Data Mining: Model Overfitting Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar

Uploaded by

Copyright:

Available Formats

Data Mining: Model Overfitting Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining: Model Overfitting Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar

Uploaded by

Copyright:

Available Formats

Data Mining

Introduction to Data Mining, 2nd Edition

03/26/2018 Introduction to Data Mining, 2nd Edition 1

 Training errors (apparent errors)

03/26/2018 Introduction to Data Mining, 2nd Edition 2

Two class problem:

• 200 noisy instances added

10 % of the data used for

03/26/2018 Introduction to Data Mining, 2nd Edition 3

Increasing number of nodes in Decision Trees

03/26/2018 Introduction to Data Mining, 2nd Edition 4

Decision boundaries on Training data

03/26/2018 Introduction to Data Mining, 2nd Edition 5

Decision Tree with 50 nodes

Decision boundaries on Training data

03/26/2018 Introduction to Data Mining, 2nd Edition 6

Decision Tree with 4 nodes

Which tree is better ?

03/26/2018 Introduction to Data Mining, 2nd Edition 7

03/26/2018 Introduction to Data Mining, 2nd Edition 8

Using twice the number of data instances

• If training data is under-representative, testing errors increase and training errors

Decision Tree with 50 nodes Decision Tree with 50 nodes

Using twice the number of data instances

• If training data is under-representative, testing errors increase and training errors

 Limited Training Size

 High Model Complexity

03/26/2018 Introduction to Data Mining, 2nd Edition 11

Effect of Multiple Comparison Procedure

 Consider the task of predicting whether Day 1 Up

03/26/2018 Introduction to Data Mining, 2nd Edition 12

 Probability that at least one analyst makes at

03/26/2018 Introduction to Data Mining, 2nd Edition 13

Effect of Multiple Comparison Procedure

 Many algorithms employ the following greedy strategy:

 Often times,  is chosen from a set of alternative

 If many alternatives are available, one may inadvertently

03/26/2018 Introduction to Data Mining, 2nd Edition 14

Use additional 100 noisy variables

Use 30% of the data for training and

 Overfitting results in decision trees that are more

 Training error does not provide a good estimate

 Need ways for estimating generalization errors

03/26/2018 Introduction to Data Mining, 2nd Edition 16

 Performed during model building

03/26/2018 Introduction to Data Mining, 2nd Edition 17

03/26/2018 Introduction to Data Mining, 2nd Edition 18

– A complex model has a greater chance of being fitted

– Therefore, one should include model complexity when

Gen. Error(Model) = Train. Error(Model, Train. Data) +

Estimating the Complexity of Decision Trees

 Pessimistic Error Estimate of decision tree T

– err(T): error rate on all training records

03/26/2018 Introduction to Data Mining, 2nd Edition 20

egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458

egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417

03/26/2018 Introduction to Data Mining, 2nd Edition 21

Estimating the Complexity of Decision Trees