Data Mining: Model Overfitting Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
Data Mining: Model Overfitting Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
Data Mining: Model Overfitting Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
Model Overfitting
Classification Errors
Test errors
– Errors committed on the test set
Generalization errors
– Expected error of a model over random
selection of records from same distribution
o : 5200 instances
• Generated from a uniform
distribution
Decision Tree
Decision Tree
Model Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
Model Overfitting
Approach:
– Get 50 analysts
– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions
Notes on Overfitting
Model Selection:
Using Validation Set
Divide training data into two parts:
– Training set:
use for model building
– Validation set:
use for estimating generalization error
Note: validation set is not the same as test set
Drawback:
– Less data available for training
e(TL) = 4/24
e(TR) = 6/24
=1
Resubstitution Estimate:
– Using training error as an optimistic estimate of
generalization error
– Referred to as optimistic error estimate
e(TL) = 4/24
e(TR) = 6/24
After splitting:
Post-pruning
– Grow decision tree to its entirety
– Subtree replacement
Trim the nodes of the decision tree in a bottom-up
fashion
If generalization error improves after trimming,
replace sub-tree by a leaf node
Class label of leaf node is determined from
majority class of instances in the sub-tree
– Subtree raising
Replace subtree with most frequently used branch
03/26/2018 Introduction to Data Mining, 2nd Edition 26
Example of Post-Pruning
Training Error (Before splitting) = 10/30
A1 A4
A2 A3
Examples of Post-pruning
Purpose:
– To estimate performance of classifier on previously
unseen data (test set)
Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
Cross-validation Example
3-fold cross-validation