IS4242 W6 Model Evaluation and Selection

Recap: Predictive Modeling
Summary of important concepts
• Terminology (model, predictive model, dimensionality, etc)

• Supervised vs unsupervised learning
• Selecting Informative Attributes
• Entropy
• Information gain
• Decision trees
• Geometric interpretation of a model
Model Evaluation and Selection
SUN Chenshuo
Department of Information Systems & Analytics
Model Evaluation & Selection
Many other models
Which model is better?

Model Evaluation
Tasks vsModels
DM Tasks
Regression, Classification, Outlier

Detection
Models
Linear Regression, Logistic Regression,

K Nearest Neighbors
Tasks &Models
Multiple models can beused for the same

task
How to choose amodel?
Model Evaluation
Training & TestError
Goodness of fit vs. Predictive Accuracy
Evaluation Metric
Model Selection
Model Complexity
Parameters & Hyperparameters
Cross-Validation
Training
Training refers to learning/estimating the
model parameters from data
Also called fitting the model to data
Model parameters usually learnt by

optimization
E.g. Parameter estimates β0̂ , β1̂ in linear or
logistic regression are obtained from
data
fit ()
The function (in sklearn) indicates

parameter learning from data
Is used in contexts other than model

training
E.g. Learning mean and sd during scaling
fit() & transform()
fit(): learns the parameters from data
transform(): applies the transformation
E.g. scaling
predict(): predicts using the learnt model (for ML

models)
E.g. predictions from regression/classification

models
Hyperparameters
Other parameters that are typically not
learnt from the data and/or not via
optimization
E.g., k in KNN
There are exceptions to both these conditions
- sometimes hyperparameters are also learnt
from data and via optimization
Training and Test Error
Training Error
Error when model used on data used in

training
Test Error
Error when model used on data not usedin

training
Training Error usually underestimates test error
Model Building
Preprocessing/Feature
Input Model
Extraction/Modeling
Target
Data Mining Steps Usedto Choose Final Pipeline of operations)

Data Analysis/Cleaning, Data Transformation, Feature Engineering, Model Building & Evaluation
Wehave to estimate unknown test error

Model Deployment
Input Model
Extraction/Modeling
Target
Data Mining Steps Used to Choose Final Pipeline of operations)

Data Analysis/Cleaning, Data Transformation, Feature Engineering, Model Building & Evaluation We
have to estimate unknown test error
Input Model Predict ions
Extraction

Model is deployed for prediction => Real Test Error
Evaluation Metrics
Evaluating Predictions: Regression
Regression
Accuracy of predictions y1̂ ,y2̂ ,…,ym
Root Mean Square Error (RMSE)

m
∑ (yî − yi)2
i=1
m
Evaluating Predictions:
Binary
Classification
Binary Classification Table
TRUTH
Counts of:
True Positive (TP) + -
False Positive (FP) + TP FP
PREDICTIO
True Negative (TN)
- FN TN
FalseNegative(FN)
N
Evaluating Predictions:
Binary Classification
Classification Accuracy TRUTH
TP + TN
TP + FP + TN + FN
+ -
PREDICT
+ TP FP
ION
- FN TN
Evaluating Predictions: Binary
Classification
TRUTH
TP
Precision =
TP + FP
+ -
TP
PREDICT
+ TP FP
Recall =
TP + FN
ION
- FN TN
Trade-Ofs
Precision-Recall
Example
Logistic Regression
Target: probabilities
Different values of threshold will yield
different values
Example
C Level
Heart Disease
1 220 0
2 300 1
3 150 0
… - -
Fit a Logistic Regression

Model Y: Heart Disease
Indicator Decision Boundary
X: Cholesterol Level High Cholesterol Levels
https://towardsdatascience.com/understanding-the-roc-and-auc-curves-a05b68550b69
Example
Evaluation on Test Data
1: Predicted Negative (=No Heart

Disease) because LRassigns a
probability <0.5: False Negative
2: Predicted Positive (=Heart

probability >0.5: False Positive
All other samples are correctly

classifier
High Cholesterol Levels
Example
Evaluation on Test Data
1: Predicted Negative (=No Heart

probability <0.5: False Negative
2: Predicted Positive (=Heart

probability >0.5: False Positive
All other samples are correctly

classifier
TRUTH
+ - High CholesterolLevels
+ TP FP
PREDICTIO
- FN TN
N
Example
TRUTH TRUTH
+ - HD- YES HD- NO
+ TP FP HD- YES TP FP
PREDICTION
- FN TN HD- NO FN TN
1. Define Positive &Negative
Example
TRUTH TRUTH
+ - HD- YES HD- NO
+ TP FP HD- YES TP FP
PREDICTION
- FN TN HD- NO FN TN
TRUTH
HD- YES HD- NO

2. Fill the table
HD- YES 3 1
using test results PREDICTION
HD- NO 1 3
Example
TRUTH
HD- YES HD- NO Precision: 75%

PREDICTION
HD- YES 3 1 Recall: 75%
HD- NO 1 3
Example
TRUTH
HD- YES HD- NO
HD- YES 3 0
PREDICTION
HD- NO 1 4
Threshold: 0.9 Precision: 100%, Recall: 75%
TRUTH
HD- YES HD- NO
HD- YES 4 4
PREDICTION
HD- NO 0 0
Threshold: 0 Precision: 50%, Recall: 100%

Decision Boundary
Input 1-NN 3-NN Logistic

Regression
ROCCurve
Receiver Operator Characteristic

Illustrates overall performance of a
binary classifier at different parameter
thresholds (that affect the trade-off)
ROCCurve
Plot of
TP
TPR(Recall) TP + FN
FP
FPR(Fall-out) TN + FP
At various classifier-related thresholds

ROCCurve
Example: Classifier performance at 4 different
thresholds (values not shown)
Threshold TPR FPR

1 0 0
2 0.5 0.25
3 0.75 0.5
4 1 1
ROCCurve
Threshold TPR FPR
1 0 0
2 0.5 0.25
3 0.75 0.5
4 1 1
https://classeval.wordpress.com/introduction/introduction-to-the-roc-receiver-operating-characteristics-plot/
Three ROC curves represent the performance levels of
three classifiers. Classifier A clearly outperforms classifier
B and C in this example.
AUROC
Area Under the ROCCurve (aka AUC)
A single number that characterises the

performance of a binary classifier
Can beused to compare classifiers
True Positive Rate
False Positive Rate
https://classeval.wordpress.com/introduction/introduction-to-the-roc-receiver-operating-characteristics-plot/
The score is 1.0 for the classifier with the perfect
performance level and 0.5 for the classifier with the
random performance level.
Evaluation Metrics
Regression: RMSE
Binary Classification
Classification Accuracy
Precision, Recall
AUC
Economic Significance
Real World
Model Complexity
Model Fitting
Fit
How well you approximate a function
Fitting: training a model ona dataset
Wedo not want high training error
Wedo not want zero training error

either!
Model Fitting
Underfitting
Model is not trained well and does not

generalize well to unseen test data
Overfitting
Model is trained very well but does not

generalize well to unseen test data
Data: Blue Circles, generated from a sin function (Green Curve) with errors,
Fitted M-order polynomials: Red Curves
Figure from Bishop, Pattern Recognition and Machine Learning
RMSE vs Order of polynomial
Figure from Bishop, Pattern Recognition and Machine Learning
Model Complexity
Increase in M (order of polynomial)
Increase in complexity
Better training accuracy Decrease in
flexibility
Lower Generalizability
Lower test accuracy

Train vs Test Set Errors
Underfitting Overfitting
Train vs Test Set Errors
Trai
Test Error
n
Erro
r
High - Underfitting
Requires further training
May increase model complexity
No generalization
Very Low High Overfitting Reduce modelcomplexity
UseRegularization (will seelater)
Low Low Good Model

Model Selection
Wenever know the “true” test error
How do we choose and train a model to

minimise true test error?
Werequire a way to estimate the
“true” test error
Estimate Test Error
Create a ‘Test’ set from the available data
f1 f2 . . . . fp
1
2
3
.
.
.
.
.
n
Estimate Test Error
f1 f2 . . . . fp
1
2
3
.
.
.
. f1 f2 . . . . fp
.
.
n
Estimate Test Error
f1 f2 . . . . fp
1
2
3
.
.
.
. f1 f2 . . . . fp
.
.
n
.
Estimate Test Error
f1 f2 . . . . fp f1 f2 . . . . fp
1 .
2 .
3 . Tr ini ngSet
. a
. .
. .
. f1 f2 . . . . fp
.
.
n
.
Test Set
.
Estimate Test Error
Simple Strategy: Single “hold out” test

Holdout
Randomly split the data into t wo sets:
Train Set
Test/Validation/Holdout Set
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Holdout
Choose Hyperparameter Values
Train model with chosen hyper

parameters on the Training data
Holdout
Evaluate model performance on holdout test set
Performance -> estimate of “true” test error
Can beused to compare different models (on same

test data)
Holdout
With the chosen hyper parameters and
model, usethe entire data to train
Final model used in deployment
Drawbacks
In the simple hold-out strategy:
1. Choice of hyper parameters ad-hoc
2. Test Error depends on which samples are

in holdout/validation set
May overestimate the test error
Drawbacks

Hyperparameter Tuning
3-way holdout: create 3 sets
Train, Validation & TestSets
Usedifferent hyperparameter values and

train different models using Training Data
Evaluate performance of each model on Validation
Set and choose the best hyper parameters
Using the best Hyperparameter values, train the
model on (Training +Validation) Set to learn a new
model
Test the performance of this model on holdout set
-> estimate of “true” test error
Usethe entire data to train a model for deployment
Drawbacks

Cross-Validation
K-fold Cross-Validation
Split Data into K subsets
Repeat K times:
Choose one fold (previously not chosen) for

validation, remaining data for training
Compute test error on validation set
Average K results
Validation Set
Training Set
K =5
Divide the entire data
into K folds
In each fold, train on

the training set and
evaluate on the
validation set
Average the
performance over all
the folds
Estimate of the
“true” test error
Cross Validation
Leave-one-out Cross-Validation
N observations
N folds
1 test observation in eachfold

Automobile Data
LOOCV 10−fold CV
28
28
Mean Squared Error
Mean Squared Error

26
26
24
24
22
22
20
20
18
18
16
16
2 4 6 8 10 2 4 6 8 10
Degree of Polynomial Degree of Polynomial
Test Error
Nested Cross Validation
Grid Search
List of hyperparameters: model trained and
tested on each combination
Computationally Intensive
Search space grows very quickly
There are algorithms to do more systematic

search: hyperparameter optimization
Grid search is still commonly used
Model Selection
No FreeLunch!
No one method dominates all others

over all possible data sets
Summary: Why Cross Validation?
Model Building & Deployment
Wewant to select a model (pipeline) that will be
deployed
Select pipeline of operations
Select model
Model parameters & hyperparameters
After selection, we can train the model on the

entire available data and deploy
Model Building
Input Model
Extraction/Modeling
Target

CVonly gives estimate of test error

Model Deployment
Input Model
Extraction/Modeling
Target

CVonly gives estimate of test error
Input Model Predict ions
Extraction

Model is deployed for prediction =>Real Test Error

Summary
Goodness of fit vs. Predictive Accuracy
Evaluation Metric - Regression, Classification
Model Selection
Model Complexity
Parameters & Hyperparameters
Cross-Validation
Data Leakage

IS4242 W6 Model Evaluation and Selection

Uploaded by

Copyright:

Available Formats

IS4242 W6 Model Evaluation and Selection

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IS4242 W6 Model Evaluation and Selection

Uploaded by

Copyright:

Available Formats

Recap: Predictive Modeling

Summary of important concepts

• Terminology (model, predictive model, dimensionality, etc)

Many other models

Which model is better?

Regression, Classification, Outlier

Linear Regression, Logistic Regression,

Multiple models can beused for the same

Model parameters usually learnt by

The function (in sklearn) indicates

Is used in contexts other than model

transform(): applies the transformation

predict(): predicts using the learnt model (for ML

E.g. predictions from regression/classification

Error when model used on data used in

Error when model used on data not usedin

Data Mining Steps Usedto Choose Final Pipeline of operations)

Wehave to estimate unknown test error

Data Mining Steps Used to Choose Final Pipeline of operations)

have to estimate unknown test error

Data Mining Steps Used to Choose Final Pipeline of operations)

Accuracy of predictions y1̂ ,y2̂ ,…,ym

Root Mean Square Error (RMSE)

False Positive (FP) + TP FP

Fit a Logistic Regression

1: Predicted Negative (=No Heart

2: Predicted Positive (=Heart

All other samples are correctly

High Cholesterol Levels

1: Predicted Negative (=No Heart

2: Predicted Positive (=Heart

All other samples are correctly

+ - HD- YES HD- NO

1. Define Positive &Negative

+ - HD- YES HD- NO

HD- YES HD- NO

HD- YES HD- NO Precision: 75%

HD- YES HD- NO

Threshold: 0.9 Precision: 100%, Recall: 75%

HD- YES HD- NO

Threshold: 0 Precision: 50%, Recall: 100%

Input 1-NN 3-NN Logistic

Receiver Operator Characteristic

At various classifier-related thresholds

Threshold TPR FPR

Threshold TPR FPR

Area Under the ROCCurve (aka AUC)

A single number that characterises the

False Positive Rate

Fitting: training a model ona dataset

Wedo not want high training error

Wedo not want zero training error

Model is not trained well and does not

Model is trained very well but does not

Lower test accuracy

Low Low Good Model

Wenever know the “true” test error

How do we choose and train a model to

Simple Strategy: Single “hold out” test