IS4242 W6 Model Evaluation and Selection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 86

Recap: Predictive Modeling

Summary of important concepts

• Terminology (model, predictive model, dimensionality, etc)


• Supervised vs unsupervised learning
• Selecting Informative Attributes
• Entropy
• Information gain
• Decision trees
• Geometric interpretation of a model
Model Evaluation and Selection

SUN Chenshuo
Department of Information Systems & Analytics
Model Evaluation & Selection

Many other models

Which model is better?


Model Evaluation
Tasks vsModels
DM Tasks

Regression, Classification, Outlier


Detection
Models

Linear Regression, Logistic Regression,


K Nearest Neighbors
Tasks &Models

Multiple models can beused for the same


task
How to choose amodel?
Model Evaluation
Training & TestError
Goodness of fit vs. Predictive Accuracy
Evaluation Metric
Model Selection
Model Complexity
Parameters & Hyperparameters
Cross-Validation
Training
Training refers to learning/estimating the
model parameters from data
Also called fitting the model to data

Model parameters usually learnt by


optimization
E.g. Parameter estimates β0̂ , β1̂ in linear or
logistic regression are obtained from
data
fit ()

The function (in sklearn) indicates


parameter learning from data

Is used in contexts other than model


training
E.g. Learning mean and sd during scaling
fit() & transform()
fit(): learns the parameters from data

transform(): applies the transformation

E.g. scaling

predict(): predicts using the learnt model (for ML


models)

E.g. predictions from regression/classification


models
Hyperparameters
Other parameters that are typically not
learnt from the data and/or not via
optimization

E.g., k in KNN
There are exceptions to both these conditions
- sometimes hyperparameters are also learnt
from data and via optimization
Training & TestError
Training and Test Error
Training Error

Error when model used on data used in


training
Test Error

Error when model used on data not usedin


training
Training Error usually underestimates test error
Model Building
Preprocessing/Feature
Input Model
Extraction/Modeling
Target

Data Mining Steps Usedto Choose Final Pipeline of operations)


Data Analysis/Cleaning, Data Transformation, Feature Engineering, Model Building & Evaluation

Wehave to estimate unknown test error


Model Deployment
Preprocessing/Feature
Input Model
Extraction/Modeling
Target

Data Mining Steps Used to Choose Final Pipeline of operations)


Data Analysis/Cleaning, Data Transformation, Feature Engineering, Model Building & Evaluation We

have to estimate unknown test error

Preprocessing/Feature
Input Model Predict ions
Extraction

Data Mining Steps Used to Choose Final Pipeline of operations)


Data Analysis/Cleaning, Data Transformation, Feature Engineering, Model Building & Evaluation
Model is deployed for prediction => Real Test Error
Evaluation Metrics
Evaluating Predictions: Regression

Regression

Accuracy of predictions y1̂ ,y2̂ ,…,ym

Root Mean Square Error (RMSE)


m
∑ (yî − yi)2
i=1
m
Evaluating Predictions:
Binary
Classification
Binary Classification Table
TRUTH
Counts of:
True Positive (TP) + -

False Positive (FP) + TP FP

PREDICTIO
True Negative (TN)
- FN TN
FalseNegative(FN)
N
Evaluating Predictions:
Binary Classification
Classification Accuracy TRUTH
TP + TN
TP + FP + TN + FN
+ -

PREDICT
+ TP FP

ION
- FN TN
Evaluating Predictions: Binary
Classification
TRUTH
TP
Precision =
TP + FP
+ -

TP

PREDICT
+ TP FP
Recall =
TP + FN

ION
- FN TN
Trade-Ofs
Precision-Recall
Example

Logistic Regression

Target: probabilities
Different values of threshold will yield
different values
Example
C Level
Heart Disease

1 220 0

2 300 1

3 150 0
… - -

Fit a Logistic Regression


Model Y: Heart Disease
Indicator Decision Boundary
X: Cholesterol Level High Cholesterol Levels

https://towardsdatascience.com/understanding-the-roc-and-auc-curves-a05b68550b69
Example
Evaluation on Test Data

1: Predicted Negative (=No Heart


Disease) because LRassigns a
probability <0.5: False Negative

2: Predicted Positive (=Heart


Disease) because LRassigns a
probability >0.5: False Positive

All other samples are correctly


classifier

High Cholesterol Levels

https://towardsdatascience.com/understanding-the-roc-and-auc-curves-a05b68550b69
Example
Evaluation on Test Data

1: Predicted Negative (=No Heart


Disease) because LRassigns a
probability <0.5: False Negative

2: Predicted Positive (=Heart


Disease) because LRassigns a
probability >0.5: False Positive

All other samples are correctly


classifier
TRUTH

+ - High CholesterolLevels

+ TP FP
PREDICTIO

- FN TN
N

https://towardsdatascience.com/understanding-the-roc-and-auc-curves-a05b68550b69
Example
TRUTH TRUTH

+ - HD- YES HD- NO

+ TP FP HD- YES TP FP
PREDICTION
- FN TN HD- NO FN TN

1. Define Positive &Negative

https://towardsdatascience.com/understanding-the-roc-and-auc-curves-a05b68550b69
Example
TRUTH TRUTH

+ - HD- YES HD- NO

+ TP FP HD- YES TP FP
PREDICTION
- FN TN HD- NO FN TN

TRUTH

HD- YES HD- NO


2. Fill the table
HD- YES 3 1
using test results PREDICTION
HD- NO 1 3
Example

TRUTH

HD- YES HD- NO Precision: 75%


PREDICTION
HD- YES 3 1 Recall: 75%
HD- NO 1 3
Example
TRUTH

HD- YES HD- NO

HD- YES 3 0
PREDICTION
HD- NO 1 4

Threshold: 0.9 Precision: 100%, Recall: 75%

TRUTH

HD- YES HD- NO

HD- YES 4 4
PREDICTION
HD- NO 0 0

Threshold: 0 Precision: 50%, Recall: 100%


https://towardsdatascience.com/understanding-the-roc-and-auc-curves-a05b68550b69
Decision Boundary

Input 1-NN 3-NN Logistic


Regression
ROCCurve

Receiver Operator Characteristic


Illustrates overall performance of a
binary classifier at different parameter
thresholds (that affect the trade-off)
ROCCurve

Plot of
TP
TPR(Recall) TP + FN
FP
FPR(Fall-out) TN + FP

At various classifier-related thresholds


ROCCurve
Example: Classifier performance at 4 different
thresholds (values not shown)

Threshold TPR FPR


1 0 0

2 0.5 0.25

3 0.75 0.5

4 1 1
ROCCurve

Threshold TPR FPR

1 0 0
2 0.5 0.25
3 0.75 0.5
4 1 1

https://classeval.wordpress.com/introduction/introduction-to-the-roc-receiver-operating-characteristics-plot/
Three ROC curves represent the performance levels of
three classifiers. Classifier A clearly outperforms classifier
B and C in this example.
AUROC

Area Under the ROCCurve (aka AUC)

A single number that characterises the


performance of a binary classifier
Can beused to compare classifiers
True Positive Rate

False Positive Rate

https://classeval.wordpress.com/introduction/introduction-to-the-roc-receiver-operating-characteristics-plot/
The score is 1.0 for the classifier with the perfect
performance level and 0.5 for the classifier with the
random performance level.
Evaluation Metrics
Regression: RMSE

Binary Classification
Classification Accuracy

Precision, Recall

AUC
Economic Significance
Real World
Model Complexity
Model Fitting
Fit
How well you approximate a function

Fitting: training a model ona dataset

Wedo not want high training error

Wedo not want zero training error


either!
Model Fitting
Underfitting

Model is not trained well and does not


generalize well to unseen test data
Overfitting

Model is trained very well but does not


generalize well to unseen test data
Data: Blue Circles, generated from a sin function (Green Curve) with errors,
Fitted M-order polynomials: Red Curves
Figure from Bishop, Pattern Recognition and Machine Learning
RMSE vs Order of polynomial
Figure from Bishop, Pattern Recognition and Machine Learning
Model Complexity
Increase in M (order of polynomial)

Increase in complexity
Better training accuracy Decrease in

flexibility

Lower Generalizability

Lower test accuracy


Train vs Test Set Errors
Underfitting Overfitting
Train vs Test Set Errors
Trai
Test Error
n
Erro
r

High - Underfitting
Requires further training
May increase model complexity

No generalization
Very Low High Overfitting Reduce modelcomplexity
UseRegularization (will seelater)

Low Low Good Model


Model Selection

Wenever know the “true” test error

How do we choose and train a model to


minimise true test error?
Werequire a way to estimate the
“true” test error
Estimate Test Error
Create a ‘Test’ set from the available data

f1 f2 . . . . fp

1
2
3
.
.
.
.
.
n
Estimate Test Error
Create a ‘Test’ set from the available data

f1 f2 . . . . fp

1
2
3
.
.
.
. f1 f2 . . . . fp
.
.
n
Estimate Test Error
Create a ‘Test’ set from the available data

f1 f2 . . . . fp

1
2
3
.
.
.
. f1 f2 . . . . fp
.
.
n
.
Estimate Test Error
Create a ‘Test’ set from the available data

f1 f2 . . . . fp f1 f2 . . . . fp

1 .
2 .
3 . Tr ini ngSet
. a
. .
. .
. f1 f2 . . . . fp
.
.
n
.
Test Set
.
Estimate Test Error

Simple Strategy: Single “hold out” test


Holdout
Randomly split the data into t wo sets:
Train Set
Test/Validation/Holdout Set

https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Holdout
Choose Hyperparameter Values

Train model with chosen hyper


parameters on the Training data

https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Holdout
Evaluate model performance on holdout test set
Performance -> estimate of “true” test error

Can beused to compare different models (on same


test data)

https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Holdout
With the chosen hyper parameters and
model, usethe entire data to train
Final model used in deployment

https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Drawbacks
In the simple hold-out strategy:

1. Choice of hyper parameters ad-hoc

2. Test Error depends on which samples are


in holdout/validation set
May overestimate the test error
Drawbacks
In the simple hold-out strategy:

1. Choice of hyper parameters ad-hoc

2. Test Error depends on which samples are


in holdout/validation set
May overestimate the test error
Hyperparameter Tuning
3-way holdout: create 3 sets

Train, Validation & TestSets

https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Hyperparameter Tuning

Usedifferent hyperparameter values and


train different models using Training Data

https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Hyperparameter Tuning
Evaluate performance of each model on Validation
Set and choose the best hyper parameters

https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Hyperparameter Tuning
Using the best Hyperparameter values, train the
model on (Training +Validation) Set to learn a new
model

https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Hyperparameter Tuning
Test the performance of this model on holdout set
-> estimate of “true” test error

Usethe entire data to train a model for deployment

https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Drawbacks
In the simple hold-out strategy:

1. Choice of hyper parameters ad-hoc

2. Test Error depends on which samples are


in holdout/validation set
May overestimate the test error
Cross-Validation
K-fold Cross-Validation
Split Data into K subsets
Repeat K times:

Choose one fold (previously not chosen) for


validation, remaining data for training
Compute test error on validation set
Average K results
K-fold Cross-Validation

Validation Set

Training Set

K =5
K-fold Cross-Validation
Divide the entire data
into K folds

In each fold, train on


the training set and
evaluate on the
validation set

Average the
performance over all
the folds

Estimate of the
“true” test error
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Cross Validation

Leave-one-out Cross-Validation
N observations
N folds

1 test observation in eachfold


Automobile Data
LOOCV 10−fold CV
28

28
Mean Squared Error

Mean Squared Error


26

26
24

24
22

22
20

20
18

18
16

16

2 4 6 8 10 2 4 6 8 10

Degree of Polynomial Degree of Polynomial

Test Error
Hyperparameter Tuning

https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Nested Cross Validation

https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Grid Search
List of hyperparameters: model trained and
tested on each combination
Computationally Intensive

Search space grows very quickly

There are algorithms to do more systematic


search: hyperparameter optimization
Grid search is still commonly used
Model Selection

No FreeLunch!

No one method dominates all others


over all possible data sets
Summary: Why Cross Validation?
Model Building & Deployment
Wewant to select a model (pipeline) that will be
deployed
Select pipeline of operations

Select model

Model parameters & hyperparameters

After selection, we can train the model on the


entire available data and deploy
Model Building
Preprocessing/Feature
Input Model
Extraction/Modeling
Target

Data Mining Steps Usedto Choose Final Pipeline of operations)


Data Analysis/Cleaning, Data Transformation, Feature Engineering, Model Building & Evaluation

CVonly gives estimate of test error


Model Deployment
Preprocessing/Feature
Input Model
Extraction/Modeling
Target

Data Mining Steps Usedto Choose Final Pipeline of operations)


Data Analysis/Cleaning, Data Transformation, Feature Engineering, Model Building & Evaluation

CVonly gives estimate of test error

Preprocessing/Feature
Input Model Predict ions
Extraction

Data Mining Steps Used to Choose Final Pipeline of operations)


Data Analysis/Cleaning, Data Transformation, Feature Engineering, Model Building & Evaluation

Model is deployed for prediction =>Real Test Error


Summary
Training & TestError
Goodness of fit vs. Predictive Accuracy
Evaluation Metric - Regression, Classification
Model Selection

Model Complexity
Parameters & Hyperparameters
Cross-Validation

Data Leakage

You might also like