IS4242 W6 Model Evaluation and Selection
IS4242 W6 Model Evaluation and Selection
IS4242 W6 Model Evaluation and Selection
SUN Chenshuo
Department of Information Systems & Analytics
Model Evaluation & Selection
E.g. scaling
E.g., k in KNN
There are exceptions to both these conditions
- sometimes hyperparameters are also learnt
from data and via optimization
Training & TestError
Training and Test Error
Training Error
Preprocessing/Feature
Input Model Predict ions
Extraction
Regression
PREDICTIO
True Negative (TN)
- FN TN
FalseNegative(FN)
N
Evaluating Predictions:
Binary Classification
Classification Accuracy TRUTH
TP + TN
TP + FP + TN + FN
+ -
PREDICT
+ TP FP
ION
- FN TN
Evaluating Predictions: Binary
Classification
TRUTH
TP
Precision =
TP + FP
+ -
TP
PREDICT
+ TP FP
Recall =
TP + FN
ION
- FN TN
Trade-Ofs
Precision-Recall
Example
Logistic Regression
Target: probabilities
Different values of threshold will yield
different values
Example
C Level
Heart Disease
1 220 0
2 300 1
3 150 0
… - -
https://towardsdatascience.com/understanding-the-roc-and-auc-curves-a05b68550b69
Example
Evaluation on Test Data
https://towardsdatascience.com/understanding-the-roc-and-auc-curves-a05b68550b69
Example
Evaluation on Test Data
+ - High CholesterolLevels
+ TP FP
PREDICTIO
- FN TN
N
https://towardsdatascience.com/understanding-the-roc-and-auc-curves-a05b68550b69
Example
TRUTH TRUTH
+ TP FP HD- YES TP FP
PREDICTION
- FN TN HD- NO FN TN
https://towardsdatascience.com/understanding-the-roc-and-auc-curves-a05b68550b69
Example
TRUTH TRUTH
+ TP FP HD- YES TP FP
PREDICTION
- FN TN HD- NO FN TN
TRUTH
TRUTH
HD- YES 3 0
PREDICTION
HD- NO 1 4
TRUTH
HD- YES 4 4
PREDICTION
HD- NO 0 0
Plot of
TP
TPR(Recall) TP + FN
FP
FPR(Fall-out) TN + FP
2 0.5 0.25
3 0.75 0.5
4 1 1
ROCCurve
1 0 0
2 0.5 0.25
3 0.75 0.5
4 1 1
https://classeval.wordpress.com/introduction/introduction-to-the-roc-receiver-operating-characteristics-plot/
Three ROC curves represent the performance levels of
three classifiers. Classifier A clearly outperforms classifier
B and C in this example.
AUROC
https://classeval.wordpress.com/introduction/introduction-to-the-roc-receiver-operating-characteristics-plot/
The score is 1.0 for the classifier with the perfect
performance level and 0.5 for the classifier with the
random performance level.
Evaluation Metrics
Regression: RMSE
Binary Classification
Classification Accuracy
Precision, Recall
AUC
Economic Significance
Real World
Model Complexity
Model Fitting
Fit
How well you approximate a function
Increase in complexity
Better training accuracy Decrease in
flexibility
Lower Generalizability
High - Underfitting
Requires further training
May increase model complexity
No generalization
Very Low High Overfitting Reduce modelcomplexity
UseRegularization (will seelater)
f1 f2 . . . . fp
1
2
3
.
.
.
.
.
n
Estimate Test Error
Create a ‘Test’ set from the available data
f1 f2 . . . . fp
1
2
3
.
.
.
. f1 f2 . . . . fp
.
.
n
Estimate Test Error
Create a ‘Test’ set from the available data
f1 f2 . . . . fp
1
2
3
.
.
.
. f1 f2 . . . . fp
.
.
n
.
Estimate Test Error
Create a ‘Test’ set from the available data
f1 f2 . . . . fp f1 f2 . . . . fp
1 .
2 .
3 . Tr ini ngSet
. a
. .
. .
. f1 f2 . . . . fp
.
.
n
.
Test Set
.
Estimate Test Error
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Holdout
Choose Hyperparameter Values
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Holdout
Evaluate model performance on holdout test set
Performance -> estimate of “true” test error
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Holdout
With the chosen hyper parameters and
model, usethe entire data to train
Final model used in deployment
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Drawbacks
In the simple hold-out strategy:
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Hyperparameter Tuning
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Hyperparameter Tuning
Evaluate performance of each model on Validation
Set and choose the best hyper parameters
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Hyperparameter Tuning
Using the best Hyperparameter values, train the
model on (Training +Validation) Set to learn a new
model
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Hyperparameter Tuning
Test the performance of this model on holdout set
-> estimate of “true” test error
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Drawbacks
In the simple hold-out strategy:
Validation Set
Training Set
K =5
K-fold Cross-Validation
Divide the entire data
into K folds
Average the
performance over all
the folds
Estimate of the
“true” test error
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Cross Validation
Leave-one-out Cross-Validation
N observations
N folds
28
Mean Squared Error
26
24
24
22
22
20
20
18
18
16
16
2 4 6 8 10 2 4 6 8 10
Test Error
Hyperparameter Tuning
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Nested Cross Validation
https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
Grid Search
List of hyperparameters: model trained and
tested on each combination
Computationally Intensive
No FreeLunch!
Select model
Preprocessing/Feature
Input Model Predict ions
Extraction
Model Complexity
Parameters & Hyperparameters
Cross-Validation
Data Leakage