Cross-Validation and Model Selection
Cross-Validation and Model Selection
Cross-Validation and Model Selection
•Modeling of the data uses one part only. The model selected for this part is
then used to predict the values in the other part of the data. A valid model
should show good predictive accuracy.
•One thing that R-squared offers no protection against is overfitting. On the
other hand, cross validation, by allowing us to have cases in our testing set
that are different from the cases in our training set, inherently offers
protection against overfitting.
CROSS VALIDATION – THE IDEAL
PROCEDURE
1.Divide data into three sets, training, validation and test sets
2.Find the optimal model on the training set, and use the test set to
check its predictive capability
3.See how well the model can predict the test set
•Training Data
•Validation Data
d = 2 is chosen
•Test Data
1.3 test error computed for d = 2
LOOCV (Leave‐one‐out Cross Validation)
• For k=1 to R
1. Let (xk,yk) be the k
example
LOOCV (Leave‐one‐out Cross Validation)
LOOCV (Leave‐one‐out Cross Validation)
LOOCV (Leave‐one‐out Cross Validation)
LOOCV (Leave‐one‐out Cross Validation)
LOOCV (Leave‐one‐out Cross Validation)
LOOCV for Quadratic Regression
LOOCV for Join The Dots
Which kind of Cross Validation?
K-FOLD CROSS VALIDATION
›Since data are often scarce, there might not be enough to set aside for a
validation sample
›To work around this issue k-fold CV works as follows:
1. Split the sample into k subsets of equal size
2. For each fold estimate a model on all the subsets except one
3. Use the left out subset to test the model, by calculating a CV metric of
choice
4. Average the CV metric across subsets to get the CV error
›This has the advantage of using all data for estimating the model, however
finding a good value for k can be tricky
K-fold Cross Validation Example
1. Split the data into 5
samples
2. Fit a model to the training
samples and use the test
sample to calculate a CV
metric.
3. Repeat the process for the
next sample, until all
samples have been used to
either train or test the
model
Which kind of Cross Validation?
Improve cross-validation
• Even better: repeated cross-validation
Example:
10-fold cross-validation is repeated 10 times and results are
averaged (reduce the variance)
Cross Validation - Metrics
• How do we determine if one model is predicting better than another
model?
Cross Validation Metrics
Best Practice for Reporting Model Fit
1.Use Cross Validation to find the best model
2.Report the RMSE and MAPE statistics from the cross validation
procedure
3.Report the R Squared from the model as you normally would.
fold 1
Observations in test set: 5
11 20 21 22 23
area 802 696 771.0 1006.0 1191
cvpred 204 188 199.3 234.7 262
sale.price 215 255 260.0 293.0 375
CV residual 11 67 60.7 58.3 113
fold 3
Observations in test set: 5
9 12 15 16 19
area 694.0 1366 821.00 714.0 790.00
cvpred 183.2 388 221.94 189.3 212.49
sale.price 192.0 274 212.00 220.0 221.50
CV residual 8.8 -114 -9.94 30.7 9.01
Sum of squares = 14241 Mean square = 2848 n = 5
Response: sale.price
Df Sum Sq Mean Sq F value Pr(>F)
area 1 18566 18566 17.0 0.0014 **
bedrooms 1 17065 17065 15.6 0.0019 **
Residuals 12 13114 1093
fold 1
Observations in test set: 5
11 20 21 22 23
Predicted 206 249 259.8 293.3 378
cvpred 204 188 199.3 234.7 262
sale.price 215 255 260.0 293.0 375
CV residual 11 67 60.7 58.3 113
Sum of squares = 24351 Mean square = 4870 n = 5
fold 2
Observations in test set: 5
10 13 14 17 18
Predicted 220.5 193.6 228.8 236.6 218.0
cvpred 226.1 204.9 232.6 238.8 224.1
sale.price 215.0 112.7 185.0 276.0 260.0
CV residual -11.1 -92.2 -47.6 37.2 35.9
Sum of squares = 13563 Mean square = 2713 n = 5
fold 3
Observations in test set: 5
9 12 15 16 19
Predicted 190.5 286.3 208.6 193.3 204
cvpred 174.8 312.5 200.8 178.9 194
sale.price 192.0 274.0 212.0 220.0 222
CV residual 17.2 -38.5 11.2 41.1 27
Sum of squares = 4323 Mean square = 865 n = 5
44
MEASURING THE MODEL ACCURACY
45
MEASURING THE MODEL ACCURACY
46