Practical Issues

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

CMPE 442 Introduction to

Machine Learning
• Model/Feature Selection
• Cross-Validation
Address Overfitting

 Options:
1) Reduce number of features
 Manually select which features to keep
 Model selection algorithm

2) Regularization
 Keep all features, but reduce the values of the parameters 𝜃 ’s
 Each of the features contributes to predicting y
Model Selection

 Suppose we are trying to select among several different models for a


learning problem
 Ex: Assume we are using a polynomial regression model and wish to decide
on the degree d of the polynomial
 Ex: Assume we want to automatically choose the bandwidth parameter 𝜏
for locally weighted linear regression, or 𝐶 parameter for SVM.
 𝑀 = {𝑀 , … , 𝑀 }
 𝑀 may contain completely different models: Ex: We might be deciding
between using an SVM, a neural network or logistic regression
 How can we automatically choose between the bias and variance?
Model Selection

 Suppose we are given a training set 𝑆. Then the algorithm could be:
1. Train each model 𝑀 on 𝑆, to get some hypothesis ℎ
2. Pick the hypothesis with the smallest training error
Model Selection

 Suppose we are given a training set 𝑆. Then the algorithm could be:
1. Train each model 𝑀 on 𝑆, to get some hypothesis ℎ
2. Pick the hypothesis with the smallest training error

 Does not work: Ex: for polynomial regression, the higher the order of
polynomial the better it fits the training set S, and thus the lower the training
error. Thus, this method always chooses high-order, high-variance model.
Cross-Validation: Hold-out cross
validation
 Better algorithm:
1. Randomly split 𝑆 into 𝑆 (say, 70% of the data) and 𝑆 (remaining 30%). Here,
𝑆 is called hold-out cross validation set.
2. Train each model 𝑀 on 𝑆 only, to get some hypothesis ℎ
3. Select and output the hypothesis ℎ that had the smallest error 𝜀̂ (ℎ ) on the
hold out cross validation set.
Cross-Validation: Hold-out cross
validation
 By testing on 𝑆 that the model did not see in the training phase, we obtain
a better estimate of each hypothesis ℎ ’s true generalization error
 Usually, 1/4 - 1/3 of the data is used in the hold-out cross validation set
 30% is a typical choice
 Disadvantage: “wastes” about 30% of the data.
Cross-Validation: k-fold cross validation

 Holds out less data each time:


1. Randomly split 𝑆 into 𝑘 disjoint subsets of 𝑚/𝑘 training examples each: 𝑆 , … , 𝑆
2. For each model 𝑀 , evaluate it as follows:
For 𝑗 = 1, … , 𝑘
Train the model 𝑀 on 𝑆 ∪ ⋯ ∪ 𝑆 ∪𝑆 ∪ ⋯ ∪ 𝑆 (i.e. train on all the data except 𝑆 ) to get some
hypothesis ℎ

Test the hypothesis ℎ on 𝑆 to get 𝜀̂ (ℎ )

The estimated generalization error of model 𝑀 is then calculated as the average of the
𝜀̂ (ℎ )’s (averaged over 𝑗).

3. Pick the model 𝑀 with the lowest estimated generalization error, and retrain that
model on the entire training set S. The resulting hypothesis is then output as our
final answer.
Cross Validation: k-fold cross validation

 A typical choice for the number of folds is 𝑘 = 10


 The fraction of data held out each time is now , much smaller than before

 May be computationally expensive than hold-out cross validation  need


to train each model k times
5 x 2Cross-Validation (5 x 2cv)

 Randomly divide S into two equal subsets : T1 and T2


 Use T1 for training and T2 for testing
 Then other way around: T2 for training and T1 for testing
 Repeat the process 5 times, each time with a different random division into
two subsets.
Cross Validation: Leave-one-out cross
validation
 Sometimes 𝑘 = 𝑚 is used to leave out as little data as possible each time:
1. Repeatedly train on all but one of the training examples in S, and test on that
held-out example
2. The resulting m = 𝑘 errors are then averaged together to obtain the estimate of
the generalization error of a model.
Stratified Approach

 Suppose the data is highly imbalanced: 60 positive and 940 negative


examples
 If you use N-fold cross-validation with N=10, on average 6 positive samples
will fall in one fold
 Not always the case – some folds might not contain positive samples
 Stratified approach- make sure that every fold contains the same
representation of the examples from the individual classes.
 Ex: if N=5 then each fold should contain 200 examples of which 12 are
positive
Cross Validation

 Can be used to select a model


 Or to evaluate a single model or algorithm!
No Free Lunch Theorem

 There is no single machine learning technique that works well under all
circumstances
 Each technique has its own advantages and shortcomings
 No machine learning approach will outperform all other machine learning
approaches under all circumstances
Feature Selection

 Suppose we have a supervised learning problem where the number of


features is very large (𝑛 ≫ 𝑚)
 What if only a small number of features are relevant to the learning task?
 What if we want to reduce the number of features?
 Given 𝑛 features, there are 2 possible feature subsets
 Select features over 2 models  too expensive
 Two approaches:
1. Filter
2. Wrapper
Feature Selection:
1. Filter
 Independent of classifier used
 Rank features using some criteria based on their relevance to the
classification task
 For example, mutual information:
𝑝 𝑥 ,𝑦
𝑀𝐼 𝑥 , 𝑦 = 𝑝 𝑥 , 𝑦 𝑙𝑜𝑔
𝑝 𝑥 𝑝 𝑦
∈{ , } ∈{ , }

 Choose a subset based on the sorted scores for the criteria used
 Disadvantages:
 Ignores the relations between the attributes
 Cannot identify redundant attributes – the attributes that does not bring
additional information beyond that provided by other attributes.
Feature Selection:
2. Wrapper
 More powerful but more computationally expensive
 Forward-search:
Question#1

 Consider learning a classifier in a situation with 1000 features total. 50 of


them are truly informative about class. Another 50 features are direct
copies of the first 50 features. The final 900 features are not informative.
Assume there is enough data to reliably assess how useful features are, and
the feature selection methods are using good thresholds.
 How many features will be selected by mutual information filtering?
Question#1

 Consider learning a classifier in a situation with 1000 features total. 50 of


them are truly informative about class. Another 50 features are direct
copies of the first 50 features. The final 900 features are not informative.
Assume there is enough data to reliably assess how useful features are, and
the feature selection methods are using good thresholds.
 How many features will be selected by mutual information filtering?
About 100
Question#2

 Consider learning a classifier in a situation with 1000 features total. 50 of


them are truly informative about class. Another 50 features are direct
copies of the first 50 features. The final 900 features are not informative.
Assume there is enough data to reliably assess how useful features are, and
the feature selection methods are using good thresholds
 How many features will be selected by a wrapper method?
Question#2

 Consider learning a classifier in a situation with 1000 features total. 50 of


them are truly informative about class. Another 50 features are direct
copies of the first 50 features. The final 900 features are not informative.
Assume there is enough data to reliably assess how useful features are, and
the feature selection methods are using good thresholds
 How many features will be selected by a wrapper method?
About 50
Question#4

Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:

Assume we will learn a decision tree on this data. Assume that when the decision tree splits on the real -valued
attribute X, it puts the split threshold halfway between the attributes that surround the split. For example: using the
gini measurement as the splitting criteria, the decision tree would initially choose to split at x=5, which is halfway
between x=4 and x=6.
Let Algorithm DT2 be the method of learning a decision tree with only two leaf nodes (i.e. only one split)
Let Algorithm DT* be the method of learning a decision tree fully with no pruning
Question#4a

Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:

What will be the training set error of DT2 on our data? You can express it as the number of misclassifications out of 10.
Question#4a

Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:

What will be the training set error of DT2 on our data? You can express it as the number of misclassifications out of 10?

1/10, because the DT will split at x=5 and will make one mistake on right branch
Question#4b

Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:

What will be leave-one-out cross validation error of DT2 on our data?


Question#4b

Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:

What will be leave-one-out cross validation error of DT2 on our data?

1/10, because the DT will split at approximately x=5 on each fold and the left out point will be consistent with the
predictions in all folds except for the “leave out x=8.5” fold
Question#4c

Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:

What will be the training set error of DT* on our data?


Question#4c

Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:

What will be the training set error of DT* on our data?

0/10, because there will be no inconsistencies in any leaves


Question#4d

Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:

What will be leave-one-out cross validation error of DT* on our data?


Question#4d

Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:

What will be leave-one-out cross validation error of DT* on our data?

3/10, the leave-one-out points that will be wrongly predicted are x=8, x=8.5 and x=9

You might also like